Beyond ChatGPT: NExT-GPT is an OpenSource Model That Lets You Master AI With Audio, Video and Text

27.09.2023

In a burgeoning technology scene dominated by giants like OpenAI and Google, NExT-GPT—an open source multimodal AI large language model (LLM)—might have what it takes to compete in the big leagues.

ChatGPT took the world by storm with its ability to understand natural language queries and generate human-like responses. But as AI continues to advance at lightning speed, people have demanded more power. The era of pure text is already over, and multimodal LLMs are arriving.

Developed through a collaboration between the National University of Singapore (NUS) and Tsinghua University, NExT-GPT can process and generate combinations of text, images, audio and video. This allows for more natural interactions than text-only models like the basic ChatGPT tool.

The team that created it pitches NExT-GPT as an “any-to-any” system, meaning it can accept inputs in any modality and deliver responses in the appropriate form.

The potential for rapid advancement is enormous. As an open-source model, NExT-GPT can be modified by users to suit their specific needs. This could lead to dramatic improvements beyond the original, much like what happened with Stable Diffusion versus its initial release. Democratizing access lets creators shape the technology for maximum impact.

So how does NExT-GPT work? As explained in the model’s research paper, the system has separate modules to encode inputs like images and audio into text-like representations that the core language model can process.

The researchers introduced a technique called “modality-switching instruction tuning” to improve cross-modal reasoning abilities—its ability to process different types of inputs as one coherent structure. This tuning teaches the model to seamlessly switch between modalities during conversations.

To handle inputs, NExT-GPT uses unique tokens, like for images, for audio, and for video. Each input type gets converted into embeddings that the language model understands. The language model can then output response text, as well as special signal tokens to trigger generation in other modalities.

A token in the response tells the video decoder to produce a corresponding video output, for example. The system’s use of tailored tokens for each input and output modality allows flexible any-to-any conversion.

The language model then outputs special tokens to signal when non-text outputs like images should be generated. Different decoders then create the outputs for each modality: Stable Diffusion as the Image Decoder, AudioLDM as the Audio decoder, and Zeroscope as the video decoder. It also uses Vicuna as the base LLM and ImageBind to encode the inputs.

NExT-GPT is essentially a model that combines the power of different AIs to become a kind of all-in-one super AI.

Screenshot courtesy of: AI Papers Academy via YouTube

NExT-GPT achieves this flexible “any-to-any” conversion while only training 1% of the total parameters. The rest of the parameters are frozen, pretrained modules—earning praise from the researchers as a very efficient design.

A demo site has been set up to allow people to test NExT-GPT, but its availability is intermittent.

With tech giants like Google and OpenAI launching their own multimodal AI products, NExT-GPT represents an open source alternative for creators to build on. Multimodality is key to natural interactions. And by open sourcing NExT-GPT, researchers are providing a springboard for the community to take AI to the next level.

Source

Click to rate this post!

[Total: 0 Average: 0]

27.09.2023

Beyond ChatGPT: NExT-GPT is an OpenSource Model That Lets You Master AI With Audio, Video and Text

Read Next

A Bitcoin Reserve Act may end crypto’s 4-year boom-bust cycle

XRP’s 275% Yearly Growth Might Have Left It Overvalued – Here’s Why

SEND surges 360%, dForce jumps 160%, BTC struggles to reclaim $100k

Crypto Trading Volume Hits $279.08 Billion As Bitcoin Dominance Surges

TOTAL3’s Approach to ATH: This Resistance Level Could Define Q1 2025

This Crypto Sector Could Rally by up to 5x in 2025 and Outperform Bitcoin, Ethereum and Solana, Says Analyst

Ripple News Today: What makes XRP a good investment for 2025

What is MOVE Token and How It Rose to the Top 100 Cryptos

Shiba Inu Skyrockets 12%: Is Bigger Move Ahead?

Is Shiba Inu Price Set for a Massive Rally After Bouncing Back From $0.0000185

A Bitcoin Reserve Act may end crypto’s 4-year boom-bust cycle

XRP’s 275% Yearly Growth Might Have Left It Overvalued – Here’s Why

SEND surges 360%, dForce jumps 160%, BTC struggles to reclaim $100k

Crypto Trading Volume Hits $279.08 Billion As Bitcoin Dominance Surges

TOTAL3’s Approach to ATH: This Resistance Level Could Define Q1 2025

This Crypto Sector Could Rally by up to 5x in 2025 and Outperform Bitcoin, Ethereum and Solana, Says Analyst

Ripple News Today: What makes XRP a good investment for 2025

What is MOVE Token and How It Rose to the Top 100 Cryptos

Shiba Inu Skyrockets 12%: Is Bigger Move Ahead?

Is Shiba Inu Price Set for a Massive Rally After Bouncing Back From $0.0000185

Leave a Reply Cancel reply

Bitcoin’s $100,000 Resistance: Will This Technical Shift Lead to Bigger Market Moves?

10 companies launch Bitcoin treasuries, but not Microsoft: Here are the risks, benefits

Michael Saylor on Corporations to Replace Cash Reserves with Bitcoin

Peter Schiff’s FOMO: Why he wants USA coin instead of Bitcoin

Asset manager GraniteShares files for RIOT, MARA, MSTR, and HOOD ETFs

Bitcoin’s $100,000 Resistance: Will This Technical Shift Lead to Bigger Market Moves?

10 companies launch Bitcoin treasuries, but not Microsoft: Here are the risks, benefits

Michael Saylor on Corporations to Replace Cash Reserves with Bitcoin

Peter Schiff’s FOMO: Why he wants USA coin instead of Bitcoin

Asset manager GraniteShares files for RIOT, MARA, MSTR, and HOOD ETFs

From Oilfields to Crypto Fields: Halliburton Invests in Bitcoin Mining Startup

Polygon’s Agglayer Mainnet Date Confirmed by Sandeep Nailwal

Bonk Rockets 30% to Lead Dog Meme Rebound; Floki Termed ‘Utility Token’ by CFTC

BitMEX’s Arthur Hayes dumps $8.4 million in ENA hours after posting ‘Ethena is impressive’

After Partial Recovery, Developers Unlock $243 Million Tokens In This Altcoin