Even the Worst Version of Claude AI Is Better Than GPT 3.5, Researchers Say
The AI industry is witnessing a riveting competition between the notable ChatGPT and Claude AI models. The Large Model Systems Organization (LMSO), responsible for creating the Chatbot Arena and the renowned Vicuna Model, has just updated their Chatbot Arena Leaderboard, reflecting how each AI chatbot measures up to its competitors. Turns out Anthropic is giving OpenAI a run for its money, even while its models are still free to use.
GPT-4, the powerhouse behind ChatGPT Plus and Bing AI, reigns supreme with the highest score, setting the gold standard for Large Language Models (LLMs). But as we move down the leaderboard, an unexpected underdog story unfolds. Anthropic’s Claude models — Claude 1, Claude 2, and Claude Instant — all outperform GPT-3.5, the engine that powers the free version of ChatGPT. This implies that every Large Language Model developed by Anthropic can outclass the free version of ChatGPT.
The meticulous ranking system by the LMSO provided insight into the performance metrics of these models. According to the leaderboard, GPT-4 holds an Arena Elo Rating of 1181, significantly leading the chart, while the Claude models follow closely with ratings ranging from 1119 to 1155. GPT-3.5, on the other hand, lags with a rating of 1115.
To rank the models, the LMSO makes them “battle” in matches with similar prompts. The model with the best answer wins and the other loses. Users decide who wins based on their own preferences, but they never get to know which models are competing.
Image: LMSO
As Decrypt previously reported, the difference in token processing capabilities between ChatGPT Plus and Claude Pro, although not a factor in the LMSO ranking, is also a major advantage that Claude models have over GPT.
“Claude Pro, based on the Claude 2 LLM, can process up to 100K tokens of information, while ChatGPT Plus, powered by the GPT-4 LLM, handles 8,192 tokens,” we recalled. This differential in token processing ability underscores the edge Claude models hold in managing extensive contextual inputs, which is crucial for a nuanced and enriched user experience.
Moreover, when handling long prompts, Claude 2 has shown superiority over GPT, handling prompts of larger magnitude more efficiently. However, when prompts are comparable, Claude 1 and Claude Instant provide similar or slightly better results to GPT-3.5, showcasing the competitive nature of these models. With Claude’s context capabilities, a poor initial answer can be dramatically improved with a more refined, larger and richer prompt.
Open-source models are not far behind in this race.
WizardLM, a model trained on Meta’s LlaMA-2 with 70 billion parameters, stands out as the best open-source LLM. Following close are Vicuna 33B and the original LlaMA-2, released by Meta.
🎉The @lmsysorg just updated the Chatbot Arena Leaderboard!
Our WizardLM-70B is now the🥇Top-1 open-source model on both ⚔️Arena Elo and 📈MT-bench.
❤️Main Contributors:@CanXu20 @victorsungo_ai@ChiYeung_Law@hpluo12@tangmensan
Leaderboard: https://t.co/1gkZKGVutQ
Model… pic.twitter.com/bsJ0jv2i7I— WizardLM (@WizardLM_AI) October 5, 2023
Open-source models play an important role in the development of the AI space for different reasons. They can be run locally, which gives users the opportunity to finetune them and engages the community in a collective effort to perfect the model. They are also cheaper to run due to their licenses, which is why the space has dozens of open-source LLMs and only a handful of proprietary models.
But the game of AI chatbots isn’t solely about numbers. It’s about real-world implications.
As chatbots become integral in various sectors from customer service to personal assistants, their efficacy, adaptability, and accuracy become paramount. With Claude models ranking higher than GPT-3.5, businesses and individual users might find themselves at a crossroads, evaluating which model aligns best with their needs. Decrypt has prepared two guides to help you decide what model suits you best.
For the uninitiated, this might seem like just another leaderboard update. But for those closely watching the AI industry, it’s a testament to how fierce the competition is and how swiftly the tides can turn. And as for the rest of us who sit in between those two camps, it’s a reminder that in the AI world, today’s most popular model could fall to the most efficient.