Model Benchmarks

Compare the performance of different AI models across standardized benchmarks. Higher scores generally indicate better performance, but context matters for each benchmark.

Benchmarks

Massive Multitask Language Understanding (MMLU)

MMLU evaluates the model's language understanding across a wide range of tasks, reflecting its versatility in natural language processing.

RankModelProviderScore (percent)DateSource
1GPT-3OpenAI92.30May 4, 2025View
2Minimax-Text-01MiniMax88.50May 4, 2025View
3Claude 3Anthropic86.80May 4, 2025View
4GPT-4OpenAI86.50May 4, 2025View
5Qwen-14BAlibaba Cloud84.20May 4, 2025View
6Qwen 2Alibaba Cloud84.20May 4, 2025View
7Mistral LargeMistral AI84.00May 4, 2025View
8Claude 2Anthropic78.50May 4, 2025View
9Cohere CommandCohere78.50May 4, 2025View
10Grok 1xAI73.00May 4, 2025View
11DeepSeek-LLMDeepSeek71.30May 4, 2025View
12ChinchillaGoogle DeepMind67.60May 4, 2025View
13LLaMAMeta AI63.40May 4, 2025View
14Phi-2Microsoft56.70May 4, 2025View
15Qwen-7BAlibaba Cloud56.70May 4, 2025View
16GalacticaMeta AI52.60May 4, 2025View
17GLM-130BTsinghua University44.80May 4, 2025View