GPT-4.5 Shows Lower Creative Performance Than GPT-4o in New Comprehensive Benchmark

Zhejiang University and Shanghai AI Lab have released Creation-MMBench, a benchmark specifically designed to evaluate multimodal creativity in real-world scenarios. This tool reveals surprising insights about the creative capabilities of today’s most advanced AI models, including the discovery that GPT-4.5‘s creative abilities lag behind those of GPT-4o in many scenarios.
Pushing Beyond Traditional AI Evaluation
While GPT-4.5 has been widely praised for its impressive contextual coherence in everyday Q&A and various creative tasks, researchers identified a critical question: Where exactly is the “creativity ceiling” of multimodal large language models (MLLMs)?
The challenge has been measuring creativity in complex scenarios. Existing benchmarks struggle to quantify whether an AI model produces genuinely creative insights, with many test scenarios being too simplistic to reflect how these models perform in real-world creative thinking situations.
Creation-MMBench addresses this gap by comprehensively evaluating “visual creative intelligence” across four major task categories, 51 fine-grained tasks, and 765 challenging test cases.
Why Visual Creative Intelligence Matters
Creative intelligence has traditionally been the most challenging aspect of AI to evaluate and develop. Unlike analytical tasks with clear right or wrong answers, creativity involves generating novel yet appropriate solutions across diverse contexts.
Current MLLM benchmarks, like MMBench and MMMU, focus primarily on analytical or practical tasks while overlooking creative challenges that are common in real-life interactions with multimodal AI. Creation-MMBench sets itself apart by featuring complex scenarios with diverse content and both single-image and multi-image problems.
For example, the benchmark challenges models to:
- Generate compelling museum exhibit commentary
- Write emotional, story-driven essays based on photos of people
- Create nuanced culinary guidance as a Michelin chef interpreting food photographs
These tasks require simultaneous mastery of visual content understanding, contextual adaptation, and creative text generation—abilities that existing benchmarks rarely assess comprehensively.
Creation-MMBench’s Rigorous Evaluation Framework
The benchmark features four main task categories:
- Literary Creation: Evaluates artistic expression through poems, dialogues, stories, and narrative construction
- Everyday Functional Writing: Tests practical writing for social media, public initiatives, emails, and real-life questions
- Professional Functional Writing: Assesses specialized writing for interior design, lesson planning, and landscape descriptions
- Multimodal Understanding and Creation: Examines visual-textual integration through document analysis and photography appreciation
What sets Creation-MMBench apart is its complexity. It incorporates thousands of cross-domain images across nearly 30 categories and supports up to 9 image inputs per task. Test prompts are comprehensive, often exceeding 500 words to provide rich, creative context.
Dual Evaluation System Quantifies Creative Quality
To quantify creative quality objectively, the team implemented a dual evaluation approach:
- Visual Fact Score (VFS): Ensures the model accurately reads image details without fabricating information
- Reward: Evaluates the model’s creative ability and presentation skills in conjunction with visual content
The evaluation process uses GPT-4o as a judging model, considering the evaluation criteria, screen content, and model responses to provide relative preference ratings between model replies and reference answers.
To verify reliability, human volunteers manually evaluated 13% of the samples, confirming that GPT-4o demonstrates strong consistency with human preferences.
Benchmark Results: Closed vs. Open Source Models
The research team evaluated over 20 mainstream MLLMs using the VLMEvalKit toolchain, including GPT-4o, the Gemini series, Claude 3.5, and open-source models like Qwen2.5-VL and InternVL.
Key Findings:
- Gemini-2.0-Pro outperformed GPT-4o in multimodal creative writing, particularly in daily functional writing tasks
- GPT-4.5 showed weaker overall performance than both Gemini-Pro and GPT-4o, though it excelled specifically in multimodal content understanding and creation
- Open-source models like Qwen2.5-VL-72B and InternVL2.5-78B-MPO demonstrated creative capabilities comparable to closed-source models but still showed a performance gap
Category-Specific Insights:
- Professional functional writing proved most challenging due to high demands for specialized knowledge and deep visual content understanding
- Models with weaker overall performance could still excel in everyday tasks related to daily social life, where situations and visual content are more straightforward.
- Most models achieved high visual factual scores on multimodal understanding and creation tasks but struggled with recreation based on visual content
Professional functional writing proved the most challenging among task categories due to its demands for specialized knowledge and deep visual understanding. In contrast, everyday functional writing tasks saw higher performance across models due to their similarity to common social scenarios.
The Impact of Visual Fine-Tuning
To further understand model capabilities, the team created a text-only version called Creation-MMBench-TO, where GPT-4o described image content in detail.
The text-only evaluation showed:
- Closed-source language models slightly outperformed open-source ones in authoring ability
- GPT-4o achieved higher creative reward scores on the text-only version, possibly by focusing more on divergent thinking without visual understanding constraints.
- Open-source multimodal models with visual instruction fine-tuning consistently performed worse on Creation-MMBench-TO than their base language model.s
This suggests that visual instruction fine-tuning might limit a model’s ability to understand longer texts and create extended content, resulting in lower visual factual scores and creative rewards.
Real-World Example: Software Engineering Interpretation
Qualitative research revealed significant differences in how models handled specific professional tasks:
- Qwen2.5-VL misidentified a swimlane diagram as a data flow diagram due to insufficient domain knowledge, leading to incorrect analysis
- GPT-4o avoided this error and provided more professional, structured language with accurate diagram interpretation
This example highlights the critical importance of domain-specific knowledge and detailed image comprehension in professional tasks, demonstrating the persistent gap between open-source and closed-source models.
Conclusion
Creation-MMBench, with details available on GitHub, represents a significant advancement in evaluating multimodal large models’ creative capabilities in realistic scenarios. With 765 instances spanning 51 detailed tasks and comprehensive evaluation criteria, it provides unprecedented insight into model performance.
The benchmark is now integrated into VLMEvalKit, supporting one-click evaluation to assess any model’s performance in creative tasks comprehensively. This makes it easier than ever to determine whether your model can effectively tell a compelling story based on visual input.