Multimodal CoT Prompting

Zhang et al. (2023) (opens in a new tab) recently proposed a multimodal chain-of-thought prompting approach. Traditional CoT focuses on the language modality. In contrast, Multimodal CoT incorporates text and vision into a two-stage framework. The first step involves rationale generation based on multimodal information. This is followed by the second phase, answer inference, which leverages the informative generated rationales.

The multimodal CoT model (1B) outperforms GPT-3.5 on the ScienceQA benchmark.

Image Source: Zhang et al. (2023) (opens in a new tab)

Related Learning

Course

Prompt Engineering for LLMs

Master multimodal prompting, chain-of-thought, and advanced reasoning techniques.

Beginner

2 hours

Course

Building Effective AI Agents

Learn to build effective AI agents. Covers function calling, tool integration, and debugging agentic systems.

Intermediate

5 hours

Explore All Courses

Discover our full catalog of AI and prompt engineering courses. From beginners to advanced practitioners.Use code PROMPTING20 for 20% off!

Browse Academy

Reflexion Graph Prompting