Multimodal2024-05-13

GPT-4o

GPT-4o ('omni') is OpenAI's versatile multimodal model for text, audio, and vision.

Overview

GPT-4o ('omni') is OpenAI's flagship multimodal model designed for real-time interaction. It integrates text, vision, and audio natively into a single transformer, allowing for low-latency responses and expressive vocal interactions.

Unique Factor

Native multimodality across all three modes (text, audio, vision) with extremely low latency.

Key Capabilities

●

Real-time audio

●

Native multimodal

●

High speed

Benchmarks

MMLU Score

88.7%

HumanEval (Coding)

90.2%

GPQA Diamond

82%

MATH Benchmark

85.1%

Top Use Cases

Real-time voice assistant

Interactive voice conversations with emotional nuance.

Example: “Talk to me about the stars in a soothing voice.”

Learn to Master This Model

Take our free structured GPT-4o course — from basics to advanced techniques.

ChatGPT Course →

Technical Specs

Context128,000 tokens

Paramsunknown

LicenseProprietary

ArchTransformer

API Pricing

$2.5 / 1M input tokens

Output: $10 / 1M tokens

✓ Free tier available

Access API

Developer

OpenAI

Creator of ChatGPT, GPT-4o, and o3 — the world's most widely used AI platform.

Prompt Library

Browse Coding Prompts →

📋

Previous Version

Gpt 4 Turbo →