Subido por Diego Martínez

OpenAI o1-mini OpenAI

Anuncio
9/23/24, 2:49 AM
OpenAI o1-mini | OpenAI
September 12, 2024
OpenAI o1-mini
Advancing cost-efficient reasoning.
Contributions
We're releasing OpenAI o1-mini, a cost-efficient reasoning model. o1-mini excels at
STEM, especially math and coding—nearly matching the performance of OpenAI o1
on evaluation benchmarks such as AIME and Codeforces. We expect o1-mini will be a
faster, cost-effective model for applications that require reasoning without broad
world knowledge.
Today, we are launching o1-mini to tier 5 API users at a cost that is 80% cheaper than
OpenAI o1-preview. ChatGPT Plus, Team, Enterprise, and Edu users can use o1-mini
as an alternative to o1-preview, with higher rate limits and lower latency (see Model
Speed).
Optimized for STEM Reasoning
Large language models such as o1 are pre-trained on vast text datasets. While these
high-capacity models have broad world knowledge, they can be expensive and slow
for real-world applications. In contrast, o1-mini is a smaller model optimized for STEM
reasoning during pretraining. After training with the same high-compute
reinforcement learning (RL) pipeline as o1, o1-mini achieves comparable performance
on many useful reasoning tasks, while being significantly more cost efficient.
https://openai.com/index/openai-o1-mini-advancing-cost-efficient-reasoning/
1/9
9/23/24, 2:49 AM
OpenAI o1-mini | OpenAI
When evaluated on benchmarks requiring intelligence and reasoning, o1-mini
performs well compared to o1-preview and o1. However, o1-mini performs worse on
tasks requiring non-STEM factual knowledge (see Limitations).
Read aloud
Math Performance vs Inference Cost
GPT-4o
GPT-4o mini
o1-preview
o1-mini
o1
80%
AIME
60%
40%
20%
0%
0
10
20
30
40
50
60
70
80
90
100
Inference Cost (%)
Mathematics: In the high school AIME math competition, o1-mini (70.0%) is
competitive with o1 (74.4%)–while being significantly cheaper–and outperforms o1preview (44.6%). o1-mini’s score (about 11/15 questions) places it in approximately the
top 500 US high-school students.
Coding: On the Codeforces competition website, o1-mini achieves 1650 Elo, which is
again competitive with o1 (1673) and higher than o1-preview (1258). This Elo score
puts the model at approximately the 86th percentile of programmers who compete
on the Codeforces platform. o1-mini also performs well on the HumanEval coding
benchmark and high-school level cybersecurity capture the flag challenges (CTFs).
Codeforces
https://openai.com/index/openai-o1-mini-advancing-cost-efficient-reasoning/
2/9
9/23/24, 2:49 AM
OpenAI o1-mini | OpenAI
1650
o1-mini
1258
o1-preview
900
GPT-4o
0
200
400
600
800
1,000
1,200
1,400
1,600
1
Elo


HumanEval
o1-mini
92.4%
o1-preview
92.4%
90.2%
GPT-4o
0
10
20
30
40
50
60
70
80
90
Accuracy


Cybersecurity CTFs
28.7%
o1-mini
43
o1-preview
20.0%
GPT-4o
0
GPT-4o 20.0%
5
10
15
20
25
30
35
40
Accuracy (Pass@12)


STEM: On some academic benchmarks requiring reasoning, such as GPQA (science)
and MATH-500, o1-mini outperforms GPT-4o. o1-mini does not perform as well as
GPT-4o on tasks such as MMLU and lags behind o1-preview on GPQA due to its lack
of broad world knowledge.
MMLU
0-shot CoT
https://openai.com/index/openai-o1-mini-advancing-cost-efficient-reasoning/
3/9
9/23/24, 2:49 AM
OpenAI o1-mini | OpenAI
88.7%
GPT-4o
85.2%
o1-mini
90.8%
o1-preview
92.3%
o1
0
10
20
30
40
50
60
70
80
90


GPQA
Diamond, 0-shot CoT
53.6%
GPT-4o
60.0%
o1-mini
73.3%
o1-preview
o1
77.3%
o1-preview 73.3%
0
10
20
30
40
50
60
70
80
90


MATH-500
0-shot CoT
60.3%
GPT-4o
90.0%
o1-mini
85.5%
o1-preview
o1
94.
o1-preview 85.5%
0
10
20
30

40
50
60
70
80
90

Human preference evaluation: We had human raters compare o1-mini to GPT-4o on
challenging, open-ended prompts in various domains, using the same methodology
as our o1-preview vs GPT-4o comparison. Similar to o1-preview, o1-mini is preferred to
GPT-4o in reasoning-heavy domains, but is not preferred to GPT-4o in languagefocused domains.
Human preference evaluation vs chatgpt-4o-latest
o1-preview
o1-mini
https://openai.com/index/openai-o1-mini-advancing-cost-efficient-reasoning/
4/9
9/23/24, 2:49 AM
OpenAI o1-mini | OpenAI
100
Win Rate vs GPT-4o (%)
80
60
40
20
0
Personal Writing
Editing Text
Computer Programming
Data Analysis
Mathematical Calcula
Domain


Model Speed
As a concrete example, we compared responses from GPT-4o, o1-mini, and o1preview on a word reasoning question. While GPT-4o did not answer correctly, both
o1-mini and o1-preview did, and o1-mini reached the answer around 3-5x faster.
https://openai.com/index/openai-o1-mini-advancing-cost-efficient-reasoning/
5/9
9/23/24, 2:49 AM
OpenAI o1-mini | OpenAI
Chat speed comparison
Safety
o1-mini is trained using the same alignment and safety techniques as o1-preview. The
model has 59% higher jailbreak robustness on an internal version of the
StrongREJECT dataset compared to GPT-4o. Before deployment, we carefully
assessed the safety risks of o1-mini using the same approach to preparedness,
external red-teaming, and safety evaluations as o1-preview. We are publishing the
detailed results from these evaluations in the accompanying system card.
Metric
GPT-4o
o1-mini
% Safe completions
refusal on harmful
0.99
0.99
% Safe completions on
harmful prompts
(Challenging: jailbreaks &
edge cases)
0.714
0.932
% Compliance on benign
edge cases (“not overrefusal”)
0.91
0.923
Goodness@0.1
0.22
0.83
0.77
0.95
prompts (standard)
StrongREJECT jailbreak
eval (Souly et al. 2024)
Human sourced jailbreak
eval
https://openai.com/index/openai-o1-mini-advancing-cost-efficient-reasoning/
6/9
9/23/24, 2:49 AM
OpenAI o1-mini | OpenAI
Limitations and What’s Next
Due to its specialization on STEM reasoning capabilities, o1-mini’s factual knowledge
on non-STEM topics such as dates, biographies, and trivia is comparable to small
LLMs such as GPT-4o mini. We will improve these limitations in future versions, as
well as experiment with extending the model to other modalities and specialities
outside of STEM.
Authors
OpenAI
Our research
Overview
Index
Latest advancements
OpenAI o1
GPT-4
https://openai.com/index/openai-o1-mini-advancing-cost-efficient-reasoning/
7/9
9/23/24, 2:49 AM
OpenAI o1-mini | OpenAI
GPT-4o mini
DALL·E 3
Sora
ChatGPT
For Everyone
For Teams
For Enterprises
ChatGPT login
Download
API
Platform overview
Pricing
Documentation
API login
Explore more
OpenAI for business
Stories
Safety overview
Safety overview
Company
About us
News
Our Charter
Security
https://openai.com/index/openai-o1-mini-advancing-cost-efficient-reasoning/
8/9
9/23/24, 2:49 AM
OpenAI o1-mini | OpenAI
Residency
Careers
Terms & policies
Terms of use
Privacy policy
Brand guidelines
Other policies
OpenAI © 2015–2024
https://openai.com/index/openai-o1-mini-advancing-cost-efficient-reasoning/
9/9
Descargar