We conducted benchmark evaluations across disallowed content
categories. We report here on our Production Benchmarks, a new more
challenging evaluation set with conversations representative of
challenging examples from production data. As we noted in previous
system cards, we introduced these Production Benchmarks to help us
measure continuing progress given that our earlier Standard evaluations
for these categories had become relatively saturated.
These evaluations were deliberately created to be difficult. They
were built around cases in which our existing models were not yet giving
ideal responses, and this is reflected in the scores below. Error rates
are not representative of average production traffic. The primary metric
is not_unsafe, checking that the model did not produce output that is
disallowed under the relevant OpenAI policy.
*New evaluations, as introduced in the GPT-5
update on sensitive conversations.
Overall, both gpt-5.1-thinking and gpt-5.1-instant show comparable
safety performance to their GPT-5 predecessors on these particularly
challenging evaluations, which are designed to target areas where our
models still have room to improve.
The new gpt-5.1-thinking model shows light regressions relative to
gpt-5-thinking for content involving harassment and hateful language, as
well as disallowed sexual content. We are working on further
improvements for these categories.
The new gpt-5.1-instant model outperforms gpt-5-instant-aug15 on all
above evaluations, and performs slightly worse than gpt-5-instant-oct3
on the evaluations for disallowed sexual content, violent content,
mental health, and emotional reliance. We provide further context on the
latter two safety categories below.
Early signal on prevalence of undesired responses for
sensitive situations
In addition to these offline evaluations, we also share here some
very early signal on the prevalence of undesired responses for sensitive
situations based on online measurements that we ran during A/B testing.
Given the extremely low prevalence of undesired model responses for
sensitive situations, combined with the relatively small size of A/B
tests, these online measurements have wide error bars. However, they can
help provide early signal on potential improvements or regressions.
After launch, we continue to run these measurements in order to gain
more precise signal on prevalence of undesired responses in real-world
usage, which more fully informs whether further mitigations are needed
(such as routing to specific safer models). We report more information
on the results of these early online measurements for mental health,
emotional reliance, and self harm and suicide below.
Online measurements and offline evaluations capture different
elements of safety performance. Online measurements can provide
real-time signals on the prevalence of risks in deployment, and are able
to capture shifts in live user behavior with our models. In contrast,
our offline evaluations focus on challenging conversations closer to a
"worst case," and are typically very long conversations seeded with
undesired behavior from past models in the previous turns.
Mental Health, Emotional Reliance, and Self Harm and
Suicide
Mental health: On offline evaluations (i.e., the
Production Benchmarks table shown above), gpt-5.1-instant shows a slight
regression relative to gpt-5-instant-oct3, but still outperforms
gpt-5-instant-aug15. gpt-5.1-thinking improves relative to
gpt-5-thinking. On early online measurements, both gpt-5.1-instant and
gpt-5.1-thinking show a slight, but low statistical confidence,
improvement relative to gpt-5-instant-oct3 and gpt-5-thinking,
respectively.
As mentioned above, our evaluations capture challenging conversations
that may not be representative of average production traffic. We will
continue to investigate the mental health performance of this model
post-launch.
Emotional reliance: On offline evaluations, both
gpt-5.1-instant and gpt-5.1-thinking show a slight regression relative
to gpt-5-instant-oct3 and gpt-5-thinking, respectively. gpt-5.1-instant
still improved relative to gpt-5-instant-aug15. Preliminary online
measurements also show a regression for gpt-5.1-instant compared to
gpt-5-instant-oct3, although the regression is low statistical
confidence. Even with this possible regression, gpt-5.1-instant still
performs better than gpt-5-instant-aug15 on online measurements.
gpt-5.1-thinking shows an improvement in preliminary online measurements
relative to gpt-5-thinking with high statistical confidence. We are
further investigating the performance of these models on emotional
reliance and are committed to improving the models’ behavior and
updating our safeguards where needed.
Self harm and suicide: Our preliminary online
measurements were neutral for gpt-5.1-instant relative to
gpt-5-instant-oct3, and showed improvements for gpt-5.1-thinking
relative to gpt-5-thinking. However, these estimates have low
statistical confidence.