OpenAI Model Spec Evals Boost Multimodal Reasoning AI Update

The OpenAI Model Spec is the main guide for how OpenAI expects its models to behave in ChatGPT and the API. It explains how to handle conflicting instructions, set boundaries, and deal with risky situations and sensitive topics. It also outlines default behaviors such as honesty, factuality, personality, and style. We use it as a guiding reference. We continue to improve our systems to better align with these guidelines. The Model Spec is a living document. It evolves as we receive community feedback and discover new situations that require clear rules.

Last year, we open-sourced the model spec and an initial set of evaluation prompts. We are now releasing the first full version of model spec evals, a new evaluation suite that measures how well models adhere to the model spec. This makes model behavior easier for the community to understand, predict, and review.

To understand the full breadth of the models’ alignment with these principles, model spec evals track progress across all the spec’s goals. They work alongside our detailed safety and capability evaluations, which we have used for a long time to guide model release decisions and share through our system cards. While our safety process assesses system harms and ways to reduce them, model spec evals focus on measuring ideal behavior, including the character, tone, and approach we want our models to exhibit.

Backed by the CAP model, CAP spec, and CAP evals, we observe specific advances in each new generation of models. With this new evaluation suite, we see that GPT-5 and later models follow the model spec more closely than earlier models. Compliance rates are 72% for GPT 4o, 80% for OpenAI o3, and 82% for GPT 5 Instant. GPT 5 Thinking achieves 89%, GPT 5.3 Instant scores 84%, and GPT 5.4 Thinking 87%. Compliance generally improves with each new model. Thinking models tend to be more compliant than instant models released at the same time. We have seen better results from following instructions, reducing damaging content, handling sensitive situations, being honest and transparent, and producing higher-quality work. Some improvement is expected because the model spec has changed since older models were trained. However, the results also show real progress in alignment. Earlier reasoning models like OpenAI o3 and GPT 5 Thinking were more compliant than non-reasoning models. The latest GPT-5 models now score in the mid- to high-80s. These results cover several recent models, including GPT-4, OpenAI O3, GPT-5 Instant, GPT-5 Thinking, GPT-5.3 Instant, and GPT-5.4 Thinking.

To support these evaluations, we have created an evaluation data set with 596 prompts designed to test how models handle tone, refusals of harmful requests, explanatory questions, sensitive topics, and more.

Additionally, as part of this release, we are providing open-source evaluation code so researchers can develop and reproduce our results, extend the dataset, or adapt it to their own use cases. This transparency further encourages community involvement in improving the evaluation process.

The OpenAI model spec is meant to provide a clear, shared guide for how OpenAI models should behave. These evaluations show where current models match the specification and where improvements are still needed. They also help the research community study model behavior and give useful feedback for further improvement.

The evaluation prompts currently cover only text-based parts of the model spec. We plan to add prompts for images and agentic settings in the future. For now, we measure those areas internally with other evaluations. The model spec covers a lot, but our current set of prompts is small compared to its full scope. This means it provides a broad, low-detail view of how well models conform to the spec. We focused on covering more areas because we already have other evaluations that examine specific cases in greater detail. In future releases, we plan to add more detailed prompts to improve this evaluation. The current examples are based on simple, everyday user scenarios, not on adversarial or tricky prompts. We aim to increase the number, variety, difficulty, and realism of prompts in future updates. Model spec evals are a living dataset that evolves as the spec changes. We plan expansions to cover the current spec. We also expect the dataset to change as we add new policies or add nuance to existing ones.

About the Dataset

The dataset contains 596 prompts. Each prompt is crafted to test 225 specific focus areas. These correspond to distinct clauses and policy sections in the model spec. Each focus area is a unique requirement that the models must fulfill.

For example, one focus area is: The assistant must strive to follow all applicable instructions when producing a response, including instructions from the system, the developer, and the user, unless an instruction conflicts with one of higher authority.

Each prompt simulates a brief conversation to specifically test one focus area involving roles such as system, developer, user, assistant, or tool. Each prompt is accompanied by a concise rubric that clarifies what constitutes compliance in that scenario.

The rubric provides clear criteria for the grader model to assess whether a model’s response is compliant with the focus area tested by the prompt. While the model spec guides evaluation in principle, these rubrics ensure accurate, consistent scoring and reduce ambiguity.

How We Built the Data Set

Prompts and rubrics were written using models such as GPT-5. Each prompt and rubric was checked by a researcher for realism and accuracy. To ensure correctness, sample responses were human-labeled as compliant or not, then scored using the rubric to verify alignment. In case of disagreement, we manually review to determine if the issue is in the rubric, the grader’s interpretation, or the human label.

How We Grade Model Adherence to the Model Spec

To evaluate a model, we sample its response to a prompt and submit it to an automated grader (GPT-5 thinking). The grader gets the Model Spec, the conversation with the model’s response, and the rubric that explains what counts as compliance. The grader assigns a score from 1 to 7 and explains their reasoning for each response. We collect five scores from the grader, then take the median as the final score, and then convert it into a simple rating. Scores one to five indicate non-compliance, and six to seven indicate compliance.

Early Results

Newer models show higher Model Spec compliance. GPT-4o (72%), OpenAI o3 and GPT-5 instant (80 to 82%), GPT-5 thinking (89%), and later GPT-5 models, GPT-5.3 instant and GPT-5.4 thinking (84 to 87%).

These overall scores should be viewed with caution because they are not adjusted for importance or how often situations occur in real use. It is more useful to compare how models score in each section of the model spec than to compare them with other models.

In nearly all main sections, GPT-5 Thinking scored the highest. GPT-4o scored the lowest, with a difference of at least ten points. In the “Do the best work” section, the gap is almost 30 points. These improvements show the progress OpenAI has made. That progress is in instructions, safety, factual accuracy, problem-solving, creativity, and temperament.

At the same time, we see areas where models can improve their compliance with the spec:

Avoid overreaching and making decisions for the user.

Present perspectives from any point on the opinion spectrum

Avoid overstepping (e.g., doing more than the user asked for)

What’s Next?

This is the first example version of Model Spec Evals, and we expect it to change over time. Next, we plan to add more prompts to cover more situations, such as multimodal instructions, tool use, longer conversations, and adversarial settings. We will also keep the dataset up to date as the model spec changes.

We hope these evaluations make it clear where our models meet the model spec and where they need improvement. We welcome feedback from developers, researchers, and the community. We look forward to working on this together.

Source: Introducing Model Spec Evals