VOOZH about

URL: https://developers.openai.com/api/docs/guides/predicted-outputs

⇱ Predicted Outputs | OpenAI API


Search the API docs

Primary navigation

Evaluation

Legacy APIs

Predicted Outputs enable you to speed up API responses from Chat Completions when many of the output tokens are known ahead of time. This is most common when you are regenerating a text or code file with minor modifications. You can provide your prediction using the prediction request parameter in Chat Completions.

Predicted Outputs are available today using the latest gpt-4o, gpt-4o-mini, gpt-4.1, gpt-4.1-mini, and gpt-4.1-nano models. Read on to learn how to use Predicted Outputs to reduce latency in your applications.

Code refactoring example

Predicted Outputs are particularly useful for regenerating text documents and code files with small modifications. Let’s say you want the GPT-4o model to refactor a piece of TypeScript code, and convert the username property of the User class to be email instead:

Most of the file will be unchanged, except for line 4 above. If you use the current text of the code file as your prediction, you can regenerate the entire file with lower latency. These time savings add up quickly for larger files.

Below is an example of using the prediction parameter in our SDKs to predict that the final output of the model will be very similar to our original code file, which we use as the prediction text.

In addition to the refactored code, the model response will contain data that looks something like this:

Note both the accepted_prediction_tokens and rejected_prediction_tokens in the usage object. In this example, 18 tokens from the prediction were used to speed up the response, while 10 were rejected.

Note that any rejected tokens are still billed like other completion tokens generated by the API, so Predicted Outputs can introduce higher costs for your requests.

Streaming example

The latency gains of Predicted Outputs are even greater when you use streaming for API responses. Here is an example of the same code refactoring use case, but using streaming in the OpenAI SDKs instead.

Position of predicted text in response

When providing prediction text, your prediction can appear anywhere within the generated response, and still provide latency reduction for the response. Let’s say your predicted text is the simple Hono server shown below:

You could prompt the model to regenerate the file with a prompt like:

The response to the prompt might look something like this:

You would still see accepted prediction tokens in the response, even though the prediction text appeared both before and after the new content added to the response:

This time, there were no rejected prediction tokens, because the entire content of the file we predicted was used in the final response. Nice! 🔥

Limitations

When using Predicted Outputs, you should consider the following factors and limitations.

  • Predicted Outputs are only supported with the GPT-4o, GPT-4o-mini, GPT-4.1, GPT-4.1-mini, and GPT-4.1-nano series of models.
  • When providing a prediction, any tokens provided that are not part of the final completion are still charged at completion token rates. See the rejected_prediction_tokens property of the usage object to see how many tokens are not used in the final response.
  • The following API parameters are not supported when using Predicted Outputs:
    • n: values higher than 1 are not supported
    • logprobs: not supported
    • presence_penalty: values greater than 0 are not supported
    • frequency_penalty: values greater than 0 are not supported
    • audio: Predicted Outputs are not compatible with audio inputs and outputs
    • modalities: Only text modalities are supported
    • max_completion_tokens: not supported
    • tools: Function calling is not currently supported with Predicted Outputs

Loading docs agent...