VOOZH about

URL: https://developers.openai.com/api/docs/guides/moderation

⇱ Moderation | OpenAI API


Search the API docs

Primary navigation

Evaluation

Legacy APIs

Use OpenAI moderation models to detect harmful content in text and images. You can classify standalone inputs with the moderation endpoint or request moderation scores alongside a generated response. Use the results to enforce your application’s policy, such as filtering content, routing a request for review, or intervening with accounts that submit flagged content.

The omni-moderation-latest model accepts text and image inputs. It doesn’t classify audio. The moderation endpoint is free to use, and image files can be up to 20 MB.

Choose a moderation workflow

WorkflowUse when
Moderate generated contentYour application generates text with the Responses API or Chat Completions API and needs moderation signals.
Classify standalone inputsYour application needs to classify text or images without generating a model response.
Understand moderation resultsYour application needs to interpret flags, categories, scores, or applied input types.
Review supported categoriesYour application needs to know which harm categories apply to text, images, or both.

Moderate generated content

When your application needs generated text and moderation scores together, pass a top-level moderation object in the generation request. The API returns moderation scores for the model input and generated output without a separate moderation request.

The model still generates normally. Review the moderation results before you show the output to a user or take downstream actions.

Set moderation.model when you create a response:

The Responses API returns an input moderation_result object at response.moderation.input and an output moderation_result object at response.moderation.output.

Set moderation.model when you create a chat completion:

Chat Completions returns moderation result containers at completion.moderation.input and completion.moderation.output. For a request with one generated choice, read the first input and output result at results[0]. If you request multiple choices, completion.moderation.output.results[i] corresponds to completion.choices[i].

Inline moderation results use the same category fields as a standalone moderation result. Start with flagged for a first-pass decision, then inspect categories and category_scores for logging, routing, audit trails, or human-review queues. A refusal or other safety-aware response can still trigger a flag if it discusses harmful content. Treat moderation scores as signals for your application’s policy, not as an automatic blocking decision.

Check the moderation result type before you read scores if your application needs to handle moderation failures. If a moderation step can’t complete, the corresponding input or output moderation field can contain an error instead of moderation scores.

For tool-calling requests, moderation covers tool-call arguments and tool outputs when they appear in conversation content. It doesn’t cover tool names, tool descriptions, tool schemas, or response-format schemas.

If you stream a generated response, moderation scores arrive after the full generated output is available. They aren’t included with partial output deltas.

Classify standalone inputs

Use the moderation endpoint to classify text or image inputs without generating a model response. The tabs below show how to use the OpenAI libraries and the omni-moderation-latest model:

Moderate text inputs
Moderate images and text

Understand moderation results

Here’s a full example output for an image from a single frame of a war movie. The model identifies indicators of violence in the image, with a violence category score greater than 0.8.

The JSON response includes fields that describe which categories are present in the input and the model’s confidence in each category.

Output categoryDescription
flagged

Set to true if the model classifies the content as potentially harmful, false otherwise.

categories

Contains a dictionary of per-category violation flags. For each category, the value is true if the model flags the corresponding category as violated, false otherwise.

category_scores

Contains a dictionary of per-category scores. Each score represents the model’s confidence that the input contains content in the category. The value is between 0 and 1, where higher values denote higher confidence.

category_applied_input_types

Contains the input types that the category score applies to. For example, if the violence/graphic category applies to both image and text inputs, the violence/graphic property is set to ["image", "text"].

We plan to continuously upgrade the moderation endpoint’s underlying model. Therefore, custom policies that rely on category_scores may need recalibration over time.

Review supported categories

The table below describes the content categories that the moderation endpoint can detect and the input types that each category supports.

Categories marked as “Text only” do not support image inputs. If you send only images (without accompanying text) to the omni-moderation-latest model, it will return a score of 0 for these unsupported categories. Image files are limited to 20 MB.

CategoryDescriptionInputs
harassment

Content that expresses, incites, or promotes harassing language towards any target.

Text only
harassment/threatening

Harassment content that also includes violence or serious harm towards any target.

Text only
hate

Content that expresses, incites, or promotes hate based on race, gender, ethnicity, religion, nationality, sexual orientation, disability status, or caste. Hateful content aimed at non-protected groups (e.g., chess players) is harassment.

Text only
hate/threatening

Hateful content that also includes violence or serious harm towards the targeted group based on race, gender, ethnicity, religion, nationality, sexual orientation, disability status, or caste.

Text only
illicit

Content that gives advice or instruction on how to commit illicit acts. A phrase like “how to shoplift” would fit this category.

Text only
illicit/violent

The same types of content flagged by the illicit category, but also includes references to violence or procuring a weapon.

Text only
self-harm

Content that promotes, encourages, or depicts acts of self-harm, such as suicide, cutting, and eating disorders.

Text and images
self-harm/intent

Content where the speaker expresses that they are engaging or intend to engage in acts of self-harm, such as suicide, cutting, and eating disorders.

Text and images
self-harm/instructions

Content that encourages performing acts of self-harm, such as suicide, cutting, and eating disorders, or that gives instructions or advice on how to commit such acts.

Text and images
sexual

Content meant to arouse sexual excitement, such as the description of sexual activity, or that promotes sexual services (excluding sex education and wellness).

Text and images
sexual/minors

Sexual content that includes an individual who is under 18 years old.

Text only
violenceContent that depicts death, violence, or physical injury.Text and images
violence/graphic

Content that depicts death, violence, or physical injury in graphic detail.

Text and images

Loading docs agent...