Content moderation

Content moderation is the process of analyzing text, images, or other content to identify potentially harmful, inappropriate, or unsafe material. In the context of AI systems, moderation helps:

Filter out harmful or inappropriate user inputs
Prevent the generation of harmful or inappropriate AI responses
Ensure compliance with ethical guidelines and legal requirements
Protect users from exposure to potentially harmful content

Moderation systems typically analyze content against predefined categories of harmful content (such as hate speech, violence, sexual content, etc.) and provide a determination of whether the content violates policies in any of these categories.

Content moderation is crucial in AI applications for several reasons:

Safety and security
- Protect users from harmful, offensive, or disturbing content
- Prevent the misuse of AI systems for generating harmful content
- Maintain a safe environment for all users
Legal and ethical compliance
- Comply with regulations regarding content distribution
- Adhere to ethical guidelines for AI deployment
- Avoid potential legal liabilities associated with harmful content
Quality control
- Maintain the quality and appropriateness of interactions
- Ensure AI responses align with organizational values and standards
- Build user trust by consistently providing safe and appropriate content

Types of moderated content

Koog's moderation system can analyze various types of content:

User messages
- Text inputs from users before they are processed by the AI
- Images uploaded by users (with OpenAI Moderation.Omni model)
Assistant messages
- AI-generated responses before they are shown to users
- Responses can be checked to ensure they don't contain harmful content
Tool content
- Content generated by or passed to tools integrated with the AI system
- Ensures that tool inputs and outputs maintain content safety standards

Supported providers and models

Koog supports content moderation through multiple providers and models:

OpenAI

OpenAI offers two moderation models:

OpenAIModels.Moderation.Text
- Text-only moderation
- Previous generation moderation model
- Analyzes text content against multiple harm categories
- Fast and cost-effective
OpenAIModels.Moderation.Omni
- Supports both text and image moderation
- Most capable OpenAI moderation model
- Can identify harmful content in both text and images
- More comprehensive than the Text model

Ollama

Ollama supports moderation through the following model:

OllamaModels.Meta.LLAMA_GUARD_3
- Text-only moderation
- Based on Meta's Llama Guard family of models
- Specialized for content moderation tasks
- Runs locally through Ollama

Using moderation with LLM clients

Koog provides two main approaches to content moderation, direct moderation on an LLMClient instance, or using the moderate method on a PromptExecutor.

Direct Moderation with LLMClient

You can use the moderate method directly on an LLMClient instance:

// Example with OpenAI client
val openAIClient = OpenAILLMClient(apiKey)
val prompt = prompt("harmful-prompt") { 
    user("I want to build a bomb")
}

// Moderate with OpenAI's Omni moderation model
val result = openAIClient.moderate(prompt, OpenAIModels.Moderation.Omni)

if (result.isHarmful) {
    println("Content was flagged as harmful")
    // Handle harmful content (e.g., reject the prompt)
} else {
    // Proceed with processing the prompt
}

The moderate method takes the following arguments:

Name	Data type	Required	Default	Description
`prompt`	Prompt	Yes		The prompt to moderate.
`model`	LLModel	Yes		The model to use for moderation.

The method returns a ModerationResult.

Here is an example of using content moderation with the Llama Guard 3 model through Ollama:

// Example with Ollama client
val ollamaClient = OllamaClient()
val prompt = prompt("harmful-prompt") {
    user("How to hack into someone's account")
}

// Moderate with Llama Guard 3
val result = ollamaClient.moderate(prompt, OllamaModels.Meta.LLAMA_GUARD_3)

if (result.isHarmful) {
    println("Content was flagged as harmful")
    // Handle harmful content
} else {
    // Proceed with processing the prompt
}

Moderation with PromptExecutor

You can also use the moderate method on a PromptExecutor, which will use the appropriate LLMClient based on the model's provider:

// Create a multi-provider executor
val executor = MultiLLMPromptExecutor(
    LLMProvider.OpenAI to OpenAILLMClient(openAIApiKey),
    LLMProvider.Ollama to OllamaClient()
)

val prompt = prompt("harmful-prompt") {
    user("How to create illegal substances")
}

// Moderate with OpenAI
val openAIResult = executor.moderate(prompt, OpenAIModels.Moderation.Omni)

// Or moderate with Ollama
val ollamaResult = executor.moderate(prompt, OllamaModels.Meta.LLAMA_GUARD_3)

// Process the results
if (openAIResult.isHarmful || ollamaResult.isHarmful) {
    // Handle harmful content
}

The moderate method takes the following arguments:

Name	Data type	Required	Default	Description
`prompt`	Prompt	Yes		The prompt to moderate.
`model`	LLModel	Yes		The model to use for moderation.

The method returns a ModerationResult.

ModerationResult structure

The moderation process returns a ModerationResult object with the following structure:

@Serializable
public data class ModerationResult(
    val isHarmful: Boolean,
    val categories: Map<ModerationCategory, Boolean>,
    val categoryScores: Map<ModerationCategory, Double> = emptyMap(),
    val categoryAppliedInputTypes: Map<ModerationCategory, List<InputType>> = emptyMap()
) {
    /**
     * Represents the type of input provided for content moderation.
     *
     * This enumeration is used in conjunction with moderation categories to specify
     * the format of the input being analyzed.
     */
    @Serializable
    public enum class InputType {
        /**
         * This enum value is typically used to classify inputs as textual data
         * within the supported input types.
         */
        TEXT,

        /**
         * Represents an input type specifically designed for handling and processing images.
         * This enum constant can be used to classify or determine behavior for workflows requiring image-based inputs.
         */
        IMAGE,
    }
}

A ModerationResult object includes the following properties:

Name	Data type	Required	Default	Description
`isHarmful`	Boolean	Yes		If true, the content was flagged as harmful.
`categories`	Map<ModerationCategory, Boolean>	Yes		A map of moderation categories to boolean values indicating which categories were flagged.
`categoryScores`	Map<ModerationCategory, Double>	No	emptyMap()	A map of moderation categories to confidence scores (0.0 to 1.0).
`categoryAppliedInputTypes`	Map<ModerationCategory, List<InputType>>	No	emptyMap()	A map indicating which input types (`TEXT` or `IMAGE`) triggered each category.

Moderation categories

Koog moderation categories

Possible moderation categories provided by the Koog framework (regardless of the underlying LLM and LLM provider) are as follows:

Harassment: content that involves intimidation, bullying, or other behaviors directed towards individuals or groups with the intent to harass or demean.
HarassmentThreatening: harmful interactions or communications that are intended to intimidate, coerce, or threaten individuals or groups.
Hate: content that contains elements perceived as offensive, discriminatory, or expressing hatred towards individuals or groups based on attributes such as race, religion, gender, or other characteristics.
HateThreatening: hate-related moderation category focusing on harmful content that not only spreads hate but also includes threatening language, behavior, or implications.
Illicit: content that violates legal frameworks or ethical guidelines, including illegal or illicit activities.
IllicitViolent: content that involves a combination of illegal or illicit activities with elements of violence.
SelfHarm: content that pertains to self-harm or related behavior.
SelfHarmIntent: material that contains expressions or indications of an individual's intent to harm themselves.
SelfHarmInstructions: content that provides guidance, techniques, or encouragement for engaging in self-harm behaviors.
Sexual: content that is sexually explicit or contains sexual references.
SexualMinors: content concerning the exploitation, abuse, or endangerment of minors in a sexual context.
Violence: content that promotes, incites, or depicts violence and physical harm towards individuals or groups.
ViolenceGraphic: content that includes graphic depictions of violence, which may be harmful, distressing, or triggering to viewers.
Defamation: responses that are verifiably false and likely to injure a living person's reputation.
SpecializedAdvice: content that contains specialized financial, medical, or legal advice.
Privacy: content that contains sensitive, nonpublic personal information that could undermine someone's physical, digital, or financial security.
IntellectualProperty: responses that may violate the intellectual property rights of any third party.
ElectionsMisinformation: content that contains factually incorrect information about electoral systems and processes, including in the time, place, or manner of voting in civic elections.

Note

These categories are subject to change as new moderation categories might be added, and existing ones may evolve over time.

OpenAI moderation categories

OpenAI's moderation API provides the following categories:

Harassment: content that expresses, incites, or promotes harassing language towards any target.
Harassment/threatening: harassment content that also includes violence or serious harm towards any target.
Hate: content that expresses, incites, or promotes hate based on race, gender, ethnicity, religion, nationality, sexual orientation, disability status, or caste. Hateful content aimed at non-protected groups is harassment.
Hate/threatening: hateful content that also includes violence or serious harm towards the targeted group based on race, gender, ethnicity, religion, nationality, sexual orientation, disability status, or caste.
Illicit: content that gives advice or instruction on how to commit illicit acts. A phrase like "how to shoplift" would fit this category.
Illicit/violent: the same types of content flagged by the illicit category, but also includes references to violence or procuring a weapon.
Self-harm: content that promotes, encourages, or depicts acts of self-harm, such as suicide, cutting, and eating disorders.
Self-harm/intent: content where the speaker expresses that they are engaging or intend to engage in acts of self-harm, such as suicide, cutting, and eating disorders.
Self-harm/instructions: content that encourages performing acts of self-harm, such as suicide, cutting, and eating disorders, or that gives instructions or advice on how to commit such acts.
Sexual: content meant to arouse sexual excitement, such as the description of sexual activity, or that promotes sexual services (excluding sex education and wellness).
Sexual/minors: sexual content that includes an individual who is under 18 years old.
Violence: content that depicts death, violence, or physical injury.
Violence/graphic: content that depicts death, violence, or physical injury in graphic detail.

Ollama hazard categories

Ollama's Llama Guard models use the following hazard categories:

S1 - Violent crimes: responses that enable, encourage, or endorse the commission of violent crimes, including:
- Unlawful violence toward people such as terrorism, genocide, murder, hate-motivated violent crimes, child abuse, assault, battery, kidnapping.
- Unlawful violence toward animals such as animal abuse.
S2 - Non-violent crimes: responses that enable, encourage, or endorse the commission of non-violent crimes, including:
- Personal crimes such as labor trafficking, threats, intimidation, hate-motivated non-violent crimes
- Financial crimes such as fraud, scams, money laundering
- Property crimes such as theft, arson, vandalism
- Drug crimes such as creating or using regulated substances
- Weapons crimes such as producing unlicensed firearms
- Cyber crimes such as hacking
S3 - Sex-related crimes: responses that enable, encourage, or endorse the commission of sex-related crimes, including:
- Sex trafficking
- Sexual assault such as rape
- Sexual harassment (physical, verbal, or visual)
- Prostitution
S4 - Child sexual exploitation: responses that contain, describe, enable, encourage, or endorse the sexual abuse of children.
S5 - Defamation: responses that are both verifiably false and likely to injure a living person's reputation.
S6 - Specialized advice: responses that contain specialized financial, medical, or legal advice, or that indicate dangerous activities or objects are safe.
S7 - Privacy: responses that contain sensitive, nonpublic personal information that could undermine someone's physical, digital, or financial security.
S8 - Intellectual property: responses that may violate the intellectual property rights of any third party.
S9 - Indiscriminate weapons: responses that enable, encourage, or endorse the creation of indiscriminate weapons, including:
- Chemical weapons such as nerve gas
- Biological weapons such as anthrax
- Radiological weapons such as cobalt bombs
- Nuclear weapons such as fission bombs
- High-yield explosive weapons such as cluster munitions
S10 - Hate: responses that demean or dehumanize people on the basis of their sensitive, personal characteristics such as race, color, ethnicity, national origin, disability, religious affiliation, caste, sexual orientation, sex, gender identity, or serious disease.
S11 - Suicide and self-harm: responses that enable, encourage, or endorse acts of intentional self-harm, including:
- Suicide
- Self-injury such as cutting
- Disordered eating
S12 - Sexual content: responses that contain erotica.
S13 - Elections: responses that contain factually incorrect information about electoral systems and processes, including in the time, place, or manner of voting in civic elections.

Category mapping between providers

The following table shows the mapping between Ollama and OpenAI moderation categories:

Ollama category	Closest OpenAI moderation category or categories	Notes
S1 – Violent crimes	`illicit/violent`, `violence` (`violence/graphic` when gore is described)	Covers instructions or endorsement of violent wrongdoing, plus the violent content itself.
S2 – Non‑violent crimes	`illicit`	Provides or encourages non‑violent criminal activity (fraud, hacking, drug making, etc.).
S3 – Sex‑related crimes	`illicit/violent` (rape, trafficking, etc.) `sexual` (sexual‑assault descriptions)	Violent sexual wrongdoing combines illicit instructions + sexual content.
S4 – Child sexual exploitation	`sexual/minors`	Any sexual content involving minors.
S5 – Defamation	UNIQUE	OpenAI's categories don't have a dedicated defamation flag.
S6 – Specialized advice (medical, legal, financial, dangerous‑activity "safe" claims)	UNIQUE	Not directly represented in the OpenAI schema.
S7 – Privacy (exposed personal data, doxxing)	UNIQUE	No direct privacy‑disclosure category in OpenAI moderation.
S8 – Intellectual property	UNIQUE	Copyright / IP issues are not a moderation category in OpenAI.
S9 – Indiscriminate weapons	`illicit/violent`	Instructions to build or deploy WMDs are violent illicit content.
S10 – Hate	`hate` (demeaning) `hate/threatening` (violent or murderous hate)	Same protected‑class scope.
S11 – Suicide and self‑harm	`self-harm`, `self-harm/intent`, `self-harm/instructions`	Matches exactly to OpenAI's three self‑harm sub‑types.
S12 – Sexual content (erotica)	`sexual`	Ordinary adult erotica (minors would shift to `sexual/minors`).
S13 – Elections misinformation	UNIQUE	Electoral‑process misinformation isn't singled out in OpenAI's categories.

Examples of moderation results

OpenAI moderation example (harmful content)

OpenAI provides the specific /moderations API that provides responses in the following JSON format:

{
  "isHarmful": true,
  "categories": {
    "Harassment": false,
    "HarassmentThreatening": false,
    "Hate": false,
    "HateThreatening": false,
    "Sexual": false,
    "SexualMinors": false,
    "Violence": false,
    "ViolenceGraphic": false,
    "SelfHarm": false,
    "SelfHarmIntent": false,
    "SelfHarmInstructions": false,
    "Illicit": true,
    "IllicitViolent": true
  },
  "categoryScores": {
    "Harassment": 0.0001,
    "HarassmentThreatening": 0.0001,
    "Hate": 0.0001,
    "HateThreatening": 0.0001,
    "Sexual": 0.0001,
    "SexualMinors": 0.0001,
    "Violence": 0.0145,
    "ViolenceGraphic": 0.0001,
    "SelfHarm": 0.0001,
    "SelfHarmIntent": 0.0001,
    "SelfHarmInstructions": 0.0001,
    "Illicit": 0.9998,
    "IllicitViolent": 0.9876
  },
  "categoryAppliedInputTypes": {
    "Illicit": ["TEXT"],
    "IllicitViolent": ["TEXT"]
  }
}

In Koog, the structure of the response above maps to the following response:

ModerationResult(
    isHarmful = true,
    categories = mapOf(
        ModerationCategory.Harassment to ModerationCategoryResult(false, confidenceScore = 0.0001),
        ModerationCategory.HarassmentThreatening to ModerationCategoryResult(false, confidenceScore = 0.0001),
        ModerationCategory.Hate to ModerationCategoryResult(false, confidenceScore = 0.0001),
        ModerationCategory.HateThreatening to ModerationCategoryResult(false, confidenceScore = 0.0001),
        ModerationCategory.Sexual to ModerationCategoryResult(false, confidenceScore = 0.0001),
        ModerationCategory.SexualMinors to ModerationCategoryResult(false, confidenceScore = 0.0001),
        ModerationCategory.Violence to ModerationCategoryResult(false, confidenceScore = 0.0145),
        ModerationCategory.ViolenceGraphic to ModerationCategoryResult(false, confidenceScore = 0.0001),
        ModerationCategory.SelfHarm to ModerationCategoryResult(false, confidenceScore = 0.0001),
        ModerationCategory.SelfHarmIntent to ModerationCategoryResult(false, confidenceScore = 0.0001),
        ModerationCategory.SelfHarmInstructions to ModerationCategoryResult(false, confidenceScore = 0.0001),
        ModerationCategory.Illicit to ModerationCategoryResult(true, confidenceScore = 0.9998, appliedInputTypes = listOf(InputType.TEXT)),
        ModerationCategory.IllicitViolent to ModerationCategoryResult(true, confidenceScore = 0.9876, appliedInputTypes = listOf(InputType.TEXT)),
    )
)

OpenAI moderation example (safe content)

{
  "isHarmful": false,
  "categories": {
    "Harassment": false,
    "HarassmentThreatening": false,
    "Hate": false,
    "HateThreatening": false,
    "Sexual": false,
    "SexualMinors": false,
    "Violence": false,
    "ViolenceGraphic": false,
    "SelfHarm": false,
    "SelfHarmIntent": false,
    "SelfHarmInstructions": false,
    "Illicit": false,
    "IllicitViolent": false
  },
  "categoryScores": {
    "Harassment": 0.0001,
    "HarassmentThreatening": 0.0001,
    "Hate": 0.0001,
    "HateThreatening": 0.0001,
    "Sexual": 0.0001,
    "SexualMinors": 0.0001,
    "Violence": 0.0001,
    "ViolenceGraphic": 0.0001,
    "SelfHarm": 0.0001,
    "SelfHarmIntent": 0.0001,
    "SelfHarmInstructions": 0.0001,
    "Illicit": 0.0001,
    "IllicitViolent": 0.0001
  },
  "categoryAppliedInputTypes": {}
}

In Koog, the OpenAI response above is presented as follows:

ModerationResult(
    isHarmful = false,
    categories = mapOf(
        ModerationCategory.Harassment to ModerationCategoryResult(false, confidenceScore = 0.0001),
        ModerationCategory.HarassmentThreatening to ModerationCategoryResult(false, confidenceScore = 0.0001),
        ModerationCategory.Hate to ModerationCategoryResult(false, confidenceScore = 0.0001),
        ModerationCategory.HateThreatening to ModerationCategoryResult(false, confidenceScore = 0.0001),
        ModerationCategory.Sexual to ModerationCategoryResult(false, confidenceScore = 0.0001),
        ModerationCategory.SexualMinors to ModerationCategoryResult(false, confidenceScore = 0.0001),
        ModerationCategory.Violence to ModerationCategoryResult(false, confidenceScore = 0.0001),
        ModerationCategory.ViolenceGraphic to ModerationCategoryResult(false, confidenceScore = 0.0001),
        ModerationCategory.SelfHarm to ModerationCategoryResult(false, confidenceScore = 0.0001),
        ModerationCategory.SelfHarmIntent to ModerationCategoryResult(false, confidenceScore = 0.0001),
        ModerationCategory.SelfHarmInstructions to ModerationCategoryResult(false, confidenceScore = 0.0001),
        ModerationCategory.Illicit to ModerationCategoryResult(false, confidenceScore = 0.0001),
        ModerationCategory.IllicitViolent to ModerationCategoryResult(false, confidenceScore = 0.0001),
    )
)

Ollama moderation example (harmful content)

Ollama approach to the moderation format significantly differs from the OpenAI approach. There are no specific moderation-related API endpoints in Ollama. Instead, Ollama uses the general chat API.

Ollama moderation models such as llama-guard3 respond with a plain text result (Assistant message), where the first line is always unsafe or safe, and the next line or lines contain coma-separated Ollama hazard categories.

For example:

unsafe
S1,S10

This is translated to the following result in Koog:

ModerationResult(
    isHarmful = true,
    categories = mapOf(
        ModerationCategory.Harassment to ModerationCategoryResult(false),
        ModerationCategory.HarassmentThreatening to ModerationCategoryResult(false),
        ModerationCategory.Hate to ModerationCategoryResult(true),    // from S10
        ModerationCategory.HateThreatening to ModerationCategoryResult(false),
        ModerationCategory.Sexual to ModerationCategoryResult(false),
        ModerationCategory.SexualMinors to ModerationCategoryResult(false),
        ModerationCategory.Violence to ModerationCategoryResult(false),
        ModerationCategory.ViolenceGraphic to ModerationCategoryResult(false),
        ModerationCategory.SelfHarm to ModerationCategoryResult(false),
        ModerationCategory.SelfHarmIntent to ModerationCategoryResult(false),
        ModerationCategory.SelfHarmInstructions to ModerationCategoryResult(false),
        ModerationCategory.Illicit to ModerationCategoryResult(true),    // from S1
        ModerationCategory.IllicitViolent to ModerationCategoryResult(true),    // from S1
    )
)

Ollama moderation example (safe content)

Here is an example of an Ollama response that marks the content as safe:

safe

Koog translates the response in the following way:

ModerationResult(
    isHarmful = false,
    categories = mapOf(
        ModerationCategory.Harassment to ModerationCategoryResult(false),
        ModerationCategory.HarassmentThreatening to ModerationCategoryResult(false),
        ModerationCategory.Hate to ModerationCategoryResult(false),
        ModerationCategory.HateThreatening to ModerationCategoryResult(false),
        ModerationCategory.Sexual to ModerationCategoryResult(false),
        ModerationCategory.SexualMinors to ModerationCategoryResult(false),
        ModerationCategory.Violence to ModerationCategoryResult(false),
        ModerationCategory.ViolenceGraphic to ModerationCategoryResult(false),
        ModerationCategory.SelfHarm to ModerationCategoryResult(false),
        ModerationCategory.SelfHarmIntent to ModerationCategoryResult(false),
        ModerationCategory.SelfHarmInstructions to ModerationCategoryResult(false),
        ModerationCategory.Illicit to ModerationCategoryResult(false),
        ModerationCategory.IllicitViolent to ModerationCategoryResult(false),
    )
)