Stop begging for JSON
How OpenAI's Structured Outputs makes building with AI much more reliable.
If you've ever dealt with LLMs in production, you've probably ended up with some variation of this prompt:
IMPORTANT: Return ONLY valid JSON. Do NOT include any other text or explanations. The response MUST be a valid JSON object and NOTHING else.
If you're lucky, this prompt works. If you're not, you'll see your evals failing as the LLM cheerfully ignores your instructions:
Of course! Here's the JSON output you requested:
{
"result": "lol. lmao even."
}
Let me know if you need anything else!
I've said before that demos are easy, and products are hard. While this is only one reason why, it's one of the most common roadblocks that AI engineers encounter. Besides, the issue isn't just annoying - it represents a real challenge to building reliable AI-enabled applications.
So today, I want to look at some of the strategies (besides begging) for working with JSON outputs from your LLMs and explain why OpenAI's approach may be close to solving the problem once and for all.
A brief history of "Please return JSON"
When GPT-3 was first released, one of its most promising use cases was as a translator between unstructured and structured data. Developers quickly discovered they could prompt the model with examples of converting raw text (like email dumps or customer feedback) to structured JSON output, and it would learn to follow the pattern. This wasn't just about making data prettier - it was about building a reliable bridge between natural language and programmatic interfaces.
With instruction tuning and RLHF, this process was even more straightforward - you could just tell the model what output you wanted and didn't have to bother with examples. But RLHF introduced a new wrinkle: models became more conversational, more eager to help, and more likely to wrap their JSON outputs in friendly prose.
Suddenly, you couldn't just assume the AI's response would be a valid JSON string - you had to account for preambles like "Here's what I found:" and postscripts like "Is there anything else you'd like me to check?" Developers, then, turned to a variety of different workarounds - some great, some not.
The poor man's parsing toolkit
Over time, AI engineers have tried different techniques to handle the "almost JSON" responses from LLMs, each with its own flaws and tradeoffs.
Regular expressions. One’s first instinct might be to use regexes to find anything that looks like a JSON object in the response and extract that. And if you're dealing with extremely simple data structures, this might work. At least until you hit nested structures, escaped quotes, or any other edge cases that make parsing JSON with regex a notorious anti-pattern.
Delimiters. But regular expressions aren't totally useless. An alternative approach is to have the model wrap its JSON output in specific delimiters, and then parse those:
<json_output> { "result": "data" } </json_output>
This adds another layer of potential syntax issues, but it's more reliable than trying to parse raw responses. It's also good if you're working with barebones or open-source LLM APIs that don't have many bells and whistles.
Assistant prefills. A more sophisticated approach emerged from understanding how LLMs handle conversation context. When you provide a message history that ends with an assistant response - including an interrupted one - the model tends to continue where it left off the conversation.
This "prefill" technique involves starting the assistant's response with something like "Here is your JSON output: " or even just "{". The model will likely continue with clean JSON rather than yapping. But you still need to handle edge cases and potentially re-include opening characters that you used to prompt the response.
Multi-step prompting. And a more experimental technique, especially as new models are released without all of the developer-oriented features, is to pass the results of one model, which has done the reasoning, to a second model, which can return properly formatted JSON.
Ultimately, though, these approaches still can't guarantee that your results will be valid. You may parse your <json_output> tags and find that there are incorrect braces or quotes, or your field names don’t match what you expected - it's an incredibly frustrating experience.
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.
JSON Mode: A step forward
Recognizing these challenges, OpenAI introduced "JSON mode" over a year ago. It's exactly what it sounds like: a guarantee that the model's response will be valid JSON.
Using it is straightforward:
response = client.chat.completions.create(
model="gpt-4o",
messages=[...],
response_format={"type": "json_object"}
)
As simple as this seems, it was a major win for developers. No more boilerplate parsing code or try-catch blocks just to handle the model's chattiness. But it came with some big limitations:
Inconsistent naming. The model could return any valid JSON, without guarantees on field names. A field might be named "phone" one minute and "phone_number" the next.
Missing validation. Likewise, there was no way to enforce types. You might get a string if you need a number, and vice versa.
Nested complexity. Nested objects were especially unpredictable, and you'd have to create a lot of custom, brittle logic to check your nested types were correct.
So, while JSON Mode marked a significant step forward, it still left developers wrestling with a fundamental challenge: the gap between correct syntax and reliable syntax. Having valid JSON was valuable, but having predictably structured data was essential.
Structured Outputs: The final solution?
This brings us to Structured Outputs, which represents a fundamental shift in working with model responses. Instead of just saying "return JSON," you provide an entire schema that defines exactly what shape that JSON should take.
Here's how it works: when you make a call to GPT-4, you pass a JSON schema that defines the structure you expect:
response = client.chat.completions.create(
model="gpt-4o",
messages=[...],
response_format={
"type": "json_object",
"schema": {
"type": "object",
"properties": {
"topics": {
"type": "array",
"items": {"type": "string"},
"description": "Main topics discussed in the email"
},
"sentiment": {
"type": "string",
"enum": ["positive", "negative", "neutral"],
"description": "Overall emotional tone of the email"
},
"main_action_item":
"type": "string",
"description": "The primary action item of the email"
},
},
"required": ["topics", "sentiment", "main_action_item"]
}
}
)
And if you're using Pydantic or Zod, it's actually even easier:
from pydantic import BaseModel
from typing import List, Literal
class EmailAnalysis(BaseModel):
topics: List[str]
sentiment: Literal["positive", "negative", "neutral"]
main_action_item: str
response = client.chat.completions.create(
model="gpt-4o",
messages=[...],
response_format=EmailAnalysis
)
The response is now guaranteed to match this exact structure:
{
"topics": ["quarterly results", "revenue growth", "customer satisfaction"],
"sentiment": "positive",
"main_action_item": "Create slides to present quarterly revenue results by next week"
}
This isn't just about formatting – it's about constraining what the model can output. Structured Outputs ensures:
Required fields must be present
Arrays must contain the specified type
Strings in the 'sentiment' field must be one of the enumerated values
Nested objects must follow their defined structure
For someone who regularly gets LLMs to spit out JSON, this was a pretty big deal.
Technical notes: Constrained Decoding
While I was impressed with Structured Outputs when it first launched, I was even more impressed when I learned how it was implemented behind the scenes. The magic is a technique called Constrained Decoding, which OpenAI discusses in its launch post. If you're interested in the technicals, I highly suggest reading the post.
Normally, with LLMs, the token generation process is "unconstrained," meaning the model can produce any sequence of tokens it wants and just chooses the one it thinks is most likely to come next ("autocomplete on steroids"), even if the token isn't "grammatically correct."
With constrained decoding, the model is limited to generating valid tokens based on the given format. Imagine that GPT-4o has already generated “{name” as the first token (or even “{“, to keep things really simple). Instead of allowing the model to pick any other next token, tokens that would create invalid JSON (“{“, “{\n”, etc.) are hidden from the model as possible choices. At each step, the model is guided to only those paths that will lead to valid output.
Behind the scenes, OpenAI is taking the JSON schema and creating what's known as a "context-free grammar" - a set of rules that defines the syntax of a given language. Just as it’s not valid in English to have a sentence with no verb, it is not valid in JSON to have double curly braces. The provided schema (and the syntax of JSON in general) creates the rules that bind the LLM's token generation to the language.
As a result, the first time you use structured outputs, there's a bit of extra latency as the model generates the rules (the rules are cached moving forward). It's a marvel of engineering, both from the API design and machine learning sides.
In practice
OpenAI has a number of interesting examples of Structured Outputs in action.
Getting the model to think step by step:
class Step(BaseModel):
explanation: str
output: str
class MathReasoning(BaseModel):
steps: list[Step]
final_answer: str
prompt = "You are a helpful math tutor. Guide the user through the solution step by step."
Extracting important data from research papers:
class ResearchPaperExtraction(BaseModel):
title: str
authors: list[str]
abstract: str
keywords: list[str]
prompt = """
You are an expert at structured data extraction.
You will be given unstructured text from a research paper and should convert it into the given structure.
"""
Moderating user content for violent or offensive messages:
class Category(str, Enum):
violence = "violence"
sexual = "sexual"
self_harm = "self_harm"
class ContentCompliance(BaseModel):
is_violating: bool
category: Optional[Category]
explanation_if_violating: Optional[str]
prompt = "Determine if the user input violates specific guidelines and explain if they do."
And dynamically generating UIs:
class UIType(str, Enum):
div = "div"
button = "button"
header = "header"
section = "section"
field = "field"
form = "form"
class Attribute(BaseModel):
name: str
value: str
class UI(BaseModel):
type: UIType
label: str
children: List["UI"]
attributes: List[Attribute]
class Response(BaseModel):
ui: UI
prompt = "You are a UI generator AI. Convert the user input into a UI."
From begging to constraining
While Structured Outputs is an incredibly cool feature, I've also been interested in what it represents more broadly about building with LLMs. We've gone from prompt engineering to JSON mode to Structured Outputs in just over a year.
Prompt engineering relies on carefully crafted instructions, hoping the model follows them effectively. Structural guarantees, however, build constraints directly into the system. This allows the model's "intelligence" to shine through while improving the system's robustness.
Of course, structured outputs aren't perfect. There's an initial latency penalty for processing new schemas. The model can still make mistakes within the values themselves. And it can be clunky to try to use the feature alongside function calling or predicted outputs.
However, it's important for developers to know about the existence of these features and how to use them effectively. Prompt engineering is a broadly useful skill with LLMs, one that I doubt will ever go away completely. Clearly, though, it can be brittle and unpredictable when trying to connect APIs together. As we can build more features like Structured Outputs, I'm hopeful that we can spend less time begging models to behave and more time building systems we can trust.