Open Responses: open inference standard for Agents | Keryc
Open Responses arrives as an open standard for inference designed specifically for the era of autonomous agents. It was initiated by OpenAI and developed by the open source community together with the Hugging Face ecosystem, and it aims to replace the limitations of traditional chat formats.
What is Open Responses
Open Responses is an open specification based on the Responses API (launched in 2025) that unifies how models generate text, images, structured JSON and videos, and how they execute agentic loops that call tools and return final results.
Why does it matter? Because most of the ecosystem still uses formats built for turn-by-turn conversations, and those don't fit well when systems need to reason, plan and act over long time horizons. Open Responses proposes a standardized, extensible format better suited for those flows.
Design and key technical changes
Stateless by default: the standard assumes independent requests, with support for encrypted reasoning when the provider requires it.
Standardized configuration parameters: this makes interoperability between models and providers easier.
Streaming defined as semantic events: it's not just text deltas; streams describe events like reasoning steps, tool calls and states.
Extensible: allows provider-specific parameters to cover practical differences without breaking compatibility.
Streaming and reasoning
Open Responses replaces "reasoning_text" with extendable reasoning chunks. Streaming is no longer simply sending fragments of text: events are transmitted with type, sequence and metadata that improve observability and reproducibility.
Example request via a proxy (curl):
curl https://evalstate-openresponses.hf.space/v1/responses \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $HF_TOKEN" \
-H "OpenResponses-Version: latest" \
-N \
-d '{
"model": "moonshotai/Kimi-K2-Thinking:nebius",
"input": "explain the theory of life"
}'
Open Responses clearly distinguishes two categories of tools:
Internal: run inside the provider's infrastructure (for example, file searches or Drive integrations). The provider executes the call and feeds the result to the model without the developer running it manually.
External: functions or servers hosted outside the provider, executed by the client or external infrastructures.
This separation lets providers optimize internal loops and lets routers orchestrate calls across different upstreams.
Standardized agent loop
The specification formalizes the loop: model sampling, emission of tool calls, tool execution (internal or external), feeding results back to the model, and repeating until the task is complete. For control, the client can use parameters like max_tool_calls and tool_choice:
{
"model": "zai-org/GLM-4.7",
"input": "Find Q3 sales data and email a summary to the team",
"tools": [...],
"max_tool_calls": 5,
"tool_choice": "auto"
}
The response includes all intermediate items: tool calls, results and reasonings, which makes auditing and debugging easier.
Migration: clients, providers and routers
Clients: migration from Responses to Open Responses is usually low-effort if you already support Responses. Main changes: adopt reasoning chunks and handle richer state payloads (for example, states specific to code interpreters).
Model providers: if you already comply with Responses, the transition is direct; you must support the new semantic events and standardized parameters.
Routers (intermediaries): gain the opportunity to normalize endpoints and expose provider-specific configuration options for customization without breaking compatibility.
Over time, useful features will tend to be standardized in the base specification, reducing the fragmentation that exists today from undocumented extensions.
Implications for security, observability and operations
Improved observability: semantic events and intermediate states allow you to trace reasoning steps and tool calls.
Privacy and encryption: the stateless mode and support for encrypted reasoning help meet regulatory requirements for some providers.
Cost and latency control: routers can limit max_tool_calls and choose providers based on latency or cost, which is essential in production.
From my experience orchestrating search-and-summarize pipelines, having all the steps (search, summarization, drafting) inside a single request reduces infrastructure complexity and makes traceability much easier. Open Responses aims precisely at that kind of flow.
Practical recommendations for developers
Start by trying the early access version on Hugging Face to validate your current stack.
Adapt clients to consume response.reasoning.* events and store intermediate items for auditing.
Define limits (max_tool_calls) and tool-selection policies before you put agents into production.
Evaluate encryption needs if you work with sensitive data and prioritize providers that offer encrypted reasoning.
Open Responses is not just another API: it's an attempt to align interfaces with the real needs of autonomous agents and interoperability across providers. If you work with agents, it's worth exploring the specification and planning the migration.