Trustworthy AI agents: practices, risks, and governance

The arrival of AI agents is no longer a futuristic promise; it's part of everyday practice. These systems already do more than answer questions: they run code, handle files, and complete workflows that cross multiple apps. What does that mean for security, human control, and regulation? Here I explain how agents work, what safeguards are effective, and what still needs to be built in the ecosystem.

Cómo funcionan los agentes y por qué importan

An agent is an AI model that directs its own processes and decides which tools to use to achieve a goal. It doesn't follow a fixed script; it operates in a loop of self-direction: it plans, acts, observes the result, adjusts, and repeats until the task is done or it asks for human guidance.

A concrete example: ask Claude in Claude Cowork to manage receipts from a business trip. The agent can transcribe photos, extract amounts, categorize expenses, and submit the report. If something goes wrong, say a charge that exceeds a limit, the agent can detect the uncertainty and ask for permission to access the expense policy in your drive before proceeding. That ability to reason about its own plan is what makes agents useful, but it also introduces new attack vectors.

Cómo funcionan los agentes y por qué importan

Principios de confianza aplicados en la práctica

Diseñar para control humano

Ayudar al agente a entender metas y límites

Defenderse de ataques como prompt injection

Qué puede aportar el ecosistema: benchmarks, estándares y evidencia

Recomendaciones técnicas para equipos que implementan agentes

Reflexión final

Fuente original

Stay up to date!

Trustworthy AI agents: practices, risks, and governance