Visual understanding in large models isn't just a static glance anymore. With Agentic Vision in Gemini 3 Flash, Google turns image understanding into an active process: the model thinks, acts and observes again, executing code to inspect and manipulate images step by step. Why does this matter to you as a developer or professional? Because it reduces guesswork, provides verifiable visual evidence, and improves accuracy on complex tasks.
What is Agentic Vision
Agentic Vision introduces a Think, Act, Observe loop that turns a visual task into an automatic investigation.
- Think: the model analyzes the query and the initial image to draft a multi-step plan.
- Act: it generates and runs
Pythonto manipulate images (crop, rotate, annotate) or to analyze them (count, measure, compute). - Observe: the transformed images are added to the context window; the model inspects them again with more information before giving the final answer.
This approach mixes visual reasoning with deterministic execution, lowering the chance the model will guess when fine details are missing.
Key milestone: enabling code execution with
Gemini 3 Flashdelivers a consistent 5–10% quality improvement on most vision benchmarks.
How it works technically
The core idea is that the model doesn't stop at a single pass. When it senses uncertainty about a detail, it generates instructions as code that run in a controlled Python environment. That environment can:
- Crop and rescale regions of interest for a new look at higher resolution.
- Draw annotations and boxes on pixels to create a verifiable "visual scratchpad."
- Run deterministic calculations (for example, sum, normalize, plot data) with libraries like
numpyormatplotlib.
The resulting images get appended to the model's context, enabling new inferences based on updated visual evidence. That reduces errors in multi-step tasks because the numeric or drawing work depends on real code execution, not language probability.
Relevant technical aspects:
- The
Think, Act, Observeloop requires a context window capable of storing multiple transformed images. - Code execution runs in a deterministic sandbox to reduce variability and help reproducibility.
- There's a tradeoff between accuracy and latency: iterative inspections improve quality but increase time and compute costs.
Agentic Vision in action: three clear cases
1. Zoom and iterative inspection
When the model spots a fine detail, it can crop and re-analyze the region. A real example: PlanCheckSolver.com improved its accuracy by 5% by letting Gemini 3 Flash generate Python to crop sections of architectural plans and analyze them iteratively. Those visual crops become part of the context and ground the final decision.
2. Visual annotation as a "scratchpad"
Instead of only describing what it sees, the model can draw bounding boxes and labels on the image to verify counts or locations. That avoids counting errors (for example, fingers on a hand) because the final answer is based on annotated, verifiable pixels.
3. Visual math and visualization
Problems with dense tables or multi-step calculations often lead to hallucinations in LLMs. Gemini 3 Flash generates and runs code that normalizes data, performs calculations and creates charts with matplotlib. The result is reproducible: rather than trusting a probabilistic answer, you get a chart and numbers produced by deterministic code.
Practical considerations for developers (technical)
If you're going to integrate Agentic Vision into your products, keep in mind:
- Security and sandboxing: Python execution should be isolated to prevent unwanted file or network access.
- Latency and cost: each Act/Observe cycle adds execution steps; measure impact on experience and billing.
- Tokens and context: attaching multiple crops increases context window usage; plan for limits and truncation strategy.
- Determinism: running code reduces reasoning randomness, but you should version dependencies and environments for reproducibility.
- Human oversight: in sensitive domains (health, legal, infrastructure) keep a human review loop.
- Prompt engineering: design prompts that tell the model when to generate code implicitly versus when to wait for explicit instruction.
Integration and recommended practices
- Start by enabling
Code Executionin Google AI Studio or Vertex AI and try the demo in AI Studio Playground. - Design pipelines that limit crops to high-value regions to control latency.
- Log every artifact (crops, generated scripts, outputs) for auditing and debugging.
- Implement timeouts and resource limits in the sandbox to avoid expensive or runaway executions.
What's next and current limits
Google aims to make more behaviors implicit (for example, rotations or visual math without explicit nudges), add external tools (web search, reverse image search) and extend Agentic Vision to other model sizes. But it's not magic:
- There's still risk of errors if the original image is too poor or the generated plan is wrong.
- The balance between automation and human control is critical in high-risk applications.
Can you imagine automating inspections, visual audits or scientific analysis with pixel-by-pixel evidence? Agentic Vision opens that door, but practical implementation requires careful design.
Source
https://blog.google/innovation-and-ai/technology/developers-tools/agentic-vision-gemini-3-flash
