Gemini 3 Flash integrates Agentic Vision and code execution

Jan 27, 20264 minutes

Visual understanding in large models isn't just a static glance anymore. With Agentic Vision in Gemini 3 Flash, Google turns image understanding into an active process: the model thinks, acts and observes again, executing code to inspect and manipulate images step by step. Why does this matter to you as a developer or professional? Because it reduces guesswork, provides verifiable visual evidence, and improves accuracy on complex tasks.

What is Agentic Vision

Agentic Vision introduces a Think, Act, Observe loop that turns a visual task into an automatic investigation.

Think: the model analyzes the query and the initial image to draft a multi-step plan.
Act: it generates and runs Python to manipulate images (crop, rotate, annotate) or to analyze them (count, measure, compute).
Observe: the transformed images are added to the context window; the model inspects them again with more information before giving the final answer.

This approach mixes visual reasoning with deterministic execution, lowering the chance the model will guess when fine details are missing.

Key milestone: enabling code execution with Gemini 3 Flash delivers a consistent 5–10% quality improvement on most vision benchmarks.

How it works technically

The core idea is that the model doesn't stop at a single pass. When it senses uncertainty about a detail, it generates instructions as code that run in a controlled Python environment. That environment can:

Crop and rescale regions of interest for a new look at higher resolution.
Draw annotations and boxes on pixels to create a verifiable "visual scratchpad."
Run deterministic calculations (for example, sum, normalize, plot data) with libraries like numpy or matplotlib.

The resulting images get appended to the model's context, enabling new inferences based on updated visual evidence. That reduces errors in multi-step tasks because the numeric or drawing work depends on real code execution, not language probability.

Relevant technical aspects:

The Think, Act, Observe loop requires a context window capable of storing multiple transformed images.
Code execution runs in a deterministic sandbox to reduce variability and help reproducibility.
There's a tradeoff between accuracy and latency: iterative inspections improve quality but increase time and compute costs.

Agentic Vision in action: three clear cases

1. Zoom and iterative inspection

When the model spots a fine detail, it can crop and re-analyze the region. A real example: PlanCheckSolver.com improved its accuracy by 5% by letting Gemini 3 Flash generate Python to crop sections of architectural plans and analyze them iteratively. Those visual crops become part of the context and ground the final decision.

2. Visual annotation as a "scratchpad"

Instead of only describing what it sees, the model can draw bounding boxes and labels on the image to verify counts or locations. That avoids counting errors (for example, fingers on a hand) because the final answer is based on annotated, verifiable pixels.

3. Visual math and visualization

Problems with dense tables or multi-step calculations often lead to hallucinations in LLMs. Gemini 3 Flash generates and runs code that normalizes data, performs calculations and creates charts with matplotlib. The result is reproducible: rather than trusting a probabilistic answer, you get a chart and numbers produced by deterministic code.

Practical considerations for developers (technical)

If you're going to integrate Agentic Vision into your products, keep in mind:

Security and sandboxing: Python execution should be isolated to prevent unwanted file or network access.
Latency and cost: each Act/Observe cycle adds execution steps; measure impact on experience and billing.
Tokens and context: attaching multiple crops increases context window usage; plan for limits and truncation strategy.
Determinism: running code reduces reasoning randomness, but you should version dependencies and environments for reproducibility.
Human oversight: in sensitive domains (health, legal, infrastructure) keep a human review loop.
Prompt engineering: design prompts that tell the model when to generate code implicitly versus when to wait for explicit instruction.

Integration and recommended practices

Start by enabling Code Execution in Google AI Studio or Vertex AI and try the demo in AI Studio Playground.
Design pipelines that limit crops to high-value regions to control latency.
Log every artifact (crops, generated scripts, outputs) for auditing and debugging.
Implement timeouts and resource limits in the sandbox to avoid expensive or runaway executions.

What's next and current limits

Google aims to make more behaviors implicit (for example, rotations or visual math without explicit nudges), add external tools (web search, reverse image search) and extend Agentic Vision to other model sizes. But it's not magic:

There's still risk of errors if the original image is too poor or the generated plan is wrong.
The balance between automation and human control is critical in high-risk applications.

Can you imagine automating inspections, visual audits or scientific analysis with pixel-by-pixel evidence? Agentic Vision opens that door, but practical implementation requires careful design.

Source

https://blog.google/innovation-and-ai/technology/developers-tools/agentic-vision-gemini-3-flash

Stay up to date!

Get AI news, tool launches, and innovative products straight to your inbox. Everything clear and useful.