Google has added Agentic Vision 3 to its Gemini 3 Flash model to help it better understand images and make fewer mistakes in visual tasks.
Agentic Vision lets the model work like an active investigator. It follows a think-act-observe process to examine and modify images by running code.
This update helps prevent the AI from making guesses when image details are small or hard to see.
Main features of Agentic Vision
- Think, Act, Observe loop: The model first looks at the question and image (think). Next, it writes Python code to change or study the image (like cropping or adding nodes) (act). Finally, it checks the updated image for greater context before answering (observe)
- 5-10% quality boost: running code with Agentic Vision in Gemini 3 Flash makes the model 5-10% more accurate on most visual tests.
- Visual scratch pad: the model can add notes or marks directly on images, so its analysis uses actual image pixels.
- Reduced hallucination: Agentic Vision uses Python code to perform tasks such as counting small objects, reading distant text, or studying tables. This stops the model from making random guesses that lead to mistakes.
Main Uses
- Zooming and inspection: The model can zoom in on small or blurry details on its own.
- Visual Math and Plotting: Agentic Vision pulls data from tables in images, does the math, and makes charts rather than guessing the numbers.
- Interactive annotations can draw boxes and labels to count items in busy images accurately.
Where to Find it
You can find Agentic Vision in:
- Google AI Studio: Developers can turn on code execution under “Tools” in the playground.
- Vertex AI: Available through the Gemini API.
- Gemini app: added under the Thinking Model option.
Future updates will make these features automatic and bring them to other Gemini models.
Frontier AI models like Gemini usually process the world in a single static glance. If they miss a small detail, such as a microchip’s serial number or a distant street sign, they have to guess.
Agentic Vision in Gemini 3 Flash changes image understanding from a static process to an active one. It treats vision as an investigation by combining visual reasoning with code execution. The model can plan to zoom in, inspect, and manipulate images step by step, grounding its answers in visual evidence.
Allowing code execution with Gemini 3 flash gives a steady 5-10% quality boost on most vision benchmarks.
Agentic Vision: A New Frontier in AI Capability
Agentic Vision brings a think-act-observe loop to image understanding tasks.
- Think: the model examines the user’s question and the initial image, then generates a step-by-step plan.
- Act: The model writes and runs Python code to work with images, such as cropping, rotating, and adding nodes. It also analyzes images by running calculations or counting objects.
- Observe: The changed image is added to the model’s context. This helps the model review the new data with more context before giving a final answer.
Agentic Vision in Action
When you enable code execution in the API, you open up a range of new possibilities. Our demo app in Google AI Studios shows many of these in action. Developers from large companies using the Gemini app to small startups are already using this feature for a variety of use cases, such as:
- Zooming and Inspecting
Gemini 3 Flash automatically zooms in on small, detailed features. Planchecksolver.com, an AI tool for checking building plans, increased its accuracy by 5% after enabling code execution with Gemini 3 Flash. This allowed the platform to inspect high-definition images in a step-by-step fashion. In a video of the backend logs, you can see Gemini 3 Flash generate Python code to crop and analyze specific areas, such as roof edges or building sections, into new images. By adding these cropped images back into its context, the model can visually check its reasoning and confirm that plans meet complex building codes.
- Image Annotation
In the Agentic Vision, the model can interact with its environment by adding notes or drawings to images rather than only describing what it sees. Gemini 3 Flash can run code or draw directly on the image, helping to show its reasoning.
In the example below, the model is asked to count the fingers on a hand. In the Gemini app, to avoid mistakes, it uses Python to draw boxes and numbers over each finger it finds. This visual scratch pad helps ensure the answer is accurate down to the pixel.
- Visual Math and Plotting
Agentic Vision can read complex tables and use Python code to create visualizations of the results.
Standard language models can make mistakes when doing multi-step visual math. Gemini 3 Flash avoids this by using a reliable Python environment for calculations. In the example below, from our demo app in Google AI Studio, the model finds the raw data, writes code, sets the previous SOTA to 1.0, and creates a matplotlib bar chart. This way, the results are based on real execution, not guesses.
What’s Next?
We are only at the beginning with Agentic Vision.
- More implicit code-driven behaviors: Right now, Gemini 3 flash is great, automatically zooming in on small details. Other features, like rotating images or doing visual math, still need a clear prompt to work. We are working to make these actions automatic in future updates.
- More Tools: We are also exploring ways to give Gemini models more tools, such as web search and reverse image search, to help them better understand the world.
- More model sizes: We also plan to bring this feature to more of our models, not just Flash.










