FunctionGemma brings function calls to the edge

Dec 18, 20254 minutes

FunctionGemma arrives as a concrete bet: not just to chat, but to execute. Google introduces a specialized version of Gemma 3 270M optimized to translate natural language into executable function calls right on the device. The goal? Fast, private local agents that can act without relying on the cloud.

What FunctionGemma is and why it matters

FunctionGemma is a variant of Gemma 3 270M fine-tuned for function calling. It’s meant as a foundation for training local agents that turn instructions into structured API calls, or that operate independently for offline tasks. It can also work as a traffic controller: handling common actions at the edge and delegating complex work to larger models like Gemma 3 27B.

Why does this change the game? Because many flows aren’t just conversations anymore. Modern assistants need to automate sequences, touch operating-system APIs, and do it with minimal latency and full privacy. That only happens with models small enough to run on device and specialized enough to be reliable.

Key features

Unified action and chat: FunctionGemma generates structured calls to tools and then reports results back in natural language. It can switch between machine-to-machine protocols and human-facing responses without losing context.
Built for customization: It’s not just a prompt; it was designed to be fine-tuned. In the Mobile Actions evaluation, fine-tuning raised accuracy from 58% to 85%. That shows that for edge agents, a dedicated, trained model is an efficient route to deterministic behavior.
Optimized for the edge: It fits on devices like the NVIDIA Jetson Nano and mobile phones. It uses Gemma’s tokenization with a 256k vocabulary, which makes tokenizing JSON and multilingual inputs more efficient. Shorter sequence lengths mean lower latency and less memory use.
Broad ecosystem: It supports training and deployment frameworks like Hugging Face Transformers, Unsloth, Keras and NVIDIA NeMo. For inference and deployment you can use LiteRT-LM, vLLM, MLX, Llama.cpp, Ollama, Vertex AI or LM Studio.

Important note: thinking local-first isn’t just about performance. It’s also about privacy. If your flow handles sensitive data, running logic on device avoids sending context to external servers.

Use cases and when to choose it

Is FunctionGemma useful for you? Probably yes if:

You have a defined API surface: smart home controls, a media player, navigation, or system tools.
You’re willing to fine-tune: you want consistent, repeatable behavior with less variability than zero-shot.
You prioritize local-first: you need near-zero latency and full privacy.
You’re building composite systems: a lightweight model on the edge handles frequent, fast cases, and a large cloud model resolves rare or complex ones.

Concrete examples: creating calendar events offline, adding contacts, turning on the flashlight, or handling game mechanics in TinyGarden where an instruction like "plant sunflowers in the top row and water them" is broken down into plantCrop and waterCrop with specific coordinates.

Demos and hands-on experience

Google shows three useful demos to understand the flow:

Mobile Actions fine tuning: a demo of a 100% offline assistant that executes operating-system commands on the device.
TinyGarden: a game where the model controls mechanics by voice on the phone without sending data to the cloud.
FunctionGemma Physics Playground: physics puzzles that run locally in the browser with Transformers.js.

These demos aren’t decorative. They prove that a 270M model can handle multi-turn logic and concrete actions in real time.

How to try FunctionGemma today

Download: the model is available on Hugging Face and Kaggle.
Learn: there are guides on function calling templates, how to sequence responses, and how to fine-tune.
Explore: the Google AI Edge Gallery includes the interactive demos.
Build: there’s a Colab and a dataset to train your own Mobile Actions agent.
Deploy: publish models on mobile with LiteRT-LM or combine them with larger models in Vertex AI or NVIDIA hardware like RTX PRO and DGX Spark.

Technical considerations for developers

Fine-tuning vs prompt engineering: for deterministic tasks at the edge, fine-tuning usually wins. Zero-shot variation can be unacceptable in systems that touch user APIs.
Tokenization and efficiency: the 256k vocabulary helps tokenize JSON efficiently and reduce sequence length. Fewer tokens mean lower latency and less memory use on constrained devices.
Hybrid orchestration: design a hierarchy where FunctionGemma handles local, frequent cases. Define clear criteria for elevating requests to a large model. This reduces cost and keeps response times low.
Tools and runtimes: test them and measure inference latency and power consumption on your target. vLLM and LiteRT-LM are production options; Llama.cpp and Ollama work well for prototypes and local tests.

Final thought

FunctionGemma confirms a clear direction: models aren’t just for chat anymore. They’re intermediaries between your language and the world of APIs. If you’re building products that need speed, privacy and deterministic actions, this kind of model enables real local deployments. Ready to turn conversations into real on-device actions?

Original source

https://blog.google/technology/developers/functiongemma

Stay up to date!

Get AI news, tool launches, and innovative products straight to your inbox. Everything clear and useful.