Gemini updates models: lower price and more speed

Sep 23, 20242 minutes

Google updates its Gemini 1.5 models with improvements aimed at production. If you work with large models or plan to integrate them into products, this changes cost and speed in a tangible way.

What changes in Gemini 1.5

The most important points are direct and practical:

New models: Gemini-1.5-Pro-002 and Gemini-1.5-Flash-002.
Significant price reduction on 1.5 Pro for prompts under 128K tokens, effective October 1, 2024.
Increases in paid rate limits: 1.5 Flash rises to 2000 RPM and 1.5 Pro to 1000 RPM.
Performance improvements: output 2x faster and up to 3x lower latency.

These changes were announced by Google on their developer blog. (deepmind.google)

Key technical details

Quality and accuracy: noticeable gains, especially in math and long-context tasks, with improvements in benchmarks like MMLU-Pro and MATH.
Massive context window: Gemini 1.5 Pro keeps a context window of 2 million token, useful for processing long documents or large repositories.
Reduction in default verbosity: responses are typically 5–20% shorter, designed to save cost in extraction or summarization tasks.

All of these points are documented in Google's official note. (deepmind.google)

Impact for developers and companies

And what does this mean for you or your team? Basically three things:

Lower cost per token on 1.5 Pro helps when you process large volumes of text or multimodal loads — for example, indexing 1,000-page PDFs or analyzing long videos. (deepmind.google)
More speed and higher query rates (RPM) let you scale apps that need fast responses, like real-time customer support or high-throughput generation pipelines.
Shorter default responses reduce spend for extraction-focused use cases, but if you need longer answers you can adjust prompting to increase verbosity. (deepmind.google)

Practical recommendations

If you already use Gemini 1.5 Pro, test the new version in a staging environment before October 1, 2024 to measure cost and latency with your real load.
Use context caching when you can: combined with the price cuts, it reduces the cost of repeated prompts.
Adjust safety filters based on your case. In these versions filters are not applied by default, so you have control over security settings in production. (deepmind.google)

Why does it matter now?

Because these improvements make large models more accessible for real products, not just demos. Lower price, more speed, and higher limits reduce friction for startups and teams that want to add multimodal AI to everyday services.

If you've ever felt that using advanced models was too expensive or too slow, this lowers those barriers. Can you imagine automating long reports or analyzing video without blowing your budget? It's now much closer.

Stay up to date!

Receive practical guides, fact-checks and AI analysis straight to your inbox, no technical jargon or fluff.