As developers create more advanced AI applications, they regularly face situations where large amounts of context such as long documents, detailed instructions, or code bases must be repeatedly sent to the model. While this information helps models respond accurately, it can also increase costs and slow performance because the same data is processed repeatedly.  

Vertex AI context caching, launched by Google Cloud in 2024, addresses this. Since then, we’ve continued to improve Gemini for greater speed and cost-effectiveness. Caching saves and reuses pre-computed input tokens. Benefits include:  

  • Significant cost reduction. Cached tokens on supported Gemini 2.5 and newer models cost just 10% of the standard input token rate. Implicit caching applies this saving automatically to recognized repeated inputs. Explicit caching lets you intentionally reuse inputs, guaranteeing the discount and predictable savings.  
  • Latency caching reduces latency by retrieving previously computed content rather than recomputing it.  

With these core benefits outlined, let’s now explore in greater detail how context caching operates and how you can take full advantage of it in your applications.  

What is Vertex AI Context Caching? 

Vertex AI context caching stores tokens for repeated content. There are two types of caching available:  

  • Implicit caching: enabled by default, saving money on cache hits without API changes. Vertex AI uses prior request states to inform future prompts, reducing costs. The cache deletes within 24 hours, depending on system load and reuse.  
  • Explicit caching: gives you control over what to cache and lets you reference it as needed. You receive a guaranteed discount for explicit caching, allowing for predictable savings.  

To support different prompt sizes and use cases, caching can range from 2,048 tokens to the entire model context window. For Gemini 2.5 Pro, that’s over a million tokens. You can cache any content supported by Gemini multimodal models, including text, PDFs, images, audio, and video. For example, you might cache a large amount, a large document, an audio file, or a video. See the list of supported models here.  

Both types of caching work with global and regional endpoints, so you get the benefits no matter how you use Gemini. Implicit caching is also integrated with provisioned throughput to support production-level traffic. For extra security and compliance, you can encrypt and explicit caches using customer-managed encryption keys (CMEKs).  

Ideal Use Cases for Context Casing 

Large-scale document processing: cache lengthy contracts or regulatory documents to enable repeated queries for clauses or compliance checks.  

  • A financial analyst using Gemini can upload and cache documents, such as annual reports, to enable repeated querying or analysis without re-uploading the files each time. The explicit cache can be cleared manually, while the implicit cache clears automatically.  
  • For customer support chatbots, cache detailed persona instructions and comprehensive product information. This ensures that the chatbot delivers consistent, accurate, and relevant responses to customer inquiries without reprocessing instructions or data for every new session.  
  • For example, a customer support chatbot can use cached instructions and information. Instead of reprocessing these for each conversation, compute them once and let the chatbot reference them, resulting in faster responses and lower costs.  
  • For code use cases, you can cache your entire codebase or frequently used code files. This supports more efficient code Q&A, enables faster auto-complete, accelerates bug fixing, and improves the pace of feature development by reducing redundant processing.  
  • For enterprise knowledge bases, organizations can cache large collections of internal documentation, wikis, or manuals. This helps employees quickly access essential process details or compliance information without re-entering data, streamlining daily workflows.  

Cost Consequences for Implicit and Explicit Caching 

Implicit caching: enabled by default for all Google Cloud projects. On repeated content with a cache hit, you automatically get a discount. Standard input tokens are billed with no extra storage charge.  

Explicit caching:  

  • Cached token count: Creating a cached content object incurs a one-time fee at the standard input token rate. Later, each generate content request using this cache is billed at a 90% discount.  
  • Storage duration (TTL): You are also billed for the time the cached content is stored, based on its time-to-live (TTL), which determines how many hours the data remains available before it is deleted. The charge is an hourly rate per million tokens, prorated down to the minute.  

Title Cache, Best Practices, and How to Optimize Cache Hit Rate 

  • Check the limitations. Make sure you meet the caching requirements, such as the minimum cache size and supported models.  
  • Granularity: Place repeated or cached context at the start of the prompt. Avoid caching frequently changing small pieces.  
  • Monitor usage and costs: Check Google Cloud billing to see the caching’s impact. To see token count, check cachedcontenttokencount in usagemetadat.a.  
  • Frequency: Implicit caches are cleared within 24 hours or sooner. Making repeated requests quickly keeps the cache available.  

For explicit caching specifically:  

  • TTL management: Choose the time-to-live (TTL) setting carefully. A longer TTL means higher storage costs, but less need to create the cache. Balance this based on how long your context is useful and how often it is used.  

Get Started: 

Context caching can greatly improve the efficiency and cost-effectiveness of your AI applications. By using this feature, you can reduce redundant token processing, achieve faster response times, and build more scalable, cost-effective generative AI solutions.  

Implicit caching is turned on by default for all GCP projects, so you can start using it right away.  

To get started with explicit caching, check out our documentation, which includes sample code to create your first cache and a Colab notebook with common examples and code.

SourceSave costs and decrease latency while using Gemini with Vertex AI context caching