Google has launched a new always-on memory agent. This system continually rereads, organizes, and handles memory tasks. It enables models like the flashlite version of Gemini to stay active at a lower cost. The agent also delivers faster response times and outperforms earlier versions.  

Some key features of the system are:  

  • The memory agent operates continuously in the background, keeping the AI’s memory updated without demanding ongoing costly processing.  
  • It targets common tasks such as UI generation, moderation, and simulation with high efficiency.  
  • The system integrates into runtime strategies and supports workflow agents and multi-agent systems deployed on Google Cloud Run and Vertex AI.  
  • This technology actively manages memory and could replace traditional vector databases by delivering a more efficient, always-on solution.  

Overall, this development addresses the amnesia problem in large language models by leveraging long-term memory.  

Tech companies are adding long-term memory to large language models to fix the amnesia problem.  

The project was built using Google’s agent development kit (ADK), which launched in spring 2025, and Google Gemini 3.1 Flash Lite, a low-cost model released on March 3, 2026. Flash Lite is the fastest and most cost-efficient model in the Gemini 3 series.  

This project serves as a practical example of something many AI teams want. Few have built an agent system that continuously takes in information, organizes it in the background, and retrieves it later without a traditional vector database.  

For enterprise developers, this release is more important as a sign of where agent infrastructure is going than as a product launch.  

The repository offers a look at long-running autonomy, which is becoming more appealing for support systems, research assistance, internal copilots, and workflow automation. It also raises governance questions when memory is not limited to a single session.  

What the Repository Seems to Do and What It Does Not Clearly Claim 

The repository also appears to use a multi-agent internal architecture with specialized components for ingestion, consolidation, and querying.  

The materials do not present this as a shared memory framework for multiple independent agents.  

The difference matters. ADK supports multi-agent systems, but this repository is best described as an always-on memory agent or memory layer built with specialized sub-agents and persistent storage.  

Even at this more limited level, it tackles a key infrastructure problem that many teams are trying to solve.  

The Architecture Is Simple and Avoids a Traditional Retrieval Stack 

The repository says the agent runs continuously, accepts files for our API input, stores structured data in SQLite, and consolidates memory by default every 30 minutes.  

A local HTTP API and a Streamlit dashboard are in place. The system can handle text, image, audio, video, and PDF files. The repository describes the design boldly. No vector database, no embeddings, just an LLM diagram that reads things and writes structured memory.  

The design will likely catch the eye of developers focused on cost and complexity. Traditional retrieval stacks often require separate embeddings, pipelines, vector storage, indexing logic, and synchronization.  

Saboo’s example relies on the model to organize and update memory. These can make prototypes simpler and reduce input. Infrastructure, scroll: the performance focus shifts from vector search overhead to model latency, memory compaction, and stability.  

Flash Lite Makes the Always-On Model More Affordable 

Gemini 3.1 Flash Lite enables this always-on model.  

Google says the model is designed for high-volume developer workloads and is priced at $0.25 for 1,000,000 input tokens and $1.50 for 1,000,000 output tokens.  

The company also says that Flash Lite is 2.5 times faster than Gemini 2.5 in time-to-first-token and offers a 45% boost in output speed while maintaining or improving quality.  

According to Google’s benchmarks, the model scores 1432 on arena.ai, 86.9% on GPQA Diamond, and 76.8% on MMMU Pro. Google says these features make it well-suited for high-frequency tasks such as translation, moderation, UI generation, and simulation.  

These numbers show why Flash Lite is used with a background memory agent column. It enables a 24/7 service to re-read, consolidate, and serve memory with predictable latency and low inference costs, ensuring affordable, reliable, always-on performance.  

Google’s ADK documentation endorses this bigger picture. The framework is model-agnostic and deployment-agnostic. It supports workflow agents, multi-agent systems, tools, and evaluation and deployment options such as Cloud Run and Vertex AI Agent Engine. This makes the memory agent seem less like a one-off demo and more like a reference for a wider set of agents. For an enterprise, the main debate is about governance, not just capability. Public reaction shows that enterprise adoption of persistent memory depends on more than just speed or token pricing.  

On X, several responses highlighted enterprise concerns. Franck Abe called Google ADK and 24-7 agent autonomy, but warned that an agent dreaming and mixing memories in the background without clear boundaries creates a compliance nightmare.  

The LED agreed, saying the main cause of always-on agents is not tokens but drift and loops.  

These critiques focus on the functional challenges of persistent systems. Who can write memory? What gets merged? How does retention work? If the agent fails to learn correctly, then our memory is deleted. How do teams audit what the agent has learned over time?  

Another response: Iffy questioned the repos’ claim of no embeddings. Iffy argued the system still needs to chunk, index, and retrieve structured memory. I also said it may work well for small context agents but could struggle as memory stores grow.  

This criticism matters. Removing a vector database does not eliminate the need for retrieval design; it just shifts the complexity elsewhere.  

For developers, the trade-off is about fit, not ideology. A lighter stack suits those building low-stack, bounded memory agents. Larger deployments may need stricter retrieval controls, clearer industry strategies, and stronger life-cycle tools. ADK expands the story beyond just one demo.  

Other commenters focused on the developer’s workflow. One person asked for the ADK repository and documentation and wanted to know if the runtime is server- or long-running, and if tool calling and evaluation hooks are available by default.  

The answer is both. The memory agent example runs as a long-running service. Eric supports multiple deployment patterns and includes tools and evaluation features. The always-one memory agent is notable, but the main point is that Saboo wants agents to function as deployable software systems, not just isolated points; in this approach, memory becomes part of the runtime layer rather than an add-on.  

What Saboo Has Shown and What He Has Not 

What Saboo has not shown yet is just as important as what he has published.  

The provided materials do not include a direct benchmark comparing Flash Lite and Anthropic, Claude Haiku for agent loops in production.  

They do not outline enterprise-grade compliance controls for this memory agent. These would include deterministic policy boundaries, retention guarantees, segregation rules, or formal audit workflows.  

While the repository appears to use several specialist agents internally, the materials do not clearly support a broader claim about persistent memory. We shared across multiple independent agents.  

For now, the repository serves as a strong engineering template, not a full enterprise memory platform.  

Why This Is Important Now 

Still, this release comes at the right time. Enterprise AI teams are moving past singleton assistance and toward systems that remember preferences, retain project information, and operate for longer periods.  

Saboo’s open-source memory agent provides teams with a solid foundation for building infrastructure that supports long-term context and persistent information. Flash Lite further benefits organizations by reducing costs and making advanced agent capabilities accessible to more teams.  

The main takeaway: continuous memory will be judged on both governance and capability.  

The real enterprise question is whether an agent can remember in ways that are limited, inspectable, and safe for production.  

Source: Google PM open-sources Always On Memory Agent, ditching vector databases for LLM-driven persistent memory

Vectors are the basic tools AI models use to process information. Simple vectors describe points while high-dimensional embeddings capture complex details, dash-like features of an embedding image, or meanings of words but use much more memory. This can slow down the key value cache, A Fast storage for frequently accessed data, so the computer has to search a slower database.  

Vector quantization is a classic data compression method that reduces the size of multidimensional vectors. This helps AI in two ways. It speeds up vector search, which is the first technology behind large-scale AI and search engines, and it reduces key‑value cache slowdowns by making key‑value pairs smaller. This means faster searches and lower memory cost. However, traditional vector quantization often incurs a memory cost because most methods must compute and store precise quantization constants for each small block of data. This can add 1 or 2 extra bits per number, partly defeating the purpose of compressing the data.  

Today, we are introducing TurboQuant, a new compression algorithm that addresses the memory load value. In vector quantization, TurboQuant uses two other methods: quantized Johnson Lindenstrauss (QJL) and polar quant to achieve its results. In our test, all three techniques help reduce key-value bottlenecks without lowering AI model effectiveness. This could have a big impact on any use case that relies on compression, especially in search and AI.  

How TurboQuant Works 

TurboQuant is a compression method that significantly reduces model size without sacrificing accuracy. It is suitable for both key value compression and vector search. Its approach involves two main steps. Each builds on concepts introduced by Polar Font and QJL.  

  1. High-value compression (the polar quant method): Turboquant starts by randomly rotating data vectors. This clever step simplifies the data geometry, making it easy to apply a standard, high-quality quantizer (a tool that maps a large set of continuous variables, such as precise decimals, to a smaller discrete set of symbols or numbers, such as integers) to each part of the vector individually. This first step uses most of the compression power (the majority of the bits) to capture the main concept and strength of the source vector.  
  1. Next, TurboQuant applies QJL to the remaining error using a single bit, thereby improving the accuracy of the attention score.  

To better understand TurboQuant’s efficiency, let’s examine the specific roles that QJL (quantized Johnson Lindenstrauss) and Polar Quant (polar coordinate quantization) play in its two-step process.  

QJL The Zero Overhead 1-Bit Trick  

QJL uses the Johnson–Lindenstrauss transform, a mathematical method, to compress high-dimensional datasets while preserving important distance relationships between data points. It converts each vector element to a single sign bit, either +1 or −1. This creates a fast shorthand that requires no extra memory. To keep the results accurate, QJL uses a special estimator that bridges a high-precision query with a simpler, lower-precision dataset. This helps the model accurately calculate the attention score, which determines which parts of its input matter most and which can be ignored.  

PolarQuant: A New Angle Of Compression 

Polar Quant solves the memory cost problem in a different way. Instead of using standard coordinates like x, y, and z to represent distance along each axis, Polar Quant converts the vector to polar coordinates. This is like saying go five blocks at a 37-degree angle instead of going three blocks east and four blocks north. This gives two pieces of information. The radius shows the strength of the data, and the angle shows its direction and/or meaning because the angles follow a known pattern. The model does not need to perform a costly data normalization step. It maps data onto a fixed, predictable circular grid with predefined boundaries, rather than a square grid whose boundaries keep changing. This lets Polar Quant avoid the memory overhead of older models.  

Experiments And Results 

We tested all three algorithms on standard long-context benchmarks, including Long Bench, Needle in a Haystack, Zero Scrolls, RULER, and L-Eval, using open-source LLMs such as GAMA and Mistral. TurboQuant delivers top performance across both dot-product distortion and recall while using less key-value (KV) memory. The chart below summarizes how TurboQuant, PolarQuant, and KIVI baselines performed on tasks such as question answering, code generation, and summarization.  

The chart below shows how the algorithms performed on long-context needled-in-hashtag tasks, which test a model’s ability to find information hidden in large text. TurboQuant achieved perfect results across all benchmarks and reduced the key-value memory footprint by at least 6x. PolarQuant performed almost as well for this task.  

TurboQuant can reduce the key-value cache to just 3 bits without any training or fine-tuning, while maintaining model effectiveness. It also runs faster than the first LLMs (Gemma and Mistral). TurboQuant is easy to implement and adds almost no extra runtime. The plot below shows that 4-bit TurboQuant A can be up to 8× faster than 32-bit unquantized keys on H100 GPUs.  

This makes it ideal for supporting use cases. This makes TurboPoint a great fit for jobs like vector search, where it can speed up index building. We tested TurboQuant with high-dimensional vector search against top methods like PQ and Rabbi Q using 1@K recall ratio. This ratio shows how often the algorithm correctly identifies the true top inner product result among its top K guesses. TurboQuant consistently achieved higher recall ratios than the baseline techniques, even though the baselines use large codebooks and require tuning for each dataset. This shows that TurboQuant is both strong and efficient for large-scale search shifts in high-dimensional search, creating a new benchmark for achievable skill. It delivers near-optimal distortion rates in a data-oblivious manner. This shows that our nearest-neighbor engines operate with the efficiency of a 3-bit system while maintaining the accuracy of much heavier models.  

Peering Forward 

TurboQuant, QJL, and PolarQuant are not simply practical engineering solutions. They are also important algorithmic advances supported by strong logical proofs. These methods work well in practice and are often proven efficient, operating close to the theoretical maximum. This solid foundation makes them reliable for large, critical systems.  

One main use of these methods is to solve key‑value cache bottlenecks in models like Gemini, but efficient online vector quantization has an even wider impact. For example, modern search is moving beyond keywords to understand intent and meaning. This shift needs vector search, which finds the most relevant items in a database with billions of vectors. Techniques like TurboQuant are essential. They enable the construction and querying of large vector indexes with minimal memory usage. Almost no pre-processing, along with top accuracy, makes semantic search at scale faster and more efficient. This is important as AI is integrated into more products. Advances in vector quantization will become even more important.  

Source: TurboQuant: Redefining AI efficiency with extreme compression 

AI on Google Search is releasing Search Live in the Google app. It uses a real-time camera and voice input. The feature runs on the Gemini 3.1 Flash Live model. Users can now show their surroundings to the AI in Google Search and ask questions in a true dialogue.  

Important Details 

  • Real-time communication: users benefit from seamless voice conversations with AI, enabling faster, more natural searches.  
  • Camera and voice integration: with instant camera and voice activation, users can quickly get answers about any object or place they encounter.  
  • Location: the feature is in the Google app (Android and iOS), accessible via the live icon under the search bar.  
  • Availability: expanding to 200+ countries and several Indian languages, Search Live benefits a broad global audience.  

How Search Alive Works 

  • Open live mode in the Google app, tap the Live button, or access it via Google Lens (a tool for searching by images captured from a camera).  
  • Point and ask, enable the camera, and ask questions allowed.  
  • The AI gives audio feedback. It also shows relevant web links.  
  • Continuous conversation—the feature permits follow-up questions for natural interaction.  
  • Background operation: users can keep interacting with the AI while multitasking, maintaining efficiency even though camera sharing pauses.  

Use Cases 

  • Troubleshooting: users can point the camera at electronics to ask how to connect specific cables.  
  • Traveling users can identify landmarks.  
  • Hobbies and learning: users can request explanations for items in a matcha set or about educational experiments.  
  • Shopping, getting shook, product details, and reviews.  

This is part of a shift toward multimodal search where imagery, visual cues, and speech replace text input.  

Google has launched Gemini 3.1 Flash Live, a real-time audio and voice AI model for faster, more natural conversations. It reduces latency, improves reliability, and enhances dialogue quality for advanced, voice-first, multimodal AI applications.  

Gemini 3.1 Flash Live  

Gemini 3.1 Flash Live manages real-time conversations with enhanced responsiveness and context awareness. It supports natural dialogue flow, multi-term interactions, extended conversations, and dynamic user inputs.  

The model delivers reliable, natural-sounding conversations and completes complex tasks, achieving benchmarks that exhibit significant improvements over previous versions. For example:  

  • ComplexFunkBench audio: Gemine 3.1 Flash Live achieves 90.8% on multi-step function calling with various component constraints, outperforming earlier models.  
  • Scale AI audio multi-challenge: it scores 36.1% with thinking enabled, excelling at complex instruction following and long-horizon reasoning, despite interruptions and hesitations typical of real-world audio.  

Key Features And Improvements 

  • The model delivers faster responses and maintains fluid, instant interactions, even in noisy environments, by filtering out background noise for reliable performance.  
  • Better reliability in real-life conditions: Gemini 3.1 Flash Life executes tasks more reliably in noisy environments by filtering out background noise such as traffic or television, ensuring agents remain responsive to instructions.  
  • It closely follows complex instructions and guardrails, ensuring dependable performance even as conversations shift.  
  • The model accurately interprets pitch, tone, and place, adapting responses to user sentiment and enabling more natural dialogue.  
  • More natural dialogue flow: The model maintains conversation threads for longer periods, preserving context throughout extended interactions and idea generation. Mission Sessions  
  • It enables real-time conversations in over 90 languages for global accessibility and consistent performance.  

Developers can use the Gemini Live API (a platform for building features using real-time data) to build real-time conversational agents that process voice and video inputs and respond instantly. Key capabilities include:  

  • Handling real-time audio and multimodal input  
  • Function calling and external tool integration  
  • Session management for long-running conversations  
  • Ephemeral tokens for secure interactions  
  • Building interactive voice-first AI agents  

In addition to these foundational capabilities, the Google Gen AI SDK (a software toolkit for building generative AI features) enables asynchronous connections to audio sessions and supports instant interaction. Actions  

Search Live Expansion And Use Cases 

Search Live now works in 200+ regions with AI mode, using Gemini 3.1 Flash Live for real-time voice and camera queries. AI mode is available in Bengali, Gujarati, Kannada, Malayalam, Marathi, Odia, Tamil, Telugu, Urdu, and more.  

Key Features Of Search Live Include: 

  • Voice-activated conversation through the Google app  
  • Follow-up questions in ongoing sessions  
  • Camera input for context-aware queries  
  • Google Lens integration for visual L-word interaction  
  • Helpful audio responses with supporting web links  

This allows users to perform tasks that require real-time interaction, such as troubleshooting, learning, or investigating real-world objects.  

Ecosystem And Integrations 

Gemini 3.1 Flash Live delivers scalable infrastructure and partner integration for production environments:  

  • WebRTC-based systems for live voice and video  
  • Global edge routing for distributed applications  
  • Partner integrations for handling diverse input systems  

Companies such as Verizon, LiveKit, and the Home Depot report positive results using the model in conversational workflows.  

Safety And Content Authenticity 

All generated audio includes a synth ID watermark imperceptibly embedded in the output. This enables the detection of AI-produced content, supporting honesty and reducing misinformation.  

Availability 

Gemini 3.1 Flash Live is available across multiple Google platforms.  

  • Developers: preview access via Gemini Live API in Google AI Studio  
  • Enterprises: Gemini Enterprise for Customer Experience Applications  
  • End users: Gemini Live and Search Alive  
  • Global Reach: Search Live is available in 200+ countries and territories with AI mode.  
  • Languages: real-time conversation support in more than 90 languages  
  • The Platforms column is accessible via the Google app on Android and iOS, as well as through Google Lens for camera-based interactions.  

SourceGoogle rolls out Gemini 3.1 Flash Live for real-time voice AI conversations, expands Search Live globally