Google AI Archives

Home
Vectors are the basic tools AI models use to process information. Simple vectors describe points while high-dimensional embeddings capture complex details, dash-like features of an embedding image, or meanings of words but use much more memory. This can slow down the key value cache, A Fast storage for frequently accessed data, so the computer has to search a slower database.

Vector quantization is a classic data compression method that reduces the size of multidimensional vectors. This helps AI in two ways. It speeds up vector search, which is the first technology behind large-scale AI and search engines, and it reduces key‑value cache slowdowns by making key‑value pairs smaller. This means faster searches and lower memory cost. However, traditional vector quantization often incurs a memory cost because most methods must compute and store precise quantization constants for each small block of data. This can add 1 or 2 extra bits per number, partly defeating the purpose of compressing the data.

Today, we are introducing TurboQuant, a new compression algorithm that addresses the memory load value. In vector quantization, TurboQuant uses two other methods: quantized Johnson Lindenstrauss (QJL) and polar quant to achieve its results. In our test, all three techniques help reduce key-value bottlenecks without lowering AI model effectiveness. This could have a big impact on any use case that relies on compression, especially in search and AI.

How TurboQuant Works

TurboQuant is a compression method that significantly reduces model size without sacrificing accuracy. It is suitable for both key value compression and vector search. Its approach involves two main steps. Each builds on concepts introduced by Polar Font and QJL.
1. High-value compression (the polar quant method): Turboquant starts by randomly rotating data vectors. This clever step simplifies the data geometry, making it easy to apply a standard, high-quality quantizer (a tool that maps a large set of continuous variables, such as precise decimals, to a smaller discrete set of symbols or numbers, such as integers) to each part of the vector individually. This first step uses most of the compression power (the majority of the bits) to capture the main concept and strength of the source vector.
1. Next, TurboQuant applies QJL to the remaining error using a single bit, thereby improving the accuracy of the attention score.
To better understand TurboQuant's efficiency, let's examine the specific roles that QJL (quantized Johnson Lindenstrauss) and Polar Quant (polar coordinate quantization) play in its two-step process.

QJL The Zero Overhead 1-Bit Trick

QJL uses the Johnson–Lindenstrauss transform, a mathematical method, to compress high-dimensional datasets while preserving important distance relationships between data points. It converts each vector element to a single sign bit, either +1 or −1. This creates a fast shorthand that requires no extra memory. To keep the results accurate, QJL uses a special estimator that bridges a high-precision query with a simpler, lower-precision dataset. This helps the model accurately calculate the attention score, which determines which parts of its input matter most and which can be ignored.

PolarQuant: A New Angle Of Compression

Polar Quant solves the memory cost problem in a different way. Instead of using standard coordinates like x, y, and z to represent distance along each axis, Polar Quant converts the vector to polar coordinates. This is like saying go five blocks at a 37-degree angle instead of going three blocks east and four blocks north. This gives two pieces of information. The radius shows the strength of the data, and the angle shows its direction and/or meaning because the angles follow a known pattern. The model does not need to perform a costly data normalization step. It maps data onto a fixed, predictable circular grid with predefined boundaries, rather than a square grid whose boundaries keep changing. This lets Polar Quant avoid the memory overhead of older models.

Experiments And Results

We tested all three algorithms on standard long-context benchmarks, including Long Bench, Needle in a Haystack, Zero Scrolls, RULER, and L-Eval, using open-source LLMs such as GAMA and Mistral. TurboQuant delivers top performance across both dot-product distortion and recall while using less key-value (KV) memory. The chart below summarizes how TurboQuant, PolarQuant, and KIVI baselines performed on tasks such as question answering, code generation, and summarization.

The chart below shows how the algorithms performed on long-context needled-in-hashtag tasks, which test a model's ability to find information hidden in large text. TurboQuant achieved perfect results across all benchmarks and reduced the key-value memory footprint by at least 6x. PolarQuant performed almost as well for this task.

TurboQuant can reduce the key-value cache to just 3 bits without any training or fine-tuning, while maintaining model effectiveness. It also runs faster than the first LLMs (Gemma and Mistral). TurboQuant is easy to implement and adds almost no extra runtime. The plot below shows that 4-bit TurboQuant A can be up to 8× faster than 32-bit unquantized keys on H100 GPUs.

This makes it ideal for supporting use cases. This makes TurboPoint a great fit for jobs like vector search, where it can speed up index building. We tested TurboQuant with high-dimensional vector search against top methods like PQ and Rabbi Q using 1@K recall ratio. This ratio shows how often the algorithm correctly identifies the true top inner product result among its top K guesses. TurboQuant consistently achieved higher recall ratios than the baseline techniques, even though the baselines use large codebooks and require tuning for each dataset. This shows that TurboQuant is both strong and efficient for large-scale search shifts in high-dimensional search, creating a new benchmark for achievable skill. It delivers near-optimal distortion rates in a data-oblivious manner. This shows that our nearest-neighbor engines operate with the efficiency of a 3-bit system while maintaining the accuracy of much heavier models.

Peering Forward

TurboQuant, QJL, and PolarQuant are not simply practical engineering solutions. They are also important algorithmic advances supported by strong logical proofs. These methods work well in practice and are often proven efficient, operating close to the theoretical maximum. This solid foundation makes them reliable for large, critical systems.

One main use of these methods is to solve key‑value cache bottlenecks in models like Gemini, but efficient online vector quantization has an even wider impact. For example, modern search is moving beyond keywords to understand intent and meaning. This shift needs vector search, which finds the most relevant items in a database with billions of vectors. Techniques like TurboQuant are essential. They enable the construction and querying of large vector indexes with minimal memory usage. Almost no pre-processing, along with top accuracy, makes semantic search at scale faster and more efficient. This is important as AI is integrated into more products. Advances in vector quantization will become even more important.

Source: TurboQuant: Redefining AI efficiency with extreme compression
About US
Vectors are the basic tools AI models use to process information. Simple vectors describe points while high-dimensional embeddings capture complex details, dash-like features of an embedding image, or meanings of words but use much more memory. This can slow down the key value cache, A Fast storage for frequently accessed data, so the computer has to search a slower database.

Vector quantization is a classic data compression method that reduces the size of multidimensional vectors. This helps AI in two ways. It speeds up vector search, which is the first technology behind large-scale AI and search engines, and it reduces key‑value cache slowdowns by making key‑value pairs smaller. This means faster searches and lower memory cost. However, traditional vector quantization often incurs a memory cost because most methods must compute and store precise quantization constants for each small block of data. This can add 1 or 2 extra bits per number, partly defeating the purpose of compressing the data.

Today, we are introducing TurboQuant, a new compression algorithm that addresses the memory load value. In vector quantization, TurboQuant uses two other methods: quantized Johnson Lindenstrauss (QJL) and polar quant to achieve its results. In our test, all three techniques help reduce key-value bottlenecks without lowering AI model effectiveness. This could have a big impact on any use case that relies on compression, especially in search and AI.

How TurboQuant Works

TurboQuant is a compression method that significantly reduces model size without sacrificing accuracy. It is suitable for both key value compression and vector search. Its approach involves two main steps. Each builds on concepts introduced by Polar Font and QJL.
1. High-value compression (the polar quant method): Turboquant starts by randomly rotating data vectors. This clever step simplifies the data geometry, making it easy to apply a standard, high-quality quantizer (a tool that maps a large set of continuous variables, such as precise decimals, to a smaller discrete set of symbols or numbers, such as integers) to each part of the vector individually. This first step uses most of the compression power (the majority of the bits) to capture the main concept and strength of the source vector.
1. Next, TurboQuant applies QJL to the remaining error using a single bit, thereby improving the accuracy of the attention score.
To better understand TurboQuant's efficiency, let's examine the specific roles that QJL (quantized Johnson Lindenstrauss) and Polar Quant (polar coordinate quantization) play in its two-step process.

QJL The Zero Overhead 1-Bit Trick

QJL uses the Johnson–Lindenstrauss transform, a mathematical method, to compress high-dimensional datasets while preserving important distance relationships between data points. It converts each vector element to a single sign bit, either +1 or −1. This creates a fast shorthand that requires no extra memory. To keep the results accurate, QJL uses a special estimator that bridges a high-precision query with a simpler, lower-precision dataset. This helps the model accurately calculate the attention score, which determines which parts of its input matter most and which can be ignored.

PolarQuant: A New Angle Of Compression

Polar Quant solves the memory cost problem in a different way. Instead of using standard coordinates like x, y, and z to represent distance along each axis, Polar Quant converts the vector to polar coordinates. This is like saying go five blocks at a 37-degree angle instead of going three blocks east and four blocks north. This gives two pieces of information. The radius shows the strength of the data, and the angle shows its direction and/or meaning because the angles follow a known pattern. The model does not need to perform a costly data normalization step. It maps data onto a fixed, predictable circular grid with predefined boundaries, rather than a square grid whose boundaries keep changing. This lets Polar Quant avoid the memory overhead of older models.

Experiments And Results

We tested all three algorithms on standard long-context benchmarks, including Long Bench, Needle in a Haystack, Zero Scrolls, RULER, and L-Eval, using open-source LLMs such as GAMA and Mistral. TurboQuant delivers top performance across both dot-product distortion and recall while using less key-value (KV) memory. The chart below summarizes how TurboQuant, PolarQuant, and KIVI baselines performed on tasks such as question answering, code generation, and summarization.

The chart below shows how the algorithms performed on long-context needled-in-hashtag tasks, which test a model's ability to find information hidden in large text. TurboQuant achieved perfect results across all benchmarks and reduced the key-value memory footprint by at least 6x. PolarQuant performed almost as well for this task.

TurboQuant can reduce the key-value cache to just 3 bits without any training or fine-tuning, while maintaining model effectiveness. It also runs faster than the first LLMs (Gemma and Mistral). TurboQuant is easy to implement and adds almost no extra runtime. The plot below shows that 4-bit TurboQuant A can be up to 8× faster than 32-bit unquantized keys on H100 GPUs.

This makes it ideal for supporting use cases. This makes TurboPoint a great fit for jobs like vector search, where it can speed up index building. We tested TurboQuant with high-dimensional vector search against top methods like PQ and Rabbi Q using 1@K recall ratio. This ratio shows how often the algorithm correctly identifies the true top inner product result among its top K guesses. TurboQuant consistently achieved higher recall ratios than the baseline techniques, even though the baselines use large codebooks and require tuning for each dataset. This shows that TurboQuant is both strong and efficient for large-scale search shifts in high-dimensional search, creating a new benchmark for achievable skill. It delivers near-optimal distortion rates in a data-oblivious manner. This shows that our nearest-neighbor engines operate with the efficiency of a 3-bit system while maintaining the accuracy of much heavier models.

Peering Forward

TurboQuant, QJL, and PolarQuant are not simply practical engineering solutions. They are also important algorithmic advances supported by strong logical proofs. These methods work well in practice and are often proven efficient, operating close to the theoretical maximum. This solid foundation makes them reliable for large, critical systems.

One main use of these methods is to solve key‑value cache bottlenecks in models like Gemini, but efficient online vector quantization has an even wider impact. For example, modern search is moving beyond keywords to understand intent and meaning. This shift needs vector search, which finds the most relevant items in a database with billions of vectors. Techniques like TurboQuant are essential. They enable the construction and querying of large vector indexes with minimal memory usage. Almost no pre-processing, along with top accuracy, makes semantic search at scale faster and more efficient. This is important as AI is integrated into more products. Advances in vector quantization will become even more important.

Source: TurboQuant: Redefining AI efficiency with extreme compression
Tech Reviews
Vectors are the basic tools AI models use to process information. Simple vectors describe points while high-dimensional embeddings capture complex details, dash-like features of an embedding image, or meanings of words but use much more memory. This can slow down the key value cache, A Fast storage for frequently accessed data, so the computer has to search a slower database.

Vector quantization is a classic data compression method that reduces the size of multidimensional vectors. This helps AI in two ways. It speeds up vector search, which is the first technology behind large-scale AI and search engines, and it reduces key‑value cache slowdowns by making key‑value pairs smaller. This means faster searches and lower memory cost. However, traditional vector quantization often incurs a memory cost because most methods must compute and store precise quantization constants for each small block of data. This can add 1 or 2 extra bits per number, partly defeating the purpose of compressing the data.

Today, we are introducing TurboQuant, a new compression algorithm that addresses the memory load value. In vector quantization, TurboQuant uses two other methods: quantized Johnson Lindenstrauss (QJL) and polar quant to achieve its results. In our test, all three techniques help reduce key-value bottlenecks without lowering AI model effectiveness. This could have a big impact on any use case that relies on compression, especially in search and AI.

How TurboQuant Works

TurboQuant is a compression method that significantly reduces model size without sacrificing accuracy. It is suitable for both key value compression and vector search. Its approach involves two main steps. Each builds on concepts introduced by Polar Font and QJL.
1. High-value compression (the polar quant method): Turboquant starts by randomly rotating data vectors. This clever step simplifies the data geometry, making it easy to apply a standard, high-quality quantizer (a tool that maps a large set of continuous variables, such as precise decimals, to a smaller discrete set of symbols or numbers, such as integers) to each part of the vector individually. This first step uses most of the compression power (the majority of the bits) to capture the main concept and strength of the source vector.
1. Next, TurboQuant applies QJL to the remaining error using a single bit, thereby improving the accuracy of the attention score.
To better understand TurboQuant's efficiency, let's examine the specific roles that QJL (quantized Johnson Lindenstrauss) and Polar Quant (polar coordinate quantization) play in its two-step process.

QJL The Zero Overhead 1-Bit Trick

QJL uses the Johnson–Lindenstrauss transform, a mathematical method, to compress high-dimensional datasets while preserving important distance relationships between data points. It converts each vector element to a single sign bit, either +1 or −1. This creates a fast shorthand that requires no extra memory. To keep the results accurate, QJL uses a special estimator that bridges a high-precision query with a simpler, lower-precision dataset. This helps the model accurately calculate the attention score, which determines which parts of its input matter most and which can be ignored.

PolarQuant: A New Angle Of Compression

Polar Quant solves the memory cost problem in a different way. Instead of using standard coordinates like x, y, and z to represent distance along each axis, Polar Quant converts the vector to polar coordinates. This is like saying go five blocks at a 37-degree angle instead of going three blocks east and four blocks north. This gives two pieces of information. The radius shows the strength of the data, and the angle shows its direction and/or meaning because the angles follow a known pattern. The model does not need to perform a costly data normalization step. It maps data onto a fixed, predictable circular grid with predefined boundaries, rather than a square grid whose boundaries keep changing. This lets Polar Quant avoid the memory overhead of older models.

Experiments And Results

We tested all three algorithms on standard long-context benchmarks, including Long Bench, Needle in a Haystack, Zero Scrolls, RULER, and L-Eval, using open-source LLMs such as GAMA and Mistral. TurboQuant delivers top performance across both dot-product distortion and recall while using less key-value (KV) memory. The chart below summarizes how TurboQuant, PolarQuant, and KIVI baselines performed on tasks such as question answering, code generation, and summarization.

The chart below shows how the algorithms performed on long-context needled-in-hashtag tasks, which test a model's ability to find information hidden in large text. TurboQuant achieved perfect results across all benchmarks and reduced the key-value memory footprint by at least 6x. PolarQuant performed almost as well for this task.

TurboQuant can reduce the key-value cache to just 3 bits without any training or fine-tuning, while maintaining model effectiveness. It also runs faster than the first LLMs (Gemma and Mistral). TurboQuant is easy to implement and adds almost no extra runtime. The plot below shows that 4-bit TurboQuant A can be up to 8× faster than 32-bit unquantized keys on H100 GPUs.

This makes it ideal for supporting use cases. This makes TurboPoint a great fit for jobs like vector search, where it can speed up index building. We tested TurboQuant with high-dimensional vector search against top methods like PQ and Rabbi Q using 1@K recall ratio. This ratio shows how often the algorithm correctly identifies the true top inner product result among its top K guesses. TurboQuant consistently achieved higher recall ratios than the baseline techniques, even though the baselines use large codebooks and require tuning for each dataset. This shows that TurboQuant is both strong and efficient for large-scale search shifts in high-dimensional search, creating a new benchmark for achievable skill. It delivers near-optimal distortion rates in a data-oblivious manner. This shows that our nearest-neighbor engines operate with the efficiency of a 3-bit system while maintaining the accuracy of much heavier models.

Peering Forward

TurboQuant, QJL, and PolarQuant are not simply practical engineering solutions. They are also important algorithmic advances supported by strong logical proofs. These methods work well in practice and are often proven efficient, operating close to the theoretical maximum. This solid foundation makes them reliable for large, critical systems.

One main use of these methods is to solve key‑value cache bottlenecks in models like Gemini, but efficient online vector quantization has an even wider impact. For example, modern search is moving beyond keywords to understand intent and meaning. This shift needs vector search, which finds the most relevant items in a database with billions of vectors. Techniques like TurboQuant are essential. They enable the construction and querying of large vector indexes with minimal memory usage. Almost no pre-processing, along with top accuracy, makes semantic search at scale faster and more efficient. This is important as AI is integrated into more products. Advances in vector quantization will become even more important.

Source: TurboQuant: Redefining AI efficiency with extreme compression
- AI
  Vectors are the basic tools AI models use to process information. Simple vectors describe points while high-dimensional embeddings capture complex details, dash-like features of an embedding image, or meanings of words but use much more memory. This can slow down the key value cache, A Fast storage for frequently accessed data, so the computer has to search a slower database.
  
  Vector quantization is a classic data compression method that reduces the size of multidimensional vectors. This helps AI in two ways. It speeds up vector search, which is the first technology behind large-scale AI and search engines, and it reduces key‑value cache slowdowns by making key‑value pairs smaller. This means faster searches and lower memory cost. However, traditional vector quantization often incurs a memory cost because most methods must compute and store precise quantization constants for each small block of data. This can add 1 or 2 extra bits per number, partly defeating the purpose of compressing the data.
  
  Today, we are introducing TurboQuant, a new compression algorithm that addresses the memory load value. In vector quantization, TurboQuant uses two other methods: quantized Johnson Lindenstrauss (QJL) and polar quant to achieve its results. In our test, all three techniques help reduce key-value bottlenecks without lowering AI model effectiveness. This could have a big impact on any use case that relies on compression, especially in search and AI.
  
  How TurboQuant Works
  
  TurboQuant is a compression method that significantly reduces model size without sacrificing accuracy. It is suitable for both key value compression and vector search. Its approach involves two main steps. Each builds on concepts introduced by Polar Font and QJL.
  1. High-value compression (the polar quant method): Turboquant starts by randomly rotating data vectors. This clever step simplifies the data geometry, making it easy to apply a standard, high-quality quantizer (a tool that maps a large set of continuous variables, such as precise decimals, to a smaller discrete set of symbols or numbers, such as integers) to each part of the vector individually. This first step uses most of the compression power (the majority of the bits) to capture the main concept and strength of the source vector.
  1. Next, TurboQuant applies QJL to the remaining error using a single bit, thereby improving the accuracy of the attention score.
  To better understand TurboQuant's efficiency, let's examine the specific roles that QJL (quantized Johnson Lindenstrauss) and Polar Quant (polar coordinate quantization) play in its two-step process.
  
  QJL The Zero Overhead 1-Bit Trick
  
  QJL uses the Johnson–Lindenstrauss transform, a mathematical method, to compress high-dimensional datasets while preserving important distance relationships between data points. It converts each vector element to a single sign bit, either +1 or −1. This creates a fast shorthand that requires no extra memory. To keep the results accurate, QJL uses a special estimator that bridges a high-precision query with a simpler, lower-precision dataset. This helps the model accurately calculate the attention score, which determines which parts of its input matter most and which can be ignored.
  
  PolarQuant: A New Angle Of Compression
  
  Polar Quant solves the memory cost problem in a different way. Instead of using standard coordinates like x, y, and z to represent distance along each axis, Polar Quant converts the vector to polar coordinates. This is like saying go five blocks at a 37-degree angle instead of going three blocks east and four blocks north. This gives two pieces of information. The radius shows the strength of the data, and the angle shows its direction and/or meaning because the angles follow a known pattern. The model does not need to perform a costly data normalization step. It maps data onto a fixed, predictable circular grid with predefined boundaries, rather than a square grid whose boundaries keep changing. This lets Polar Quant avoid the memory overhead of older models.
  
  Experiments And Results
  
  We tested all three algorithms on standard long-context benchmarks, including Long Bench, Needle in a Haystack, Zero Scrolls, RULER, and L-Eval, using open-source LLMs such as GAMA and Mistral. TurboQuant delivers top performance across both dot-product distortion and recall while using less key-value (KV) memory. The chart below summarizes how TurboQuant, PolarQuant, and KIVI baselines performed on tasks such as question answering, code generation, and summarization.
  
  The chart below shows how the algorithms performed on long-context needled-in-hashtag tasks, which test a model's ability to find information hidden in large text. TurboQuant achieved perfect results across all benchmarks and reduced the key-value memory footprint by at least 6x. PolarQuant performed almost as well for this task.
  
  TurboQuant can reduce the key-value cache to just 3 bits without any training or fine-tuning, while maintaining model effectiveness. It also runs faster than the first LLMs (Gemma and Mistral). TurboQuant is easy to implement and adds almost no extra runtime. The plot below shows that 4-bit TurboQuant A can be up to 8× faster than 32-bit unquantized keys on H100 GPUs.
  
  This makes it ideal for supporting use cases. This makes TurboPoint a great fit for jobs like vector search, where it can speed up index building. We tested TurboQuant with high-dimensional vector search against top methods like PQ and Rabbi Q using 1@K recall ratio. This ratio shows how often the algorithm correctly identifies the true top inner product result among its top K guesses. TurboQuant consistently achieved higher recall ratios than the baseline techniques, even though the baselines use large codebooks and require tuning for each dataset. This shows that TurboQuant is both strong and efficient for large-scale search shifts in high-dimensional search, creating a new benchmark for achievable skill. It delivers near-optimal distortion rates in a data-oblivious manner. This shows that our nearest-neighbor engines operate with the efficiency of a 3-bit system while maintaining the accuracy of much heavier models.
  
  Peering Forward
  
  TurboQuant, QJL, and PolarQuant are not simply practical engineering solutions. They are also important algorithmic advances supported by strong logical proofs. These methods work well in practice and are often proven efficient, operating close to the theoretical maximum. This solid foundation makes them reliable for large, critical systems.
  
  One main use of these methods is to solve key‑value cache bottlenecks in models like Gemini, but efficient online vector quantization has an even wider impact. For example, modern search is moving beyond keywords to understand intent and meaning. This shift needs vector search, which finds the most relevant items in a database with billions of vectors. Techniques like TurboQuant are essential. They enable the construction and querying of large vector indexes with minimal memory usage. Almost no pre-processing, along with top accuracy, makes semantic search at scale faster and more efficient. This is important as AI is integrated into more products. Advances in vector quantization will become even more important.
  
  Source: TurboQuant: Redefining AI efficiency with extreme compression
- Troubleshooting
- Smartwatches
- Smartphones
- Camera
- Gadgets
- Games
- Laptops
- Seasonal Sales
Phone Features
Vectors are the basic tools AI models use to process information. Simple vectors describe points while high-dimensional embeddings capture complex details, dash-like features of an embedding image, or meanings of words but use much more memory. This can slow down the key value cache, A Fast storage for frequently accessed data, so the computer has to search a slower database.

Vector quantization is a classic data compression method that reduces the size of multidimensional vectors. This helps AI in two ways. It speeds up vector search, which is the first technology behind large-scale AI and search engines, and it reduces key‑value cache slowdowns by making key‑value pairs smaller. This means faster searches and lower memory cost. However, traditional vector quantization often incurs a memory cost because most methods must compute and store precise quantization constants for each small block of data. This can add 1 or 2 extra bits per number, partly defeating the purpose of compressing the data.

Today, we are introducing TurboQuant, a new compression algorithm that addresses the memory load value. In vector quantization, TurboQuant uses two other methods: quantized Johnson Lindenstrauss (QJL) and polar quant to achieve its results. In our test, all three techniques help reduce key-value bottlenecks without lowering AI model effectiveness. This could have a big impact on any use case that relies on compression, especially in search and AI.

How TurboQuant Works

TurboQuant is a compression method that significantly reduces model size without sacrificing accuracy. It is suitable for both key value compression and vector search. Its approach involves two main steps. Each builds on concepts introduced by Polar Font and QJL.
1. High-value compression (the polar quant method): Turboquant starts by randomly rotating data vectors. This clever step simplifies the data geometry, making it easy to apply a standard, high-quality quantizer (a tool that maps a large set of continuous variables, such as precise decimals, to a smaller discrete set of symbols or numbers, such as integers) to each part of the vector individually. This first step uses most of the compression power (the majority of the bits) to capture the main concept and strength of the source vector.
1. Next, TurboQuant applies QJL to the remaining error using a single bit, thereby improving the accuracy of the attention score.
To better understand TurboQuant's efficiency, let's examine the specific roles that QJL (quantized Johnson Lindenstrauss) and Polar Quant (polar coordinate quantization) play in its two-step process.

QJL The Zero Overhead 1-Bit Trick

QJL uses the Johnson–Lindenstrauss transform, a mathematical method, to compress high-dimensional datasets while preserving important distance relationships between data points. It converts each vector element to a single sign bit, either +1 or −1. This creates a fast shorthand that requires no extra memory. To keep the results accurate, QJL uses a special estimator that bridges a high-precision query with a simpler, lower-precision dataset. This helps the model accurately calculate the attention score, which determines which parts of its input matter most and which can be ignored.

PolarQuant: A New Angle Of Compression

Polar Quant solves the memory cost problem in a different way. Instead of using standard coordinates like x, y, and z to represent distance along each axis, Polar Quant converts the vector to polar coordinates. This is like saying go five blocks at a 37-degree angle instead of going three blocks east and four blocks north. This gives two pieces of information. The radius shows the strength of the data, and the angle shows its direction and/or meaning because the angles follow a known pattern. The model does not need to perform a costly data normalization step. It maps data onto a fixed, predictable circular grid with predefined boundaries, rather than a square grid whose boundaries keep changing. This lets Polar Quant avoid the memory overhead of older models.

Experiments And Results

We tested all three algorithms on standard long-context benchmarks, including Long Bench, Needle in a Haystack, Zero Scrolls, RULER, and L-Eval, using open-source LLMs such as GAMA and Mistral. TurboQuant delivers top performance across both dot-product distortion and recall while using less key-value (KV) memory. The chart below summarizes how TurboQuant, PolarQuant, and KIVI baselines performed on tasks such as question answering, code generation, and summarization.

The chart below shows how the algorithms performed on long-context needled-in-hashtag tasks, which test a model's ability to find information hidden in large text. TurboQuant achieved perfect results across all benchmarks and reduced the key-value memory footprint by at least 6x. PolarQuant performed almost as well for this task.

TurboQuant can reduce the key-value cache to just 3 bits without any training or fine-tuning, while maintaining model effectiveness. It also runs faster than the first LLMs (Gemma and Mistral). TurboQuant is easy to implement and adds almost no extra runtime. The plot below shows that 4-bit TurboQuant A can be up to 8× faster than 32-bit unquantized keys on H100 GPUs.

This makes it ideal for supporting use cases. This makes TurboPoint a great fit for jobs like vector search, where it can speed up index building. We tested TurboQuant with high-dimensional vector search against top methods like PQ and Rabbi Q using 1@K recall ratio. This ratio shows how often the algorithm correctly identifies the true top inner product result among its top K guesses. TurboQuant consistently achieved higher recall ratios than the baseline techniques, even though the baselines use large codebooks and require tuning for each dataset. This shows that TurboQuant is both strong and efficient for large-scale search shifts in high-dimensional search, creating a new benchmark for achievable skill. It delivers near-optimal distortion rates in a data-oblivious manner. This shows that our nearest-neighbor engines operate with the efficiency of a 3-bit system while maintaining the accuracy of much heavier models.

Peering Forward

TurboQuant, QJL, and PolarQuant are not simply practical engineering solutions. They are also important algorithmic advances supported by strong logical proofs. These methods work well in practice and are often proven efficient, operating close to the theoretical maximum. This solid foundation makes them reliable for large, critical systems.

One main use of these methods is to solve key‑value cache bottlenecks in models like Gemini, but efficient online vector quantization has an even wider impact. For example, modern search is moving beyond keywords to understand intent and meaning. This shift needs vector search, which finds the most relevant items in a database with billions of vectors. Techniques like TurboQuant are essential. They enable the construction and querying of large vector indexes with minimal memory usage. Almost no pre-processing, along with top accuracy, makes semantic search at scale faster and more efficient. This is important as AI is integrated into more products. Advances in vector quantization will become even more important.

Source: TurboQuant: Redefining AI efficiency with extreme compression
Buying Guides
Vectors are the basic tools AI models use to process information. Simple vectors describe points while high-dimensional embeddings capture complex details, dash-like features of an embedding image, or meanings of words but use much more memory. This can slow down the key value cache, A Fast storage for frequently accessed data, so the computer has to search a slower database.

Vector quantization is a classic data compression method that reduces the size of multidimensional vectors. This helps AI in two ways. It speeds up vector search, which is the first technology behind large-scale AI and search engines, and it reduces key‑value cache slowdowns by making key‑value pairs smaller. This means faster searches and lower memory cost. However, traditional vector quantization often incurs a memory cost because most methods must compute and store precise quantization constants for each small block of data. This can add 1 or 2 extra bits per number, partly defeating the purpose of compressing the data.

Today, we are introducing TurboQuant, a new compression algorithm that addresses the memory load value. In vector quantization, TurboQuant uses two other methods: quantized Johnson Lindenstrauss (QJL) and polar quant to achieve its results. In our test, all three techniques help reduce key-value bottlenecks without lowering AI model effectiveness. This could have a big impact on any use case that relies on compression, especially in search and AI.

How TurboQuant Works

TurboQuant is a compression method that significantly reduces model size without sacrificing accuracy. It is suitable for both key value compression and vector search. Its approach involves two main steps. Each builds on concepts introduced by Polar Font and QJL.
1. High-value compression (the polar quant method): Turboquant starts by randomly rotating data vectors. This clever step simplifies the data geometry, making it easy to apply a standard, high-quality quantizer (a tool that maps a large set of continuous variables, such as precise decimals, to a smaller discrete set of symbols or numbers, such as integers) to each part of the vector individually. This first step uses most of the compression power (the majority of the bits) to capture the main concept and strength of the source vector.
1. Next, TurboQuant applies QJL to the remaining error using a single bit, thereby improving the accuracy of the attention score.
To better understand TurboQuant's efficiency, let's examine the specific roles that QJL (quantized Johnson Lindenstrauss) and Polar Quant (polar coordinate quantization) play in its two-step process.

QJL The Zero Overhead 1-Bit Trick

QJL uses the Johnson–Lindenstrauss transform, a mathematical method, to compress high-dimensional datasets while preserving important distance relationships between data points. It converts each vector element to a single sign bit, either +1 or −1. This creates a fast shorthand that requires no extra memory. To keep the results accurate, QJL uses a special estimator that bridges a high-precision query with a simpler, lower-precision dataset. This helps the model accurately calculate the attention score, which determines which parts of its input matter most and which can be ignored.

PolarQuant: A New Angle Of Compression

Polar Quant solves the memory cost problem in a different way. Instead of using standard coordinates like x, y, and z to represent distance along each axis, Polar Quant converts the vector to polar coordinates. This is like saying go five blocks at a 37-degree angle instead of going three blocks east and four blocks north. This gives two pieces of information. The radius shows the strength of the data, and the angle shows its direction and/or meaning because the angles follow a known pattern. The model does not need to perform a costly data normalization step. It maps data onto a fixed, predictable circular grid with predefined boundaries, rather than a square grid whose boundaries keep changing. This lets Polar Quant avoid the memory overhead of older models.

Experiments And Results

We tested all three algorithms on standard long-context benchmarks, including Long Bench, Needle in a Haystack, Zero Scrolls, RULER, and L-Eval, using open-source LLMs such as GAMA and Mistral. TurboQuant delivers top performance across both dot-product distortion and recall while using less key-value (KV) memory. The chart below summarizes how TurboQuant, PolarQuant, and KIVI baselines performed on tasks such as question answering, code generation, and summarization.

The chart below shows how the algorithms performed on long-context needled-in-hashtag tasks, which test a model's ability to find information hidden in large text. TurboQuant achieved perfect results across all benchmarks and reduced the key-value memory footprint by at least 6x. PolarQuant performed almost as well for this task.

TurboQuant can reduce the key-value cache to just 3 bits without any training or fine-tuning, while maintaining model effectiveness. It also runs faster than the first LLMs (Gemma and Mistral). TurboQuant is easy to implement and adds almost no extra runtime. The plot below shows that 4-bit TurboQuant A can be up to 8× faster than 32-bit unquantized keys on H100 GPUs.

This makes it ideal for supporting use cases. This makes TurboPoint a great fit for jobs like vector search, where it can speed up index building. We tested TurboQuant with high-dimensional vector search against top methods like PQ and Rabbi Q using 1@K recall ratio. This ratio shows how often the algorithm correctly identifies the true top inner product result among its top K guesses. TurboQuant consistently achieved higher recall ratios than the baseline techniques, even though the baselines use large codebooks and require tuning for each dataset. This shows that TurboQuant is both strong and efficient for large-scale search shifts in high-dimensional search, creating a new benchmark for achievable skill. It delivers near-optimal distortion rates in a data-oblivious manner. This shows that our nearest-neighbor engines operate with the efficiency of a 3-bit system while maintaining the accuracy of much heavier models.

Peering Forward

TurboQuant, QJL, and PolarQuant are not simply practical engineering solutions. They are also important algorithmic advances supported by strong logical proofs. These methods work well in practice and are often proven efficient, operating close to the theoretical maximum. This solid foundation makes them reliable for large, critical systems.

One main use of these methods is to solve key‑value cache bottlenecks in models like Gemini, but efficient online vector quantization has an even wider impact. For example, modern search is moving beyond keywords to understand intent and meaning. This shift needs vector search, which finds the most relevant items in a database with billions of vectors. Techniques like TurboQuant are essential. They enable the construction and querying of large vector indexes with minimal memory usage. Almost no pre-processing, along with top accuracy, makes semantic search at scale faster and more efficient. This is important as AI is integrated into more products. Advances in vector quantization will become even more important.

Source: TurboQuant: Redefining AI efficiency with extreme compression
Comparison
Vectors are the basic tools AI models use to process information. Simple vectors describe points while high-dimensional embeddings capture complex details, dash-like features of an embedding image, or meanings of words but use much more memory. This can slow down the key value cache, A Fast storage for frequently accessed data, so the computer has to search a slower database.

Vector quantization is a classic data compression method that reduces the size of multidimensional vectors. This helps AI in two ways. It speeds up vector search, which is the first technology behind large-scale AI and search engines, and it reduces key‑value cache slowdowns by making key‑value pairs smaller. This means faster searches and lower memory cost. However, traditional vector quantization often incurs a memory cost because most methods must compute and store precise quantization constants for each small block of data. This can add 1 or 2 extra bits per number, partly defeating the purpose of compressing the data.

Today, we are introducing TurboQuant, a new compression algorithm that addresses the memory load value. In vector quantization, TurboQuant uses two other methods: quantized Johnson Lindenstrauss (QJL) and polar quant to achieve its results. In our test, all three techniques help reduce key-value bottlenecks without lowering AI model effectiveness. This could have a big impact on any use case that relies on compression, especially in search and AI.

How TurboQuant Works

TurboQuant is a compression method that significantly reduces model size without sacrificing accuracy. It is suitable for both key value compression and vector search. Its approach involves two main steps. Each builds on concepts introduced by Polar Font and QJL.
1. High-value compression (the polar quant method): Turboquant starts by randomly rotating data vectors. This clever step simplifies the data geometry, making it easy to apply a standard, high-quality quantizer (a tool that maps a large set of continuous variables, such as precise decimals, to a smaller discrete set of symbols or numbers, such as integers) to each part of the vector individually. This first step uses most of the compression power (the majority of the bits) to capture the main concept and strength of the source vector.
1. Next, TurboQuant applies QJL to the remaining error using a single bit, thereby improving the accuracy of the attention score.
To better understand TurboQuant's efficiency, let's examine the specific roles that QJL (quantized Johnson Lindenstrauss) and Polar Quant (polar coordinate quantization) play in its two-step process.

QJL The Zero Overhead 1-Bit Trick

QJL uses the Johnson–Lindenstrauss transform, a mathematical method, to compress high-dimensional datasets while preserving important distance relationships between data points. It converts each vector element to a single sign bit, either +1 or −1. This creates a fast shorthand that requires no extra memory. To keep the results accurate, QJL uses a special estimator that bridges a high-precision query with a simpler, lower-precision dataset. This helps the model accurately calculate the attention score, which determines which parts of its input matter most and which can be ignored.

PolarQuant: A New Angle Of Compression

Polar Quant solves the memory cost problem in a different way. Instead of using standard coordinates like x, y, and z to represent distance along each axis, Polar Quant converts the vector to polar coordinates. This is like saying go five blocks at a 37-degree angle instead of going three blocks east and four blocks north. This gives two pieces of information. The radius shows the strength of the data, and the angle shows its direction and/or meaning because the angles follow a known pattern. The model does not need to perform a costly data normalization step. It maps data onto a fixed, predictable circular grid with predefined boundaries, rather than a square grid whose boundaries keep changing. This lets Polar Quant avoid the memory overhead of older models.

Experiments And Results

We tested all three algorithms on standard long-context benchmarks, including Long Bench, Needle in a Haystack, Zero Scrolls, RULER, and L-Eval, using open-source LLMs such as GAMA and Mistral. TurboQuant delivers top performance across both dot-product distortion and recall while using less key-value (KV) memory. The chart below summarizes how TurboQuant, PolarQuant, and KIVI baselines performed on tasks such as question answering, code generation, and summarization.

The chart below shows how the algorithms performed on long-context needled-in-hashtag tasks, which test a model's ability to find information hidden in large text. TurboQuant achieved perfect results across all benchmarks and reduced the key-value memory footprint by at least 6x. PolarQuant performed almost as well for this task.

TurboQuant can reduce the key-value cache to just 3 bits without any training or fine-tuning, while maintaining model effectiveness. It also runs faster than the first LLMs (Gemma and Mistral). TurboQuant is easy to implement and adds almost no extra runtime. The plot below shows that 4-bit TurboQuant A can be up to 8× faster than 32-bit unquantized keys on H100 GPUs.

This makes it ideal for supporting use cases. This makes TurboPoint a great fit for jobs like vector search, where it can speed up index building. We tested TurboQuant with high-dimensional vector search against top methods like PQ and Rabbi Q using 1@K recall ratio. This ratio shows how often the algorithm correctly identifies the true top inner product result among its top K guesses. TurboQuant consistently achieved higher recall ratios than the baseline techniques, even though the baselines use large codebooks and require tuning for each dataset. This shows that TurboQuant is both strong and efficient for large-scale search shifts in high-dimensional search, creating a new benchmark for achievable skill. It delivers near-optimal distortion rates in a data-oblivious manner. This shows that our nearest-neighbor engines operate with the efficiency of a 3-bit system while maintaining the accuracy of much heavier models.

Peering Forward

TurboQuant, QJL, and PolarQuant are not simply practical engineering solutions. They are also important algorithmic advances supported by strong logical proofs. These methods work well in practice and are often proven efficient, operating close to the theoretical maximum. This solid foundation makes them reliable for large, critical systems.

One main use of these methods is to solve key‑value cache bottlenecks in models like Gemini, but efficient online vector quantization has an even wider impact. For example, modern search is moving beyond keywords to understand intent and meaning. This shift needs vector search, which finds the most relevant items in a database with billions of vectors. Techniques like TurboQuant are essential. They enable the construction and querying of large vector indexes with minimal memory usage. Almost no pre-processing, along with top accuracy, makes semantic search at scale faster and more efficient. This is important as AI is integrated into more products. Advances in vector quantization will become even more important.

Source: TurboQuant: Redefining AI efficiency with extreme compression
News
Vectors are the basic tools AI models use to process information. Simple vectors describe points while high-dimensional embeddings capture complex details, dash-like features of an embedding image, or meanings of words but use much more memory. This can slow down the key value cache, A Fast storage for frequently accessed data, so the computer has to search a slower database.

Vector quantization is a classic data compression method that reduces the size of multidimensional vectors. This helps AI in two ways. It speeds up vector search, which is the first technology behind large-scale AI and search engines, and it reduces key‑value cache slowdowns by making key‑value pairs smaller. This means faster searches and lower memory cost. However, traditional vector quantization often incurs a memory cost because most methods must compute and store precise quantization constants for each small block of data. This can add 1 or 2 extra bits per number, partly defeating the purpose of compressing the data.

Today, we are introducing TurboQuant, a new compression algorithm that addresses the memory load value. In vector quantization, TurboQuant uses two other methods: quantized Johnson Lindenstrauss (QJL) and polar quant to achieve its results. In our test, all three techniques help reduce key-value bottlenecks without lowering AI model effectiveness. This could have a big impact on any use case that relies on compression, especially in search and AI.

How TurboQuant Works

TurboQuant is a compression method that significantly reduces model size without sacrificing accuracy. It is suitable for both key value compression and vector search. Its approach involves two main steps. Each builds on concepts introduced by Polar Font and QJL.
1. High-value compression (the polar quant method): Turboquant starts by randomly rotating data vectors. This clever step simplifies the data geometry, making it easy to apply a standard, high-quality quantizer (a tool that maps a large set of continuous variables, such as precise decimals, to a smaller discrete set of symbols or numbers, such as integers) to each part of the vector individually. This first step uses most of the compression power (the majority of the bits) to capture the main concept and strength of the source vector.
1. Next, TurboQuant applies QJL to the remaining error using a single bit, thereby improving the accuracy of the attention score.
To better understand TurboQuant's efficiency, let's examine the specific roles that QJL (quantized Johnson Lindenstrauss) and Polar Quant (polar coordinate quantization) play in its two-step process.

QJL The Zero Overhead 1-Bit Trick

QJL uses the Johnson–Lindenstrauss transform, a mathematical method, to compress high-dimensional datasets while preserving important distance relationships between data points. It converts each vector element to a single sign bit, either +1 or −1. This creates a fast shorthand that requires no extra memory. To keep the results accurate, QJL uses a special estimator that bridges a high-precision query with a simpler, lower-precision dataset. This helps the model accurately calculate the attention score, which determines which parts of its input matter most and which can be ignored.

PolarQuant: A New Angle Of Compression

Polar Quant solves the memory cost problem in a different way. Instead of using standard coordinates like x, y, and z to represent distance along each axis, Polar Quant converts the vector to polar coordinates. This is like saying go five blocks at a 37-degree angle instead of going three blocks east and four blocks north. This gives two pieces of information. The radius shows the strength of the data, and the angle shows its direction and/or meaning because the angles follow a known pattern. The model does not need to perform a costly data normalization step. It maps data onto a fixed, predictable circular grid with predefined boundaries, rather than a square grid whose boundaries keep changing. This lets Polar Quant avoid the memory overhead of older models.

Experiments And Results

We tested all three algorithms on standard long-context benchmarks, including Long Bench, Needle in a Haystack, Zero Scrolls, RULER, and L-Eval, using open-source LLMs such as GAMA and Mistral. TurboQuant delivers top performance across both dot-product distortion and recall while using less key-value (KV) memory. The chart below summarizes how TurboQuant, PolarQuant, and KIVI baselines performed on tasks such as question answering, code generation, and summarization.

The chart below shows how the algorithms performed on long-context needled-in-hashtag tasks, which test a model's ability to find information hidden in large text. TurboQuant achieved perfect results across all benchmarks and reduced the key-value memory footprint by at least 6x. PolarQuant performed almost as well for this task.

TurboQuant can reduce the key-value cache to just 3 bits without any training or fine-tuning, while maintaining model effectiveness. It also runs faster than the first LLMs (Gemma and Mistral). TurboQuant is easy to implement and adds almost no extra runtime. The plot below shows that 4-bit TurboQuant A can be up to 8× faster than 32-bit unquantized keys on H100 GPUs.

This makes it ideal for supporting use cases. This makes TurboPoint a great fit for jobs like vector search, where it can speed up index building. We tested TurboQuant with high-dimensional vector search against top methods like PQ and Rabbi Q using 1@K recall ratio. This ratio shows how often the algorithm correctly identifies the true top inner product result among its top K guesses. TurboQuant consistently achieved higher recall ratios than the baseline techniques, even though the baselines use large codebooks and require tuning for each dataset. This shows that TurboQuant is both strong and efficient for large-scale search shifts in high-dimensional search, creating a new benchmark for achievable skill. It delivers near-optimal distortion rates in a data-oblivious manner. This shows that our nearest-neighbor engines operate with the efficiency of a 3-bit system while maintaining the accuracy of much heavier models.

Peering Forward

TurboQuant, QJL, and PolarQuant are not simply practical engineering solutions. They are also important algorithmic advances supported by strong logical proofs. These methods work well in practice and are often proven efficient, operating close to the theoretical maximum. This solid foundation makes them reliable for large, critical systems.

One main use of these methods is to solve key‑value cache bottlenecks in models like Gemini, but efficient online vector quantization has an even wider impact. For example, modern search is moving beyond keywords to understand intent and meaning. This shift needs vector search, which finds the most relevant items in a database with billions of vectors. Techniques like TurboQuant are essential. They enable the construction and querying of large vector indexes with minimal memory usage. Almost no pre-processing, along with top accuracy, makes semantic search at scale faster and more efficient. This is important as AI is integrated into more products. Advances in vector quantization will become even more important.

Source: TurboQuant: Redefining AI efficiency with extreme compression
Contact
Vectors are the basic tools AI models use to process information. Simple vectors describe points while high-dimensional embeddings capture complex details, dash-like features of an embedding image, or meanings of words but use much more memory. This can slow down the key value cache, A Fast storage for frequently accessed data, so the computer has to search a slower database.

Vector quantization is a classic data compression method that reduces the size of multidimensional vectors. This helps AI in two ways. It speeds up vector search, which is the first technology behind large-scale AI and search engines, and it reduces key‑value cache slowdowns by making key‑value pairs smaller. This means faster searches and lower memory cost. However, traditional vector quantization often incurs a memory cost because most methods must compute and store precise quantization constants for each small block of data. This can add 1 or 2 extra bits per number, partly defeating the purpose of compressing the data.

Today, we are introducing TurboQuant, a new compression algorithm that addresses the memory load value. In vector quantization, TurboQuant uses two other methods: quantized Johnson Lindenstrauss (QJL) and polar quant to achieve its results. In our test, all three techniques help reduce key-value bottlenecks without lowering AI model effectiveness. This could have a big impact on any use case that relies on compression, especially in search and AI.

How TurboQuant Works

TurboQuant is a compression method that significantly reduces model size without sacrificing accuracy. It is suitable for both key value compression and vector search. Its approach involves two main steps. Each builds on concepts introduced by Polar Font and QJL.
1. High-value compression (the polar quant method): Turboquant starts by randomly rotating data vectors. This clever step simplifies the data geometry, making it easy to apply a standard, high-quality quantizer (a tool that maps a large set of continuous variables, such as precise decimals, to a smaller discrete set of symbols or numbers, such as integers) to each part of the vector individually. This first step uses most of the compression power (the majority of the bits) to capture the main concept and strength of the source vector.
1. Next, TurboQuant applies QJL to the remaining error using a single bit, thereby improving the accuracy of the attention score.
To better understand TurboQuant's efficiency, let's examine the specific roles that QJL (quantized Johnson Lindenstrauss) and Polar Quant (polar coordinate quantization) play in its two-step process.

QJL The Zero Overhead 1-Bit Trick

QJL uses the Johnson–Lindenstrauss transform, a mathematical method, to compress high-dimensional datasets while preserving important distance relationships between data points. It converts each vector element to a single sign bit, either +1 or −1. This creates a fast shorthand that requires no extra memory. To keep the results accurate, QJL uses a special estimator that bridges a high-precision query with a simpler, lower-precision dataset. This helps the model accurately calculate the attention score, which determines which parts of its input matter most and which can be ignored.

PolarQuant: A New Angle Of Compression

Polar Quant solves the memory cost problem in a different way. Instead of using standard coordinates like x, y, and z to represent distance along each axis, Polar Quant converts the vector to polar coordinates. This is like saying go five blocks at a 37-degree angle instead of going three blocks east and four blocks north. This gives two pieces of information. The radius shows the strength of the data, and the angle shows its direction and/or meaning because the angles follow a known pattern. The model does not need to perform a costly data normalization step. It maps data onto a fixed, predictable circular grid with predefined boundaries, rather than a square grid whose boundaries keep changing. This lets Polar Quant avoid the memory overhead of older models.

Experiments And Results

We tested all three algorithms on standard long-context benchmarks, including Long Bench, Needle in a Haystack, Zero Scrolls, RULER, and L-Eval, using open-source LLMs such as GAMA and Mistral. TurboQuant delivers top performance across both dot-product distortion and recall while using less key-value (KV) memory. The chart below summarizes how TurboQuant, PolarQuant, and KIVI baselines performed on tasks such as question answering, code generation, and summarization.

The chart below shows how the algorithms performed on long-context needled-in-hashtag tasks, which test a model's ability to find information hidden in large text. TurboQuant achieved perfect results across all benchmarks and reduced the key-value memory footprint by at least 6x. PolarQuant performed almost as well for this task.

TurboQuant can reduce the key-value cache to just 3 bits without any training or fine-tuning, while maintaining model effectiveness. It also runs faster than the first LLMs (Gemma and Mistral). TurboQuant is easy to implement and adds almost no extra runtime. The plot below shows that 4-bit TurboQuant A can be up to 8× faster than 32-bit unquantized keys on H100 GPUs.

This makes it ideal for supporting use cases. This makes TurboPoint a great fit for jobs like vector search, where it can speed up index building. We tested TurboQuant with high-dimensional vector search against top methods like PQ and Rabbi Q using 1@K recall ratio. This ratio shows how often the algorithm correctly identifies the true top inner product result among its top K guesses. TurboQuant consistently achieved higher recall ratios than the baseline techniques, even though the baselines use large codebooks and require tuning for each dataset. This shows that TurboQuant is both strong and efficient for large-scale search shifts in high-dimensional search, creating a new benchmark for achievable skill. It delivers near-optimal distortion rates in a data-oblivious manner. This shows that our nearest-neighbor engines operate with the efficiency of a 3-bit system while maintaining the accuracy of much heavier models.

Peering Forward

TurboQuant, QJL, and PolarQuant are not simply practical engineering solutions. They are also important algorithmic advances supported by strong logical proofs. These methods work well in practice and are often proven efficient, operating close to the theoretical maximum. This solid foundation makes them reliable for large, critical systems.

One main use of these methods is to solve key‑value cache bottlenecks in models like Gemini, but efficient online vector quantization has an even wider impact. For example, modern search is moving beyond keywords to understand intent and meaning. This shift needs vector search, which finds the most relevant items in a database with billions of vectors. Techniques like TurboQuant are essential. They enable the construction and querying of large vector indexes with minimal memory usage. Almost no pre-processing, along with top accuracy, makes semantic search at scale faster and more efficient. This is important as AI is integrated into more products. Advances in vector quantization will become even more important.

Source: TurboQuant: Redefining AI efficiency with extreme compression

News
0
0
10 min read

Google Cuts a Memory Cost With New TurboQuant Method

Vectors are the basic tools AI models use to process information. Simple vectors describe points while high-dimensional embeddings capture complex details, dash-like features of an embedding image, or meanings of words but use much more memory. This can slow down the key value cache, A Fast storage for frequently accessed data, so the computer has to search a slower database.

Vector quantization is a classic data compression method that reduces the size of multidimensional vectors. This helps AI in two ways. It speeds up vector search, which is the first technology behind large-scale AI and search engines, and it reduces key‑value cache slowdowns by making key‑value pairs smaller. This means faster searches and lower memory cost. However, traditional vector quantization often incurs a memory cost because most methods must compute and store precise quantization constants for each small block of data. This can add 1 or 2 extra bits per number, partly defeating the purpose of compressing the data.

Today, we are introducing TurboQuant, a new compression algorithm that addresses the memory load value. In vector quantization, TurboQuant uses two other methods: quantized Johnson Lindenstrauss (QJL) and polar quant to achieve its results. In our test, all three techniques help reduce key-value bottlenecks without lowering AI model effectiveness. This could have a big impact on any use case that relies on compression, especially in search and AI.

How TurboQuant Works

TurboQuant is a compression method that significantly reduces model size without sacrificing accuracy. It is suitable for both key value compression and vector search. Its approach involves two main steps. Each builds on concepts introduced by Polar Font and QJL.

High-value compression (the polar quant method): Turboquant starts by randomly rotating data vectors. This clever step simplifies the data geometry, making it easy to apply a standard, high-quality quantizer (a tool that maps a large set of continuous variables, such as precise decimals, to a smaller discrete set of symbols or numbers, such as integers) to each part of the vector individually. This first step uses most of the compression power (the majority of the bits) to capture the main concept and strength of the source vector.

Next, TurboQuant applies QJL to the remaining error using a single bit, thereby improving the accuracy of the attention score.

To better understand TurboQuant’s efficiency, let’s examine the specific roles that QJL (quantized Johnson Lindenstrauss) and Polar Quant (polar coordinate quantization) play in its two-step process.

QJL The Zero Overhead 1-Bit Trick

QJL uses the Johnson–Lindenstrauss transform, a mathematical method, to compress high-dimensional datasets while preserving important distance relationships between data points. It converts each vector element to a single sign bit, either +1 or −1. This creates a fast shorthand that requires no extra memory. To keep the results accurate, QJL uses a special estimator that bridges a high-precision query with a simpler, lower-precision dataset. This helps the model accurately calculate the attention score, which determines which parts of its input matter most and which can be ignored.

PolarQuant: A New Angle Of Compression

Polar Quant solves the memory cost problem in a different way. Instead of using standard coordinates like x, y, and z to represent distance along each axis, Polar Quant converts the vector to polar coordinates. This is like saying go five blocks at a 37-degree angle instead of going three blocks east and four blocks north. This gives two pieces of information. The radius shows the strength of the data, and the angle shows its direction and/or meaning because the angles follow a known pattern. The model does not need to perform a costly data normalization step. It maps data onto a fixed, predictable circular grid with predefined boundaries, rather than a square grid whose boundaries keep changing. This lets Polar Quant avoid the memory overhead of older models.

Experiments And Results

We tested all three algorithms on standard long-context benchmarks, including Long Bench, Needle in a Haystack, Zero Scrolls, RULER, and L-Eval, using open-source LLMs such as GAMA and Mistral. TurboQuant delivers top performance across both dot-product distortion and recall while using less key-value (KV) memory. The chart below summarizes how TurboQuant, PolarQuant, and KIVI baselines performed on tasks such as question answering, code generation, and summarization.

The chart below shows how the algorithms performed on long-context needled-in-hashtag tasks, which test a model’s ability to find information hidden in large text. TurboQuant achieved perfect results across all benchmarks and reduced the key-value memory footprint by at least 6x. PolarQuant performed almost as well for this task.

TurboQuant can reduce the key-value cache to just 3 bits without any training or fine-tuning, while maintaining model effectiveness. It also runs faster than the first LLMs (Gemma and Mistral). TurboQuant is easy to implement and adds almost no extra runtime. The plot below shows that 4-bit TurboQuant A can be up to 8× faster than 32-bit unquantized keys on H100 GPUs.

This makes it ideal for supporting use cases. This makes TurboPoint a great fit for jobs like vector search, where it can speed up index building. We tested TurboQuant with high-dimensional vector search against top methods like PQ and Rabbi Q using 1@K recall ratio. This ratio shows how often the algorithm correctly identifies the true top inner product result among its top K guesses. TurboQuant consistently achieved higher recall ratios than the baseline techniques, even though the baselines use large codebooks and require tuning for each dataset. This shows that TurboQuant is both strong and efficient for large-scale search shifts in high-dimensional search, creating a new benchmark for achievable skill. It delivers near-optimal distortion rates in a data-oblivious manner. This shows that our nearest-neighbor engines operate with the efficiency of a 3-bit system while maintaining the accuracy of much heavier models.

Peering Forward

TurboQuant, QJL, and PolarQuant are not simply practical engineering solutions. They are also important algorithmic advances supported by strong logical proofs. These methods work well in practice and are often proven efficient, operating close to the theoretical maximum. This solid foundation makes them reliable for large, critical systems.

One main use of these methods is to solve key‑value cache bottlenecks in models like Gemini, but efficient online vector quantization has an even wider impact. For example, modern search is moving beyond keywords to understand intent and meaning. This shift needs vector search, which finds the most relevant items in a database with billions of vectors. Techniques like TurboQuant are essential. They enable the construction and querying of large vector indexes with minimal memory usage. Almost no pre-processing, along with top accuracy, makes semantic search at scale faster and more efficient. This is important as AI is integrated into more products. Advances in vector quantization will become even more important.

Source: TurboQuant: Redefining AI efficiency with extreme compression

News
0
0
7 min read

Google Tests Search Live With Real-Time Camera + Voice AI

AI on Google Search is releasing Search Live in the Google app. It uses a real-time camera and voice input. The feature runs on the Gemini 3.1 Flash Live model. Users can now show their surroundings to the AI in Google Search and ask questions in a true dialogue.

Important Details

Real-time communication: users benefit from seamless voice conversations with AI, enabling faster, more natural searches.

Camera and voice integration: with instant camera and voice activation, users can quickly get answers about any object or place they encounter.

Location: the feature is in the Google app (Android and iOS), accessible via the live icon under the search bar.

Availability: expanding to 200+ countries and several Indian languages, Search Live benefits a broad global audience.

How Search Alive Works

Open live mode in the Google app, tap the Live button, or access it via Google Lens (a tool for searching by images captured from a camera).

Point and ask, enable the camera, and ask questions allowed.

The AI gives audio feedback. It also shows relevant web links.

Continuous conversation—the feature permits follow-up questions for natural interaction.

Background operation: users can keep interacting with the AI while multitasking, maintaining efficiency even though camera sharing pauses.

Use Cases

Troubleshooting: users can point the camera at electronics to ask how to connect specific cables.

Traveling users can identify landmarks.

Hobbies and learning: users can request explanations for items in a matcha set or about educational experiments.

Shopping, getting shook, product details, and reviews.

This is part of a shift toward multimodal search where imagery, visual cues, and speech replace text input.

Google has launched Gemini 3.1 Flash Live, a real-time audio and voice AI model for faster, more natural conversations. It reduces latency, improves reliability, and enhances dialogue quality for advanced, voice-first, multimodal AI applications.

Gemini 3.1 Flash Live

Gemini 3.1 Flash Live manages real-time conversations with enhanced responsiveness and context awareness. It supports natural dialogue flow, multi-term interactions, extended conversations, and dynamic user inputs.

The model delivers reliable, natural-sounding conversations and completes complex tasks, achieving benchmarks that exhibit significant improvements over previous versions. For example:

ComplexFunkBench audio: Gemine 3.1 Flash Live achieves 90.8% on multi-step function calling with various component constraints, outperforming earlier models.

Scale AI audio multi-challenge: it scores 36.1% with thinking enabled, excelling at complex instruction following and long-horizon reasoning, despite interruptions and hesitations typical of real-world audio.

Key Features And Improvements

The model delivers faster responses and maintains fluid, instant interactions, even in noisy environments, by filtering out background noise for reliable performance.

Better reliability in real-life conditions: Gemini 3.1 Flash Life executes tasks more reliably in noisy environments by filtering out background noise such as traffic or television, ensuring agents remain responsive to instructions.

It closely follows complex instructions and guardrails, ensuring dependable performance even as conversations shift.

The model accurately interprets pitch, tone, and place, adapting responses to user sentiment and enabling more natural dialogue.

More natural dialogue flow: The model maintains conversation threads for longer periods, preserving context throughout extended interactions and idea generation. Mission Sessions

It enables real-time conversations in over 90 languages for global accessibility and consistent performance.

Developers can use the Gemini Live API (a platform for building features using real-time data) to build real-time conversational agents that process voice and video inputs and respond instantly. Key capabilities include:

Handling real-time audio and multimodal input

Function calling and external tool integration

Session management for long-running conversations

Ephemeral tokens for secure interactions

Building interactive voice-first AI agents

In addition to these foundational capabilities, the Google Gen AI SDK (a software toolkit for building generative AI features) enables asynchronous connections to audio sessions and supports instant interaction. Actions

Search Live Expansion And Use Cases

Search Live now works in 200+ regions with AI mode, using Gemini 3.1 Flash Live for real-time voice and camera queries. AI mode is available in Bengali, Gujarati, Kannada, Malayalam, Marathi, Odia, Tamil, Telugu, Urdu, and more.

Key Features Of Search Live Include:

Voice-activated conversation through the Google app

Follow-up questions in ongoing sessions

Camera input for context-aware queries

Google Lens integration for visual L-word interaction

Helpful audio responses with supporting web links

This allows users to perform tasks that require real-time interaction, such as troubleshooting, learning, or investigating real-world objects.

Ecosystem And Integrations

Gemini 3.1 Flash Live delivers scalable infrastructure and partner integration for production environments:

WebRTC-based systems for live voice and video

Global edge routing for distributed applications

Partner integrations for handling diverse input systems

Companies such as Verizon, LiveKit, and the Home Depot report positive results using the model in conversational workflows.

Safety And Content Authenticity

All generated audio includes a synth ID watermark imperceptibly embedded in the output. This enables the detection of AI-produced content, supporting honesty and reducing misinformation.

Availability

Gemini 3.1 Flash Live is available across multiple Google platforms.

Developers: preview access via Gemini Live API in Google AI Studio

Enterprises: Gemini Enterprise for Customer Experience Applications

End users: Gemini Live and Search Alive

Global Reach: Search Live is available in 200+ countries and territories with AI mode.

Languages: real-time conversation support in more than 90 languages

The Platforms column is accessible via the Google app on Android and iOS, as well as through Google Lens for camera-based interactions.

Source: Google rolls out Gemini 3.1 Flash Live for real-time voice AI conversations, expands Search Live globally

Latest post

Amazon Patent Shows Gesture-Based Control for Devices

Google Cuts a Memory Cost With New TurboQuant Method

Popular Posts

Best Business Laptops 2025 (2485)

Best Budget Smartphones 2026: Affordable Phones That Impress (2233)

The Future Is Calling: Top Upcoming Smartphones of 2026 You’ll Want to Wait For (1885)

Toyota’s 2026 RAV4 Gets AI Shadow Driver — What It Does (1432)

Best Smartphones 2025: Complete Buyer’s Guide with Android (1242)

Stay Connected

Tag: Google AI

Google Cuts a Memory Cost With New TurboQuant Method

Google Tests Search Live With Real-Time Camera + Voice AI

Latest Posts

Amazon Patent Shows Gesture-Based Control for Devices

Google Cuts a Memory Cost With New TurboQuant Method

Find us on Facebook

Quick Links