Decompression helps reduce storage costs and speed up data transfers across databases, data centers, high-performance computing, deep learning, and other areas. However, decompressing this data can slow things down by adding latency and using valuable computing power.
To address these challenges, NVIDIA introduced the hardware decompression engine DE in the NVIDIA Blackwell architecture and created it for the nvCOMP library. Together they offload decompression from general-purpose compute, accelerate widely used formats like Snappy, and make adoption seamless.
In this blog, we will explain how DE and nvCOMP work, share usage tips, and highlight the performance benefits they deliver for data-intensive tasks.
How The Decompression Engine Works
The new DE in the Blackwell architecture is a dedicated hardware block that speeds up decompression for snappy LZ4 and D-flat-based streams. By handling decompression in hardware, the DE lets streaming multiprocessor (SM) resources focus on computation rather than data movement.
The DEE is built into the copy engine. You no longer need to do host-to-device copies and then run software decompression. Now, compressed data can move directly over PCIe or C2C and be decompressed as it travels, helping remove a major I/O bottleneck.
The DE does more than boost throughput. It allows data movement and computation to occur simultaneously. With multi-stream workloads, decompression can run in parallel with SM kernels, so the GPU stays busy. This helps with data-intensive tasks like training LLMs, evaluating large geonomics datasets, and/or running HPC simulations, keeping up with the high bandwidth of Blackwell GPUs without being slowed down by I/O.
The Benefits of NvComp’s GPU-Accelerated Decompression
The NVIDIA nvCOMP library offers GPU-accelerated routines for both compression and decompression. It works with many standard formats, as well as formats that NVIDIA has tuned for top GPU performance.
Standard formats, CPUs, and fixed-function hardware often have an edge over GPUs because GPUs have less parallelism for these tasks. The decompress engine solves this issue for many workloads. Next, we will explain how to use nvCOMP with the DE.
How to use DE and nvCOMP
Developers should use DE through the nvCOMP API’s. Right now, DE is only on certain GPUs (B200, B300, GB200, and GB300), so using nvCOMP lets you write code that works among different GPUs as support grows. If DE is available, nvCOMP uses it automatically. If not, it switches to its fast SM-based methods without needing changes to your code.
To ensure this works on DE-enabled GPUs, follow these steps. nvCOMP usually accepts any input and output buffers that the device can access, but the DE has stricter rules. If your buffers don’t meet these rules, nvCOMP will use the SM or decompression instead.
You can use cudaMalloc as usual for device-to-device decompression. For host-to-device or host-to-host decompression, use cudaMallocFromPoolsync or active cuMemCreate, but make sure to set up the allocators correctly.
How SM Performance Compares to DE
DE offers faster decompression and lets the SM handle other tasks. The DE has dozens of execution units, while SMs have thousands of volts. Each DE unit is much faster at decompression, but in some cases, a fully loaded SM can come close to DE speed. Both SM and DE can use cost-pinned data as input, enabling zero-copy decompression.
The next figure shows how SM and DE perform on the Silesia benchmark for LZ4, D-flat, and Snappy algorithms. Snappy has been newly optimized in nvCOMP 5.0, and there are more chances to improve D-flat and LZ4 as well.
Performance was measured using 64KiB and 512KiB chunk sizes on both small and large data sets. The large data set is the full Silesia dataset, and the small data set is the first 50 MB of Silesia.tar.
Get started
The Decompressor engine in Blackwell helps solve one of the biggest problems in data-heavy workloads: getting first efficient decompression. By moving this job to a dedicated hardware application, it runs faster and frees up GPU resources. For other tasks, operators can take advantage of these improvements without changing their code, leading to better pipelines and better performance.
Source: Speeding Up Data Decompression with nvCOMP and the NVIDIA Blackwell Decompression Engine










