Key Takeaways 

  • AWS and Cerebras are partnering to enable fast AI inference on Amazon Bedrock, launching soon.  
  • AWS Trainium handles prefill, and Cerebras CS-3 handles decode, delivering fast, efficient AI inference.  
  • AWS is the first cloud provider to deliver Cerebras’s disaggregated inference solution a method that separates the AI processing steps onto different specialized hardware exclusively via Amazon Bedrock.  

AWS and Cerebras have announced a partnership to deliver fast AI inference for generative AI and LLMs. Amazon Bedrock will use AWS Training Servers, Cerebras CS-3 systems, and elastic fiber adapter networking. AWS will add open-source LLMs and Amazon Nova to Cerebras hardware later this year.  

Inference is where AI delivers real value to customers, but speed remains a critical bottleneck for workloads such as real-time coding assistance and interactive applications, said David Brown, Vice President, Compute and ML Services, AWL. What we’re building with Cerebras solves that: by splitting the inference workload across training and CS3 and connecting them with Amazon’s Elastic Fabric Adapter, each system does what it’s best at. The result will be an inference that’s an order of magnitude faster and higher performance than what’s available today.  

Partnering with AWS to build a disaggregated inference solution will bring the fastest inference to a global consumer base, said Andrew Feldmann, founder and CEO of Cerebras Systems. Every enterprise worldwide can benefit from blisteringly fast inference within its existing AWS environment.  

How It Works: Inference Disaggregation 

The Trainium and CS3 solution uses inference disaggregation, which splits AI inference into two steps: prompt analysis (pre-fill) and output generation (decode). Pre-fill refers to processing the input prompt in parallel, which requires substantial computing power and moderate memory bandwidth. Decode produces the output one token at a time in a serial process, which is lighter on the CPU but requires higher memory bandwidth. Decode usually takes most of the inference time because each output token is created one after another.  

Since each stage has its own computing needs, they work best with different hardware and fast, high-bandwidth EFA (Elastic Fabric Adapter) networking, which splits the inference process across them. Trainium can focus on prefill (analyzing inputs), and CS-3 can handle decode (step-by-step output generation), letting each part run as efficiently as possible.  

The new solution is built on the AWS Nitro System, a combination of specialized hardware and software that helps create a secure, high-performance environment. This ensures that Cerebras C3 systems (for output generation) and training-powered instances (for input analysis) offer the same security, isolation, and reliability that AWS customers expect.  

AWS Trainum for pre-fill and Cerebras CS-3 for Decode  

Trainium is Amazon’s custom AI chip, designed for high performance and cost efficiency in both training (teaching AI models by exposing them to data) and inference (using AI models to generate answers or predictions based on input data) across many generative AI tasks (such as writing, coding, and image creation). Leading AI labs like Anthropic and OpenAI are using Trainium. Anthropic has chosen AWS as its main training partner and uses Trainium for its model. OpenAI will use 2 gigawatts of training capacity through AWS to support stateful runtime environments (systems that remember previous interactions), frontier models (next-generation AI models), and other advanced workloads. Since its launch, Trainium 3 has been widely adopted by organizations in different industries.  

Cerebras CS3 is the world’s fastest AI inference system, offering much higher memory bandwidth than the fastest GPU. As reasoning models now handle most inference tasks and generate more tokens per request, speeding up this part of the workflow has become increasingly important. Companies like OpenAI, Cognition, and Mistral use Cerebras to speed their toughest workloads, especially in agentic coding, where fast inference is key to improving productivity.  

CS3 speeds up decoding to deliver tokens faster. Trainium manages prefill, CS3 handles decode, and EFA Networking connects them, maximizing each hardware’s strengths.  

About Amazon Web Services 

Amazon Web Services (AWS) focuses on customers, innovation, operational excellence, and long-term goals. For almost 20 years, AWS has made technology and cloud computing available to organizations of all sizes and industries, becoming one of the fastest-growing tech companies ever. Millions of customers use AWS to innovate, grow, and shape the future with broad AI capabilities and a global network. Amazon helps people turn big ideas into reality. Learn more at aws.amazon.com or follow @AWSNewsroom.  

About Cerebras Systems 

Cerebras systems create the world’s fastest AI infrastructure. Our team includes computer architects, scientists, AI researchers, and engineers. We work together to make AI extremely fast through new ideas and emergence, believing that faster AI can change the world. Our main technology, the wafer-scale engine (WSE3), is the largest and fastest AI processor 56 times bigger than the largest GPU. It uses less power per unit of compute and delivers inference and training over 20 times faster than others. Top companies, research institutes, and governments across four continents use Cerebras solutions for their AI needs. Our solutions are available both on-premise and in the cloud. For more information, visit sheribras.ai or follow us on LinkedIn, X, or Threads.  

Source: AWS and Cerebras collaboration aims to set a new standard for AI inference speed and performance in the cloud 

Amazon

Leave a Reply

Your email address will not be published. Required fields are marked *