Product Launches

Google DeepMind's Gemma 4 Models Arrive on Amazon Bedrock: 31B, 26B MoE & E2B Explained

AWS Machine Learning Blog
Jun 16, 202613 min read2 views
+1
Google DeepMind's Gemma 4 Models Arrive on Amazon Bedrock: 31B, 26B MoE & E2B Explained

Amazon Bedrock now hosts Google DeepMind's Gemma 4 models, including 31B, 26B MoE, and E2B variants with reasoning mode, multimodal support, and enterprise-grade security.

Three open-weight models. A Mixture-of-Experts architecture that costs like a small model but thinks like a large one. And full enterprise privacy controls on AWS infrastructure. Gemma 4 just became significantly easier to deploy at scale.


Introduction

For enterprises building production AI systems, open-weight models present a persistent tension. The appeal is clear — full transparency into the architecture, the ability to fine-tune on proprietary data, and freedom from vendor lock-in. The problem is everything that comes after: provisioning infrastructure, managing inference stacks, ensuring data privacy compliance, and handling traffic spikes without dropping requests.

On June 15, 2026, AWS announced that the Gemma 4 family — built by Google DeepMind and released under the Apache 2.0 open-weight license — is now available on Amazon Bedrock. The partnership removes the infrastructure burden entirely. Teams get Google DeepMind's latest open models through a fully managed AWS service, with the same data protection and compliance controls Bedrock customers already rely on.


Quick Summary

DetailInfoModel familyGemma 4 (Google DeepMind)LicenseApache 2.0 (open-weight)Announced on BedrockJune 15, 2026Variants availableGemma 4 31B, Gemma 4 26B-A4B, Gemma 4 E2BEndpointbedrock-mantleSDK compatibilityOpenAI Python and TypeScript SDKsShared capabilitiesReasoning mode, native function calling, text and image inputLaunch regionsUS East N. Virginia, US East Ohio, US West Oregon, Europe FrankfurtData privacyPrompts not used for training, content not shared with third parties


The Three Gemma 4 Variants on Bedrock

Google DeepMind designed the Gemma 4 family around a single principle: intelligence per parameter — squeezing the most usable capability out of the fewest active parameters. The three variants on Bedrock reflect that philosophy at different cost and performance points.

Complete Specifications at a Glance

ModelArchitectureTotal ParametersActive per TokenContext WindowBest ForGemma 4 31BDense30.7B30.7B256K tokensReasoning, complex codingGemma 4 26B-A4BMixture-of-Experts25.2B3.8B256K tokensHigh throughput, cost efficiencyGemma 4 E2BDense with PLE5.1B2.3B effective128K tokensLatency-sensitive, edge-style


Gemma 4 31B — The Dense Flagship

Model ID: google.gemma-4-31b Architecture: Dense Parameters: 30.7B Context window: 256K tokens

Gemma 4 31B is the largest and most capable model in the family on Bedrock. It is a traditional dense architecture — all 30.7 billion parameters activate for every request — designed for workloads that need maximum reasoning depth and coding quality from a single model.

On the Artificial Analysis Intelligence Index, Gemma 4 31B scores 39. The median score for models in the 4B to 40B open-weight class is 15. That gap — 39 versus a field median of 15 — is the clearest single-number representation of Gemma 4 31B's intelligence-per-parameter advantage over its size tier.

The 256K token context window makes it suitable for tasks involving long documents, extended multi-turn conversations, large codebases, or complex multi-step agentic workflows where context continuity matters.


Gemma 4 26B-A4B — The Mixture-of-Experts Model

Model ID: google.gemma-4-26b-a4b Architecture: Mixture-of-Experts (MoE) Total parameters: 25.2B Active parameters per token: 3.8B Context window: 256K tokens

The 26B-A4B variant is where the Gemma 4 family gets architecturally interesting. Mixture-of-Experts is a design in which a model holds a large pool of specialized sub-networks — called experts — but activates only a small subset for each token during inference. The 26B-A4B has 25.2 billion total parameters, but only 3.8 billion activate per token. That means the compute cost and latency of each request is roughly equivalent to running a 4-billion-parameter dense model, while the breadth of knowledge encoded in the full 25.2B parameter pool remains available.

The practical implication: organizations running high-throughput workloads that need genuine knowledge breadth — document understanding, multilingual tasks, domain-spanning analysis — can do so at a cost profile that makes production scale financially viable. The 256K token context window matches the 31B model, making it equally suited for long-form tasks.


Gemma 4 E2B — The Compact Speed-Optimized Variant

Model ID: google.gemma-4-e2b Architecture: Dense with Per-Layer Embeddings (PLE) Total parameters: 5.1B Effective parameters: 2.3B Context window: 128K tokens

The E2B variant uses a technique called Per-Layer Embeddings (PLE) to keep its effective parameter count at 2.3 billion out of a total 5.1 billion — reducing both memory requirements and compute cost per inference call. The 128K token context window is smaller than the other two variants but still covers most real-world document and conversation lengths.

E2B is the right choice when response speed takes priority over analytical depth — edge-style deployments, real-time classification, latency-sensitive customer interfaces, and multimodal tasks where a fast answer matters more than a deeply reasoned one.

One important configuration note: AWS recommends setting reasoning effort to high for this variant specifically. The E2B model tends to reason extensively by default, and the high effort setting keeps that thinking in the dedicated reasoning channel — improving output quality and preventing reasoning text from appearing in the final answer.


What All Three Variants Share

Despite the architectural differences, every Gemma 4 model on Bedrock shares a common capability set and API surface — meaning teams can build once and switch between variants based on workload requirements.

Built-in reasoning mode: All three variants can emit an explicit internal thought process before delivering a final answer. This is controllable per request — useful for complex multi-step tasks where showing the reasoning matters, but optional for simpler queries where it would just add latency. Reasoning effort has three levels: low, medium, and high.

Native function calling: All variants support structured tool calling for agentic workflows. A model can receive a set of tool definitions, decide which tool to call, pass the correct arguments, receive the result, and incorporate it into a final response — the complete loop needed for autonomous agent behavior.

Multimodal input: Every variant accepts both text and images as input. Images can be passed as inline base64-encoded data or as Amazon S3 URLs. Public HTTPS image URLs are not supported. Google DeepMind recommends placing image content before text in the prompt for best results.

Language support: All models support over 35 languages out of the box, with pre-training spanning more than 140 languages.

Fine-tuning capability: Because Gemma 4 is open-weight under Apache 2.0, teams can fine-tune any variant on proprietary data — an option not available with closed-source models.


The Architecture Behind the Long Context Window

Handling 256K tokens without ballooning memory and compute costs is a genuine engineering challenge. Gemma 4 addresses it through hybrid attention — an architecture that interleaves local attention (which processes nearby tokens efficiently) with global attention (which maintains relationships across the full context). The combination keeps memory footprint small while sustaining coherent reasoning across very long inputs.

This matters practically for teams building document understanding pipelines, multi-document analysis tools, or agentic systems that accumulate large amounts of conversation and tool-call history. The 256K window on both the 31B and 26B-A4B models handles these workloads without requiring external chunking or retrieval engineering.


How to Access Gemma 4 on Bedrock

The bedrock-mantle Endpoint

Gemma 4 models on Amazon Bedrock are accessed through a dedicated endpoint called bedrock-mantle, which is built on a next-generation inference engine designed with Model Deployment Account isolation and zero operator access. The engine itself is the infrastructure; bedrock-mantle is the API surface developers call.

The endpoint URL format is: https://bedrock-mantle.{region}.api.aws/openai/v1

It exposes two APIs: Chat Completions and the Responses API. Both follow the same interface as the OpenAI Python and TypeScript SDKs — teams already using those SDKs to call other models can switch to Gemma 4 on Bedrock by updating only the base URL and model ID. No other code changes are needed.

Choosing Between Chat Completions and Responses API

Chat Completions is the right choice for multi-turn conversations and agentic workflows involving client-side tool-calling loops. It accepts a structured messages list and is the more familiar of the two interfaces for most developers.

The Responses API uses a single input field and returns a top-level output_text, making it simpler for single-turn generation. It is also the only way to access the reasoning output — when reasoning mode is enabled, the model returns its thought process as a separate item alongside the final answer. This keeps reasoning visible for inspection and auditing without mixing it into the response text.

IAM Permissions Required

Two managed IAM policies cover different access levels. AmazonBedrockMantleInferenceAccess grants read and inference creation permissions — everything needed to call the models. AmazonBedrockMantleFullAccess covers the full action set including project management, fine-tuning, and custom model operations.

API Keys

The bedrock-mantle endpoint supports Amazon Bedrock API keys. For production workloads, short-term keys are recommended — they expire automatically after a maximum of 12 hours and inherit the permissions of the IAM role that generated them. Credentials should be stored in AWS Secrets Manager or AWS Systems Manager Parameter Store rather than environment variables.


Reasoning Mode: How It Works and What to Watch

When reasoning mode is enabled, Gemma 4 works through a problem step by step before producing its final answer — similar to how a person might write out their logic before committing to a conclusion. The thought process comes back as a separate output item, not embedded in the final response text.

Three effort levels are available per request: low, medium, and high. Higher effort means more thorough reasoning at the cost of additional latency and token usage.

One critical rule for multi-turn conversations: send back only the final answers from previous turns, not the reasoning items. Feeding prior reasoning back into the model degrades response quality. Teams can still log and audit the reasoning locally — the instruction is to strip it from the conversation history sent on the next API call, not to discard it entirely.


Service Tiers: Matching Cost to Workload

Every Gemma 4 variant on Bedrock is available across three service tiers, which can be mixed within the same application.

TierBest WorkloadsCost vs StandardKey CharacteristicsPriorityReal-time agents, customer-facing interfacesPremiumUp to 25% better output tokens per second, processed firstStandardEveryday tasks, content generation, analysisBaselineDefault tier, consistent performanceFlexBackground jobs, evaluations, batch summarizationDiscountedHigher latency during peak, processed after Standard

Priority tier does not require upfront reservations or commitments — it is activated per request by setting a service_tier parameter. This lets applications route latency-sensitive requests to Priority and background work to Flex within the same codebase, optimizing cost without architectural changes.


Scaling and Traffic Management

The bedrock-mantle endpoint has no requests-per-minute quota. Instead, it is governed by per-model, per-region token limits — separate input-tokens-per-minute and output-tokens-per-minute caps. Gemma 4 models do not currently have published per-account quotas in the Service Quotas console; throughput is managed through internal service capacity.

Two error codes matter for production operations:

HTTP 429 means a token-per-minute quota has been exceeded. The correct response is to reduce the submission rate and retry with exponential backoff. Sustained 429 errors can be addressed by requesting a quota increase through AWS Support.

HTTP 503 means regional capacity for the model is under pressure. Occasional 503s are handled by exponential backoff with a bounded retry count. Sustained 503s require reducing the request rate.

How to Ramp Traffic Safely

Sudden large increases in request volume are more likely to trigger 503 errors than gradual ramps. AWS recommends a structured ramp procedure: start at the target rate, and if 503s appear, reduce by 50% and keep reducing until requests succeed consistently. Hold at that rate for 15 minutes before increasing by 50% again. Repeat until the target volume is reached.

As a concrete example: targeting 2,000 requests per minute and hitting 503s means dropping to 1,000, then to 500 if errors persist. Once 500 is stable for 15 minutes, step to 750, then 1,125, and so on. Skipping the hold period at each step turns every increment into a fresh load test rather than a controlled ramp.


Prompt Caching: Automatic Latency Reduction

All Gemma 4 models on Bedrock support implicit prompt caching, which activates automatically with no code changes or cache markers required. When consecutive requests share a common prompt prefix, the model can reuse its cached internal state instead of recomputing from scratch — reducing latency on the matching tokens.

Prompt caching works across all three service tiers and is particularly effective for workloads with stable prefixes: multi-turn agents that reuse the same system prompt, RAG pipelines that include the same source documents, and long-context analysis tasks that reference consistent instructions. The practical guidance is to place static content at the beginning of the prompt and dynamic content at the end — this maximizes the portion of each request that can benefit from a cache hit.


Data Privacy and Security

AWS has been explicit on the data handling terms: prompts and completions sent to Gemma 4 models on Bedrock are not used to train any models, and content is not shared with third parties. Inference runs entirely on infrastructure operated by AWS. The bedrock-mantle engine is built with Model Deployment Account isolation and zero operator access — meaning AWS operators cannot access model weights or inference traffic.

For organizations in regulated industries where data residency and audit requirements are non-negotiable, these terms are the practical argument for running open-weight models through a managed cloud service rather than self-hosted infrastructure.


Availability and Pricing

At launch, Gemma 4 models are available in four AWS regions: US East (N. Virginia), US East (Ohio), US West (Oregon), and Europe (Frankfurt).

Pricing is per token and varies by model variant and service tier. AWS has not published specific per-token rates in the announcement — current pricing is available at the Amazon Bedrock pricing page.


Why This Matters for Enterprise AI Teams

The Gemma 4 launch on Bedrock sits at the intersection of two trends that have been reshaping enterprise AI adoption.

The first is the maturing of open-weight models. The intelligence-per-parameter gap between the best open models and the best proprietary models has narrowed substantially. Gemma 4 31B scoring 39 on the Artificial Analysis Intelligence Index — against a class median of 15 — is evidence of how far open-weight models have come relative to their size tier. Teams that previously felt they had to choose between open-model flexibility and frontier-level capability now have a more competitive option.

The second is the demand for managed infrastructure. Fine-tuning and deploying open-weight models self-hosted is technically feasible but operationally expensive. Engineering time spent on inference stacks, auto-scaling, security hardening, and compliance documentation is time not spent on applications. Bedrock removes that operational layer, making the practical argument for open-weight models stronger in enterprise contexts where infrastructure overhead is a real cost.

The Apache 2.0 license means Gemma 4 can be used commercially without royalty obligations, fine-tuned on proprietary data without restrictions, and benchmarked independently — all of which matter to organizations that need full visibility into what they are deploying and full control over how it evolves.


Final Takeaway

Gemma 4 on Amazon Bedrock gives enterprise teams three meaningfully different models under one managed API: a dense flagship for maximum reasoning quality, a Mixture-of-Experts variant that delivers large-model knowledge at small-model cost, and a compact speed-optimized model for latency-critical applications. All three share the same interface, the same reasoning mode, the same native function calling, and the same multimodal input support.

For teams that have been waiting for open-weight models to reach a capability level worth deploying in production — and for infrastructure that removes the operational burden of doing so — this combination is worth evaluating seriously.


Source: AWS Machine Learning Blog — aws.amazon.com/blogs/machine-learning/introducing-gemma-4-models-on-amazon-bedrock — Published June 15, 2026

Original Source

This analysis is based on reporting from AWS Machine Learning Blog.

View on AWS Machine Learning Blog
Share:

📌 Related Posts

What do you think?
+1
Share:

Comments

Leave a comment

0/2000