Running Open Source LLMs on AWS: Bedrock, SageMaker, EKS, and EC2

Written by | Jun 4, 2026

Open source LLMs have grown up. Models like Llama 4, Mistral, and DeepSeek now go toe-to-toe with commercial options like GPT, Gemini, and Claude on real-world tasks like code generation, summarization, document analysis, and customer support. And AWS gives teams more paths to get there than ever before.

AWS gives teams four distinct paths to run open source LLMs in production: Amazon Bedrock, Amazon SageMaker, Amazon EKS, and Amazon EC2. Think of them as a spectrum. On one end, you trade control for simplicity. On the other, you get full ownership of the infrastructure at the cost of managing the OS, drivers, scaling, and everything in between.

In this post, we’ll walk through this spectrum of approaches for running open source LLMs on AWS — from a fully managed API you can call in five lines of Python, all the way to raw GPU instances where you own every layer of the stack. We’ll include working code for each one so you can get your hands dirty.

Open Source LLMs vs Commercial APIs

Choosing open source over commercial APIs isn’t always the right call. But when it is, the reasons are compelling:

  • Data sovereignty and compliance. If you’re in healthcare, finance, or government, sending patient records or classified documents to a third-party API might not be an option. Open source  allows your data to never leave your environment.
  • Cost at scale. Per-token API pricing adds up fast at high volume. Once you cross a certain threshold, the economics of running your own inference stack become difficult to ignore.
  • Customization. An open source LLM gives you full control over their inner workings, including fine-tuning a model on your proprietary data — your codebase, your legal documents, your product catalog — gives you something no generic API can match.
  • Latency control. Running an open source LLM also gives you the ability to own the infrastructure, which in turn allows you to guarantee response times instead of hoping the provider isn’t having a bad day.

When commercial still wins: If you’re a small team shipping an MVP, if your volume is low, or if you need the absolute frontier of reasoning quality, proprietary APIs like GPT-4o or Claude remain hard to beat. There’s no shame in using the right tool for the job, and  as you will see shortly, AWS also has you covered in this regard.

AWS Deployment Options: From Managed to DIY

The right starting point depends on your team’s skills, your compliance requirements, and how much infrastructure you want to own. Here’s a quick decision guide:

Less Complexity ◄─────────────────────────────────► More Complexity
Little/None Managed                                          More Control

  Amazon Bedrock   Amazon SageMaker     Amazon EKS             Amazon EC2
  (Managed API)    (ML Platform)        (Containers)           (Raw Compute)

AgilityFeat’s recommendation: start simple and graduate as your needs grow. There’s no prize for over-engineering on day one. Begin with Bedrock, and move down the spectrum only when you hit a real limitation — a model that isn’t available, a security/performance requirement you can’t meet, or a cost curve that no longer makes sense.

Here’s a quick decision guide:

  1. Is the model on Bedrock and your volume is under 100M tokens/day? → Use Bedrock.
  2. Need custom fine-tuning or ML platform features (A/B testing, experiments)? → Use SageMaker.
  3. Already running Kubernetes and want LLM as another microservice? → Use EKS.
  4. Need maximum cost control, custom hardware, or full OS-level compliance? → Use EC2.

Hybrid strategies are perfectly valid, too! Use Bedrock for low-volume, simple tasks and self-hosted infrastructure for high-volume or latency-sensitive workloads. There’s no rule that says you have to pick just one.

Hands-On: Deploying Open Source LLMs on AWS

Let’s get hands-on. For each approach, here’s a working snippet that gets you from zero to inference.

Amazon Bedrock — Minutes to First Inference

Bedrock is the “I just want it to work” option. No instances to provision, no drivers to install, no model weights to download. You request access to a model, and within minutes you’re making API calls.

import boto3

client = boto3.client("bedrock-runtime", region_name="us-east-1")
response = client.converse(
    modelId="us.meta.llama3-1-70b-instruct-v1:0",
    messages=[{"role": "user", "content": [{"text": "Explain quantum computing"}]}],
    inferenceConfig={"maxTokens": 512, "temperature": 0.7}
)
print(response["output"]["message"]["content"][0]["text"])

That’s it. Five lines and you’re running Llama 3.1 70B. Bedrock also gives you built-in guardrails for content filtering and PII detection — features that would take weeks to build on raw infrastructure. 

Amazon Bedrock also give you access to commercial models such as from Anthropic and OpenAI, so you can think of this as a sort of “Commercial API within your own environment”

Best for: Prototypes, low-to-moderate volume, teams without ML infrastructure expertise.

Limitations: No GPU control, limited fine-tuning options, no latency SLA, only models in the catalog.

Amazon SageMaker — The ML Platform Sweet Spot

SageMaker gives you the control that Bedrock doesn’t. You pick the model, the instance type, and the inference engine while AWS still handles the infrastructure orchestration. It’s like moving from an apartment to a house: more responsibility, but you can finally knock down walls if you want.

Deploy the endpoint (run once, takes ~10 min):

from sagemaker.jumpstart.model import JumpStartModel

# Using Llama 3.2 1B on a single-GPU instance for quick testing (~$1.50/hr).
# For production, you'd want to use something like
# "meta-textgeneration-llama-3-1-70b-instruct" on ml.p4d.24xlarge, 
# depending on your use case
model = JumpStartModel(
    model_id="meta-textgeneration-llama-3-2-1b-instruct",
    role="arn:aws:iam::<your-account-id>:role/<your-sagemaker-role>",
    instance_type="ml.g5.xlarge"
)
predictor = model.deploy(accept_eula=True)

Query the endpoint:

import json, boto3

runtime = boto3.client("sagemaker-runtime")
response = runtime.invoke_endpoint(
    EndpointName="<your-endpoint-name>",  # from deploy output
    ContentType="application/json",
    Body=json.dumps({
        "inputs": "Explain quantum computing in one paragraph",
        "parameters": {"max_new_tokens": 256}
    })
)
print(json.loads(response["Body"].read()))

Clean up when done (~$1.50/hr while running):

# Stop billing immediately
aws sagemaker delete-endpoint --endpoint-name <your-endpoint-name>
# Optional: remove leftover config (no cost, but keeps your account tidy)
aws sagemaker delete-endpoint-config --endpoint-config-name <your-endpoint-name>
aws sagemaker delete-model --model-name <your-model-name>

Under the hood, you can choose between vLLM and TensorRT-LLM as your inference engine, configure tensor parallelism, and set up A/B testing between model versions with native traffic splitting.

Best for: ML teams, custom fine-tuning with full control (LoRA, FSDP, custom loss functions), model experimentation.

Limitations: ~20% cost markup over raw EC2, cold starts of 10-15 minutes for large models.

Amazon EKS — Kubernetes-Native LLM Serving

If your team already runs Kubernetes, deploying an LLM is just another workload. You get GPU-aware scheduling, Karpenter for dynamic node provisioning, and native integration with your existing service mesh and monitoring.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-inference
spec:
  replicas: 1
  selector:
    matchLabels: {app: llm-inference}
  template:
    metadata:
      labels: {app: llm-inference}
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:latest
        command: ["/bin/sh", "-c"]
        args: ["vllm serve meta-llama/Llama-3.2-1B-Instruct --max-model-len 4096 --port 8000"]
        env:
        - name: HF_TOKEN
          valueFrom:
            secretKeyRef:
              name: hf-token-secret
              key: token
        ports:
        - containerPort: 8000
        resources:
          limits:
            nvidia.com/gpu: 1
          requests:
            cpu: "2"
            memory: "8Gi"
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 60
          periodSeconds: 5

For production with a 70B model on 8 GPUs, you’d change the serve command to include –tensor-parallel-size 8, request nvidia.com/gpu: 8, and use a larger instance like g5.48xlarge.

Best for: Platform teams already on K8s, microservices architectures, multi-model routing, cloud portability.

Limitations: Requires Kubernetes expertise. If you don’t already run K8s, this isn’t the place to start learning.

Amazon EC2 — Full Stack Ownership

EC2 is the “there is no spoon” option. You control everything: the OS, the drivers, the inference framework, the networking. It’s the most work, but also the most flexibility.

python3 -m venv ~/vllm-env && source ~/vllm-env/bin/activate
pip install vllm
export HF_TOKEN=hf_your_token_here
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.2-1B-Instruct \
  --max-model-len 4096 \
  --port 8000

Four commands and you’re serving a 1B model. For production, you’d want to use a 70B (depending on the use case), wrap it in a systemd service, put an NLB in front, and configure an Auto Scaling Group with warm pools so new instances come up in 2 minutes instead of 15.

You also get access to AWS’s cost-optimized Inferentia2 chips, which deliver ~40% better price-performance for supported models. And you can use Spot instances for non-critical workloads to cut costs by 60-70%.

Best for: Maximum cost optimization, strict compliance (full OS audit trail), custom hardware, research.

Limitations: You own patching, scaling, monitoring — everything. With great power comes great operational burden.

Build for Where You’re Going

All four approaches in this post work. Bedrock, SageMaker, EKS, EC2 — each is a legitimate production path with real teams running real workloads on it today. Start with what fits your team now, and graduate to more control only when you need it. The goal is to grow out of each stage gracefully, not to be future-proof on day one.

As an AWS partner specializing in production systems, AgilityFeat helps teams make those decisions well. We offer assessments to help you select the right approach, and can also work with you to configure inference engines, scale infrastructure, and and tune for performance. Whether you’re just getting started with Bedrock or ready to run a fleet of GPU instances on EKS, let’s talk about what you’re building.

About the author

About the author

Hector Zelaya

Hector is a Computer Systems Engineer specializing in DevOps, WebRTC, and AI. He has been part of the AgilityFeat/WebRTC.ventures team since 2016. Hector is a member of the AWS Community Builder Program and an AWS-Certified DevOps Engineer. He has presented at numerous conferences and is a frequent author of technical blog posts. Outside of work, Hector is a happy husband, proud father, hobbyist musician, and gamer.

Recent Blog Posts