Deep Dive: Hardened Private AI Inference on AWS (Open WebUI + Ollama + Optional vLLM)

How CoreNova engineers a production-grade Private AI Sandbox on Ubuntu GPU EC2—dynamic EBS mounts, anti-hijack admin bootstrap, Ollama-first chat, and optional multi-GPU vLLM—so teams skip days of integration work.

Tags AWS Marketplace · AMI · EC2 · GPU · Open WebUI · Ollama · vLLM · AI Inference

Deploying large language models inside an enterprise VPC is no longer experimental—it is a compliance and data-sovereignty requirement. Frameworks like Open WebUI, Ollama, and vLLM make local AI look simple on a laptop. Moving the same stack onto hardened AWS GPU infrastructure surfaces three production bottlenecks that a naive docker compose up will not solve:

Engine choice and GPU utilization: Chat should work out of the box on a single-GPU g6.xlarge, while optional high-throughput serving must scale tensor parallelism on a 4-GPU g5.12xlarge without hand-editing flags.
EBS device path randomization: Model weights quickly exceed root volume capacity. On Nitro instances, secondary volumes appear as unpredictable /dev/nvme* paths—you cannot hardcode /dev/sdb.
Public admin hijacking: Open WebUI can allow public signup on first launch. If crawlers reach your public IP before you register admin, the stack is compromised.

This post walks through how we built Enterprise Secure AI Sandbox—now available on AWS Marketplace—on top of our Ubuntu 22.04 LTS Hardened GPU AMI.

Architecture overview

The stack isolates AI services in /opt/corenova/ai/ while OS-level hardware and storage setup runs before containers start.

       [ Public / VPC traffic ]
                  │  HTTPS 443 (Nginx reverse proxy)
                  ▼
         [ Open WebUI ]  signup=false · admin = Instance ID
                  │
                  ▼  localhost:11434
         [ Ollama ]  DEFAULT chat engine · qwen3:0.6b on first boot
                  │
                  ▼
   [ /mnt/models ]  auto-mounted secondary EBS (by-id, not /dev/sdX)

Optional (off by default):
         [ vLLM OpenAI API ]  localhost:8000 · auto tensor-parallel-size

Default path: Ollama serves all chat. Optional path: sudo systemctl start corenova-ai-vllm enables vLLM with dynamic --tensor-parallel-size. On a single GPU, do not run Ollama and vLLM concurrently.

1. Boot-order orchestration: GPU runtime before Compose

Packer builds on non-GPU builders, so the NVIDIA Container Toolkit runtime is configured on first GPU boot, not at image bake time:

# configure-docker-gpu.sh (excerpt)
nvidia-ctk runtime configure --runtime=docker
systemctl restart docker

Systemd enforces order:

corenova-docker-gpu → setup-models-volume → corenova-bootstrap-admin → corenova-ai-stack → nginx

Nginx terminates TLS on 443 (self-signed on first boot; replace for production) and proxies to Open WebUI on 127.0.0.1:8080. Port 80 redirects to HTTPS—not HTTP on 3000.

2. Dynamic EBS mounting via NVMe by-id

We mount the first non-root Amazon EBS volume to /mnt/models using stable symlinks under /dev/disk/by-id/:

MOUNT_POINT="/mnt/models"
ROOT_DEV="$(readlink -f "$(findmnt -n -o SOURCE /)" | sed 's/p[0-9]*$//')"

for id in /dev/disk/by-id/nvme-Amazon_Elastic_Block_Store_vol*; do
  dev="$(readlink -f "$id")"
  base_dev="$(echo "$dev" | sed 's/p[0-9]*$//')"
  [[ "$base_dev" == "$ROOT_DEV" ]] && continue
  DATA_DEV="$dev"
  break
done

# Format only if no filesystem; persist via UUID in fstab
if ! blkid "$DATA_DEV" >/dev/null 2>&1; then
  mkfs.ext4 -F "$DATA_DEV"
fi
UUID="$(blkid -s UUID -o value "$DATA_DEV")"
echo "UUID=${UUID} ${MOUNT_POINT} ext4 defaults,nofail 0 2 # corenova-models-ebs" >> /etc/fstab
mount "$MOUNT_POINT"

Ollama models live under /mnt/models/ollama; Hugging Face caches for vLLM under /mnt/models/vllm. Without a data volume, the stack falls back to root filesystem paths—works for pilots, not for 7B+ weights.

Public signup stays off permanently. On first boot, bootstrap-admin.sh reads the EC2 Instance ID via IMDSv2 and writes runtime secrets:

TOKEN=$(curl -sf -X PUT "http://169.254.169.254/latest/api/token" \
  -H "X-aws-ec2-metadata-token-ttl-seconds: 21600")
INSTANCE_ID=$(curl -sf -H "X-aws-ec2-metadata-token: ${TOKEN}" \
  "http://169.254.169.254/latest/meta-data/instance-id")

cat > /opt/corenova/ai/compose/.env <<EOF
ENABLE_SIGNUP=false
WEBUI_ADMIN_EMAIL=admin@local.host
WEBUI_ADMIN_PASSWORD=${INSTANCE_ID}
OLLAMA_BASE_URL=http://127.0.0.1:11434
DEFAULT_MODELS=qwen3:0.6b
EOF

Login: https://YOUR_PUBLIC_IP/ · email admin@local.host · password = Instance ID. Only someone with AWS Console access or your SSH key can recover that password. Change it after first login.

We also fixed a subtle Ollama networking issue: OLLAMA_HOST=0.0.0.0:11434 inside the container (not 127.0.0.1) so Open WebUI’s health checks succeed when using host networking.

4. Ollama-first chat with automatic model pull

After admin bootstrap, the stack pulls a small default model so chat works without manual steps:

# pull-default-model.sh waits for Ollama, then:
docker exec corenova-ollama ollama pull qwen3:0.6b

First-boot timeline (typical g6/g4dn): minutes 0–2 NVIDIA driver + Docker GPU runtime; 2–6 Ollama pull; 6–8 WebUI + Nginx ready. Plan 5–10 minutes before first HTTPS login.

5. Optional vLLM with dynamic tensor parallelism

When you need OpenAI-compatible throughput on multi-GPU instances:

# run-vllm.sh
GPU_COUNT=$(nvidia-smi -L | wc -l | tr -d ' ')
export VLLM_TENSOR_PARALLEL_SIZE="${GPU_COUNT}"
docker compose --profile vllm up -d vllm

Compose binds vLLM to 127.0.0.1:8000; Open WebUI can point OPENAI_API_BASE_URL at it. Default model: Qwen/Qwen2.5-0.5B-Instruct (override via .env).

6. Hardening foundation

This product inherits the same CIS-oriented baseline as our GPU Base AMI: SSH key-only, UFW, auditd, AIDE, unattended security updates, NVIDIA Driver 550, CUDA 12.4. OpenSCAP scan artifacts ship with the product; buyers retain final compliance responsibility.

The economics: build vs. buy

Configuring NVIDIA container tooling, NVMe-by-id mount automation, signup-safe WebUI bootstrap, and a reproducible Packer pipeline is days of engineering—even for teams that know each component individually.

CoreNova packages this architecture as a zero-maintenance AMI:

Software fee: $0.25/hr with a 7-day free trial (EC2/GPU charges separate)
Delivery: AWS Marketplace
Support: CoreNovaLabs@aipalnet.cn

If you could wire this stack yourself in a weekend, you probably should—unless your hourly engineering cost makes $0.25/hr the cheaper way to get a audited, repeatable baseline inside your VPC. That is the bet we optimized for.

For custom RAG layouts or MCP integrations on top of this AMI, email our engineering team at CoreNovaLabs@aipalnet.cn.