Deep Dive: Hardened Private AI Inference on AWS (Open WebUI + Ollama + Optional vLLM)
How CoreNova engineers a production-grade Private AI Sandbox on Ubuntu GPU EC2—dynamic EBS mounts, anti-hijack admin bootstrap, Ollama-first chat, and optional multi-GPU vLLM—so teams skip days of integration work.
Tags AWS Marketplace · AMI · EC2 · GPU · Open WebUI · Ollama · vLLM · AI Inference
Deploying large language models inside an enterprise VPC is no longer experimental—it is a compliance and data-sovereignty requirement. Frameworks like Open WebUI, Ollama, and vLLM make local AI look simple on a laptop. Moving the same stack onto hardened AWS GPU infrastructure surfaces three production bottlenecks that a naive docker compose up will not solve:
- Engine choice and GPU utilization: Chat should work out of the box on a single-GPU
g6.xlarge, while optional high-throughput serving must scale tensor parallelism on a 4-GPUg5.12xlargewithout hand-editing flags. - EBS device path randomization: Model weights quickly exceed root volume capacity. On Nitro instances, secondary volumes appear as unpredictable
/dev/nvme*paths—you cannot hardcode/dev/sdb. - Public admin hijacking: Open WebUI can allow public signup on first launch. If crawlers reach your public IP before you register admin, the stack is compromised.
This post walks through how we built Enterprise Secure AI Sandbox—now available on AWS Marketplace—on top of our Ubuntu 22.04 LTS Hardened GPU AMI.
Architecture overview
The stack isolates AI services in /opt/corenova/ai/ while OS-level hardware and storage setup runs before containers start.
[ Public / VPC traffic ]
│ HTTPS 443 (Nginx reverse proxy)
▼
[ Open WebUI ] signup=false · admin = Instance ID
│
▼ localhost:11434
[ Ollama ] DEFAULT chat engine · qwen3:0.6b on first boot
│
▼
[ /mnt/models ] auto-mounted secondary EBS (by-id, not /dev/sdX)
Optional (off by default):
[ vLLM OpenAI API ] localhost:8000 · auto tensor-parallel-size
Default path: Ollama serves all chat. Optional path: sudo systemctl start corenova-ai-vllm enables vLLM with dynamic --tensor-parallel-size. On a single GPU, do not run Ollama and vLLM concurrently.
1. Boot-order orchestration: GPU runtime before Compose
Packer builds on non-GPU builders, so the NVIDIA Container Toolkit runtime is configured on first GPU boot, not at image bake time:
# configure-docker-gpu.sh (excerpt)
nvidia-ctk runtime configure --runtime=docker
systemctl restart docker
Systemd enforces order:
corenova-docker-gpu → setup-models-volume → corenova-bootstrap-admin → corenova-ai-stack → nginx
Nginx terminates TLS on 443 (self-signed on first boot; replace for production) and proxies to Open WebUI on 127.0.0.1:8080. Port 80 redirects to HTTPS—not HTTP on 3000.
2. Dynamic EBS mounting via NVMe by-id
We mount the first non-root Amazon EBS volume to /mnt/models using stable symlinks under /dev/disk/by-id/:
MOUNT_POINT="/mnt/models"
ROOT_DEV="$(readlink -f "$(findmnt -n -o SOURCE /)" | sed 's/p[0-9]*$//')"
for id in /dev/disk/by-id/nvme-Amazon_Elastic_Block_Store_vol*; do
dev="$(readlink -f "$id")"
base_dev="$(echo "$dev" | sed 's/p[0-9]*$//')"
[[ "$base_dev" == "$ROOT_DEV" ]] && continue
DATA_DEV="$dev"
break
done
# Format only if no filesystem; persist via UUID in fstab
if ! blkid "$DATA_DEV" >/dev/null 2>&1; then
mkfs.ext4 -F "$DATA_DEV"
fi
UUID="$(blkid -s UUID -o value "$DATA_DEV")"
echo "UUID=${UUID} ${MOUNT_POINT} ext4 defaults,nofail 0 2 # corenova-models-ebs" >> /etc/fstab
mount "$MOUNT_POINT"
Ollama models live under /mnt/models/ollama; Hugging Face caches for vLLM under /mnt/models/vllm. Without a data volume, the stack falls back to root filesystem paths—works for pilots, not for 7B+ weights.
3. Anti-hijack admin: IMDSv2 + disabled signup
Public signup stays off permanently. On first boot, bootstrap-admin.sh reads the EC2 Instance ID via IMDSv2 and writes runtime secrets:
TOKEN=$(curl -sf -X PUT "http://169.254.169.254/latest/api/token" \
-H "X-aws-ec2-metadata-token-ttl-seconds: 21600")
INSTANCE_ID=$(curl -sf -H "X-aws-ec2-metadata-token: ${TOKEN}" \
"http://169.254.169.254/latest/meta-data/instance-id")
cat > /opt/corenova/ai/compose/.env <<EOF
ENABLE_SIGNUP=false
WEBUI_ADMIN_EMAIL=admin@local.host
WEBUI_ADMIN_PASSWORD=${INSTANCE_ID}
OLLAMA_BASE_URL=http://127.0.0.1:11434
DEFAULT_MODELS=qwen3:0.6b
EOF
Login: https://YOUR_PUBLIC_IP/ · email admin@local.host · password = Instance ID. Only someone with AWS Console access or your SSH key can recover that password. Change it after first login.
We also fixed a subtle Ollama networking issue: OLLAMA_HOST=0.0.0.0:11434 inside the container (not 127.0.0.1) so Open WebUI’s health checks succeed when using host networking.
4. Ollama-first chat with automatic model pull
After admin bootstrap, the stack pulls a small default model so chat works without manual steps:
# pull-default-model.sh waits for Ollama, then:
docker exec corenova-ollama ollama pull qwen3:0.6b
First-boot timeline (typical g6/g4dn): minutes 0–2 NVIDIA driver + Docker GPU runtime; 2–6 Ollama pull; 6–8 WebUI + Nginx ready. Plan 5–10 minutes before first HTTPS login.
5. Optional vLLM with dynamic tensor parallelism
When you need OpenAI-compatible throughput on multi-GPU instances:
# run-vllm.sh
GPU_COUNT=$(nvidia-smi -L | wc -l | tr -d ' ')
export VLLM_TENSOR_PARALLEL_SIZE="${GPU_COUNT}"
docker compose --profile vllm up -d vllm
Compose binds vLLM to 127.0.0.1:8000; Open WebUI can point OPENAI_API_BASE_URL at it. Default model: Qwen/Qwen2.5-0.5B-Instruct (override via .env).
6. Hardening foundation
This product inherits the same CIS-oriented baseline as our GPU Base AMI: SSH key-only, UFW, auditd, AIDE, unattended security updates, NVIDIA Driver 550, CUDA 12.4. OpenSCAP scan artifacts ship with the product; buyers retain final compliance responsibility.
The economics: build vs. buy
Configuring NVIDIA container tooling, NVMe-by-id mount automation, signup-safe WebUI bootstrap, and a reproducible Packer pipeline is days of engineering—even for teams that know each component individually.
CoreNova packages this architecture as a zero-maintenance AMI:
- Software fee: $0.25/hr with a 7-day free trial (EC2/GPU charges separate)
- Delivery: AWS Marketplace
- Support: CoreNovaLabs@aipalnet.cn
If you could wire this stack yourself in a weekend, you probably should—unless your hourly engineering cost makes $0.25/hr the cheaper way to get a audited, repeatable baseline inside your VPC. That is the bet we optimized for.
For custom RAG layouts or MCP integrations on top of this AMI, email our engineering team at CoreNovaLabs@aipalnet.cn.