Skip to main content

# BLUEPRINT-AWS-PRIVATE-AI

Private LLM platform on AWS

Run private AI workflows on dedicated GPU capacity in your own AWS account — open-weight models, zero standby cost, multi-tenant by design.

// authored by Oleks Saloid · published · last reviewed

A blueprint is a pattern already running in production. The numbers below describe the system that produced the evidence, not a prospective engagement. We adapt blueprints to each client's stack and scale.

# THE-PROBLEM

why this pattern exists.

Hosted AI APIs are easy to start with — but the moment you have sensitive prompts, regulated data, or material inference volume, the trade-offs flip. You're sending business-critical context to a third party, paying per-token at someone else's pricing curve, and locked to whichever models that vendor chooses to expose.

Self-hosting fixes all three — but only if it doesn't burn a 24/7 GPU bill, only if it's actually reliable for real work, and only if a small team can stand it up and own it. What you actually want is private, dedicated capacity that scales to zero when nobody's using it, picks up sub-second when someone does, and lets you swap models or add use cases without rebuilding from scratch.

# ARCHITECTURE

how it's built.

  1. Private AI infrastructure

    Your models, your data, your VPC, your AWS account. Prompts and outputs never leave your perimeter — no third-party data-processing terms to negotiate, no data-residency exceptions, no shared tenancy with other vendors' customers.

  2. Zero standby cost

    GPUs are provisioned on demand and released when work drains. You pay for inference minutes, not for parked capacity. An idle team costs nothing; a busy team scales up automatically and back down when they're done.

  3. Open-weight model freedom

    Run any open-weight model your workloads need — Gemma, Qwen, Llama, others — and swap models without re-architecting the platform. Right-size each use case to the model that fits, not to whatever the vendor picked for you.

  4. Multi-tenant by design

    Per-team isolation, per-user authentication (including SSO via SAML), per-application credentials for service-to-service callers. Different teams share GPU capacity safely without sharing prompts, history, or quotas.

  5. Production-grade reliability

    Long-running generations are protected from termination, queued work survives capacity transitions, and the API stays sub-second even when the GPU fleet is cold or scaling. The platform is designed to sit at zero overnight and resume the next morning without lost work.

  6. Deployable anywhere, fully codified

    The entire platform is infrastructure as code. Stand it up in any AWS account or region in hours; extend it with new endpoints or use cases by attaching another stack — no rebuilds, no environment-specific code paths.

# KEY-PRIMITIVE

the load-bearing idea.

Zero standby cost without dropping work

The decision that makes this blueprint commercially viable is solving cost and reliability together: GPUs only run when there's work for them, and termination never kills an in-flight generation. The fleet returns to zero only when work is genuinely complete; when scale-down does fire, instances signal they're still busy and stay up until they're truly idle. The result is a platform that can sit at zero spend overnight and resume sub-second the next morning, with no lost work in between — the cost profile of pay-per-token, the privacy and control of self-hosted.

# TECH-STACK

what runs it.

AWS Lambda API Gateway Cognito (SSO, S2S) SQS SNS DynamoDB EC2 GPU Auto Scaling Network Load Balancer VPC CloudWatch KMS IAM Serverless Framework CloudFormation Node.js 24 AWS SDK v3 Ollama Open-weight LLMs (Gemma, Qwen, Llama, …) Docker

# PRODUCTION-EVIDENCE

what we've measured.

Private AI on dedicated AWS infrastructure. Years in production.

This is a working platform — not a POC, not a demo. Companies run their own LLM workflows on dedicated GPU capacity inside their own AWS account, with prompts and outputs that never leave their perimeter. Open-weight models give them model choice without vendor lock-in; the serverless control plane keeps idle cost at zero; the IaC foundation means new use cases ship as additional endpoints, not as another platform to operate. It's the cost profile of pay-per-token with the privacy and control of self-hosted.

Idle GPU cost
Zero
Your AWS account
Private
API latency
Sub-sec
In production
Years

want one of these in production?

30-min discovery call. we adapt the blueprint, we don't resell it.

 book-call

// or write: hello@saloid.com · gräfelfing · de