# BLUEPRINT-AWS-EMR
EMR platform for mixed teams
Visual pipeline authoring, batch jobs on EMR Classic, interactive Spark on EMR Serverless, an S3 file explorer, and per-team cost reporting — one web app for engineers, data scientists, and analysts.
# THE-PROBLEM
why this pattern exists.
Heterogeneous teams — software engineers, data engineers, data scientists, and client operations — all need Spark on EMR for both scheduled batch ETL and interactive exploration, but each reinvents cluster bootstrapping, IAM, access control, and cost attribution. The result is duplicated pipelines, inconsistent security posture, and no way to attribute compute spend.
The moment you have more than one team on EMR, the cost of not having a shared control plane compounds. Teams either collide on shared clusters or run their own — and neither scales. What you actually want is one UI that handles batch DAGs, ad-hoc Spark sessions, file management, and cost reporting, with platform plumbing solid enough that a small team can own it for years.
# ARCHITECTURE
how it's built.
-
Unified platform for authoring & execution
One Cognito-secured web app handles every step — design a Spark workflow on a React Flow DAG editor, run it, monitor it, inspect inputs and outputs. No separate scheduler, no per-team UIs, no YAML to hand-write. Workflow definitions persist to DynamoDB; executions stream over SQS. The same surface drives both batch and interactive jobs.
-
Batch jobs on EMR Classic
Long-running Spark batch workloads execute on dedicated EMR Classic clusters with S3-hosted bootstrap scripts and assumed-role execution contexts. Steps are submitted, tracked, and surfaced back to the UI through the same workflow model used for interactive jobs.
-
Interactive Spark on EMR Serverless
Ad-hoc exploration and short-lived analytics workloads run on EMR Serverless, so data scientists get sub-minute job starts without keeping a cluster warm. One workflow model, two execution substrates — batch or interactive picked per job.
-
File explorer + cost reporting
An S3 explorer and a Cost Explorer-backed reporting dashboard, both scoped by the same Cognito identity that drives execution. Browse, upload, and preview pipeline inputs and outputs without shelling into one-off scripts; attribute spend by team, workflow, and execution substrate without spreadsheet exports.
-
Infrastructure as code, deployable anywhere
Every Lambda, queue, table, IAM role, WAF rule, and CloudFront distribution is codified as Serverless Framework v4 + CloudFormation, composed from a single config.yml with per-environment overrides under stages/. Bootstrap a new AWS account or region in hours, not weeks — deterministic, zero-downtime deploys; no environment-specific code paths.
-
Security & Well-Architected by default
KMS-encrypted data at rest, Cognito authentication with per-endpoint API keys on a shared API Gateway, WAF on public routes, Secrets Manager for credentials, blocked public S3, least-privilege IAM scoped per service, structured Pino logging with 90-day CloudWatch retention. The AWS Well-Architected pillars — Operational Excellence, Security, Reliability, Performance Efficiency, Cost Optimization, Sustainability — addressed by construction, not retrofit.
# KEY-PRIMITIVE
the load-bearing idea.
One config.yml. Any AWS account. Hours, not weeks.
The decision that pays back compounding interest in this blueprint is making the platform fully reproducible from a single source of truth. A single config.yml plus per-environment overrides under stages/ composes with Serverless Framework v4 to produce deterministic, zero-downtime deploys — no environment-specific code paths, no manual click-ops, no drift. Bring the entire platform up in a fresh AWS account or region in a few hours: every Lambda, queue, IAM role, WAF rule, and KMS key in place by the time it finishes deploying. This is what makes the blueprint deployable, not just describable.
# TECH-STACK
what runs it.
-
// compute & infra
- AWS Lambda
- API Gateway
- CloudFormation
- EMR (Classic & Serverless)
- CloudFront
- WAF
-
// data
- S3
- S3 Tables
- DynamoDB
- Glue
-
// messaging
- SQS
- SNS
- SES
-
// identity & ops
- Cognito
- Secrets Manager
- IAM (cross-account)
- Cost Explorer
-
// backend
- Node.js 22
- Serverless Framework v4
- esbuild
- Pino
-
// frontend
- React 19
- Vite
- Redux Toolkit
- React Flow
-
// testing & ci
- Jest
- Jenkins (19 pipelines)
# PRODUCTION-EVIDENCE
what we've measured.
More than 30 AWS services, brought together as one product. Years in production.
This is a working platform — not a POC, not a slide deck. The serverless-first design means there's no standby cost: you pay when a job runs and nothing when it doesn't. The unified Cognito-secured surface means new teams onboard without bespoke tooling. Everything is codified as IaC, so changes ship as zero-downtime deploys and the operational footprint is small enough for a single engineer to own and evolve safely. It earned its users by being good, not by being mandated.
- AWS services unified
- 30+
- Idle / standby cost
- Zero
- EMR compute savings
- ~90%
- In production
- 5+ yrs
want one of these in production?
30-min discovery call. we adapt the blueprint, we don't resell it.
book-call// or write: hello@saloid.com · gräfelfing · de