AWS defines building blocks to train and serve AI models | Keryc
AWS publishes a technical framework to understand how to train and serve foundation models at scale using accelerated infrastructure, low-latency networking, and an open source ecosystem. Are you curious where the real bottlenecks show up and how hardware, orchestration, and ML software connect in production? Here I explain the essentials, with practical details for engineers and architects.
Architecture and infrastructure building blocks
AWS organizes the stack around three tightly coupled blocks: accelerated compute with lots of on-device memory, high-bandwidth low-latency interconnect, and scalable distributed storage. It’s not just hardware; it’s the combination that enables pre-training, post-training and serving at large scale.
Accelerated compute: P5 and P6 families with NVIDIA H100, H200, Blackwell B200 and B300 GPUs. The key scaling axes are Tensor Core throughput, HBM capacity and bandwidth, and interconnect bandwidth.
Representative GPU
BF16/FP16 peak
FP8 peak
HBM
HBM BW
H100 SXM
0.9895 PFLOPS
1.979 PFLOPS
80 GB HBM3
3.35 TB/s
H200 SXM
0.9895 PFLOPS
1.979 PFLOPS
141 GB HBM3e
4.8 TB/s
B200 HGX
2.25 PFLOPS
4.5 PFLOPS
180 GB HBM3e
8 TB/s
B300 HGX
2.25 PFLOPS
4.5 PFLOPS
288 GB HBM3e
8 TB/s
Network: NVLink/NVSwitch for intra-node scale-up and EFA (Elastic Fabric Adapter) for scale-out between nodes. EFA provides OS-bypass RDMA using libfabric and the SRD protocol, reducing latency in collectives.
Storage: a tiered hierarchy with local NVMe for hot data, FSx for Lustre for parallel throughput, and Amazon S3 for lasting checkpoints and datasets.
On top of that, AWS offers UltraClusters to group thousands of instances with a non-blocking network and UltraServers that extend NVLink domains across multiple instances, reaching up to 72 GPUs and terabytes of HBM inside a single NVLink domain. That changes the rules when the bottleneck is leaving the NVLink domain.
Orchestration and resource scheduling
When a job needs hundreds or thousands of GPUs, co-programming them manually is impossible. That’s where Slurm and Kubernetes with extensions come in.
Slurm: the classic HPC batch paradigm. It schedules atomic jobs, supports backfill, topology-aware placement and fine GPU control with GRES. On AWS it’s deployed with ParallelCluster or with managed control planes like PCS and HyperPod features in SageMaker.
Kubernetes: declarative and powerful for deployment, but native Kubernetes doesn’t guarantee gang admission or topology-aware placement for collectives. That’s why there are extra layers:
Kueue: an admission controller that manages gang scheduling and quotas.
Volcano and NVIDIA KAI Scheduler: replace or extend the scheduler for topology-aware placement and gang scheduling sensitive to NVLink.
Amazon EKS integrates the NVIDIA device plugin and, together with SageMaker HyperPod in EKS mode, adds governance, managed Kueue, checkpointless training, elastic training and auto-resume.
What is checkpointless training? Instead of writing multi-TB checkpoints to shared storage, state is replicated among peers to recover from failures via EFA communication. This reduces dependence on IO when failures occur frequently.
ML software stack: from drivers to advanced frameworks
Think of the stack as five layers: drivers and hardware support, runtimes and libraries, communication substrate, ML frameworks, and distributed frameworks for training and inference.
Drivers and runtimes: NVIDIA drivers, GDRCopy for CPU-GPU copies, the EFA driver and Lustre client. CUDA Toolkit 13.x adds support for Blackwell architectures.
Kernels and toolchains: optimizations like FlashAttention, Triton, CuTe and CUTLASS dominate real performance. Many gains come from fused, specialized kernels for attention, layernorm, MoE dispatch and KV-cache.
Communication: NCCL for collectives with topology-aware algorithms. In AWS, NCCL connects to libfabric via aws-ofi-nccl to use EFA without changing the app. For MoE, all-to-all operations are critical and can dominate step time if expert parallelism scales a lot.
Point-to-point transfers in serving: NIXL (NVIDIA Inference Xfer Library) unifies movement between HBM, DRAM and storage and integrates with UCX and GPUDirect Storage backends.
Base frameworks: PyTorch and JAX. This series focuses on PyTorch because of its prevalence in OSS. In PyTorch, torch.distributed provides process groups, DDP and FSDP2 (sharding inspired by ZeRO).
High-level frameworks:
Hugging Face Transformers + Accelerate: easy to use, ideal for fine-tuning and moderate-scale scenarios.
NVIDIA Megatron Core and NeMo: tuned for maximum throughput with 3D parallelism and FP8 support.
veRL: designed for RLHF and post-training; lets you mix backends within the same job.
vLLM and SGLang: inference solutions that handle KV cache with paging or techniques like RadixAttention to reuse prefixes and improve batching.
Practical result? Choosing the right mix of optimized kernels, communication patterns and parallelism strategy often matters as much or more than picking the fastest GPU.
Observability and operating at scale
Without telemetry you can’t operate large clusters. Observability covers infrastructure, workload and alerts.
Standard stack: Prometheus + Grafana. In AWS, AMP and AMG are the managed versions that remove operational overhead.
Key metrics: DCGM-Exporter for GPU metrics (utilization, HBM, ECC, XID). EFA exposes counters for bytes and retransmits. FSx for Lustre provides throughput and metadata latency. At the application level you export step time, tokens per second and losses.
A common pattern: check DCGM for GPU health, then EFA/NCCL for collective issues, and finally IO for checkpoint bottlenecks.
Failures you should watch: rising ECC single-bit errors, or XID 63 64 94 95. These often foreshadow serious failures and warrant quick node replacement.
Dashboards and alerts: the GPU Health - Cluster dashboard is a good starting point. Part 5 of the series goes deeper into alert rules and metric retention.
What this means for your project
If you work with large models, the lesson is clear: the three scaling regimes—pre-training, post-training and test-time—converge on the same infrastructure requirements. It’s not just buying GPUs; it’s aligning NVLink domains, EFA, storage and orchestration with your model’s communication needs.
Practical questions to ask yourself:
Does your parallelism require heavy all-to-all, or can you stay with data/tensor parallelism?
Do you need checkpointless recovery, or can your pipeline tolerate high IOPS for multi-TB checkpoints?
Does your scheduler respect topology-aware placement to minimize network hops?
Understanding these integration points is the foundation for diagnosing bottlenecks and making smart scaling decisions.