Distributed Transformer Training with Horovod
Custom GPT-style transformer trained from scratch on BookCorpus using Horovod across 4x H100 GPUs, from raw data to text generation.
Why I Built This
Most people fine-tune existing models. I wanted to understand what happens when you train one from scratch, on real data, on real HPC hardware. Not a toy example, but a full decoder-only transformer on 34GB of BookCorpus across 4x H100 GPUs using Horovod.
How It Works
-
Custom transformer architecture built from scratch. 12 layers, 12 attention heads, 768-dimensional embeddings, 3072-dim feed-forward, 512-token context window. Weight tying between token embeddings and the output projection to reduce parameters. Sinusoidal positional encoding, pre-norm layer normalization, GELU activations, and causal masking for autoregressive generation.
-
34GB data pipeline. Downloaded BookCorpus, tokenized with GPT-2's 50K-vocab tokenizer, split 95/5 train/val, packed tokens into fixed-length 512-token blocks to minimize padding waste. Stored in Apache Arrow format. Multi-worker DataLoaders with pinned memory and prefetch.
-
Horovod distributed training across 4 NVIDIA H100 GPUs on UMD's Zaratan HPC cluster. Rank-aware operations: only rank 0 saves checkpoints and writes logs. Allreduce for metric averaging. Gradient accumulation (2 steps) synced so Horovod only communicates on actual update steps, not micro-batches. Falls back gracefully to single-GPU when needed.
-
H100 optimizations. bfloat16 mixed precision (native Tensor Core support), TensorFloat-32 for cuBLAS, NCCL P2P and shared memory for fast intra-node communication. CUDA 12.3, OpenMPI 4.1.5, NCCL 2.18.1.
-
Training setup. AdamW with linear warmup (1000 steps) + linear decay. Gradient clipping at max norm 1.0. Full checkpoint serialization with resume support. TensorBoard logging, CSV metrics, and rank-aware console output with SLURM job metadata.
-
Text generation using top-k and nucleus (top-p) sampling with temperature control.
Results
| Component | Detail |
|---|---|
| Model | 12-layer decoder-only transformer, GPT-2 tokenizer (50,257 vocab) |
| Dataset | BookCorpus — 34GB preprocessed, Apache Arrow format |
| GPUs | 4x NVIDIA H100 (80GB VRAM each) on Zaratan HPC |
| Batch size | 128 effective (16 per GPU x 4 GPUs x 2 accumulation steps) |
| Precision | bfloat16 mixed precision (H100 optimized) |
| Checkpointing | Full state resume — model, optimizer, scheduler, step |
| Monitoring | TensorBoard + CSV metrics + rank-aware logging |
| Generation | Top-k / nucleus sampling with temperature control |