Distributed Transformer Training with Horovod

Custom GPT-style transformer trained from scratch on BookCorpus using Horovod across 4x H100 GPUs, from raw data to text generation.

Why I Built This

Most people fine-tune existing models. I wanted to understand what happens when you train one from scratch, on real data, on real HPC hardware. Not a toy example, but a full decoder-only transformer on 34GB of BookCorpus across 4x H100 GPUs using Horovod.

How It Works

Custom transformer architecture built from scratch. 12 layers, 12 attention heads, 768-dimensional embeddings, 3072-dim feed-forward, 512-token context window. Weight tying between token embeddings and the output projection to reduce parameters. Sinusoidal positional encoding, pre-norm layer normalization, GELU activations, and causal masking for autoregressive generation.
34GB data pipeline. Downloaded BookCorpus, tokenized with GPT-2's 50K-vocab tokenizer, split 95/5 train/val, packed tokens into fixed-length 512-token blocks to minimize padding waste. Stored in Apache Arrow format. Multi-worker DataLoaders with pinned memory and prefetch.
Horovod distributed training across 4 NVIDIA H100 GPUs on UMD's Zaratan HPC cluster. Rank-aware operations: only rank 0 saves checkpoints and writes logs. Allreduce for metric averaging. Gradient accumulation (2 steps) synced so Horovod only communicates on actual update steps, not micro-batches. Falls back gracefully to single-GPU when needed.
H100 optimizations. bfloat16 mixed precision (native Tensor Core support), TensorFloat-32 for cuBLAS, NCCL P2P and shared memory for fast intra-node communication. CUDA 12.3, OpenMPI 4.1.5, NCCL 2.18.1.
Training setup. AdamW with linear warmup (1000 steps) + linear decay. Gradient clipping at max norm 1.0. Full checkpoint serialization with resume support. TensorBoard logging, CSV metrics, and rank-aware console output with SLURM job metadata.
Text generation using top-k and nucleus (top-p) sampling with temperature control.

Results

Component	Detail
Model	12-layer decoder-only transformer, GPT-2 tokenizer (50,257 vocab)
Dataset	BookCorpus — 34GB preprocessed, Apache Arrow format
GPUs	4x NVIDIA H100 (80GB VRAM each) on Zaratan HPC
Batch size	128 effective (16 per GPU x 4 GPUs x 2 accumulation steps)
Precision	bfloat16 mixed precision (H100 optimized)
Checkpointing	Full state resume — model, optimizer, scheduler, step
Monitoring	TensorBoard + CSV metrics + rank-aware logging
Generation	Top-k / nucleus sampling with temperature control