MacFleet: Distributed ML for Apple Silicon
Pool Apple Silicon Macs into a distributed ML training cluster — auto-discovery, adaptive compression, and thermal-aware scheduling over WiFi, Ethernet, or Thunderbolt.
PythonPyTorchMLXgRPCmDNSRing AllReduce
View on GitHub ↗Problem Statement
Apple Silicon Macs have powerful GPU cores but no native way to pool them for distributed ML training. Individual machines sit idle while training bottlenecks on a single GPU. The challenge: turn a collection of heterogeneous Macs into one unified training cluster with zero configuration.
Technical Approach
- Zero-config discovery — nodes find each other via mDNS/Bonjour.
macfleet joinis the only command needed. No IP addresses, no config files. - Framework-agnostic core — the communication layer uses only NumPy, never importing PyTorch or MLX. Both engines work through the same pool/network/compression infrastructure.
- Adaptive gradient compression — auto-selects based on network: no compression over Thunderbolt 4, TopK 10% + FP16 (~20x) over Ethernet, TopK 1% + FP16 (~200x) over WiFi.
- Heterogeneous scheduling — faster Macs get proportionally larger batches based on GPU core count. The scheduler continuously re-profiles throughput and adjusts for thermal throttling (nominal → 100%, fair → 90%, serious → 70%, critical → 30%).
- Ring AllReduce — efficient N-node gradient synchronization that scales linearly with cluster size.
- Dual engine support — native PyTorch (MPS backend) and Apple MLX with identical APIs. One-liner, context manager, or decorator patterns.
Results
| Feature | Detail |
|---|---|
| Install | pip install macfleet |
| Discovery | Automatic via mDNS (zero config) |
| Engines | PyTorch (MPS) + Apple MLX |
| Compression | Up to 200x over WiFi |
| Thermal management | Real-time workload adjustment |
| CLI tools | join, status, train, bench, diagnose |
| API patterns | One-liner, context manager, decorator |