Back

MacFleet: Distributed ML for Apple Silicon

Pool Apple Silicon Macs into a distributed ML training cluster — auto-discovery, adaptive compression, and thermal-aware scheduling over WiFi, Ethernet, or Thunderbolt.

PythonPyTorchMLXgRPCmDNSRing AllReduce
View on GitHub ↗

Problem Statement

Apple Silicon Macs have powerful GPU cores but no native way to pool them for distributed ML training. Individual machines sit idle while training bottlenecks on a single GPU. The challenge: turn a collection of heterogeneous Macs into one unified training cluster with zero configuration.

Technical Approach

  • Zero-config discovery — nodes find each other via mDNS/Bonjour. macfleet join is the only command needed. No IP addresses, no config files.
  • Framework-agnostic core — the communication layer uses only NumPy, never importing PyTorch or MLX. Both engines work through the same pool/network/compression infrastructure.
  • Adaptive gradient compression — auto-selects based on network: no compression over Thunderbolt 4, TopK 10% + FP16 (~20x) over Ethernet, TopK 1% + FP16 (~200x) over WiFi.
  • Heterogeneous scheduling — faster Macs get proportionally larger batches based on GPU core count. The scheduler continuously re-profiles throughput and adjusts for thermal throttling (nominal → 100%, fair → 90%, serious → 70%, critical → 30%).
  • Ring AllReduce — efficient N-node gradient synchronization that scales linearly with cluster size.
  • Dual engine support — native PyTorch (MPS backend) and Apple MLX with identical APIs. One-liner, context manager, or decorator patterns.

Results

FeatureDetail
Installpip install macfleet
DiscoveryAutomatic via mDNS (zero config)
EnginesPyTorch (MPS) + Apple MLX
CompressionUp to 200x over WiFi
Thermal managementReal-time workload adjustment
CLI toolsjoin, status, train, bench, diagnose
API patternsOne-liner, context manager, decorator