·Dashboard·Research·Work·Archive
← wiki
sector src/semi-analysissemi/datacenterstatus/complete 2024-09-04

Multi-Datacenter Training: OpenAI vs Google Infrastructure

Key Points

  • Leading frontier AI model training clusters scaled to 100,000 GPUs in 2024
  • 300,000+ GPU clusters in development for 2025 deployment
  • Gigawatt clusters requiring advanced infrastructure with physical construction constraints
  • OpenAI pursuing distributed multi-datacenter training to compete with Google's capabilities
  • Stark infrastructure differences between Google's efficient design and competitors' approaches

Google vs Microsoft Infrastructure

  • Google's AI training campus with 300MW+ power capacity expanding to 500MW next year
  • Large cooling towers and centralized water systems rejecting ~200MW of heat
  • Direct-to-chip liquid cooling with liquid-to-liquid heat exchangers for efficiency
  • Microsoft's largest training cluster lacks liquid cooling, 35% lower IT capacity per building vs. Google

Key Insights

  • Physical construction timelines limiting datacenter deployment expansion
  • Liquid cooling critical for high-density GPU clustering and efficiency
  • Telecom networking and long-haul fiber required for distributed infrastructure
  • Hierarchical and asynchronous SGD enabling multi-datacenter training approaches
  • Distributed infrastructure winners include networking and cooling solution providers

Source

  • File: SA - Multi-Datacenter Training_ OpenAI's Ambitious Plan To Beat Google's Infrastructure.pdf
  • Location: Dropbox/2. Semi/Datacenter/
  • Pages: 38
  • Date: September 4, 2024
  • Authors: Dylan Patel, Daniel Nishball, Jeremie Eliahou Ontiveros
  • Publisher: SemiAnalysis
  • Type: Paid Research

Related