·Dashboard ·Research ·Work ·Archive

← wiki

sector src/semi-analysissemi/datacenterstatus/complete 2024-09-04

Multi-Datacenter Training: OpenAI vs Google Infrastructure

Key Points

Leading frontier AI model training clusters scaled to 100,000 GPUs in 2024
300,000+ GPU clusters in development for 2025 deployment
Gigawatt clusters requiring advanced infrastructure with physical construction constraints
OpenAI pursuing distributed multi-datacenter training to compete with Google's capabilities
Stark infrastructure differences between Google's efficient design and competitors' approaches

Google vs Microsoft Infrastructure

Google's AI training campus with 300MW+ power capacity expanding to 500MW next year
Large cooling towers and centralized water systems rejecting ~200MW of heat
Direct-to-chip liquid cooling with liquid-to-liquid heat exchangers for efficiency
Microsoft's largest training cluster lacks liquid cooling, 35% lower IT capacity per building vs. Google

Key Insights

Physical construction timelines limiting datacenter deployment expansion
Liquid cooling critical for high-density GPU clustering and efficiency
Telecom networking and long-haul fiber required for distributed infrastructure
Hierarchical and asynchronous SGD enabling multi-datacenter training approaches
Distributed infrastructure winners include networking and cooling solution providers

Source

File: SA - Multi-Datacenter Training_ OpenAI's Ambitious Plan To Beat Google's Infrastructure.pdf
Location: Dropbox/2. Semi/Datacenter/
Pages: 38
Date: September 4, 2024
Authors: Dylan Patel, Daniel Nishball, Jeremie Eliahou Ontiveros
Publisher: SemiAnalysis
Type: Paid Research

Related

Pages that link here (1)

China Liquid Cooling: AI and Carbon Neutrality Wave

Open in Obsidian