AI Infrastructure & Datacenter
Overview / thesis
AI infrastructure is the full stack powering AI workloads — silicon, memory, networking fabric, data center shell, power delivery, and the grid behind it. It is the largest capex cycle in tech history, and the one domain where nearly every semiconductor sub-sector converges. The investable thesis is not "AI is big." It is that the bottleneck keeps moving, the spend keeps compounding, and the value is captured at the chokepoints — chips, memory, networking, power — far more reliably than at the model or application layer where it is created. The cleanest expression of the trade remains the oldest one: sell picks and shovels in a gold rush.
The thesis in one paragraph
Training clusters now exceed 100K GPUs and are heading toward million-GPU scale by 2027 (xAI's Colossus roadmap targets 1M GPUs by late 2026). But the story has shifted from "more compute" to "what constrains the compute." NVIDIA's GB200 NVL72 rack consumes ~120kW, roughly 10x a traditional CPU server, which moves the binding constraint stack to memory (HBM supply from SK Hynix / Samsung / Micron), networking (800G → 1.6T optical interconnects), power delivery (48V → 800V HVDC rack redesign), and raw electricity (JPM projects ~100GW of US datacenter demand, 2-3x the current US nuclear fleet). On top of that, agentic AI is rewriting inference economics — longer sessions, heavier CPU loads (50-90% of latency is tool processing, not LLM inference), and fundamentally different memory and storage patterns. The recurring conclusion across every source is the same: memory bandwidth, not compute, is the binding constraint, and the companies that sit on the constraint capture the value.
Why it matters — the so-what
This is the most capital-intensive technology buildout since the internet, compressed into roughly three years. The numbers anchor everything downstream:
- Hyperscalers (the Big Five — Amazon, Google, Meta, Microsoft, Oracle) have committed $660–690B in aggregate capex for 2026, roughly double 2025's ~$350-370B, with ~75% ($450B+) AI-related. Amazon ~$200B; Google $175-185B; Meta $115-135B; Microsoft $120B+; Oracle ~$50B.
- NVIDIA printed $215.9B in FY2026 revenue (up 65% YoY) with a $320B backlog for FY2027 — more revenue than the GDP of most countries, almost entirely on AI silicon.
- OpenAI raised $110B at a $730B valuation (Feb 2026) and targets ~$600B cumulative compute spend by 2030.
- The total AI accelerator market is on track to blow past $1 trillion by 2030.
The "so what" for an investor: the value chain is a cascade, not a sum. NVIDIA's L1 chip revenue is funded by hyperscaler L3 capex, which is funded by enterprise L7 AI spending. The most value is created at the application layer but the most value is captured at chip design, where NVIDIA holds near-monopoly pricing power (71% gross margin on hardware) protected by the CUDA moat. That asymmetry is the core reason the picks-and-shovels framing holds.
Sizing the opportunity — TAM, growth, and the capital math
Total AI and accelerator markets (from the LLM industry primer, multiple analysts):
| Market | 2025 | 2026E | 2030E | CAGR |
|---|---|---|---|---|
| Total AI market | $391B | ~$500B+ | ~$3.5T | 30.6% |
| AI accelerator (training + inference) | ~$210B | ~$280B+ | $1T+ | ~35% |
| AI inference market | $106B | ~$130B+ | $255B | 19.2% |
| HBM | $38B | $54.6-58B | $100B (2028) | — |
| Optical interconnects (AI DC) | ~$10B | — | $31B (2033) | 15.3% |
| Agentic AI software | $7.92B | — | $155B (BofA, 2030) | — |
| LLM API revenue | $8.4B | ~$15B+ | — | — |
| AI coding tools | $7.4B | $12-15B | — | ~40% |
The single most important macro frame is the AI capital math — Doug O'Laughlin / SemiAnalysis's "$6 trillion hyperscaler capital thesis." Combined hyperscaler cash from operations is ~$450B annually in 2024. The deployable-capital ceiling is reconstructed from three multipliers: (1) full FCF reinvestment (~$450B/yr, vs. the current 40-50%, heading to 70-75% of cash flows by 2026 per iCapital); (2) off-balance-sheet SPV structures (Meta's $27B+ Hyperion deal with Blue Owl/PIMCO/BlackRock; Microsoft's BlackRock GIP partnership; the $40B Aligned Data Centers acquisition at 70% fund-level leverage); and (3) 1x corporate leverage (only ~7% of industry capex is debt-funded today vs. 32% during the telecom boom of 2000). The total deployable capital lands at ~$5.6-6.7 trillion over 5-7 years. This converges with major institutional estimates: McKinsey $6.7T by 2030, JPMorgan $5-7T, iCapital $5.3T through 2030, Morgan Stanley $2.9T through 2028. O'Laughlin's split: ~60% ($3.1T) to compute hardware, ~25% ($1.3T) to power infrastructure, ~15% ($0.8T) to construction.
The demand drivers — what's actually pulling the spend
- Enterprise AI adoption is in the 3rd inning. 84% of corporations are positive on AI but most are still experimenting; only ~6% of organizations have agents in production, 64% plan to. The addressable market is every knowledge-worker workflow in every company — a 10+ year runway resembling cloud computing circa 2012.
- The agentic shift from chat to work. This is the structurally underappreciated angle. Agentic workloads need balanced compute (GPU + CPU + memory + storage + network), not just GPUs — a Georgia Tech/Intel study found tool processing on CPUs is 50-90% of total latency. A single agentic task burns 50,000-500,000 tokens and costs $5-8 vs. fractions of a cent for a chatbot query — a 5-20x cost multiplier per interaction that scales infrastructure demand non-linearly. Lisa Su (AMD, March 2026): CPU demand "has actually far exceeded my expectations." NVIDIA's purpose-built Vera CPU is the strongest validation of the CPU renaissance.
- Inference overtaking training. Deloitte estimates inference was 50% of AI compute in 2025, jumping to 67% in 2026; by 2030 the inference market could be 10x training. The datacenter is becoming a token factory — silicon, electricity, and water in, intelligence out — and per-token cost is the new unit economic.
- Inference-time compute scaling. Reasoning models (o1/o3, extended thinking) decouple capability from training cost, shifting GPU demand toward inference and generating massive "thinking" token volumes.
- Cost deflation feeding consumption (Jevons Paradox). Capability-adjusted token costs are falling 5-10x per year (Epoch AI: ~50x/yr on some benchmarks; B200 hits $0.02/M tokens vs. $20/M for equivalent GPT-4 performance in 2022). Cheaper inference expands usage faster than it shrinks per-task spend — the net direction is up.
- Sovereign AI. UAE and Saudi Arabia spending $100B+; France backing Mistral; every major economy now has an AI strategy. This fragments the market but expands the total pie.
The central debate — is this a bubble?
This is the load-bearing tension across the sources, and the vault holds both sides honestly.
The bear / bubble case (the "$600B reckoning"): ~$400-500B poured into AI infrastructure in 2024 against only ~$100B in actual AI-services revenue (Sequoia/David Cahn's accounting). Cahn calculates AI operators need $600B in annual revenue just to earn back investment — 6x current generation, and the gap has tripled from $125B since late 2023. JPMorgan puts the bar at $650B in annual revenue for a "mere 10% return." The framing is the railroad bubble: 70,400 miles of track added 1880-1890, yet by 1892 only 44% of railroad shares paid a dividend and a third of US mileage hit receivership by 1895. The financing structure rhymes — private credit deploying ~$50B/quarter into AI datacenters, covenant erosion, PIK income at 11.7% of BDC loans, 33% recovery rates, pension funds holding 31% of private-credit assets. Oracle "broke the pattern" by leveraging up (500% debt-to-equity vs. Microsoft's 30%), risking a shift from "disciplined, cash-flow-funded race" to "debt-fueled arms race." Utilization undercuts the demand story: GPU utilization runs 60-70%, Meta's Llama 3 training hit just 38% model-flop utilization, 42% of enterprise AI projects are abandoned before production, and only 5.4% of US businesses report regular AI use. Gartner predicts 40% of agentic AI projects cancelled by 2027. Goldman's Jim Covello: "Overbuilding things the world doesn't have use for typically ends badly."
The bull / "this time partly differs" case: Unlike the dot-com era, the companies doing the spending (Microsoft, Google, Amazon, Meta) are among the most profitable in history, funding capex from advertising and cloud cash flows rather than venture capital — only ~7% of industry capex is debt-funded. The technology works: adoption curves are steep, enterprises are signing real contracts, and physical datacenter capacity is genuinely undersupplied (North American vacancy 1.9-2.8%, Northern Virginia <1%, 78-98% of construction pre-committed). The honest synthesis the vault lands on: some valuations (especially private — OpenAI at $730B on 33% gross margins burning ~$17B/yr) will prove too high, but the underlying technology shift is real and durable. The question is not whether AI matters but whether current prices already reflect the upside.
The constraint stack — where the bottleneck actually sits
The recurring through-line is that the binding constraint keeps migrating up the stack, and each migration creates an investable layer:
- Compute was the original constraint; now NVIDIA holds 80-90% of accelerator revenue with the CUDA moat, while custom silicon (Google TPU, AWS Trainium, Microsoft Maia, Meta MTIA — designed via Broadcom ~70% and Marvell ~20% of the ASIC market) carves out inference share.
- Memory bandwidth is the constraint that every technical source converges on. KV cache grows linearly with context, batch, and depth — a 70B model at 128K context needs ~80GB just for KV cache, nearly a full H100. HBM production consumes 3x the wafer capacity of standard DRAM, tightening the entire memory market; DRAM demand is growing ~35% vs. ~23% supply, the widest gap in decades. The cost of an output token is fundamentally model-bytes ÷ memory-bandwidth ÷ batch-size — which is why output tokens cost 4-8x input tokens and why HBM capacity (not just bandwidth) is the lever on inference economics.
- Networking is the bottleneck after compute — the 800G → 1.6T → 3.2T optical transition, InfiniBand vs. Ethernet (Ethernet now >two-thirds of AI cluster switch sales via the Ultra Ethernet Consortium), and the unresolved CPO-vs-pluggable question at 1.6T.
- Power is the ultimate binding constraint. Microsoft disclosed $80B of Azure orders it cannot fulfill because it can't get power. Power transformer lead times have stretched to 115-210 weeks (2-4 years) with prices up 60-80% since 2020. Global datacenter electricity is projected to double to ~945 TWh by 2030, with AI pushing datacenters to ~4.5% of global energy generation. US Critical IT capacity must roughly triple from 2023-2027.
Datacenter economics and the financing moat
A theme that recurs and is easy to lose: building AI clouds is a financing problem, not just a technology problem. Neoclouds face 12-18% debt rates and a fundamental duration mismatch — 30-year assets funded against 1-3 year customer rental contracts. The Dell-vs-Supermicro case crystallizes it: both buy identical NVIDIA baseboards, Supermicro builds cheaper, but Dell Financial Services ($8.4B FY24 originations, $10.5B portfolio) offers rates 2-3 points below third parties, so Dell wins on total cost of ownership despite a higher sticker price. The emerging model is "a bank with a server-manufacturing arm attached," analogous to Toyota being both the largest automaker and the largest car lender.
Cross-domain stakes (relevant to GGGI / green-finance angle)
The buildout collides directly with energy and decarbonization. You cannot add ~100GW of US datacenter demand without addressing carbon intensity, and Southeast Asia — Singapore (60% of SEA capacity, but a moratorium pushing demand to Malaysia/Johor), the ASEAN grid, transformer supply chains — is the next frontier and the next bottleneck simultaneously. Power generation with a low-carbon mix, grid interconnection backlogs, and transformer lead times are the gating factors, which is precisely where green-finance and corporate-PPA structuring intersect the AI story.
Where this sits in the wiki
Company-specific theses live on their ticker pages: NVDA, AMD, INTC, AVGO, TSM, ANET, MRVL, ALAB, AIXA, SOITEC, 268A, 5802, 6855, 6834, 6754, 6777. The primary research anchors are SemiAnalysis (Dylan Patel — data center economics, HBM, power, neocloud playbook; 260-article local mirror at semianalysis/index) and FundaAI (agentic AI, memory, infrastructure). Layer-by-layer detail (compute, memory, networking, datacenter, power, equipment) is covered in the other sections of this page.
How it works
The AI data center is a token factory. Silicon, electricity, and water go in; intelligence comes out. To understand the unit economics you have to reason from the hardware up, not from the API price list down. The product physically works through a stack — silicon, memory, interconnect, power delivery, thermal, the data center shell — and every layer has a distinct physics, cost structure, and bottleneck. The single most important reframing in the sector: memory bandwidth, not compute, is the binding constraint on inference, and power, not chips, is the binding constraint on the buildout. Everything below follows from those two facts.
The transformer is the engine the whole buildout exists to run
Every frontier model — GPT, Claude, Gemini, Llama, DeepSeek, Grok — is a transformer, the architecture from the 2017 "Attention Is All You Need" paper (now 173,000+ citations). The entire data center, chip, networking, cooling, and nuclear-power buildout exists to train and run transformers. Understanding the math is the difference between knowing NVIDIA sells GPUs and understanding why an H100 is worth $25,000–$40,000.
The core innovation is attention, which replaced the sequential RNN/LSTM (slow, can't parallelize, forgets long sequences) by letting every token look at every other token at once. Mechanically, each token is projected into three vectors via learned weight matrices (W_Q, W_K, W_V): a Query ("what am I looking for"), a Key ("what do I contain"), and a Value ("what information I carry"). The formula is:
Attention(Q, K, V) = softmax(QK^T / √d_k) × V
You take each token's Query, dot it against every token's Key, scale by √d_k (the square root of the key dimension — prevents softmax saturation killing the gradient), softmax into a probability distribution, then take a weighted sum of the Values. The QK^T step produces an N × N attention matrix where N is sequence length — which is why attention scales quadratically with context length, the single fact that drives billions in optimization R&D. Doubling context quadruples attention compute. Multi-head attention runs this many times in parallel with different projections (GPT-3 uses 96 heads), each learning a different relationship type. Order is injected separately via positional encoding — original sinusoidal, modern RoPE (rotates Q/K vectors so the dot product depends only on relative distance m−n, zero extra params, used by Llama/Qwen/Mistral), or ALiBi (linear distance penalty on attention scores).
Each transformer block stacks attention with a Feed-Forward Network (a two-layer MLP that expands dimension ~4×, applies a nonlinearity like GeLU/SwiGLU, projects back down) — the FFN holds roughly two-thirds of all parameters and acts as the model's factual "memory bank." Plus residual connections (skip connections that add input to output, essential for training deep nets — GPT-3 has 96 layers) and layer normalization (modern models use Pre-Norm, applied before the sub-layer, for training stability). Stack 32–128 blocks and you have an LLM.
Building a model is a multi-stage industrial process
Like steelmaking: raw materials (data), a furnace (compute), a controlled process (training), finishing (alignment).
Tokenization converts text to sub-word units (typically 3–4 characters) via Byte Pair Encoding — start with characters, iteratively merge the most frequent adjacent pair for ~32K–100K merges. GPT-4 uses ~100K vocab (tiktoken), Llama 3 uses 128K (SentencePiece), Gemma 256K. Larger vocabularies shorten sequences but bloat the embedding table. This is why models struggle with spelling and arithmetic — "123 + 456" tokenizes such that the model never sees individual digits, and it directly sets API costs (you pay per token).
Pre-training consumes 90%+ of compute via a single objective: next-token prediction. Process trillions of tokens, predict t(n+1), penalize with cross-entropy. Capabilities — grammar, world knowledge, reasoning, code — emerge from scale rather than being programmed. The data pipeline is itself a moat: URL filtering → language ID → boilerplate removal → quality scoring → MinHash deduplication → safety filtering → domain mixing, drawing from Common Crawl (petabytes), The Pile (825 GB), RedPajama V2 (100T+ raw tokens), FineWeb2 (20 TB, showed 10% curated beats 100% raw). Chinchilla (2022) showed the compute-optimal ratio is ~20 tokens per parameter — most models had been too large and undertrained.
Post-training turns a raw text-predictor into something usable: SFT (supervised fine-tuning on instruction/response pairs teaches format) → RLHF (reward model on human preferences, optimized via PPO — works but unstable and expensive) or its simpler successor DPO (Direct Preference Optimization, reformulates RLHF as supervised learning on preferred/rejected pairs — no reward model, no RL infrastructure; by 2025, 70% of enterprises use RLHF or DPO) → Constitutional AI (Anthropic — model critiques its own outputs against principles) → RL with Verifiable Rewards (math/code with checkable answers, popularized by DeepSeek-R1 for reasoning). Distillation trains a small student on a large teacher's soft probabilities (how Llama 3 8B punches above its weight from the 405B model).
Training cost has exploded 287,000× from the 2017 transformer (~$670) to frontier runs. GPT-3 (175B params, 300B tokens) cost ~$4.6M; GPT-4 (~1.8T est., ~13T tokens) cost $78–100M+; current frontier runs cost ~$1B per Dario Amodei, with $10B runs expected by 2027. At FP16 each parameter is 2 bytes, so a 70B model needs 140 GB just for weights — more than a single H100's 80 GB HBM, forcing multi-GPU deployment and quantization.
Architecture variants reshape the hardware bill
Dense vs. Mixture of Experts (MoE). A dense transformer activates all parameters per token. MoE replaces the single FFN with N expert FFNs (8–256) plus a lightweight router that scores each token and activates only the top-K (typically 2). You get a huge model's capacity at a small model's inference cost. GPT-4, Gemini, DeepSeek-V3, and Grok all use MoE. Mixtral 8×7B: 46.7B total / 12.9B active. DeepSeek-V3: 671B total / 37B active, 256 experts, top-8. Llama 4 Maverick: 400B total / 17B active, 128 experts. The hard part is load balancing — poorly balanced expert utilization can cause 82% GPU idle time as popular experts bottleneck. Google's "Expert Choice" (experts select tokens) guarantees balance; DeepSeek's auxiliary-loss-free balancing solves it without distorting training loss. MoE has a paradoxical memory profile: you must load all experts into memory (high capacity) but activate few per pass (low compute). At low batch sizes it's bandwidth-efficient; at high batch sizes, different tokens route to different experts until you read every expert anyway — making MoE more bandwidth-intensive than dense models of equal compute. MoE's all-to-all communication (tokens travel to the GPU hosting their expert) forced infrastructure redesigns — AWS rebuilt Trainium3 networking from 3D Torus to a switched fabric specifically because all-to-all collectives perform poorly on mesh topologies.
State Space Models (SSMs) attack the N×N quadratic problem directly. Instead of all-pairs interactions, SSMs process sequences through a learned continuous-time dynamical system: linear time complexity, constant memory (a fixed-size state instead of a growing KV cache), ~5× higher throughput. Mamba (2023) added selectivity — input-dependent state transitions, a learned forget gate. The pragmatic answer is hybrid architectures interleaving Mamba and attention layers (AI21 Jamba, NVIDIA Nemotron 3) — cheap linear processing for most of the sequence, attention sprinkled in where all-pairs interaction is genuinely needed.
Multi-modal models bolt a vision encoder (a ViT splitting images into 14×14 patches, each treated as a token) and a projection adapter onto a standard decoder backbone — the transformer barely changes; it's all sequences of vectors.
Inference splits into two phases with opposite hardware needs — this is the origin story of everything
LLM inference is two fundamentally different operations:
- Prefill processes the entire input prompt in parallel (matrix-matrix multiplication). It is compute-bound — GPU FLOPS are the bottleneck while expensive HBM bandwidth sits idle. Latency metric: Time to First Token (TTFT). GDDR7 is sufficient.
- Decode generates output tokens one at a time, each requiring the model to stream all weights plus the entire KV cache from memory (matrix-vector multiplication). It is memory-bound — HBM bandwidth determines speed, not compute. Latency metric: Time Per Output Token (TPOT). HBM is essential.
| Characteristic | Prefill | Decode |
|---|---|---|
| Bottleneck | Compute (FLOPS) | Memory bandwidth |
| Parallelism | High (all tokens at once) | Low (one token at a time) |
| GPU compute utilization | Can approach 100% | Severely underutilized |
| HBM bandwidth usage | Underutilized | Fully stressed |
| Optimal memory | GDDR7 sufficient | HBM essential |
Running both phases on one GPU wastes resources — during prefill expensive HBM idles; during decode expensive compute idles — yielding 70–90% hardware underutilization. The answer is disaggregated serving: separate prefill and decode onto different GPU pools. DistServe (UCSD) proved 4.48× higher throughput or 10.2× tighter latency; Microsoft Splitwise showed 1.4× throughput at 20% lower cost; Moonshot AI (Kimi) runs 100B+ tokens/day disaggregated. NVIDIA validated it by building Rubin CPX, the first GPU designed for prefill only: 128 GB GDDR7 (~50% cheaper per GB than HBM), no NVLink, ~25% of the manufacturing cost of the decode-oriented Rubin R200 (288 GB HBM, 20.5 TB/s, 14.4 Tb/s NVLink). SemiAnalysis estimated ~$0.90/hour of wasted TCO per R200 used for prefill due to idle HBM. The VR NVL144 CPX rack combines 72 R200 GPUs with 144 Rubin CPX chips at ~370 kW.
The KV cache is the beating heart of inference economics
In a causal decoder the Key and Value vectors of prior tokens never change, so they're cached — the KV cache — turning generation from O(n²) to O(n) and delivering ~5× inference speedups. But the cache grows linearly with sequence length, layer count, attention heads, and batch size, and during decode the GPU reloads the entire growing cache every step. GPUs hit only ~23% compute utilization during inference because bandwidth, not FLOPS, is the limit. More FLOPs alone won't help.
The KV cache footprint is large. For a 70B model with GQA at 128K context: ~80 GB — nearly a whole H100, before model weights load. Scaling:
| Context length | KV cache (70B model) |
|---|---|
| 4K tokens | ~2.5 GB |
| 32K tokens | ~20 GB |
| 128K tokens | ~80 GB |
| 1M tokens | ~625 GB |
Attention mechanisms evolved specifically to shrink this. MHA (2017, separate Q/K/V per head) has the largest footprint. MQA (Google, single shared K/V) cut it 8–32× but degraded quality. GQA (Grouped-Query — query heads grouped, each group sharing K/V; e.g. 32 query heads, 8 K/V groups) is the industry standard: ~99% of MHA quality at 4–8× KV reduction (Llama 3, Mistral, Mixtral, Falcon 40B+). Sliding Window Attention (Mistral) limits attention to the last W tokens (~4,096), plus attention sinks (initial tokens act as attention dumps), enabling stable generation to ~4M tokens at constant memory. Multi-Head Latent Attention (MLA) from DeepSeek compresses K/V into low-rank latent vectors before caching — 93% KV cache reduction while improving quality versus MHA. If MLA replicates broadly it could disrupt the GQA equilibrium.
Why output tokens cost 4–8× input tokens — and how batching saves the economics
The pricing asymmetry (OpenAI 4×, Anthropic 5×, Gemini 2.5 Pro 8×) is hardware physics, not margin-stacking. Input tokens process during prefill (parallel, high arithmetic intensity, efficient tensor-core use). Output tokens generate during decode at arithmetic intensity near 1 FLOP per byte — tensor cores idle while the GPU waits on HBM. For a 70B FP16 model (140 GB), each decode step streams 140 GB through an H100's 3.35 TB/s = 41.8 ms/token (~24 tokens/sec). The GPU is a $25,000 space heater waiting on memory.
At $2.50/hour, batch-size-1 decode costs $29 per million output tokens — the raw hardware floor. Batching is the only economical path: below saturation, adding requests to a decode step is nearly free (load weights once, apply to B tokens). Batch 64 drops cost to $0.45/M tokens; batch 256 to $0.11/M. This is why providers sell at $10–15/M output and still earn margin.
The central tension: KV cache memory directly caps batch size. Each concurrent request needs its own KV cache. On two H100s (160 GB, ~20 GB free after weights) at 2K context you max out at batch 3–4. The H200's extra 61 GB (141 vs 80 GB) is not a luxury — it directly enables 2–4× larger batches, linearly cutting cost/token. This is why NVIDIA charges a premium for capacity, not just bandwidth.
| GPU | HBM BW | HBM capacity | Decode latency (70B FP16) | Max batch @ 4K | Cost/M tokens at max batch |
|---|---|---|---|---|---|
| A100 | 2.0 TB/s | 80 GB | 70 ms | ~4 (2 GPUs) | ~$7.50 |
| H100 | 3.35 TB/s | 80 GB | 42 ms | ~4 (2 GPUs) | ~$4.50 |
| H200 | 4.8 TB/s | 141 GB | 29 ms | ~15 (2 GPUs) | ~$0.65 |
| B200 | 8.0 TB/s | 192 GB | 17.5 ms | ~25 (2 GPUs) | ~$0.20 |
Quantization compounds multiplicatively. FP16→INT8 halves model size (140→70 GB), which halves decode latency and doubles free memory for KV cache — a 4× cost reduction from one precision change. FP8 KV cache (native on Hopper/Blackwell) halves cache memory at <1% accuracy loss; NVIDIA's NVFP4 on Blackwell halves it again. INT4 gets a 70B model to 35 GB (fits consumer GPUs). The roofline model formalizes this: the H100's ridge point is ~295 ops/byte (990 TFLOPS ÷ 3.35 TB/s); prefill runs ~95 FLOPS/byte (decent utilization), decode at batch 1 plummets to ~1–8 FLOPS/byte (deeply memory-bound). You need batch >~150 to push decode FFN above the ridge on H100. Critically, attention kernel arithmetic intensity stays ~0.5–1 ops/byte regardless of batch size (KV cache scanning destroys memory locality, L2 hit rates drop 74–82% from prefill to decode), creating a throughput ceiling FFN batching can't break. For MoE (DeepSeek 256 experts, top-8) the saturation batch balloons to ~3,840 — far harder to hit, which is why MoE inference economics differ from dense.
Cache hit rates and token warehousing turn ephemeral compute into durable inventory
A KV cache hit occurs when a request shares a prefix with cached states — those tokens skip prefill entirely. Anthropic prices cache reads at 0.1× base input (90% discount) because a hit eliminates the prefill FLOPS; cache writes cost 1.25× base (you pay to compute and store). Hit-rate drivers: system prompt reuse (dominant), multi-turn conversations, RAG querying the same docs, few-shot templates. DeepSeek reported a 56.3% hit rate with on-disk caching; well-structured prompts exceed 87%. Implementations: SGLang's RadixAttention (token-level trie, provably optimal hit rates), vLLM's Automatic Prefix Caching (16-token block granularity). PagedAttention (vLLM, borrowing OS virtual-memory paging) solved fragmentation that previously wasted 60–80% of KV cache memory, dropping waste below 4% and enabling 2–4× more concurrent requests.
Token warehousing decouples KV cache from local GPU memory. Today most providers "re-manufacture" KV cache every request even when the same prompt ran 10 seconds ago on another GPU; evicted caches (typically after 5–15 min) force recomputation. WEKA trademarked "Token Warehouse" (March 2025) for NVMe-backed persistent KV storage, claiming 96–99% hit rates for agentic workloads, 75× faster prefill, 4.2× throughput, using GPUDirect Storage/RDMA to bypass the CPU at up to 252 GB/s/node. NVIDIA formalized it (Jan 2026) as the Inference Context Memory Storage Platform (ICMSP) on BlueField-4 DPUs — a "G3.5" Ethernet-attached flash tier between host DRAM and networked storage, claiming 5× tokens/sec and 5× better power efficiency, with 12 vendors (Dell, HPE, Pure Storage, VAST, WEKA) building on it. The emerging memory hierarchy:
| Tier | Medium | Latency | Capacity | Purpose |
|---|---|---|---|---|
| SRAM | On-chip L1/L2 | Sub-ns | ~50 MB | Active attention compute |
| HBM | GPU memory | Nanoseconds | 80–288 GB | Hot KV cache, model weights |
| DRAM | Host CPU memory | Microseconds | 1–2 TB | Warm KV cache overflow |
| NVMe (ICMSP) | Local/networked flash | Microseconds | 100s of TB | Persistent KV cache store |
| Object storage | Networked | Milliseconds | Petabytes | Cold/archival KV cache |
The full inference cost formula:
Cost/token = [(1 − hit_rate) × prefill_cost + decode_cost] / batch_size
where prefill_cost ∝ input length × model FLOPS, decode_cost is dominated by model_bytes / HBM_bandwidth, and batch_size is capped by (GPU_memory − model_weights) / KV_cache_per_request. Every optimization maps to one term. SemiAnalysis: prefill consumes ~80% of GPU cycles in many workloads, so DeepSeek's 56% hit rate means ~45% fewer prefill GPUs — a direct capex cut. Software is transitioning from near-zero marginal cost to high marginal cost because every API call burns real silicon time. A GB200 NVL72 (~$5M) can reportedly generate ~$75M in DeepSeek-R1 token revenue (15× ROI) — but only at high utilization. Per-unit token prices crash ~50× per year (Epoch AI; B200 hits ~$0.02/M tokens versus $20/M for equivalent GPT-4 performance in 2022) while consumption explodes faster.
Agentic AI rebalances the rack toward CPU, memory, networking, and storage
Training was GPU-only. Simple inference was GPU-weighted. Agentic AI — autonomous systems that reason through multi-step problems, call tools, execute code, coordinate with other agents — is fundamentally different. A single agentic task consumes 50,000–500,000 tokens across dozens of LLM calls and hundreds of tool invocations (versus a few hundred tokens for a chatbot), costing $5–8 per complex task versus fractions of a cent. The Georgia Tech/Intel paper (arXiv:2511.00739) measured the key fact: tool processing on CPUs accounts for 50–90% of total latency in agentic workloads (up to 90.6%). The LLM calls are the only GPU-heavy part; tool calling, code execution (sandboxed containers), memory retrieval (vector DBs — Pinecone, Milvus, Weaviate, Chroma — via RAG), orchestration, API communication, and multi-agent message passing all run on CPU, system RAM, networking, and storage.
The agent's six components: the LLM backbone (GPU), tool use (CPU orchestration — JSON parsing, HTTP, timeouts), memory (short-term context window + long-term vector DB), planning (Chain-of-Thought, Tree-of-Thought, Reflexion — each multiplies LLM calls), code execution (CPU containers), and multi-agent orchestration (Orchestrator-Worker patterns, CPU + networking). The ReAct loop (Reason→Act→Observe, repeated) means dozens of LLM calls per task. The CPU-to-GPU ratio shifts from training (hundreds of GPUs per handful of CPUs) toward 1:1 or higher; some orchestration-heavy workloads are purely CPU-bound. Latency profile: a single LLM call ~800ms, an orchestrator-worker Reflexion flow 10–30s, complex multi-agent tasks minutes to hours. NVIDIA validated this by building the Vera CPU specifically for agentic workloads — the strongest possible confirmation that GPUs alone aren't sufficient. The compute requirements an agentic CPU needs: high core count (parallel agents), high single-thread performance (critical-path latency), large caches and fast memory (state management), strong I/O / PCIe lanes (constant API and storage hits) — which reads like the AMD EPYC spec sheet. Company-level CPU/GPU positioning lives on AMD, INTC, and ticker pages.
Power consumption escalation forced a rack-level redesign
The physical driver of the entire infrastructure shift is power density. A CPU/storage server draws ~1 kW; an AI server now eclipses 10 kW. An NVIDIA H100 has a 700W TDP versus <200W for the most common data center CPU (Intel Skylake/Cascade Lake). Conventional CPU racks deliver 15–20 kW; next-gen AI accelerator racks require >200 kW at the rack level — a 10× increase. The GB200 NVL72 rack consumes ~120 kW; the Rubin CPX rack ~370 kW.
The electrical fundamentals dictate the architecture. Power dissipation in resistive elements follows P = R × I², so transporting power at low voltage / high current creates large I²R losses. The fix: transport at high voltage / low current, then step down as close to the silicon as possible. The grid-to-chip path runs grid (hundreds of thousands of volts AC) → transformers → PSUs (AC to DC) → Voltage Regulator Modules (final step-down to the ~1V the silicon needs). VRMs comprise capacitors (smooth delivery, handle transients), inductors (resist current spikes), and power stages (MOSFETs + drivers). The shift to 48V DC racks (Google drove the first adoption in 2017) and emerging 800V HVDC (Computex 2025) reduces conduction losses at these power levels. Power delivery is now a significant BoM line with real design-win competition: Vicor was the 2017 leader but was replaced by MPS (Monolithic Power Systems) as the H100 GPU power supplier, with Delta, Renesas, Infineon, and ADI all making share gains. Component-level winners live on ticker pages.
Thermal: air cooling can't dissipate 120 kW, so liquid moves in
At >200 kW/rack, air cooling is physically inadequate. The GB200 thermal stack is liquid-based: integrated heat spreader → cold plate (liquid interface extracting heat off the die) → inner manifold/CDM → rack manifold → CDU. Two architectures: L2A (liquid-to-air, CDU rejects heat to room air) and L2L (liquid-to-liquid, to a facility loop). Quick disconnects (QDs) enable serviceability. Two-phase cooling is the next-gen technology on the roadmap.
Networking and optics become a critical path, not just a bus
As clusters scale to 100K+ GPUs (heading toward million-GPU by 2027) the network determines training and inference performance. Three interconnect tiers: NVLink/NVSwitch (NVIDIA's proprietary in-rack GPU-to-GPU fabric — NVLink 5.0 in Blackwell at 1.8 TB/s per GPU, NVLink C2C at 1.8 TB/s for CPU-GPU, a hard moat element), InfiniBand (ultra-low latency for GPU-to-GPU training, NVIDIA via Mellanox), and Ethernet (cheaper, more flexible — the Ultra Ethernet Consortium released its 1.0 spec June 2025, and Ethernet now exceeds two-thirds of AI cluster switch sales; Broadcom's Tomahawk 6 "Davidson" hits 102.4 Tbps on 3nm). MoE's all-to-all traffic and disaggregated serving's KV-cache transfer (a single 512-token request on a large model moves ~1.13 GB; 10 req/sec needs 90+ Gbps sustained between phases) make networking demand structural, not incidental, and favor RDMA/InfiniBand for latency.
The optical interconnect is the physical layer underneath. The 800G → 1.6T transition (current 800G uses PAM4 at 200G/lane; 1.6T needs more lanes or new modulation) drives explosive transceiver and connector demand — each speed jump (400G→800G→1.6T→3.2T) requires more connector units per rack, a volume tailwind independent of market share. The GB200 BoM includes transceivers, DSPs (signal equalization), single-mode and multi-mode optics, ACC chips (signal-integrity retiming), and DAC/ACC copper cable assemblies. The big unresolved technical question is CPO (co-packaged optics) vs. pluggable — pluggable keeps winning each generation; analysis suggests CPO economics don't flip until 3.2T. Supply chain is geographically split: USA designs chips, China manufactures modules, Taiwan assembles. The AI-DC optical market is projected to grow from ~$10B (2025) to ~$31B by 2033 (~15.3% CAGR). Specific connector/transceiver/laser names live on ticker pages.
Memory is the tightening bottleneck across the whole stack
HBM (high-bandwidth memory stacked directly on the GPU) is the scarcest resource. The market runs ~$38B (2025) → ~$54–58B (2026) → ~$100B (2028), a three-player oligopoly (SK Hynix ~62%, Micron ~21%, Samsung ~17%) all sold out through 2026. HBM commands a 3–5× price premium over standard DDR and consumes 3× the wafer capacity of standard DRAM per gigabyte, which squeezes DDR5 supply just as agentic CPU deployments spike DDR5 demand. DRAM demand is growing ~35% versus ~23% supply — the widest gap in decades. TSMC's CoWoS advanced-packaging capacity and TSV conversion remain the gating bottleneck for HBM stacking, alongside bonders (Hanmi/ASMPT/Besi) and inspection (Camtek). The disaggregation trend creates a second memory market: prefill-optimized chips (Rubin CPX) use cheaper GDDR7 instead of HBM because prefill underutilizes HBM bandwidth — a recovery path for Samsung. Flash enters via persistent KV cache (token warehousing) and "LLM in Flash" inference. CXL memory pooling, declared dead two years ago, is returning as memory demands grow.
Data center economics: critical IT power, the utility constraint, and the buildout's binding limit
Critical IT power is the usable electrical capacity at the data center floor for compute, servers, and networking — it excludes cooling and power-delivery overhead (captured by PUE). Global DC critical IT power surges from 49 GW (2023) to 96 GW (2026), of which AI consumes ~40 GW; the growth rate jumped from a 12–15% CAGR to 25%. US critical IT capacity must roughly triple from 2023–2027. AI is projected to push data centers to 4.5% of global electricity generation by 2030 (global DC electricity ~doubling to 945 TWh).
Power is the binding constraint, and the bottleneck is the transformer. Microsoft disclosed $80B in unfillable Azure orders purely because it can't get enough power. Building DC power means stepping grid voltage (operators connect at 11kV or 220kV) down through substations to 480V at the data hall — and that requires transformers whose lead times have stretched catastrophically: medium transformers now run 115–130 weeks (Hitachi's book stretches to 130 weeks), large substation/GSU units 120–210 weeks (2.3–4 years). Transformer prices rose 60–80% since January 2020, driven by grain-oriented electrical steel (nearly doubled) and copper (+40%). Pre-pandemic lead times were 30–60 weeks. The underlying technology is unchanged in 50 years — it's a manpower and capacity problem, and "building a new factory takes four years," so supply can't outrun demand. Hitachi Energy is investing $6B and hiring 15,000. This is why training (latency-insensitive, deployable anywhere) versus inference (distributed but enormous volume) location logic matters, and why SEA buildout hits power-grid walls (Singapore's 60 MW/year cap and DC moratorium pushing demand to Malaysia/JB). Utility-scale transformer demand alone is projected to reach ~$116B by 2032; transformers have ~40-year service lives, so legacy replacement compounds the crunch.
The OEM/ODM split shapes the systems layer: ODMs (Quanta, FII/Foxconn, Inventec, Wistron, Wiwynn, ZT Systems) build for hyperscalers at ~2–3% margins because hyperscalers need minimal service; OEMs (Dell, Supermicro, HPE, Lenovo) serve enterprise, neocloud, and sovereign buyers who need full service.
The GPU cloud / neocloud business model and its financing physics
GPU clouds are far simpler than traditional clouds — workloads are homogeneous (LLM training, high-volume inference, diffusion inference all run well on the same H100 configuration), so they need few config options and networking is rarely the cost bottleneck (small relative to GPU cost). The structural problem is duration mismatch: DC assets have 30+ year lives funded with long-term capital, colocation tenants sign 15-year leases, but neoclouds lock customers for only 1–3 year server rentals — and contract terms compressed from 3-year-with-prepayment (2022–23 shortage) to 6-month-to-1-year deals (2024 as supply improved), creating refinancing and pricing-pressure risk.
The decisive insight: making AI clouds is a question of financing, not technology. Debt rates run 12%+ standard, up to 18% for unproven neoclouds without customer contracts. Financing preference order: vendor financing (best) > prominent lenders > equity (worst, 20%+ hurdle rates) — with customer prepayment ideal. This is why Dell beats Supermicro despite a worse product: both buy identical NVIDIA H100 8-GPU baseboards, and Supermicro is the more efficient manufacturer (server costs $5,000–$10,000 less, higher gross margin), but Dell Financial Services (a captive lender — $8.4B FY2024 originations, $10.5B portfolio) offers rates 2–3% vs. the 12–18% third-party rates, making Dell's total cost of ownership lower. TCO components — server acquisition, financing, power, colocation, networking, thermal — are dominated by financing for smaller players, where access becomes binary (get a DFS loan or no creditor will lend). The emerging paradigm: an AI server vendor is "a bank with a server manufacturing arm attached," exactly like the auto industry (Toyota is both the largest automaker and the largest car lender). Project-level economics often model <2-year paybacks, which is what drives the aggressive buildout. Provider-specific lender maps (CoreWeave/Blackstone/Magnetar, Crusoe, APLD/Macquarie, Nebius) live on ticker pages.
The capital math: a $6 trillion ceiling and a $600 billion revenue gap
The buildout's scale is unprecedented. Hyperscaler cash from operations is ~$450B in 2024; the Big Five (Amazon, Google, Meta, Microsoft, Oracle) are guided to $660–690B aggregate capex in 2026, ~75% AI-related — roughly double 2025. The deployment ceiling reconstructs to ~$6 trillion: full FCF reinvestment (~$450B/yr × ~7 years ≈ $3.15T) + off-balance-sheet SPV structures (Meta's Hyperion: $27B SPV debt with Meta retaining 20% equity and receiving $3B cash at close, the debt rated A+ and invisible on Meta's books; Microsoft's BlackRock/GIP partnership; Oracle's Stargate $18B project loan) + 1× corporate leverage (today only 7% of industry capex is debt-funded versus 32% during the 2000 telecom boom). McKinsey models $6.7T by 2030; iCapital $5.3T with cash flows covering only ~$1.5T; JPMorgan $5–7T needing $1.5T of IG bonds. By component: ~60% compute hardware (~$3.1T), ~25% power (~$1.3T), ~15% DC construction (~$0.8T).
Against this sits the revenue gap. Sequoia's David Cahn computes that AI infrastructure operators need ~$600B in annual revenue to earn back costs — six times current generation, a gap that tripled from $125B since late 2023. The method: NVIDIA DC revenue × 2 (GPUs are ~half of DC cost) × again (for the gross margin GPU buyers need to break even). J.P. Morgan independently calculates $650B in annual AI revenue for a "mere 10% return." Sequoia's earlier framing implied ~$600B of annual AI spending needs corresponding application revenue, but only OpenAI generates significant AI-native revenue (and at 33% gross margins). The railroad parallel is structural: 70,400 miles of track added 1880–1890 yet only 44% of railroad shares paid dividends by 1892; social returns ~43% but railroads captured ~8%; debt-to-equity rose from 0.62:1 (1875–79) to 1.58:1 (1885–89); one-third of US railroad mileage was in receivership by 1895. The honest read: utilization tells a softer story than the 1.9–2.8% DC vacancy rates suggest — GPU utilization runs 60–70%, Meta's Llama 3 training hit 38% model-flop utilization, and ~42% of enterprise AI projects are abandoned before production. Whether Jevons Paradox (cost deflation → more total usage) closes the gap is the central bull/bear question. Financing-system cascade risk is detailed in ai-infra-financing-bubble.
The value chain and where margin is captured
Money cascades through a seven-layer stack (chip design → memory & interconnect → cloud/compute → training tools → model layer → inference/serving → applications). Value is created at the application layer but captured at chip design, because NVIDIA's CUDA software ecosystem is the real lock-in — competitors match specs on paper, but ripping CUDA out of a production pipeline is a multi-quarter engineering project. Margin and concentration by layer:
| Layer | Gross margin | Concentration | Barrier to entry |
|---|---|---|---|
| Foundry (TSMC) | 50–55% | Oligopoly, TSMC dominant | Extreme ($20B+/fab) |
| GPUs/accelerators | 70–78% | Near-monopoly (NVIDIA 80–90%) | Very high (CUDA) |
| CPUs | 50–58% | Duopoly (AMD/Intel) + ARM | High (x86 ISA) |
| Memory (HBM/DDR5) | 35–50% | Oligopoly (3 players) | Extreme (fab + IP) |
| Networking | 60–75% | Oligopoly | High (ASIC, protocols) |
| Optical | 40–55% | Fragmented to moderate | Moderate |
| Software/orchestration | 70–85% | Fragmented | Low (open-source threat) |
| Cloud | 30–40% | Oligopoly (3–4 hyperscalers) | Extreme (capex, scale) |
The conventional wisdom holds: sell picks and shovels in a gold rush. The toll roads are GPUs (NVIDIA + CUDA), foundry (TSMC at 2nm, no alternative), and HBM (SK Hynix). The custom-silicon counterattack — hyperscalers designing their own accelerators (Google TPU, AWS Trainium, Microsoft Maia, Meta MTIA) and ARM CPUs (Graviton, Cobalt, Axion) — routes design revenue to Broadcom (~70% of custom ASICs) and Marvell rather than creating standalone investables. Inference is overtaking training as the dominant compute load (Deloitte: 50% of AI compute in 2025 → 67% in 2026), shifting GPU demand toward recurring, memory-bound decode workloads. Company-specific positioning and theses live on the respective ticker pages.
Subsectors
AI infrastructure is not one market — it is a stack of ten-or-so distinct sub-areas, each with its own technology, its own set of players, and its own investment angle. They are linked by a single cascade: hyperscaler capex (~$660–690B guided for 2026, ~75% AI-related) funds GPU and custom-silicon purchases, which require HBM, networking, power and cooling, which sit inside data centers that need grid electricity, which are financed by an increasingly leveraged web of corporate debt, SPVs, private credit and neocloud lenders. The layers are a cascade, not a sum — NVIDIA's chip revenue is funded by hyperscaler capex, which is funded by enterprise AI spend — so there is heavy double-counting if you add them. What follows enumerates each subsector. Company-specific deep dives live on ticker pages and are referenced with wikilinks rather than re-hosted here.
Compute silicon — CPU / GPU / accelerators
The largest and highest-margin layer (L1: chip design, ~$180B in 2025 rising to ~$270B+ in 2026; 60–75% gross margins; very high concentration). Three competing visions for how to build the data center:
- NVIDIA full-stack — couple a custom CPU (Grace, then Vera) with a custom GPU (Hopper → Blackwell → Rubin) over NVLink C2C and sell the integrated platform (GB200 NVL72, Vera Rubin NVL72). 80–90% of AI accelerator revenue, training share >90%, inference share lower (60–75%) because of custom ASICs. CUDA is the actual moat, not the silicon. See NVDA.
- AMD merchant silicon — sell best-in-class CPUs (EPYC; Turin → Venice 256-core / 1.6 TB/s on TSMC 2nm → Verano) and GPUs (Instinct MI300X → MI350X → MI400 "Vulkan" 432GB HBM4 → MI450X) separately. The only company selling both a market-leading server CPU and a competitive AI GPU. OpenAI deal: up to 6GW of MI450X, ~$100B+ over 4–6 years. See AMD.
- Custom silicon / hyperscaler ASICs — Google TPU (7th-gen Ironwood, 4,614 TFLOPS FP8, 100% better perf/watt than v6e; trains Gemini entirely on TPUs), Amazon Trainium (Trainium2 in production, Anthropic on 500K+ Trainium2; Trainium3 on 3nm, 4.4× compute; AWS custom run-rate >$10B), Microsoft Maia 200 (TSMC 3nm, 216GB HBM3e, 30% better perf/$ for inference), Meta MTIA (a new generation every ~6 months). Custom benefits flow to the design partners — AVGO (~70% of custom AI ASIC market, 5 confirmed hyperscaler customers) and MRVL (~20% and growing) — not to standalone names.
The CPU renaissance is the underappreciated angle within this subsector. Agentic AI is CPU-heavy: a Georgia Tech/Intel paper (arXiv:2511.00739) found tool processing on CPUs is 50–90% of total latency in agentic workloads (up to 90.6%). The CPU-to-GPU ratio moves from GPU-dominated (training) toward ~1:1 (agentic). NVIDIA validated this by building the Vera CPU (88 Olympus cores) specifically for agentic AI — the strongest possible confirmation. AMD EPYC is the primary merchant beneficiary; AMD has gone from ~5% server CPU share (2018) to ~28–39% (early 2026) and is approaching Intel parity, with supply nearly fully allocated. INTC is the comeback attempt (Diamond Rapids / Xeon 7, up to 192 cores on 18A; Cisco Unified Edge win) but carries severe execution risk and <1% of discrete AI accelerators. ARM custom chips (Graviton4/5, Cobalt, Axion) capture first-party cloud but not enterprise/on-prem/multi-cloud. Ampere Computing (SoftBank, $6.5B) is being squeezed from both sides — an "avoid."
Hardware requirements by workload type, the most striking shift being CPU criticality in the agentic era:
| Component | Training | Simple Inference | Agentic AI |
|---|---|---|---|
| GPU | Primary (>90%) | Primary (~70–80%) | Important not dominant (~40–50%) |
| CPU | Minimal (data loading) | Supporting | Critical (50–90% of latency) |
| Memory | HBM for GPU | HBM + DDR5 | HBM + DDR5 + large system RAM |
| Networking | GPU-to-GPU (NVLink, InfiniBand) | Moderate | Heavy (APIs, agent-to-agent, tools) |
| Storage | Dataset storage | Model weights | Persistent memory, vector DBs, caches |
Memory — HBM and the bandwidth wall
A supply-constrained oligopoly and a goldmine. HBM market ~$38–54.6B in 2025/2026, projected to $58–100B by 2026–2028. Three makers, all sold out through 2026: SK Hynix (53–62% share, HBM4 complete with 40% power-efficiency gain, exclusive NVIDIA relationships, MR-MUF packaging), Micron (MU, 21% and gaining, the US-listed way to play the supercycle, HBM4 samples at 11 Gbps), Samsung (17%, lagging on NVIDIA qualification — its recovery hinges on passing qual, and it is well-positioned for GDDR7 from prefill-optimized chips). HBM3E → HBM4 (2026 mix ~55/45) → HBM4E. HBM consumes ~3x the wafer capacity of standard DRAM per gigabyte and commands a 3–5x price premium over DDR, squeezing DDR5 supply — DRAM demand growing ~35% vs ~23% supply, the widest gap in decades. The binding constraint for inference is memory bandwidth, not compute (see LLM inference economics below). TSMC CoWoS / TSV conversion capacity is the upstream bottleneck on HBM assembly; CXL memory pooling failed two years ago and is returning as memory demand grows.
AI connectivity & retimers
Networking scales proportionally with GPU count and becomes a critical path, not just a data bus — especially for MoE all-to-all traffic and disaggregated-serving KV-cache transfer.
- Switching fabric — InfiniBand (NVIDIA, via Mellanox; dominant in training) vs Ethernet (the challenger, now >two-thirds of AI cluster switch sales). The Ultra Ethernet Consortium (UEC) released its 1.0 spec June 2025. ANET leads AI-DC Ethernet (~21.5% DC-Ethernet share, $2.75B AI-DC revenue target, surpassed Cisco; $105B TAM by 2029). Broadcom supplies the switching silicon (Tomahawk 6 "Davidson": 102.4 Tbps, 3nm; plus Jericho).
- NVLink / NVSwitch — NVIDIA's proprietary GPU-to-GPU interconnect (NVLink 5.0 in Blackwell: 1.8 TB/s per GPU). Essential for MoE expert parallelism and a key moat element — leaving NVIDIA means giving up NVLink.
- Retimers / DSP / connectivity ASICs — ALAB (Astera Labs) retimers carry GB200 content; CRDO Credo for DSP/retimer datacenter connectivity; Marvell PAM4 DSP leadership; Semtech (SMTC) gaining TPU share. These ride the per-rack-content volume tailwind independent of share.
Optical interconnects & silicon photonics
The physical layer beneath networking, treated as its own emerging market (~$9.94B in AI DC in 2025 → $31B by 2033, ~15.3% CAGR; the broader silicon-photonics market a separate JPM-sized ~$11B). The supply chain is geographically split: USA designs chips, China manufactures modules, Taiwan assembles — a concentration risk (Nomura). The 800G → 1.6T → 3.2T transition drives more connector units per rack every generation. NVIDIA invested $2B in COHR Coherent with multi-billion purchase commitments; Meta signed a $6B Corning (GLW) deal; Seikoh Giken (6834, inventor of the APC connector) ran +600% on AI demand. The unresolved question is CPO vs pluggable — Ayar Labs/NVIDIA and Meta bet on co-packaged optics, but pluggable keeps winning each generation; analysis suggests CPO economics don't flip until 3.2T. LPO/LRO (linear-drive optics) is the bridge technology. Japanese optical names: Santec 6777, JEM 6855, Sumitomo Electric 5802, Anritsu 6754.
Datacenter power & thermal
The fastest-escalating constraint inside the rack. Power consumption per AI server now eclipses 10kW vs ~1kW for CPU/storage servers; the NVIDIA H100 is 700W TDP vs <200W for the most common datacenter CPU (Intel Skylake/Cascade Lake). Conventional CPU racks deliver 15–20kW; AI accelerator racks now require >200kW — a 10x increase. The GB200 NVL72 rack is ~120kW.
- Power delivery / VRM — the path is grid AC → transformers → PSUs (AC→DC) → VRMs (final step-down to ~1V at the silicon). Minimize I²R losses by stepping down as close to the silicon as possible; the industry shift is to 48V DC racks and now 800V HVDC (Computex 2025). VRM suppliers: MPS (Monolithic Power, which displaced Vicor as the H100 power supplier), Delta (and subsidiary Cyntec), Renesas, Infineon, ADI. Vicor was the 2017 leader when Google drove first 48V adoption; it has lost share.
- Thermal management — liquid cooling is now standard for these densities (50kW+ racks). GB200 thermal stack: integrated heat spreader, cold plate, inner manifold (CDM), quick disconnects, rack manifold, L2A CDU. L2A vs L2L liquid cooling distinction matters. Two-phase cooling is the next-gen technology. Vendors: Vertiv VRT (power + cooling), Trane, Schneider Electric.
Datacenter capacity vs grid
The distinction that resolves the "is there overcapacity" debate. Critical IT power is the usable electrical capacity at the data center floor for compute/servers/networking — it excludes cooling and power-delivery overhead. Global DC Critical IT power: 49GW (2023) → 96GW (2026), of which AI ~40GW; capacity-growth CAGR jumped from 12–15% to 25%. US Critical IT capacity needs to triple 2023–2027. AI is projected to push data centers to ~4.5% of global energy generation by 2030; global DC electricity consumption doubling to ~945 TWh by 2030. Training is latency-insensitive and can sit anywhere economic (subject to data residency); inference is eventually the larger and more distributed workload.
The binding constraint is the grid, not the floor. Vacancy rates are at historic lows (1.9–2.8% in primary NA markets, <1% in Northern Virginia; 78–98% pre-leased) yet GPU utilization runs only 60–70% and training MFU can be ~38%. Power-related systems are the most-cited bottleneck: transformer lead times have stretched to 115–130 weeks for medium units and 120–210 weeks (2.3–4 years) for large substation/GSU transformers; prices up 60–80% since 2020 (GOES steel ~2x, copper ~40%). Hitachi Energy is investing $6B and hiring 15,000 to expand transformer capacity. Building a new transformer factory takes ~4 years. The utility-scale transformer market alone is projected to ~$116B by 2032, with parallel demand for switchgear, cabling, and smart meters, plus replacement of ~40-year-life legacy equipment.
Southeast Asia is the frontier and its own sub-theme: Singapore's 2023 moratorium (now lifted but capped at ~60MW/year, must be greened) pushed demand to Malaysia (Johor) and beyond; Singapore still ~60% of SEA DC capacity, with DCs >7% of its electricity, and it depends on imported natural gas for ~90% of power. Keppel's Datapark+ proposes a hydrogen + DC grid. ASEAN grid interconnections and corporate PPAs are the enabling/financing layer. This intersects grid decarbonization — you can't build 100GW without addressing carbon intensity.
GPU-cloud / neocloud financing
Neoclouds (CoreWeave CRWV, Crusoe, Nebius, Applied Digital APLD, Lambda, Together AI) rent GPUs as a homogeneous workload — far fewer config options than traditional multi-tenant cloud, networking spend rarely the bottleneck. The defining feature is that building an AI cloud is a financing problem, not a technology problem. Debt rates run 12%+ standard, up to 18% for unproven neoclouds without customer contracts; blockbuster deals (Blackstone × CoreWeave) get better terms. CoreWeave: IPO Mar 2025 (~$35B cap, now ~$24B), revenue ~$3.5B annualized → $12–13B 2026 guide, backlog ~$55–66.8B, anchored by OpenAI ($22.4B/5yr) and Meta ($14.2B); acquired Weights & Biases for $1.4B. The "GPU debt wall" is the key risk.
The structural fragility is a duration mismatch: DC assets have 30+ year lives funded with long-term capital, colo tenants sign 15-year leases, but neoclouds lock customers for only 1–3 year server rentals — and contract terms shortened from 3-year-with-prepayment (2022–23 shortage) to 6-month–1-year (2024 supply improvement), creating H100 repricing pressure at expiry. Financing preference order: customer prepayment (ideal) → vendor financing → prominent lenders → equity (worst, 20%+ hurdle). Many projects model <2-year payback at the project level.
Lenders/investors mapped (from gpu-cloud-lenders): CoreWeave debt led by Blackstone + Magnetar ($7.5B May 2024, $2.3B Aug 2023), plus Carlyle, CDPQ, DigitalBridge, BlackRock, PIMCO; $650M revolver led by JPMorgan/Goldman/Morgan Stanley; equity from Coatue, Fidelity (7.6%), NVIDIA (6%). Crusoe: Upper90 asset-backed GPU financing; Blue Owl + Primary Digital Infrastructure $3.4B JV for a 206MW DC; equity from Founders Fund, Mubadala, NVIDIA, Valor. APLD: Macquarie $150M senior secured + up to $5.0B perpetual preferred (15% equity stake) for the HPC business; NVIDIA $160M strategic. Nebius: $700M private placement (Accel, NVIDIA, Orbis).
Vendor-financing as moat is why DELL is beating Supermicro SMCI: Dell Financial Services (DFS, $8.4B FY24 originations, $10.5B portfolio) lends to enterprises, neoclouds and sovereign AI at rates 2–3 points below third parties. Both buy the same NVIDIA 8-GPU baseboards at the same price; Supermicro is the more efficient manufacturer ($5–10K cheaper server) but its weak balance sheet forces customers to third-party financing at higher rates, so Dell's total cost of ownership is lower despite the higher list price. The emerging model is "a bank with a server-manufacturing arm attached" — Toyota analogy. Dell has won sockets at CoreWeave, Tesla and x.AI. The OEM/ODM split: hyperscalers buy from ODMs (Quanta, FII, Inventec, Wistron, Wiwynn, ZT Systems) at ~2–3% margins; non-hyperscale buyers need OEMs (Dell, Supermicro, HPE, Lenovo).
AI-capex financing & bubble risk
The macro/systemic-risk subsector. The framing question: AI infrastructure faces a ~$600B annual revenue gap — the largest gap between capital invested and economic return since the railroad bubbles (Panics of 1873/1893). ~$400–500B poured into AI infra in 2024 (hyperscaler capex $251B, +62% YoY, plus ~$100B VC, ~$115B PE DC deals, CHIPS Act) against Sequoia/David Cahn's calculation of only ~$100B of actual AI-services revenue. The gap tripled from $125B since late 2023. J.P. Morgan: $650B annual AI revenue needed for a "mere 10% return"; Goldman's Jim Covello: "overbuilding things the world doesn't have use for typically ends badly." Enterprise AI abandonment 42% (up from 17%); only 5.4% of US businesses report regular AI use.
The $6 trillion capital math (Doug O'Laughlin / SemiAnalysis): hyperscaler cash-from-operations ~$450B/year; full FCF reinvestment over ~7 years ~$3.15T; plus off-balance-sheet SPVs (Meta's $27–29B Hyperion deal with Blue Owl/PIMCO/BlackRock, A+ rated, Meta keeps 20% and got $3B cash at close; Microsoft's BlackRock/GIP $100B+ partnership; the $40B Aligned acquisition at 70% fund-level leverage); plus 1x corporate leverage (hyperscalers run minimal debt today — Google 7%, Meta 15%, Microsoft 30% D/E — vs 32% debt-funded capex in the 2000 telecom boom, only 7% today). Total deployable ceiling ~$5.6–6.7T, aligning with McKinsey ($6.7T by 2030), iCapital ($5.3T, ~$1.5T cash-flow-funded leaving a gap), JPM ($5–7T needing $1.5T of IG bonds), Morgan Stanley ($2.9T with a $1.5T financing gap). Allocation roughly: ~60% compute hardware, ~25% power, ~15% construction. Oracle broke the discipline by leveraging up aggressively (Stargate $18B project loan, ~500% D/E).
Private credit is the dominant financier — ~$50B/quarter into AI DCs, 2–3x what public markets provide; market quadrupled to ~$2T globally with $450B dry powder; Ares pegs the opportunity at $5.5T through 2035. Credit quality is deteriorating: PIK income at 11.7% of BDC loans (highest since 2020), interest coverage ~2.0x, private-credit recovery rates ~33% vs 52% for syndicated; covenant-lite has spread to large-cap. The cascade pathways (railroad-debt analogy, debt/equity rose 0.62→1.58 in the 1880s): AI startup failures → private credit losses → pension capital calls (pensions hold 31% / ~$307B of private-credit assets) → forced selling; tech-stock concentration → bank securities losses → SVB-style deposit flight; CLO market seizure at the 2028 maturity wall (CLOs buy 70–80% of leveraged-loan issuance); NBFI liquidity mismatch (US bank exposure to NBFIs >120% of Tier 1 capital). Hyperscalers issued ~$121B of new debt in 2025. The key counter: the spenders fund mostly from operating cash flow, and physical DC capacity is genuinely undersupplied. See ai-infra-financing-bubble and 6t-ai-capital-math.
LLM architecture & inference economics
The software/model subsector that determines hardware demand from the top down. Every frontier model is a transformer (decoder-only); the value chain runs seven layers (chip design → memory/interconnect → cloud → training tools → model layer → inference/serving → applications). Model layer is consolidating around 5–7 labs (OpenAI $25B ARR / $730B / ~33% gross margin / $17B 2026 burn; Anthropic $19B ARR / $380B; xAI $230B; Meta open-source; Google; Mistral; DeepSeek). See llm-architecture-primer and llm-industry-primer.
The architectural shifts with direct supply-chain consequences:
- MoE is the frontier default (GPT-4, Gemini, DeepSeek-V3 671B/37B-active, Grok). Sparse activation gives 2–7x training efficiency but creates a paradoxical memory profile (load all experts, activate few) and all-to-all communication that forced network redesigns (AWS moved Trainium3 from 3D Torus to switched fabric). Poor load balancing causes up to 82% GPU idle.
- The memory wall — bandwidth, not compute, is the binding constraint for inference. GPUs hit only ~23% compute utilization during inference because they're memory-bound.
- KV cache is the dominant consumer of scarce GPU memory: linear in sequence length × layers × heads × batch. A Llama-70B 128K-context request needs ~40–80GB of HBM for cache alone. Attention evolution traded expressiveness for memory: MHA → MQA (8–32x smaller cache) → GQA (industry standard, ~99% quality, 4–8x reduction) → DeepSeek's MLA (93% reduction while improving quality).
- Prefill vs decode have opposite hardware needs — prefill is compute-bound (GDDR7 sufficient), decode is memory-bandwidth-bound (HBM essential). This birthed disaggregated serving (DistServe 4.48x throughput; Microsoft Splitwise; NVIDIA Dynamo orchestration) and purpose-built prefill silicon: NVIDIA Rubin CPX (128GB GDDR7, ~25% the cost of an R200 decode chip; SemiAnalysis estimates ~$0.90/hr wasted TCO per R200 used for prefill).
- Tokenomics — output tokens cost 4–8x more than input because decode is sequential and memory-bound. Cost/token = [(1−hit_rate) × prefill_cost + decode_cost] / batch_size. Batch size is capped by (GPU memory − weights) / KV-cache-per-request, so HBM capacity directly sets unit economics (B200 reaches ~$0.02/M tokens on some benchmarks). Cache hit rates are the biggest lever after batch size (Anthropic prices cache reads at 0.1x; DeepSeek 56.3% production hit rate). Token warehousing — persistent tiered KV-cache storage (WEKA's trademarked "Token Warehouse"; NVIDIA's ICMSP "G3.5" tier on BlueField-4, 12-vendor ecosystem incl. Dell/HPE/Pure/VAST) — is the emerging step-change, claiming 96–99% hit rates. SemiAnalysis: software is moving from near-zero to high marginal cost; a ~$5M GB200 NVL72 can generate ~$75M of DeepSeek-R1 token revenue (15x ROI) at high utilization; per-unit token prices crash ~50x/year while consumption explodes faster (Jevons Paradox).
The DeepSeek shock (Jan 2025): R1 matched o1 at a claimed $5.6M final-run cost, wiped ~$1T off US tech (NVIDIA −$593B in one session), and lifted Chinese LLM global share from 3% to 13% in two months — permanently changing the efficiency-vs-scale debate.
Wafer-scale compute
A distinct accelerator architecture that escapes the reticle limit by fabricating an entire wafer as one chip — Cerebras WSE2 is the canonical example (referenced alongside NVIDIA H100, Google TPUv5, AMD MI300 and Intel Gaudi3/PVC in SemiAnalysis's power-delivery competitive landscape). Coverage in the vault is thin: it appears only as a comparison point in the power-delivery primer and is not yet a dedicated page. The investment angle is as an alternative to GPU-based scale-up for training and high-throughput inference; the open question is whether wafer-scale economics and yield compete with NVLink-connected rack-scale systems for the workloads that matter.
GB200 BOM / rack content
The single most analyzed system unit, and the lens for tracing per-rack content value (SemiAnalysis GB200 component & supply-chain BOM model). The GB200 NVL72 is a 72-GPU rack-scale system (~$3M+ list, ~120kW). The Bianca board pairs B200 Blackwell GPUs with a Grace ARM CPU. The BOM splits into:
- Core processing — B200 GPU (~$6,400 production cost; ASP ~$30–40K), Grace CPU. GB200 superchip (2× B200 + Grace) ASP ~$60–70K.
- NVLink ecosystem — NVLink 5 backing-plane connector, NVSwitch board + ASIC, SkewClear EXD gen-2 cable, UltraPass connectors, flyover cable to OSFP cage, 1.6T Twin-Port OSFP LinkX ACC cable + assembly, ACC active-cable chips. Interconnect is a large share of BOM cost.
- Non-NVLink networking — Bluefield-3 DPU, custom NIC, transceivers, DSP, single-mode + multi-mode optics, ACC retiming chips, DAC/ACC cable assemblies. Subject to design-win competition.
- Liquid cooling — integrated heat spreader, cold plate, inner manifold (CDM), fans, quick disconnects, rack manifold, L2A CDU.
The investment angle: each component is a separate design-win battleground, and per-rack content value rises with every generation (GB200 → GB300/B300 → Vera Rubin NVL72 → VR NVL144 CPX at ~370kW). Tracking BOM share is how you find the suppliers (retimers, connectors, optics, VRMs, cold plates) that ride volume regardless of which GPU vendor wins. See GB200-Supply-Chain-BOM and sa-gb200-component-outline.
Value chain
The AI infrastructure value chain is a cascade, not a sum. Money flows from end applications down through cloud capex into systems, silicon, memory, networking, power, and ultimately the foundry — but the dollars are double-counted at every layer. NVIDIA's revenue (chip design) is funded by hyperscaler capex (cloud), which is funded by enterprise AI spending (applications). Value is created most at the application layer (where AI generates revenue for enterprises) but captured most at chip design, because NVIDIA holds near-monopoly pricing power. This is the classic picks-and-shovels dynamic, and in AI it is more extreme than in prior tech cycles because the hardware moat (CUDA) is harder to replicate than cloud infrastructure ever was.
The end-to-end map
The two primers in the vault frame the chain slightly differently — a six-stage agentic view and a seven-layer LLM view — but they describe the same physical reality.
The agentic-AI framing (manufacturing → end applications):
[Foundry/Manufacturing] → [Silicon/Components] → [Systems/Platforms] → [Cloud/Infrastructure] → [Software/Orchestration] → [End Applications]
TSMC, Intel AMD, NVIDIA, Intel Dell, HPE, Supermicro AWS, Azure, GCP LangChain, CrewAI Enterprise agents
Samsung Foundry Broadcom, Marvell Oracle, CoreWeave Salesforce, ServiceNow
SK Hynix, Micron
Arista, Coherent
Seikoh Giken, Corning
The seven-layer LLM framing (chip up to apps):
[L1: Chip Design] NVIDIA, AMD, Broadcom, Marvell, Google/Amazon/Meta custom
[L2: Memory & Interconnect] SK Hynix, Samsung, Micron (HBM) + Arista, Cisco (networking)
[L3: Cloud / Compute] AWS, Azure, GCP, CoreWeave + power infrastructure
[L4: Training Tools] Scale AI, Databricks, Weights & Biases, PyTorch
[L5: Model Layer] OpenAI, Anthropic, Google, Meta, Mistral, xAI, DeepSeek
[L6: Inference & Serving] Groq, Fireworks, Together AI, vLLM, TensorRT-LLM
[L7: Application Layer] ChatGPT, Claude Code, Copilot, Palantir, Cursor, Perplexity
Margin pools and concentration by layer
The single clearest fact in the value chain is the margin hierarchy: chip design earns the most, the model layer the least. Two source tables converge on this.
Agentic-AI primer, layer economics:
| Layer | Revenue Pool | Gross Margin | Concentration | Barrier to Entry |
|---|---|---|---|---|
| Foundry | ~$100B+ | 50–55% (TSMC) | Oligopoly (TSMC dominates) | Extreme ($20B+ per fab) |
| GPUs/Accelerators | ~$150B+ | 70–78% (NVIDIA) | Near-monopoly | Very high (CUDA ecosystem) |
| CPUs | ~$50B+ server | 50–58% | Duopoly (AMD/Intel) + ARM | High (x86 ISA, ecosystem) |
| Memory (HBM/DDR5) | ~$55B HBM (2026) | 40–50% | Oligopoly (3 players) | Extreme (fab + IP) |
| Networking | ~$15B+ AI networking | 60–75% | Oligopoly | High (ASIC design, protocols) |
| Optical | ~$10B AI DC | 40–55% | Fragmented to moderate | Moderate |
| Software/Orchestration | ~$7–50B (early) | 70–85% | Fragmented | Low (open source threat) |
| Cloud | ~$450B AI infra | 30–40% | Oligopoly (3–4 hyperscalers) | Extreme (capex, scale) |
LLM primer, seven-layer revenue and margin (2025E / 2026E):
| Layer | Revenue 2025E | Revenue 2026E | Gross Margin | Concentration | Barrier |
|---|---|---|---|---|---|
| L1: Chip Design | ~$180B | ~$270B+ | 60–75% | Very high (NVIDIA 80–90%) | Extreme (CUDA, R&D, IP) |
| L2: Memory & Interconnect | ~$50B | ~$70B+ | 35–65% | Oligopoly (3 HBM makers) | High (fab cost, qualification) |
| L3: Cloud / Compute | ~$30–40B | ~$50–60B | 30–60% | Oligopoly (3 hyperscalers) | Extreme (capex, scale) |
| L4: Training Tools | ~$10B | ~$15B+ | 60–80% | Fragmented | Moderate |
| L5: Model Layer | ~$25–30B | ~$50B+ | 30–70% | Fragmenting | High but declining |
| L6: Inference & Serving | ~$5–8B | ~$12–15B | 40–60% | Fragmented | Moderate |
| L7: Application Layer | ~$20B | ~$35B+ | 60–85% | Wide variance | Varies widely |
The highest-margin, most concentrated layers — the toll roads — are GPUs (NVDA), foundry (TSM), and HBM (000660.KS SK Hynix). The worst risk-reward for a new investor is the model layer (L5): OpenAI runs ~33% gross margin, terrible for software, because inference costs scale faster than revenue and open-source closes the capability gap within 6–12 months, compressing pricing power.
Choke point one: the foundry
TSM manufactures virtually all leading-edge AI chips — NVIDIA, AMD, Broadcom, Marvell, Apple, Qualcomm. At 2nm there is no alternative. This is the ultimate picks-and-shovels position and a strategic single point of failure (Taiwan). Arizona CHIPS-Act fabs are a partial hedge but won't match Taiwan's scale or cost for years. Within the foundry, advanced packaging is the binding sub-bottleneck: TSMC's CoWoS capacity, alongside TSV conversion capacity, is the constraint that gates HBM-stacked GPU output, not raw wafer starts. Equipment makers capturing the HBM/packaging build-out include Applied Materials (TSV tools), Hanmi / ASMPT / Besi (bonders), and Camtek (inspection). Vault ticker pages cover the equipment edge: AIXA (MOCVD for transceiver lasers and GaN/SiC power), 268A (Rigaku X-ray metrology), SOITEC (photonics-SOI and FD-SOI engineered wafers).
Choke point two: HBM memory
Memory bandwidth, not compute, is the binding constraint for inference. HBM is a supply-constrained three-player oligopoly with pricing power and a multi-year upgrade cycle (HBM3E → HBM4 → HBM4E). Share (Q2 2025): SK Hynix 62% (the clear leader, >50%, exclusive NVIDIA supply, MR-MUF packaging lead, sold out through 2026 with most of 2026 booked), Micron (MU) 21% (the US-listed fast follower, shipping HBM4 samples at 11 Gbps, claims 30% lower power on HBM3E), Samsung 17% (trailing, recovery hinges entirely on passing NVIDIA qualification — its HBM stumble wiped ~$126B of market value). Market size: $38B (2025) → $58B (2026); the agentic primer cites BofA at $54.6B (2026) → $100B (2028). 2026 mix is ~55% HBM4 / 45% HBM3E, with ~20% HBM3E price hikes planned. HBM commands a 3–5x price premium over standard DDR.
A second-order squeeze: HBM production consumes ~3x the wafer capacity of standard DRAM per gigabyte, so the HBM ramp starves DDR5 supply just as agentic CPU deployments spike DDR5 demand. DRAM demand is growing ~35% in 2026 vs ~23% supply — the widest gap in decades. Micron exited consumer memory entirely to focus on AI.
The KV cache is why memory is the bottleneck. KV cache grows linearly with sequence length, layers, heads, and batch size; LLaMA-2 70B at 128K context needs ~80GB just for cache (nearly a full H100), and batch size 32 at 8K context reaches ~640GB, dwarfing the ~140GB of model weights. GPUs hit only ~23% compute utilization during inference because they are memory-bandwidth-bound, not FLOP-bound. Architectural fixes (GQA cutting cache 4–8x; DeepSeek's MLA cutting it 93%) reduce per-token footprint but enable longer contexts, so net HBM demand stays strong — a "memory-Parkinson" dynamic where models grow to fill whatever HBM appears.
KV-cache context scaling (70B model):
| Context Length | Approx KV cache |
|---|---|
| 4K tokens | ~2.5GB |
| 32K tokens | ~20GB |
| 128K tokens | ~80GB |
| 1M tokens | ~625GB |
Per-stage economics: the GB200 NVL72 BOM
The rack is where component-supplier margin pools live. SemiAnalysis's GB200 component model (mirrored in the vault and detailed in GB200-Supply-Chain-BOM) breaks a Blackwell NVL72 rack into discrete high-value buckets, each with its own design-win competition:
- Compute die (majority of BOM cost): B200 GPU + Grace CPU on the Bianca board. Production cost ~$6,400 per B200; ASP ~$30–40K — extraordinary margin. GB200 superchip ~$60–70K; a full GB200 NVL72 rack (72 GPUs + 36 Grace CPUs) ~$3M+.
- NVLink ecosystem (interconnect — significant BOM): NVLink 5 backing-plane connector, NVSwitch board + ASIC (1.8 TB/s per GPU), SkewClear EXD gen-2 cable, UltraPass connectors, flyover cable to OSFP cage, 1.6T twin-port OSFP LinkX ACC cable assemblies, ACC signal-integrity chips. This is a NVIDIA moat element — switching off NVIDIA forfeits NVLink performance.
- Non-NVLink networking: BlueField-3 DPU, custom NIC, transceivers, DSP (signal equalization), single-mode and multi-mode optics, DAC/ACC copper cable assemblies. Subject to design-win competition; ALAB (retimers), 6855 JEM and 6834 Seikoh Giken (connectors) sit here.
- Liquid cooling (critical for density): integrated heat spreader, cold plate, inner manifold (CDM), quick-disconnects, rack manifold, L2A CDU. At 120kW+ per rack, thermal is no longer optional.
Per-stage economics: the token factory
The datacenter is a token factory — silicon, electricity, and water in, intelligence out. The unit cost of an output token is set by one ratio: model bytes ÷ memory bandwidth ÷ batch size. Everything downstream (KV cache management, prefix caching, token warehousing, disaggregated serving) manipulates the terms of that equation.
The prefill/decode asymmetry is the origin of the whole inference cost structure. Prefill (processing the input prompt) is compute-bound, parallel matrix-matrix work that saturates tensor cores. Decode (generating output one token at a time) is memory-bandwidth-bound, matrix-vector work with arithmetic intensity near 1 FLOP/byte that leaves compute idle while the GPU streams weights from HBM. Running both on the same GPU wastes 70–90% of hardware. This is why output tokens cost more than input tokens at every provider (OpenAI 4x, Anthropic 5x, Gemini 2.5 Pro 8x) — a hardware reality, not margin-stacking.
Decode cost by GPU at max batch (70B FP16):
| GPU | HBM Bandwidth | HBM Capacity | Decode latency | Max batch @4K ctx | Cost/M tokens at max batch |
|---|---|---|---|---|---|
| A100 | 2.0 TB/s | 80 GB | 70 ms | ~4 (2 GPUs) | ~$7.50 |
| H100 | 3.35 TB/s | 80 GB | 42 ms | ~4 (2 GPUs) | ~$4.50 |
| H200 | 4.8 TB/s | 141 GB | 29 ms | ~15 (2 GPUs) | ~$0.65 |
| B200 | 8.0 TB/s | 192 GB | 17.5 ms | ~25 (2 GPUs) | ~$0.20 |
At batch-1, H100 decode costs ~$29/M output tokens — the raw hardware floor. Batching is the only escape: batch 64 → $0.45/M, batch 256 → $0.11/M. But KV cache memory caps batch size, which is why NVIDIA charges a premium for capacity, not just bandwidth — the H200's extra 61GB directly buys 2–4x larger batches. The inference cost formula:
Cost/token = [(1 - hit_rate) × prefill_cost + decode_cost] / batch_size
Cache hit rate is the single largest lever after batch size. Anthropic prices cache reads at 0.1x base input (90% discount) because a hit eliminates prefill FLOPS; cache writes cost 1.25x. DeepSeek reported 56.3% production hit rates; well-structured prompts exceed 87%; WEKA claims 96–99% for agentic workloads. SemiAnalysis estimates prefill consumes ~80% of GPU cycles, so a 56% hit rate cuts the prefill GPU fleet ~45% — a direct capex reduction. A GB200 NVL72 (~$5M) can reportedly generate $75M in DeepSeek-R1 token revenue (15x ROI) — but only at high utilization. Per-unit token prices are crashing ~50x/year (Epoch AI), but consumption is exploding faster (Jevons Paradox).
This converts the value chain's economics: software is moving from near-zero marginal cost to high marginal cost, because every API call now consumes real silicon time. A new infrastructure category — "token warehousing" (WEKA's trademarked term, NVIDIA's ICMSP / G3.5 memory tier on BlueField-4 DPUs, 12 vendors including Dell/HPE/Pure/VAST) — turns ephemeral KV cache into persistent, location-independent inventory, a tiered hierarchy from SRAM → HBM → DRAM → NVMe flash → object storage.
Disaggregation reshapes who supplies what
Disaggregated serving — splitting prefill and decode onto specialized hardware pools — is now standard at every hyperscaler. It validates purpose-built silicon and reshuffles the memory supply chain. NVIDIA's Rubin CPX is the first prefill-only GPU, and crucially it uses GDDR7 instead of HBM because prefill underutilizes HBM bandwidth (SemiAnalysis estimates ~$0.90/hour wasted TCO per R200 used for prefill):
| Spec | Rubin CPX (prefill) | Rubin R200 (decode) |
|---|---|---|
| Compute (FP4) | 20 PFLOPS | 33.3 PFLOPS |
| Memory | 128GB GDDR7 | 288GB HBM |
| Bandwidth | 2 TB/s | 20.5 TB/s |
| Memory cost | ~50% lower/GB | Premium HBM |
| Interconnect | PCIe Gen6 (no NVLink) | 14.4 Tb/s NVLink |
| Mfg cost | ~25% of R200 | Baseline |
This opens a GDDR7 supply-chain lane that benefits Samsung (well-positioned in GDDR7, a recovery path from its HBM stumble) and Micron, while forcing any competing custom-silicon program to ship both prefill and decode variants. Disaggregation also multiplies networking demand: moving KV cache between prefill and decode nodes needs 90+ Gbps sustained per 10 req/s, favoring RDMA/InfiniBand and the 800G→1.6T Ethernet transition.
The networking and optical layer
As clusters scale to hundreds of thousands of GPUs, the network becomes a critical path, not a data bus — especially for MoE models whose all-to-all routing (tokens travel to the GPUs hosting their experts) punishes mesh topologies (AWS redesigned Trainium3 from 3D Torus to switched fabric specifically for this). Ethernet is winning the AI networking war: the Ultra Ethernet Consortium released its 1.0 spec in June 2025, and Ethernet now exceeds two-thirds of AI cluster switch sales. ANET Arista leads (surpassed Cisco; ~$2.75B AI DC revenue target; $105B TAM by 2029) on switching silicon from Broadcom (Tomahawk 6 "Davidson": 102.4 Tbps, 3nm). NVLink/NVSwitch and InfiniBand (via Mellanox) remain NVIDIA's proprietary moat in training. Custom networking and XPU design value accrues to AVGO Broadcom (~70% of custom AI ASICs, Tomahawk/Jericho) and MRVL Marvell (Teralynx switches, 500ns latency; 76% of datacenter revenue from AWS; optical DSP via the Celestial AI acquisition).
Optical interconnects are the physical layer making it all work. Each speed transition (400G → 800G → 1.6T → 3.2T) needs more connector units per rack — a volume tailwind independent of share. The optical interconnect market in AI datacenters grows from $9.94B (2025) to $31B (2033), 15.3% CAGR. NVIDIA invested $2B in Coherent (COHR) with multi-billion purchase commitments; Meta signed $6B with Corning. Supply concentration: USA designs chips, China manufactures modules, Taiwan assembles — a chokepoint flagged for export-control risk. CPO vs pluggable is the unresolved 1.6T question; analysis suggests CPO economics don't flip until 3.2T. Vault ticker pages: 5802 Sumitomo Electric (fiber/cable), 6777 Santec (tunable lasers), 6754 Anritsu (test).
The CPU renaissance — a new balanced-compute claim on the chain
Agentic AI rebalances the rack. Tool processing on CPUs accounts for 50–90% of total latency in agentic workloads (Georgia Tech/Intel, arXiv:2511.00739) — only the LLM calls are GPU-heavy; orchestration, tool calling, code execution, memory retrieval, and API communication are CPU/memory/networking/storage. The CPU-to-GPU ratio moves from training's hundreds-of-GPUs-per-handful-of-CPUs toward ~1:1 in agentic workloads. Hardware requirements by workload:
| Component | Training | Simple Inference | Agentic AI |
|---|---|---|---|
| GPU | Primary (>90%) | Primary (~70–80%) | ~40–50% |
| CPU | Minimal | Supporting | Critical (50–90% of latency) |
| Memory | HBM | HBM + DDR5 | HBM + DDR5 + large system RAM |
| Networking | GPU-to-GPU | Moderate | Heavy (APIs, agent-to-agent) |
| Storage | Datasets | Model weights | Persistent memory, vector DBs |
AMD is the primary beneficiary as the only vendor selling both a market-leading server CPU (EPYC Venice: 256 cores, 1.6 TB/s, TSMC 2nm) and a competitive AI GPU (Instinct MI400, 432GB HBM4). NVIDIA validated the thesis by building the Vera CPU (88 Olympus ARM cores, sold only inside the Vera Rubin NVL72 — the captive Apple model vs AMD's merchant Android model). INTC fights back with Diamond Rapids on 18A. Hyperscaler custom ARM (Graviton, Cobalt, Axion) captures first-party cloud but can't reach enterprise/on-prem/multi-cloud — the merchant lane AMD owns. Historical precedent: optionality wins in datacenter silicon.
Systems and platforms: financing is the moat, not engineering
The systems layer (OEMs assembling NVIDIA boards into deployable servers) has a counterintuitive margin driver: making AI clouds is a question of financing, not technology. Both DELL and Supermicro buy identical H100 8-GPU baseboards from NVIDIA at the same price; Supermicro is the more efficient manufacturer and lists servers $5,000–$10,000 cheaper. Yet Dell wins on total cost of ownership because Dell Financial Services (DFS) — a captive credit arm, the "Toyota model" (largest automaker and largest car lender) — lends at rates 2–3 points below third parties. DFS scale: $8.4B FY24 originations, $10.5B portfolio. For neoclouds and secondary buyers, financing is binary: get a DFS loan or find no willing third-party creditor at all. Supermicro's weak working-capital-strained balance sheet can't compete on credit, so its superior engineering becomes "largely irrelevant." The emerging paradigm: an AI server vendor is a bank with a manufacturing arm attached.
The buyer split shapes who supplies: hyperscalers buy from ODMs (Quanta, FII/Foxconn, Inventec, Wistron, Wiwynn, ZT Systems) at ~2–3% margins because they need minimal service; non-hyperscale buyers (enterprises, neoclouds, sovereign AI) buy from OEMs (Dell, Supermicro, HPE, Lenovo). Dell has taken sockets at CoreWeave, Tesla, and xAI.
Cloud / neocloud economics and the duration-risk choke point
Hyperscaler capex is the funding engine: the Big Five are projected at $660–690B in 2026 (roughly double 2025), ~75% AI-related. But they are supply-, not demand-, constrained — Microsoft holds $80B in Azure orders it cannot fill for lack of power. Financing has shifted off balance sheet: hyperscalers issued ~$121B in new debt in 2025; SPV structures (Meta's $27B Hyperion deal with Blue Owl/PIMCO/BlackRock; Microsoft's BlackRock GIP partnership; Oracle's $18B Stargate project loan) keep debt off corporate books. The $6T capital-deployment ceiling reconstructs as ~$450B annual hyperscaler cash from operations × 5–7 years + SPV doubling + 1x corporate leverage. Today only ~7% of industry capex is debt-funded vs 32% in the 2000 telecom boom; McKinsey models $6.7T by 2030 (split ~60% compute / ~25% power / ~15% construction).
Neoclouds (CoreWeave, Crusoe, Nebius, Applied Digital) are the new intermediaries, and their economics carry a structural mismatch: datacenter assets have 30+ year lives funded with long-term capital, colocation tenants sign 15-year leases — but neoclouds lock customers for only 1–3 year GPU-rental agreements. That duration risk is the choke point. Debt rates run 12%+ standard, up to 18% for unproven neoclouds without contracts; blockbuster deals (Blackstone × CoreWeave) achieve better. Financing preference order: vendor financing > prominent lenders > equity (20%+ hurdle) > customer prepayment (ideal, lowers upfront cash). Project-level payback models often target <2 years. GPU-cloud workloads are homogeneous (H100 optimal for training, high-volume inference, diffusion) so networking is rarely the bottleneck and config complexity is far lower than multi-tenant CPU cloud. Lender/investor concentration is documented on gpu-cloud-lenders; deeper deal terms on neocloud-gpu-cloud-economics and sa-ai-neocloud-playbook.
Power delivery — the constraint stack beneath the rack
Power is the most-cited bottleneck and increasingly the binding one. A conventional CPU/storage server draws ~1kW; an AI server now eclipses 10kW. Conventional CPU racks deliver 15–20kW; AI accelerator racks need >200kW at the rack level — a 10x step. The H100 is 700W TDP vs <200W for the most-installed datacenter CPU (Intel Skylake/Cascade Lake); the GB200 NVL72 rack consumes 120kW.
Inside the rack, the value pool is the VRM and DC-DC conversion chain. The grid-to-chip path steps hundreds of thousands of volts AC down through transformers, PSUs (AC→DC), and finally VRMs (final step-down to ~1V at the silicon). The physics: transporting power at low voltage/high current creates I²R losses, so you transport high-voltage/low-current and step down as close to the silicon as possible — driving 48V DC and now 800V HVDC rack architectures. Supplier landscape: Vicor led in 2017 when Google drove first 48V adoption but was displaced — MPS (Monolithic Power) replaced Vicor as primary H100 GPU power supplier, with Delta, Renesas, Infineon, and ADI all making share gains. Delta's bi-directional DC-DC converter (U50SU4P162, a Vicor NBM equivalent) is winning hyperscaler deals, with Foxconn/Quanta in its supply chain. Detail on rack-power-delivery-primer-gb200 and dc-hpc-power-density.
Power delivery — the grid-level choke point
Above the rack, the binding constraint is electricity itself and the transformer supply chain. Global datacenter critical IT power surges from 49GW (2023) to 96GW (2026), of which AI consumes ~40GW; the capacity CAGR jumped from 12–15% to 25%. AI is projected to push datacenters to 4.5% of global electricity generation by 2030 (SemiAnalysis); global DC electricity roughly doubles to 945 TWh by 2030. US DC critical IT capacity must triple 2023–2027. JPM projects 100GW of US datacenter demand — 2–3x the current US nuclear fleet.
Transformers are the chokepoint. Even in normal times they carry 12–24 month lead times; now Hitachi's order book stretches to 130 weeks for medium units and ~4 years for the largest utility-substation units. Wood Mackenzie (April): lead times 115–130 weeks average; large substation and generator step-up (GSU) transformers 120–210 weeks (2.3–4 years), vs 30–60 weeks pre-pandemic. Prices up 60–80% since Jan 2020 (grain-oriented electrical steel nearly doubled, copper +40%); Rystad estimates +40% since 2019 with the crunch lasting at least through end-2026. "Power transformers are currently the most severely undersupplied critical power grid equipment." A new operator may build an entire substation to step 100kV/220kV down to 11kV or 22kV, then another transformer bank down to 480V for the data hall. Underlying transformer tech has been unchanged ~50 years — the constraint is manpower and factory capacity (a new factory takes ~4 years to build). Hitachi Energy is investing $6B and hiring 15,000. Adjacent grid equipment (switchgear, cabling, smart meters) faces the same demand surge; utility-scale transformer market projected to ~$116B by 2032. The broader electrical BOM (transformers, switchgear, UPS, OCP busbar, generators, substations) and Vertiv / Schneider / Eaton positioning are catalogued on ai-infrastructure.
Southeast Asia: where the power choke point bites a regional value chain
Training is latency-insensitive and can locate anywhere economic (subject to data residency); inference is the larger eventual workload but distributable. SEA's pull factors are the AI-DC industry's binding inputs: abundant cheap power, energy-supply-chain stability against geopolitical/weather shocks, ability to ramp fuel and grid capacity, and low-carbon power mix to meet emission commitments and chip-export conditions. Singapore imposed a 2023 three-year moratorium, then capped new DC allocation at 60MW/year and required greening; SG still hosts ~60% of SEA DC capacity and DCs consume >7% (some sources 8–10%) of its electricity. Singapore depends on imported natural gas for ~90% of power and is exploring hydrogen for up to half its power by 2050 (Keppel's "Datapark+" concept — using DC demand to anchor hydrogen-supply investment, not to power DCs directly). Demand displaced by the moratorium flows to Malaysia (Johor) and beyond. Datapoints on sea-ai-boom-datapoints and datapoints; regional buildout thesis on bcg-sea-datacenters and asean-power-grid-interconnections.
Pricing power, in one line per stage
Foundry: extreme (TSMC 2nm monopoly, no alternative). GPUs: near-monopoly (CUDA lock-in; ripping out CUDA is a multi-quarter engineering project). HBM: oligopoly pricing power, sold out, 3–5x DDR premium. CPUs: shifting from Intel monopoly to AMD/Intel duopoly + ARM, so weakening for the incumbent. Networking: oligopoly with Ethernet displacing InfiniBand. Optical: fragmented, volume-driven, not pricing-driven. Systems/OEM: thin (financing arbitrage, not product margin; ODMs at 2–3%). Cloud: scale-protected but capital-intensive (30–40% margin). Model layer: collapsing under open-source (OpenAI ~33% gross). Applications: bimodal — Palantir prints money, AI-wrappers go to zero.
Notes on sources
interconnect-challenges-computing.mdis a bare bookmark (a Science.org URL) with no extractable value-chain content.utility-vs-it-capacity-dc.mdcontains only repeated "What's Next?" newsletter sign-off boilerplate — no value-chain substance.semianalysis/index.mdis a 260-article catalog (titles + ticker tags); the value-chain-relevant SemiAnalysis pieces it indexes — GB200 BOM (Jul 2024), AI Neocloud Playbook (Oct 2024), Datacenter Anatomy Part 1 Electrical (Oct 2024), Rubin CPX (Sep 2025), Scaling the Memory Wall / HBM (Aug 2025), How Dell is Beating Supermicro (May 2024), AI Server Cost Analysis "Memory Is The Biggest Loser" (May 2023) — are the upstream sources for the BOM, neocloud, power, and memory analysis reproduced above.
Players
The AI infrastructure stack runs from foundry through silicon, systems, cloud, software, and applications. Value is created highest up the stack (applications that generate enterprise revenue) but captured lowest down — NVIDIA's near-monopoly pricing power at the chip-design layer is the most extreme picks-and-shovels dynamic of any tech cycle, because the CUDA moat is harder to replicate than cloud infrastructure ever was. The map below positions the companies that matter, with the company-specific deep work living on the linked ticker pages.
Compute silicon — GPUs, CPUs, accelerators
NVIDIA (NVDA) — the toll road. Designs the GPUs that train and run virtually every frontier model, plus the networking fabric (NVLink, InfiniBand via Mellanox) and the software stack (CUDA, TensorRT, NIM, Dynamo) that locks customers in. 80–90% of AI accelerator revenue, training share above 90%, inference share lower at 60–75% because of custom silicon. FY2026 revenue $215.9B (+65% YoY), data center ~$170B, GAAP gross margin ~71%, $320B FY2027 backlog. Blackwell (B200/GB200/GB200 NVL72) is the current workhorse; Vera Rubin NVL72 launches H2 2026 promising 5x inference performance and 10x lower cost-per-token; Rubin in 2027. The moat is CUDA, not the silicon. Bear case: hyperscaler custom ASICs taking 30–40% of inference in-house, plus DeepSeek-style efficiency shocks capping total GPU demand. NVIDIA also validated the CPU renaissance by building the Vera CPU (88 custom Olympus ARM cores, 1.5TB LPDDR5X, NVLink C2C at 1.8 TB/s) specifically for agentic workloads, sold captive in the NVL72.
AMD — the only company selling both a market-leading server CPU (EPYC) and a competitive AI GPU (Instinct), which uniquely positions it for the balanced CPU+GPU compute profile that agentic workloads demand. EPYC has gone from ~5% server CPU share in 2018 to ~28–39% by early 2026, with multiple sources projecting AMD overtakes Intel as the largest x86 data center CPU supplier in 2026. EPYC Venice (Zen 6, H2 2026): up to 256 cores / 512 threads, TSMC 2nm, 1.6 TB/s memory bandwidth — purpose-built for agentic orchestration. Instinct MI350X ships 288GB HBM3e (60% more than B200); MI400 "Vulkan" in 2026 with 432GB HBM4 and 40 PFLOPS FP4; the OpenAI deal is for up to 6GW of MI450X capacity, potentially $100B+ over 4–6 years. AI-specific revenue ~$9.5B in 2025, targeting 20–30% discrete AI accelerator share by 2027. Detail and the agentic CPU thesis live on AMD. Competitive gap to watch: NVIDIA's prefill-specific Rubin CPX has no AMD equivalent announced yet, a real risk in disaggregated serving.
Intel (INTC) — the comeback attempt. Server CPU share has bled to AMD for five straight years; under Lip-Bu Tan (CEO since March 2025), Diamond Rapids (Xeon 7, 2026, up to 192 cores on Intel 18A, MRDIMM to 1.6 TB/s) is the must-execute product. Cisco chose Xeon 6 for Cisco Unified Edge (agentic/inference at the edge). The thesis is not agentic AI — it is whether 18A yields materialize on schedule. In discrete AI accelerators Intel is a rounding error (<1%; Gaudi 3 pitched on price), though it still holds ~22% of broader data center revenue with CPUs included. SemiAnalysis's "Intel on the Brink of Death" frames the structural foundry/culture problem. Watchlist-only until 18A yield confirmation.
CBRS — Cerebras Systems. Wafer-scale-engine inference specialist, tracking pre-IPO. The bet is that a single wafer-scale chip beats GPU clusters on inference latency for specific workloads — a differentiated architectural play rather than a CUDA-ecosystem competitor.
ARM-based hyperscaler CPUs — AWS Graviton (4/5; 96/192 cores, 20–40% cost savings vs x86; ~50% of EC2 compute), Microsoft Cobalt 100/200 (30–50% cost advantage), Google Axion (72 cores, 50% better perf / 60% better energy efficiency), NVIDIA Grace (72 Neoverse V2, deployed at Meta scale). ARM holds ~12% of total server market and ~25% of cloud instances. The competitive dynamic that matters: custom ARM chips capture first-party cloud workloads but cannot reach enterprise, on-prem, or multi-cloud — that merchant silicon market is AMD's to win. Ampere Computing (independent ARM server CPU, acquired by SoftBank for $6.5B) is being squeezed from both sides; share declining to 18.2%, classified "avoid."
Custom ASICs — the hyperscaler counter-attack and the biggest long-term threat to NVIDIA. Google TPU (7th-gen Ironwood, 4,614 TFLOPS FP8, 100% better perf/watt than v6e; the most mature program, a decade in), Amazon Trainium (Trainium3 on 3nm, 4.4x compute vs Trainium2; Anthropic trains on 500K+ Trainium2; AWS custom-chip run rate >$10B), Microsoft Maia 200 (TSMC 3nm, 216GB HBM3e, 30% better perf/$ for inference, deployed since Jan 2026), Meta MTIA (four generations in two years, new chip every ~6 months). For investors the benefit flows to the design partners, not standalone companies.
Custom-silicon design partners and networking
Broadcom (AVGO) — the anti-NVIDIA play and custom-chip kingmaker. ~70% of the custom AI ASIC market across five confirmed hyperscaler customers (Google TPU, Meta MTIA, ByteDance, reportedly Apple/OpenAI). AI semis ~$20–21B FY2025 (+74% YoY in Q4), Q1 FY2026 guide $8.2B (doubling), FY2026 projection $40–50B AI revenue, ~67% EBITDA margin. $10B TPU rack order in Q3 FY2025 plus $11B follow-on in Q4. Also supplies the switching silicon (Tomahawk 6 "Davidson": 102.4 Tbps, 3nm) that benefits from the 800G→1.6T transition and MoE all-to-all traffic. Concern: customer concentration (Google ~58% of ASIC revenue).
Marvell (MRVL) — the other custom-silicon winner; same thesis, smaller scale, earlier innings, trading at a discount to Broadcom. Custom AI ASIC business went from near-zero to a $1.5B run rate in one fiscal year; FY2026 revenue ~$8.2B (+42.6%), data center record $1.65B, ~59–60% non-GAAP gross margin. Two XPU programs in volume, a third underway; 18 design wins (XPUs + optical) in 2025. Design partner for AWS Trainium, Microsoft, Google Axion; ~76% of data center revenue from AWS (both an anchor and a dependency). Acquired Celestial AI ($5.5B) for co-packaged optics; PAM4 DSP leader; Teralynx switches at 500ns latency for disaggregated serving.
Arista Networks (ANET) — the Ethernet networking play. Ethernet now wins over two-thirds of AI cluster switch sales; the Ultra Ethernet Consortium 1.0 spec (June 2025) narrows the gap with InfiniBand. Arista leads with ~21.5% data center Ethernet share and surpassed Cisco. 2025 revenue $9B (+28.6%), 47.5% operating margin; 2026 guide ~$10.65B; AI data center revenue target raised to $2.75B (from $1.5B); TAM ~$105B by 2029. Agentic AI's networking-intensive profile (constant API calls, agent-to-agent traffic) is a direct tailwind. Dependency: Broadcom switching silicon. The ISI "AI switching primer / Arista long thesis" is the best single resource. Credo (CRDO) supplies DSP/retimers for datacenter connectivity; Cisco is the incumbent challenger losing share.
ALAB — Astera Labs. Retimers and AI connectivity silicon (PCIe/CXL/Ethernet signal integrity), with direct GB200 content opportunity — its retimers sit on the connectivity path that scales with every speed transition. Pure-play connectivity exposure to the rack-scale buildout; positioned in the SemiAnalysis "Astera Labs IPO — the next connectivity" coverage. Detail on ALAB.
Memory — the bottleneck layer
The binding constraint on AI inference is memory bandwidth, not compute, which makes HBM the goldmine. Three suppliers, all sold out through 2026, with pricing power and a multi-year HBM3E→HBM4→HBM4E upgrade cycle. HBM market ~$38–54.6B in 2025/2026 heading to $100B by 2028; HBM commands a 3–5x price premium over standard DDR.
| Maker | HBM share (2025) | Status |
|---|---|---|
| SK Hynix (000660.KS) | 53–62% | Clear leader; HBM4 complete (40% power efficiency gain, 10 Gbps); exclusive NVIDIA relationships; MR-MUF packaging edge; overtook Samsung in annual profit for the first time in 2025 |
| Micron (MU) | 12–21% | Fast follower; HBM3E into NVIDIA, HBM4 samples at 11 Gbps; ~30% lower power claim; US-listed HBM play; exited consumer memory to focus on AI; GDDR7 exposure for prefill chips |
| Samsung | 17–35% | Trailing; recovery hinges entirely on NVIDIA HBM4 qualification; well-positioned for GDDR7 demand from Rubin CPX-style prefill accelerators |
HBM production consumes ~3x the wafer capacity of standard DRAM per gigabyte, squeezing DDR5 supply just as agentic CPU deployments scale DDR5 demand — DRAM demand grew ~35% in 2026 vs ~23% supply, the widest gap in decades. SemiAnalysis's "Scaling the Memory Wall" is the reference.
Optical interconnects and fiber
The physical layer that makes hyperscale clusters work. The 800G→1.6T module transition in 2026 drives transceiver and connector volume independent of market share; AI-DC optical interconnects grow from ~$10B (2025) to $31B by 2033 (~15.3% CAGR). NVIDIA invested $2B in Coherent (COHR) with multi-billion purchase commitments (800G/1.6T transceivers; competes with Lumentum and Broadcom's vertical integration). Corning (GLW) signed a $6B optical fiber/cable deal with Meta. Japanese connector names cluster here: Seikoh Giken (6834) invented the APC connector and ran up ~600% on AI fiber demand (62x P/E, watchlist for a pullback); JEM (6855), Santec (6777) for optical T&M and tunable lasers, Anritsu (6754) for RF/microwave test, Sumitomo Electric (5802) for fiber and cables. The CPO-vs-pluggable question (Ayar Labs / NVIDIA, Meta) remains unresolved — pluggable keeps winning each generation; analysis suggests CPO economics don't flip until 3.2T.
Systems — OEMs vs ODMs, and the financing moat
The NVIDIA-server market splits into OEMs (Dell, Supermicro, HPE, Lenovo) and ODMs (Quanta, FII, Inventec, Wistron, Wiwynn, ZT Systems). Hyperscalers buy from ODMs at ~2–3% margins because they need minimal service; non-hyperscale buyers (enterprises, neoclouds, sovereign AI) need OEMs.
DELL is beating Supermicro, and the reason is financing, not engineering. Both buy H100 8-GPU baseboards from NVIDIA at identical prices, and Supermicro is the more efficient manufacturer (server cost $5,000–10,000 lower). But Dell Financial Services — a captive credit arm with $8.4B FY24 originations and a $10.5B portfolio — lends 2–3 percentage points cheaper than the 12–18% third-party rates neoclouds face, which flips total cost of ownership in Dell's favor despite the higher list price. The model is "a bank with a server manufacturing arm attached," the same dynamic as auto OEMs (Toyota is the largest automaker and largest car lender). For smaller/secondary buyers, financing access is binary — get a DFS loan or find no willing creditor — so manufacturing efficiency becomes irrelevant. Dell has gained sockets at CoreWeave, Tesla, and xAI, and is best-positioned for the enterprise on-prem AI opportunity (where Dell and NVIDIA share the goal of keeping control away from hyperscalers). Supermicro retains hyperscaler share but its weak balance sheet caps its pace. Full breakdown in why-dell-beating-supermicro (SemiAnalysis "How Dell Is Beating Supermicro"). Vertiv (VRT), Schneider Electric, Trane, Celestica round out the power/cooling/contract-manufacturing layer.
2308 — Delta Electronics is the dominant power-delivery player: an estimated ~60% share of the AI server PSU market and the only company offering integrated power delivery from 20,000V grid to 0.8V chip. It co-developed the 800V HVDC rack power shelf with NVIDIA that defines next-gen architecture (>98% conversion efficiency, 1.1 MW per rack, unveiled at NVIDIA GTC 2025). FY2025 Infrastructure segment grew +82% revenue / +413% EBIT; the stock re-rated ~530% in 12 months to ~90x TTM P/E — the thesis is real, the valuation extreme. Detail and the WATCH stance on 2308. In the broader VRM/power landscape, MPS (Monolithic Power) replaced Vicor as the H100 power supplier, with Delta, Renesas, Infineon, and ADI all gaining share as rack power jumps from 15–20kW (CPU) to >200kW (AI accelerator) — SemiAnalysis's "Energizing AI: Power Delivery Competition" tracks this.
GPU clouds / neoclouds
CoreWeave (CRWV) — the leading neocloud. IPO March 2025 ($1.5B raised, ~$35B cap, now ~$24B). Revenue $3.5B annualized mid-2025, guided $4.9–5.1B FY2025 (300% growth), $12–13B 2026; backlog $66.8B; anchor customers OpenAI ($22.4B over 5yr) and Meta ($14.2B). Acquired Weights & Biases for $1.4B. Key risk: a "GPU debt wall" — massive debt to finance GPU purchases (Blackstone-led $7.5B facility, Magnetar, Coatue) against only 1–3yr customer rental commitments. Nebius, Crusoe, Applied Digital (APLD), Lambda Labs, Together AI fill out the field; lender/investor maps for each are in gpu-cloud-lenders. The economics (12–18% financing rates, sub-2yr project paybacks, duration mismatch between 30-yr assets and 1–3yr contracts) are detailed in neocloud-gpu-cloud-economics and SemiAnalysis's "AI Neocloud Playbook." SemiAnalysis's "GPU Cloud ClusterMAX rating system" ranks these providers head-to-head.
Model developers (private, mostly)
Headlines live here; economics are most precarious. The layer is consolidating around 5–7 well-funded labs. OpenAI ($25B ARR Feb 2026, $730B valuation after $110B raise, but ~33% gross margin and ~$17B 2026 cash burn — terrible software economics; committed $250B to Azure, $22.4B to CoreWeave). Anthropic ($19B ARR Mar 2026, $380B post-money; Claude Code at $2.5B+ ARR; Constitutional AI differentiation; Google + Amazon backed). Google DeepMind/Alphabet (GOOGL) — the sleeping giant with TPUs, proprietary data, and 750M Gemini MAUs buried in a $403B revenue machine. Meta (META) — the open-source wrecking ball (Llama 4, free, subsidized by ad cash; $115–135B 2026 capex; MTIA silicon). xAI — brute force at scale ($230B valuation; Colossus toward 1M GPUs by late 2026). Mistral (Europe's champion, $14B, ASML owns 11%). DeepSeek — the efficiency revolution that wiped $1T off US tech in Jan 2025 and lifted Chinese LLM share from 3% to 13% in two months; MLA, auxiliary-loss-free MoE balancing, FP8 training. Other Chinese players: Alibaba Qwen (73.5M DAUs), ByteDance Doubao (145M DAUs), Baidu ERNIE (200M MAUs).
Inference, serving, tools, applications
Groq — purpose-built LPU, 456 tokens/s, 1–3 J/token vs 10–30 for GPUs; $1.5B from Saudi Arabia; ~$20B NVIDIA licensing deal. Fireworks, Together AI, vLLM, TensorRT-LLM for serving. Scale AI (training data, ~$2B 2026, exploring $25B valuation), Databricks ($5.4B ARR, $134B, IPO expected 2026), Weights & Biases (CoreWeave-owned). Applications: Palantir (PLTR) (the standout — $4.5B 2025 revenue +56%, Rule of 40 of 127%), Salesforce Agentforce ($500M+ ARR +330%), ServiceNow, Cursor ($2B ARR), GitHub Copilot, Claude Code, Perplexity. Orchestration frameworks (LangChain/LangGraph, CrewAI, Microsoft AutoGen) sit at the commoditization-risk frontier as model providers build native agent capabilities.
The investable ranking (agentic-AI primer)
| Rank | Company | Ticker | Thesis | Risk |
|---|---|---|---|---|
| 1 | AMD | AMD | Dual CPU+GPU; primary agentic CPU beneficiary; Venice 256-core + MI400; approaching Intel parity | CUDA moat; China export |
| 2 | NVIDIA | NVDA | 85–90% GPU share + Vera CPU validates full stack; Rubin NVL72 generational | Valuation; custom ASICs |
| 3 | Broadcom | AVGO | ~70% custom ASIC share; networking; AI +74% YoY | Google ~58% of ASIC revenue |
| 4 | TSMC | TSM | Fabs for NVIDIA, AMD, Broadcom, Marvell; 2nm monopoly | Taiwan geopolitics; capex |
| 5 | SK Hynix | 000660.KS | 53% HBM share; sold out; HBM4 lead | Cyclicality; Samsung catching up |
| 6 | Arista | ANET | AI-DC Ethernet leader; $10B target; EtherLink | Broadcom silicon dependency; Cisco |
| 7 | Marvell | MRVL | #2 custom ASICs; PAM4 DSP; Celestial AI optics | Execution; AWS dependency |
Tiered conviction from the same primer: Tier 1 core holdings AMD, NVDA, TSM; Tier 2 tactical AVGO, SK Hynix, ANET, MRVL; Tier 3 watchlist INTC (await 18A yields), COHR, Seikoh Giken/6834 (terrible entry after 11x), Salesforce; avoid Ampere Computing.
Head-to-head analyses in _compare/
For the optics and equipment names that feed this sector, the comparative work is in _compare/: santec-vs-jem-vs-anritsu-and-more-versus.md (optical T&M and components — Santec/6777, JEM/6855, Anritsu/6754), soitec-vs-aixa-showdown.md (engineered wafers vs MOCVD equipment for optoelectronics and power), and amkr-vs-umc-vs-6809-and-more-showdown.md / ats-vs-onto-vs-uctt-and-more-showdown.md for packaging and semicap subsystems. The TSMC CoWoS bottleneck, HBM stacking equipment (Applied Materials TSV tools; Hanmi/ASMPT/Besi bonders; Camtek inspection) sit upstream of every name above.
Monitor
A rolling log of what moved this sector, what's dated and worth re-checking, and the standing watch-items that recur across the primers. Industry-wide only — company-specific catalysts live on the ticker pages (NVDA, AMD, INTC, AVGO, MRVL, MU, ANET, TSM, AIXA, SOITEC, 268A, 6855, 6834, 6777, 5802, ALAB). Dated specifics are preserved with their dates so the snapshot can be aged correctly — every primer here carries the same caveat: numbers go stale fast.
Standing watch-items (the recurring checklist)
These are the leading indicators every primer in this sector tells you to track. Run them on each hyperscaler earnings cycle and each NVIDIA print.
- Hyperscaler capex guidance — quarterly earnings calls. The Big Five (Amazon, Google, Meta, Microsoft, Oracle) guided to $660–690B aggregate capex in 2026, ~double 2025, with ~75% AI-related. Watch for moderation, which is the first sign the cycle is rolling over. Microsoft's 1.5GW self-build slowdown (Apr 28, 2025) is the template for how a "freeze" gets misread.
- NVIDIA data center revenue — the single most important AI data point. FY2026 ~$170B DC revenue; $320B backlog for FY2027. If this rolls, everything downstream rolls.
- GPU utilization rates at cloud providers — currently 60–70%; Meta's Llama 3 training hit just 38% model-flop utilization. If utilization drops, it signals over-building.
- Enterprise AI contract size and duration — growing = real adoption. The neocloud model rests on locking multi-year demand; contract terms compressed from 3-year-with-prepayment (2022–23 shortage) to 6-month–1-year deals (2024 supply improvement). Watch whether they re-lengthen or shorten.
- Open-source benchmark gap vs frontier — narrowing = margin pressure on the model layer. Llama 4 / DeepSeek / Qwen reach "good enough" within 6–12 months of closed frontier.
- Inference cost per token — falling ~50× per year (Epoch AI); capability-adjusted cost falling 5–10×/yr. Faster decline = faster adoption (Jevons) but lower per-query revenue. B200 hit $0.02/M tokens on some benchmarks vs $20/M for equivalent GPT-4 quality in 2022.
- HBM demand–supply gap — DRAM demand growing ~35% in 2026 vs ~23% supply, the widest gap in decades. Track capacity additions vs bit-demand growth, and Samsung's NVIDIA HBM qualification status — it determines Samsung's whole competitive trajectory (it lost ~$126B of market value waiting on HBM3E qualification).
- Power transformer lead times — the binding physical constraint. Now 115–130 weeks (medium), 120–210 weeks / 2.3–4 years (large utility-scale units); Hitachi's order book stretches to ~130 weeks. Pre-pandemic was 30–60 weeks. Transformer prices up 60–80% since Jan 2020. Rystad expects the crunch to last at least through end-2026. This gates data center construction directly — Microsoft has $80B of unfillable Azure orders because it can't get power.
- Custom silicon share of inference — is NVIDIA at ~85% and rising, or rising-in-dollars-but-falling-in-percent? Track hyperscaler XPU ramps (Google TPU, AWS Trainium, Microsoft Maia, Meta MTIA) flowing to AVGO / MRVL design revenue.
- Enterprise AI ROI data — due late 2026–2027. This either validates or kills the capex cycle. Gartner's counter-call: 40% of agentic AI projects canceled by 2027; only ~130 of thousands of agentic vendors are "real."
Dated developments — 2026
- Jan 2026 — NVIDIA formalized the Inference Context Memory Storage Platform (ICMSP), a BlueField-4-powered "G3.5" Ethernet-attached flash tier purpose-built for KV cache (claims 5× tokens/sec, 5× power efficiency vs traditional storage). Twelve storage vendors building on it (Dell, HPE, Pure Storage, VAST, WEKA, others). Signals "token warehousing" — persistent, location-independent KV cache — is now an industry architectural direction, not speculation.
- Jan 2026 — Microsoft Maia 200 deployed at scale (TSMC 3nm, 216GB HBM3e at 7 TB/s, 30% better perf/$ than third-party hardware in MSFT's fleet, inference-built). xAI $20B Series E closed, valuation to $230B.
- Feb 2026 — OpenAI raised $110B at a $730B valuation (SoftBank $30B, NVIDIA $30B, Amazon $50B). Anthropic closed its $30B Series G at $380B post-money (led by GIC and Coatue). NVIDIA reported Q4 FY2026 alone at $68.1B (+73% YoY); FY2026 $215.9B. Alibaba Qwen3.5 shipped 60% cheaper than predecessor; ByteDance Doubao 2.0 launched as LLM+multimodal+agent triple-threat.
- Mar 2026 — At Morgan Stanley TMT, Lisa Su said the CPU portion of AMD's business "far exceeded my expectations in terms of demand"; AMD server CPU supply nearly fully allocated for 2026. Intel reallocated capacity from PC to server chips the same week. This is the inflection that reframed the narrative from "GPUs only" to the CPU renaissance / agentic pivot. Anthropic at $19B ARR, OpenAI at $25B ARR (Feb), Cursor at $2B ARR.
- 2026 product cadence to watch: AMD EPYC Venice (Zen 6, H2 2026) — up to 256 cores / 512 threads, TSMC 2nm, 1.6 TB/s/socket. Intel Diamond Rapids (Xeon 7, 2026) on 18A, up to 192 cores, MRDIMM to 1.6 TB/s — gated on 18A yield confirmation. NVIDIA Vera Rubin NVL72 (H2 2026) — 5× inference, 10× lower cost/token vs Blackwell; the Vera CPU (88 Olympus cores) is the strongest validation of the CPU-renaissance thesis. AMD MI450 / 1GW deployment starts H2 2026 (OpenAI deal: up to 6GW, $100B+ over 4–6 years). HBM4 begins Q3 2026 (2026 mix ~55% HBM4 / 45% HBM3E); SK Hynix + Samsung planning ~20% HBM3E price hikes.
Dated developments — 2025 (the year the framing changed)
- Jan 2025 — the DeepSeek moment. DeepSeek-R1 matched GPT-4o/o1 at a claimed ~$5.6M training run (vs $100M+); $1T wiped from US tech stocks, NVIDIA −$593B in one session (largest single-day value destruction in US history). Chinese LLM global share went 3% → 13% in two months. R1 API launched 27× cheaper than o1; by V3.2 the gap grew to 140×. Permanently shifted the efficiency-vs-scale debate. SA coverage: DeepSeek Debates (Jan 31), DeepSeek Debrief 128 Days Later (Jul 3).
- Jan 15, 2025 — AI Diffusion export controls — 3-tier country system, Malaysia stranded capacity, sovereign-AI restrictions. NVIDIA China revenue fell from ~25% to <10% of total.
- Jan 23, 2025 — OpenAI Stargate JV demystified — Texas GigaCampus, SoftBank capital question, Abilene/Crusoe/Oracle cost breakdown.
- Mar 19, 2025 — NVIDIA GTC 2025: Vera Rubin, Kyber, CPO, Dynamo inference, Feynman roadmap.
- Mar 26, 2025 — ClusterMAX GPU-cloud rating system published.
- Apr 10, 2025 — Tariff Armageddon — GPU loopholes, Mexico supply-chain shift, optical-module pricing surge.
- Apr 16, 2025 — Huawei CloudMatrix 384, China's answer to GB200 NVL72.
- May 16, 2025 — US–UAE/KSA AI deal; sovereign AI ($100B+ Gulf spend) becomes a market-fragmenting/expanding force.
- Jun 11, 2025 — Ultra Ethernet Consortium UEC 1.0 spec released; Ethernet now >2/3 of AI cluster switch sales. New AI Networks: UEC / UALink vs Broadcom SUE.
- Jun 25, 2025 — Gigawatt-scale training load fluctuations flagged as grid-blackout risk.
- Sep 10, 2025 — NVIDIA Rubin CPX announced — first prefill-only GPU (GDDR7, not HBM); validates disaggregated serving and "sends competitors back to the drawing board" (any custom-silicon project now needs both prefill and decode variants). SA Rubin CPX piece.
- Sep 16, 2025 — xAI Colossus 2 — first gigawatt-class datacenter; roadmap to 1M GPUs by late 2026 (Colossus 1: 230K+ GPUs ~240MW; Colossus 2: 550K Blackwell, ~$18B).
- Sep 2025 — Oracle "broke the pattern" by leveraging up aggressively (Stargate $18B project loan from ~20 banks; ~500% debt-to-equity directly on Oracle's books). Per O'Laughlin, this is what turns a "disciplined, cash-flow-funded race" into a "debt-fueled arms race."
- Oct 2025 — Meta Hyperion SPV (the off-balance-sheet template). Morgan Stanley + Blue Owl structured a vehicle: PIMCO ($18B) + BlackRock ($3B) debt (A+ rated), Blue Owl $2.5B equity for 80%, Meta retains 20% and leases back — Meta took $3B cash at close, debt stays off its balance sheet. Microsoft's BlackRock/GIP partnership ($40B Aligned Data Centers, 70% fund-level leverage) is the parallel.
- Throughout 2025 — Hyperscalers issued ~$121B of new debt ($90B in the last three months alone) to bridge the AI-capex / FCF gap; AI capex is no longer funded from operating cash flow alone.
Dated developments — 2024 (foundation events)
- Late 2024 — Private credit deploying ~$50B/quarter into AI data centers (UBS), 2–3× what public markets supply. Meta's Louisiana DC financing $29B ($26B 144a bonds + $3B equity; PIMCO + Blue Owl). CoreWeave $7.5B Blackstone/Magnetar debt facility (May 2024). Apollo arranged $11B for Intel's Ireland fab. Tenors 15–24 years.
- Dec 25, 2024 — NVIDIA GB300 / B300 reasoning-inference, Blackwell delays, GB300 BOM, VRMs.
- Oct 14, 2024 / Feb 13, 2025 — SemiAnalysis Datacenter Anatomy Part 1 (Electrical: transformers, switchgear, UPS, OCP busbar, generators, substation) and Part 2 (Cooling). Reference base for the power-bottleneck thesis.
- Oct 3, 2024 — AI Neocloud Playbook — H100 rental price cuts, cluster BOM, cost-of-ownership and returns.
- Sep 4, 2024 — Multi-datacenter training (OpenAI) — gigawatt clusters across sites changes networking requirements fundamentally.
- CES 2026 / 2024 demos — AMD EPYC Venice prototypes shown at CES 2026; NVIDIA Vera CPU unveiled. Track the Vera-adoption-vs-Venice-adoption race as the live test of full-stack lock-in vs merchant optionality.
Things to watch / unresolved questions (carried from the primers)
These are the open debates the sector keeps re-litigating. Each resolves over a different horizon.
- The $600B (now ~$6T) revenue gap. Sequoia/Cahn: NVIDIA run-rate implies ~$600B of annual AI revenue needed to break even; the gap tripled from $125B (late 2023). JPMorgan: ~$650B needed for a "mere 10% return." Only OpenAI generates significant AI-native revenue. O'Laughlin's $6T figure is the total deployable capital ceiling (FCF ~$450B/yr × 7yr + SPV doubling + 1x corporate leverage). McKinsey $6.7T by 2030, iCapital $5.3T (cash flow covers only ~$1.5T), Morgan Stanley $2.9T-through-2028 with a $1.5T financing gap. Resolves over 3–5 years. The railroad parallel: 1/3 of US rail mileage went into receivership by 1895 despite the technology being real.
- CPU-to-GPU ratio normalization. Georgia Tech/Intel (arXiv:2511.00739, Nov 2025): tool processing = 50–90% of agentic latency; agentic CPU:GPU approaches 1:1. Futurum projects 34.9% CPU demand growth by 2029. Watch whether this holds as agent architectures and GPUs evolve.
- Does agentic adoption follow BofA's $155B curve or Gartner's 40% failure rate? Same market, opposite predictions. Only 6% of orgs have agents in production today; 64% plan to. Enterprise deployment data over the next 12 months resolves it.
- CPO vs pluggable at 1.6T. Pluggable keeps winning at each generation; analysis suggests CPO economics don't flip until 3.2T. Watch Ayar Labs / NVIDIA and Meta CPO bets vs the 800G→1.6T→3.2T pluggable cadence.
- Custom silicon vs NVIDIA. Does TPU/Trainium/Maia/MTIA erode the moat or expand the market? MS ASIC analysis: custom wins inference, NVIDIA keeps training. If the Big 5 bring 30–40% of inference in-house over 3–5 years, NVIDIA's TAM shrinks.
- Power as the binding constraint. JPM projects 100GW of US DC demand (2–3× current US nuclear output). Global DC critical IT power 49GW (2023) → 96GW (2026), of which ~40GW is AI; capacity CAGR jumped 12–15% → 25%. SA: AI pushes datacenters to 4.5% of global generation by 2030; US critical IT capacity must triple 2023–2027. Nuclear deals take 5–10 years. Near-certain and already binding.
- SEA as the next frontier — and its bottlenecks. Singapore's 2023 moratorium (lifted but capped ~60MW/yr, must be "greened") pushed demand to Malaysia/Johor; SG still ~60% of SEA DC capacity and DCs consume 7%+ of its electricity (SG depends on imported natural gas for ~90% of power; eyeing hydrogen to 2050 — Keppel's "DataPark+"). Watch SEA grid-interconnect backlog and transformer wait times as the gating items. Note the physical-DC vs chip-demand discrepancy: the 40/96GW AI split is a chip-demand estimate; physical DCs tell a different story.
- Financing-structure risk (the bubble watch). PIK income up to 11.7% of BDC loans (highest since 2020); private-credit interest coverage ~2.0x; recovery rates ~33% vs 52% for syndicated. Banks hold ~$95B committed credit lines to private-credit vehicles ($56B utilized, +145% over five years). US bank exposure to NBFIs >120% of Tier 1 capital. Watch the 2028 maturity wall ('B'-category debt rising $122B (2025) → $682B (2028)) and CLO-market freeze risk. The neocloud-specific version: the "GPU debt wall" — duration mismatch between 30-yr-life DC assets / 15-yr colo leases and 1–3 yr GPU rental contracts.
- Token-cost trajectory + token warehousing. Persistent KV-cache (WEKA "Token Warehouse," NVIDIA ICMSP) could collapse the three-tier API pricing (input / cached input / output) to two as cache hits become default. DeepSeek reported 56.3% production hit rate; WEKA claims 96–99% for agentic. Watch whether prefill GPU fleets shrink (SA: prefill ~80% of GPU cycles in many workloads).
- Does open-source kill model-layer margins? OpenAI at 33% gross margin, ~$17B 2026 cash burn, $730B valuation. If API pricing collapses, that valuation is unjustifiable. Happening now.
Best sources for ongoing monitoring
- SemiAnalysis (Dylan Patel et al., ★★ paid) — the dated archive lives in the SemiAnalysis MOC (260 articles, 2020–2025, navigable by year and by ticker). Essential: "CPUs are Back: The Datacenter CPU Landscape 2026," the Tokenomics Model, the AI Datacenter Model (5,000+ facilities). Authoritative on interconnects, HBM, power delivery, neocloud economics.
- FundaAI (★★ paid) — AI agents, memory, infrastructure (DeepINTC agentic-bottleneck, DeepKioxia NAND/HBM). Distinct from the funda.ai tooling platform.
- Fabricated Knowledge (Doug O'Laughlin) — capital-cycle framing ("Capital Cycles and AI," Jan 2025; "Oracle and Animal Spirits," Sep 2025), the $6T capital math, semiconductor company context.
- Futurum / BofA / Epoch AI / The Information / Sequoia blog — agentic infra projections, the $155B and $600B frameworks, scaling-law and token-cost data, AI company revenue reporting.
- NVIDIA, AMD, Broadcom, hyperscaler earnings calls — the primary data source for the hardware cycle; Lisa Su and Jensen Huang comments on agentic AI directly move markets.
- Georgia Tech/Intel arXiv:2511.00739 — "A CPU-Centric Perspective on Agentic AI," the foundational academic work behind the CPU-renaissance thesis.
Sources
The AI Infrastructure corpus draws on a tight set of named analysts and a large body of cited primary research. Where an author or publication has a handle page under _sources/, it is linked below.
Primary research authors / publications
- source-semianalysis — SemiAnalysis (lead author Dylan Patel; also Daniel Nishball et al.). The single dominant source for this sector. Covers data center infrastructure, AI compute economics, multi-datacenter training, power delivery, interconnects, HBM, and geopolitics. The vault holds a full local mirror: a 260-article archive (2020–2025) under
KB/wiki/semianalysis/, indexed bysemianalysis/index.md. Cornerstone pieces include "Scaling the Memory Wall: The Rise and Roadmap of HBM," "Another Giant Leap: The Rubin CPX Specialized Accelerator & Rack," "AI Neocloud Playbook and Anatomy," "Multi-Datacenter Training: OpenAI's Ambitious Plan," "100,000 H100 Clusters: Power, Network Topology," the "Datacenter Anatomy" series (Electrical / Cooling), "GB200 Hardware Architecture – Component Supply Chain & BOM," "OpenAI Stargate Joint Venture Demystified," "DeepSeek Debates," and the AMD vs NVIDIA inference benchmarks. SemiAnalysis's proprietary Tokenomics Model and AI Datacenter Model (5,000+ facilities tracked) are referenced as the basis for the $6T capital figure. - source-doug-olaughlin — Doug O'Laughlin / Fabricated Knowledge (President of SemiAnalysis, founder of Fabricated Knowledge). Source for semiconductor industry context, company-level analysis, and the "$6 trillion hyperscaler capital" thesis built from his "Capital Cycles and AI" (Jan 2025) and "Oracle and Animal Spirits" (Sept 2025) articles, plus his TBPN (Technology Brothers Podcast Network) appearances. His Fabricated Knowledge podcast hosted WEKA Chief AI Officer Val Bercovici (July 2025) on KV-cache persistence. Recommended as the ongoing-research source for semis context in the agentic primer.
- source-fundaai — FundaAI (Substack; not to be confused with funda.ai / the
/funda-toolsskill). Covers AI agents, memory, infrastructure. Several SemiAnalysis archive articles carry a FundaAI byline in the mirror. - source-citrini — Citrini Research. Thematic/macro; appears in the SemiAnalysis archive byline set and in the broader AI-infra hub (Celestica long thesis).
- source-meridian-report — The Meridian Report (author "Steve"). Byline on several 2024 SemiAnalysis-archive pieces (Intel 14A, NVIDIA Blackwell perf/TCO, Astera Labs IPO, B100/B200/GB200 COGS).
- source-photoncap — PhotonCap (photonics / OFC / CPO). Byline on SemiAnalysis-archive entries including "How Dell is Beating Supermicro" and "Apple's AI Strategy."
- Sequoia Capital — David Cahn. The "$600B revenue gap" accounting (Nvidia DC run-rate × 2 × gross-margin breakeven), tripled from $125B since late 2023. Foundation of the financing-bubble note.
- Georgia Tech / Intel — "A CPU-Centric Perspective on Agentic AI" (arXiv:2511.00739, Nov 2025). The foundational academic finding that tool processing on CPUs accounts for 50–90% of total agentic-workload latency.
- Academic / lab papers cited: "Attention Is All You Need" (Google Brain, 2017, 173K+ citations); Kaplan et al. scaling laws (2020); Chinchilla (DeepMind, 2022); Mamba (Gu & Dao, 2023); DistServe (UCSD Hao AI Lab, Jan 2024); Microsoft Research Splitwise; Google Research (GQA, Switch Transformer, Expert Choice routing); MIT HAN Lab StreamingLLM / attention sinks; DeepSeek MLA and V2/V3 papers; Moonshot AI (Kimi) Mooncake architecture (FAST '25 Best Paper); Apple "LLM in Flash" (arXiv 2312.11514v3); "Addressing interconnect challenges for enhanced computing performance" (Science, doi:10.1126/science.adk6189).
Sell-side, consultancy, and analyst research
- BofA Global Research (June 2025) — $155B agentic-AI software spending estimate by 2030; HBM market $54.6B (2026) → $100B (2028); the $18.6T knowledge-worker-wage TAM logic.
- Goldman Sachs — Jim Covello AI-skeptic commentary; CLO-freeze warning; $1.07T 2025 corporate-debt maturity estimate; HBM market analysis.
- J.P. Morgan — $650B required annual AI revenue for a 10% return; $5–7T data center spend / $1.5T IG-bond framework; the 100GW datacenter deep-dive and $11B silicon-photonics market primer.
- Morgan Stanley — TMT Conference (Lisa Su, March 2026); $2.9T-through-2028 / $1.5T financing-gap estimate; ASIC analysis.
- McKinsey ($6.7T by 2030, $5.2T for AI workloads), iCapital ($5.3T through 2030, ~$1.5T cash-flow coverage), UBS ($50B/quarter private-credit deployment), BCG (74% of companies struggle to scale AI; SEA datacenter buildout), Gartner (40% of agentic-AI projects canceled by 2027; "agent washing"), Futurum Group (CPU demand +34.9% by 2029; the $690B capex framework), Deloitte (inference = 50% of AI compute 2025 → 67% in 2026), Grand View Research / Precedence / MarketsandMarkets / Fortune Business Insights / DataM Intelligence (market-size figures), Counterpoint Research (HBM), William Blair / Evercore ISI / Jefferies / Barclays / Citi / Wells Fargo / Kerrisdale / Nomura (company- and component-level notes referenced via the hub).
Industry / company primary sources
- Company filings and earnings calls: AMD (Lisa Su), NVIDIA (Jensen Huang), Intel, Broadcom, SK Hynix, Micron, Samsung, Arista, Marvell, Coherent, Corning, Salesforce, ServiceNow, Dell, Supermicro, Vertiv, Schneider, Hitachi Energy.
- NVIDIA technical material: CES 2026, GTC 2025, Dynamo / TensorRT-LLM / NIXL blogs, Rubin CPX and the Inference Context Memory Storage Platform (ICMSP, Jan 2026). WEKA "Token Warehouse" trademark (March 2025) and Augmented Memory Grid.
- Energy / power-grid sources: Rystad Energy (Edvard Christoffersen — transformer prices +40% since 2019), Wood Mackenzie (115–210-week transformer lead times), IEEFA (ASEAN grid supply chains), PowerMag ("The Transformer Crisis"), EPA eGRID, Keppel / Nikkei Asia (Wong Wai Meng, Singapore hydrogen DC), Uniper (Schierenbeck) on transformer waits.
- Vendor and tooling docs: vLLM (PagedAttention, UC Berkeley), SGLang (RadixAttention), HuggingFace Transformers, AWS Neuron SDK, Mistral / Meta LLaMA technical reports, Epoch AI (50×/year price decline data).
- Mainstream / trade press: Tom's Hardware (CPU renaissance), TechCrunch, Bloomberg, CNBC, Fortune, RAND, EE Times, IEEE ComSoc, and the various per-claim citations in the LLM-industry primer's source list.
Consolidated source files (the wiki pages this sector page is built from)
agentic-ai-infrastructure-primer.md— agentic CPU thesis, six-component agent architecture, value-chain map, CPU/GPU/memory/networking/software landscape.ai-infra-financing-bubble.md— "AI's Half-Trillion Dollar Reckoning," railroad-bubble parallel, private-credit cascade pathways.ai-infrastructure.md— the topic hub; the densest link map of external references (SemiAnalysis, JPM, Morgan Stanley, William Blair, ISI, Nomura, BCG, OBI, etc.).gpu-cloud-lenders.md— CoreWeave / Crusoe / APLD / Nebius lender-and-investor map.neocloud-gpu-cloud-economics.md— SemiAnalysis neocloud TCO and financing economics (Dec 2023).llm-architecture-primer.md— transformer mechanics, training pipeline, MoE/SSM, value chain.llm-industry-primer.md— model-developer and seven-layer value-chain primer; carries its own ~60-link source list.interconnect-challenges-computing.md— bookmark to the Science interconnect paper.rack-power-delivery-primer-gb200.md— VRM / 48V DC / power-delivery supplier landscape.sa-gb200-component-outline.md— SemiAnalysis GB200 BOM component breakdown.sea-ai-boom-datapoints.md— SEA datacenter power, transformer-lead-time and Singapore-moratorium datapoints.supply-chain-implications.md— MoE / KV-cache / attention-evolution / prefill-decode / disaggregated-serving supply-chain note (carries its own multi-source list).tokenomics-economics-of-intelligence.md— inference unit economics, KV-cache hit rates, token-warehousing.utility-vs-it-capacity-dc.md— SEA "What's Next?" newsletter fragments (Krungsri-style).6t-ai-capital-math.md— Doug O'Laughlin's $6T hyperscaler-capital reconstruction.why-dell-beating-supermicro.md— Dell DFS captive-finance vs Supermicro TCO thesis.datapoints.md— raw SEA / transformer / Singapore datapoints (near-duplicate ofsea-ai-boom-datapoints.md).dc-hpc-power-density.md— 50kW rack density, OEM/ODM procurement split._sources/source-fundaai.md,_sources/source-semianalysis.md,semianalysis/index.md— source-handle pages and the 260-article SemiAnalysis archive MOC.
Consolidation queue (merged 2026-05-30 — section-scoped rebuild)
Industry-wide content folded in from these source files. They stay live pending Pink's archive confirm.
- [ ]
agentic-ai-infrastructure-primer.md - [ ]
ai-infra-financing-bubble.md - [ ]
ai-infrastructure.md - [ ]
gpu-cloud-lenders.md - [ ]
neocloud-gpu-cloud-economics.md - [ ]
llm-architecture-primer.md - [ ]
llm-industry-primer.md - [ ]
interconnect-challenges-computing.md - [ ]
rack-power-delivery-primer-gb200.md - [ ]
sa-gb200-component-outline.md - [ ]
sea-ai-boom-datapoints.md - [ ]
supply-chain-implications.md - [ ]
tokenomics-economics-of-intelligence.md - [ ]
utility-vs-it-capacity-dc.md - [ ]
6t-ai-capital-math.md - [ ]
why-dell-beating-supermicro.md - [ ]
datapoints.md - [ ]
dc-hpc-power-density.md - [ ]
_sources/source-fundaai.md - [ ]
_sources/source-semianalysis.md - [ ]
semianalysis/index.md