Why VC Moved from AI Apps to AI Infrastructure — An Inference Layer Playbook for Korean Founders

The 2026 AI investment landscape quietly shifted. Capital is no longer chasing flashy AI apps — it’s flowing into the layers below: chips, inference engines, and data pipelines. Cerebras IPO at $26.6B, Sierra’s $950M Series E, RadixArk (the SGLang commercialization entity) at a $100M Seed / $400M valuation, and Gimlet Labs’ $80M Series A all hitting within the same week points in the same direction.

Behind this trend is a simple question: “How do you serve already-trained models cheaper?” That is the defining B2B pain point of 2026.

Where the Real Gaps Are in AI Infrastructure

VC-targeted opportunities cluster into three layers.

Layer 1: Inference optimization middleware. vLLM and TensorRT-LLM are now default choices, but there’s no layer that automatically routes “which workload goes to which engine.” Parasail orchestrates GPU spot capacity across 40 countries at per-token pricing. Runware unifies 400,000+ image/video generation models under a single API endpoint.

Layer 2: Heterogeneous chip compilers. Gimlet Labs is the lead case — slicing models across NVIDIA, AMD, Intel, ARM, and Cerebras simultaneously, claiming 3–10x inference acceleration. Already at $10M+ monthly revenue.

Layer 3: Transformer-specific ASICs. Etched Sohu allocates ~96% of transistors to matmul, claiming ~20x H100 throughput on Llama-70B. Cerebras WSE-3 eliminates the memory wall with 44GB on-chip SRAM at 21PB/s bandwidth.

Entry Angles for Korean Founders

Korea’s major GPU clouds — KT, Naver Cloud, and NHN Cloud — each run different chip configurations. This heterogeneity is exactly the environment where the Gimlet Labs model works.

Three concrete opportunities emerge:

First, a multi-cloud AI inference router for Korean clouds. A middleware layer that dynamically selects the lowest-cost endpoint across all three domestic clouds in real-time. Adding KISA network separation and AI ethics compliance modules creates a differentiated entry for financial, medical, and public-sector markets.

Second, an SGLang-based agent serving PaaS. Korean SaaS companies running chatbots and agents where each customer has its own repeated system prompt are the perfect SGLang workload. SGLang’s RadixAttention delivers +29% throughput over vLLM for this use case. There is no commercial support for SGLang in Korea yet.

Third, inference cost monitoring SaaS. A tool that breaks down GPU cloud billing by workload, model, and engine, then recommends optimizations. Companies spending $50K+/month on GPU compute feel the ROI immediately.