AI·Technology
Why VC Moved from AI Apps to AI Infrastructure — An Inference Layer Playbook for Korean Founders
Published: 2026-05-07
The 2026 AI investment landscape quietly shifted. Capital is no longer chasing flashy AI apps — it’s flowing into the layers below: chips, inference engines, and data pipelines. Cerebras IPO at $26.6B, Sierra’s $950M Series E, RadixArk (the SGLang commercialization entity) at a $100M Seed / $400M valuation, and Gimlet Labs’ $80M Series A all hitting within the same week points in the same direction.
Behind this trend is a simple question: “How do you serve already-trained models cheaper?” That is the defining B2B pain point of 2026.
Where the Real Gaps Are in AI Infrastructure
VC-targeted opportunities cluster into three layers.
Layer 1: Inference optimization middleware. vLLM and TensorRT-LLM are now default choices, but there’s no layer that automatically routes “which workload goes to which engine.” Parasail orchestrates GPU spot capacity across 40 countries at per-token pricing. Runware unifies 400,000+ image/video generation models under a single API endpoint.
Layer 2: Heterogeneous chip compilers. Gimlet Labs is the lead case — slicing models across NVIDIA, AMD, Intel, ARM, and Cerebras simultaneously, claiming 3–10x inference acceleration. Already at $10M+ monthly revenue.
Layer 3: Transformer-specific ASICs. Etched Sohu allocates ~96% of transistors to matmul, claiming ~20x H100 throughput on Llama-70B. Cerebras WSE-3 eliminates the memory wall with 44GB on-chip SRAM at 21PB/s bandwidth.
Entry Angles for Korean Founders
Korea’s major GPU clouds — KT, Naver Cloud, and NHN Cloud — each run different chip configurations. This heterogeneity is exactly the environment where the Gimlet Labs model works.
Three concrete opportunities emerge:
First, a multi-cloud AI inference router for Korean clouds. A middleware layer that dynamically selects the lowest-cost endpoint across all three domestic clouds in real-time. Adding KISA network separation and AI ethics compliance modules creates a differentiated entry for financial, medical, and public-sector markets.
Second, an SGLang-based agent serving PaaS. Korean SaaS companies running chatbots and agents where each customer has its own repeated system prompt are the perfect SGLang workload. SGLang’s RadixAttention delivers +29% throughput over vLLM for this use case. There is no commercial support for SGLang in Korea yet.
Third, inference cost monitoring SaaS. A tool that breaks down GPU cloud billing by workload, model, and engine, then recommends optimizations. Companies spending $50K+/month on GPU compute feel the ROI immediately.
Sources
- Bret Taylor's Sierra raises nearly $1B in latest AI capital push — CNBC
- AI Startup Funding News May 2026 — Mean CEO Blog