From Words to Worlds: from LLMs to VLAs

OpenAI’s ChatGPT birthed a multi-trillion dollar ecosystem of companies. Large-language models (LLM) are on track to replace humans for every category of intellectual work. If your job is currently done on a computer screen, that job will soon be done cheaper, faster, and better by the computer itself.
Computers replacing humans isn’t stopping at pushing pixels. In the next 3-5 years, robots are going to be able to see, understand, and do — in a general universal sense. Once they do, physical labor is solved.
ChatGPT showed the world that the ability to predict the next word (i.e. token) in a sentence is surprisingly powerful when you suck down the entire written corpus of humanity and can run them through football fields of GPU supercomputer farms. The resulting LLM is the culmination of a massive ecosystem and supply chain of data, labeling, compute, alignment, safety, etc. Now imagine that the LLM grew eyes, ears, and hands: a Vision-Language-Action (VLA) model. A strong VLA requires a new industrial stack, and I’m ready to put $1B to work to help make this a reality.
General robots are inevitable, and whoever is first to control a powerful general VLA driving an army of robots will have a narrow window of opportunity to take over the world. All I want is to just be on that person’s team. So how do we deploy $1B? Let’s build our intuition via analogy and study the industrial stack for LLMs and extrapolate what is needed for VLAs.
The 5‑layer stack: LLM vs. VLA
Layer | LLMs needed … | A VLA will need … |
---|---|---|
Raw data | Web‑scale text | Petabyte‑scale, time‑synced RGB‑D + joint + force logs |
Supervision & RL | Instruction‑tuning corpora, RLHF, RLAIF | Goal‑label repository, Robot‑RLHF & reward models |
Compute & infra | GPU/TPU farms | GPU + CPU‑heavy physics farms |
Evaluation & safety | Prompt red‑teams | Test tracks and compliance |
Deployment | Cloud inference | Edge runtime, tele‑op override |
1 · Raw pre‑training data
LLMs have now ingested the entire corpus of human written language — the entire Internet, teams have already scanned in every printed book, and any loose scraps in between. A VLA needs the data of the physical world: stereo video, depth maps, joint angles, force curves, IMU ticks—all timestamp‑locked.
There are some datasets like Meta’s Ego4D (3670 hours of first-person footage with gaze and depth) and DeepMind’s OpenX-Embodiment (one million robot trajectories across 22 body types), but this is academic-scale. The current humanoid players like Tesla Optimus, Figure, and 1x all capture their own proprietary datasets via tele-operation (humans piloting robots), yet none of them work well yet. Why?
The most important bottleneck is high-quality data.We’re at least 2 orders of magnitude short of the noisy chaos of real-life. A general VLA will require >100 million hours of real-world data. I’m working with a founder in stealth with a unique approach to solve this problem, and there’s plenty of frontier research to be done to harness better, cheaper, faster data with tele-op, simulation, and other approaches.
2 · Supervision & RL
A lot of humans clicking a lot of ‘thumbs-up’ or ‘thumbs-down’ emojis is at the heart of supervised learning for LLMs. Is ChatGPT answering your question well? Thumbs up or thumbs down. Is your robot picking up a cup well? Thumbs up or thumbs down.
OpenAI’s RLHF pipeline, Anthropic’s RLAIF, Scale AI, and/or Surge AI orchestrate a lot of low-cost humans to make these up-or-down clicks on chat sessions. Robots need much more complex, multi-dimensional coaching. For example, someone needs to watch a robot and label each action for a simple instruction like: “Pick up the red mug without spilling.”
Current approaches to label a single trajectory costs about $2 and ten human minutes. Assume you need 10B action-label pairs, that’s $20B, which is not economic. Physical Intelligence (π), where I’m a small seed investor, is a leading company operating across Layer 1 and 2 for their foundation VLA model π₀. Tesla Optimus, Figure, and 1x all run in-house RLHF loops for their in-house proprietary VLAs. There’s likely white-space to launch Layer 2 “robotic reinforcement-learning-as-a-service” i.e. ScaleAI for physical world data in a similar vein as the Layer 1 raw data refinery.
3 · Adding physics to compute infrastructure
LLMs need a lot of NVIDIA GPUs. This is why NVIDIA is the most valuable company in the world at $4T. VLAs still need GPUs, but the workload also needs to focus on physics to simulate the real-world. Solving physics is a different math problem than solving language so it stands to reason that you’ll need a differently optimized cluster setup (different mix of CPU vs. GPU, memory, I/O, etc). Cloud providers like Coreweave are already piloting “simulation clusters”, but perhaps there’s opportunity to win share by offering a super specialized physics cluster.
4 · Evaluation & safety
Every now and then you’ll see mainstream news headlines about OpenAI or Anthropic LLMs doing something “dangerous.” These articles are often simply wrapping self-published reports by in-house red teams and made to sound more dramatic to get views. There are also third-party security companies that specialize in jailbreaking these LLMs, and then turning around selling their services back to the companies they jailbroke.
Robots running around and doing stuff in real-life will require even more rigorous safety evaluation. The industry will obviously be regulated by government, and that means there will be a whole new stack of insurance, legal, and compliance. Groups will own valuable businesses selling 3rd party evaluation, insurance, and regulatory compliance to the next generation of robotics companies.
5· Deployment — cloud latency gives way to edge autonomy
When you talk to ChatGPT, you’re talking to a supercomputer cluster far away geographically from you, which means you have lag. A supercomputer server farm is far more powerful than the chip on your phone, so it's currently worth the trade-off. Big companies like Apple are now just starting to explore LLMs that are smart enough, yet can run locally on your phone. This is a nice-to-have for latency and data security.
However, a robot reacting to something in real-life can’t wait for the server to tell it what to do. This means decision-making or inference must live on the robot itself. Bundling distilled VLA weights with a safety‑certified runtime (and maybe a tele‑op failsafe API) is a necessary lily pad towards a robotic future.
Let me help you build
A VLA will need | Metrics | |
---|---|---|
Raw data | Physical-world data refinery | 100 M hrs of multi‑sensor video by 2028 (≈ 3 PB). |
Supervision & RL | Goal‑label repository, Robot‑RLHF & reward models | Drive per‑trajectory label cost from $2 → $0.10 |
Compute & infra | GPU + CPU‑heavy physics farms | 10× cheaper per trajectory than H100 LLM cluster |
Evaluation & safety | Test tracks and compliance | 10 000 regression trials/night with ISO audit |
Deployment | Edge runtime and tele-op override | <75 ms on‑robot inference; 99.99 % fail‑safe switchover |
The LLM stack already has household names at every layer (OpenAI to ScaleAI to Coreweave) with trillions of dollars of value created and captured. The LLM players are some of the most ambitious humans in the world and they will expand into the VLA stack. But there is a window now to compete for the VLA frontier. And when there’s a tectonic shift, there is startup opportunity. Winners will have to be nimble to navigate the next biggest game in the world. If you’re building one of these startups, email me a one-pager at my antifund.com email or DM me on x.com/geoffreywoo.
Thank you to Gianluca Bencomo, Palmer Luckey, Logan Paul and Rahul Sidhu for reading early drafts of this essay.