Full Transcript

·YouTLDR

How to sell RL envs and data to AI labs: Interview with Sean Cai

27:164,431 words · ~22 min readEnglishTranscribed May 28, 2026
AI Summary

AI frontier labs are experiencing a massive bottleneck in high-quality, long-horizon reinforcement learning environments and data, creating an opaque multi-billion dollar market where success depends on research sophistication and verifiability rather than simple data labeling.

Understanding the shifting dynamics of AI training data markets reveals where elite labs are directing capital, which software environments are being simulated, and how developers can build high-margin businesses in reinforcement learning.

Section summaries

0:00-3:20

Underserved Data Domains: Bio, Cyber, and ERP Environments

watch

Provides critical context on where the next-generation bottlenecks in AI data collection lie, specifically biology and cyber security.

3:20-5:45

Specific Cyber Security and Bioinformatics Tools

optional

Dives into specific software types (GUI vs Bash, CTF challenges) which is only useful if you are building specifically in those niches.

5:45-11:33

Data Sales Cycles & Quality Control (QC) Pitfalls

watch

Essential advice for founders on how to structure QC, bypass gatekeepers, and avoid reward-hacking in datasets.

11:33-14:25

Robotics, Egocentric Data, and Container Formats

watch

Reveals the technical deployment formats (Docker) and the necessity of having post-training taste in robotics datasets.

14:25-17:40

Exclusivity Dynamics & Frontier Lab Hierarchies

watch

Exposes how Anthropic and OpenAI treat data vendors differently and the shifting landscape of exclusive data contracts.

17:40-22:03

Enterprise Post-Training & App-Layer Behavior

optional

Analyzes the Ramp Labs report and how non-frontier app companies like Cursor are structuring their internal model fine-tuning.

22:03-27:11

Market Sizing: 70 Billion+ Aggregate Data Spend & M&A Trends

watch

Breaks down the macroeconomic reality of lab data budgets, Surge's revenue run rate, and why neoclouds are acquiring RL infrastructure players.

Key points

  • The Underserved Frontiers of Verifiable Data — While coding matured quickly due to GitHub, domains like biology, cybersecurity, and complex back-office ERP/Excel tasks are severely underserved. The market is moving toward highly complex search and retrieval environments with non-technical physical-to-digital boundaries.
  • The Transition of AI Labs into M&A Buyers of Real-World Assets — Frontier labs (like Anthropic and OpenAI) are quiet but aggressive buyers of small enterprise app-layer companies, wet labs, and legacy software with 10-15 years of user interaction history to bypass user-trace data ceilings.
  • Post-Training Research Taste Over Raw Data Supply — Succeeding as an RL environment or egocentric data vendor requires 'research taste'—the ability to run downstream post-training experiments on your own data to prove to labs that it actually improves model capability.
  • Neocloud and Inference Provider Horizontal Expansion — Inference and GPU neocloud providers are increasingly acquiring RL and post-training infrastructure startups to offer end-to-end capabilities to app-layer companies and mid-tier labs.
Good data taste, good research taste, and perceived ability to scale quality with quantity are the three things that are necessary to building a a good human data company out. Sean Cai
If you're trying to sell data, but you're not actually cognizant of how model improvement is achieved... you are in a losing battle and losing market uh and you are going to be commoditized. Sean Cai

AI-generated from the transcript. May contain errors.

0:00

What areas of data do you feel are

0:02

underserved by now um that labs want to

0:05

buy?

0:07

>> Uh certainly we're still in a huge darth

0:09

of long and unverifiable. Uh the domains

0:13

that matured the quickest

0:16

matured because they were much much more

0:19

easily verifiable with web 2.0

0:21

instruments. Coding had GitHub. We don't

0:22

have a GitHub for all other domains but

0:24

we ventured into finance healthcare and

0:26

law afterwards. Nowadays, biological and

0:29

cyber security is is is a craze but only

0:32

I think by the most sophisticated labs

0:34

namely anthropic and some of OEI.

0:36

>> You said bio and cyber.

0:38

>> Yes.

0:40

>> But in general I think incredibly long

0:43

horizon realistic is still very much in

0:46

demand. A lot of benchmarks are built as

0:48

this but not actually that. Does that

0:50

mean things that would like basically

0:52

serve making co-work and codeex better

0:55

for non-technical work?

0:58

>> Uh potentially. Yeah. I would classify

1:00

that as like back office ERP type tasks.

1:02

>> Yeah.

1:03

>> Which largely have to do with very

1:04

complicated search and retrieval

1:05

functions across quite convoluted data

1:07

links uh and environments.

1:10

>> Okay. Um what kinds of software programs

1:13

would people be using?

1:16

>> Uh even so Excel file systems.

1:20

um just like think very convoluted file

1:22

systems and applications with data

1:24

across multiple formats sometimes

1:26

tabular sometimes graphical even as well

1:29

>> um and it's an exercise in model tool

1:32

calling as well

1:33

>> cool also stuff like SAP or or not

1:36

[snorts] necessarily

1:38

>> uh yes but I think for maybe some of the

1:40

computer use circles computer use

1:42

continues to be sort of like a smaller

1:44

market but dominated by a few top RLM

1:47

companies there uh like a a version of

1:50

SAP would have gone for like 500K uh in

1:53

the computer use craze of like late 25

1:55

probably. Um I suspect a version of SAP

2:00

has been created by one of the RL

2:02

computer use companies at this point or

2:03

maybe by an internal anthropic and OI

2:06

but that's speculation. Yeah. The um bio

2:11

and cyber for the bio stuff is it uh bio

2:14

real world things or like purely digital

2:18

you know stuff that's pure computer?

2:20

>> Yeah it started from bioinformatics but

2:21

nowadays we're trying to make a lot of

2:23

these processes with one step in the

2:25

digital and one step in the physical

2:27

areable.

2:28

Uh naturally this gets really hard

2:30

because you think about the verification

2:32

mechanism for a lot of things in

2:33

biology. Anthropic just put out this

2:35

benchmark called mystery biobench which

2:37

I think enumerates a lot of the problems

2:39

pretty succinctly there. We don't even

2:41

know even amongst the top experts how to

2:44

verify something in biology because

2:45

they're literally denovo experiments,

2:47

right? Like we combine certain chemicals

2:50

like what happens? Um

2:53

so uh they're almost veering into

2:55

physics based models. Of course if you

2:57

get to physics based models then you

2:59

have like

3:01

uh semreal and robotics uh to to which

3:04

you could do RL and robotics and that's

3:06

an entirely different domain. But I I'd

3:08

say it's like on our slow march to make

3:11

sim to real uh and and generally model

3:13

much more things in the physical world

3:15

more accurately as verifiers and RL um

3:18

>> certain domains

3:20

>> fall in the middle of purely software

3:22

based work and like robotics

3:24

>> like biological workflows, chemistry

3:26

workflows, even scientific discovery

3:28

that make that pretty useful. What are

3:30

some of the pieces of software that uh

3:33

bioinformaticists might be using that

3:34

like they would have you know ends for?

3:38

>> Oh, I think I think like I'm not a

3:41

biologist but uh there's so many bespoke

3:44

tools and like bespoke processes within

3:47

a lab itself. I I I'm helping a wet lab

3:52

sort of digitize our processes here

3:54

doing this and a lot of this stuff

3:56

doesn't have purpose-built software for

3:58

it. It's

4:00

and it does just adds to the environment

4:02

complexity and bespoke tool build out

4:04

for those environments complexities as

4:06

well.

4:06

>> Okay. So it's more like general computer

4:09

use both both guey and bash.

4:12

>> Yeah, I would I I would say so that's

4:14

the primitive in which like kind of

4:16

everything is based right.

4:18

>> Yeah. the um

4:21

cyber stuff any particular subsets like

4:24

uh uh you know program analysis or um

4:27

pentesting or um you know what's the

4:30

what's the most underserved cyber

4:32

subsets do you think?

4:34

Yeah, for sure. Uh, I think cyber is

4:37

mostly being bought by Anthropic and

4:39

maybe some of OEI right now and then the

4:40

rest of the labs are are following

4:41

whatever they do. Uh, naturally

4:46

irregular security is probably the

4:48

company to fall into space who does a

4:49

lot of this type of stuff. Um,

4:52

I would say a lot of offensive cyber is

4:57

uh being modeled right now in sort of

4:59

interactive environments. So a lot of

5:01

stuff in code level security uh web app

5:04

exploitation CTF challenges and like

5:06

remediation is already covered by

5:07

existing benchmarks like um cyber gym

5:10

cybench these are pretty saturated

5:13

nowadays so long horizon wise as all

5:17

domains are tending towards you're

5:19

looking towards um

5:22

stuff like infrastructure exploits uh

5:24

and agent layer attacks. So there

5:28

there's a lot of stuff you can model out

5:29

there. Um because there are new zero

5:32

days every single day. It's like one of

5:34

the most dynamically changing fields. So

5:36

naturally you're going to expect there

5:38

to need to be uh real-time data streams

5:41

to translate these things into model

5:44

actionable formats.

5:45

>> Yep. Awesome. The um

5:49

the is the the sales process is roughly

5:52

um either do something in public such

5:55

that researchers reach out to you

5:56

already know researchers or kind of get

5:59

intros or do cold emails to researchers

6:02

to get a you know a pilot. Is that

6:04

roughly the first step?

6:06

Yeah, but you know the our our appetite

6:09

for data is voracious and expanding, but

6:11

there's still only 24 hours in a day and

6:13

a researcher's job is not to talk to

6:15

data vendors all the time. So the bar to

6:18

get I think a researcher's attention is

6:22

getting higher and higher. So those

6:24

without research sophistication, it's

6:25

just like that's simply not the best

6:27

move anymore. the human data supply

6:29

chain is expanding such that one can

6:31

meaningfully participate in it in it um

6:33

without interacting with an end

6:35

researcher

6:37

>> cross-selling partnerships you mean or

6:40

>> yeah protege is an example uh I think

6:44

there are companies out there which

6:48

almost train companies to produce good

6:50

post training data and then sell their

6:53

data to researchers themselves using

6:55

themselves as a stamp of approval

6:58

Yep. Yep. The um

7:02

the uh what are what are the most

7:05

important things that labs look for?

7:07

What are the main reasons that a lab

7:09

might uh not kind of renew or increase

7:12

their purchase volume? You know, after

7:15

they get the uh the first set of data.

7:20

>> Uh each lab has their own QC processes

7:23

and they run them internally and they

7:24

see. But I I you know there there are so

7:26

many reasons why

7:29

some data might be quite quite poor. Um

7:34

so

7:37

I think there are just a lot of very

7:39

small things that a researcher can look

7:41

at a data set and like maybe a data set

7:43

is a million lines and they notice

7:45

something that is off about one line and

7:46

they really start to question whether a

7:48

startup even has a QC process at all in

7:50

the first place or not. Um,

7:53

subsequently,

7:55

uh, it's it's it's I think the most

7:57

common case is just your tasks are

8:00

incredibly incredibly poorly designed,

8:03

um, in terms of they're easily reward

8:04

hackable. The props are vague, they're

8:07

not emblematic of real world tasks from

8:09

an obvious setting.

8:11

There's a lot of other smaller things

8:14

afterwards too like the model failures

8:16

you identify and the reasons for why the

8:19

model fails at them are not actually

8:20

genuine capability failures. So because

8:22

you designed the harness quite poorly.

8:24

Um it's because you ran it in a very

8:27

specific environment that isn't actually

8:28

how most users would have ran this task

8:30

or model in. Uh you don't do cross

8:32

harness testing. You don't do engram

8:34

contamination testing. So which is to

8:36

say you don't test whether a data set is

8:38

already in the pre-trained corpus

8:39

literature or not. Um there there's a

8:43

lot of uh QC checks that one can run

8:47

before delivering say OTS RL data that

8:51

would make it just uh that would make

8:54

DOM a great partner that researchers

8:56

would want to work with and iterating

8:57

the shape of posting data.

8:59

>> Yeah. the um

9:02

the and the QC is both uh researcher

9:05

flagged issues and then also just the

9:07

data teams as well reviewing things

9:10

themselves.

9:12

Yeah, I think it's uh

9:16

c certainly researcher needs are bespoke

9:18

as well for certain projects, but if you

9:20

think about

9:22

uh researchers exploring a net new

9:26

research question like how do we improve

9:27

taste and models and they're exploring

9:30

different OTS RL data sets to uh along

9:34

with maybe a data company provided

9:36

benchmark to exploit this question.

9:38

There there are just so many things that

9:41

they look for in a typical data delivery

9:43

that are really is really difficult for

9:46

you to know unless you worked in the in

9:48

the data industry yourself or you've

9:49

been a researcher and you understand

9:50

what good RL data is, right?

9:53

>> Good data taste, good research taste,

9:56

and perceived ability to scale quality

9:58

with quantity are the three things that

10:00

are necessary to building a a good human

10:02

data company out.

10:04

>> Good data taste, good research taste,

10:05

and what was the third one?

10:06

perceived ability to scale quality with

10:08

quantity.

10:09

>> Yeah. Cool. The um uh when does a

10:13

researcher like in a question like that

10:15

they're doing some new kind of area um

10:17

that's vague. How do they get the model

10:19

to do a certain thing?

10:21

Um

10:23

what do researchers do to first explore

10:25

off-the-shelf offerings before um will

10:28

they just hit up all their existing

10:30

vendors from their team or they go to

10:32

their you know data team people

10:34

internally and ask them to go and get a

10:36

a uh a set of options for them? What's

10:39

the what's the first steps that the

10:40

researcher takes?

10:43

Yeah, I'd say they do some bespoke reach

10:46

out themselves, especially for new

10:49

research team directions like OpenAI's

10:53

newest robotics VA direction which spun

10:56

out of Sora or not I wouldn't say Denovo

10:59

spun out of Sora but was sort of

11:01

combined with the remnants of Sora. they

11:03

will go out to real world data vendors

11:08

and reach out to get samples uh because

11:11

you got to remember their jobs are to

11:12

improve model capabilities and if that's

11:14

the bottleneck they won't go and solve

11:16

the bottleneck themselves but that's why

11:18

these labs have human data teams. It's

11:19

like one to procure the data necessary

11:22

and manage vendor relations but two is

11:24

like negotiate on price

11:27

>> and and all that all those things

11:29

associated. So it's a collaboration

11:32

between those two entities.

11:33

>> Cool. The um in the robotics data space

11:37

is there anything that your viewers

11:38

underserved

11:39

uh there? Obviously there's a lot of

11:42

things that are well served but what are

11:43

the ones you view as underserved in

11:45

robotics?

11:46

>> Data vendors who are genuinely research

11:48

first and running post training

11:49

experiments on their own data. if

11:51

they're trying to sell things like ego

11:55

and data up the training mix pyramid to

12:00

if you like companies

12:03

uh or

12:06

uh ju just running a lot of training

12:09

experiments on their data that match the

12:11

research direction of companies they're

12:12

trying to sell to in order to be a bit

12:14

more

12:14

>> selling egocentric data to VA companies

12:16

is underserved.

12:20

It is underserved in a sense of there

12:24

are not many

12:25

>> my naive view is like everyone is

12:27

selling egocentric data.

12:29

>> Well, there are very few egocentric data

12:31

vendors that actually know how the

12:33

downstream training is done.

12:36

>> So it's egocentric data vendors who are

12:37

doing their own postraining and

12:39

therefore like research have good

12:40

research taste.

12:42

>> Yeah. But um you you got to remember

12:45

what is being sold when you sell data in

12:46

the first place is just model capability

12:48

improvement, right? And data is just the

12:50

medium to do that. So if you're trying

12:51

to sell data, but you're not actually

12:53

cognizant of how model improvement is

12:56

achieved or you don't have an opinion

12:57

there and can't really help the

12:59

researcher with that, you are in a

13:01

losing battle and losing market uh and

13:04

you are going to be commoditized.

13:06

>> Yep.

13:07

>> Yep. The um

13:10

what is the uh what does the initial

13:12

meeting look like? the um someone talks

13:14

to researcher, the researcher maybe

13:16

requests some samples and the founder

13:18

sends it in Google Drive. Um how does

13:21

that uh what does that typically look

13:23

like? What's the formats people are

13:25

expecting?

13:26

>> For RL data, for the longest time, it's

13:28

literally just been a Docker container.

13:30

>> Yeah.

13:31

>> Uh a Docker container isolated

13:33

environment, all the tools on there, all

13:35

the verification mechanisms and rubrics.

13:37

One simply simply has to plug and play

13:40

their agent. Uh and then you get an eval

13:44

score and then you can use a multitude

13:45

of these software containers to run

13:47

rollouts for GRPO RL.

13:49

>> Yep.

13:49

>> Whatever other training me mechanisms

13:51

you employ

13:53

>> and labs do they have more kind of

13:54

sophisticated internal kind of you know

13:57

uh

14:00

setups for um running environments now

14:04

that need different formats.

14:08

>> Yes, they do. Anthropic notably has one

14:11

whose name I can't uh disclose but the

14:14

most sophisticated labs I would say are

14:16

like Anthropic, Open AI, Deep Mind and

14:21

then everybody else and then Chinese

14:23

labs in that order.

14:25

>> Yep. Yep. the um

14:30

the um and then in terms of the data

14:33

that the non

14:35

you know three frontier labs are buying

14:38

um are they buying more off-the-shelf

14:41

data that you know companies have

14:42

already sold to anthropic and open AI on

14:45

like a non-exclusive thing

14:50

>> uh I think OTS is a relatively new

14:53

phenomenon it is the mechanism with

14:56

which Serge has done business for a long

14:59

long time. Uh but that's because Serge

15:01

is a very fundamentally different

15:03

company than all the other ventureback

15:04

players. Um I I believe

15:07

>> how are by the way on that?

15:09

>> Oh, they genuinely started off as just

15:11

model capability caring about model

15:12

capability improvement, right? Not not

15:17

as a sort of data company and for the

15:19

longest time like mostly SFD data as

15:21

well. Um,

15:25

>> so

15:26

you're talking about exclusivity.

15:30

Certainly exclusivity reflects different

15:32

labs philosophies towards data vendors.

15:35

Anthropic is the only one who I think

15:37

really pushes for exclusivity. Open AI

15:40

at different points throughout its human

15:42

data turnover uh human data teams

15:45

turnovers because a lot of people shift

15:48

around in OAI a lot. Uh but Enthropic

15:51

genuinely views their data vendors as

15:54

research partners and if you think your

15:57

research partner is genuinely novel

15:59

research you probably want to get

16:01

exclusivity on that. Um

16:04

>> which is the approach that they've

16:08

employed with many of the data companies

16:09

they've worked with.

16:10

>> Yep. Do they have expiry clauses on the

16:12

exclusivity like 12 months or 24 or

16:15

something like this? I'd imagine they

16:16

are starting to think about that pretty

16:18

closely now. But I am aware of many many

16:20

companies who have recently just ended

16:24

anthropic exclusivity. Some of them had

16:26

the agreement that they would only have

16:27

it for a year. Some of them

16:30

for other strategic reasons they've

16:32

stopped exclusivity with. So

16:35

>> yeah.

16:36

>> Yep. the um

16:39

the

16:41

uh

16:42

almost all purchase decisions researcher

16:44

led at this point as opposed to you know

16:47

like the researcher pulls it in and then

16:48

the the data team kind of uh is effect

16:52

is like a form of pro procurement or um

16:57

is it different? You can imagine it's a

16:59

partnership of sorts,

17:02

but if you want to think about it from

17:04

from a economic buyer perspective, you

17:07

always want to be just in general B2B

17:09

sales dealing with the economic buyer

17:11

because if you can convince the economic

17:14

sorry, not the economic buyer, the the

17:16

end user, right? If you can convince the

17:17

end user of your product that there's

17:19

substantial value, the question is not

17:21

whether the org is going to buy it or

17:22

not. It's just how much are they going

17:24

to buy it for. Yep.

17:25

>> Um, and so if you're going to the guy

17:28

who's pricing it first, who doesn't know

17:30

how available it is, doubtlessly it's

17:34

going to be a harder sell than if you

17:36

had convinced the end user that it's

17:37

available first, right?

17:40

>> Um, Decagon, Sierra, and Ramp. Um, what

17:43

kinds of uh data are they buying

17:45

relative to the Frontier Labs?

17:47

>> Voice data. Uh, RAMP is not so much

17:51

buying data. Actually, the Ramp Labs

17:53

report came out um the other day, and I

17:57

was surprised at a couple things. One, I

18:00

really love the fact that you've got

18:01

really sophisticated elite engineering

18:03

or applier companies out there

18:05

post-raining their own small models for

18:06

their own use cases. But I was surprised

18:08

that they used a synthetic data set to

18:11

inform some of the environments in which

18:13

they were training like accounting level

18:15

transactions if they're an app layer

18:17

company that should have access to that

18:18

data themselves

18:20

>> which is

18:22

which one suggests that one could

18:26

sell data to them if they can't use

18:28

their own app layer data uh for for

18:31

these training environments. But uh two

18:35

uh also suggests that Apple companies

18:37

may be feasible buyers in the future if

18:39

there's a substantial systematic issue

18:41

that prevents them from using their own

18:43

users data. Certainly doesn't look like

18:45

it's been a problem with cursor though.

18:47

So I'm sure this is just a small uh

18:50

quirk.

18:51

>> Yeah. Do you think it's a it's a privacy

18:52

thing that they'll just figure out?

18:55

>> I think so. Privacy is not like data

18:57

privacy is very easy to figure out

18:59

nowadays for all these companies.

19:01

>> Yep. the um

19:05

uh do you have a certain view on you

19:08

know long-term

19:10

um

19:12

the

19:13

labs have their applications those

19:15

applications give them you know traces

19:17

that they can train on um

19:21

how it evolves where they still need to

19:24

buy data externally versus training on

19:27

the data from their users

19:31

Um

19:34

yeah, one would have thought that

19:36

Enthropic has so much data from claude

19:38

code and

19:40

work

19:41

>> right

19:41

>> that maybe they would not have needed to

19:44

procure from external vendors

19:46

but they still do. Um and and and this

19:50

reflects the fact that most external

19:53

data vendors that are succeeding with

19:54

sophisticated research labs and data

19:56

markets, they're mostly selling

19:58

capabilities that are N plus one of

19:59

current tier models, right?

20:02

>> Y

20:03

>> um Andon Labs, by the way, Andon is a

20:06

fantastic company in this regard in

20:09

terms of producing really hard realistic

20:11

benchmarks, but a bit too ahead of its

20:14

time, I think.

20:15

>> Yeah. Um, and on labs is a good example

20:20

of the fact that we're we we're going to

20:24

produce these really real world long

20:26

horizon benchmarks that are not going to

20:28

be saturated for a long time and that is

20:30

quite available to us.

20:31

>> Yep. So if it's already within the

20:33

capabilities of the model then they can

20:36

train on it from their traces but if

20:38

it's not and no user is going to attempt

20:40

it in the model then they don't have any

20:41

traces to train on. And this is from a

20:44

purely single axis performance-based

20:46

perspective, right? Whereas it's like

20:48

there's only one thing to help climb and

20:49

it's is perceived performance. Um cost

20:52

and latency are also big questions too.

20:55

An anthropic researcher I think told me

20:58

at some point our benchmarks are really

21:00

not going to index on performance and

21:03

that we'll have prohibitively expensive

21:04

AGI in some sense but like how much does

21:07

it cost and how fast does it take to do

21:10

something is going to be new

21:12

>> new new dimensions of benchmarks. So

21:16

then you expect that um end vendors will

21:20

uh start to do benchmarks that are

21:22

basically performance divided by price

21:25

rather than just performance essentially

21:28

>> perhaps. Yeah. And this expands greatly

21:30

the aperture of different niches that RN

21:33

companies can play in because if you

21:35

think about the enterprise world, right?

21:37

There are many use cases where I just

21:39

want a much much cheaper model at a

21:42

fixed level of intelligence

21:44

>> that is satisfactory for certain like

21:46

job functions, right? And then even in

21:49

ramp lab's recent implementation on

21:51

their Twitter post they showed that they

21:53

use a above head frontier model for

21:55

planning but they they collapse the

21:57

search and retrieval function to a small

21:59

model that they post trained just for

22:01

that purpose.

22:03

>> The um how many labs are spending at the

22:06

you know billion dollar plus per year

22:08

data level?

22:11

>> Seven or eight.

22:12

>> Mhm. the how much more than

22:16

like you know Anthropic talked about

22:18

their billion dollar number. Do you

22:19

think it's going to end up being like

22:20

closer to like you know three to four

22:22

kind of this year?

22:24

>> Yeah. I mean I'd say like honestly each

22:27

Frontier Lab if you're loose with your

22:30

definition of data like they spend

22:32

between 10 to 20 billion a year. I think

22:34

I posted about this a while back too.

22:36

>> 10 to 20 if you're loose with your

22:38

definition of data. Um yeah. Can you say

22:40

how so? Uh this is this shouldn't be a

22:44

surprise to anybody, right? Like three

22:46

things Hill climb model capabilities,

22:47

compute, data, and talent. And data

22:49

spend is still a drop in a bucket

22:50

compared to compute costs, right? Um I

22:54

I'd say we're generally still supply

22:57

constrained in that if you think about

23:00

RL data or just data in general, that

23:02

means the quality bar for these labs,

23:04

we're still very much still in demand of

23:06

that data.

23:07

>> Yep. So you're saying 10 to 20 billion

23:08

in aggregate?

23:11

No, per lab.

23:13

>> Per lab

23:15

with eight labs spending that much

23:17

>> uh sevenish.

23:20

uh I think for some labs

23:23

>> including this isn't like salaries of

23:25

data team people is included like how

23:27

does it get to

23:28

>> like literal data from external vendors

23:30

and and and by the way most of this

23:32

spent does not actually get satisfied

23:34

like I'm sure that there is a data

23:36

budget set aside whose upper limit is

23:39

not actually met because there's simp

23:40

just simply not enough good quality data

23:43

vendings. I've still seen I have still

23:46

never seen a data contract get turned

23:48

down by a top lab if it's good quality

23:51

data for budget reasons.

23:53

>> Yeah. What's the delta between the

23:54

billion dollar number versus the you

23:56

know 10 to 20 billion like what's

23:58

included in the latter that's not

23:59

included in the former?

24:01

>> Uh I would say body shop type data

24:03

labeling that's very emblematic of scale

24:04

type what what scale used to do uh and

24:07

what many people still think the data

24:09

industry is which is just manual manual

24:11

data labeling for pre-training data. Um

24:14

>> so then that would be like you know 70

24:16

billion plus in aggregate. Um

24:21

what's what's what's like the ballpark

24:23

of like surge scale

24:25

annual revenue

24:28

>> surge is between two to three bill

24:30

runway rate I'm pretty sure

24:32

>> what's the yeah where's the where's the

24:34

gap come from like if Serge is you know

24:36

leading provider they're doing two to

24:38

three 70 billion aggregate spend

24:43

>> um there are so many companies that

24:44

participate in data markets that you

24:46

would have never even expected just a

24:48

big massive long tail basically.

24:50

>> Yeah, it's an it's a very massive long

24:52

tail. Yes. Uh also

24:55

>> yeah staffing agencies as well. It's

24:58

like uh uh and this encompasses a lot of

25:02

the spend that OEI and anthropic

25:05

directly have like acquiring companies

25:07

from the real world too just for data

25:08

assets.

25:09

>> Yeah.

25:10

>> Uh which certainly I don't know why like

25:12

is happening a lot more and more and

25:13

people are not discussing this very

25:15

closely. M

25:16

>> um

25:17

>> this is like acquiring little like

25:19

little wet labs and that kind of stuff

25:21

>> like app layer companies in certain

25:23

domains that they're they're interested

25:25

in building products in right

25:28

>> uh I I I can't name them specifically.

25:32

Um

25:32

>> enterprise software type small app

25:35

player companies.

25:36

>> Yeah. Yeah, you could say that with like

25:37

network effects from like having I don't

25:40

know 10 to 15 years worth of user

25:42

activity like a stack overflow type type

25:44

thing. Uh so the the data markets as

25:50

exemplified by like Merur and these

25:52

companies they represent like the tip of

25:54

the iceberg in terms of like the the

25:57

entire long tale of companies where data

25:59

procured actually comes from.

26:00

>> Cool. As a last question, um the

26:05

what makes inference providers and

26:06

neoclouds a good fit uh to acquire RLM

26:10

codes is that they basically act as

26:11

implementers to the enterprise partners

26:14

that are their customers.

26:16

>> They they are the compute they are the

26:18

compute providers for our labs as well.

26:20

It naturally makes sense that they want

26:22

to do horizontal product expansion and

26:24

bring post-training infrastructure.

26:26

>> Yeah.

26:26

>> Uh and tooling alongside their product

26:29

offering to labs. B 10 actually I think

26:31

it was B 10 made an RLM's acquisition

26:34

>> uh like in December January time that

26:36

very few people are talking about so

26:38

there's precedent and I think um some of

26:40

the sophisticated RLM targets are very

26:43

good acquisition targets for this

26:45

>> both help themselves to enterprises and

26:47

to labs

26:51

>> uh to build out their uh post training

26:53

infrastructure product suite

26:56

>> the the and the end customer the post

26:57

training infra is um mostly like non-top

27:01

three frontier labs just like other

27:03

enterprises.

27:04

>> Yeah. Yeah. Like app layer companies

27:06

too. Y

27:07

>> um like for a while while you know

27:09

Perplexity and Cursor were more than 50%

27:11

of fireworks revenue for example.

More transcripts

Explore other videos transcribed with YouTLDR.

Get the TLDR of any YouTube video

Transcribe, summarize, and repurpose videos in 125+ languages — free, no signup required.

Try YouTLDR Free