Full Transcript

How to sell RL envs and data to AI labs: Interview with Sean Cai

27:16EnglishBy Chris BarberTranscribed May 28, 2026

Analyze another video with Pro30-day money-back guarantee

0:00

What areas of data do you feel are

0:02

underserved by now um that labs want to

0:05

buy?

0:07

>> Uh certainly we're still in a huge darth

0:09

of long and unverifiable. Uh the domains

0:13

that matured the quickest

0:16

matured because they were much much more

0:19

easily verifiable with web 2.0

0:21

instruments. Coding had GitHub. We don't

0:22

have a GitHub for all other domains but

0:24

we ventured into finance healthcare and

0:26

law afterwards. Nowadays, biological and

0:29

cyber security is is is a craze but only

0:32

I think by the most sophisticated labs

0:34

namely anthropic and some of OEI.

0:36

>> You said bio and cyber.

0:38

>> Yes.

0:40

>> But in general I think incredibly long

0:43

horizon realistic is still very much in

0:46

demand. A lot of benchmarks are built as

0:48

this but not actually that. Does that

0:50

mean things that would like basically

0:52

serve making co-work and codeex better

0:55

for non-technical work?

0:58

>> Uh potentially. Yeah. I would classify

1:00

that as like back office ERP type tasks.

1:02

>> Yeah.

1:03

>> Which largely have to do with very

1:04

complicated search and retrieval

1:05

functions across quite convoluted data

1:07

links uh and environments.

1:10

>> Okay. Um what kinds of software programs

1:13

would people be using?

1:16

>> Uh even so Excel file systems.

1:20

um just like think very convoluted file

1:22

systems and applications with data

1:24

across multiple formats sometimes

1:26

tabular sometimes graphical even as well

1:29

>> um and it's an exercise in model tool

1:32

calling as well

1:33

>> cool also stuff like SAP or or not

1:36

[snorts] necessarily

1:38

>> uh yes but I think for maybe some of the

1:40

computer use circles computer use

1:42

continues to be sort of like a smaller

1:44

market but dominated by a few top RLM

1:47

companies there uh like a a version of

1:50

SAP would have gone for like 500K uh in

1:53

the computer use craze of like late 25

1:55

probably. Um I suspect a version of SAP

2:00

has been created by one of the RL

2:02

computer use companies at this point or

2:03

maybe by an internal anthropic and OI

2:06

but that's speculation. Yeah. The um bio

2:11

and cyber for the bio stuff is it uh bio

2:14

real world things or like purely digital

2:18

you know stuff that's pure computer?

2:20

>> Yeah it started from bioinformatics but

2:21

nowadays we're trying to make a lot of

2:23

these processes with one step in the

2:25

digital and one step in the physical

2:27

areable.

2:28

Uh naturally this gets really hard

2:30

because you think about the verification

2:32

mechanism for a lot of things in

2:33

biology. Anthropic just put out this

2:35

benchmark called mystery biobench which

2:37

I think enumerates a lot of the problems

2:39

pretty succinctly there. We don't even

2:41

know even amongst the top experts how to

2:44

verify something in biology because

2:45

they're literally denovo experiments,

2:47

right? Like we combine certain chemicals

2:50

like what happens? Um

2:53

so uh they're almost veering into

2:55

physics based models. Of course if you

2:57

get to physics based models then you

2:59

have like

3:01

uh semreal and robotics uh to to which

3:04

you could do RL and robotics and that's

3:06

an entirely different domain. But I I'd

3:08

say it's like on our slow march to make

3:11

sim to real uh and and generally model

3:13

much more things in the physical world

3:15

more accurately as verifiers and RL um

3:18

>> certain domains

3:20

>> fall in the middle of purely software

3:22

based work and like robotics

3:24

>> like biological workflows, chemistry

3:26

workflows, even scientific discovery

3:28

that make that pretty useful. What are

3:30

some of the pieces of software that uh

3:33

bioinformaticists might be using that

3:34

like they would have you know ends for?

3:38

>> Oh, I think I think like I'm not a

3:41

biologist but uh there's so many bespoke

3:44

tools and like bespoke processes within

3:47

a lab itself. I I I'm helping a wet lab

3:52

sort of digitize our processes here

3:54

doing this and a lot of this stuff

3:56

doesn't have purpose-built software for

3:58

it. It's

4:00

and it does just adds to the environment

4:02

complexity and bespoke tool build out

4:04

for those environments complexities as

4:06

well.

4:06

>> Okay. So it's more like general computer

4:09

use both both guey and bash.

4:12

>> Yeah, I would I I would say so that's

4:14

the primitive in which like kind of

4:16

everything is based right.

4:18

>> Yeah. the um

4:21

cyber stuff any particular subsets like

4:24

uh uh you know program analysis or um

4:27

pentesting or um you know what's the

4:30

what's the most underserved cyber

4:32

subsets do you think?

4:34

Yeah, for sure. Uh, I think cyber is

4:37

mostly being bought by Anthropic and

4:39

maybe some of OEI right now and then the

4:40

rest of the labs are are following

4:41

whatever they do. Uh, naturally

4:46

irregular security is probably the

4:48

company to fall into space who does a

4:49

lot of this type of stuff. Um,

4:52

I would say a lot of offensive cyber is

4:57

uh being modeled right now in sort of

4:59

interactive environments. So a lot of

5:01

stuff in code level security uh web app

5:04

exploitation CTF challenges and like

5:06

remediation is already covered by

5:07

existing benchmarks like um cyber gym

5:10

cybench these are pretty saturated

5:13

nowadays so long horizon wise as all

5:17

domains are tending towards you're

5:19

looking towards um

5:22

stuff like infrastructure exploits uh

5:24

and agent layer attacks. So there

5:28

there's a lot of stuff you can model out

5:29

there. Um because there are new zero

5:32

days every single day. It's like one of

5:34

the most dynamically changing fields. So

5:36

naturally you're going to expect there

5:38

to need to be uh real-time data streams

5:41

to translate these things into model

5:44

actionable formats.

5:45

>> Yep. Awesome. The um

5:49

the is the the sales process is roughly

5:52

um either do something in public such

5:55

that researchers reach out to you

5:56

already know researchers or kind of get

5:59

intros or do cold emails to researchers

6:02

to get a you know a pilot. Is that

6:04

roughly the first step?

6:06

Yeah, but you know the our our appetite

6:09

for data is voracious and expanding, but

6:11

there's still only 24 hours in a day and

6:13

a researcher's job is not to talk to

6:15

data vendors all the time. So the bar to

6:18

get I think a researcher's attention is

6:22

getting higher and higher. So those

6:24

without research sophistication, it's

6:25

just like that's simply not the best

6:27

move anymore. the human data supply

6:29

chain is expanding such that one can

6:31

meaningfully participate in it in it um

6:33

without interacting with an end

6:35

researcher

6:37

>> cross-selling partnerships you mean or

6:40

>> yeah protege is an example uh I think

6:44

there are companies out there which

6:48

almost train companies to produce good

6:50

post training data and then sell their

6:53

data to researchers themselves using

6:55

themselves as a stamp of approval

6:58

Yep. Yep. The um

7:02

the uh what are what are the most

7:05

important things that labs look for?

7:07

What are the main reasons that a lab

7:09

might uh not kind of renew or increase

7:12

their purchase volume? You know, after

7:15

they get the uh the first set of data.

7:20

>> Uh each lab has their own QC processes

7:23

and they run them internally and they

7:24

see. But I I you know there there are so

7:26

many reasons why

7:29

some data might be quite quite poor. Um

7:34

so

7:37

I think there are just a lot of very

7:39

small things that a researcher can look

7:41

at a data set and like maybe a data set

7:43

is a million lines and they notice

7:45

something that is off about one line and

7:46

they really start to question whether a

7:48

startup even has a QC process at all in

7:50

the first place or not. Um,

7:53

subsequently,

7:55

uh, it's it's it's I think the most

7:57

common case is just your tasks are

8:00

incredibly incredibly poorly designed,

8:03

um, in terms of they're easily reward

8:04

hackable. The props are vague, they're

8:07

not emblematic of real world tasks from

8:09

an obvious setting.

8:11

There's a lot of other smaller things

8:14

afterwards too like the model failures

8:16

you identify and the reasons for why the

8:19

model fails at them are not actually

8:20

genuine capability failures. So because

8:22

you designed the harness quite poorly.

8:24

Um it's because you ran it in a very

8:27

specific environment that isn't actually

8:28

how most users would have ran this task

8:30

or model in. Uh you don't do cross

8:32

harness testing. You don't do engram

8:34

contamination testing. So which is to

8:36

say you don't test whether a data set is

8:38

already in the pre-trained corpus

8:39

literature or not. Um there there's a

8:43

lot of uh QC checks that one can run

8:47

before delivering say OTS RL data that

8:51

would make it just uh that would make

8:54

DOM a great partner that researchers

8:56

would want to work with and iterating

8:57

the shape of posting data.

8:59

>> Yeah. the um

9:02

the and the QC is both uh researcher

9:05

flagged issues and then also just the

9:07

data teams as well reviewing things

9:10

themselves.

9:12

Yeah, I think it's uh

9:16

c certainly researcher needs are bespoke

9:18

as well for certain projects, but if you

9:20

think about

9:22

uh researchers exploring a net new

9:26

research question like how do we improve

9:27

taste and models and they're exploring

9:30

different OTS RL data sets to uh along

9:34

with maybe a data company provided

9:36

benchmark to exploit this question.

9:38

There there are just so many things that

9:41

they look for in a typical data delivery

9:43

that are really is really difficult for

9:46

you to know unless you worked in the in

9:48

the data industry yourself or you've

9:49

been a researcher and you understand

9:50

what good RL data is, right?

9:53

>> Good data taste, good research taste,

9:56

and perceived ability to scale quality

9:58

with quantity are the three things that

10:00

are necessary to building a a good human

10:02

data company out.

10:04

>> Good data taste, good research taste,

10:05

and what was the third one?

10:06

perceived ability to scale quality with

10:08

quantity.

10:09

>> Yeah. Cool. The um uh when does a

10:13

researcher like in a question like that

10:15

they're doing some new kind of area um

10:17

that's vague. How do they get the model

10:19

to do a certain thing?

10:21

Um

10:23

what do researchers do to first explore

10:25

off-the-shelf offerings before um will

10:28

they just hit up all their existing

10:30

vendors from their team or they go to

10:32

their you know data team people

10:34

internally and ask them to go and get a

10:36

a uh a set of options for them? What's

10:39

the what's the first steps that the

10:40

researcher takes?

10:43

Yeah, I'd say they do some bespoke reach

10:46

out themselves, especially for new

10:49

research team directions like OpenAI's

10:53

newest robotics VA direction which spun

10:56

out of Sora or not I wouldn't say Denovo

10:59

spun out of Sora but was sort of

11:01

combined with the remnants of Sora. they

11:03

will go out to real world data vendors

11:08

and reach out to get samples uh because

11:11

you got to remember their jobs are to

11:12

improve model capabilities and if that's

11:14

the bottleneck they won't go and solve

11:16

the bottleneck themselves but that's why

11:18

these labs have human data teams. It's

11:19

like one to procure the data necessary

11:22

and manage vendor relations but two is

11:24

like negotiate on price

11:27

>> and and all that all those things

11:29

associated. So it's a collaboration

11:32

between those two entities.

11:33

>> Cool. The um in the robotics data space

11:37

is there anything that your viewers

11:38

underserved

11:39

uh there? Obviously there's a lot of

11:42

things that are well served but what are

11:43

the ones you view as underserved in

11:45

robotics?

11:46

>> Data vendors who are genuinely research

11:48

first and running post training

11:49

experiments on their own data. if

11:51

they're trying to sell things like ego

11:55

and data up the training mix pyramid to

12:00

if you like companies

12:03

uh or

12:06

uh ju just running a lot of training

12:09

experiments on their data that match the

12:11

research direction of companies they're

12:12

trying to sell to in order to be a bit

12:14

more

12:14

>> selling egocentric data to VA companies

12:16

is underserved.

12:20

It is underserved in a sense of there

12:24

are not many

12:25

>> my naive view is like everyone is

12:27

selling egocentric data.

12:29

>> Well, there are very few egocentric data

12:31

vendors that actually know how the

12:33

downstream training is done.

12:36

>> So it's egocentric data vendors who are

12:37

doing their own postraining and

12:39

therefore like research have good

12:40

research taste.

12:42

>> Yeah. But um you you got to remember

12:45

what is being sold when you sell data in

12:46

the first place is just model capability

12:48

improvement, right? And data is just the

12:50

medium to do that. So if you're trying

12:51

to sell data, but you're not actually

12:53

cognizant of how model improvement is

12:56

achieved or you don't have an opinion

12:57

there and can't really help the

12:59

researcher with that, you are in a

13:01

losing battle and losing market uh and

13:04

you are going to be commoditized.

13:06

>> Yep.

13:07

>> Yep. The um

13:10

what is the uh what does the initial

13:12

meeting look like? the um someone talks

13:14

to researcher, the researcher maybe

13:16

requests some samples and the founder

13:18

sends it in Google Drive. Um how does

13:21

that uh what does that typically look

13:23

like? What's the formats people are

13:25

expecting?

13:26

>> For RL data, for the longest time, it's

13:28

literally just been a Docker container.

13:30

>> Yeah.

13:31

>> Uh a Docker container isolated

13:33

environment, all the tools on there, all

13:35

the verification mechanisms and rubrics.

13:37

One simply simply has to plug and play

13:40

their agent. Uh and then you get an eval

13:44

score and then you can use a multitude

13:45

of these software containers to run

13:47

rollouts for GRPO RL.

13:49

>> Yep.

13:49

>> Whatever other training me mechanisms

13:51

you employ

13:53

>> and labs do they have more kind of

13:54

sophisticated internal kind of you know

13:57

uh

14:00

setups for um running environments now

14:04

that need different formats.

14:08

>> Yes, they do. Anthropic notably has one

14:11

whose name I can't uh disclose but the

14:14

most sophisticated labs I would say are

14:16

like Anthropic, Open AI, Deep Mind and

14:21

then everybody else and then Chinese

14:23

labs in that order.

14:25

>> Yep. Yep. the um

14:30

the um and then in terms of the data

14:33

that the non

14:35

you know three frontier labs are buying

14:38

um are they buying more off-the-shelf

14:41

data that you know companies have

14:42

already sold to anthropic and open AI on

14:45

like a non-exclusive thing

14:50

>> uh I think OTS is a relatively new

14:53

phenomenon it is the mechanism with

14:56

which Serge has done business for a long

14:59

long time. Uh but that's because Serge

15:01

is a very fundamentally different

15:03

company than all the other ventureback

15:04

players. Um I I believe

15:07

>> how are by the way on that?

15:09

>> Oh, they genuinely started off as just

15:11

model capability caring about model

15:12

capability improvement, right? Not not

15:17

as a sort of data company and for the

15:19

longest time like mostly SFD data as

15:21

well. Um,

15:25

>> so

15:26

you're talking about exclusivity.

15:30

Certainly exclusivity reflects different

15:32

labs philosophies towards data vendors.

15:35

Anthropic is the only one who I think

15:37

really pushes for exclusivity. Open AI

15:40

at different points throughout its human

15:42

data turnover uh human data teams

15:45

turnovers because a lot of people shift

15:48

around in OAI a lot. Uh but Enthropic

15:51

genuinely views their data vendors as

15:54

research partners and if you think your

15:57

research partner is genuinely novel

15:59

research you probably want to get

16:01

exclusivity on that. Um

16:04

>> which is the approach that they've

16:08

employed with many of the data companies

16:09

they've worked with.

16:10

>> Yep. Do they have expiry clauses on the

16:12

exclusivity like 12 months or 24 or

16:15

something like this? I'd imagine they

16:16

are starting to think about that pretty

16:18

closely now. But I am aware of many many

16:20

companies who have recently just ended

16:24

anthropic exclusivity. Some of them had

16:26

the agreement that they would only have

16:27

it for a year. Some of them

16:30

for other strategic reasons they've

16:32

stopped exclusivity with. So

16:35

>> yeah.

16:36

>> Yep. the um

16:39

the

16:41

uh

16:42

almost all purchase decisions researcher

16:44

led at this point as opposed to you know

16:47

like the researcher pulls it in and then

16:48

the the data team kind of uh is effect

16:52

is like a form of pro procurement or um

16:57

is it different? You can imagine it's a

16:59

partnership of sorts,

17:02

but if you want to think about it from

17:04

from a economic buyer perspective, you

17:07

always want to be just in general B2B

17:09

sales dealing with the economic buyer

17:11

because if you can convince the economic

17:14

sorry, not the economic buyer, the the

17:16

end user, right? If you can convince the

17:17

end user of your product that there's

17:19

substantial value, the question is not

17:21

whether the org is going to buy it or

17:22

not. It's just how much are they going

17:24

to buy it for. Yep.

17:25

>> Um, and so if you're going to the guy

17:28

who's pricing it first, who doesn't know

17:30

how available it is, doubtlessly it's

17:34

going to be a harder sell than if you

17:36

had convinced the end user that it's

17:37

available first, right?

17:40

>> Um, Decagon, Sierra, and Ramp. Um, what

17:43

kinds of uh data are they buying

17:45

relative to the Frontier Labs?

17:47

>> Voice data. Uh, RAMP is not so much

17:51

buying data. Actually, the Ramp Labs

17:53

report came out um the other day, and I

17:57

was surprised at a couple things. One, I

18:00

really love the fact that you've got

18:01

really sophisticated elite engineering

18:03

or applier companies out there

18:05

post-raining their own small models for

18:06

their own use cases. But I was surprised

18:08

that they used a synthetic data set to

18:11

inform some of the environments in which

18:13

they were training like accounting level

18:15

transactions if they're an app layer

18:17

company that should have access to that

18:18

data themselves

18:20

>> which is

18:22

which one suggests that one could

18:26

sell data to them if they can't use

18:28

their own app layer data uh for for

18:31

these training environments. But uh two

18:35

uh also suggests that Apple companies

18:37

may be feasible buyers in the future if

18:39

there's a substantial systematic issue

18:41

that prevents them from using their own

18:43

users data. Certainly doesn't look like

18:45

it's been a problem with cursor though.

18:47

So I'm sure this is just a small uh

18:50

quirk.

18:51

>> Yeah. Do you think it's a it's a privacy

18:52

thing that they'll just figure out?

18:55

>> I think so. Privacy is not like data

18:57

privacy is very easy to figure out

18:59

nowadays for all these companies.

19:01

>> Yep. the um

19:05

uh do you have a certain view on you

19:08

know long-term

19:10

um

19:12

the

19:13

labs have their applications those

19:15

applications give them you know traces

19:17

that they can train on um

19:21

how it evolves where they still need to

19:24

buy data externally versus training on

19:27

the data from their users

19:31

Um

19:34

yeah, one would have thought that

19:36

Enthropic has so much data from claude

19:38

code and

19:40

work

19:41

>> right

19:41

>> that maybe they would not have needed to

19:44

procure from external vendors

19:46

but they still do. Um and and and this

19:50

reflects the fact that most external

19:53

data vendors that are succeeding with

19:54

sophisticated research labs and data

19:56

markets, they're mostly selling

19:58

capabilities that are N plus one of

19:59

current tier models, right?

20:02

>> Y

20:03

>> um Andon Labs, by the way, Andon is a

20:06

fantastic company in this regard in

20:09

terms of producing really hard realistic

20:11

benchmarks, but a bit too ahead of its

20:14

time, I think.

20:15

>> Yeah. Um, and on labs is a good example

20:20

of the fact that we're we we're going to

20:24

produce these really real world long

20:26

horizon benchmarks that are not going to

20:28

be saturated for a long time and that is

20:30

quite available to us.

20:31

>> Yep. So if it's already within the

20:33

capabilities of the model then they can

20:36

train on it from their traces but if

20:38

it's not and no user is going to attempt

20:40

it in the model then they don't have any

20:41

traces to train on. And this is from a

20:44

purely single axis performance-based

20:46

perspective, right? Whereas it's like

20:48

there's only one thing to help climb and

20:49

it's is perceived performance. Um cost

20:52

and latency are also big questions too.

20:55

An anthropic researcher I think told me

20:58

at some point our benchmarks are really

21:00

not going to index on performance and

21:03

that we'll have prohibitively expensive

21:04

AGI in some sense but like how much does

21:07

it cost and how fast does it take to do

21:10

something is going to be new

21:12

>> new new dimensions of benchmarks. So

21:16

then you expect that um end vendors will

21:20

uh start to do benchmarks that are

21:22

basically performance divided by price

21:25

rather than just performance essentially

21:28

>> perhaps. Yeah. And this expands greatly

21:30

the aperture of different niches that RN

21:33

companies can play in because if you

21:35

think about the enterprise world, right?

21:37

There are many use cases where I just

21:39

want a much much cheaper model at a

21:42

fixed level of intelligence

21:44

>> that is satisfactory for certain like

21:46

job functions, right? And then even in

21:49

ramp lab's recent implementation on

21:51

their Twitter post they showed that they

21:53

use a above head frontier model for

21:55

planning but they they collapse the

21:57

search and retrieval function to a small

21:59

model that they post trained just for

22:01

that purpose.

22:03

>> The um how many labs are spending at the

22:06

you know billion dollar plus per year

22:08

data level?

22:11

>> Seven or eight.

22:12

>> Mhm. the how much more than

22:16

like you know Anthropic talked about

22:18

their billion dollar number. Do you

22:19

think it's going to end up being like

22:20

closer to like you know three to four

22:22

kind of this year?

22:24

>> Yeah. I mean I'd say like honestly each

22:27

Frontier Lab if you're loose with your

22:30

definition of data like they spend

22:32

between 10 to 20 billion a year. I think

22:34

I posted about this a while back too.

22:36

>> 10 to 20 if you're loose with your

22:38

definition of data. Um yeah. Can you say

22:40

how so? Uh this is this shouldn't be a

22:44

surprise to anybody, right? Like three

22:46

things Hill climb model capabilities,

22:47

compute, data, and talent. And data

22:49

spend is still a drop in a bucket

22:50

compared to compute costs, right? Um I

22:54

I'd say we're generally still supply

22:57

constrained in that if you think about

23:00

RL data or just data in general, that

23:02

means the quality bar for these labs,

23:04

we're still very much still in demand of

23:06

that data.

23:07

>> Yep. So you're saying 10 to 20 billion

23:08

in aggregate?

23:11

No, per lab.

23:13

>> Per lab

23:15

with eight labs spending that much

23:17

>> uh sevenish.

23:20

uh I think for some labs

23:23

>> including this isn't like salaries of

23:25

data team people is included like how

23:27

does it get to

23:28

>> like literal data from external vendors

23:30

and and and by the way most of this

23:32

spent does not actually get satisfied

23:34

like I'm sure that there is a data

23:36

budget set aside whose upper limit is

23:39

not actually met because there's simp

23:40

just simply not enough good quality data

23:43

vendings. I've still seen I have still

23:46

never seen a data contract get turned

23:48

down by a top lab if it's good quality

23:51

data for budget reasons.

23:53

>> Yeah. What's the delta between the

23:54

billion dollar number versus the you

23:56

know 10 to 20 billion like what's

23:58

included in the latter that's not

23:59

included in the former?

24:01

>> Uh I would say body shop type data

24:03

labeling that's very emblematic of scale

24:04

type what what scale used to do uh and

24:07

what many people still think the data

24:09

industry is which is just manual manual

24:11

data labeling for pre-training data. Um

24:14

>> so then that would be like you know 70

24:16

billion plus in aggregate. Um

24:21

what's what's what's like the ballpark

24:23

of like surge scale

24:25

annual revenue

24:28

>> surge is between two to three bill

24:30

runway rate I'm pretty sure

24:32

>> what's the yeah where's the where's the

24:34

gap come from like if Serge is you know

24:36

leading provider they're doing two to

24:38

three 70 billion aggregate spend

24:43

>> um there are so many companies that

24:44

participate in data markets that you

24:46

would have never even expected just a

24:48

big massive long tail basically.

24:50

>> Yeah, it's an it's a very massive long

24:52

tail. Yes. Uh also

24:55

>> yeah staffing agencies as well. It's

24:58

like uh uh and this encompasses a lot of

25:02

the spend that OEI and anthropic

25:05

directly have like acquiring companies

25:07

from the real world too just for data

25:08

assets.

25:09

>> Yeah.

25:10

>> Uh which certainly I don't know why like

25:12

is happening a lot more and more and

25:13

people are not discussing this very

25:15

closely. M

25:16

>> um

25:17

>> this is like acquiring little like

25:19

little wet labs and that kind of stuff

25:21

>> like app layer companies in certain

25:23

domains that they're they're interested

25:25

in building products in right

25:28

>> uh I I I can't name them specifically.

25:32

Um

25:32

>> enterprise software type small app

25:35

player companies.

25:36

>> Yeah. Yeah, you could say that with like

25:37

network effects from like having I don't

25:40

know 10 to 15 years worth of user

25:42

activity like a stack overflow type type

25:44

thing. Uh so the the data markets as

25:50

exemplified by like Merur and these

25:52

companies they represent like the tip of

25:54

the iceberg in terms of like the the

25:57

entire long tale of companies where data

25:59

procured actually comes from.

26:00

>> Cool. As a last question, um the

26:05

what makes inference providers and

26:06

neoclouds a good fit uh to acquire RLM

26:10

codes is that they basically act as

26:11

implementers to the enterprise partners

26:14

that are their customers.

26:16

>> They they are the compute they are the

26:18

compute providers for our labs as well.

26:20

It naturally makes sense that they want

26:22

to do horizontal product expansion and

26:24

bring post-training infrastructure.

26:26

>> Yeah.

26:26

>> Uh and tooling alongside their product

26:29

offering to labs. B 10 actually I think

26:31

it was B 10 made an RLM's acquisition

26:34

>> uh like in December January time that

26:36

very few people are talking about so

26:38

there's precedent and I think um some of

26:40

the sophisticated RLM targets are very

26:43

good acquisition targets for this

26:45

>> both help themselves to enterprises and

26:47

to labs

26:51

>> uh to build out their uh post training

26:53

infrastructure product suite

26:56

>> the the and the end customer the post

26:57

training infra is um mostly like non-top

27:01

three frontier labs just like other

27:03

enterprises.

27:04

>> Yeah. Yeah. Like app layer companies

27:06

too. Y

27:07

>> um like for a while while you know

27:09

Perplexity and Cursor were more than 50%

27:11

of fireworks revenue for example.

Continue with YouTLDR

Analyze another video with Pro

Process a new video, search every timestamp, compare sources, and keep the result in your library.

Get Pro — $12/month30-day money-back guarantee

More transcripts

Explore other videos transcribed with YouTLDR.