0:00
What areas of data do you feel are
0:02
underserved by now um that labs want to
0:07
>> Uh certainly we're still in a huge darth
0:09
of long and unverifiable. Uh the domains
0:13
that matured the quickest
0:16
matured because they were much much more
0:19
easily verifiable with web 2.0
0:21
instruments. Coding had GitHub. We don't
0:22
have a GitHub for all other domains but
0:24
we ventured into finance healthcare and
0:26
law afterwards. Nowadays, biological and
0:29
cyber security is is is a craze but only
0:32
I think by the most sophisticated labs
0:34
namely anthropic and some of OEI.
0:36
>> You said bio and cyber.
0:40
>> But in general I think incredibly long
0:43
horizon realistic is still very much in
0:46
demand. A lot of benchmarks are built as
0:48
this but not actually that. Does that
0:50
mean things that would like basically
0:52
serve making co-work and codeex better
0:55
for non-technical work?
0:58
>> Uh potentially. Yeah. I would classify
1:00
that as like back office ERP type tasks.
1:03
>> Which largely have to do with very
1:04
complicated search and retrieval
1:05
functions across quite convoluted data
1:07
links uh and environments.
1:10
>> Okay. Um what kinds of software programs
1:13
would people be using?
1:16
>> Uh even so Excel file systems.
1:20
um just like think very convoluted file
1:22
systems and applications with data
1:24
across multiple formats sometimes
1:26
tabular sometimes graphical even as well
1:29
>> um and it's an exercise in model tool
1:33
>> cool also stuff like SAP or or not
1:38
>> uh yes but I think for maybe some of the
1:40
computer use circles computer use
1:42
continues to be sort of like a smaller
1:44
market but dominated by a few top RLM
1:47
companies there uh like a a version of
1:50
SAP would have gone for like 500K uh in
1:53
the computer use craze of like late 25
1:55
probably. Um I suspect a version of SAP
2:00
has been created by one of the RL
2:02
computer use companies at this point or
2:03
maybe by an internal anthropic and OI
2:06
but that's speculation. Yeah. The um bio
2:11
and cyber for the bio stuff is it uh bio
2:14
real world things or like purely digital
2:18
you know stuff that's pure computer?
2:20
>> Yeah it started from bioinformatics but
2:21
nowadays we're trying to make a lot of
2:23
these processes with one step in the
2:25
digital and one step in the physical
2:28
Uh naturally this gets really hard
2:30
because you think about the verification
2:32
mechanism for a lot of things in
2:33
biology. Anthropic just put out this
2:35
benchmark called mystery biobench which
2:37
I think enumerates a lot of the problems
2:39
pretty succinctly there. We don't even
2:41
know even amongst the top experts how to
2:44
verify something in biology because
2:45
they're literally denovo experiments,
2:47
right? Like we combine certain chemicals
2:50
like what happens? Um
2:53
so uh they're almost veering into
2:55
physics based models. Of course if you
2:57
get to physics based models then you
3:01
uh semreal and robotics uh to to which
3:04
you could do RL and robotics and that's
3:06
an entirely different domain. But I I'd
3:08
say it's like on our slow march to make
3:11
sim to real uh and and generally model
3:13
much more things in the physical world
3:15
more accurately as verifiers and RL um
3:20
>> fall in the middle of purely software
3:22
based work and like robotics
3:24
>> like biological workflows, chemistry
3:26
workflows, even scientific discovery
3:28
that make that pretty useful. What are
3:30
some of the pieces of software that uh
3:33
bioinformaticists might be using that
3:34
like they would have you know ends for?
3:38
>> Oh, I think I think like I'm not a
3:41
biologist but uh there's so many bespoke
3:44
tools and like bespoke processes within
3:47
a lab itself. I I I'm helping a wet lab
3:52
sort of digitize our processes here
3:54
doing this and a lot of this stuff
3:56
doesn't have purpose-built software for
4:00
and it does just adds to the environment
4:02
complexity and bespoke tool build out
4:04
for those environments complexities as
4:06
>> Okay. So it's more like general computer
4:09
use both both guey and bash.
4:12
>> Yeah, I would I I would say so that's
4:14
the primitive in which like kind of
4:16
everything is based right.
4:21
cyber stuff any particular subsets like
4:24
uh uh you know program analysis or um
4:27
pentesting or um you know what's the
4:30
what's the most underserved cyber
4:32
subsets do you think?
4:34
Yeah, for sure. Uh, I think cyber is
4:37
mostly being bought by Anthropic and
4:39
maybe some of OEI right now and then the
4:40
rest of the labs are are following
4:41
whatever they do. Uh, naturally
4:46
irregular security is probably the
4:48
company to fall into space who does a
4:49
lot of this type of stuff. Um,
4:52
I would say a lot of offensive cyber is
4:57
uh being modeled right now in sort of
4:59
interactive environments. So a lot of
5:01
stuff in code level security uh web app
5:04
exploitation CTF challenges and like
5:06
remediation is already covered by
5:07
existing benchmarks like um cyber gym
5:10
cybench these are pretty saturated
5:13
nowadays so long horizon wise as all
5:17
domains are tending towards you're
5:22
stuff like infrastructure exploits uh
5:24
and agent layer attacks. So there
5:28
there's a lot of stuff you can model out
5:29
there. Um because there are new zero
5:32
days every single day. It's like one of
5:34
the most dynamically changing fields. So
5:36
naturally you're going to expect there
5:38
to need to be uh real-time data streams
5:41
to translate these things into model
5:45
>> Yep. Awesome. The um
5:49
the is the the sales process is roughly
5:52
um either do something in public such
5:55
that researchers reach out to you
5:56
already know researchers or kind of get
5:59
intros or do cold emails to researchers
6:02
to get a you know a pilot. Is that
6:04
roughly the first step?
6:06
Yeah, but you know the our our appetite
6:09
for data is voracious and expanding, but
6:11
there's still only 24 hours in a day and
6:13
a researcher's job is not to talk to
6:15
data vendors all the time. So the bar to
6:18
get I think a researcher's attention is
6:22
getting higher and higher. So those
6:24
without research sophistication, it's
6:25
just like that's simply not the best
6:27
move anymore. the human data supply
6:29
chain is expanding such that one can
6:31
meaningfully participate in it in it um
6:33
without interacting with an end
6:37
>> cross-selling partnerships you mean or
6:40
>> yeah protege is an example uh I think
6:44
there are companies out there which
6:48
almost train companies to produce good
6:50
post training data and then sell their
6:53
data to researchers themselves using
6:55
themselves as a stamp of approval
7:02
the uh what are what are the most
7:05
important things that labs look for?
7:07
What are the main reasons that a lab
7:09
might uh not kind of renew or increase
7:12
their purchase volume? You know, after
7:15
they get the uh the first set of data.
7:20
>> Uh each lab has their own QC processes
7:23
and they run them internally and they
7:24
see. But I I you know there there are so
7:29
some data might be quite quite poor. Um
7:37
I think there are just a lot of very
7:39
small things that a researcher can look
7:41
at a data set and like maybe a data set
7:43
is a million lines and they notice
7:45
something that is off about one line and
7:46
they really start to question whether a
7:48
startup even has a QC process at all in
7:50
the first place or not. Um,
7:55
uh, it's it's it's I think the most
7:57
common case is just your tasks are
8:00
incredibly incredibly poorly designed,
8:03
um, in terms of they're easily reward
8:04
hackable. The props are vague, they're
8:07
not emblematic of real world tasks from
8:11
There's a lot of other smaller things
8:14
afterwards too like the model failures
8:16
you identify and the reasons for why the
8:19
model fails at them are not actually
8:20
genuine capability failures. So because
8:22
you designed the harness quite poorly.
8:24
Um it's because you ran it in a very
8:27
specific environment that isn't actually
8:28
how most users would have ran this task
8:30
or model in. Uh you don't do cross
8:32
harness testing. You don't do engram
8:34
contamination testing. So which is to
8:36
say you don't test whether a data set is
8:38
already in the pre-trained corpus
8:39
literature or not. Um there there's a
8:43
lot of uh QC checks that one can run
8:47
before delivering say OTS RL data that
8:51
would make it just uh that would make
8:54
DOM a great partner that researchers
8:56
would want to work with and iterating
8:57
the shape of posting data.
9:02
the and the QC is both uh researcher
9:05
flagged issues and then also just the
9:07
data teams as well reviewing things
9:12
Yeah, I think it's uh
9:16
c certainly researcher needs are bespoke
9:18
as well for certain projects, but if you
9:22
uh researchers exploring a net new
9:26
research question like how do we improve
9:27
taste and models and they're exploring
9:30
different OTS RL data sets to uh along
9:34
with maybe a data company provided
9:36
benchmark to exploit this question.
9:38
There there are just so many things that
9:41
they look for in a typical data delivery
9:43
that are really is really difficult for
9:46
you to know unless you worked in the in
9:48
the data industry yourself or you've
9:49
been a researcher and you understand
9:50
what good RL data is, right?
9:53
>> Good data taste, good research taste,
9:56
and perceived ability to scale quality
9:58
with quantity are the three things that
10:00
are necessary to building a a good human
10:04
>> Good data taste, good research taste,
10:05
and what was the third one?
10:06
perceived ability to scale quality with
10:09
>> Yeah. Cool. The um uh when does a
10:13
researcher like in a question like that
10:15
they're doing some new kind of area um
10:17
that's vague. How do they get the model
10:19
to do a certain thing?
10:23
what do researchers do to first explore
10:25
off-the-shelf offerings before um will
10:28
they just hit up all their existing
10:30
vendors from their team or they go to
10:32
their you know data team people
10:34
internally and ask them to go and get a
10:36
a uh a set of options for them? What's
10:39
the what's the first steps that the
10:43
Yeah, I'd say they do some bespoke reach
10:46
out themselves, especially for new
10:49
research team directions like OpenAI's
10:53
newest robotics VA direction which spun
10:56
out of Sora or not I wouldn't say Denovo
10:59
spun out of Sora but was sort of
11:01
combined with the remnants of Sora. they
11:03
will go out to real world data vendors
11:08
and reach out to get samples uh because
11:11
you got to remember their jobs are to
11:12
improve model capabilities and if that's
11:14
the bottleneck they won't go and solve
11:16
the bottleneck themselves but that's why
11:18
these labs have human data teams. It's
11:19
like one to procure the data necessary
11:22
and manage vendor relations but two is
11:24
like negotiate on price
11:27
>> and and all that all those things
11:29
associated. So it's a collaboration
11:32
between those two entities.
11:33
>> Cool. The um in the robotics data space
11:37
is there anything that your viewers
11:39
uh there? Obviously there's a lot of
11:42
things that are well served but what are
11:43
the ones you view as underserved in
11:46
>> Data vendors who are genuinely research
11:48
first and running post training
11:49
experiments on their own data. if
11:51
they're trying to sell things like ego
11:55
and data up the training mix pyramid to
12:00
if you like companies
12:06
uh ju just running a lot of training
12:09
experiments on their data that match the
12:11
research direction of companies they're
12:12
trying to sell to in order to be a bit
12:14
>> selling egocentric data to VA companies
12:20
It is underserved in a sense of there
12:25
>> my naive view is like everyone is
12:27
selling egocentric data.
12:29
>> Well, there are very few egocentric data
12:31
vendors that actually know how the
12:33
downstream training is done.
12:36
>> So it's egocentric data vendors who are
12:37
doing their own postraining and
12:39
therefore like research have good
12:42
>> Yeah. But um you you got to remember
12:45
what is being sold when you sell data in
12:46
the first place is just model capability
12:48
improvement, right? And data is just the
12:50
medium to do that. So if you're trying
12:51
to sell data, but you're not actually
12:53
cognizant of how model improvement is
12:56
achieved or you don't have an opinion
12:57
there and can't really help the
12:59
researcher with that, you are in a
13:01
losing battle and losing market uh and
13:04
you are going to be commoditized.
13:10
what is the uh what does the initial
13:12
meeting look like? the um someone talks
13:14
to researcher, the researcher maybe
13:16
requests some samples and the founder
13:18
sends it in Google Drive. Um how does
13:21
that uh what does that typically look
13:23
like? What's the formats people are
13:26
>> For RL data, for the longest time, it's
13:28
literally just been a Docker container.
13:31
>> Uh a Docker container isolated
13:33
environment, all the tools on there, all
13:35
the verification mechanisms and rubrics.
13:37
One simply simply has to plug and play
13:40
their agent. Uh and then you get an eval
13:44
score and then you can use a multitude
13:45
of these software containers to run
13:47
rollouts for GRPO RL.
13:49
>> Whatever other training me mechanisms
13:53
>> and labs do they have more kind of
13:54
sophisticated internal kind of you know
14:00
setups for um running environments now
14:04
that need different formats.
14:08
>> Yes, they do. Anthropic notably has one
14:11
whose name I can't uh disclose but the
14:14
most sophisticated labs I would say are
14:16
like Anthropic, Open AI, Deep Mind and
14:21
then everybody else and then Chinese
14:30
the um and then in terms of the data
14:35
you know three frontier labs are buying
14:38
um are they buying more off-the-shelf
14:41
data that you know companies have
14:42
already sold to anthropic and open AI on
14:45
like a non-exclusive thing
14:50
>> uh I think OTS is a relatively new
14:53
phenomenon it is the mechanism with
14:56
which Serge has done business for a long
14:59
long time. Uh but that's because Serge
15:01
is a very fundamentally different
15:03
company than all the other ventureback
15:04
players. Um I I believe
15:07
>> how are by the way on that?
15:09
>> Oh, they genuinely started off as just
15:11
model capability caring about model
15:12
capability improvement, right? Not not
15:17
as a sort of data company and for the
15:19
longest time like mostly SFD data as
15:26
you're talking about exclusivity.
15:30
Certainly exclusivity reflects different
15:32
labs philosophies towards data vendors.
15:35
Anthropic is the only one who I think
15:37
really pushes for exclusivity. Open AI
15:40
at different points throughout its human
15:42
data turnover uh human data teams
15:45
turnovers because a lot of people shift
15:48
around in OAI a lot. Uh but Enthropic
15:51
genuinely views their data vendors as
15:54
research partners and if you think your
15:57
research partner is genuinely novel
15:59
research you probably want to get
16:01
exclusivity on that. Um
16:04
>> which is the approach that they've
16:08
employed with many of the data companies
16:09
they've worked with.
16:10
>> Yep. Do they have expiry clauses on the
16:12
exclusivity like 12 months or 24 or
16:15
something like this? I'd imagine they
16:16
are starting to think about that pretty
16:18
closely now. But I am aware of many many
16:20
companies who have recently just ended
16:24
anthropic exclusivity. Some of them had
16:26
the agreement that they would only have
16:27
it for a year. Some of them
16:30
for other strategic reasons they've
16:32
stopped exclusivity with. So
16:42
almost all purchase decisions researcher
16:44
led at this point as opposed to you know
16:47
like the researcher pulls it in and then
16:48
the the data team kind of uh is effect
16:52
is like a form of pro procurement or um
16:57
is it different? You can imagine it's a
16:59
partnership of sorts,
17:02
but if you want to think about it from
17:04
from a economic buyer perspective, you
17:07
always want to be just in general B2B
17:09
sales dealing with the economic buyer
17:11
because if you can convince the economic
17:14
sorry, not the economic buyer, the the
17:16
end user, right? If you can convince the
17:17
end user of your product that there's
17:19
substantial value, the question is not
17:21
whether the org is going to buy it or
17:22
not. It's just how much are they going
17:25
>> Um, and so if you're going to the guy
17:28
who's pricing it first, who doesn't know
17:30
how available it is, doubtlessly it's
17:34
going to be a harder sell than if you
17:36
had convinced the end user that it's
17:37
available first, right?
17:40
>> Um, Decagon, Sierra, and Ramp. Um, what
17:43
kinds of uh data are they buying
17:45
relative to the Frontier Labs?
17:47
>> Voice data. Uh, RAMP is not so much
17:51
buying data. Actually, the Ramp Labs
17:53
report came out um the other day, and I
17:57
was surprised at a couple things. One, I
18:00
really love the fact that you've got
18:01
really sophisticated elite engineering
18:03
or applier companies out there
18:05
post-raining their own small models for
18:06
their own use cases. But I was surprised
18:08
that they used a synthetic data set to
18:11
inform some of the environments in which
18:13
they were training like accounting level
18:15
transactions if they're an app layer
18:17
company that should have access to that
18:22
which one suggests that one could
18:26
sell data to them if they can't use
18:28
their own app layer data uh for for
18:31
these training environments. But uh two
18:35
uh also suggests that Apple companies
18:37
may be feasible buyers in the future if
18:39
there's a substantial systematic issue
18:41
that prevents them from using their own
18:43
users data. Certainly doesn't look like
18:45
it's been a problem with cursor though.
18:47
So I'm sure this is just a small uh
18:51
>> Yeah. Do you think it's a it's a privacy
18:52
thing that they'll just figure out?
18:55
>> I think so. Privacy is not like data
18:57
privacy is very easy to figure out
18:59
nowadays for all these companies.
19:05
uh do you have a certain view on you
19:13
labs have their applications those
19:15
applications give them you know traces
19:17
that they can train on um
19:21
how it evolves where they still need to
19:24
buy data externally versus training on
19:27
the data from their users
19:34
yeah, one would have thought that
19:36
Enthropic has so much data from claude
19:41
>> that maybe they would not have needed to
19:44
procure from external vendors
19:46
but they still do. Um and and and this
19:50
reflects the fact that most external
19:53
data vendors that are succeeding with
19:54
sophisticated research labs and data
19:56
markets, they're mostly selling
19:58
capabilities that are N plus one of
19:59
current tier models, right?
20:03
>> um Andon Labs, by the way, Andon is a
20:06
fantastic company in this regard in
20:09
terms of producing really hard realistic
20:11
benchmarks, but a bit too ahead of its
20:15
>> Yeah. Um, and on labs is a good example
20:20
of the fact that we're we we're going to
20:24
produce these really real world long
20:26
horizon benchmarks that are not going to
20:28
be saturated for a long time and that is
20:30
quite available to us.
20:31
>> Yep. So if it's already within the
20:33
capabilities of the model then they can
20:36
train on it from their traces but if
20:38
it's not and no user is going to attempt
20:40
it in the model then they don't have any
20:41
traces to train on. And this is from a
20:44
purely single axis performance-based
20:46
perspective, right? Whereas it's like
20:48
there's only one thing to help climb and
20:49
it's is perceived performance. Um cost
20:52
and latency are also big questions too.
20:55
An anthropic researcher I think told me
20:58
at some point our benchmarks are really
21:00
not going to index on performance and
21:03
that we'll have prohibitively expensive
21:04
AGI in some sense but like how much does
21:07
it cost and how fast does it take to do
21:10
something is going to be new
21:12
>> new new dimensions of benchmarks. So
21:16
then you expect that um end vendors will
21:20
uh start to do benchmarks that are
21:22
basically performance divided by price
21:25
rather than just performance essentially
21:28
>> perhaps. Yeah. And this expands greatly
21:30
the aperture of different niches that RN
21:33
companies can play in because if you
21:35
think about the enterprise world, right?
21:37
There are many use cases where I just
21:39
want a much much cheaper model at a
21:42
fixed level of intelligence
21:44
>> that is satisfactory for certain like
21:46
job functions, right? And then even in
21:49
ramp lab's recent implementation on
21:51
their Twitter post they showed that they
21:53
use a above head frontier model for
21:55
planning but they they collapse the
21:57
search and retrieval function to a small
21:59
model that they post trained just for
22:03
>> The um how many labs are spending at the
22:06
you know billion dollar plus per year
22:12
>> Mhm. the how much more than
22:16
like you know Anthropic talked about
22:18
their billion dollar number. Do you
22:19
think it's going to end up being like
22:20
closer to like you know three to four
22:24
>> Yeah. I mean I'd say like honestly each
22:27
Frontier Lab if you're loose with your
22:30
definition of data like they spend
22:32
between 10 to 20 billion a year. I think
22:34
I posted about this a while back too.
22:36
>> 10 to 20 if you're loose with your
22:38
definition of data. Um yeah. Can you say
22:40
how so? Uh this is this shouldn't be a
22:44
surprise to anybody, right? Like three
22:46
things Hill climb model capabilities,
22:47
compute, data, and talent. And data
22:49
spend is still a drop in a bucket
22:50
compared to compute costs, right? Um I
22:54
I'd say we're generally still supply
22:57
constrained in that if you think about
23:00
RL data or just data in general, that
23:02
means the quality bar for these labs,
23:04
we're still very much still in demand of
23:07
>> Yep. So you're saying 10 to 20 billion
23:15
with eight labs spending that much
23:20
uh I think for some labs
23:23
>> including this isn't like salaries of
23:25
data team people is included like how
23:28
>> like literal data from external vendors
23:30
and and and by the way most of this
23:32
spent does not actually get satisfied
23:34
like I'm sure that there is a data
23:36
budget set aside whose upper limit is
23:39
not actually met because there's simp
23:40
just simply not enough good quality data
23:43
vendings. I've still seen I have still
23:46
never seen a data contract get turned
23:48
down by a top lab if it's good quality
23:51
data for budget reasons.
23:53
>> Yeah. What's the delta between the
23:54
billion dollar number versus the you
23:56
know 10 to 20 billion like what's
23:58
included in the latter that's not
23:59
included in the former?
24:01
>> Uh I would say body shop type data
24:03
labeling that's very emblematic of scale
24:04
type what what scale used to do uh and
24:07
what many people still think the data
24:09
industry is which is just manual manual
24:11
data labeling for pre-training data. Um
24:14
>> so then that would be like you know 70
24:16
billion plus in aggregate. Um
24:21
what's what's what's like the ballpark
24:28
>> surge is between two to three bill
24:30
runway rate I'm pretty sure
24:32
>> what's the yeah where's the where's the
24:34
gap come from like if Serge is you know
24:36
leading provider they're doing two to
24:38
three 70 billion aggregate spend
24:43
>> um there are so many companies that
24:44
participate in data markets that you
24:46
would have never even expected just a
24:48
big massive long tail basically.
24:50
>> Yeah, it's an it's a very massive long
24:55
>> yeah staffing agencies as well. It's
24:58
like uh uh and this encompasses a lot of
25:02
the spend that OEI and anthropic
25:05
directly have like acquiring companies
25:07
from the real world too just for data
25:10
>> Uh which certainly I don't know why like
25:12
is happening a lot more and more and
25:13
people are not discussing this very
25:17
>> this is like acquiring little like
25:19
little wet labs and that kind of stuff
25:21
>> like app layer companies in certain
25:23
domains that they're they're interested
25:25
in building products in right
25:28
>> uh I I I can't name them specifically.
25:32
>> enterprise software type small app
25:36
>> Yeah. Yeah, you could say that with like
25:37
network effects from like having I don't
25:40
know 10 to 15 years worth of user
25:42
activity like a stack overflow type type
25:44
thing. Uh so the the data markets as
25:50
exemplified by like Merur and these
25:52
companies they represent like the tip of
25:54
the iceberg in terms of like the the
25:57
entire long tale of companies where data
25:59
procured actually comes from.
26:00
>> Cool. As a last question, um the
26:05
what makes inference providers and
26:06
neoclouds a good fit uh to acquire RLM
26:10
codes is that they basically act as
26:11
implementers to the enterprise partners
26:14
that are their customers.
26:16
>> They they are the compute they are the
26:18
compute providers for our labs as well.
26:20
It naturally makes sense that they want
26:22
to do horizontal product expansion and
26:24
bring post-training infrastructure.
26:26
>> Uh and tooling alongside their product
26:29
offering to labs. B 10 actually I think
26:31
it was B 10 made an RLM's acquisition
26:34
>> uh like in December January time that
26:36
very few people are talking about so
26:38
there's precedent and I think um some of
26:40
the sophisticated RLM targets are very
26:43
good acquisition targets for this
26:45
>> both help themselves to enterprises and
26:51
>> uh to build out their uh post training
26:53
infrastructure product suite
26:56
>> the the and the end customer the post
26:57
training infra is um mostly like non-top
27:01
three frontier labs just like other
27:04
>> Yeah. Yeah. Like app layer companies
27:07
>> um like for a while while you know
27:09
Perplexity and Cursor were more than 50%
27:11
of fireworks revenue for example.