Full Transcript

·YouTLDR

Google DeepMind Pre-Training Lead: How To Land a Job at a Frontier Lab | Vlad Feinberg

1:04:05EnglishTranscribed Jun 16, 2026

Open in Studio

0:00

Every single time you go up for a

0:01

pre-training run, you're about to put in

0:03

more flops into this run than you've

0:05

ever done before. This is Vlad Fineberg.

0:08

He's Google DeepMind's pre-training area

0:10

lead. And I asked him all about how to

0:12

get a job at a frontier lab.

0:14

>> That was a particular skill that I see

0:16

voracious demand for across all the

0:18

different labs. The research skill set

0:21

is going to become increasingly

0:23

important. If you do the scaling book

0:26

exercises and, you know, send me a video

0:28

of yourself doing them, I would love to,

0:30

you know, interview you. Here's the full

0:33

episode.

0:37

You wrote this post that was titled,

0:40

"How to get a job at a frontier lab.

0:42

What are the skills that are kind of in

0:45

demand in Frontier Labs?" Maybe we can

0:47

talk about the shape of the work.

0:49

There's quite a range of different

0:52

things that Frontier Labs require

0:56

at this point. LLMs are artifacts that

0:59

are connected to uh research and product

1:03

in ways that machine learning really

1:06

hasn't been as connected to before. And

1:10

so it it really touches on so many

1:12

different things. The goal of my post

1:14

was to propose just a couple tangible

1:18

directions in which labs could require a

1:21

certain set of skills, not not to be

1:23

fully exhaustive. And really the ones

1:25

that I I dive into have to do with uh

1:30

kernel development and a low-level

1:32

engineering to accelerate

1:35

the runtime for these LLMs uh in

1:38

practice. And so that that was a

1:41

particular skill that I see voracious

1:43

demand for across all the different labs

1:46

and uh among different projects within

1:49

the labs. So that that seemed like a

1:51

very sharp one to call out as uh an

1:55

overall need. Uh and so specifically

1:59

whenever we're doing a research project

2:02

that involves changing the architecture

2:04

for the neural net in a particular way

2:07

or rethinking how we might do serving to

2:11

uh you know do better KV caching or

2:13

something like that again across the

2:16

stack you just need to be able to

2:18

implement these new techniques in

2:20

efficient ways and

2:23

uh the inner loop of all of these

2:26

different changes is creating software

2:28

artifacts that can function at large

2:30

scales with high throughput, low

2:33

latency. Uh, and this is just

2:35

fundamental work that's tied to

2:38

classical backend engineering thinking.

2:40

Uh, so yeah, it seemed like a very open

2:44

thing for people to specialize in. my

2:46

friends that work at OpenAI and

2:48

Anthropic, there's this distinction of

2:52

an applied org and the research org and

2:56

I was wondering if deep mind has a

2:58

similar uh distinction and if you could

3:01

speak about what that difference is. So

3:04

we we have different focus areas and

3:07

like you know for instance within GDM

3:10

there's a team that focuses on how uh we

3:15

can use our Gemini LLMs to better inform

3:19

search results and so like that might be

3:22

some you know you know in some way like

3:25

an applied version of the LLMs but I I

3:29

am hesitant to you know make a very

3:32

sharp distinction here because there's

3:34

so much actual like hard research that

3:37

has to go into this kind of level of

3:40

product integration like specifically

3:42

for the one I mentioned uh quite a lot

3:44

of work goes into making sure that these

3:46

LLMs are factual and can site sources uh

3:50

to have very precise grounded answers

3:53

assessing the quality of these sources

3:55

to make sure that you're not referring

3:57

to anything that's like sarcastic or a

3:59

joke.

4:01

This is uh I guess a good example of how

4:04

even in like product specific quote

4:07

unquote applied AI verticals you're

4:09

still doing research. Uh that being

4:11

said, there's definitely

4:14

what I would say is like very classical

4:16

LLM research teams, pre-training,

4:18

post-training.

4:20

These are things that are still

4:22

standalone uh teams inside of GDM that

4:25

are focused on

4:27

what I would say is like you know

4:28

creating soda models, you know, pure

4:31

research.

4:33

Again, the caveat is the the pure

4:36

research that we do like the extent that

4:38

it matters is the extent to which we can

4:39

realize it. And so, you know, we're just

4:42

as responsible with uh delivering these

4:46

models and making sure they train stably

4:48

and actually being like the SRRES of

4:52

sorts for the training run to make sure

4:53

that the model training is going

4:55

smoothly. Uh, as we are for coming up

4:57

with the recipes to make these LLMs and

4:59

you can't separate those two roles. It's

5:01

it's really crucial to kind of wear both

5:04

of those hats. So

5:07

yeah, I think you can you can draw up a

5:09

spectrum between research and applied.

5:11

Uh but uh no matter what in today's

5:13

world, I think uh everyone needs to be

5:17

fluid across that spectrum. I noticed

5:20

there's also another spectrum of

5:22

software engineer to pure AI researcher

5:25

and like how do you think of that

5:27

spectrum like software engineering

5:29

versus like AI researcher roles? So I

5:32

guess in um in in my case specifically I

5:36

think a lot of what we do and a lot of

5:41

the new techniques that we develop

5:45

the groundwork is laid in infrastructure

5:49

investment. So um I can walk through

5:53

what my team does a little bit more

5:55

detail uh later but one of the verticals

5:58

is uh distillation and in order to do uh

6:03

distillation it's it's some way of of

6:06

transferring the knowledge or some form

6:10

of statistics about the underlying data

6:12

set through a teacher model into the

6:15

student model to make the student model

6:16

better than if it hadn't ever seen these

6:18

auxiliary statistics from the teacher.

6:21

And when you're talking about statistics

6:24

derived from a massive LLM applied to

6:27

trillions and trillions of tokens, uh

6:30

you're talking about a level of flops

6:33

investment that you know is, you know,

6:36

millions and millions of dollars. And

6:40

that in turn means that you have to be

6:44

able to think through how do you uh

6:48

optimize the system to be as efficient

6:50

as possible because every operation that

6:52

we're performing is is multiplied by

6:54

such a large factor that yeah every

6:56

second counts every bite of storage

6:58

counts and quite a bit of that work is

7:05

you know good oldfashioned software

7:06

engineering And so uh in particular the

7:10

infrastructure for distillation has

7:13

evolved

7:15

through maybe three to four generations

7:18

at this point. And in each one we've

7:22

taken a step back looked at what kind of

7:26

research methods have we been applying

7:27

for distillation holistically thought

7:30

about how do we broaden what the

7:33

infrastructure is capable of. And

7:36

there's definitely a couple discreet

7:37

points where rethinking the system

7:41

design of how we perform distillation

7:44

enables us to do research on

7:47

distillation methods much more quickly.

7:49

And so it's this kind of investment that

7:52

like okay this like four month or

7:54

whatever rewrite of our distillation

7:57

infrastructure uh then results in a

8:00

dramatically new understanding of uh

8:03

distillation scaling laws that

8:06

translates to really strong models. So

8:09

it really requires just work across the

8:12

stack and I you know I can't yeah I

8:15

can't imagine that we would have gotten

8:16

results like flash 3.0 know without

8:19

having made those distillation

8:21

infrastructure investments that are at

8:23

the end of the day things that started

8:25

with a good oldfashioned design dock and

8:26

thinking about what the right

8:27

abstractions are for uh generating these

8:30

teacher statistics coming up with the

8:32

right storage system for them thinking

8:34

through what could support uh the

8:36

reading and writing across uh multiple

8:39

different data centers at this scale

8:42

really classical distributed systems

8:44

problems

8:45

>> yeah I mean it sounds like there's

8:46

there's a lot of software engineering

8:48

engineering backend infra type problems

8:51

given just the scale of the compute at

8:53

this point. It still feels like though

8:55

there at some point in that spectrum

8:58

there is some crossover where there's

9:00

these new skills like somewhere where if

9:03

you had you took a arbitrary backend

9:06

engineer and you placed them to I don't

9:09

know adjust the model architecture or

9:11

something like there that is like a bit

9:13

of a jump more than the infrowwork. um

9:16

like how do you see that distinction?

9:18

>> Yeah. So I think there is a crossover

9:20

point in terms of doing research where

9:24

research is an endeavor where the

9:30

payoffs become a lot higher risk higher

9:33

reward and we have this notion of uh

9:38

kind of research taste which is you know

9:40

some high level intuition about what

9:43

path you should be proceeding through

9:44

the DAG of the multiple uh different

9:47

milestones that you need accomplish in a

9:49

particular project.

9:51

In some sense, we can view software

9:53

engineering projects through a similar

9:54

DAG where you know you have all of these

9:56

intermediate artifacts that you want to

9:58

hit in a software program to uh get to

10:02

the final result. But in the software

10:05

engineering case, the DAG is more or

10:07

less deterministic where you you know

10:10

build one service then a different

10:11

service then a third service and you

10:13

know you figure out your storage

10:14

infrastructure layer first uh that kind

10:17

of thing and you can just make monotone

10:18

progress. But in the research case, you

10:21

have to uh kind of explore this DAG

10:24

which is now stoastic because some of

10:26

the nodes which might be some research

10:28

ideas or some you know aspect of getting

10:31

to a final goal uh may or may not work

10:34

out

10:35

and I think that requires a bit of a

10:38

mindset shift

10:40

and that that kind of mindset shift

10:42

takes a while to learn and it takes

10:44

specialized skills to learn. uh this

10:46

would be the kind of skills you pick up

10:47

in a PhD. For instance, one succinct way

10:51

I could put it, there's a really

10:54

excellent post by this uh professor

10:57

Jacob Steinhart and I I love to frame a

11:01

lot of the research work that I do in

11:02

this way and it's research as an MDP. So

11:05

MDP here markov decision process uh it's

11:09

again we have this highle idea of a

11:12

stochastic dependency graph between

11:15

different milestones in a research

11:16

project where you might need to have a

11:19

pertinent certain kind of result or

11:20

prove a certain kind of theorem before

11:22

you get to a certain kind of conclusion.

11:24

Uh similarly for a machine learning

11:26

research project you might need to have

11:28

this and that featurization working

11:29

before you can get this and that imageet

11:32

accuracy or something like that.

11:34

um and expanding those nodes in this

11:37

graph. It's this stochastic endeavor

11:40

where these approaches may or may not

11:42

work out and whether or not one works

11:45

out opens up a set of new possibilities

11:47

for you. And so the approach that you

11:51

might have in the software engineering

11:55

case where you could fully write out

11:57

here are all the paths to the goal

12:00

across walking this graph. what's the

12:02

shortest path to your goal? That

12:04

approach is not optimal in the research

12:07

case because if all of a sudden the

12:10

transitions between the edges in this

12:12

graph become unreliable and uh some of

12:16

the nodes you might not even be aware

12:19

of. It might be a hidden MDP. Then the

12:22

way that you might approach this problem

12:24

would really differ. And in particular,

12:26

you have to factor in the success rate

12:29

and the time investment that you're

12:31

going to be putting into uh these

12:33

different research ideas as well as

12:37

a priori estimating what those different

12:40

rates are. And that's a very different

12:42

exercise than writing up what the you

12:45

know design for your software

12:47

engineering project might be. And it's

12:49

it it's this skill set of of building an

12:51

intuition of how likely an approach is

12:53

to work out without having yet done that

12:56

approach that I think people often you

13:00

know correlate with this uh research

13:02

taste notion.

13:04

But that's exactly the one that you need

13:06

to build up in order to properly uh

13:08

traverse this MDP

13:11

>> for the the research projects and just

13:13

like generally the nature of the

13:14

research work. And it sounds like you're

13:16

you're saying that there's a lot more

13:18

uncertainty here. I'm still trying to

13:21

get a sense of the the nature of the

13:23

work. If you threw this backend engineer

13:26

into a team that's doing research, like

13:29

what are those like concrete examples

13:32

where they fall short?

13:34

Like I think the the very first thing

13:37

that comes to mind is having the right

13:39

context for the research landscape in

13:43

which you're operating. So quite a bit

13:47

of research work involves like almost uh

13:53

this kind of

13:55

um you have to take on this like very

13:57

humble viewpoint of there's been quite a

14:00

lot of investment in related work in the

14:03

past and until I know

14:07

the sum total of humanity's bleeding

14:10

edge in this topic I'm definitely not

14:12

going to be able to further that

14:14

bleeding edge. So building up uh a solid

14:19

understanding of of past work uh in a

14:22

particular area uh and doing that

14:24

related literature review is maybe the

14:26

first thing that I would imagine people

14:28

might stumble on is uh having read and

14:31

having the skills to effectively

14:33

traverse you know historical uh citation

14:37

tree for a particular topic because you

14:40

don't have the time to read all of these

14:42

different papers. is you need to build

14:43

up a sense of uh what are the high value

14:46

papers and what are the ways in which I

14:49

can assess if a paper is worth reading

14:50

without fully reading it. That's like

14:53

the first thing that comes to mind as

14:55

the you know skill that people need to

14:57

build up even to be able to read these

14:59

research level papers. You have to have

15:01

a background in

15:04

machine learning in um some you know

15:07

computer science and uh you know

15:11

depending on the paper and depending on

15:12

the domain there might be all sorts of

15:14

prerequisites in terms of like the

15:16

underlying math and coursework that you

15:19

would want to have to properly

15:20

understand. So

15:23

that's that's quite important to be able

15:25

to have a deep understanding of what

15:27

methodology is available because you

15:29

really won't have a lot of hope of

15:32

improving upon the methodology if you

15:34

don't understand what's there already.

15:35

So, so I think I like mentioned earlier,

15:38

one of the things that my team works on

15:40

is is distillation.

15:42

And in order to advance our

15:48

understanding in uh distillation for

15:50

large language models, you have to have

15:53

a good understanding of like what we're

15:55

trying to do with LLMs.

15:58

And uh just to give a cursory overview

16:01

here, the name of the game for LLM

16:04

research is

16:06

especially in pre-training is uh is

16:09

scaling laws. And so what are scaling

16:12

laws? People focus a lot about like you

16:14

know this power law structure and the

16:16

fact that like you you know have this

16:18

and that exponent but like what matters

16:20

is less so the functional form. What

16:22

matters is for a given recipe of scaling

16:27

up your LLM. So as you invest more and

16:29

more flops into the pre-training run of

16:31

an LLM, you have to be able to predict

16:34

what the final test loss of this LLM is

16:38

going to be. And why why do we care

16:41

about this question? Why do we care

16:42

about predicting what our uh

16:44

generalization error is in the classical

16:48

machine learning world? Like say we're

16:50

trying to you know win imageet

16:53

we have our test loss which is our

16:55

classification uh error for you know a

16:58

thousand different classes and uh you

17:01

run your VGG or your ResNet proposal to

17:04

get that uh classification error that's

17:06

an estimate of how well that model does

17:08

at classifying

17:10

amongst those thousand classes various

17:13

different images. we can estimate how

17:16

good our method's going to be by taking

17:18

a validation set and then whenever we

17:20

have an architecture idea for a neural

17:21

net we just train it and then we uh do a

17:24

bunch of uh validation set runs and we

17:27

get a cross validation error that is

17:28

itself an estimator of our final test

17:30

error and so in this way you can just

17:32

iterate on different ideas uh through

17:34

this process but what's different in LM

17:37

world is every single time you go up for

17:39

a pre-training run you're about to put

17:42

in more flops into this run than you've

17:44

ever done for. So it's in some sense

17:47

like a oneshot version of this imageet

17:49

problem. You never get to see the full

17:51

imagenet training data set. You have to

17:53

practice on emnest and then cf and then

17:55

maybe based off of those you try to come

17:58

up with a method that just works right

17:59

off the bat on imageet. And if you were

18:03

to just do that by itself, as I'm sure

18:05

many people have tried, like certainly

18:07

when I was learning how to do all of

18:09

those different things, you get

18:10

something, it works really great on

18:11

emnest, it maybe even works on CFR, and

18:14

then all of a sudden it breaks on

18:15

imageet. You'll find out that like

18:18

things don't just generalize easily

18:19

across scale like this. And so much of

18:23

what we do for LMS is coming up with

18:25

recipes where a recipe is this function

18:28

that goes from number of flops you'd

18:30

like to train on to a training routine

18:32

for this LM. And if you can couple this

18:36

recipe with a prediction rule that can

18:39

predict accurately what your LM accuracy

18:41

is going to be, then um you're able to

18:44

make decisions about how to improve your

18:46

recipe because you can use that

18:48

prediction.

18:49

That is all a ton of context on what uh

18:52

LLM research looks like in general. But

18:55

that's like an understanding that we got

18:57

to that we even thought was feasible

19:00

thanks to so much uh

19:03

initial LLM scaling work that we've seen

19:06

across the Kaplan paper across

19:09

Chinchilla.

19:11

Since those two papers, there's been a

19:13

lot more work in terms of like what

19:15

other factors are there beyond uh number

19:18

of params and um number of tokens that

19:21

you train on that influence your

19:23

prediction accuracy uh like number of

19:25

unique tokens for instance. But like I

19:30

would say like those two foundational

19:31

papers for LLMs uh those are informed by

19:35

uh an even even longer line of uh

19:38

different uh scaling works uh going back

19:40

to like say the original uh GPTs and

19:43

then Google has had a ton of scaling

19:44

work across its palm papers. This is

19:47

just a set of works that have informed

19:52

that viewpoint that I described earlier

19:54

that

19:55

you you kind of just need to build up by

19:58

having gone through that literature

20:00

review yourself. If you were, for

20:02

instance, if uh you were trying to pick

20:05

someone that was going on your team and

20:08

the the way that you would judge their

20:10

fitness to help you push the frontier is

20:13

their understanding of the frontier,

20:15

including the existing literature, which

20:18

requires all these prerequisite. I think

20:20

you called it mathematical maturity in

20:22

your post.

20:24

>> Yeah. So I think I I I think it's easy

20:29

to read and understand those papers once

20:30

you have mathematical maturity.

20:33

So I guess the ones I mentioned in

20:36

particular nowadays they're table

20:37

stakes. So I I would expect candidates

20:39

to be familiar with them. Um I think um

20:45

the the general skill set is being able

20:47

to dive into

20:50

uh a paper of that level and then

20:52

understanding it.

20:54

uh you know being able to take a

20:57

research idea uh from a paper and

21:00

implementing it yourself like that's

21:02

that's just a a very important skill set

21:05

to be able to have like we get you know

21:08

all sorts of uh different ideas

21:10

presented you know they might not all

21:13

directly apply to our domain but if you

21:15

can deeply understand them then you can

21:17

iterate on them and you can improve them

21:18

inside of uh inside of our domain and so

21:22

when we assess for people who can work

21:26

with the mathematical concepts in these

21:27

machine learning papers. That's that's I

21:30

guess the the key skill there that would

21:32

be evidence that you can go pick up this

21:34

arbitrary paper and see to what extent

21:38

these ideas carry over uh in the Google

21:40

setting. This probably won't be

21:42

exhaustive, but I'd be curious to hear

21:45

other domains that maybe people could

21:47

dig into to see what kind of matters in

21:51

frontier AI research. So you'd mentioned

21:53

distillation, you also mentioned

21:55

kernels. It sounds like kernels are

21:56

helpful everywhere. Um, but are there

21:59

other areas that come to mind if you

22:00

were just raffle off areas that are not

22:03

necessarily exhaustive?

22:05

One thing that I think is is quite

22:07

powerful is uh actually

22:10

programming language research. So by

22:13

looking into how we can create

22:15

abstractions at the programming language

22:17

level, we could facilitate kernel

22:19

development. I think Thunderkittens is a

22:21

really good example of this. Like coming

22:23

up with an ab an abstraction that allows

22:25

you to write kernels through four

22:27

functions instead of arbitrary globs of

22:30

C++ code uh allows you to move really

22:33

quickly uh in uh developing algorithms

22:37

that fully utilize hardware.

22:40

So like it at that point it's um it's

22:44

not about the PL research itself. It's

22:46

about having a passion for

22:49

you know these kind of programming

22:50

language abstractions and and working

22:52

with low-level hardware um you know uh

22:55

people who you know are interested in

22:58

and will try to work with like cute DSL

23:02

this kind of thing where there's a lot

23:04

of hardware specific uh domain specific

23:07

languages one other thing that comes to

23:09

mind besides PL and uh scaling law

23:13

literature would be reinforcement

23:15

learning literature. uh so in particular

23:18

ever since uh RHF uh I think we've seen

23:22

that DRL algorithms uh like PO do have a

23:27

place in production systems and you know

23:29

there was a time where that was in

23:32

question but uh now it's you know uh

23:35

pretty unanimous that we see these kind

23:37

of algorithms applied to real production

23:39

systems and

23:42

the uh theory behind that uh you kind of

23:46

have to start with the basics for

23:48

reinforcement learning and work your way

23:51

up to

23:52

you know the myriad uh value based

23:55

methods and and uh policy gradient

23:57

methods that we have today.

24:00

That's that's another domain that I

24:02

think is just like a very rich

24:03

literature tree to crawl. Um, and then

24:06

for more of the backend engineer folks,

24:08

just beyond just the kernels themselves,

24:12

there's I think a pretty fun overlap

24:15

between distributed systems and

24:16

optimization work where uh figuring out

24:20

how to design neural net training

24:23

algorithms that allow for

24:26

training across

24:28

many GPUs.

24:31

There's all sorts of fun challenges

24:34

between asynchronicity, how upto-date

24:37

your gradients are, how

24:40

pipelining affects the staleness, uh all

24:43

of these system choices that you could

24:45

make in your training algorithm design

24:48

will impact convergence and the final

24:50

quality of your neural net and uh those

24:52

are things that can be analyzed

24:53

independently of the LLM setting uh and

24:55

have been for a while. So uh you know

24:59

especially if you're kind of more

25:00

infrain inclined then having a good

25:02

understanding of like uh how those

25:05

different algorithms works work is a is

25:07

a really good place to start. Do you see

25:09

any difference between the the demands

25:12

of the different frontier labs? So for

25:14

instance if someone wants to work at

25:16

deep mind is there like a particular

25:18

area that you see Deep Mind cares about

25:21

more than anthropic for instance? I

25:24

think in terms of the skill set, it's

25:26

probably pretty similar. Yeah, I think I

25:28

think there's maybe differences in like

25:32

business strategy and uh you know the

25:36

set of offerings that's a function of uh

25:40

the specialties of the labs and uh like

25:44

the kind of different uh you know

25:46

customers that the labs could have. Uh

25:48

but uh I would say that there's there's

25:51

quite a lot of overlap between the labs

25:54

in terms of what people look for and

25:56

like yeah like when I posted uh my post

25:58

you would you would see like you know

26:00

people from both open AI and thropic

26:02

saying like yeah like we agree with this

26:04

advice and so you know I I think um that

26:07

that's just a little bit of evidence

26:09

towards that. I think one reason for the

26:11

the huge demand for wanting to go closer

26:14

to AI research is because people are

26:17

thinking oh software engineering is not

26:18

going to be as important in the future.

26:20

Is there a similar thought in when it

26:23

comes to research where LMS is also

26:26

going to handle a lot of that work as

26:27

well? So there's no reason to favor AI

26:31

research versus software engineering. Um

26:34

so I think the the research skill set is

26:37

going to become increasingly important.

26:40

Uh so I would say like being able to

26:43

handle stoastic components in the

26:45

planning of your work is is just going

26:48

to be a larger and larger part of how we

26:53

approach our jobs.

26:56

figuring out how to leverage AI in

26:59

whatever thing you work on, which

27:01

doesn't even have to be software

27:02

related, is just an important muscle to

27:04

start building right away. Um, because

27:07

these components aren't deterministic.

27:09

And thinking about how do I construct

27:11

systems around these LLMs to do my job

27:13

more effectively, uh, that's that's

27:16

going to be the thing that sets you

27:17

apart in the future. And I think that's

27:19

true no matter what you're going to be

27:20

doing. Look, I think I think there's

27:22

there's FUD everywhere, especially with

27:24

with some of the approach to marketing

27:27

that some people have in terms of AI.

27:29

It's FUD that is being intentionally

27:32

leveraged. And so I I feel like people

27:36

should really just focus on themselves

27:38

and and trying to uh be more productive

27:41

themselves. I I don't think that like AI

27:45

is going to replace all of our roles.

27:47

And so the reason for that is that

27:51

one of the important aspects of what we

27:54

do as humans in an organization which is

27:57

really this web of trust

28:00

from like you know this organization

28:04

that is you know this pool of resources

28:05

and this pool of people that manages

28:07

these resources. One of the important

28:10

things that we do is we allocate those

28:12

resources towards c certain goals and um

28:17

even when we can accelerate our

28:21

execution

28:23

there's an element of making decisions

28:26

around how we allocate these resources

28:28

that will always be something that needs

28:31

to be attributable to a human making

28:32

that decision. And uh that's simply

28:35

because you can't hand off blame to AI.

28:40

So we at this point have LLMs that

28:43

really deeply understand law and they

28:45

could, you know, review your contract

28:47

for you or something like that. But they

28:49

can't represent you in court because

28:51

they can't be disbarred.

28:54

And so that's that's I think like a a

28:56

really you know sharp way that I might

28:58

describe like okay this is why the legal

29:01

profession will go on even though LLMs

29:04

are really good at recalling precedent

29:06

is you want to have someone who is

29:10

responsible who can validate the output

29:12

of AI to perform uh legal work more

29:17

effectively for you rather than hand off

29:23

your legal defense to an LLM.

29:26

>> Yeah, I think the FUD that was actually

29:28

the original motivation for your post.

29:30

>> Yeah, I mean I I really think that the

29:33

mindset that people should have is is a

29:35

constructive one. And so there was a

29:38

tweet that I saw I think by Dee that was

29:41

like some long form you know

29:44

fear-mongering about you know uh uh AI

29:48

permanent underclass or something like

29:50

that. And uh it's easy to get stuck in

29:53

that loop, but I think the important

29:56

thing to think about is like we all have

30:00

agency over our future and we can start

30:03

investing in uh skills that matter for

30:07

tomorrow today and um that's that's

30:12

really

30:14

the only thing you should be doing,

30:15

right? Like you know worrying about it

30:16

is not going to not going to help you.

30:18

And so part of why I wanted to write

30:20

this post is is in response to that

30:24

uh because it it it was something that I

30:26

could see echoed. You know, I gave a

30:29

lecture at Princeton a while back and

30:31

you know, a big question that came up is

30:33

like, you know, how do I work at Deep

30:35

Mind? And and it's something that like

30:37

uh yeah, just when people find out what

30:39

I do, that's the top question people

30:40

ask. So, I figured it would be helpful

30:43

to add a little bit more constructive,

30:47

you know, direction to the discourse

30:49

here. One last thing on the post, cuz

30:52

you know, if you think about getting a

30:53

role, there's obviously the skills and

30:56

we talked a lot about the skills and

30:58

your fitness for the role, but there's

31:00

also kind of the uh signaling for that

31:03

role and like what is kind of valued if

31:05

you were to be saying marketing yourself

31:08

to one of these frontier labs. What

31:10

signals um matter most?

31:13

>> Actual evidence that you've created

31:16

something of uh

31:19

of use to other people uh along the line

31:21

of kernels, right? Like you can take any

31:25

of the many open source LLMs that we

31:27

have and optimize them. You don't have

31:30

to make them better in every case. You

31:31

could show that, oh, I have an

31:33

improvement for this and that setting.

31:35

It doesn't even have to be something

31:36

that speeds up the model on GPU. There's

31:39

all sorts of open- source stacks like

31:41

VLM. There's a lot of other um things

31:44

that you can do besides accelerating the

31:47

LLM inference on device. The serving

31:50

stack that surrounds LMS is a very

31:53

sophisticated distributed system that

31:55

has to maintain this KV cache memory and

31:59

deal with uh all sorts of like load

32:02

balancing and uh request queuing and and

32:06

very common problems for for back-end

32:08

servers. Uh and these projects are

32:11

always looking for help. So, you know,

32:13

contributions to VLM or SGLANG uh or

32:17

demonstrations with Tensor RT uh they

32:20

have, I think, a a a distributed system

32:23

called Dynamo that uh allows for

32:25

disagregated serving where you could

32:29

show that you you made a project using

32:31

these components, you improve these

32:33

components like that would be an

32:35

extremely positive signal uh for any

32:37

candidate that I'm looking at uh and and

32:40

a very welcome contribution to uh open

32:42

source.

32:43

>> I I think also a lot of what we said is

32:45

kind of assuming the path of external

32:49

hire into frontier lab. Um but a lot of

32:52

these frontier labs have large

32:55

organizations that aren't necessarily

32:57

doing the cutting edge uh frontier work.

33:00

So let's say yeah for instance I mean

33:02

you know Google deep mind versus let's

33:05

say there's some infrastructure that's

33:07

working on search and they have the

33:09

backend skill set maybe not as much

33:11

domain context and they try to internal

33:14

transfer to Google deepmind does any of

33:16

your advice differ in that kind of case

33:18

for like an internal transfer versus uh

33:21

someone who's coming from external

33:23

>> there's someone who I worked with

33:26

closely on the search side who actually

33:29

did transfer to my team uh Nate Linds

33:33

and he's amazing and now he owns so much

33:35

of uh like what we do on my team in

33:38

terms of

33:40

inference code design for uh like flash

33:43

and flashlight and I would say like he's

33:46

a really great example of this where

33:49

his approach was you know how do I help

33:53

my PA my product area

33:57

adopt this technology as effective ly as

33:59

possible. So I think there's you know

34:02

definitely if you're in a organization

34:07

that isn't directly generating these

34:09

models but in some way trying to

34:11

leverage them there's a very big gap in

34:15

terms of applying these LLMs effectively

34:18

serving them effectively

34:20

within

34:22

uh your organization and becoming

34:25

someone who does that really

34:27

effectively. not only creates a ton of

34:29

value uh in terms of like

34:33

the uh you know specific business need

34:36

for your org which will definitely

34:38

elevate you in your org. Uh but it'll

34:40

also be the case that you're going to

34:42

just naturally become the partner that

34:44

we work with uh on the research side to

34:48

make sure that our models are effective

34:50

within your org. And so at that point,

34:52

you know, you may or may not want to

34:54

transfer. uh definitely if you transfer

34:56

we'd you know be happy to work with you

34:58

but like

35:00

at that point I think you're you're

35:01

you're already doing something that is

35:03

cutting edge which is integrating this

35:05

new technology into uh you know a real

35:08

product that people use and so

35:11

yeah that'd be my advice there is a

35:14

towards the end of this post as we as we

35:16

leave this topic you had the concrete

35:19

invitation because I know you were

35:21

hiring uh do you want to say what that

35:23

was

35:23

>> yeah so I just trying to think of like,

35:26

you know, you know, how do I put my

35:28

money where my mouth is? Um, how do I

35:31

demonstrate, look, this is a good way to

35:34

show that you have, you know, at least

35:36

some evidence of of like the the skills

35:38

that I called out as important, you

35:40

know, intent, mathematical maturity,

35:42

grit. Uh and so I listed out a couple of

35:46

exercises that demonstrate you know some

35:48

initial knowledge of scaling laws, some

35:51

willingness to get into the weeds

35:53

engineering wise in terms of

35:54

implementing a real transformer

35:57

and uh sort of willingness to pick up

35:59

the kind of bread and butter bread and

36:01

butter math that we use uh every day to

36:05

size uh these LLMs and uh you know I I

36:09

won't I won't recall the full list of

36:11

like the exercises that I expected here.

36:13

But like uh you know if you do the

36:15

detailed like uh handwritten uh version

36:18

of the scaling book exercises and you

36:21

know send me a video of yourself doing

36:22

them along with the transformer exercise

36:24

on my post then that's something if you

36:28

can work in the uh New York office I

36:30

would love to you know interview you for

36:33

and quite a few people reached out to me

36:35

about that. I actually already have had

36:37

a couple submissions and we're

36:39

proceeding with the loop with those

36:41

people. So

36:43

yeah, it's it's quite a bit of work, but

36:45

uh impressively I got a response within

36:47

like I think a week of posting. So uh

36:52

it's definitely doable.

36:54

Uh yeah, I mean I don't have unlimited

36:57

headcount. So I mean the offer is on the

36:59

table, but the you know I can only hire

37:02

so many people. The good thing is though

37:05

that is such a strong sign of you know

37:08

self-development

37:09

that uh not only is this a something

37:12

that you should be doing for its own

37:13

sake regardless of whether or not you

37:16

will get a job at at DeepMind

37:18

specifically

37:20

but I think it'll be something that you

37:22

know lets you basically prepare for

37:25

interviews in other places. Uh certainly

37:28

if you reach out to me with these uh

37:30

exercises completed like even if you

37:33

know I do all my hiring there's tons of

37:36

people who I know who are hiring as well

37:38

and I'd be happy to refer people as

37:40

well. Open AAI enthropic cursor and

37:44

Verscell all use this product to make

37:46

their lives better. And the problem it

37:48

solves is when you're building SAS or an

37:51

AI product and you want to sell to other

37:53

companies there's all these requirements

37:55

you need to meet. There's SSO, there's

37:58

SKIM, there's arbback, there's audit

38:00

logs. These are all things that take

38:02

time to integrate but aren't the main

38:04

focus of your app. Work OS is an API

38:07

layer that lets you meet all of these

38:08

requirements in just a few lines of

38:10

code. So, let's say you have a new SAS

38:12

product and you want to sell to other

38:14

companies. Work OS will solve all of

38:16

these critical feature gaps for you. You

38:19

can check them out at workos.com to

38:21

learn more and get started. and I

38:24

appreciate them for supporting my work

38:25

and sponsoring this podcast. On the next

38:28

topic, I mean, I saw you're the the the

38:31

area lead for pre-training on Gemini,

38:34

and I just thought it might be

38:35

interesting to hear you give um uh kind

38:38

of like a highle overview of what

38:40

pre-training is in your words and maybe

38:43

what are the the highle challenges in

38:45

the area. We can talk about that.

38:47

>> Yeah. So there's there's quite a lot of

38:49

work that we do in pre-training

38:52

um as an area lead for it.

38:55

The specific things that my team is

38:57

responsible for delivering uh include uh

39:01

the flash model, the flashlight model.

39:04

These are models that get used for AI

39:06

overviews and AI mode in the search bar.

39:09

Uh as well as some other uh oneP models

39:11

that are used by different orgs like ads

39:14

and YouTube.

39:16

Besides this, we're also key technical

39:18

PC's for the uh Google Apple

39:21

partnership. Uh and so we do technical

39:23

work there.

39:26

Those are the actual like product level

39:28

deliverables uh for my team. Uh beyond

39:32

that, we do research to make sure that

39:36

these deliverables are state-of-the-art

39:38

and also we do general pre-training

39:40

research that contributes to the Pro

39:42

Series model as well. And the nature of

39:47

the research I would say generally

39:48

breaks down into three different

39:50

verticals. There's distillation which I

39:53

mentioned earlier.

39:55

There's what I like to call inference

39:57

code design. So uh creating neural

40:01

architectures that are efficient uh to

40:04

run inference on. So coming up with the

40:08

network topology, the shapes of the

40:11

matrices that the matt moles uh use uh

40:15

inside of uh uh gating and linear layers

40:18

for this transformer as well as the

40:19

attention shapes, num heads, that kind

40:21

of thing.

40:23

So that that is effectively utilizing

40:24

the hardware that you're serving on. And

40:27

then the final uh pillar here is new

40:31

quantization methods. And so

40:34

quantization is just something that's uh

40:36

been near and dear to my heart that I've

40:37

been working on the research side for

40:39

ever since I joined Google. And it

40:42

really changes what's feasible for uh

40:47

the first two. So uh that's why you know

40:51

furthering the state-of-the-art in terms

40:53

of how you can compress models is is

40:55

also a very important pillar in the

40:58

research that my team does. Generally uh

41:01

uh quantization uh refers to reducing in

41:05

some sense the size that the neural nets

41:08

take up uh in order to represent their

41:10

weights. So typically a neural net when

41:14

you're training it uh is represented as

41:17

a a series of numbers that make up the

41:19

matrices inside of the neural net uh

41:22

that are stored in FP32 32-bit

41:25

floatingoint weights. Um, it turns out

41:28

that when you do these computations, you

41:31

don't need all of that extra precision

41:33

to still maintain the quality of your

41:34

neural net. And you can with pretty

41:37

simple methods reduce the precision at

41:41

which you store these weights down to

41:43

four bits. So uh all of a sudden this

41:47

huge range of numbers uh that we would

41:50

take you know this float 32 to represent

41:53

uh something that gets you down to like

41:55

you know seven digits of precision uh

41:58

can

42:00

you know with somewhat high fidelity uh

42:03

still be um uh represented well by 4bit

42:07

ins which you know just cover this uh

42:09

tiny range of like minus 8 to 7 and um

42:14

It's it's kind of a miracle that you can

42:15

do this. But what's even more of a

42:18

miracle is that you can apply these kind

42:21

of quantization transforms to the

42:24

runtime activations that the neural net

42:26

processes. And as soon as you do that,

42:29

the actual math that you're performing

42:31

because you're taking much smaller

42:34

operands to your map mole,

42:37

the amount of electricity that it takes

42:39

to compute the neural net drops

42:41

significantly. And what's interesting is

42:43

that like 99% of the total cost of

42:47

operation for AI hardware comes from the

42:53

uh power that it takes to run these

42:54

chips. And so if you can do these

42:57

operations, you could just make neural

42:59

nets run more cheaply, run more

43:00

efficiently.

43:02

That helps uh uh in terms of like

43:05

serving more requests and helps in terms

43:07

of latency.

43:09

So the name of the game for quant

43:12

research is how do we push the frontier

43:13

beyond like this like 4bit range.

43:16

There's this take that I see on Twitter

43:18

all the time um which is just talking

43:21

about MFU and someone who's not in the

43:23

space or model flops utilization.

43:26

Someone who's not in the space they see

43:27

a number in the low tens and they think

43:30

wow they're wasting all of those GPU

43:32

resources. Um, I was curious if you

43:35

could just clarify that for people why a

43:38

low MFU or I guess naively low is

43:41

actually not low at all and maybe also

43:43

explain what MFU is.

43:44

>> Yeah. So

43:46

when we compute MFU, you want to divide

43:50

the actual number of flops that the

43:52

neural net is performing here by the

43:55

total number of flops that the

43:56

accelerator could have done in the time

43:58

of your request. And so in some sense

44:01

this is giving us the uh percent of time

44:06

that we're usefully utilizing the flops

44:08

rate of the accelerator. And to get to

44:11

100% MFU, you would just need be need to

44:14

be fully utilizing uh the matmo unit of

44:17

uh whatever accelerator uh you're doing

44:20

here. So it would just have to be doing

44:21

like a bunch of matt moles in a loop uh

44:24

without reading any memory or doing any

44:26

other operations.

44:28

That's not a very useful computation. Uh

44:31

and in practice,

44:33

neural nets have to apply activation

44:37

functions or do attention or write

44:41

intermediate outputs back to uh HBM. And

44:45

all of those different operations

44:48

will require utilizing the memory bus or

44:50

utilizing vector processing units. uh or

44:54

simply they might be a mathematical

44:56

operations that

44:59

the underlying hardware performs more

45:02

slowly than they than uh it might

45:05

perform a maple. And so all of those

45:07

things contribute to not running at the

45:10

full speed that the processor is rated

45:13

at. Uh and so that's why you might not

45:16

see 100% MFU all the time is cuz you

45:19

know part of the time your neural net

45:20

was you know reading and writing to

45:22

memory or part of the time it was doing

45:24

an operation that uh you know

45:26

fundamentally runs slower than certain

45:29

other units on your on your device. And

45:34

I think quite a bit of this inference

45:36

codeesign work that I talked about

45:38

earlier is across all of the different

45:42

um capabilities of the chip. So uh

45:45

communication to other chips um memory

45:48

bandwidth the speed at which we can read

45:50

parameters for memory

45:53

flops of course uh this can be metal

45:57

flops this could be flops for processing

46:00

uh vectors. So like things like doing

46:03

activations uh all of these have

46:05

different rates in the hardware and a

46:08

given computation isn't going to match

46:10

the natural hardware's rate

46:13

uh of each of those operations. So when

46:17

you design a neural net, you want to be

46:19

able to choose shapes for this neural

46:22

net that fully saturate all of those

46:25

hardware units to get you as high of an

46:27

MFU as possible. um when you are doing

46:31

uh inference here. What makes this more

46:34

than just an algebra problem is that

46:37

those choices translate to different

46:40

quality outcomes when you actually train

46:42

this neural net. So the process of this

46:45

kind of inference code design is how do

46:48

we come up with neural architectures

46:50

that scale predictably

46:54

have a good prediction so are high

46:56

quality and still make the MFU as large

46:59

as possible during inference. And so

47:02

this kind of joint optimization is what

47:04

makes uh inference code design really

47:06

fun. uh and also this kind of evergreen

47:08

problem because as the hardware changes

47:11

all of those relative constants of flops

47:14

to memory bandwidth to communication

47:16

bandwidth change and those will have

47:18

different implications to what's the

47:21

optimal neural net shape should be. On

47:23

another topic, Google has this idea of a

47:27

spot bonus where someone can kind of

47:29

give you a a oneoff lump sum of money as

47:33

a thank you for like good performance.

47:35

And I I saw on your resume that Jeff

47:38

Dean, the legend himself, gave you a

47:40

spot bonus. And you know, if you can

47:42

tell that story, I'd love to hear why

47:44

did he give you a spot bonus. Yeah. So,

47:47

that one actually was at the very

47:49

beginning of the Gemini program. Uh he

47:53

gave out his spot bonus to people who

47:56

hopped on and launched the first version

47:58

of Bard. And like I had a you know very

48:02

small contribution to a very very large

48:03

project at the time. Uh I helped with uh

48:06

SFT for uh one of the first versions uh

48:10

of uh supervised fine-tuning for one of

48:12

the first versions of uh Bard that got

48:15

released like right you know the biggest

48:18

lesson out of uh that experience was you

48:23

know at that time I was just doing like

48:27

pure research in um uh Google brain and

48:31

I was super focused on just how do I

48:33

maximize the number of first author

48:35

papers at Nurib Ciclair

48:38

and

48:41

I remember distinctly thinking like I I

48:44

had this instinct of like oh like you

48:47

know should I just keep my head down and

48:49

try to write more papers and luckily at

48:52

the time uh my my manager Roana Neil

48:55

like he really encouraged all of us to

48:57

get involved in

49:00

uh you know this space and

49:03

that was just the right motivation that

49:06

I needed to like roll up sleeves, do a

49:09

bunch of hyperparameter tuning and

49:11

engineering work to get uh this uh model

49:15

running on uh uh some like really old

49:18

TPUs to get some extra you know cycles

49:21

in for for uh SFT attempts.

49:25

that very small initial engagement that

49:28

was recognized by Jeff Dean I I think

49:30

blossomed into more and more investment

49:33

on the LLM side by me and ultimately led

49:37

me to where I am today. Uh so yeah, I

49:41

would say you know it it's less so about

49:44

you know you know how much that like SFT

49:47

helped the initial release and it's much

49:48

more about uh uh recognizing that like

49:53

there there's quite a bit of work some

49:55

of it not glamorous some of it just like

49:57

you know hyperparameter tuning and uh

50:00

golfing the XLA compiler to make your

50:02

program fit in a certain memory amount

50:05

that contributes to a wider business

50:07

goal that is is really quite important

50:10

for getting involved in in very high

50:13

value projects.

50:15

>> You've been working on Gemini for a

50:16

while now and because it's a top

50:18

priority, there has to be some, you

50:21

know, incidents or war stories that

50:23

you've been involved in. So, I'm

50:24

curious, you know, what's your favorite

50:27

uh war story when working on Gemini? So,

50:31

I think my all-time favorite would have

50:34

to be

50:37

Flash 2.0.

50:39

Uh, so this one this one was quite a

50:42

challenge and a very long journey to get

50:44

there.

50:46

But, uh, one of the main things that we

50:50

were optimizing for which which Flash

50:53

1.5 established is this category of very

50:56

fast low latency model that's still

51:00

quite good. Um, and you know,

51:04

in particular, it has to be fast because

51:06

it's it's used by search to serve uh uh

51:09

responses in in AI mode uh very quickly.

51:14

because of that uh for flash 1.5 and

51:17

before we we focused on dense models

51:20

which uh allow you to respond very

51:22

quickly

51:25

even though at the time we we knew about

51:27

models and how they increase capacity

51:29

and so I think um one thing that like

51:34

came up was okay like we sure would like

51:37

to use this new architecture but it it's

51:40

difficult to just simply switch to ane

51:43

Because what happens with ane is it uses

51:45

a lot more parameters in general and

51:48

because it uses more parameters it takes

51:50

up more HBM.

51:52

These chips that we serve on have a

51:54

finite amount of HBM. So you have to

51:56

shard thee across uh multiple different

52:00

chips. So if you have you know whatever

52:03

n experts then you might shard it across

52:05

n chips or you know some factor of n.

52:08

And what this causes is a lot of

52:12

communication in the middle of the model

52:14

when you have a token that needs to be

52:17

routed to an expert and that token might

52:20

live on the first TPU but it needs to go

52:22

to the last TPU. That's a lot of

52:23

communication that you're inducing in

52:25

the forward pass. So the latency of this

52:29

operation

52:31

like increases dramatically with N. uh

52:35

and uh you know the challenge with is

52:37

they increase N. So uh that that that

52:40

like kind of really bottlenecked this

52:42

approach.

52:44

And one interesting thing that happened

52:47

was uh we we definitely knew about uh

52:51

pipeline serving for a while is just in

52:54

the dense case uh it never really ended

52:57

up mattering. Like I distinctly remember

52:59

a very early conversation I had with

53:00

Shto about it and Shelto's like oh yeah

53:03

you're like so flop bound and so

53:05

pipelining is just not going to change

53:06

your prefill profile. And and he was

53:08

right. I tested it out and like then

53:10

abandoned the idea.

53:12

But what's interesting is

53:15

I I had a very small team at the time

53:17

and and one of my reports uh Gen Yan uh

53:20

had a very nice idea. He was working

53:22

with Rahul Aaria and a couple folks from

53:25

uh the Israel team at Google and that

53:27

was to apply pipeline prefill to

53:32

pipelining is a technique where instead

53:36

of paralyzing those end machines experts

53:40

across those end machines you paralyze

53:42

layers across those end machines. So

53:45

instead of on a particular layer you

53:47

have to route tokens from machine to

53:49

machine, now one layer does the

53:52

computation for one subset of your

53:55

prefill request and then hands off uh

53:57

the process tokens to the next machine

54:00

to process the second layer and then the

54:02

third layer and the fourth layer and

54:06

all of the experts can then stay

54:08

resonant to a single machine or a

54:09

smaller set of machines. So uh what this

54:13

does effectively is it changes the

54:16

communication pattern from something

54:17

that required a lot of token exchange on

54:21

every single layer to uh something that

54:25

actually can be uh hidden behind other

54:28

computation because you can do this uh

54:31

pipeline prefill across different parts

54:33

of your request. Uh so uh while layer 2

54:37

is working on the first thousand tokens

54:40

of your request uh layer 1 on the first

54:42

chip uh is processing uh the second

54:46

thousand tokens of your request. So it

54:48

was a way of

54:50

breaking this HBM constraint by moving

54:54

layers across the machines rather than

54:55

moving experts across these machines.

54:58

And because of that the communication

55:00

overhead has gone down and all of a

55:02

sudden latency looks really attractive.

55:05

Now this you know the the Gemini 2.0

55:09

report says like it's ane series of

55:11

models and the thing that made that

55:13

possible is you know or one of the

55:15

things that made that possible is is

55:16

this uh uh you know serving time

55:19

innovation.

55:21

Darkesh and Reiner have an amazing post

55:23

about exactly this

55:26

optimization that you can write up in

55:29

the algebra of the scaling book and it's

55:31

just a wonderful example of how uh this

55:34

kind of change can

55:38

uh have really dramatic implications on

55:40

LOM quality. What really made Flash 2.0

55:43

rewarding is this, you know, giant

55:47

decision. And it sounds like a small

55:49

technical decision at the time, but

55:51

people were really worried about whether

55:53

or not the latency of this would uh

55:56

actually be reasonable. Luckily, I was

55:58

able to run like a very transparent

56:00

technical process to get to the bottom

56:02

of this. And by the end of it, uh you

56:06

know, we we made the right call. Uh but

56:08

then we had to train it. So

56:11

this was a bigger model than we've ever

56:14

trained before at the Flash scale. And

56:16

like we knew this would be the right

56:18

call, but it was just going to be 40

56:20

days of grueling work for like a really

56:22

really small team. Like we probably had

56:24

like five people on the rotation for

56:28

training this model. I remember, you

56:30

know, all of us just kind of like

56:32

rotated day by day handing off like, you

56:36

know, all of this like S sur style work

56:39

of uh keeping the training job alive,

56:42

which at the time was was a very

56:45

interactive thing cuz uh you had to make

56:48

sure that everything was moving stably,

56:49

that you know you have tuned data

56:51

iterators that aren't slowing down your

56:54

job, that you know if there's like a gap

56:57

in the data somewhere or or an indexing

56:59

issue, you have to like really quickly

57:01

put up a fix because it's, you know,

57:03

wasting all of this GPU time. Um,

57:05

>> what about at nighttime and on the

57:07

weekends?

57:08

>> So, yeah, like I think, you know, for

57:10

those 40 days, we did not do a lot of

57:12

sleeping. Like we had to like do like

57:16

kind of these dual shifts across like

57:18

the Paris office and Mountain View. And

57:20

like the thing that makes it so

57:23

rewarding was when this model came out,

57:26

like around the same time uh Deep Seek

57:29

V3 came out and uh the Wall Street

57:32

Journal put out this article that was

57:34

like this giant Red Scare article about

57:36

how China's going to take over AI with

57:38

open source models. And I remember my

57:42

friend sent me a screenshot of this

57:44

table of the LMS arena leaderboard and

57:48

you know all the way at the top right

57:50

you've got uh chat GPT and and uh

57:55

Deepseek right behind it and like oh

57:57

Deepseek was trained for whatever few

58:00

million dollars you know and and like

58:02

they're right there. Uh, and then my

58:04

friend was like, "Oh, like you know,

58:06

Gemini is so behind cuz they had, you

58:09

know, a version of like I think 1.5 Pro

58:11

or something in that table at the very

58:13

bottom." And then I looked at it, it's

58:16

like, "Oh, that's really interesting. I

58:17

was just looking at this leaderboard cuz

58:20

we just released a model and it

58:23

definitely doesn't look like that when

58:24

you go to the website." So, turns out

58:27

there was kind of some ellighted rose on

58:31

the uh Wall Street Journal uh article.

58:34

And so now if you go to that article

58:36

today, you can see, you know, what at

58:39

the time was the state-of-the-art model,

58:42

you know, Flash 2.0 you know, thinking

58:45

uh up in the top right corner way far

58:47

ahead of uh DeepSec V3 might be messing

58:51

with the open source narrative that they

58:53

were trying to publish there, but uh it

58:55

was a really important accomplishment uh

58:57

for for the team.

58:59

>> Last question for you is if you could go

59:01

back to yourself when you just graduated

59:04

college, I guess undergrad, and give

59:07

yourself some advice knowing what you

59:08

know now, what would you say?

59:12

You you got to chase the problems that

59:16

people are facing

59:18

like in the world today like like go

59:21

after

59:23

uh the challenges that people see in

59:25

everyday life and don't be afraid to

59:31

tackle a smaller part of this problem or

59:34

maybe a more menial sounding part of

59:36

this problem even if it's not fancy

59:38

research math or something like that.

59:40

like trust that by working on what's

59:43

important, even if it's a smaller part

59:46

of a larger project for what's

59:47

important, you're going to get to see

59:50

what really matters in terms of moving

59:52

the frontier forward. And it's it's this

59:56

kind of I guess humility maybe in your

1:00:00

problem approach that that you should

1:00:03

really be chasing. Um that's one piece

1:00:06

of advice. I think the other bit that I

1:00:09

would give like maybe as professional

1:00:11

advice perhaps

1:00:14

would be

1:00:16

be the kind of co-worker

1:00:19

that

1:00:20

people would want to see succeed.

1:00:24

Uh, and so like what I mean by that is

1:00:28

there's this this like conception of

1:00:30

like workplace psychopath or mchavelian

1:00:33

leaders or or whatever people who like

1:00:36

will do anything at all costs to get the

1:00:39

results they want and they they they

1:00:40

they might be able to squeeze people to

1:00:43

get some short-term gain. But

1:00:46

you know, having interacted with a

1:00:48

variety of people professionally for so

1:00:51

long,

1:00:52

what is interesting to me is there have

1:00:55

been a select few,

1:00:59

you know, one in particular is is

1:01:01

probably a very dear friend and mentor

1:01:04

of mine, Todd Lipkin, who first got me

1:01:06

into computer science. um that are just

1:01:10

1:01:13

like kind and

1:01:16

like people that you can learn from and

1:01:20

uh you know someone that I can

1:01:24

follow and be successful by following

1:01:28

that just genuinely inspire me to want

1:01:31

to help them succeed. Uh and so in

1:01:35

particular, if you are the kind of

1:01:38

person who helps people succeed in their

1:01:42

projects, comes up with projects that

1:01:45

can leverage other people's

1:01:47

complimentary skills in ways that help

1:01:49

them shine, people will notice that.

1:01:52

People will want to contribute to

1:01:55

projects that you come up with in the

1:01:57

future and in general will want to

1:02:00

support you going forward. And so, you

1:02:03

know, people can get really cynical

1:02:06

thinking about the game theory of how to

1:02:08

interact at work. Uh but I found that

1:02:11

this kind of more amicable approach

1:02:16

generally like it it creates like this

1:02:19

this deep sense of collaboration and you

1:02:23

know willingness to help that like

1:02:26

is so important to get very large

1:02:28

projects that require multiple people

1:02:30

and multiple skill sets uh over the

1:02:33

line. And so yeah, I think if if I could

1:02:37

give any kind of like inner, you know,

1:02:39

interpersonal feedback or professional

1:02:41

feedback or whatever to earlier version

1:02:44

of myself, it's it's to be that guy is

1:02:46

to to be the kind of person that other

1:02:49

people want to see succeed. I love that

1:02:52

this advice combats the cynical advice

1:02:55

and I also love that your original post

1:02:58

combats the doomer, you know, permanent

1:03:00

underclass stuff. So, um, yeah, thank

1:03:03

you so much for your time. This is a lot

1:03:04

of fun. Really appreciate it.

1:03:05

>> Thanks for having me, Ryan.

1:03:07

>> Hey, thank you for watching this

1:03:08

podcast. If you liked it and you want to

1:03:10

see the show grow, please support with a

1:03:12

comment or a like. Also, if you have any

1:03:15

recommendations for people you want me

1:03:17

to bring on, please drop a comment.

1:03:19

Guests like Barbara Liskoff, Mike

1:03:22

Stonereaker, Mark Brooker, these were

1:03:24

all people that I brought on because

1:03:26

someone left a comment. On another note,

1:03:28

aside from the podcast, I'm working on

1:03:30

building the ergonomic keyboard that I

1:03:32

wish existed. Here's a glance at the

1:03:34

prototype. It's a split keyboard, so

1:03:36

there's two sides. Um, this is in the

1:03:39

case, but yeah, we launched on

1:03:40

Kickstarter and we hit our goal within 8

1:03:42

hours of launching. I really appreciate

1:03:44

it if you were one of the people who

1:03:45

grabbed one of the early units. Um,

1:03:48

we're now working on the long journey of

1:03:50

building the tooling now. So, if you

1:03:51

still want to pick one up, I've left the

1:03:53

late pledges open on Kickstarter, so you

1:03:56

can grab one there. I'll put a link in

1:03:58

the description. Thank you again for

1:04:00

watching the podcast and I'll see you in

1:04:02

the next

More transcripts

Explore other videos transcribed with YouTLDR.

Get the TLDR of any YouTube video

Transcribe, summarize, and repurpose videos in 125+ languages — free, no signup required.

Try YouTLDR Free