Google DeepMind Pre-Training Lead: How To Land a Job at a Frontier Lab | Vlad Feinberg
Every single time you go up for a
pre-training run, you're about to put in
more flops into this run than you've
ever done before. This is Vlad Fineberg.
He's Google DeepMind's pre-training area
lead. And I asked him all about how to
get a job at a frontier lab.
>> That was a particular skill that I see
voracious demand for across all the
different labs. The research skill set
is going to become increasingly
important. If you do the scaling book
exercises and, you know, send me a video
of yourself doing them, I would love to,
you know, interview you. Here's the full
episode.
You wrote this post that was titled,
"How to get a job at a frontier lab.
What are the skills that are kind of in
demand in Frontier Labs?" Maybe we can
talk about the shape of the work.
There's quite a range of different
things that Frontier Labs require
at this point. LLMs are artifacts that
are connected to uh research and product
in ways that machine learning really
hasn't been as connected to before. And
so it it really touches on so many
different things. The goal of my post
was to propose just a couple tangible
directions in which labs could require a
certain set of skills, not not to be
fully exhaustive. And really the ones
that I I dive into have to do with uh
kernel development and a low-level
engineering to accelerate
the runtime for these LLMs uh in
practice. And so that that was a
particular skill that I see voracious
demand for across all the different labs
and uh among different projects within
the labs. So that that seemed like a
very sharp one to call out as uh an
overall need. Uh and so specifically
whenever we're doing a research project
that involves changing the architecture
for the neural net in a particular way
or rethinking how we might do serving to
uh you know do better KV caching or
something like that again across the
stack you just need to be able to
implement these new techniques in
efficient ways and
uh the inner loop of all of these
different changes is creating software
artifacts that can function at large
scales with high throughput, low
latency. Uh, and this is just
fundamental work that's tied to
classical backend engineering thinking.
Uh, so yeah, it seemed like a very open
thing for people to specialize in. my
friends that work at OpenAI and
Anthropic, there's this distinction of
an applied org and the research org and
I was wondering if deep mind has a
similar uh distinction and if you could
speak about what that difference is. So
we we have different focus areas and
like you know for instance within GDM
there's a team that focuses on how uh we
can use our Gemini LLMs to better inform
search results and so like that might be
some you know you know in some way like
an applied version of the LLMs but I I
am hesitant to you know make a very
sharp distinction here because there's
so much actual like hard research that
has to go into this kind of level of
product integration like specifically
for the one I mentioned uh quite a lot
of work goes into making sure that these
LLMs are factual and can site sources uh
to have very precise grounded answers
assessing the quality of these sources
to make sure that you're not referring
to anything that's like sarcastic or a
joke.
This is uh I guess a good example of how
even in like product specific quote
unquote applied AI verticals you're
still doing research. Uh that being
said, there's definitely
what I would say is like very classical
LLM research teams, pre-training,
post-training.
These are things that are still
standalone uh teams inside of GDM that
are focused on
what I would say is like you know
creating soda models, you know, pure
research.
Again, the caveat is the the pure
research that we do like the extent that
it matters is the extent to which we can
realize it. And so, you know, we're just
as responsible with uh delivering these
models and making sure they train stably
and actually being like the SRRES of
sorts for the training run to make sure
that the model training is going
smoothly. Uh, as we are for coming up
with the recipes to make these LLMs and
you can't separate those two roles. It's
it's really crucial to kind of wear both
of those hats. So
yeah, I think you can you can draw up a
spectrum between research and applied.
Uh but uh no matter what in today's
world, I think uh everyone needs to be
fluid across that spectrum. I noticed
there's also another spectrum of
software engineer to pure AI researcher
and like how do you think of that
spectrum like software engineering
versus like AI researcher roles? So I
guess in um in in my case specifically I
think a lot of what we do and a lot of
the new techniques that we develop
the groundwork is laid in infrastructure
investment. So um I can walk through
what my team does a little bit more
detail uh later but one of the verticals
is uh distillation and in order to do uh
distillation it's it's some way of of
transferring the knowledge or some form
of statistics about the underlying data
set through a teacher model into the
student model to make the student model
better than if it hadn't ever seen these
auxiliary statistics from the teacher.
And when you're talking about statistics
derived from a massive LLM applied to
trillions and trillions of tokens, uh
you're talking about a level of flops
investment that you know is, you know,
millions and millions of dollars. And
that in turn means that you have to be
able to think through how do you uh
optimize the system to be as efficient
as possible because every operation that
we're performing is is multiplied by
such a large factor that yeah every
second counts every bite of storage
counts and quite a bit of that work is
you know good oldfashioned software
engineering And so uh in particular the
infrastructure for distillation has
evolved
through maybe three to four generations
at this point. And in each one we've
taken a step back looked at what kind of
research methods have we been applying
for distillation holistically thought
about how do we broaden what the
infrastructure is capable of. And
there's definitely a couple discreet
points where rethinking the system
design of how we perform distillation
enables us to do research on
distillation methods much more quickly.
And so it's this kind of investment that
like okay this like four month or
whatever rewrite of our distillation
infrastructure uh then results in a
dramatically new understanding of uh
distillation scaling laws that
translates to really strong models. So
it really requires just work across the
stack and I you know I can't yeah I
can't imagine that we would have gotten
results like flash 3.0 know without
having made those distillation
infrastructure investments that are at
the end of the day things that started
with a good oldfashioned design dock and
thinking about what the right
abstractions are for uh generating these
teacher statistics coming up with the
right storage system for them thinking
through what could support uh the
reading and writing across uh multiple
different data centers at this scale
really classical distributed systems
problems
>> yeah I mean it sounds like there's
there's a lot of software engineering
engineering backend infra type problems
given just the scale of the compute at
this point. It still feels like though
there at some point in that spectrum
there is some crossover where there's
these new skills like somewhere where if
you had you took a arbitrary backend
engineer and you placed them to I don't
know adjust the model architecture or
something like there that is like a bit
of a jump more than the infrowwork. um
like how do you see that distinction?
>> Yeah. So I think there is a crossover
point in terms of doing research where
research is an endeavor where the
payoffs become a lot higher risk higher
reward and we have this notion of uh
kind of research taste which is you know
some high level intuition about what
path you should be proceeding through
the DAG of the multiple uh different
milestones that you need accomplish in a
particular project.
In some sense, we can view software
engineering projects through a similar
DAG where you know you have all of these
intermediate artifacts that you want to
hit in a software program to uh get to
the final result. But in the software
engineering case, the DAG is more or
less deterministic where you you know
build one service then a different
service then a third service and you
know you figure out your storage
infrastructure layer first uh that kind
of thing and you can just make monotone
progress. But in the research case, you
have to uh kind of explore this DAG
which is now stoastic because some of
the nodes which might be some research
ideas or some you know aspect of getting
to a final goal uh may or may not work
out
and I think that requires a bit of a
mindset shift
and that that kind of mindset shift
takes a while to learn and it takes
specialized skills to learn. uh this
would be the kind of skills you pick up
in a PhD. For instance, one succinct way
I could put it, there's a really
excellent post by this uh professor
Jacob Steinhart and I I love to frame a
lot of the research work that I do in
this way and it's research as an MDP. So
MDP here markov decision process uh it's
again we have this highle idea of a
stochastic dependency graph between
different milestones in a research
project where you might need to have a
pertinent certain kind of result or
prove a certain kind of theorem before
you get to a certain kind of conclusion.
Uh similarly for a machine learning
research project you might need to have
this and that featurization working
before you can get this and that imageet
accuracy or something like that.
um and expanding those nodes in this
graph. It's this stochastic endeavor
where these approaches may or may not
work out and whether or not one works
out opens up a set of new possibilities
for you. And so the approach that you
might have in the software engineering
case where you could fully write out
here are all the paths to the goal
across walking this graph. what's the
shortest path to your goal? That
approach is not optimal in the research
case because if all of a sudden the
transitions between the edges in this
graph become unreliable and uh some of
the nodes you might not even be aware
of. It might be a hidden MDP. Then the
way that you might approach this problem
would really differ. And in particular,
you have to factor in the success rate
and the time investment that you're
going to be putting into uh these
different research ideas as well as
a priori estimating what those different
rates are. And that's a very different
exercise than writing up what the you
know design for your software
engineering project might be. And it's
it it's this skill set of of building an
intuition of how likely an approach is
to work out without having yet done that
approach that I think people often you
know correlate with this uh research
taste notion.
But that's exactly the one that you need
to build up in order to properly uh
traverse this MDP
>> for the the research projects and just
like generally the nature of the
research work. And it sounds like you're
you're saying that there's a lot more
uncertainty here. I'm still trying to
get a sense of the the nature of the
work. If you threw this backend engineer
into a team that's doing research, like
what are those like concrete examples
where they fall short?
Like I think the the very first thing
that comes to mind is having the right
context for the research landscape in
which you're operating. So quite a bit
of research work involves like almost uh
this kind of
um you have to take on this like very
humble viewpoint of there's been quite a
lot of investment in related work in the
past and until I know
the sum total of humanity's bleeding
edge in this topic I'm definitely not
going to be able to further that
bleeding edge. So building up uh a solid
understanding of of past work uh in a
particular area uh and doing that
related literature review is maybe the
first thing that I would imagine people
might stumble on is uh having read and
having the skills to effectively
traverse you know historical uh citation
tree for a particular topic because you
don't have the time to read all of these
different papers. is you need to build
up a sense of uh what are the high value
papers and what are the ways in which I
can assess if a paper is worth reading
without fully reading it. That's like
the first thing that comes to mind as
the you know skill that people need to
build up even to be able to read these
research level papers. You have to have
a background in
machine learning in um some you know
computer science and uh you know
depending on the paper and depending on
the domain there might be all sorts of
prerequisites in terms of like the
underlying math and coursework that you
would want to have to properly
understand. So
that's that's quite important to be able
to have a deep understanding of what
methodology is available because you
really won't have a lot of hope of
improving upon the methodology if you
don't understand what's there already.
So, so I think I like mentioned earlier,
one of the things that my team works on
is is distillation.
And in order to advance our
understanding in uh distillation for
large language models, you have to have
a good understanding of like what we're
trying to do with LLMs.
And uh just to give a cursory overview
here, the name of the game for LLM
research is
especially in pre-training is uh is
scaling laws. And so what are scaling
laws? People focus a lot about like you
know this power law structure and the
fact that like you you know have this
and that exponent but like what matters
is less so the functional form. What
matters is for a given recipe of scaling
up your LLM. So as you invest more and
more flops into the pre-training run of
an LLM, you have to be able to predict
what the final test loss of this LLM is
going to be. And why why do we care
about this question? Why do we care
about predicting what our uh
generalization error is in the classical
machine learning world? Like say we're
trying to you know win imageet
we have our test loss which is our
classification uh error for you know a
thousand different classes and uh you
run your VGG or your ResNet proposal to
get that uh classification error that's
an estimate of how well that model does
at classifying
amongst those thousand classes various
different images. we can estimate how
good our method's going to be by taking
a validation set and then whenever we
have an architecture idea for a neural
net we just train it and then we uh do a
bunch of uh validation set runs and we
get a cross validation error that is
itself an estimator of our final test
error and so in this way you can just
iterate on different ideas uh through
this process but what's different in LM
world is every single time you go up for
a pre-training run you're about to put
in more flops into this run than you've
ever done for. So it's in some sense
like a oneshot version of this imageet
problem. You never get to see the full
imagenet training data set. You have to
practice on emnest and then cf and then
maybe based off of those you try to come
up with a method that just works right
off the bat on imageet. And if you were
to just do that by itself, as I'm sure
many people have tried, like certainly
when I was learning how to do all of
those different things, you get
something, it works really great on
emnest, it maybe even works on CFR, and
then all of a sudden it breaks on
imageet. You'll find out that like
things don't just generalize easily
across scale like this. And so much of
what we do for LMS is coming up with
recipes where a recipe is this function
that goes from number of flops you'd
like to train on to a training routine
for this LM. And if you can couple this
recipe with a prediction rule that can
predict accurately what your LM accuracy
is going to be, then um you're able to
make decisions about how to improve your
recipe because you can use that
prediction.
That is all a ton of context on what uh
LLM research looks like in general. But
that's like an understanding that we got
to that we even thought was feasible
thanks to so much uh
initial LLM scaling work that we've seen
across the Kaplan paper across
Chinchilla.
Since those two papers, there's been a
lot more work in terms of like what
other factors are there beyond uh number
of params and um number of tokens that
you train on that influence your
prediction accuracy uh like number of
unique tokens for instance. But like I
would say like those two foundational
papers for LLMs uh those are informed by
uh an even even longer line of uh
different uh scaling works uh going back
to like say the original uh GPTs and
then Google has had a ton of scaling
work across its palm papers. This is
just a set of works that have informed
that viewpoint that I described earlier
that
you you kind of just need to build up by
having gone through that literature
review yourself. If you were, for
instance, if uh you were trying to pick
someone that was going on your team and
the the way that you would judge their
fitness to help you push the frontier is
their understanding of the frontier,
including the existing literature, which
requires all these prerequisite. I think
you called it mathematical maturity in
your post.
>> Yeah. So I think I I I think it's easy
to read and understand those papers once
you have mathematical maturity.
So I guess the ones I mentioned in
particular nowadays they're table
stakes. So I I would expect candidates
to be familiar with them. Um I think um
the the general skill set is being able
to dive into
uh a paper of that level and then
understanding it.
uh you know being able to take a
research idea uh from a paper and
implementing it yourself like that's
that's just a a very important skill set
to be able to have like we get you know
all sorts of uh different ideas
presented you know they might not all
directly apply to our domain but if you
can deeply understand them then you can
iterate on them and you can improve them
inside of uh inside of our domain and so
when we assess for people who can work
with the mathematical concepts in these
machine learning papers. That's that's I
guess the the key skill there that would
be evidence that you can go pick up this
arbitrary paper and see to what extent
these ideas carry over uh in the Google
setting. This probably won't be
exhaustive, but I'd be curious to hear
other domains that maybe people could
dig into to see what kind of matters in
frontier AI research. So you'd mentioned
distillation, you also mentioned
kernels. It sounds like kernels are
helpful everywhere. Um, but are there
other areas that come to mind if you
were just raffle off areas that are not
necessarily exhaustive?
One thing that I think is is quite
powerful is uh actually
programming language research. So by
looking into how we can create
abstractions at the programming language
level, we could facilitate kernel
development. I think Thunderkittens is a
really good example of this. Like coming
up with an ab an abstraction that allows
you to write kernels through four
functions instead of arbitrary globs of
C++ code uh allows you to move really
quickly uh in uh developing algorithms
that fully utilize hardware.
So like it at that point it's um it's
not about the PL research itself. It's
about having a passion for
you know these kind of programming
language abstractions and and working
with low-level hardware um you know uh
people who you know are interested in
and will try to work with like cute DSL
this kind of thing where there's a lot
of hardware specific uh domain specific
languages one other thing that comes to
mind besides PL and uh scaling law
literature would be reinforcement
learning literature. uh so in particular
ever since uh RHF uh I think we've seen
that DRL algorithms uh like PO do have a
place in production systems and you know
there was a time where that was in
question but uh now it's you know uh
pretty unanimous that we see these kind
of algorithms applied to real production
systems and
the uh theory behind that uh you kind of
have to start with the basics for
reinforcement learning and work your way
up to
you know the myriad uh value based
methods and and uh policy gradient
methods that we have today.
That's that's another domain that I
think is just like a very rich
literature tree to crawl. Um, and then
for more of the backend engineer folks,
just beyond just the kernels themselves,
there's I think a pretty fun overlap
between distributed systems and
optimization work where uh figuring out
how to design neural net training
algorithms that allow for
training across
many GPUs.
There's all sorts of fun challenges
between asynchronicity, how upto-date
your gradients are, how
pipelining affects the staleness, uh all
of these system choices that you could
make in your training algorithm design
will impact convergence and the final
quality of your neural net and uh those
are things that can be analyzed
independently of the LLM setting uh and
have been for a while. So uh you know
especially if you're kind of more
infrain inclined then having a good
understanding of like uh how those
different algorithms works work is a is
a really good place to start. Do you see
any difference between the the demands
of the different frontier labs? So for
instance if someone wants to work at
deep mind is there like a particular
area that you see Deep Mind cares about
more than anthropic for instance? I
think in terms of the skill set, it's
probably pretty similar. Yeah, I think I
think there's maybe differences in like
business strategy and uh you know the
set of offerings that's a function of uh
the specialties of the labs and uh like
the kind of different uh you know
customers that the labs could have. Uh
but uh I would say that there's there's
quite a lot of overlap between the labs
in terms of what people look for and
like yeah like when I posted uh my post
you would you would see like you know
people from both open AI and thropic
saying like yeah like we agree with this
advice and so you know I I think um that
that's just a little bit of evidence
towards that. I think one reason for the
the huge demand for wanting to go closer
to AI research is because people are
thinking oh software engineering is not
going to be as important in the future.
Is there a similar thought in when it
comes to research where LMS is also
going to handle a lot of that work as
well? So there's no reason to favor AI
research versus software engineering. Um
so I think the the research skill set is
going to become increasingly important.
Uh so I would say like being able to
handle stoastic components in the
planning of your work is is just going
to be a larger and larger part of how we
approach our jobs.
figuring out how to leverage AI in
whatever thing you work on, which
doesn't even have to be software
related, is just an important muscle to
start building right away. Um, because
these components aren't deterministic.
And thinking about how do I construct
systems around these LLMs to do my job
more effectively, uh, that's that's
going to be the thing that sets you
apart in the future. And I think that's
true no matter what you're going to be
doing. Look, I think I think there's
there's FUD everywhere, especially with
with some of the approach to marketing
that some people have in terms of AI.
It's FUD that is being intentionally
leveraged. And so I I feel like people
should really just focus on themselves
and and trying to uh be more productive
themselves. I I don't think that like AI
is going to replace all of our roles.
And so the reason for that is that
one of the important aspects of what we
do as humans in an organization which is
really this web of trust
from like you know this organization
that is you know this pool of resources
and this pool of people that manages
these resources. One of the important
things that we do is we allocate those
resources towards c certain goals and um
even when we can accelerate our
execution
there's an element of making decisions
around how we allocate these resources
that will always be something that needs
to be attributable to a human making
that decision. And uh that's simply
because you can't hand off blame to AI.
So we at this point have LLMs that
really deeply understand law and they
could, you know, review your contract
for you or something like that. But they
can't represent you in court because
they can't be disbarred.
And so that's that's I think like a a
really you know sharp way that I might
describe like okay this is why the legal
profession will go on even though LLMs
are really good at recalling precedent
is you want to have someone who is
responsible who can validate the output
of AI to perform uh legal work more
effectively for you rather than hand off
your legal defense to an LLM.
>> Yeah, I think the FUD that was actually
the original motivation for your post.
>> Yeah, I mean I I really think that the
mindset that people should have is is a
constructive one. And so there was a
tweet that I saw I think by Dee that was
like some long form you know
fear-mongering about you know uh uh AI
permanent underclass or something like
that. And uh it's easy to get stuck in
that loop, but I think the important
thing to think about is like we all have
agency over our future and we can start
investing in uh skills that matter for
tomorrow today and um that's that's
really
the only thing you should be doing,
right? Like you know worrying about it
is not going to not going to help you.
And so part of why I wanted to write
this post is is in response to that
uh because it it it was something that I
could see echoed. You know, I gave a
lecture at Princeton a while back and
you know, a big question that came up is
like, you know, how do I work at Deep
Mind? And and it's something that like
uh yeah, just when people find out what
I do, that's the top question people
ask. So, I figured it would be helpful
to add a little bit more constructive,
you know, direction to the discourse
here. One last thing on the post, cuz
you know, if you think about getting a
role, there's obviously the skills and
we talked a lot about the skills and
your fitness for the role, but there's
also kind of the uh signaling for that
role and like what is kind of valued if
you were to be saying marketing yourself
to one of these frontier labs. What
signals um matter most?
>> Actual evidence that you've created
something of uh
of use to other people uh along the line
of kernels, right? Like you can take any
of the many open source LLMs that we
have and optimize them. You don't have
to make them better in every case. You
could show that, oh, I have an
improvement for this and that setting.
It doesn't even have to be something
that speeds up the model on GPU. There's
all sorts of open- source stacks like
VLM. There's a lot of other um things
that you can do besides accelerating the
LLM inference on device. The serving
stack that surrounds LMS is a very
sophisticated distributed system that
has to maintain this KV cache memory and
deal with uh all sorts of like load
balancing and uh request queuing and and
very common problems for for back-end
servers. Uh and these projects are
always looking for help. So, you know,
contributions to VLM or SGLANG uh or
demonstrations with Tensor RT uh they
have, I think, a a a distributed system
called Dynamo that uh allows for
disagregated serving where you could
show that you you made a project using
these components, you improve these
components like that would be an
extremely positive signal uh for any
candidate that I'm looking at uh and and
a very welcome contribution to uh open
source.
>> I I think also a lot of what we said is
kind of assuming the path of external
hire into frontier lab. Um but a lot of
these frontier labs have large
organizations that aren't necessarily
doing the cutting edge uh frontier work.
So let's say yeah for instance I mean
you know Google deep mind versus let's
say there's some infrastructure that's
working on search and they have the
backend skill set maybe not as much
domain context and they try to internal
transfer to Google deepmind does any of
your advice differ in that kind of case
for like an internal transfer versus uh
someone who's coming from external
>> there's someone who I worked with
closely on the search side who actually
did transfer to my team uh Nate Linds
and he's amazing and now he owns so much
of uh like what we do on my team in
terms of
inference code design for uh like flash
and flashlight and I would say like he's
a really great example of this where
his approach was you know how do I help
my PA my product area
adopt this technology as effective ly as
possible. So I think there's you know
definitely if you're in a organization
that isn't directly generating these
models but in some way trying to
leverage them there's a very big gap in
terms of applying these LLMs effectively
serving them effectively
within
uh your organization and becoming
someone who does that really
effectively. not only creates a ton of
value uh in terms of like
the uh you know specific business need
for your org which will definitely
elevate you in your org. Uh but it'll
also be the case that you're going to
just naturally become the partner that
we work with uh on the research side to
make sure that our models are effective
within your org. And so at that point,
you know, you may or may not want to
transfer. uh definitely if you transfer
we'd you know be happy to work with you
but like
at that point I think you're you're
you're already doing something that is
cutting edge which is integrating this
new technology into uh you know a real
product that people use and so
yeah that'd be my advice there is a
towards the end of this post as we as we
leave this topic you had the concrete
invitation because I know you were
hiring uh do you want to say what that
was
>> yeah so I just trying to think of like,
you know, you know, how do I put my
money where my mouth is? Um, how do I
demonstrate, look, this is a good way to
show that you have, you know, at least
some evidence of of like the the skills
that I called out as important, you
know, intent, mathematical maturity,
grit. Uh and so I listed out a couple of
exercises that demonstrate you know some
initial knowledge of scaling laws, some
willingness to get into the weeds
engineering wise in terms of
implementing a real transformer
and uh sort of willingness to pick up
the kind of bread and butter bread and
butter math that we use uh every day to
size uh these LLMs and uh you know I I
won't I won't recall the full list of
like the exercises that I expected here.
But like uh you know if you do the
detailed like uh handwritten uh version
of the scaling book exercises and you
know send me a video of yourself doing
them along with the transformer exercise
on my post then that's something if you
can work in the uh New York office I
would love to you know interview you for
and quite a few people reached out to me
about that. I actually already have had
a couple submissions and we're
proceeding with the loop with those
people. So
yeah, it's it's quite a bit of work, but
uh impressively I got a response within
like I think a week of posting. So uh
it's definitely doable.
Uh yeah, I mean I don't have unlimited
headcount. So I mean the offer is on the
table, but the you know I can only hire
so many people. The good thing is though
that is such a strong sign of you know
self-development
that uh not only is this a something
that you should be doing for its own
sake regardless of whether or not you
will get a job at at DeepMind
specifically
but I think it'll be something that you
know lets you basically prepare for
interviews in other places. Uh certainly
if you reach out to me with these uh
exercises completed like even if you
know I do all my hiring there's tons of
people who I know who are hiring as well
and I'd be happy to refer people as
well. Open AAI enthropic cursor and
Verscell all use this product to make
their lives better. And the problem it
solves is when you're building SAS or an
AI product and you want to sell to other
companies there's all these requirements
you need to meet. There's SSO, there's
SKIM, there's arbback, there's audit
logs. These are all things that take
time to integrate but aren't the main
focus of your app. Work OS is an API
layer that lets you meet all of these
requirements in just a few lines of
code. So, let's say you have a new SAS
product and you want to sell to other
companies. Work OS will solve all of
these critical feature gaps for you. You
can check them out at workos.com to
learn more and get started. and I
appreciate them for supporting my work
and sponsoring this podcast. On the next
topic, I mean, I saw you're the the the
area lead for pre-training on Gemini,
and I just thought it might be
interesting to hear you give um uh kind
of like a highle overview of what
pre-training is in your words and maybe
what are the the highle challenges in
the area. We can talk about that.
>> Yeah. So there's there's quite a lot of
work that we do in pre-training
um as an area lead for it.
The specific things that my team is
responsible for delivering uh include uh
the flash model, the flashlight model.
These are models that get used for AI
overviews and AI mode in the search bar.
Uh as well as some other uh oneP models
that are used by different orgs like ads
and YouTube.
Besides this, we're also key technical
PC's for the uh Google Apple
partnership. Uh and so we do technical
work there.
Those are the actual like product level
deliverables uh for my team. Uh beyond
that, we do research to make sure that
these deliverables are state-of-the-art
and also we do general pre-training
research that contributes to the Pro
Series model as well. And the nature of
the research I would say generally
breaks down into three different
verticals. There's distillation which I
mentioned earlier.
There's what I like to call inference
code design. So uh creating neural
architectures that are efficient uh to
run inference on. So coming up with the
network topology, the shapes of the
matrices that the matt moles uh use uh
inside of uh uh gating and linear layers
for this transformer as well as the
attention shapes, num heads, that kind
of thing.
So that that is effectively utilizing
the hardware that you're serving on. And
then the final uh pillar here is new
quantization methods. And so
quantization is just something that's uh
been near and dear to my heart that I've
been working on the research side for
ever since I joined Google. And it
really changes what's feasible for uh
the first two. So uh that's why you know
furthering the state-of-the-art in terms
of how you can compress models is is
also a very important pillar in the
research that my team does. Generally uh
uh quantization uh refers to reducing in
some sense the size that the neural nets
take up uh in order to represent their
weights. So typically a neural net when
you're training it uh is represented as
a a series of numbers that make up the
matrices inside of the neural net uh
that are stored in FP32 32-bit
floatingoint weights. Um, it turns out
that when you do these computations, you
don't need all of that extra precision
to still maintain the quality of your
neural net. And you can with pretty
simple methods reduce the precision at
which you store these weights down to
four bits. So uh all of a sudden this
huge range of numbers uh that we would
take you know this float 32 to represent
uh something that gets you down to like
you know seven digits of precision uh
can
you know with somewhat high fidelity uh
still be um uh represented well by 4bit
ins which you know just cover this uh
tiny range of like minus 8 to 7 and um
It's it's kind of a miracle that you can
do this. But what's even more of a
miracle is that you can apply these kind
of quantization transforms to the
runtime activations that the neural net
processes. And as soon as you do that,
the actual math that you're performing
because you're taking much smaller
operands to your map mole,
the amount of electricity that it takes
to compute the neural net drops
significantly. And what's interesting is
that like 99% of the total cost of
operation for AI hardware comes from the
uh power that it takes to run these
chips. And so if you can do these
operations, you could just make neural
nets run more cheaply, run more
efficiently.
That helps uh uh in terms of like
serving more requests and helps in terms
of latency.
So the name of the game for quant
research is how do we push the frontier
beyond like this like 4bit range.
There's this take that I see on Twitter
all the time um which is just talking
about MFU and someone who's not in the
space or model flops utilization.
Someone who's not in the space they see
a number in the low tens and they think
wow they're wasting all of those GPU
resources. Um, I was curious if you
could just clarify that for people why a
low MFU or I guess naively low is
actually not low at all and maybe also
explain what MFU is.
>> Yeah. So
when we compute MFU, you want to divide
the actual number of flops that the
neural net is performing here by the
total number of flops that the
accelerator could have done in the time
of your request. And so in some sense
this is giving us the uh percent of time
that we're usefully utilizing the flops
rate of the accelerator. And to get to
100% MFU, you would just need be need to
be fully utilizing uh the matmo unit of
uh whatever accelerator uh you're doing
here. So it would just have to be doing
like a bunch of matt moles in a loop uh
without reading any memory or doing any
other operations.
That's not a very useful computation. Uh
and in practice,
neural nets have to apply activation
functions or do attention or write
intermediate outputs back to uh HBM. And
all of those different operations
will require utilizing the memory bus or
utilizing vector processing units. uh or
simply they might be a mathematical
operations that
the underlying hardware performs more
slowly than they than uh it might
perform a maple. And so all of those
things contribute to not running at the
full speed that the processor is rated
at. Uh and so that's why you might not
see 100% MFU all the time is cuz you
know part of the time your neural net
was you know reading and writing to
memory or part of the time it was doing
an operation that uh you know
fundamentally runs slower than certain
other units on your on your device. And
I think quite a bit of this inference
codeesign work that I talked about
earlier is across all of the different
um capabilities of the chip. So uh
communication to other chips um memory
bandwidth the speed at which we can read
parameters for memory
flops of course uh this can be metal
flops this could be flops for processing
uh vectors. So like things like doing
activations uh all of these have
different rates in the hardware and a
given computation isn't going to match
the natural hardware's rate
uh of each of those operations. So when
you design a neural net, you want to be
able to choose shapes for this neural
net that fully saturate all of those
hardware units to get you as high of an
MFU as possible. um when you are doing
uh inference here. What makes this more
than just an algebra problem is that
those choices translate to different
quality outcomes when you actually train
this neural net. So the process of this
kind of inference code design is how do
we come up with neural architectures
that scale predictably
have a good prediction so are high
quality and still make the MFU as large
as possible during inference. And so
this kind of joint optimization is what
makes uh inference code design really
fun. uh and also this kind of evergreen
problem because as the hardware changes
all of those relative constants of flops
to memory bandwidth to communication
bandwidth change and those will have
different implications to what's the
optimal neural net shape should be. On
another topic, Google has this idea of a
spot bonus where someone can kind of
give you a a oneoff lump sum of money as
a thank you for like good performance.
And I I saw on your resume that Jeff
Dean, the legend himself, gave you a
spot bonus. And you know, if you can
tell that story, I'd love to hear why
did he give you a spot bonus. Yeah. So,
that one actually was at the very
beginning of the Gemini program. Uh he
gave out his spot bonus to people who
hopped on and launched the first version
of Bard. And like I had a you know very
small contribution to a very very large
project at the time. Uh I helped with uh
SFT for uh one of the first versions uh
of uh supervised fine-tuning for one of
the first versions of uh Bard that got
released like right you know the biggest
lesson out of uh that experience was you
know at that time I was just doing like
pure research in um uh Google brain and
I was super focused on just how do I
maximize the number of first author
papers at Nurib Ciclair
and
I remember distinctly thinking like I I
had this instinct of like oh like you
know should I just keep my head down and
try to write more papers and luckily at
the time uh my my manager Roana Neil
like he really encouraged all of us to
get involved in
uh you know this space and
that was just the right motivation that
I needed to like roll up sleeves, do a
bunch of hyperparameter tuning and
engineering work to get uh this uh model
running on uh uh some like really old
TPUs to get some extra you know cycles
in for for uh SFT attempts.
that very small initial engagement that
was recognized by Jeff Dean I I think
blossomed into more and more investment
on the LLM side by me and ultimately led
me to where I am today. Uh so yeah, I
would say you know it it's less so about
you know you know how much that like SFT
helped the initial release and it's much
more about uh uh recognizing that like
there there's quite a bit of work some
of it not glamorous some of it just like
you know hyperparameter tuning and uh
golfing the XLA compiler to make your
program fit in a certain memory amount
that contributes to a wider business
goal that is is really quite important
for getting involved in in very high
value projects.
>> You've been working on Gemini for a
while now and because it's a top
priority, there has to be some, you
know, incidents or war stories that
you've been involved in. So, I'm
curious, you know, what's your favorite
uh war story when working on Gemini? So,
I think my all-time favorite would have
to be
Flash 2.0.
Uh, so this one this one was quite a
challenge and a very long journey to get
there.
But, uh, one of the main things that we
were optimizing for which which Flash
1.5 established is this category of very
fast low latency model that's still
quite good. Um, and you know,
in particular, it has to be fast because
it's it's used by search to serve uh uh
responses in in AI mode uh very quickly.
because of that uh for flash 1.5 and
before we we focused on dense models
which uh allow you to respond very
quickly
even though at the time we we knew about
models and how they increase capacity
and so I think um one thing that like
came up was okay like we sure would like
to use this new architecture but it it's
difficult to just simply switch to ane
Because what happens with ane is it uses
a lot more parameters in general and
because it uses more parameters it takes
up more HBM.
These chips that we serve on have a
finite amount of HBM. So you have to
shard thee across uh multiple different
chips. So if you have you know whatever
n experts then you might shard it across
n chips or you know some factor of n.
And what this causes is a lot of
communication in the middle of the model
when you have a token that needs to be
routed to an expert and that token might
live on the first TPU but it needs to go
to the last TPU. That's a lot of
communication that you're inducing in
the forward pass. So the latency of this
operation
like increases dramatically with N. uh
and uh you know the challenge with is
they increase N. So uh that that that
like kind of really bottlenecked this
approach.
And one interesting thing that happened
was uh we we definitely knew about uh
pipeline serving for a while is just in
the dense case uh it never really ended
up mattering. Like I distinctly remember
a very early conversation I had with
Shto about it and Shelto's like oh yeah
you're like so flop bound and so
pipelining is just not going to change
your prefill profile. And and he was
right. I tested it out and like then
abandoned the idea.
But what's interesting is
I I had a very small team at the time
and and one of my reports uh Gen Yan uh
had a very nice idea. He was working
with Rahul Aaria and a couple folks from
uh the Israel team at Google and that
was to apply pipeline prefill to
pipelining is a technique where instead
of paralyzing those end machines experts
across those end machines you paralyze
layers across those end machines. So
instead of on a particular layer you
have to route tokens from machine to
machine, now one layer does the
computation for one subset of your
prefill request and then hands off uh
the process tokens to the next machine
to process the second layer and then the
third layer and the fourth layer and
all of the experts can then stay
resonant to a single machine or a
smaller set of machines. So uh what this
does effectively is it changes the
communication pattern from something
that required a lot of token exchange on
every single layer to uh something that
actually can be uh hidden behind other
computation because you can do this uh
pipeline prefill across different parts
of your request. Uh so uh while layer 2
is working on the first thousand tokens
of your request uh layer 1 on the first
chip uh is processing uh the second
thousand tokens of your request. So it
was a way of
breaking this HBM constraint by moving
layers across the machines rather than
moving experts across these machines.
And because of that the communication
overhead has gone down and all of a
sudden latency looks really attractive.
Now this you know the the Gemini 2.0
report says like it's ane series of
models and the thing that made that
possible is you know or one of the
things that made that possible is is
this uh uh you know serving time
innovation.
Darkesh and Reiner have an amazing post
about exactly this
optimization that you can write up in
the algebra of the scaling book and it's
just a wonderful example of how uh this
kind of change can
uh have really dramatic implications on
LOM quality. What really made Flash 2.0
rewarding is this, you know, giant
decision. And it sounds like a small
technical decision at the time, but
people were really worried about whether
or not the latency of this would uh
actually be reasonable. Luckily, I was
able to run like a very transparent
technical process to get to the bottom
of this. And by the end of it, uh you
know, we we made the right call. Uh but
then we had to train it. So
this was a bigger model than we've ever
trained before at the Flash scale. And
like we knew this would be the right
call, but it was just going to be 40
days of grueling work for like a really
really small team. Like we probably had
like five people on the rotation for
training this model. I remember, you
know, all of us just kind of like
rotated day by day handing off like, you
know, all of this like S sur style work
of uh keeping the training job alive,
which at the time was was a very
interactive thing cuz uh you had to make
sure that everything was moving stably,
that you know you have tuned data
iterators that aren't slowing down your
job, that you know if there's like a gap
in the data somewhere or or an indexing
issue, you have to like really quickly
put up a fix because it's, you know,
wasting all of this GPU time. Um,
>> what about at nighttime and on the
weekends?
>> So, yeah, like I think, you know, for
those 40 days, we did not do a lot of
sleeping. Like we had to like do like
kind of these dual shifts across like
the Paris office and Mountain View. And
like the thing that makes it so
rewarding was when this model came out,
like around the same time uh Deep Seek
V3 came out and uh the Wall Street
Journal put out this article that was
like this giant Red Scare article about
how China's going to take over AI with
open source models. And I remember my
friend sent me a screenshot of this
table of the LMS arena leaderboard and
you know all the way at the top right
you've got uh chat GPT and and uh
Deepseek right behind it and like oh
Deepseek was trained for whatever few
million dollars you know and and like
they're right there. Uh, and then my
friend was like, "Oh, like you know,
Gemini is so behind cuz they had, you
know, a version of like I think 1.5 Pro
or something in that table at the very
bottom." And then I looked at it, it's
like, "Oh, that's really interesting. I
was just looking at this leaderboard cuz
we just released a model and it
definitely doesn't look like that when
you go to the website." So, turns out
there was kind of some ellighted rose on
the uh Wall Street Journal uh article.
And so now if you go to that article
today, you can see, you know, what at
the time was the state-of-the-art model,
you know, Flash 2.0 you know, thinking
uh up in the top right corner way far
ahead of uh DeepSec V3 might be messing
with the open source narrative that they
were trying to publish there, but uh it
was a really important accomplishment uh
for for the team.
>> Last question for you is if you could go
back to yourself when you just graduated
college, I guess undergrad, and give
yourself some advice knowing what you
know now, what would you say?
You you got to chase the problems that
people are facing
like in the world today like like go
after
uh the challenges that people see in
everyday life and don't be afraid to
tackle a smaller part of this problem or
maybe a more menial sounding part of
this problem even if it's not fancy
research math or something like that.
like trust that by working on what's
important, even if it's a smaller part
of a larger project for what's
important, you're going to get to see
what really matters in terms of moving
the frontier forward. And it's it's this
kind of I guess humility maybe in your
problem approach that that you should
really be chasing. Um that's one piece
of advice. I think the other bit that I
would give like maybe as professional
advice perhaps
would be
be the kind of co-worker
that
people would want to see succeed.
Uh, and so like what I mean by that is
there's this this like conception of
like workplace psychopath or mchavelian
leaders or or whatever people who like
will do anything at all costs to get the
results they want and they they they
they might be able to squeeze people to
get some short-term gain. But
you know, having interacted with a
variety of people professionally for so
long,
what is interesting to me is there have
been a select few,
you know, one in particular is is
probably a very dear friend and mentor
of mine, Todd Lipkin, who first got me
into computer science. um that are just
so
like kind and
like people that you can learn from and
uh you know someone that I can
follow and be successful by following
that just genuinely inspire me to want
to help them succeed. Uh and so in
particular, if you are the kind of
person who helps people succeed in their
projects, comes up with projects that
can leverage other people's
complimentary skills in ways that help
them shine, people will notice that.
People will want to contribute to
projects that you come up with in the
future and in general will want to
support you going forward. And so, you
know, people can get really cynical
thinking about the game theory of how to
interact at work. Uh but I found that
this kind of more amicable approach
generally like it it creates like this
this deep sense of collaboration and you
know willingness to help that like
is so important to get very large
projects that require multiple people
and multiple skill sets uh over the
line. And so yeah, I think if if I could
give any kind of like inner, you know,
interpersonal feedback or professional
feedback or whatever to earlier version
of myself, it's it's to be that guy is
to to be the kind of person that other
people want to see succeed. I love that
this advice combats the cynical advice
and I also love that your original post
combats the doomer, you know, permanent
underclass stuff. So, um, yeah, thank
you so much for your time. This is a lot
of fun. Really appreciate it.
>> Thanks for having me, Ryan.
>> Hey, thank you for watching this
podcast. If you liked it and you want to
see the show grow, please support with a
comment or a like. Also, if you have any
recommendations for people you want me
to bring on, please drop a comment.
Guests like Barbara Liskoff, Mike
Stonereaker, Mark Brooker, these were
all people that I brought on because
someone left a comment. On another note,
aside from the podcast, I'm working on
building the ergonomic keyboard that I
wish existed. Here's a glance at the
prototype. It's a split keyboard, so
there's two sides. Um, this is in the
case, but yeah, we launched on
Kickstarter and we hit our goal within 8
hours of launching. I really appreciate
it if you were one of the people who
grabbed one of the early units. Um,
we're now working on the long journey of
building the tooling now. So, if you
still want to pick one up, I've left the
late pledges open on Kickstarter, so you
can grab one there. I'll put a link in
the description. Thank you again for
watching the podcast and I'll see you in
the next
More transcripts
Explore other videos transcribed with YouTLDR.

How Export Controls Helped Not Hurt China & Power is the Bottleneck to AI | Perplexity CEO
20VC with Harry Stebbings · English

John 3:16 Was NEVER About How Much It Was About How
Deep Made Simple · Urdu

Conductor CEO Charlie Holtz Walks Us Through His AI Coding Setup
Y Combinator · English

Harrisburg City Council - Legislative Session 6/9/2026
The City of Harrisburg - WHBG20 · English

Harrisburg City Planning Commission Meeting - 6/3/2026
The City of Harrisburg - WHBG20 · English

Harrisburg City Council Work Session 4-21-2026
The City of Harrisburg - WHBG20 · English

Harrisburg City Council - Work Session 06/02/2026
The City of Harrisburg - WHBG20 · English

Harrisburg City Council Meeting 5-26-2026
The City of Harrisburg - WHBG20 · English

Harrisburg Zoning Hearing Board Meeting 5/18/26
The City of Harrisburg - WHBG20 · English

Harrisburg City Council Legislative Session - 5/12/26
The City of Harrisburg - WHBG20 · English

Harrisburg City Planning Commission Meeting - 5/7/2026
The City of Harrisburg - WHBG20 · English

Harrisburg City Council Work Session 5-5-2026
The City of Harrisburg - WHBG20 · English
Get the TLDR of any YouTube video
Transcribe, summarize, and repurpose videos in 125+ languages — free, no signup required.