Full Transcript

·YouTLDR

Jack Morris: Stuffing Context is not Memory, Updating Weights is

1:02:4511,967 words · ~60 min readEnglishTranscribed Apr 19, 2026
0:13

[music]

0:20

>> Let's talk about ChatGPT. I think like

0:22

ChatGPT knows a lot of things. It's

0:24

actually extremely impressive. I use it

0:27

all the time. I use it to help prepare

0:29

for the presentation. You know, I use it

0:30

to cook last night. Um,

0:33

you know, like growing increasingly

0:35

dependent. And yet there's a lot that

0:37

ChatGPT doesn't know. Like um

0:40

it didn't know why my speaker pass

0:41

wasn't working when I was trying to get

0:43

into the building and it uh

0:46

if you ask it did the Blue Jays win the

0:47

World Series, the answer is no. And I

0:49

know that because I watched the World

0:50

Series, but ChatGPT doesn't know that if

0:52

you don't enable web search because it

0:54

has something called a knowledge cut

0:55

off. So all the training data is kind of

0:57

segmented by date and things after a

1:00

certain date are not known by ChatGPT

1:03

like unilaterally.

1:05

Uh if you ask ChatGPT help me optimize

1:07

this kernel I wrote for AMD GPUs,

1:10

it's so bad at it. And I think there's a

1:12

few reasons for this. One, it's really

1:13

hard. Two, uh there's not a lot of data

1:16

for it. But three, I think it's more

1:19

that the data that does exist is such a

1:21

small portion of its training data that

1:23

it just like can't do it very well. And

1:25

so a lot of tasks like this, which I

1:27

I would guess a lot of you face in your

1:29

jobs, like the things that are more

1:30

niche or here I call long tail, are

1:33

really hard for ChatGPT to do. Even if

1:36

you say, "Please like

1:38

please sir." [laughter]

1:39

Like I want you to learn more about this

1:41

or practice. Like it can't learn more

1:42

about this. It can't practice. It

1:44

doesn't know what to do when you ask it

1:46

that. And yeah, if you ask what are the

1:48

terms of our partnership agreement for

1:50

BlackRock, it doesn't know about your

1:51

company. Which of these shirts should I

1:53

order from Amazon? Implement a new

1:55

feature in our

1:57

company mono repo. Write an email in my

2:00

style. Diagnose this patient given their

2:02

history. What arguments did the opposing

2:04

counsel use in the Martinez settlement

2:06

negotiations?

2:08

Is this question already answered on our

2:09

company internal wiki? Like none of

2:11

these things are

2:13

possibly answered by ChatGPT because

2:15

they're not in the training data or

2:16

they're too niche or they require some

2:18

data that's not available to it.

2:21

So I think like the question I want to

2:22

talk about today is like what's the

2:24

right [clears throat] way to solve this

2:24

problem? Like if we want to build new

2:26

systems that actually know the things we

2:28

want them to know, how how should we

2:30

build them? And I think like the way I

2:33

want to think about it is like how do we

2:35

take some knowledge and inject it into

2:38

the parameters of the model? Like what's

2:39

the right way to do this?

2:41

And like the way that I think about it

2:43

and I think the way this manifests in my

2:45

research and other people's research is

2:47

there's three ways. There's

2:48

full context. You can take as much stuff

2:51

as you can and cram it into the language

2:53

model. There's rag or retrieval

2:55

augmented generation where you have so

2:57

many things that you can't fit them all

2:59

in and so you retrieve the most useful

3:02

ones and then feed them in.

3:04

And then there's this third thing, which

3:06

I think is like really new and no one is

3:08

doing it yet, which is training things

3:09

into weights. And I want what I mostly

3:11

want to talk about today is like why I

3:13

think we should be training things into

3:15

weights. But I'm going to start with the

3:17

other two.

3:18

And also I guess like along the way I

3:20

about 10% of the time I'm going to be

3:22

shilling my own research, but I'm going

3:24

to like try to be honest about it. And

3:25

you can just tune me out if you want.

3:28

So I think like the easiest way to solve

3:30

these problems is to put everything into

3:32

context. Like if you work at a small

3:33

company or

3:36

all you care about is like maybe the 100

3:39

World Series that have occurred, you can

3:41

kind of copy all the data and paste it

3:43

into ChatGPT or paste it into Grok or

3:45

whatever model you use and that's

3:48

finite enough that the model can

3:49

understand.

3:51

And this like works works pretty well. I

3:54

think that this is something that got

3:55

people really excited for a while a few

3:57

years ago. I have this example of like a

3:59

doctor answering a question from a

4:01

medical record. A medical record is

4:02

small enough that it can presumably be

4:04

like input into the context of the model

4:07

and the model can do pretty well. I

4:09

think there's a few problems with this.

4:11

Maybe the main one is just that it's so

4:13

expensive. Like if you do anything like

4:15

this in your day-to-day workflow, you

4:17

put like a ton of tokens in the context

4:19

to start generating. I mean one, it's

4:21

going to cost a lot of money like US

4:22

dollars. But two, it's just so slow like

4:28

you know, a few months ago I was writing

4:29

my thesis and I wrote it myself, but I

4:32

did ask for some feedback a few times

4:35

from Claude. And like the second you

4:37

paste in I I don't know, it's like

4:39

maybe 80 pages of text or something. It

4:42

like as

4:43

documents go, it's medium length. I

4:46

paste into Claude, the second you paste

4:47

into Claude, everything slows down by

4:49

10x or something. I have this stat here

4:51

that if you have 1,000 tokens of

4:52

context,

4:54

we can output 10,000 tokens per second.

4:56

If you have 128k per ton

4:59

128k tokens of context, we can output

5:02

130 tokens per second. So that's like

5:04

several orders of magnitude slow down

5:06

and I think we've all faced this. So

5:07

it's very annoying and it's hard to

5:08

imagine how we can get around this.

5:11

Um I'll give you like the quick

5:14

background from the research world,

5:16

which maybe people know, which is this

5:17

inherent limitation the models we use.

5:20

The models we use are transformers.

5:21

Transformers look like this. The real

5:23

problem with transformers comes

5:26

in this one little

5:28

box right here called self-attention.

5:30

The problem is that all of the words

5:32

that go into the transformer need to

5:33

look at each other. And this has a

5:35

quadratic tendency. So if there's four

5:37

words, four tokens, maybe the matrix has

5:40

16 entries. If there are 12 tokens,

5:42

there are 144 entries. And we can manage

5:44

this for a while, but at some point it

5:46

becomes infeasible. Like especially from

5:48

a memory perspective, we can't

5:51

From a memory perspective, we can't keep

5:53

all these things in context.

5:56

You might say, "Well Jack, Grok 4 has 2

5:59

million token context window."

6:01

Yeah, 2 million token context window.

6:03

It's It's a very large number. Gemini 3

6:06

dropped

6:07

during this conference and Gemini 3 has

6:09

1 million token context window.

6:11

You also might ask why did Gemini 3 not

6:14

do a larger context window even though

6:16

it came after Grok. And I think the

6:18

reason is because there's

6:19

[clears throat] a difference between

6:21

the model not breaking when you put in

6:23

that many tokens and the model actually

6:25

like properly reasoning across many

6:28

large chunks of tokens.

6:30

And I think the second part we're still

6:33

figuring out. I think people have

6:34

realized how to train models that don't

6:36

break with more and more tokens, but we

6:39

haven't really gotten to the point where

6:40

we can train models that truly work as

6:43

well on a million tokens as they do on a

6:45

thousand tokens.

6:47

And if you're more curious about this,

6:48

there's this really good report from

6:49

Chroma called context context broad

6:53

about how performance degrades when you

6:56

add just like other stuff into the

6:58

context. So this graph shows like the

7:01

larger the context grows, even with the

7:03

same finite amount of relevant

7:04

information, the LLMs get worse and

7:07

worse. And I think like two things to

7:09

observe here that I think are

7:10

interesting. One, Claude is the best by

7:12

far. I like graphs like this because I

7:14

feel like when you talk to people, a lot

7:16

of people think Claude is the best. But

7:17

if you

7:19

measure on a lot of standard benchmarks,

7:21

it actually is worse. But then you use

7:23

it and you're like, "Oh, there's

7:23

something's better here." So I like this

7:24

because it captures what people actually

7:26

say to me. But I also like it because

7:28

once you get here, the performance is

7:30

horrible. So like if they if they enter

7:33

a bunch of relevant stuff that doesn't

7:35

actually help you solve the problem,

7:36

once you get to 10 to the fourth tokens,

7:38

which is 10,000, like the models don't

7:40

work at all. And even though they're not

7:42

breaking, like they're outputting

7:45

things that make sense and are

7:47

grammatical,

7:48

they're not actually solving the

7:49

problem. So context broad is a huge

7:51

issue.

7:52

Um

7:53

maybe like just anecdotally, if you look

7:56

up there's a ton of people saying stuff

7:57

like this. Like, "Oh, what the context

7:59

window so long, why does it not actually

8:01

work?" Or people think Claude code when

8:03

it fills up the context window sort of

8:04

like stops working.

8:06

Um there's a ton of people working on

8:07

these efficient architectures that you

8:09

might hear about like

8:10

>> [music]

8:11

>> Mamba, state space models, linear

8:13

attention,

8:14

hybrid attention, sparse attention,

8:16

sliding window. They're all more

8:18

efficient, but they basically have the

8:20

same properties of transformers. Like

8:21

even if they can

8:23

operate uh in a faster time or with a

8:26

lower memory requirement, there's some

8:28

trade-off in the terms of performance

8:29

they give you. So even if you build a

8:31

linear attention model that can fit

8:33

infinite context, it's not good. Like

8:36

it's not going to be able to solve the

8:38

problem you have, which is how do I

8:40

actually like reason and get smarter

8:44

when I input more tokens into the model?

8:47

There's so many examples of this. I saw

8:50

this recent post. If you're like kind of

8:52

deep in the model architecture world,

8:54

maybe you've seen this. This is like a

8:55

couple weeks ago. There's new Chinese

8:57

model MiniMax M2 that's one of the state

8:59

of the art open models. And a bunch of

9:02

the other Chinese labs have been pushing

9:04

these new hybrid architectures that are

9:05

like more efficient and can take longer

9:08

context. And MiniMax M2 just didn't do

9:10

that. They just used sort of like the

9:11

regular quadratic attention that I was

9:13

showing you. And they have this really

9:15

long story about how they tried and

9:16

tried and

9:18

it's basically just not worth it.

9:19

There's like an inherent trade-off in

9:21

how much computation you use and and how

9:23

good the models are. And so even if you

9:25

can technically build a model that

9:27

doesn't break at millions of tokens,

9:30

it's not actually better for any of the

9:31

tasks they care about. So no one is

9:34

really doing this.

9:35

And I think to conclude, we think that

9:37

like we're pretty limited by the context

9:39

window in full context. There's like one

9:41

systems problem that you can't put

9:43

millions of tokens into the model. And

9:45

then there's another reasoning problem

9:47

that even if you can, the models don't

9:49

actually get better. So it's probably

9:51

not practical. And I think if if you

9:53

work in industry, I'm sure you see

9:56

document sets that are much much larger

9:59

like on the order of I don't know

10:00

billions to trillions of tokens and even

10:03

though we're getting better at training

10:05

the models and the system side we're

10:07

getting much better at running them more

10:08

efficiently, faster, cheaper, we're not

10:12

near fitting trillions of tokens into a

10:14

model. I think like that's pretty far

10:16

off.

10:17

So I would guess a lot of you are doing

10:18

rag. How many people in this room use or

10:21

work on a rag system on like a weekly

10:23

basis?

10:25

That's actually pretty crazy. Okay, so

10:27

over half for sure.

10:29

So now we're going to talk about rag.

10:31

I'm going to talk about why it's good

10:33

and then I'll talk about why I think um

10:35

it's fundamentally limited and

10:38

the products of the future will use

10:40

something better than rag.

10:44

So if you use rag you probably use a

10:45

vector database. There are many vector

10:47

databases. I think

10:50

I know some of these Turbo puffer,

10:53

Weaviate,

10:54

another one is S3 that's Chroma. I made

10:56

this slide.

10:58

Uh

10:59

Uh there there are many different vector

11:01

databases. They all offer you like

11:02

slightly different trade-offs. They give

11:04

you your vectors for cheaper, faster, um

11:08

Vector databases are the way that memory

11:09

works in production. If you're using a

11:12

company internal question answering

11:14

system, it's it's definitely running on

11:15

rag which is powered by a vector

11:17

database which stores embeddings.

11:20

ChatGPT memory uh uses embeddings.

11:24

Uh

11:24

Andre Karpathy has this diagram from

11:27

last year, 2 years ago actually, of what

11:30

the

11:30

an operating system that runs on

11:32

language models would look like and he

11:34

called embeddings the file system of

11:36

LLMs. Um I think that's true in today's

11:39

terms like today, November 22nd, 2025,

11:43

probably like if you think of what

11:45

you're working on as an operating system

11:46

the file system is embeddings. But I

11:48

think embeddings are the file system of

11:50

today and they're not the file system of

11:52

the future. And that's what I'm going to

11:54

talk about today.

11:56

I I also want to point out that they're

11:57

extremely easy to use like any of the

11:59

tools I'm going to talk about at the end

12:01

of the talk that are like related to

12:03

training things into models are just

12:05

fundamentally harder. But this is just

12:07

really nice and we can all take a moment

12:08

to appreciate it. You just sort of

12:11

take your text and then you like run

12:13

this

12:14

and and that's all. So five lines of

12:16

code. That's that's really really good.

12:19

The problem is they just aren't that

12:21

good and I they have a lot of problems I

12:24

think. Um which I think also okay, how

12:26

many people work on rag or experience

12:30

a rag system and are satisfied

12:33

completely with

12:34

>> [laughter]

12:37

>> Okay, that's great. So I think we're all

12:39

kind of in agreement here that maybe

12:40

there there could be something more.

12:42

Like even if we don't know exactly what

12:44

it is there must be something else out

12:45

there.

12:46

Um

12:47

I'll talk about a few problems that I've

12:48

run into in my own research. So let's

12:51

like start with this abstract. And so

12:52

this is the vector database that powers

12:55

rag.

12:56

Every dot here is is supposed to be a

12:58

document. So the document goes through

13:00

the LLM. The LLM is trained to give you

13:03

just this one vector that represents the

13:05

document. I've projected them down to

13:07

two dimensions for this slide, but each

13:09

doc document is one dot. Um if you

13:12

actually look at what's in the vector

13:13

database it looks like this. So there's

13:16

lots of

13:18

numbers. There's no one in the world who

13:20

can tell tell you what this means. Um

13:24

one thing that I think is interesting is

13:26

that even though they look random and no

13:29

one can actually read them, if you build

13:31

a system to read them, it works pretty

13:33

well. So like if you're working in a rag

13:36

and you're sending someone embeddings,

13:37

you're actually sending them

13:39

something analogous to text. And I think

13:42

this is important because a lot of the

13:43

actual architectures like Turbo puffer,

13:46

Pinecone, what have you, they store only

13:49

embeddings. And so like maybe there's

13:50

this false premise that if you just send

13:52

them embeddings there's no security

13:54

flaws, but actually uh even slightly

13:57

motivated person can build this system

13:59

here, this white arrow on the right,

14:01

which takes the embedding and produces

14:03

maybe not the exact same text, but

14:04

something extremely close to it. This is

14:07

what I worked on for like about a year

14:09

of my PhD. This is a

14:12

animation of like so I type in the

14:14

sentence, it goes into the embedding

14:15

model, it gets stored in a vector

14:17

database, and then we run this it's like

14:19

a multi-round correction thing. And then

14:21

by the end we actually can get most I

14:23

think our research has at a certain

14:25

length we can get 90% of text back

14:27

exactly from vector databases. So the

14:29

takeaway here is that there's no uh

14:32

security benefits to using a vector

14:34

database. And also they're very hard to

14:36

run at scale. So this is like an

14:38

inherent problem for people with

14:39

sensitive data. That's the paper.

14:42

Um

14:43

I think a second problem that I

14:45

personally have with embeddings is that

14:46

they're not adaptive. Like there's this

14:49

one universal sense of what the world

14:51

looks like that's captured in these

14:52

vectors and it's not adjustable based on

14:55

what you work on. So like to give you a

14:57

concrete example,

15:00

we embedded a bunch of databases or we

15:03

created a database of a bunch of

15:04

embeddings of credit card related

15:06

documents. I think we had half of them

15:09

that were from MasterCard and half of

15:11

them that were from Visa. But if you

15:13

actually look at where the embeddings

15:14

get stored,

15:16

um I guess it's not in this picture, but

15:17

it's like only right here. So even

15:20

though there's this like really large

15:21

space of kind of all possible semantics,

15:24

embeddings only represent like one

15:26

universal one if that makes sense. So

15:29

credit cards are actually clustered in

15:30

this like really small area and this

15:32

means search works bad. So

15:35

like to give you a concrete example, if

15:38

you take these two documents, one's from

15:40

Visa, one's from MasterCard, in these in

15:42

the system we were designing like if you

15:44

search something that's about a Visa

15:45

query, you should never receive

15:47

MasterCard. But they're all so close to

15:49

each other that they're actually like

15:50

completely all jumbled together. And

15:52

this is just like a problem with all

15:54

conventional embedding mechanisms. So we

15:56

built this new model that lets you feed

15:59

in some like surrounding documents. So

16:01

like to give you an example, this is

16:02

kind of the first half of our model. We

16:04

would feed in a bunch of credit cards. I

16:07

just put Amex, but there actually was no

16:09

Amex when we did it. And um

16:12

and the model kind of works like this.

16:14

Like when it produces the embedding for

16:15

the text, which is here, it also looks

16:17

at a bunch of surrounding documents so

16:19

it can kind of know like, okay, this

16:21

text is about Visa, but also all the

16:23

other documents are about either Visa or

16:25

MasterCard. And it gets trained so that

16:27

it can like dynamically adjust the

16:30

embeddings based on like the surrounding

16:32

context. So I thought this was cool.

16:34

And it works better. So like in this

16:36

Visa MasterCard case, the similarity

16:39

between a Visa and MasterCard is now

16:40

0.144. And I think anything containing

16:43

Visa has a much higher similarity.

16:46

So that's like maybe correcting one

16:47

small thing. Um it works better on like

16:50

out of domain stuff. So we have a

16:53

forgot what the climate data set is. A

16:54

data set of arguments, a data set of

16:56

financial questions, and then I think

16:59

like scientific articles, and I guess

17:03

the point I'm making here is that if you

17:04

do this contextual thing, embeddings

17:06

work a bit better. So like if you build

17:08

them in a way that they can dynamically

17:09

adapt to the domain, they can solve some

17:12

problems, but I think at the end of the

17:14

day they're still embeddings. And so

17:17

you're

17:18

Yeah, yeah.

17:20

Was this approach picked up by anyone

17:22

else? Do you know if they know it?

17:23

Yeah, I think we know they're using it

17:26

at OpenAI in practice like behind the

17:28

scenes now that embedding models are

17:29

contextual. It's a pretty it's kind of a

17:31

free lunch. Like you add these extra

17:34

tokens. Uh

17:36

I guess it's it's kind of hard to build.

17:38

Like you have to build this two-stage

17:39

model and then uh when you embed

17:41

something you have to grab some

17:42

embeddings from the surrounding

17:44

documents, but once you build it it just

17:46

works, you know, better on like

17:49

especially on long tail stuff. I think

17:50

if you look at um

17:52

like MS Marco, which is this large web

17:54

scale

17:55

embedding task, it really doesn't get

17:57

much better when you add surrounding

17:59

stuff because like it's already pretty

18:02

global if that makes sense. But if you

18:03

look at like really niche things,

18:05

embeddings work a lot better. So yeah, I

18:07

I know it's productionized at some other

18:09

companies. Um I think if you're actually

18:11

building an embedding model at your

18:12

company and you want to put effort into

18:15

making it better, this is probably like

18:17

the easiest way besides data. Probably

18:19

the first way is data. Um

18:22

There's some recent work that I think is

18:24

worth mentioning about like fundamental

18:26

limitations of embeddings and vector

18:28

databases and rag which says that like

18:30

if you

18:32

it's not even really worth explaining,

18:33

but there's like some

18:36

uh

18:37

there there's some relationships that

18:39

cannot be captured in fixed dimensional

18:40

vector. Like you have to reason about

18:42

things to answer all possible tasks. And

18:44

this is this kind of combinatorial setup

18:47

where there are so many possible

18:48

relationships that the embeddings simply

18:50

can't store them. And so like in theory

18:53

embeddings are obviously

18:56

not the best way to do all possible

18:58

relationships between text.

19:01

But I think everyone knows that rag has

19:03

its issues. Like I'm glad that no one

19:04

raised their hand when I asked if anyone

19:06

was going to like really stand up and

19:08

speak for rag. And like we can I I

19:11

actually think this is a hard point to

19:12

make. Like everyone kind of knows this,

19:14

but it's hard to come up with examples

19:16

that retrieval can't solve in practice.

19:18

Like speaking as someone who's recently

19:20

sat down and tried to make benchmarks

19:22

for tasks that I care about, it's hard

19:25

to express questions that require kind

19:28

of this like latent reasoning over

19:29

multiple documents in a way that rag

19:32

doesn't solve. But they do appear. Like

19:35

um anything that kind of requires

19:38

association between multiple things or

19:40

questions that are they're like sort of

19:42

implied but not explicitly answered by

19:44

the documents are just not solvable by

19:47

current techniques. And also if you have

19:49

interesting examples of this would love

19:50

to hear after after the presentation. Um

19:55

Hopefully I've made my case that I think

19:58

rag Oh, yeah, yeah, go ahead.

20:00

I'm curious if you would classify

20:02

agentic search as rag as well. Yes,

20:05

that's a good question. So, I guess the

20:06

way I think of agentic search is like a

20:09

model that can grab and it makes a bunch

20:11

of queries in a row and then it

20:13

responds. Um

20:15

Yeah, that's that's a really good

20:17

question. I think

20:19

I think I wouldn't classify it as rag,

20:22

but I think it has

20:24

different fundamental limitations that

20:26

are also tough to overcome. Like what

20:28

you what you would really want is like a

20:30

model that reads the entire thing and

20:33

reasons about every possible

20:34

relationship and then answers. And I

20:36

think in theory maybe you could build an

20:38

agentic rag system that does that, but

20:40

it would be very expensive.

20:42

I think cuz [clears throat] isn't that

20:44

isn't that oh isn't deep research in the

20:47

direction of that where like it goes

20:49

through a bunch of hundreds of thousands

20:51

of sources, but then what ends up in

20:52

context is only like a small subset of

20:54

those. Yeah, yeah, I actually think deep

20:57

research is like really in the right

20:58

direction. Like they're trying to

21:00

do something that's a little bit higher

21:02

level and requires a lot of compute.

21:05

Like I think um anything that works

21:07

better than rag is going to be more

21:09

expensive. And so like just the property

21:12

that it takes a while and it makes a lot

21:14

of searches and it thinks a lot is like

21:16

good. I think that there's probably a

21:19

more elegant way to train like a really

21:23

big kind of deep research ask system.

21:25

But I think that's that's actually a

21:28

a good way of doing this and and not the

21:29

one that I'm talking about today, but

21:31

it's very promising as well. Like maybe

21:33

the question is like are you willing to

21:35

spend a lot of money at training time or

21:38

at inference time? And deep research is

21:39

like kind of they don't spend a lot of

21:41

money to train it, but it's willing to

21:42

wait for a long time at inference. And I

21:44

think the things I'm going to talk about

21:46

today are more like if you're willing to

21:47

spend a lot of money up front and you

21:49

get a really smart model that knows all

21:51

your data already. Um

21:53

and it's really cheap to do inference.

21:55

So, it's like kind of different sides of

21:57

the same trade-off. And I think like a

21:59

good way of thinking about these things

22:00

is like to get better models you're

22:02

going to need to pay somewhere, you

22:03

know? Like you're either going to need

22:05

to like generate better data and spend

22:07

more time on the data, you're going to

22:08

need to spend time on training, or

22:09

you're going to need to spend time on

22:11

inference. And a nice thing about rag is

22:12

it kind of just works, but anything

22:14

better will cost more. Yeah. Uh getting

22:17

back to your example of MasterCard

22:19

versus Visa. Yeah, sure. I I I don't

22:21

know if that's in your presentation

22:22

later, but what are your thoughts on

22:24

using knowledge graph for that?

22:26

As kind of augmenting

22:28

It's a good question. Maybe ask me

22:30

after. I have to think about knowledge

22:32

graphs. It's been a while.

22:33

Um

22:34

so, let's talk about how to learn things

22:36

in weights.

22:37

Um I think like the question that we

22:40

want to get at is like okay, so say we

22:42

have the example I showed earlier or

22:45

like you have a small data set you

22:46

collected from your own personal work

22:48

and you want to teach it to the model.

22:49

It's

22:50

one thing to put it into context and

22:52

that's a good way to get started. And if

22:54

you don't have that much data, that'll

22:55

get you pretty far. But I think we can

22:57

do more. Like there's some questions

22:59

that even when your data is in context,

23:01

the model can't answer. And so what I

23:03

want us to think about is like how can

23:05

we inject things into a model

23:07

uh in such that it learns better than in

23:09

context and also that it doesn't forget

23:11

everything that it already knows.

23:13

Um I want to point out something from my

23:15

own research, which is that there is a

23:17

fixed capacity to language models. Like

23:18

one way to think about this is chat GPT

23:21

has like only so many parameters. We

23:23

have this measurement that it can store

23:25

3.6 bits per parameter. So like uh I

23:30

think a billion parameter model is like

23:33

at 3.6 bits is maybe like

23:36

4 terabytes? Is that right? 4 gigabytes?

23:40

What a Yeah, thank you. Thank you. Um

23:43

this is like some information, but it's

23:44

actually not that much. So, the models

23:47

they basically do their best to fit the

23:50

training distribution and they throw

23:51

everything else out. So like to give you

23:54

a concrete example, this morning I was

23:56

putting this together. I asked Claude,

23:57

what is the capital of the smallest

23:59

province in Tajikistan? And

24:02

it gave me a very detailed answer. It's

24:03

actually very impressive. No web search,

24:05

the model just knows this in its

24:07

parameters. I guess I'm arguing that

24:09

this is bad. Like if you want to build a

24:11

system that can answer

24:13

really detailed documentation questions

24:16

for your company,

24:17

you don't need it to know what the

24:19

capital of the smallest province in

24:20

Tajikistan is. And since we know these

24:22

models have fixed capacity, I think that

24:25

this is bad. Like what we really want is

24:27

to know how to like find this kind of

24:28

thing and just like delete it and

24:30

replace it with the things we care

24:31

about. And I think that's like what

24:33

we're getting towards, but we don't 100%

24:34

know how to do that again.

24:36

Oh, sorry. So, when I originally put

24:38

this talk together, the way I was

24:39

thinking of explaining it is calling it

24:41

a neural file system. And then I decided

24:43

to just call it weights. I think it's

24:45

easier to understand, but this slide

24:47

still says neural file systems. Um

24:50

So, I think there's a few questions

24:52

here. Like we want to train all our data

24:53

into the model. One question is like how

24:55

do we train it? Do we do RL? Do we do

24:57

SFT? Uh what's what even is the data? Um

25:01

another question is like out of

25:03

uh all the possible data, what do we

25:06

use? Do we just like fine-tune directly

25:07

on our data? Do we try to generate more?

25:11

I think my argument is that we should

25:12

try to generate more and I'll show you

25:14

why. And then there's an architectural

25:16

question. Like I think for a long time

25:18

people really cared in the machine

25:21

learning deep learning community about

25:22

like what architectures we should use.

25:24

And then for like what, 8 years,

25:27

everyone who knows what they're doing is

25:28

really just been using transformers

25:30

unless they're trying to make them

25:31

better. And I think now in this world

25:34

where we're trying to train stuff into

25:35

models, like

25:37

like if you think of okay, a world we

25:39

all each of us have has our own model or

25:40

maybe multiple models and those models

25:42

are getting updated a lot, I think we

25:44

start to care about architecture again.

25:46

And I'll and I'll tell you why and like

25:48

what I think the options are.

25:50

So, [clears throat] first let's talk

25:51

about learning.

25:53

Um

25:55

So, I think like the mental [snorts]

25:57

model here, which I mentioned before, is

26:00

like we're trying to train the model to

26:02

learn the data as best as it possibly

26:05

can and it's going to be expensive. So,

26:08

like we didn't like rag, but also rag

26:10

didn't cost us very much money. I think

26:12

to do better than rag, we're going to

26:13

have to like pay some GPU points. And

26:17

that's just like the state of the world.

26:19

Okay, fine. So, this is our model. It's

26:22

like this homogeneous blob of data and

26:25

this is our data. So like maybe we have

26:27

the MasterCard data set or maybe we

26:29

collected data about ourselves or maybe

26:31

I uh

26:32

collected all my traces from coding in

26:34

November and December and I want to like

26:36

train the the model to learn my problems

26:38

better. What do I do? How do I actually

26:40

do this? Um

26:43

Let's let's like start with the dumbest

26:45

possible approach and just like see what

26:47

happens. So, say

26:49

uh we start with a data set

26:51

and we just train on it.

26:54

Um like using I guess next token

26:56

prediction. So, we actually ran this

27:00

little experiment. This is like uh 3M.

27:03

It's a company that makes duct tape.

27:06

And

27:07

um

27:08

this is like some financial reports. So,

27:10

maybe like you're working there and you

27:12

really don't want to read all of this.

27:14

So, you just want to ask the model to

27:15

like really understand this and be able

27:18

to answer questions. And like rag isn't

27:19

really working cuz it's like this weird

27:21

structure and there's a lot of ways the

27:23

documents interrelate. Okay, cool. So,

27:25

we're just going to like train the model

27:27

using next token prediction. See what

27:30

happens. You know what? Actually, even

27:32

if you don't train the whole model, um

27:34

you you still get zero loss. So, the

27:37

model can perfectly memorize

27:39

entire uh 3M 10K financial report. Um

27:44

it's extremely impressive.

27:46

Okay, so now let's talk to it. So, so we

27:48

did this and then we didn't want to ask

27:50

anything that's like exactly present in

27:52

the document cuz we want to see if the

27:53

model's actually good. So, we started,

27:55

you know, like everyone loves to test

27:56

poems. So, we started with a poem. We

27:58

said, "Can you write a poem about 3M in

28:01

fiscal year 2025?"

28:03

So, register your bets. And what do you

28:06

think happened?

28:09

It's terrible. Someone said it. It says,

28:12

"The passage of a passage is a poem."

28:15

End of sentence.

28:17

It's crazy.

28:18

>> [laughter]

28:19

>> Yeah. So, now maybe we ask like why does

28:21

this happen and and how do we fix it?

28:23

So, unfortunately this doesn't work and

28:25

I actually think this is like one of the

28:26

reasons why people haven't been doing

28:27

this yet is because the dumbest possible

28:29

approach usually does work in machine

28:31

learning, but in this case we have to do

28:33

something a little bit more

28:34

sophisticated. Um

28:37

So, maybe take a second and think about

28:38

like what you would do when you're

28:39

facing this problem at work or in a side

28:41

project. Um I think there's like two

28:44

things we need to fix. One is that um

28:48

the data

28:50

is not it's not exactly what we want to

28:52

train on, I think. And two is that we

28:55

probably don't want to update the entire

28:58

model because what we did there was

28:59

basically overwrite all the, you know,

29:02

stuff about Tajikistan and everything

29:04

else that's in the model with just like

29:05

this 3M knowledge. And I think that's

29:08

like too specific and then the model is

29:10

just obsessed with 3M and it'll only

29:12

produce exact copy sentences from the

29:15

document. That's that's clearly too

29:17

much. So, I think we need a better way

29:18

to update the model and we need a better

29:21

way to change the data.

29:23

Um there's this pretty relevant work. I

29:25

don't know if you follow this like LLM

29:27

chat thing from Andrej Karpathy. Shout

29:30

out. I think it's very educational. And

29:32

he had a really good question, which is

29:34

like he built this small LLM and trained

29:36

it from scratch and everything. And then

29:38

he wanted to teach it about himself. And

29:41

okay, maybe the first thing you would

29:43

try is rag. You put like a little

29:44

database of information about yourself,

29:46

but that's only scalable to a certain

29:49

amount and then the model can't really

29:51

like combine things. It can only

29:53

kind of regurgitate facts. And so he

29:56

wants to actually treat teach it

29:57

properly, he says, meaning in weights.

30:00

And so notice he doesn't just like take

30:02

one example and and train the model

30:04

using next token prediction. He does

30:06

something a bit more complicated. He

30:08

like generates this task or

30:10

don't have to care about the specifics,

30:12

but there's like basically he makes a

30:13

diverse training data set of examples

30:15

that look like the thing he cares about

30:18

and then trains on it. And if you go,

30:19

you can find this it actually does work

30:21

pretty well, which is cool. So he's able

30:23

to teach a novel behavior to a model by

30:25

like generating a lot of synthetic data

30:27

that looks like the example he cares

30:29

about and then fine-tuning the model for

30:31

a little bit and it and it learns.

30:34

There's a paper that's really good

30:36

that's from last year from some folks at

30:38

Stanford called synthetic continued

30:40

pre-training and they have the same

30:41

problem. So they have like a really

30:43

small data set and they want to teach

30:44

the model to the data set without like

30:46

breaking the model essentially.

30:48

And

30:50

they have this kind of fancy way of

30:52

generating synthetic data by extracting

30:54

entities, but I think the important part

30:56

is that they take a small data set and

30:59

they generate like a very large more

31:01

diverse data set representative of the

31:04

thing that they care about. And this is

31:05

something that like breaks the whole

31:08

like conventional machine learning

31:09

paradigm. Like they only have a small

31:12

training data set, so

31:14

uh what you learn in school would tell

31:15

you that you would just like overfit and

31:17

there's nothing you can do, you just

31:18

have to go back and collect more data.

31:20

But actually because LLMs are so good

31:22

now, we can do this second thing where

31:24

we generate like a much larger training

31:26

data set. It really contains only the

31:29

like facts that were present in the

31:30

original data, but it's so large that

31:32

you can train a model on it. It's like

31:34

very strange and only recently started

31:36

working, but it does work. I'll show you

31:38

some evidence. Um the green line is what

31:41

happens when you do the dumb thing you

31:43

tried before. So you just like fine-tune

31:45

the model on the data. It actually

31:47

starts at the black line, so

31:48

[clears throat] surprisingly actually

31:48

gets worse. So it like memorizes the

31:51

data so well that it can't answer any

31:53

slightly different questions about it.

31:55

Um the thing they do, they have like two

31:57

different ways of doing it, but

31:59

basically like generating lots of

32:00

synthetic data that describes the things

32:02

in the original data set. It works very

32:05

well. Like at some scale, I guess

32:07

100 million tokens close to a billion,

32:10

they can actually outperform GPT-4 on

32:12

this data set, which is really cool. So

32:14

I think like the takeaway here is

32:17

even though you don't have a lot of

32:18

data, if you're willing to generate like

32:20

a large synthetic data set that

32:22

describes the data you have, you can

32:24

actually train a model on it and it

32:25

works really well.

32:28

There's a bunch of other papers that do

32:29

this. One is called active reading.

32:32

Um they basically ask the LLM what types

32:35

of things should we generate and then

32:36

they generate from it. There's

32:38

self-study, which is from this

32:39

cartridges paper, which is more like

32:41

question answering, like asking the

32:42

model to like quiz itself. And then

32:45

there's this rephrasing the web thing. I

32:47

didn't realize my

32:50

Whatever. A rephrasing the web thing

32:52

where they kind of like rephrase the

32:53

entire pre-training data set. So this

32:55

actually works at scale in kind of a

32:57

surprising way. Um and there's a lot

32:59

more work in this direction. So I'm

33:00

really excited about this like and I'm

33:02

kind of monitoring it. There's a company

33:03

called Datalogi that's doing this really

33:05

well. They're like generating really

33:07

high quality synthetic data. It's just

33:09

like not something that used to be

33:10

possible until very recently when LLMs

33:14

crossed some threshold that they're like

33:16

able to generate data that's good enough

33:18

to actually train themselves on. Oh,

33:20

there's actually something pretty cool

33:21

that's not in the slides called

33:22

self-adapting language models.

33:25

Self-edit. It's called SEAL, S E A L,

33:29

and they

33:30

ask the model what data to generate to

33:32

make itself better. And under some like

33:35

constraint scenarios, this is actually

33:36

working. So that's like actually quite

33:38

bizarre.

33:40

And like obviously doesn't work

33:41

infinitely or else they would have

33:42

caused an intelligence explosion. But

33:45

the fact that it works at all is like

33:47

really remarkable and I think like worth

33:49

monitoring. So

33:52

in conclusion for this section, we want

33:53

to train things into weights. We can

33:55

generate large synthetic data sets that

33:57

describe very pretty small data sets and

34:00

it works fine.

34:02

Um

34:03

now I think the money question here is

34:05

like how do we inject the information

34:07

into the model? I think before I

34:08

mentioned we were training all the

34:09

parameters and we tried it and it worked

34:11

really bad and this is a a problem

34:14

that's been around for a long time. It's

34:17

called like catastrophic forgetting.

34:19

Um even in old school machine learning

34:21

like you train a model to

34:23

recognize handwritten digits and then

34:24

you train a model to recognize house

34:25

numbers and it's no longer able to

34:27

recognize handwritten digits. This is

34:29

like a very well-known problem. There's

34:30

a lot of like theory and like approaches

34:33

proposed to solve it, but no one really

34:34

knows how to solve it. It's very very

34:36

hard. Um

34:38

but I think there are some easy ways we

34:41

can get around it in the conventional

34:43

paradigm where we have like this big

34:45

pre-trained try GPT transformer.

34:48

Uh instead of retraining the entire

34:49

model, there's a few different ways we

34:51

can do it. I mean the first one

34:53

is retraining the entire model. So the

34:55

things we're training I'm highlighting

34:56

in blue here. That's like if we take our

34:58

transformer and we update all the

34:59

parameters,

35:00

we're probably going to forget stuff.

35:02

Um there's another one that's pretty

35:04

cool called prefix tuning where you just

35:06

train the KV cache. Um I mean we all

35:09

like skip the details for now, but ask

35:11

me if you have questions. Prefix

35:12

tuning's cool. Um another way is since a

35:16

lot of these models are called like

35:17

mixture experts and they have this MLP

35:19

layer in them, you can add another

35:22

part to the MLP that is optionally

35:24

routed to and used and that's like

35:26

pretty scalable. I think people try

35:27

this.

35:28

Um

35:29

there's another approach where where you

35:31

replace instead of like another MLP, you

35:33

build this thing called a memory layer,

35:35

which is like a big lookup table. I

35:36

think memory layers are really good. And

35:39

let me pause and say now this part of

35:41

the talk is getting close to purely

35:43

speculative. This is like the things

35:45

that are like they exist and like

35:47

someone's going to do this and someone's

35:48

going to use like one of them, but I

35:50

really don't know what the right answer

35:51

is. Um another one is called LoRA, so

35:54

low-rank adaptation. You probably heard

35:56

of this very like hot topic. Um they

35:59

kind of like train a small a small

36:01

matrix or small few matrices to adapt

36:05

the linear layers. So it's like if your

36:06

model's 10 billion parameters, maybe you

36:08

train 10 million parameters that can

36:11

like control it. Um

36:14

and if we look at them together, maybe

36:15

it's not super obvious which thing would

36:18

work best. Like ICL is just like putting

36:20

stuff in context. So we have in context,

36:23

RAG, full fine-tuning. We could do the

36:25

memory layers and MLP, cartridges, which

36:28

is prefix tuning. And we can do LoRA.

36:30

You could also do add something to the

36:32

mixture experts. I think to me it's not

36:34

like clear and I'm not positive that it

36:36

matters which one we do. Like I think

36:39

the main thing is like we have this

36:40

giant model and we're adding a tiny bit

36:43

to it to control it and train only those

36:46

parameters. That way we retain most of

36:48

the information in the model. I think

36:49

that's like the most important part.

36:52

But I think for the end of this talk,

36:53

I'll just talk through like

36:55

what I think people are doing in this

36:57

space up to like the minute and then

37:00

you can make up your own mind what you

37:01

think the right way to do it is.

37:03

So let's talk for a second about what

37:04

properties we want. I think we want um

37:07

we want our changes to the model to be

37:09

very small. Like say you're serving a

37:11

model to each person, you actually can

37:14

do it, but you have to use one of these

37:15

like parameter efficient methods. If

37:17

you're trying to fine-tune a new Kimmy

37:19

for each person, Kimmy's like a

37:20

terabyte, it's a trillion parameters.

37:22

It's just like not even storable, let

37:25

alone servable. Um we want something

37:27

that's resistant to forgetting like we

37:29

said, so it would be nice to have

37:31

an architectural change that's both

37:33

small and makes the minimal impact on

37:35

the model as it is now because the model

37:36

as it is now works really well. Um and

37:40

preferably high capacity. I think like

37:42

changes that are really expressive and

37:45

can capture a lot of facts in few

37:47

parameters are the ones that we prefer.

37:49

And we want to be able to do inference

37:50

quickly. As like a small aside, you

37:54

actually can do this quickly with a lot

37:55

of um

37:57

a lot of these methods. Like maybe some

37:59

of you have seen Tinker, this new

38:00

training API from Thinking Machines.

38:02

It's basically all predicated on this

38:04

idea that you can

38:06

you can serve one model per person as

38:08

long as you do LoRA and batch the LoRAs.

38:11

And there's like it's actually most

38:12

interesting from a consistency

38:13

perspective. There's like ways you can

38:15

train it and train each one separately

38:17

and there's ways you can do inference

38:18

and it basically has no cost, um which

38:21

is really interesting just cuz like the

38:22

base model doesn't change and we all

38:24

share the same base model. So all the

38:26

ideas I'm going to be talk about are

38:27

kind of like in the same direction as

38:29

Tinker. Um

38:32

we can think about like whether certain

38:34

methods might learn more or forget more.

38:37

Um

38:39

so this is comparing LoRA to full

38:41

fine-tuning. So LoRA makes a tiny change

38:43

to the model. Full fine-tuning updates

38:45

the entire model. And on two different

38:47

settings, they show like LoRA here is

38:49

like purplish or pink. The pink one's a

38:51

little bit smaller capacity. Um it

38:54

basically doesn't do as well at least

38:56

when you're doing SFT.

38:58

Uh LoRA can learn a little bit less,

39:01

but also if we look at how much it's

39:03

degrading, it forgets less. So this

39:05

paper's called learn LoRA learns less

39:07

and forgets less. And it's it's actually

39:10

very nice finding. So like if you want

39:13

to at least teach a model via SFT

39:15

and you use one of these low-rank or

39:17

parameter efficient methods like all the

39:18

ones I described, they're going to make

39:20

a small change to the model in a way

39:22

that it's probably not going to be as

39:23

expressive as full fine-tuning, but it

39:25

also doesn't destroy a lot of the

39:26

knowledge. Um

39:29

here's something going in the exact

39:30

opposite direction. This is the result

39:31

from Thinking Machines showing that they

39:33

think LoRA is about as good as full

39:36

fine-tuning, which is interesting

39:38

because they're doing RL. So it's like

39:40

maybe dependent on the training

39:42

mechanism, like if you do RL, maybe it

39:44

makes small updates and

39:46

um

39:47

you can do LoRA, you can do memory

39:49

layers, but for SFT, it really has to

39:51

store a lot of information, so you

39:53

really have to do full fine-tuning. I

39:54

think that's the takeaway I have, and I

39:56

have some actually a paper that's like

39:58

kind of

39:59

blocked for legal reasons, but coming

40:01

out soon. Um here's one result from my

40:03

paper that's relevant to this. So, we

40:06

have this like tiny LoRA thing that's

40:07

even smaller than LoRA. And well,

40:10

there's actually LoRA XS which already

40:11

exists, and then we made tiny LoRA which

40:13

is even smaller. And if you're doing RL

40:15

on GSM8K

40:17

math reasoning, [clears throat]

40:18

you can train

40:20

14 parameters and get like 91% accuracy,

40:24

which is pretty crazy. I think um

40:26

there's like a lot of reasons for this.

40:28

Like RL makes really tiny changes. I

40:30

think this QLoRA model like is something

40:32

fishy is going on with the training

40:34

data. Do you have a one parameter

40:37

experiment? Uh yeah, yeah, it's just one

40:39

parameter. It actually learns It gets 5%

40:42

better with one parameter.

40:43

>> [laughter]

40:45

>> Pretty cool.

40:46

>> Yeah, yeah, it's it's it's really nice.

40:48

I think um

40:50

Literally the smallest. Yeah, yeah, the

40:52

smallest thing you can possibly train.

40:54

It's more like you you generate a lot of

40:56

random projections, and then you control

40:58

them all with one number, if that makes

41:00

sense.

41:01

Like the model actually changes a lot,

41:03

but the only thing you can actually

41:05

train and store is the one parameter.

41:08

Uh tell me more about it later.

41:10

>> Yeah.

41:10

Um yeah, it's pretty cool. Um

41:14

This is another result that's like kind

41:16

of in the mix, but I'm not sure how to

41:18

place it. So, if you do the KV cache

41:20

tuning or prefix tuning, this paper

41:22

thinks prefix tuning works much better

41:24

than LoRA. I met some people in Meta um

41:26

when I used to be affiliated there that

41:28

said that they think LoRA works much

41:30

better than prefix tuning. So, I really

41:31

don't know, but I think like what it

41:34

really will come down to is like when

41:36

you do it at scale, what's like most

41:37

efficient. And I'm not exactly sure, but

41:40

I think prefix tuning is a pretty good

41:42

candidate because like KV caches are so

41:45

commonly used these days and like a lot

41:48

of the system stuff is built around KV

41:50

caches. I think a cool thing about

41:51

Thinking Machines is like they're

41:53

designing this entire organization

41:54

around like scaling LoRA, which is

41:56

awesome, but it's not really possible in

41:58

open source right now. Like there's not

42:00

kernels for training many LoRAs at the

42:02

same time. It's like very complex, and

42:04

you have to have a lot of people working

42:05

on that. Prefix tuning on the other

42:06

hand, it's like very well supported.

42:08

Um and then finally, I'll quickly talk

42:10

about memory layers. This is another

42:12

approach to injecting data into models

42:14

that I think is good. This is like uh

42:16

adding a expert to the MLP, but the

42:19

expert is just like this giant

42:21

differentiable lookup table. So, it's

42:23

kind of

42:24

not that important exactly how it works,

42:27

but it's like it's just a different way

42:28

to inject information into models. The

42:31

cool thing about memory layers is it's

42:32

controllable. So, in this work uh by

42:35

Justin Lin from this year, they specify

42:39

exactly which parts of the memory layer

42:41

get updated and keep it to like a very

42:43

small number. And so, their result shows

42:46

that memory layers actually work the

42:48

best. So, memory The axes here are

42:51

forgetting, so down is bad, and

42:54

learning, right is good. So, the memory

42:56

layers basically don't forget at all,

42:59

and they learn close to as much. So, I

43:02

think if you're trying to

43:04

inject information into models and you

43:06

really care about them not forgetting

43:07

any of their base information, maybe

43:09

memory layers are the way to go. I think

43:11

honestly there's a lot of conflicting

43:12

evidence right now. Like some people

43:14

think LoRA is good, some people think

43:15

prefix tuning is good, these people

43:17

think memory layers is good. I really am

43:19

not sure, but I think it's going to be

43:21

one of them.

43:23

Okay, cool. That's That's the end of the

43:25

training stuff into weights part. Maybe

43:27

actually I'll stop and see if anyone has

43:28

any questions about the different

43:30

parameterizations. Yeah.

43:32

Can you go back to the

43:35

slide where you were showing the

43:37

when

43:39

GRPO

43:40

Oh yeah, yeah, yeah. From from my yet

43:42

unreleased research.

43:44

So, have you used SFT before?

43:47

Yeah, yeah, I can show you the SFT

43:49

results later, but SFT

43:52

uh

43:53

takes a lot more parameters in the short

43:55

explanation. Like many, many more, like

43:57

a thousand X more or something. And you

44:00

attribute that to the sparsity of the

44:01

reward? Yeah, yeah, I think it's

44:03

something like that. Like the SFT

44:05

learning signal is like cross-entropy on

44:08

all of the tokens with or without

44:10

thinking tokens, and that's a lot of

44:12

bits essentially. And then RL just gives

44:15

you a one or a zero if you get it right,

44:17

and you already knew that it's no

44:18

information. If you get it wrong, you

44:20

get like one bit. So, I think because RL

44:22

is like so sparse and

44:25

uh information efficient, then you can

44:26

do it with way fewer parameters. That's

44:28

That's kind of the takeaway from our

44:30

paper, actually.

44:30

>> So, you didn't do GRPO after doing SFT?

44:34

No, no SFT.

44:35

We just either do GRPO or SFT. And then

44:38

we see like kind of how many parameters

44:40

you need to train to get to equivalent

44:42

performance. And SFT requires many more

44:45

parameters.

44:47

Yeah.

44:48

Uh so, here you are comparing like

44:51

training versus RAG. Like we are

44:54

we want to solve the problem what we are

44:56

facing in the RAG. So, if the volume of

44:58

the document also matter? Like do you

45:00

have any studies like

45:02

because if if some problem has a less

45:04

number of document,

45:06

uh

45:07

RAG will be better or the

45:10

training will be better? That's a really

45:12

good point. Um maybe that let's uh

45:15

go to the last slide. So, I think the

45:17

question is like, okay, if you're trying

45:19

to train all of your data into a model,

45:21

but something only happens once. Yeah,

45:23

means that when I should pick focus on

45:26

RAG and when I should focus on like

45:29

like a training because

45:31

every time means I have like a small set

45:33

of the document, the training might not

45:36

be feasible.

45:37

Yes, yes, like it

45:39

you like maybe you something is so

45:42

underrepresented in your data that it

45:44

probably wouldn't

45:45

>> data is frequently changing, might be.

45:47

Your data is changing a lot. Yeah, maybe

45:49

in the short term it's hard to train.

45:51

Um

45:52

Yeah, so let me point out like Okay, so

45:55

obviously we're always going to put

45:56

stuff into context, and I think we'll

45:59

also probably always do RAG. Like I

46:02

think um there's basically no scenario

46:06

that you can imagine for a long time

46:08

where you're just like always training

46:09

the model and never doing RAG. I think

46:11

you'll do both. I think like maybe if

46:13

you have a ton of documents, I don't

46:15

know, maybe every day you do this big

46:16

training, and then every time you start,

46:18

you also do RAG. And so, like what I

46:21

really imagine is like

46:22

or maybe my my point is that no one is

46:25

doing this right now. And like people

46:27

will start doing it.

46:28

>> you have any like a prediction like

46:29

after certain amount of data, like

46:32

training will be like more efficient

46:34

[cough] than the RAG like Yeah, yeah,

46:36

yeah, no, that's a really good question.

46:38

Uh no, like I think I think this kind of

46:40

thing is really new, so there's a lot of

46:42

room for analysis like that. I would

46:43

definitely be interested to see both

46:45

analysis on how the frequency of

46:47

information affects like the tradeoff

46:50

and how just like how much data you have

46:52

to have for training to become

46:53

economically feasible. That's a really

46:55

good question.

46:57

Yeah. Um is your suggestion kind of in

47:01

uh diving more into like the weights

47:04

side of uh the presentation to use a

47:07

fine-tuned model for like

47:10

completion type tasks or also for

47:12

embeddings?

47:14

Oh yeah, that's a good question. Um

47:16

No, I think I think

47:19

the fine-tuning I'm talking about is all

47:20

for like assistant agent completion. Um

47:24

it's an interesting question. You

47:25

probably could do like dynamic embedding

47:26

model training, but I guess like the way

47:28

I think about it is like

47:30

the real like 10X improvement here is

47:33

going to come from training into

47:34

weights. You can maybe make RAG like

47:36

2X better if you really, really work,

47:38

but I think there's so many fundamental

47:40

problems with it that I wouldn't spend

47:43

that much time on making embeddings

47:45

better.

47:46

What were What do you feel like the most

47:48

fundamental problem is where even if

47:50

like your retrieval is fantastic,

47:53

I think like chunking, like um

47:55

you just like kind of retrieve some of

47:57

the stuff you need, and then you can't

47:59

really reason across all of it. And like

48:01

I think in the limit like there's some

48:04

types of data where like no matter how

48:06

you chunk, you'll never get like

48:07

everything you need, if that makes

48:08

sense. Yeah, totally.

48:10

Cool.

48:11

Yeah.

48:12

Do you see any fundamental limitations

48:14

as you scale up the amount of

48:15

personalization you need? Let's say you

48:17

had a B2C product that had 100 million

48:19

or 10 million users with memory for all

48:22

Mhm. Do you think that's just not

48:23

feasible? You say 10 million users?

48:25

Yeah, 10 million, 100 million, somewhere

48:26

in that range. Yeah, um

48:28

No, no, I actually think it is it is

48:30

feasible. Like LoRA, maybe you train

48:34

a a few megabytes per user or something.

48:37

It's not that crazy, right? Like YouTube

48:39

probably has gigabytes

48:43

Right, that's a good point.

48:43

[clears throat] Like the continual

48:44

updates are hard. Like probably in

48:46

realistic short term, it's more like you

48:48

update once a day or something like

48:49

that. But I think that's

48:51

that's doable, but you make a good point

48:53

that the paradigm I'm describing is much

48:55

more expensive than

48:56

Also, do you consider there's a lot more

48:58

that you can do in the other two like

49:00

buckets? You can compress the data

49:02

context, you can compress it before you

49:03

put it in RAG, break that down. There

49:05

are

49:05

buckets, you don't just have to use

49:07

RAGs, SQLs, and knowledge graphs, all of

49:10

them together to accomplish that solve

49:12

the problem.

49:13

Yeah, yeah, that's a good point. There's

49:14

kind of like three axes of optimization

49:16

here, and I guess like we are we're

49:20

getting pretty good at this. We're okay

49:22

at this, and we're horrible at this. And

49:23

so, like we'll continue improving upon

49:26

all three axes.

49:27

Yeah. What's your

49:30

like kind of hearing that maybe it's not

49:33

just fine-tuning, but what's your kind

49:34

of like intuition or guess in terms of

49:36

like where the decision boundary is in

49:39

terms of investing your effort in those

49:41

optimizations. Particularly in like

49:43

let's say a couple of years where you

49:45

could do something like a deep research

49:46

but it would be way cheaper and way

49:48

faster.

49:49

When what are there

49:52

You were saying that there isn't like a

49:54

number of documents but what is the

49:56

boundary that you would think about

49:57

looking at is it the freshness of the

50:00

data or is it how fast it's changing or

50:01

is it number of documents or what's the

50:03

what's the cost? Yeah, I it's a really

50:05

good question. I I think um

50:08

I think the paradigm I'm describing is

50:10

especially effective when you have like

50:11

a large amount of data that's not been

50:14

indexed into the LLM at all and it gives

50:16

you a big benefit there. I think when

50:18

you start seeing like

50:19

sparser updates to your data set or like

50:22

some new data that comes in but it's not

50:23

that much and it's like fairly often,

50:26

then you probably want to turn to

50:27

inference time approaches that are

50:28

closer to deep research.

50:31

Um yeah, and that guy had a question

50:33

long

50:34

Yeah. Can you elaborate a little bit

50:35

more about the

50:37

synthetic data generation? So let's say

50:40

that you have an LLM and you need to

50:43

get it to talk uh similar to the

50:46

language and terminology of like a

50:48

proprietary field, right? Like millions

50:51

of new documents.

50:53

Like how would synthetic data generation

50:55

in that context be helpful?

51:00

So your company has millions of

51:01

documents you said and you want to model

51:04

to It's more like a scenario. Yeah,

51:05

yeah. Okay. Yeah, yeah, yeah. Cuz it

51:07

wouldn't cuz it wouldn't you said you

51:10

wouldn't just train off of the next word

51:12

prediction,

51:13

Yeah.

51:14

Um

51:15

trials and such as and I think one of

51:17

those questions that you have talked

51:18

about was

51:19

uh synthetic data to

51:21

Yeah, yeah. No, I think I think

51:23

synthetic data generation could work for

51:25

that problem. So I guess like

51:29

um

51:31

it depends on how information dense your

51:32

data is. If you have millions of

51:34

documents from your company, I would

51:35

guess many of them share formatting and

51:38

only contribute maybe like a few bits of

51:41

kind of global information to the data

51:43

set. And so what you want to think about

51:45

is like does there exist a function that

51:47

could produce a good training data set

51:49

for an LLM that would teach it about my

51:51

data? And like there probably is. Like

51:53

you could probably design some strategy

51:54

that looks at the documents kind of like

51:56

figures out what's new about each

51:58

document and creates like kind of

51:59

question answer pairs.

52:01

But this is very blue sky. Like I think

52:03

a lot of people are working on this

52:04

right now but I don't have like a

52:06

a

52:07

global answer of how to actually do it.

52:09

>> Right now my only solution that I can

52:11

think of is um

52:13

you know, getting it to generate that

52:14

Q&A pairs for you to send. Right. And

52:17

then

52:19

for other

52:20

documents I'm wondering if there's other

52:21

ways

52:24

Yeah, yeah. I think it also depends on

52:26

what types of questions you'll be asking

52:28

about the documents. Like what you

52:29

really want to model is like all

52:30

possible questions or something like

52:32

that. But I think Q&A gets you pretty

52:33

far.

52:37

Yeah.

52:38

Um so with with this approach, right?

52:41

You you you mentioned this example where

52:42

you're um

52:44

you would train your model, right? On 3M

52:48

uh quarterly earnings, right? And you

52:51

can take 10K 10Q documents.

52:54

What would like

52:56

what would the prompt basically look

52:58

like, right? Like is there is there

53:00

anything in within like the in-context

53:03

learning that would still need to be

53:04

specified kind of specified to

53:08

bring your data into the context?

53:11

Yeah, uh so I think the question was if

53:14

you start with the 3M example we had and

53:17

you train all that into a model using

53:19

something like magic synthetic data,

53:20

what is actually the prompt look like?

53:21

Yeah. I think actually if you do it

53:23

right, you don't need a prompt at all.

53:24

Like you can just ask the model a

53:25

question, no system prompt, no

53:29

extra information and if nothing has

53:30

changed, it should know everything. Like

53:33

and you even there's some scenarios

53:34

where there's only one document and the

53:36

model knows which document it is so you

53:37

don't have to specify that you're even

53:39

asking a question about the document.

53:40

It's like implied, you know? So um

53:43

it depends on how you set it up but I

53:44

think in like the ideal case, there's no

53:47

prompt at all.

53:51

Yeah.

53:53

I

53:54

it's not obvious to me that information

53:56

is best stored in model weights. Yeah.

53:58

Why do you have do you have that?

54:00

Um it feels implied.

54:03

You have you might be right. Good

54:04

question.

54:06

So he said it's not obvious that

54:08

information needs to be stored in

54:09

weights. Yeah, yeah. This is this is a

54:11

good question. I think um

54:14

I'm not saying that it's best to store

54:16

information in weights. I guess I'm

54:18

arguing that that gets you a lot and

54:21

we're not using it right now.

54:22

>> Yeah. And like once you get to the scale

54:25

of like a GitHub repo, you might have

54:27

millions of tokens and it's just like

54:29

very expensive. And so at least like

54:32

this is the cheapest way to do it. The

54:34

question of like can we generate

54:36

synthetic data to do better than

54:37

in-context is like it's it's hard, I

54:40

think. It's like that's research.

54:44

Do you know what I mean when I say it's

54:45

cheaper though?

54:47

Like if you have a million token prompt,

54:49

you can just like compress it into the

54:50

weights and produce a model that gives

54:52

the same outputs with no prompt. And

54:55

then the inference costs less.

55:04

We can talk after.

55:05

Follow up. Yeah.

55:13

Mm. That's actually a really good

55:14

question. Never thought of that before.

55:17

Um I think it's probably pretty hard.

55:18

Like I guess if you're training on user

55:19

data and like you have some user that

55:21

wants to sabotage your system and you're

55:24

generating training data from their

55:26

inputs, there probably are a lot of

55:27

these like security risks and

55:31

uh I guess in this scenario if you're

55:33

serving the same models that user said

55:35

it doesn't work anymore, that's like not

55:36

your problem. But once you start

55:38

aggregating information across users, I

55:39

bet it becomes hard. I'm sure ChatGPT

55:41

has the same problem where some people

55:43

always click thumbs down instead of

55:44

thumbs up to try to like

55:47

>> [laughter]

55:50

>> Uh [snorts] they segment it

55:51

geographically with countries. With some

55:53

cultures are in conflict.

55:56

Oh yeah.

55:57

So you can

55:58

>> [laughter]

55:58

>> apply some bias in the

56:00

That's funny.

56:02

Yeah.

56:03

Yeah, um so speaking of maybe

56:05

[clears throat] a little bit about

56:05

practical limitations of something like

56:07

this, um especially in terms of like say

56:10

version control that you mentioned

56:11

GitHub models that you can keep

56:13

fine-tuning over time.

56:14

Say you're a company that just changed a

56:16

policy in the system one [snorts] line

56:17

sentence we honor something to we do not

56:19

honor it anymore. Mhm. And that keeps

56:21

going back and forth. Do you then, you

56:24

know, start from the base model again

56:25

and fine-tune that Yeah, yeah. Or the

56:27

one that already already has a good

56:28

representation of it.

56:30

And just has to change that one small

56:32

thing and then you know how that kind of

56:34

is joined at the hallucinations.

56:36

Which is kind of why we were doing full

56:38

context, right?

56:39

Partially to avoid that.

56:42

Yeah, I think it So So his question was

56:44

about

56:46

what you do once you start making

56:48

multiple updates to the model,

56:49

especially when you have like

56:50

conflicting information. And I think

56:53

like the optimal synthetic data strategy

56:55

would somehow figure this out during

56:56

training and maybe even like if there's

56:58

some documents from a few days ago that

57:00

are no longer relevant, you can just

57:01

like delete them. But I don't know how

57:03

to do it.

57:04

It's not How how we can give more

57:06

attention in the same like whatever

57:08

let's say uh

57:10

information is conflicting with each

57:12

other.

57:12

Uh whatever pre-trained versus what uh

57:15

friend document we are giving for

57:16

training. If it is a contradict each

57:18

other but I want more preference from my

57:20

document.

57:22

Like what we are doing in right like uh

57:24

asking the questions from the ground

57:26

truth.

57:27

So how

57:29

uh it will replace that uh scenario?

57:33

I'm [clears throat] not sure I

57:33

understood your question.

57:35

Sorry? I I don't know if I understood

57:36

your question.

57:37

Okay.

57:38

So what you have

57:40

I didn't understand your question. So my

57:41

question is like I

57:43

uh we have the data uh whatever the

57:44

training data we are giving it is

57:47

contradicting with the pre-training

57:48

data.

57:49

It is a conflicting.

57:50

Now while asking the question while the

57:53

inference, I want to give more

57:54

preference on my data. I don't need the

57:58

pre-training information. That's why we

58:00

are using right like uh I need the

58:02

output from my ground truth uh whatever

58:04

the context I'm giving. So how it will

58:07

uh we can achieve in the like a

58:09

training?

58:12

I think that the

58:16

the paradigm I'm proposing has all the

58:18

same limitations of RAG.

58:21

Uh I'm not positive that answers your

58:22

question but like for example, if

58:26

uh like with maybe in the scenario he

58:28

said where you said something many times

58:29

and then turns out not to be true, both

58:31

RAG would retrieve that and in the

58:34

uh

58:35

dumbest setup, that would also be

58:37

present a lot in the training data. So I

58:38

think like the same problems have to be

58:39

solved.

58:42

Have you done any work with federated uh

58:44

tuning? Fine-tuning uh parameters

58:47

So so what might be your problem? Yeah.

58:48

Millions of users.

58:50

Have you done any research in this space

58:51

yet? No, no, no. Not really but I think

58:54

it's an interesting uh opportunity. So

58:56

like back in the day, a lot of people

58:57

were really excited about the idea that

58:59

you could share gradients and train the

59:00

same model across many machines. This is

59:03

federated learning. And I think like one

59:05

of the problems why it's hard is because

59:07

the models now are so big that the

59:09

network costs are way too high. And

59:11

because like I'm arguing that you only

59:13

need to train a million parameters

59:14

instead of a trillion, it probably comes

59:16

back into play. So I think it's a very

59:18

good idea, especially in the RL world

59:20

where you do a lot of work for a long

59:23

time and then do gradients like very

59:26

seldomly. So I think it probably will

59:29

come back and it's smart to think of it,

59:31

but it hasn't quite yet.

59:34

Um, maybe I'll take like two more

59:35

questions. Yeah, go. Um, so

59:38

your argument here about training in

59:41

um

59:42

information seems to be

59:44

uh counter to Karpathy's view of like a

59:47

reasoning engine, like distilling just

59:49

the pure like you know, intelligence

59:52

aspect of the of the model down to like

59:54

a 2 billion parameter thing.

59:56

Um,

59:58

and like I think that there's a bit of

59:59

overlap there, like um

1:00:03

like a lawyer is not doesn't have the

1:00:06

entire legal code memorized, but they

1:00:08

know how to use the tools available to

1:00:10

them to find what they need to.

1:00:12

And so I I think part of it is kind of a

1:00:15

combination of those two things, where

1:00:16

you're doing task-specific training

1:00:20

with something like this on a relatively

1:00:22

small reasoning brain to get a sense of

1:00:26

where it needs to find the things that

1:00:29

might become stale or or, you know,

1:00:31

am I on the right track here or Yeah,

1:00:34

yeah. So, I think

1:00:36

you're making a comparison between some

1:00:38

people who have said, "Oh, the best

1:00:39

model we could ever have is like really

1:00:41

small and knows nothing, but can use

1:00:43

tools really well or something like

1:00:44

that." And

1:00:46

I guess I I was proposing some similar

1:00:48

ideas. I said models know way too much.

1:00:51

I think everyone agrees. The model

1:00:52

doesn't need to know the capital of the

1:00:53

smallest province in Tajikistan for most

1:00:56

use cases at least in like my life. It

1:00:59

doesn't need to remember, you know,

1:01:00

encryption keys. Yeah, but I think

1:01:03

there's I I think it's a very

1:01:05

philosophical question, but

1:01:07

I think it's really hard to create a

1:01:08

model that doesn't know anything. And

1:01:10

so, I'm more advocating for like

1:01:12

specialized models that are good at

1:01:14

something you care about, but bad at

1:01:16

other things rather than advocating for

1:01:17

a model that's like bad at everything.

1:01:20

Uh okay, last question here. Yeah, have

1:01:22

you ever done any research yet into the

1:01:24

temporal elements of the information?

1:01:26

No, but I think that's like one of the

1:01:28

first things to think about is like,

1:01:29

okay, if you have information from day

1:01:30

one and day two and day three, do you

1:01:33

just sort of like have data everything

1:01:34

or do you train them in order kind of

1:01:36

like you were asking or do you like

1:01:38

train multiple models and merge them or

1:01:40

I actually don't know, but that's a good

1:01:42

segue. So, now I'm

1:01:45

I'm working on this

1:01:47

problems related to this a lot, thinking

1:01:48

about this a lot.

1:01:50

Started a company with a few other

1:01:51

people and um

1:01:54

this is like the kind of research we're

1:01:56

doing. If anyone knows someone who lives

1:01:58

in San Francisco and is good at

1:02:00

engineering and you think they're

1:02:01

interested in this, let me know or send

1:02:03

me an email. Or if you're interested in

1:02:05

like using this kind of thing, send me

1:02:06

an email. That would be great. Is it

1:02:08

temporal stuff or Not necess- I mean,

1:02:12

it's kind of all of this, I would say.

1:02:13

Um, trying to build models that you can

1:02:15

teach things to.

1:02:17

Tell us more.

1:02:21

All right, thanks so much for having me.

1:02:23

This was great.

1:02:24

>> [applause]

1:02:27

[music]

1:02:43

[music]

Get the TLDR of any YouTube video

Transcribe, summarize, and repurpose videos in 125+ languages — free, no signup required.

Try YouTLDR Free