Full Transcript

·YouTLDR

The Never Ending Lore of Harness | Vivek Trivedy (Product Lead, Langchain)

1:33:26EnglishTranscribed Apr 22, 2026

Open in Studio

0:00

Hey everyone, welcome back to ground

0:01

zero. This is episode 13. Yeah, we are

0:04

running fast. Today we have ve from

0:06

langchain. So we leads their work on

0:09

open source agents and harnesses the hot

0:12

term right now. He's the person behind

0:15

DP agents the coding agent that went

0:17

from top 30 to top five on terminal

0:20

bench 2.0 by only changing the harness.

0:23

He's been writing some really good stuff

0:25

with lot of signal and alpha on what

0:28

harnesses actually are. Why agents

0:30

should be more opinated the idea of

0:33

harness as a service and um how planning

0:36

agents are really just dynamic workflow

0:38

generators.

0:40

Before Langen, he ran his own startup on

0:42

visual understanding agents and before

0:44

that uh was a scientist at AWS while

0:47

doing his PhD in CS at Temple. Uh, we'll

0:51

cover a lot into this. Uh, there's a lot

0:53

to get into. We welcome.

0:55

>> Thank you for having me. I'm super

0:57

hyped. I'm super I've been following you

0:58

on Twitter a bunch. So, yeah, I'm glad

1:00

we're making this happen.

1:01

>> How are you doing? And would love to

1:02

know your uh initial VIP check on Opus

1:05

4.7.

1:05

>> First of all, doing great. Whenever

1:07

there's a new model release, you know,

1:08

it's always like a good week for all of

1:09

us. It's maybe like an even more fun

1:12

week for like anyone who does like evals

1:14

on all the models. Um, so yeah, dropped

1:17

yesterday. We started like evaling it.

1:19

We have our like set across our

1:21

products. We have like open source evals

1:22

that we use and like also like for some

1:24

of like Lang Smith's products that we

1:26

use. It's a good model. It's a good

1:28

model. I don't think it was like a crazy

1:29

step change for tons of stuff that we're

1:31

doing. But TBD I think like the fun part

1:34

about stuff we'll like jump into which

1:36

is strong belief that every model needs

1:40

its own custom things that you add to

1:42

it. I know like anthropic release is a

1:43

nice skill uh that you can like easily

1:45

convert prompts and stuff but we're in

1:47

the middle of that process for like the

1:49

agents that we're going to use it for.

1:51

So it's a good model not a crazy step

1:53

change but we'll we'll fit it. We'll

1:54

we'll make it good.

1:55

>> I mean it is interesting in a way that I

1:57

have been seeing a lot of mixed opinions

2:00

right now. People have pretty much mixed

2:02

opinions on 4.7. Basically what they

2:04

have doing it with um the kota users as

2:07

well. I mean in just three four prompts

2:09

you are running out of I mean there's a

2:12

lot of good story I mean interesting

2:13

story behind but but yeah I mean the

2:16

kind of piece about these models being

2:18

coming up be open air or anthropic

2:20

anthropic specifically how they have

2:22

been doing good at public perception and

2:24

effective marketing as I say I mean

2:26

working well working working I mean it's

2:28

been rewarding for them

2:29

>> I mean they're great they're great they

2:30

they put out like great models obviously

2:32

they put out great products around the

2:34

models I think there's definitely some

2:37

stuff where

2:39

people are playing a lot more with the

2:42

models and like they're basically like

2:44

picking use cases they use models for.

2:46

So it's like everyone uses cloud code

2:47

like everyone uses codecs and that sort

2:48

of stuff. But like when you build like

2:50

your agents on top of those models, it's

2:52

like I need to actually care about the

2:54

prompts. I need to care about the

2:55

context engineering. I need to like care

2:57

about the tool design. And I think like

2:59

that's where it's really cool to like us

3:03

putting out content like other like

3:04

really cool people putting out content

3:05

which is like like how do I make a model

3:07

good at like my task basically because

3:09

at the end like my customers that's all

3:10

they care about that's all I care about

3:12

and I think like that's like a bunch of

3:13

the harnessge journey basically whether

3:16

you call context whether you call like

3:17

agent edge it's basically like fit some

3:20

sort of system around this model to make

3:21

it like sit at my task and that's like

3:24

what we're all trying to do and like

3:25

anthropic is trying to help us with

3:26

that. Open models are trying to help us

3:28

with that as well.

3:29

>> Totally makes sense. Um let's dive in um

3:31

about your journey. So you went for a

3:34

PhD in CS at Temple and I mean worth to

3:37

mention you did your bachelor's,

3:38

masters, PhD everything at Temple and

3:40

this has been a talk of the town as well

3:42

in past years on Twitter. People were

3:43

talking about it. People have again I

3:45

mean some opinions about Temple being a

3:48

university, good university or not. So

3:50

my question is to being a scientist I

3:52

mean doing a PhD PhD then to being a

3:55

scientist at AWS to running your own

3:56

startup on agents or visual

3:59

understanding to leading open source

4:00

agents at Langen. How has your journey

4:02

been like?

4:02

>> Happy to dive in. Um cool cool I'm so

4:05

I'm from around this area. So I'm from

4:06

like east coast uh Jersey like

4:08

Philadelphia area. I went to school at

4:10

Temple. So I did my undergrad there did

4:12

my masters there like my PhD there. So

4:14

like super early I was like I'm just

4:17

going to be a doctor like most kids

4:19

pressured by their parents like I'm

4:20

going to be a great doctor like quickly

4:22

realized like I don't really want to do

4:23

that most of my undergrad. So I do my

4:24

underground in math and math is like

4:27

really cool. I think there's a lot of

4:29

concepts in math that like translate

4:30

really well to CS and like physics and

4:32

things like that sort of like systems

4:33

thinking.

4:35

>> Math is also like at least for me maybe

4:37

I'm just not amazing at it. It's

4:38

incredibly hard. So like doing something

4:40

really hard does prepare you for other

4:42

things.

4:44

Yeah, dude. Undergrad was like really

4:45

fun. I enjoyed math. I got into like

4:46

some CS stuff. I think like late 2010s

4:50

was when there was a lot of cool stuff

4:53

in different parts of ML. So like I got

4:55

into computer vision stuff, like

4:57

undergrad research. And like I love

4:59

vision. So like I think vision is still

5:01

one of the coolest things out there.

5:03

There's like way less research done on

5:04

vision even today relative to text. Like

5:08

>> OCR is pretty important, right? OCR is

5:11

like now okay just just send just send

5:14

the PDF to Claude basically and like

5:16

obviously a bunch of systems engineering

5:17

around that but yeah man like I I loved

5:20

vision I still love vision vision was

5:22

really cool so like I did undergrad in

5:24

that did like research around that and

5:25

then I just went straight into like

5:27

masters in PhD like right after I

5:29

graduated like early 2020s and then yeah

5:33

my PhD was basically all around like

5:36

vision focused representation learning

5:39

so yeah I can talk a little bit about

5:40

that. So the first like topics that I

5:42

was working on was like graph neural

5:44

networks which are like I don't know how

5:46

hot those are anymore but I do see like

5:48

some really cool people still doing

5:49

research around those. Um basically like

5:51

graph representation learning but it's

5:52

like graph representation learning for

5:54

like vision basically. So it's like if I

5:56

like decompose an image into like

5:58

particular objects and like I make a

6:00

graph of that and then I do like

6:01

representation learning do we get like a

6:03

better end vector for like retrieval

6:05

like classification and then like we did

6:06

this at also like the data set level as

6:09

well. So like what if I have like kind

6:11

of like few shot examples. It's called

6:12

like transductive learning like use

6:14

other information in the data set to

6:16

help you classify the next thing. Dude,

6:18

that was really cool. Like I think

6:19

graphs I'm like bearish on graphs

6:21

overall actually. So maybe hot take but

6:23

like that was a really cool part of

6:24

research and like that was my first like

6:26

dabbling into like computer vision stuff

6:28

like undergrad then my first like PhD

6:30

topic which like it shifted a little bit

6:32

after like the chat PT moment like tons

6:34

of research became around okay like

6:37

let's do VLMs for everything and let's

6:39

do like representation learning on the

6:41

VLMs like what are VLMs like actually

6:43

seeing when they're doing their like

6:45

attention mechanism over images. So

6:48

yeah, dude, it was great. It was great.

6:50

I like really enjoyed my time in PhD. I

6:52

think it's like you get some sort of

6:54

unbounded time with your adviser to just

6:58

pick an interesting problem and just

6:59

like rabbit hole in it. So I did like

7:01

retrieval stuff like representation

7:03

learning stuff. Yeah, dude. It was

7:04

great. I enjoyed it.

7:05

>> Awesome. Um, so I had a chat with

7:08

Tensorcut the other day. He started

7:10

Paradigma. He dropped out of PhD. So my

7:13

question to you is what do you really

7:15

think about the scenario right now the

7:18

linkage between academia and the

7:20

industry and right now if you have been

7:23

like if someone is going for PhD or

7:24

something like that. So what do you

7:25

really think about is is is that is this

7:27

worth it or how far we have come is

7:31

still necessary to go for a PhD to I

7:34

mean it is again very opinionated um

7:36

question but still I mean I want to

7:38

really understand your

7:39

>> yeah absolutely so like it's a great

7:41

question like people ask me this

7:43

question like locally like my friends or

7:44

like younger brothers and stuff.

7:46

>> Yeah. So like maybe my PhD was like

7:49

slightly different because I was doing

7:51

research at Temple but I was also doing

7:53

research and like working on like prod

7:55

projects when I was at AWS and those are

7:57

happening at the same time and I like

8:00

strongly believe that that is like a

8:02

fantastic mix for anyone who wants to do

8:05

like research but then sort of

8:07

understand maybe like how their research

8:10

is going to be applied in like some

8:11

settings. And I think today like the

8:15

point basically of a PhD to me is like

8:17

you pick a topic that you're like really

8:19

deeply interested in and you like poke

8:22

around the edges of that topic to try to

8:24

figure out like how we can make like

8:25

this thing better. And like that doesn't

8:27

like really require a degree to do that.

8:29

There's tons of like sick researchers on

8:31

X who just like post like random blogs

8:33

and like they don't have a PhD. they

8:35

probably don't maybe don't have CS

8:36

background but there's like you just

8:37

pick a topic you like rabbit hole it

8:41

you just like push the boundary of

8:42

what's possible and you do that like in

8:44

a verifiable way so you like write code

8:46

do experiments you try to share like

8:47

open research and if you're able to find

8:51

a company that allows you to do that

8:52

like lang's fantastic at that like I

8:54

think they really cultivate like hey

8:56

like we're going to like pick this topic

8:57

we're just going to like figure out how

8:58

it works and we're going to like publish

9:00

content about it basically

9:02

>> I would say that's great I think it it

9:03

kind of depends like if you find a great

9:05

company, a good great founder that you

9:07

vibe with that lets you do both.

9:08

Industry is like amazing and like

9:10

especially AI research like it's super

9:12

helpful across a lot of companies. You

9:14

can probably make a lot of money and

9:15

like do interesting research at the same

9:17

time. So yeah, kind of like a

9:18

non-answer, but if you do find that

9:20

scenario amazing if you just want to

9:22

like grind on like some sort of topic

9:24

and PhD for like a bunch of years, also

9:27

great. I actually don't think you can go

9:29

wrong like just by being curious and

9:30

just exploring it.

9:31

>> Yep. I can see you have uh you you were

9:35

like working on your startup about

9:36

visual understanding agents. So I want

9:39

to understand your learnings there and

9:41

how do you see the vision space right

9:44

now like how can you correlate between

9:47

uh the time when you started and the

9:48

time we have come so far with the

9:51

current frontier state-of-the-art

9:52

research and products building. Yeah,

9:54

dude. Um, yeah. So, like I started that

9:56

startup after I graduated like my PhD.

9:59

So, that was sort of like mid last year

10:01

with a friend. And basically like the

10:04

main thing that we were working on like

10:06

starts was called Agentify. And like the

10:08

main idea was basically that basically

10:10

vision compared to text like really lags

10:12

behind in frontier models for like

10:14

things like visual reasoning but also

10:16

things like perception just generally.

10:18

So there's like tons of things where

10:19

you'll like show an image or like an

10:22

object like o two overlapping boxes to

10:24

the model, right? And it's like it

10:25

doesn't like fully understand that those

10:26

two things like overlapping and like

10:28

part of this is just a perception

10:30

problem in the visual encoder where it's

10:32

like some of these like fine grain

10:33

details, it's just not able to

10:34

understand them with like the native

10:36

training that it has. But that I think

10:39

is like a fantastic opportunity because

10:41

it's like how much of that gets absorbed

10:43

into the vision encoder backbone versus

10:46

like how much do we augment models with

10:49

like tool calling behavior that they're

10:51

exceptional at and actually use that as

10:54

the mechanism to like take vision

10:56

capabilities and like put them into the

10:57

models. Like that's basically the whole

10:59

like idea that we were working on. So

11:00

like research and like product around

11:02

that which is like what if I just took

11:04

all of the classic vision models that we

11:06

already have and like a lot of this was

11:08

honestly inspired by Meta's work on SAM.

11:11

So I think like SAM and that whole

11:13

series is like incredible like SAM 123.

11:17

It also supports like video segmentation

11:19

which is like insane and you can also

11:20

like fine-tune it. You can do like meds

11:22

SAM and things like that. So it's

11:23

basically like BET was okay models are

11:26

amazing. They're getting very smart, but

11:28

like their vision capabilities are

11:29

lagging behind. But we can augment them

11:31

with tools and like you can basically

11:34

like do the right tool selection in the

11:37

moment to like get that capability. Like

11:39

segmentation is something that it was in

11:41

Gemini Flash across the Gemini series,

11:43

but like compare that to like SAM,

11:44

right? Like SAM was like way better. If

11:46

you just use like Sam as a tool compared

11:48

to like the native segmentation Gemini,

11:49

you would be just like way happier. and

11:51

like all you had to really do was like

11:52

point to the right spot which is like

11:54

way easier than doing like semantic

11:56

segmentation. So that was the idea. I

11:58

still think that that is true in vision

12:00

today. Like even with like Opus 4.7's

12:04

new benches, it's still not as good at

12:08

visual perception as like we need it to

12:10

be. So I still think tool use is like

12:12

really really exciting for yeah just for

12:16

like agentic systems like visual

12:18

basically making a bunch of like vision

12:19

specific tools for your task and like

12:20

augmenting uh yeah augmenting your agent

12:23

with that.

12:23

>> I think there is a lot of scope to do

12:25

alongside UI bench as well. I mean again

12:29

uh it's more about one's taste but uh

12:31

but there are lots of ifs and buts lot

12:34

of nuances where you really need to take

12:37

care of like even if you're cloning a

12:38

website I mean there's lot of sc uh

12:41

scope to play around something so my

12:43

next question is about your work at

12:45

Loheed Martin. So you you you interned

12:48

there. I think that was your first um

12:51

job and honestly a lot of what people

12:53

see about world is kind of sophisticated

12:56

reals on social media about American

12:58

weaponry. So what was the reality like

13:00

from the inside? What what what you were

13:02

working on? How does it feel like to

13:03

work at some defense um kind of defense

13:06

company and what experience lead?

13:08

>> That is like such a throwback. So that

13:10

was like my first internship at like

13:12

tech ever. So, I was like a bio intern

13:15

in undergrad and I was like looking for

13:16

internships and I gave my resume and I

13:19

got an internship at like Loy Martin

13:20

which is amazing because like I don't

13:22

know how good my bio resume was for

13:24

getting like any internships. Yeah, man.

13:26

I wish I say like tons of stuff I did on

13:29

>> What do you mean by bio resume? It was

13:31

like like you were working on some bio

13:33

>> Yeah. So like I went to undergrad as

13:35

like a biochem major because like I

13:37

wanted to be like a doctor.

13:40

>> Amazing.

13:40

>> Yeah. So like then like after freshman

13:42

year I applied to like internships cuz I

13:44

I switched I wanted to do tech after

13:46

that or like at least explore it with

13:47

like a bio resume and they were like

13:50

dude like what like what are what are we

13:52

doing here? And then like I think I

13:53

basically just like talked like to the

13:55

hiring manager and just said like hey

13:57

I'm like really down to like learn this

13:58

thing like which is like data science

14:00

like that time there bunch of these like

14:01

data science courses and things coming

14:02

out so it was still like early and I was

14:04

like hey like I took these like Python

14:05

classes and like I'm super down to learn

14:08

this. And basically it was like yeah I

14:10

mean it sounds great.

14:13

I ended up working on the data science

14:14

team there and it was basically like my

14:17

first introduction into like kind of

14:21

like data analysis sort of stuff. So

14:23

like understanding like it was much like

14:25

stats basically. So like I wouldn't say

14:27

it was like ML but it was like this is

14:29

like intro to like making plots like

14:32

slice this data this way. So it was a

14:33

bunch of just like empathy for like very

14:36

very messy data as like my first

14:38

internship which is actually like very

14:40

valuable today just like insane amounts

14:42

of data which is like does not look very

14:43

clean and yeah man I wish I say more it

14:46

was basically like a great learning

14:47

experience because I was kind of

14:48

learning how to code and like doing like

14:50

data science stuff and then it was also

14:52

like a decent confidence boost because

14:53

I'm like okay maybe I can do like tech

14:56

stuff and yeah I interned there and it

14:58

was like fun and then yeah I didn't

15:00

really go back after that but I started

15:03

getting into more like research stuff at

15:04

school.

15:04

>> Awesome. Um, also recently I was just

15:07

kind of exploring the timeline. I see

15:09

Mike Mill who is a pretty famous, you

15:11

know, internet celebrity was looking for

15:13

an AI guy and you came up through

15:15

Temple. Apparently Mike was surprised

15:18

how many Temple people are in AI and so

15:21

did you end up connecting with him? Did

15:22

you share anything about Langchen and

15:24

stuff?

15:24

>> So Meek Mill is like he's like a rapper

15:26

from from Philadelphia and like I guess

15:29

he lives around Temple like that's where

15:30

he was from and I think everyone was

15:33

like when they saw that tweet they were

15:34

like Meek Mills get into AI so okay let

15:37

me just like reply basically because I

15:38

think like honestly like randomly

15:40

posting on Twitter X is like awesome.

15:42

You can meet so many cool people like

15:44

that and I we'll talk about this but I

15:47

met like Harrison the founder of like

15:48

CEO and the CEO of like W

15:52

And yeah, he did not reply to me. I hope

15:54

his like startup is doing sick, whatever

15:56

he's whatever he's doing. But like I'll

15:58

like repeat it if he does need someone

16:00

for help with like AI. I'm actually like

16:03

seven blocks down. So I could totally

16:06

like just pull up and help him. So no, I

16:09

think that's a good lesson though is

16:10

just like randomly posting maybe like

16:11

I'll just keep doing that and then maybe

16:13

something will happen.

16:14

>> Yep. Awesome.

16:17

So I mean the next question to you is so

16:20

when did you join Langchain and uh what

16:22

actually pulled you there specifically?

16:24

So and since you joined what actually

16:27

has

16:27

>> So this is like this is so much fun. Um

16:30

I was working on my startup like after I

16:32

finished my PhD that didn't work out

16:34

like we basically stopped around the

16:36

fall. At the same time, I was basically

16:39

like doing my first foray into just like

16:41

posting like random stuff on Twitter

16:44

just like my thoughts like basically

16:45

just like open source stuff like hacking

16:46

on random stuff and

16:49

from a bunch of the stuff I was posting

16:50

around like so like last year I also

16:53

like sort of believe that like we have

16:54

amazing models but like because we did a

16:56

bunch of stuff in this like visual

16:58

understanding space with like agents and

17:00

stuff. I was like very very confident

17:02

that models need like some stuff around

17:04

them to like help them do these tasks

17:06

because like they just suck at them out

17:08

of the box and like we basically saw

17:09

this every day. So that's basically when

17:12

a lot of maybe the ideas that were

17:14

brewing around harnessge like started to

17:17

maybe get more like crystallized and I

17:19

just started like posting about that

17:20

online. It's like, hey, like this is

17:23

maybe like what harnesses look like.

17:25

Like harnesses are like supposed to like

17:26

wrap models and like if we're trying to

17:28

do like vertical tasks. It like really

17:30

helps to have some sort of like

17:31

opinionated like prompts, context

17:33

engineering, like tool call structure

17:35

like all this sort of stuff. And I think

17:38

I just like started DMing Harrison like

17:41

the CEO from that which is like super

17:42

sick. He is also always thinking about

17:46

like the frontier of like AI systems

17:49

which is awesome. And then we started

17:51

chatting maybe like late last year just

17:54

like yeah like what would it look like

17:55

to build open-source infrastructure

17:58

around like agent engineering and like

18:02

maybe the best way to facilitate that is

18:04

by helping people build good harnesses

18:07

like whatever good means like let's

18:08

discover like what good means and make

18:10

open source software about that. So, it

18:12

was basically like, okay, that sounds

18:14

sick. And then I was like, I don't

18:16

exactly know what I'm going to do. Like,

18:17

maybe I'll continue like working on the

18:18

startup or like, but I would love to

18:20

join something that like really aligns.

18:21

So, then I started working with like

18:22

their open source team late like last

18:25

year on what ended up becoming like what

18:28

was deep agents, but ended up becoming

18:30

like a lot bigger. Um, so yeah, we were

18:32

working on like the very very early

18:34

versions of like deep agents last year,

18:36

which is like one of our libraries at

18:38

Langchain that we that we have. It's

18:40

like our library to help people build

18:42

harnesses. Um, or at least it's one of

18:44

the ways that people can build harnesses

18:45

using using Wangchain. And yeah, I loved

18:48

it. I love the team. Uh, amazing people

18:51

doing open source. And then I decided to

18:53

join like full-time in in December.

18:54

>> Amazing. Um, and and I mean, the

18:57

adoption is just crazy, dude. I mean, so

18:59

I want to understand about the growth

19:01

here. So, so again I mean right now

19:03

Twitter is full of people flaming

19:05

millions in ARR every month and but like

19:07

a feels like one of the most you know

19:10

stressed metrics out there. So my

19:12

question is how has lang approached

19:14

growth in real terms be it opensource be

19:17

it community adoption be it enterprise

19:19

>> yeah dude it's a great question. So I I

19:21

think about this a bunch because like I

19:23

think the best way to maybe think about

19:24

it is like basically like work backwards

19:26

from you want to like help people build

19:30

stuff using like the tools you're you're

19:32

putting out there, right? And like the

19:34

goal is basically just like help people

19:36

build like really cool things and like

19:39

make that process of building as easy as

19:40

possible. I think in like open source

19:42

that comes through like very clearly

19:44

because in open source I think you get a

19:46

lot of like empathy for the end user

19:48

because they're like directly using your

19:50

product like all the code is like fully

19:52

visible like go inspect it also like put

19:55

your opinions in like our GitHub issues

19:58

and tell us like what's good what's bad

20:00

like what should we fix like what should

20:02

we add also like it's totally cool to

20:05

like disagree in open source because

20:07

like the maintainers sort of have

20:09

limited bandwidth to address like all of

20:12

the things, but we want to make sure

20:13

that the most impactful things that are

20:15

going to help like the most users build

20:17

like the coolest stuff like we like

20:18

prioritize those. So, I think there's a

20:21

there's a big part of growth which is

20:23

why I like really like X um and like

20:27

these direct feedback channels or like

20:28

Slack for example or just like messaging

20:31

builders and customers because you

20:34

basically get to see exactly what

20:35

they're doing. you build like a lot of

20:36

empathy for shoot like this thing that

20:39

we built like it's a little broken in

20:41

this way or like it doesn't exactly like

20:42

fit the use case and then you hear a

20:44

bunch of those stories and you sort of

20:45

like work backwards to say okay like we

20:47

need to improve like this part of our

20:49

library or like we need to like make it

20:51

possible for others to improve our

20:53

library as well. That's like an amazing

20:54

part of open source that we get tons of

20:56

like amazing feedback, tons of like user

20:59

contributions which is great because you

21:02

sort of like grow with your community

21:04

and I think like that's a really big

21:06

part of open source and related to that

21:08

which I really really like about

21:09

Langchain like one of the reasons why I

21:12

joined and like I really enjoy working

21:13

here is there's a lot of like learnings

21:16

that we get from all the research that I

21:19

do in like open source and like putting

21:21

stuff out there and getting feedback

21:23

that slowly like make their way into our

21:25

products as well because it's like for

21:27

example a lot of stuff in like Lang

21:29

Smith for example which is like okay

21:31

like how do you build good evals like

21:33

how do you how do you actually enable

21:35

agents and users to build like really

21:37

good evals like how do you like

21:38

understand what's happening in traces

21:40

like mind signals from

21:42

>> like a lot of that we put out just in

21:44

the open like I did a bunch of blogs on

21:46

that stuff there's other people who are

21:47

like hacking on that stuff as well and a

21:49

lot of the stuff in open source you sort

21:51

of see how the community interacts with

21:53

it. You also just see the raw numbers

21:55

and you put it out there and it's like

21:56

hey like I would love this or like I'm

21:59

using this and it's like oh we should

22:01

make that as easy as possible. Put it

22:04

into a product and like if people love

22:06

the product then like the rest of it

22:08

sort of takes care of itself. It's like

22:09

yes you will make money you know your

22:13

customers will be really happy and then

22:14

like just continue the loop like just

22:15

keep making it better basically. So I

22:17

think like yeah dude customer feedback

22:19

is amazing like community feedback is

22:21

amazing. So it's like a really really

22:22

big part of I think lang chain a really

22:25

big part of like a a lot of the open

22:26

source stuff that we do

22:27

>> I can imagine of course and more

22:29

specifically here so you are leading the

22:32

open source egen and harnesses work

22:34

right now so what does a typical um week

22:37

looks like for you it's more about

22:39

research engineering or product

22:42

>> yeah dude whatever

22:44

>> I think the fun part is like it's it is

22:46

actually like a mix of a ton of stuff

22:48

and I like really really like that so

22:50

it's like the goal is bas basically pick

22:52

the most important thing to work on at

22:55

this time and then like we'll like we'll

22:57

chat about it maybe over the weekend or

22:59

like the week before like Harrison

23:00

jumped in with with us like we'll DM and

23:03

let's just like sprint towards that and

23:05

build it basically and like maybe what

23:08

that looks like lately

23:11

like lately like a ton of my work has

23:13

been on like eval continual learning

23:17

essentially like methods for using like

23:19

evals and continual learning to make

23:21

like agents and like their harness

23:22

better. So that's like basically like

23:24

the research direction and I would say

23:26

maybe like 50% of the week goes into

23:30

okay let's like pick a research

23:31

hypothesis let's like figure out what

23:33

the experiment design around that might

23:35

be. Like for example, last week we were

23:37

doing a bunch on can you like just in

23:40

time generate evals uh like for any

23:42

given task like what does that look

23:44

like? Like are you overfitting to them

23:45

and like what is your like fitting

23:47

algorithm? There's like tons of stuff

23:48

that we put out. There's like a lot of

23:50

good content on like harness hill

23:52

climbing basically. But yeah,

23:54

essentially it's like research. Let's

23:55

pick that task. Um kind of like a PhD.

23:58

We're going to make a hypothesis. We're

24:00

going to like run the experiments on it.

24:02

We're going to get like get metrics and

24:03

we're going to post them on Slack and

24:04

we're going to like review them and like

24:07

argue our takes about them essentially.

24:11

Yeah. Then the other maybe bunch of

24:12

percentage like 50% is like talking to

24:15

customers like talking to people like on

24:17

Twitter getting a bunch of feedback from

24:18

them on like the open source stuff like

24:20

how can we improve our libraries whether

24:23

that's like lang chain lang graph like

24:24

deep agents anything in like lang and

24:27

then a bunch of that is talking with

24:29

like product teams as well. So there's

24:31

like tons of great teams at Lang Chain

24:34

that do a bunch of good work on like all

24:36

the products that we have. So there's

24:37

tons of learnings that I think come from

24:38

open source that we can like port back

24:41

into the products that we're going to

24:42

build and yeah just keeping that

24:44

feedback loop is good. So I would say

24:46

like it's a mix bunch of like research

24:48

and then engineering stuff and then a

24:51

bunch of like I don't know like what the

24:53

term today is but like devril like

24:56

devril devx which is just like if

24:58

someone asks a question on Twitter like

24:59

we should respond to them and we should

25:01

like put our ideas out there and we

25:02

should like be willing to engage with

25:04

other people's ideas and yeah just hear

25:06

what people are saying. So it's like a

25:07

mix yeah it's a mix of those things.

25:09

what percentage of your article source

25:12

like article is coming from this

25:14

research source I can imagine a certain

25:16

percentage but because dude I mean I

25:20

mean let's just come to harnesses like

25:22

what this what is all about the load

25:24

behind harnesses right you know

25:26

>> so you mentioned that the definition of

25:28

agent is basically model plus harness

25:30

right

25:31

>> so I mean this is something like I mean

25:33

it is being in like people know this

25:35

from quite some time like this is this

25:37

is a fact but I think this is the

25:39

cleanest framing anyone any anyone have

25:42

seen at least on Twitter. So if you're

25:44

not the model, you are a harness, right?

25:46

And and a harness is every piece of

25:49

code, configuration or execution logic

25:51

that isn't the model itself.

25:53

>> So can you walk me through how you

25:56

arrive at the definition?

25:58

>> Yeah. Yeah. Yeah, dude. I think like it

26:00

is it is definitely like a cleanish sort

26:04

of specification of like what is this

26:06

thing that we're talking about and I

26:09

think like maybe the definition doesn't

26:11

really matter like as much like what the

26:13

exact equation is but like there is one

26:16

thing that's helpful which is like when

26:18

you're communicating with someone about

26:20

like how we're going to make this agent

26:21

better we need like some shared language

26:24

so we can talk about like what is the

26:26

thing that we're going to optimize

26:27

basically right so it's like

26:29

like working backward from model

26:32

capabilities because like that's sort of

26:35

the thing that we need to wrap

26:37

intelligence like wrap systems around to

26:40

like amplify the intelligence of the

26:42

model. So it's like I basically view it

26:44

as there's some sort of computation

26:46

happening inside the LLM and like where

26:49

that's happening is over this like

26:51

context window boundary. So like all the

26:54

compute happens when I basically like

26:55

take context from like my system and I

26:59

push it over the boundary and I put it

27:00

into the context window like for the

27:03

model to do computation on and then

27:05

produce tokens basically. And like some

27:07

of those tokens correspond to like tool

27:09

calls and then I go and execute those

27:10

tool calls and like I return the context

27:12

back. And like the reason why I like

27:14

that is because like models by

27:16

themselves they're basically just like

27:18

>> token input machines and like token

27:20

generators basically. But like we need

27:22

to put a system around the model so it

27:25

can do useful things. And I really like

27:29

maybe like working backwards from what

27:31

should the agent do and like maybe even

27:34

like what does my customer want the

27:36

agent to do and then like figure out if

27:38

I just like give it like a really really

27:40

simple model like maybe like really

27:41

really simple harness. Can the agent can

27:43

the model and like the agent can the

27:44

agent basically just do that? And like

27:46

if the agent can just do that with like

27:48

a really simple harness, then that's

27:50

like amazing because then we can just

27:52

like give that to the user essentially.

27:55

Where things maybe get like more

27:57

interesting is like where like a really

27:59

simple harness just like can't do that

28:01

today. And that might just be because

28:02

like it doesn't have the right tools or

28:04

maybe like the model isn't intelligent

28:06

enough to like orchestrate those tools

28:07

in order to do that. Or maybe it's like

28:10

some of our context engineering opinions

28:13

in the harness aren't good enough and

28:14

it's like hey like you're you're putting

28:17

a bunch of like really big tool call

28:19

outputs like into the context window and

28:21

it's like confusing the model. We should

28:24

find out ways to not do that. But these

28:26

are all basically like harness level

28:28

configurations that we're doing and

28:30

they're external to the model. Like the

28:32

model is basically just like a

28:34

computation unit and it computes things

28:36

over its context window and like we need

28:38

to decide what goes into that context

28:40

window so it can do like useful work for

28:42

us.

28:42

>> If I have to ask you some like three uh

28:45

three bullet points what really makes a

28:49

good hardness according to you what are

28:51

they?

28:51

>> Yeah. So there's a bunch, but if if I

28:54

had to pick like three right now, I

28:55

would say

28:57

basically prompting and like very very

29:01

clear instructions

29:03

for better or worse. Like there was this

29:04

whole thing like prompting is dead. Like

29:06

prompting is like totally not dead. It

29:08

is like so useful, so helpful. And like

29:10

I I don't just mean like prompting in

29:12

terms of just a system prompt. Like

29:14

prompting also applies to like the tool

29:17

descriptions as well that get like

29:19

autoloaded into context. It also applies

29:21

to how well your like skills front

29:25

matter explains like how to use these

29:27

skills or like how to use like other

29:28

skills. It it also applies to like if

29:31

you have sub agents, does like the sub

29:33

agent front matter specify like when

29:35

this should be used or like how to use

29:37

it basically. So it's just like

29:38

basically prompting that encodes really

29:41

really good instructions from the user

29:44

or on behalf of the user for like how to

29:47

use this agent to do useful work. That's

29:49

like super important. I think like

29:50

prompting is honestly more important

29:52

today than it ever was before because

29:54

our like the systems we have are way

29:56

more intelligent. So we're able to guide

29:58

them towards doing useful work more

30:00

easily with good prompts. That's one. I

30:03

think the other one that we're spending

30:04

a bunch of time on right now is

30:07

basically verification. So we did like

30:10

some blogs around this on like making

30:12

coding agents better. But there's sort

30:15

of like maybe two things in

30:17

verification. like first is prompting,

30:18

second is like verification. So there's

30:20

like a built-in verification that you

30:23

might inject like into into the harness

30:26

itself. So like that can be like a hook

30:29

basically. So like before the model

30:31

tries to go and exit like force it to

30:33

like recheck the work or like make sure

30:37

>> really

30:37

>> verification is basically like if if I

30:39

give so for example if we just use like

30:42

all the terminal bench tasks, right? So

30:44

like terminal bench task comes with like

30:46

an environment. It comes with like a

30:48

task and then it comes with like a

30:49

verifier that will run after the agent

30:52

thinks it's done, right? But like

30:54

obviously we can't use that verifier

30:55

information. So like what the agent

30:57

needs to do is like it needs to like

30:59

self-verify its work before that

31:01

verifier runs to like be like very very

31:04

sure that the code that it developed

31:07

solves the task that we're that we're

31:08

like trying to solve. Maybe there's two

31:10

parts of that. One part is we need to

31:13

like teach agents what the useful

31:16

primitives are for verifying their work.

31:18

I think like one immediate one if like

31:20

anyone uses like the claude model or

31:23

like even like GPT 5.4 is like agents

31:26

are very susceptible towards like

31:28

picking the easy way out in verification

31:30

which is like they test like trivial

31:32

cases or like not not like very

31:34

difficult cases. Obviously, that fails

31:37

in the verifier because it's just like,

31:38

hey, like I checked like these three

31:39

cases are really easy, so like I'm good

31:41

essentially and like that's bad. Like we

31:44

should teach agents to be much more

31:46

thorough when they're like generating

31:49

verification for themselves. That's like

31:50

one part of it. The other part of it is

31:52

like like this is all code. So like we

31:55

have in our repos tons of like unit

31:58

tests and like tons of like evals that

32:00

we already use. Like that is great

32:03

context that we should give to the

32:04

agent. so that it can like run that eval

32:07

suite and that might be run with a hook

32:08

for example like I don't want like maybe

32:11

the agent won't run it by itself but

32:12

like when it tries to exit that should

32:14

just maybe run my eval suite or a subset

32:16

of it and it should inject the context

32:18

or like the results back to the agent so

32:21

the agent can see like what failed like

32:24

what what passed basically because like

32:26

we need some sort of signal to give back

32:29

to the agent so we can like fix the

32:31

thing that it generated so it's like

32:33

self-verify or like use external signals

32:36

from like existing evals so you can like

32:38

fix the things that are going wrong. And

32:39

I think that's like a really really big

32:41

part of it. And like maybe the last part

32:43

that we're focusing a ton on is

32:47

high level. It's kind of like

32:49

orchestration basically but for doing

32:52

things that are more long horizon

32:55

basically like it's problem

32:57

decomposition and like making sure that

32:59

like when we use like sub agents to do

33:02

problem decomposition like two things

33:03

are true. So one is we're picking the

33:06

right model like agent for the job

33:08

because like every model is like good at

33:11

different things and also that um this

33:14

is a lot of context engineering. We're

33:16

basically like bounding the sub problem

33:19

that the agent needs to do in like a

33:20

decent enough window that it can like

33:22

manage it. Basically what I mean by that

33:24

is um I wanted to like do things in like

33:28

a 50k to like a 150k token range roughly

33:32

or like 200k. sort of it depends on the

33:35

model but like I don't want to give a

33:36

subtask to like a sub agent if it's if

33:40

it's so big that it's like okay it's

33:43

going to start getting into like really

33:45

really high context zones like dumb zone

33:47

which like Dex calls it um from human

33:49

layer which I love and yeah so it's like

33:52

efficiently being able to take a problem

33:54

decompose it and then use like sub

33:56

agents as like compute sources to like

33:58

do those problems and like filter stuff

34:00

back to the main agent and like some of

34:02

it is just good model choice like for

34:04

example like we find that maybe the GPT

34:08

series like 5.4 for is exceptional at

34:10

like planning uh which is amazing and

34:13

like Gemini like I find is like really

34:16

really good at like multimodal stuff and

34:18

so actually so is they all are but like

34:20

Gemini is like really good at it and

34:21

like Flash is actually amazing bang for

34:24

a buck for like speed cost and

34:26

multimodal stuff like a lot of this is

34:28

just informed by like dog fooding and

34:29

evals like hey like we need to like test

34:31

these models and figure out what are

34:33

they good at so yeah I think I think

34:34

those are the three maybe roughly and

34:36

there's like way more obviously so it's

34:37

like like prompting

34:38

like systems around like verification

34:41

like self-improvement uh like via traces

34:44

or like via evals and then the last

34:46

thing is like kind of like orchestration

34:47

but maybe it's like context engineering

34:50

around problem decomposition

34:52

>> makes sense um you just mentioned about

34:54

uh 5.4 for for uh planning. So uh so uh

35:00

pretty much I think it uh not just a

35:03

black box but it is kind of a reasoning

35:06

sandwich where where I mean you

35:08

mentioned as well x high for planning

35:10

high for execution x high for

35:12

verification um like running only at x

35:15

high scored 53.9%

35:18

due to timeouts versus 63.6% at high. So

35:22

I mean that's counterative right? I mean

35:25

does more reasoning made it worse?

35:27

>> Yeah. So I think I think this is

35:29

basically touching on like the point

35:30

that I think about a bunch which is like

35:33

we need to like what we try to do is

35:35

basically like we're trying to design

35:36

like an agent system around like a task

35:39

that we need to solve right and like

35:40

that task has maybe like a bunch of

35:42

constraints like I think the one you're

35:44

talking about is maybe like the the some

35:45

of the terminal bench work that we were

35:47

doing and just trying to publish. So

35:49

yeah like for that use case we we had

35:51

like an artificial constraint which was

35:53

like we have a like a timebounded run

35:57

essentially like after this amount of

35:58

time like the sandbox just like exits

36:00

and like the run doesn't get scored or

36:02

like the run gets scored like wherever

36:04

we left the state of the sandbox and

36:06

yeah so I think maybe the takeaway from

36:08

that is less that like maybe like x high

36:10

reasoning all the way through like

36:12

wouldn't have been better. It actually

36:14

like does a great job. It just takes

36:16

like a really long time. So then it like

36:18

runs out of time to like complete the

36:20

task. But also it's like not compute

36:23

efficient and it's not like cost

36:24

efficient. Like it's awesome to like run

36:26

X high at everything all the time and

36:28

spend a bunch of token on like every

36:29

single problem. Like practically

36:32

speaking um you have to pay for the

36:35

tokens and like also like practically

36:37

speaking from like a user experience

36:38

like am I just going to wait for GPT 5.4

36:41

afford to just like think super hard all

36:43

the time or like can I use a smaller

36:46

model or like a cheaper model that I

36:48

like write really good instructions for

36:50

and it can just go do that task like

36:52

immediately then my user just like sort

36:53

of gets like a more you know like

36:56

latency reduced interaction. So it's

36:59

like yeah I think main takeaway is like

37:01

XH high actually for me is amazing and I

37:03

do a bunch of like planning in X high

37:04

when I'm like just coding but because

37:06

like when I'm in the loop I want like

37:08

feedback because like it's annoying if

37:10

I'm just like staring at a blank screen.

37:12

I use like high for a bunch of like in

37:14

the loop coding. So like X high planning

37:17

and then like high for execution. So but

37:19

yeah it just depends. It like totally

37:20

depends on like the work that we're

37:21

doing. I think that's like the main

37:24

thread that I think about.

37:24

>> Awesome. Okay. I mean yeah that makes

37:27

sense. saw and and I have seen that

37:29

people are using people are preferring

37:32

5.4 xi codeex over opus 4.6 six I mean

37:36

now seven has like mixed opinions I mean

37:39

anyways um so uh again like you said

37:42

about what about hardnesses and

37:44

everything and there was a potential a

37:46

lot of news about file system as well

37:48

like I can't give a count the number of

37:51

blogs I have number of Twitter articles

37:53

I have read about file system right and

37:56

even like in your anatomy post you said

37:58

that the file system is arguably the

38:01

most foundational harness primitive so I

38:04

mean it's a it's It's it's a strong

38:06

claim and um and previously obsidian co

38:09

also mentioned about everything just

38:11

about file system. So why the file

38:13

system and how does it kind of make it

38:16

really influential in in this harness

38:19

design and things around agent

38:21

engineering. What other tools?

38:23

>> I mean I'm like incredibly bullish on

38:25

file systems. I think like a ton of

38:27

people internally also are and like a

38:30

ton of people across industry like very

38:31

bullish on file systems. Like one of the

38:33

early decisions in like DB agent when we

38:35

were building it last year was basically

38:37

like using the file system and that was

38:40

more because we saw like two things. one

38:43

like how useful it actually is for

38:45

context management and like two agents

38:49

are just exceptional at using file

38:50

systems already right so it's like it's

38:52

kind of two things like the model is

38:54

already very very good at using this

38:55

tool so I don't have to coersse it a

38:58

bunch to get good at like using these

39:00

sort of like patterns and like now like

39:02

with newer models is probably even like

39:03

post trainer even more on getting good

39:05

at file system stuff so that's like

39:06

amazing the the other thing that's like

39:08

really amazing about file systems or

39:10

like basically the concept of a file

39:13

system. I I'll I'll maybe like

39:14

generalize it a little bit, which is

39:15

like I need some sort of like persistent

39:18

storage that my agent can use to both

39:21

like access information and then like

39:24

offload information. And like that's

39:26

maybe the higher level primitive like a

39:28

file system ends up being like a really

39:29

really easy way to do that. But like the

39:31

primitive is like the LLM the model

39:35

basically has like this computational

39:37

boundary that I put stuff into and like

39:39

I can take stuff out of essentially,

39:41

right? And like all the comput happens

39:43

here and the decision for like where to

39:47

store stuff and like how to access it

39:48

like file systems end up being fantastic

39:51

storage primitives to do that and like

39:53

the reason why I say like the concept of

39:55

a file system is like in in like lang

39:57

chain like in our libraries we have this

39:59

concept like virtual file systems where

40:01

it's like you expose file system like

40:04

storage essentially right so like the

40:07

operations that you would do on a file

40:09

system for example like ls for example

40:11

right or like you're like grapping over

40:13

that. It depends like what your

40:15

underlying storage system is. But can

40:17

you like use existing storage like for

40:19

example like S3 for example or like

40:22

Postgress, right? And then like what

40:23

does it look like to use that as storage

40:25

and then like put it over the

40:27

computational boundary so like the agent

40:28

can like search over this stuff and like

40:30

pull it into context.

40:32

Like agents are exceptional at doing

40:34

that. And the other thing is like

40:36

context management is so important

40:38

because like the context window is like

40:40

where all the computation actually

40:41

happens that we need some mechanism of

40:43

achieving that which is like why I'm so

40:45

bullish on file systems. It's both like

40:47

and then and then actually like maybe

40:48

one more thing I'll add is

40:51

>> now that we're doing a bunch more stuff

40:53

on multi- aent orchestration and like

40:56

multi- aent like collaboration sort of

40:57

stuff. So I think I said like a little

40:59

bit about decomposing like really big

41:00

problems into like sub problems, right?

41:03

But like where should all of that work

41:05

get stored for all of like the

41:07

decomposition that the sub agents do? So

41:09

like file systems actually also become

41:12

excellent like collaborations places. So

41:16

like sub agents can like write to

41:17

particular files and like main agent can

41:19

like read from there and like it doesn't

41:20

pollute like the main agent context

41:22

window a bunch. So it becomes like a

41:24

place where you just like write files

41:26

and like files are basically excellent

41:28

scratch pads or excellent like like

41:30

planning places or excellent like

41:32

persistent storage places like an agent

41:34

needs to come back to something and this

41:36

sort of like primitive that files encode

41:39

information really well like file

41:41

systems

41:42

offer like interfaces to like external

41:45

storage that already exists and like it

41:48

really helps with context management.

41:50

Like all of those things together I

41:52

think make it really really good for for

41:55

as like a harness tool for like an

41:57

agent. And I think a lot of harnesses

41:59

like like basically I think everyone is

42:01

like settled around file systems like

42:03

like it's uh it's not like too

42:04

controversial to say like I'm going to

42:06

give my agent a file system and like

42:08

that's a part of my harness you know

42:09

like people just sort of like oh yeah

42:10

that that makes sense. It's interesting

42:12

to know right I mean this is something

42:14

so basic something so fundamental is

42:17

kind of changed the whole trajectory of

42:19

the space in like 6 months and everyone

42:22

is kind of getting adapted to this thing

42:24

and on the same note you have uh you

42:27

have also mentioned about memory via

42:29

agents.mmd and and this is something you

42:31

kind of connect with you know like

42:33

injecting and start and you also call

42:36

this continual learning so I'm very

42:38

interesting to know about why do you

42:40

think So, and like is it really or is it

42:43

more like a persistent or consistent

42:44

notepad? So, what you really think about

42:47

this could be aligned to

42:49

>> I think like a a ton of a ton of like my

42:52

work recently has been around like this

42:54

just general idea of continual learning

42:57

basically. So like h how do I help my

43:00

agents which are producing a bunch of

43:02

data over time like I'm using let's

43:05

let's just take like my personal agent

43:06

like I'm using this one agent a ton over

43:08

time

43:09

>> and it's producing a ton of data which

43:12

is like traces essentially right and

43:14

then like all those traces like I'm

43:15

storing somewhere like we store them in

43:16

length you can put all your traces in

43:19

one place and how do I update the

43:22

definition of the agent in order to

43:25

learn from all of the data that it's

43:27

producing Right. So there's like maybe

43:30

two ways to really do that. And memory

43:33

is sort of a subpiece of continual

43:35

learning. Like continual learning like

43:36

overall to me is as I'm acting in the

43:39

world and as I'm like sort of like

43:41

producing data kind of like how we

43:42

humans do. Like I'm doing stuff in the

43:44

world and I'm like learning from the

43:46

feedback that I'm getting, right? Like I

43:48

ran and I tripped and I fell when I was

43:49

a kid and like this is a great trace

43:51

stored in my brain to say like please

43:53

like don't do that. Same thing for

43:55

agents. But the way that we actually

43:58

like update the like the agent knowledge

44:01

is like really different probably

44:03

because like we don't understand exactly

44:05

how like experiential memory that humans

44:09

experience like how does like my

44:10

experiential memory as a human get

44:12

encoded into my brain like I don't

44:14

exactly know how that process works and

44:17

we need to do that process essentially

44:21

for agents and like the agents

44:24

computation boundary is just it's

44:26

context window basically. So I need to

44:28

be able to like take learnings from the

44:30

past and I need to be able to like do

44:32

two things. One is um inject them into

44:37

the context window at the appropriate

44:40

time

44:41

>> so that when that scenario comes up, it

44:44

can like use that prior information to

44:46

like fix the thing. Like for example,

44:48

maybe this comes up in like user memory

44:50

for coding, right? It's like you're

44:52

doing a bunch of like coding with your

44:54

coding agent and then like you give it

44:57

it has that trace and like maybe you

44:58

like annotate that trace with human

45:00

feedback saying like hey like the way

45:02

that you did this or like you use this

45:04

library but like we never use that

45:06

library so like please like always use

45:07

this other library right and it's like

45:09

okay like great should that piece of

45:12

feedback and like context should that

45:13

always be in like my always on memory

45:16

right is that like just in my agents.mmd

45:18

that always gets like loaded in or is

45:21

this something that gets injected like

45:22

in real time into the agent like

45:25

contextually. This is like why I'm super

45:27

interested also in like search as a way

45:29

of doing this because like we're I think

45:33

it's like almost like unfathomable the

45:35

data scale that we're going to start

45:36

producing with agents. So like agents

45:38

run like all the time non-stop. they

45:41

produce like millions of tokens like

45:43

every few minutes and like that's a ton

45:45

of information that we need to like sift

45:47

through to figure out what's useful from

45:49

that and like what's not useful from

45:50

that. So like search is like a really

45:53

really big part of distilling a bunch of

45:55

trace knowledge into like nuggets or

45:58

like memories that I can actually

46:00

retrieve that are useful because like

46:02

tons of that trace will actually be

46:04

noise. So it's sort of this process of

46:05

like distilling

46:07

great data which is like trace data but

46:10

into nuggets that I can actually like

46:12

bring into context when I need to.

46:13

That's like one. And then the other one

46:15

is like really interesting for us is

46:17

instead of just selectively and

46:19

contextually pulling the right thing

46:21

over the like the context window

46:23

boundary for like computation to happen

46:25

over it. So like context engineering

46:26

like you can also just touch the

46:28

weights. So like we like lean in a bunch

46:30

into like open models and like I love

46:32

open models. I use like GLM5 a bunch

46:35

like a ton of the team does as well. And

46:37

that's like amazing as well. That's like

46:39

continual learning by using feedback

46:41

from traces and like distilling that

46:43

into data that you can do like RL on

46:47

essentially and like making that process

46:48

a lot easier. And both are really

46:52

interesting like we're leaning into both

46:54

and I think both will happen. So it's

46:56

actually not going to be like an or like

46:58

everything will be RL or like everything

47:00

will be like context entry. you totally

47:01

need both because there's like tons of

47:04

things that you don't want to RL or like

47:06

it just doesn't make sense to like

47:07

fact-based retrieval like you can like

47:10

include that data in there but it makes

47:13

more sense to do search in order to

47:15

retrieve some of that stuff. So it's

47:16

like yeah those are maybe the

47:18

interesting bits that we're sort of

47:20

leaning into like sort of

47:21

>> you just mentioned there are tons of

47:23

things which you don't want to RL so can

47:27

you mention what kind of arenas do you

47:29

think we should go for RL or we should

47:32

not like where there is like it is

47:35

constrained by compute resources or

47:37

anything

47:38

>> I'm like super bullish on if you're like

47:42

if you're a builder or a company

47:44

producing some sort of like data in

47:46

vertical and you want to like do two

47:50

things. One, make your model way better

47:52

at that task and like basically like fit

47:54

to your data, fit to your use case, then

47:56

also like make it like way faster and

47:57

like way cheaper. Like RL is something

47:59

like definitely like worth exploring

48:00

because fine-tuning has gotten like way

48:03

easier in the last whatever year. Like

48:06

there's actually like amazing companies

48:07

that will help you fine-tune if you like

48:09

bring the data, if you massage it

48:10

properly, like you store all your data

48:12

like Langmith and you can like pull it

48:13

down to do RL over it. Um,

48:16

in terms of things that you like should

48:18

RL on or you shouldn't RL on, I think

48:21

it's really really great if you have

48:23

some sort of like vertical that you want

48:24

to like make your model like really

48:26

really good at. I think we see a lot of

48:27

companies that have started, okay, like

48:30

I'm building this like model and it's

48:33

going to be really really good at search

48:35

and I'm going to expose that as like a

48:37

sub agent to like my main agent and like

48:39

this sub agent is going to rock at that

48:41

or it's like this this model we like

48:44

fine-tune on a bunch of our like

48:46

customer service data and like it's

48:48

really really good at that use case or

48:50

like finance data for example or like

48:51

even even yesterday um like OpenAI

48:53

released Rosalind right which is like

48:55

all about bio

48:57

That's like amazing, right? And that

48:58

also like sort of it it butts heads with

49:01

this whole idea that the general purpose

49:05

everything is just going to like kind of

49:07

like subsume everything, right? It's

49:09

like I'm going to have like one general

49:10

agent that's just going to like it's

49:12

going to be so good. It's just going to

49:13

get exactly what I'm saying. It's going

49:14

like solve the task. Like maybe in the

49:16

limit that is definitely maybe going to

49:18

be true, but to like today like we have

49:20

to build for today, you know? So like

49:22

today it's super helpful actually to

49:24

take the opposite view like curate a ton

49:27

of data and like pick a niche that you

49:30

really care about or like that your

49:31

customers care about and like build the

49:33

best data for that like build the best

49:35

harness for your model around that and

49:38

just like sort of rock at that task. And

49:39

I think like RL is amazing for imbuing

49:42

sort of like vertical specific skills

49:44

into an open model and you get it like

49:48

way cheaper like way faster and like

49:50

depending on the original like training

49:52

distribution of that task in like the

49:55

frontier labs like data mixture like

49:58

you're it's very likely that your

50:00

fine-tune model will be better than that

50:02

open model or sorry than that closed

50:04

model at that task as well because like

50:05

you have the data and you like

50:06

fine-tuned it and like maybe like where

50:08

you don't want to use RL4 is like I I

50:10

honestly think it's a really good idea

50:12

just to start with harness engineering

50:13

like or like just really good context

50:15

engineering

50:17

because it's so easy actually like

50:20

relative to RL that just like pick your

50:22

model like design like a really really

50:24

simple harness around it first like for

50:26

example we have like this abstraction

50:27

and lang chain called like create agent

50:30

which is just a react loop and then you

50:31

can like build a bunch of stuff on top

50:33

of that until like you don't need to

50:34

anymore or you can use like deep agents

50:36

out of the box if you want to and Yeah,

50:38

just like go and build and do maybe

50:40

start with harness engineering and like

50:41

maybe the other point was like

50:44

there's things that like things like

50:46

factbased retrieval like fact-based

50:48

retrieval is just it's just like maybe

50:50

More transcripts

Explore other videos transcribed with YouTLDR.

Get the TLDR of any YouTube video

Transcribe, summarize, and repurpose videos in 125+ languages — free, no signup required.

Try YouTLDR Free