Full Transcript

·YouTLDR

The Never Ending Lore of Harness | Vivek Trivedy (Product Lead, Langchain)

1:33:2619,026 words · ~95 min readEnglishTranscribed Apr 22, 2026
0:00

Hey everyone, welcome back to ground

0:01

zero. This is episode 13. Yeah, we are

0:04

running fast. Today we have ve from

0:06

langchain. So we leads their work on

0:09

open source agents and harnesses the hot

0:12

term right now. He's the person behind

0:15

DP agents the coding agent that went

0:17

from top 30 to top five on terminal

0:20

bench 2.0 by only changing the harness.

0:23

He's been writing some really good stuff

0:25

with lot of signal and alpha on what

0:28

harnesses actually are. Why agents

0:30

should be more opinated the idea of

0:33

harness as a service and um how planning

0:36

agents are really just dynamic workflow

0:38

generators.

0:40

Before Langen, he ran his own startup on

0:42

visual understanding agents and before

0:44

that uh was a scientist at AWS while

0:47

doing his PhD in CS at Temple. Uh, we'll

0:51

cover a lot into this. Uh, there's a lot

0:53

to get into. We welcome.

0:55

>> Thank you for having me. I'm super

0:57

hyped. I'm super I've been following you

0:58

on Twitter a bunch. So, yeah, I'm glad

1:00

we're making this happen.

1:01

>> How are you doing? And would love to

1:02

know your uh initial VIP check on Opus

1:05

4.7.

1:05

>> First of all, doing great. Whenever

1:07

there's a new model release, you know,

1:08

it's always like a good week for all of

1:09

us. It's maybe like an even more fun

1:12

week for like anyone who does like evals

1:14

on all the models. Um, so yeah, dropped

1:17

yesterday. We started like evaling it.

1:19

We have our like set across our

1:21

products. We have like open source evals

1:22

that we use and like also like for some

1:24

of like Lang Smith's products that we

1:26

use. It's a good model. It's a good

1:28

model. I don't think it was like a crazy

1:29

step change for tons of stuff that we're

1:31

doing. But TBD I think like the fun part

1:34

about stuff we'll like jump into which

1:36

is strong belief that every model needs

1:40

its own custom things that you add to

1:42

it. I know like anthropic release is a

1:43

nice skill uh that you can like easily

1:45

convert prompts and stuff but we're in

1:47

the middle of that process for like the

1:49

agents that we're going to use it for.

1:51

So it's a good model not a crazy step

1:53

change but we'll we'll fit it. We'll

1:54

we'll make it good.

1:55

>> I mean it is interesting in a way that I

1:57

have been seeing a lot of mixed opinions

2:00

right now. People have pretty much mixed

2:02

opinions on 4.7. Basically what they

2:04

have doing it with um the kota users as

2:07

well. I mean in just three four prompts

2:09

you are running out of I mean there's a

2:12

lot of good story I mean interesting

2:13

story behind but but yeah I mean the

2:16

kind of piece about these models being

2:18

coming up be open air or anthropic

2:20

anthropic specifically how they have

2:22

been doing good at public perception and

2:24

effective marketing as I say I mean

2:26

working well working working I mean it's

2:28

been rewarding for them

2:29

>> I mean they're great they're great they

2:30

they put out like great models obviously

2:32

they put out great products around the

2:34

models I think there's definitely some

2:37

stuff where

2:39

people are playing a lot more with the

2:42

models and like they're basically like

2:44

picking use cases they use models for.

2:46

So it's like everyone uses cloud code

2:47

like everyone uses codecs and that sort

2:48

of stuff. But like when you build like

2:50

your agents on top of those models, it's

2:52

like I need to actually care about the

2:54

prompts. I need to care about the

2:55

context engineering. I need to like care

2:57

about the tool design. And I think like

2:59

that's where it's really cool to like us

3:03

putting out content like other like

3:04

really cool people putting out content

3:05

which is like like how do I make a model

3:07

good at like my task basically because

3:09

at the end like my customers that's all

3:10

they care about that's all I care about

3:12

and I think like that's like a bunch of

3:13

the harnessge journey basically whether

3:16

you call context whether you call like

3:17

agent edge it's basically like fit some

3:20

sort of system around this model to make

3:21

it like sit at my task and that's like

3:24

what we're all trying to do and like

3:25

anthropic is trying to help us with

3:26

that. Open models are trying to help us

3:28

with that as well.

3:29

>> Totally makes sense. Um let's dive in um

3:31

about your journey. So you went for a

3:34

PhD in CS at Temple and I mean worth to

3:37

mention you did your bachelor's,

3:38

masters, PhD everything at Temple and

3:40

this has been a talk of the town as well

3:42

in past years on Twitter. People were

3:43

talking about it. People have again I

3:45

mean some opinions about Temple being a

3:48

university, good university or not. So

3:50

my question is to being a scientist I

3:52

mean doing a PhD PhD then to being a

3:55

scientist at AWS to running your own

3:56

startup on agents or visual

3:59

understanding to leading open source

4:00

agents at Langen. How has your journey

4:02

been like?

4:02

>> Happy to dive in. Um cool cool I'm so

4:05

I'm from around this area. So I'm from

4:06

like east coast uh Jersey like

4:08

Philadelphia area. I went to school at

4:10

Temple. So I did my undergrad there did

4:12

my masters there like my PhD there. So

4:14

like super early I was like I'm just

4:17

going to be a doctor like most kids

4:19

pressured by their parents like I'm

4:20

going to be a great doctor like quickly

4:22

realized like I don't really want to do

4:23

that most of my undergrad. So I do my

4:24

underground in math and math is like

4:27

really cool. I think there's a lot of

4:29

concepts in math that like translate

4:30

really well to CS and like physics and

4:32

things like that sort of like systems

4:33

thinking.

4:35

>> Math is also like at least for me maybe

4:37

I'm just not amazing at it. It's

4:38

incredibly hard. So like doing something

4:40

really hard does prepare you for other

4:42

things.

4:44

Yeah, dude. Undergrad was like really

4:45

fun. I enjoyed math. I got into like

4:46

some CS stuff. I think like late 2010s

4:50

was when there was a lot of cool stuff

4:53

in different parts of ML. So like I got

4:55

into computer vision stuff, like

4:57

undergrad research. And like I love

4:59

vision. So like I think vision is still

5:01

one of the coolest things out there.

5:03

There's like way less research done on

5:04

vision even today relative to text. Like

5:08

>> OCR is pretty important, right? OCR is

5:11

like now okay just just send just send

5:14

the PDF to Claude basically and like

5:16

obviously a bunch of systems engineering

5:17

around that but yeah man like I I loved

5:20

vision I still love vision vision was

5:22

really cool so like I did undergrad in

5:24

that did like research around that and

5:25

then I just went straight into like

5:27

masters in PhD like right after I

5:29

graduated like early 2020s and then yeah

5:33

my PhD was basically all around like

5:36

vision focused representation learning

5:39

so yeah I can talk a little bit about

5:40

that. So the first like topics that I

5:42

was working on was like graph neural

5:44

networks which are like I don't know how

5:46

hot those are anymore but I do see like

5:48

some really cool people still doing

5:49

research around those. Um basically like

5:51

graph representation learning but it's

5:52

like graph representation learning for

5:54

like vision basically. So it's like if I

5:56

like decompose an image into like

5:58

particular objects and like I make a

6:00

graph of that and then I do like

6:01

representation learning do we get like a

6:03

better end vector for like retrieval

6:05

like classification and then like we did

6:06

this at also like the data set level as

6:09

well. So like what if I have like kind

6:11

of like few shot examples. It's called

6:12

like transductive learning like use

6:14

other information in the data set to

6:16

help you classify the next thing. Dude,

6:18

that was really cool. Like I think

6:19

graphs I'm like bearish on graphs

6:21

overall actually. So maybe hot take but

6:23

like that was a really cool part of

6:24

research and like that was my first like

6:26

dabbling into like computer vision stuff

6:28

like undergrad then my first like PhD

6:30

topic which like it shifted a little bit

6:32

after like the chat PT moment like tons

6:34

of research became around okay like

6:37

let's do VLMs for everything and let's

6:39

do like representation learning on the

6:41

VLMs like what are VLMs like actually

6:43

seeing when they're doing their like

6:45

attention mechanism over images. So

6:48

yeah, dude, it was great. It was great.

6:50

I like really enjoyed my time in PhD. I

6:52

think it's like you get some sort of

6:54

unbounded time with your adviser to just

6:58

pick an interesting problem and just

6:59

like rabbit hole in it. So I did like

7:01

retrieval stuff like representation

7:03

learning stuff. Yeah, dude. It was

7:04

great. I enjoyed it.

7:05

>> Awesome. Um, so I had a chat with

7:08

Tensorcut the other day. He started

7:10

Paradigma. He dropped out of PhD. So my

7:13

question to you is what do you really

7:15

think about the scenario right now the

7:18

linkage between academia and the

7:20

industry and right now if you have been

7:23

like if someone is going for PhD or

7:24

something like that. So what do you

7:25

really think about is is is that is this

7:27

worth it or how far we have come is

7:31

still necessary to go for a PhD to I

7:34

mean it is again very opinionated um

7:36

question but still I mean I want to

7:38

really understand your

7:39

>> yeah absolutely so like it's a great

7:41

question like people ask me this

7:43

question like locally like my friends or

7:44

like younger brothers and stuff.

7:46

>> Yeah. So like maybe my PhD was like

7:49

slightly different because I was doing

7:51

research at Temple but I was also doing

7:53

research and like working on like prod

7:55

projects when I was at AWS and those are

7:57

happening at the same time and I like

8:00

strongly believe that that is like a

8:02

fantastic mix for anyone who wants to do

8:05

like research but then sort of

8:07

understand maybe like how their research

8:10

is going to be applied in like some

8:11

settings. And I think today like the

8:15

point basically of a PhD to me is like

8:17

you pick a topic that you're like really

8:19

deeply interested in and you like poke

8:22

around the edges of that topic to try to

8:24

figure out like how we can make like

8:25

this thing better. And like that doesn't

8:27

like really require a degree to do that.

8:29

There's tons of like sick researchers on

8:31

X who just like post like random blogs

8:33

and like they don't have a PhD. they

8:35

probably don't maybe don't have CS

8:36

background but there's like you just

8:37

pick a topic you like rabbit hole it

8:41

you just like push the boundary of

8:42

what's possible and you do that like in

8:44

a verifiable way so you like write code

8:46

do experiments you try to share like

8:47

open research and if you're able to find

8:51

a company that allows you to do that

8:52

like lang's fantastic at that like I

8:54

think they really cultivate like hey

8:56

like we're going to like pick this topic

8:57

we're just going to like figure out how

8:58

it works and we're going to like publish

9:00

content about it basically

9:02

>> I would say that's great I think it it

9:03

kind of depends like if you find a great

9:05

company, a good great founder that you

9:07

vibe with that lets you do both.

9:08

Industry is like amazing and like

9:10

especially AI research like it's super

9:12

helpful across a lot of companies. You

9:14

can probably make a lot of money and

9:15

like do interesting research at the same

9:17

time. So yeah, kind of like a

9:18

non-answer, but if you do find that

9:20

scenario amazing if you just want to

9:22

like grind on like some sort of topic

9:24

and PhD for like a bunch of years, also

9:27

great. I actually don't think you can go

9:29

wrong like just by being curious and

9:30

just exploring it.

9:31

>> Yep. I can see you have uh you you were

9:35

like working on your startup about

9:36

visual understanding agents. So I want

9:39

to understand your learnings there and

9:41

how do you see the vision space right

9:44

now like how can you correlate between

9:47

uh the time when you started and the

9:48

time we have come so far with the

9:51

current frontier state-of-the-art

9:52

research and products building. Yeah,

9:54

dude. Um, yeah. So, like I started that

9:56

startup after I graduated like my PhD.

9:59

So, that was sort of like mid last year

10:01

with a friend. And basically like the

10:04

main thing that we were working on like

10:06

starts was called Agentify. And like the

10:08

main idea was basically that basically

10:10

vision compared to text like really lags

10:12

behind in frontier models for like

10:14

things like visual reasoning but also

10:16

things like perception just generally.

10:18

So there's like tons of things where

10:19

you'll like show an image or like an

10:22

object like o two overlapping boxes to

10:24

the model, right? And it's like it

10:25

doesn't like fully understand that those

10:26

two things like overlapping and like

10:28

part of this is just a perception

10:30

problem in the visual encoder where it's

10:32

like some of these like fine grain

10:33

details, it's just not able to

10:34

understand them with like the native

10:36

training that it has. But that I think

10:39

is like a fantastic opportunity because

10:41

it's like how much of that gets absorbed

10:43

into the vision encoder backbone versus

10:46

like how much do we augment models with

10:49

like tool calling behavior that they're

10:51

exceptional at and actually use that as

10:54

the mechanism to like take vision

10:56

capabilities and like put them into the

10:57

models. Like that's basically the whole

10:59

like idea that we were working on. So

11:00

like research and like product around

11:02

that which is like what if I just took

11:04

all of the classic vision models that we

11:06

already have and like a lot of this was

11:08

honestly inspired by Meta's work on SAM.

11:11

So I think like SAM and that whole

11:13

series is like incredible like SAM 123.

11:17

It also supports like video segmentation

11:19

which is like insane and you can also

11:20

like fine-tune it. You can do like meds

11:22

SAM and things like that. So it's

11:23

basically like BET was okay models are

11:26

amazing. They're getting very smart, but

11:28

like their vision capabilities are

11:29

lagging behind. But we can augment them

11:31

with tools and like you can basically

11:34

like do the right tool selection in the

11:37

moment to like get that capability. Like

11:39

segmentation is something that it was in

11:41

Gemini Flash across the Gemini series,

11:43

but like compare that to like SAM,

11:44

right? Like SAM was like way better. If

11:46

you just use like Sam as a tool compared

11:48

to like the native segmentation Gemini,

11:49

you would be just like way happier. and

11:51

like all you had to really do was like

11:52

point to the right spot which is like

11:54

way easier than doing like semantic

11:56

segmentation. So that was the idea. I

11:58

still think that that is true in vision

12:00

today. Like even with like Opus 4.7's

12:04

new benches, it's still not as good at

12:08

visual perception as like we need it to

12:10

be. So I still think tool use is like

12:12

really really exciting for yeah just for

12:16

like agentic systems like visual

12:18

basically making a bunch of like vision

12:19

specific tools for your task and like

12:20

augmenting uh yeah augmenting your agent

12:23

with that.

12:23

>> I think there is a lot of scope to do

12:25

alongside UI bench as well. I mean again

12:29

uh it's more about one's taste but uh

12:31

but there are lots of ifs and buts lot

12:34

of nuances where you really need to take

12:37

care of like even if you're cloning a

12:38

website I mean there's lot of sc uh

12:41

scope to play around something so my

12:43

next question is about your work at

12:45

Loheed Martin. So you you you interned

12:48

there. I think that was your first um

12:51

job and honestly a lot of what people

12:53

see about world is kind of sophisticated

12:56

reals on social media about American

12:58

weaponry. So what was the reality like

13:00

from the inside? What what what you were

13:02

working on? How does it feel like to

13:03

work at some defense um kind of defense

13:06

company and what experience lead?

13:08

>> That is like such a throwback. So that

13:10

was like my first internship at like

13:12

tech ever. So, I was like a bio intern

13:15

in undergrad and I was like looking for

13:16

internships and I gave my resume and I

13:19

got an internship at like Loy Martin

13:20

which is amazing because like I don't

13:22

know how good my bio resume was for

13:24

getting like any internships. Yeah, man.

13:26

I wish I say like tons of stuff I did on

13:29

>> What do you mean by bio resume? It was

13:31

like like you were working on some bio

13:33

>> Yeah. So like I went to undergrad as

13:35

like a biochem major because like I

13:37

wanted to be like a doctor.

13:40

>> Amazing.

13:40

>> Yeah. So like then like after freshman

13:42

year I applied to like internships cuz I

13:44

I switched I wanted to do tech after

13:46

that or like at least explore it with

13:47

like a bio resume and they were like

13:50

dude like what like what are what are we

13:52

doing here? And then like I think I

13:53

basically just like talked like to the

13:55

hiring manager and just said like hey

13:57

I'm like really down to like learn this

13:58

thing like which is like data science

14:00

like that time there bunch of these like

14:01

data science courses and things coming

14:02

out so it was still like early and I was

14:04

like hey like I took these like Python

14:05

classes and like I'm super down to learn

14:08

this. And basically it was like yeah I

14:10

mean it sounds great.

14:13

I ended up working on the data science

14:14

team there and it was basically like my

14:17

first introduction into like kind of

14:21

like data analysis sort of stuff. So

14:23

like understanding like it was much like

14:25

stats basically. So like I wouldn't say

14:27

it was like ML but it was like this is

14:29

like intro to like making plots like

14:32

slice this data this way. So it was a

14:33

bunch of just like empathy for like very

14:36

very messy data as like my first

14:38

internship which is actually like very

14:40

valuable today just like insane amounts

14:42

of data which is like does not look very

14:43

clean and yeah man I wish I say more it

14:46

was basically like a great learning

14:47

experience because I was kind of

14:48

learning how to code and like doing like

14:50

data science stuff and then it was also

14:52

like a decent confidence boost because

14:53

I'm like okay maybe I can do like tech

14:56

stuff and yeah I interned there and it

14:58

was like fun and then yeah I didn't

15:00

really go back after that but I started

15:03

getting into more like research stuff at

15:04

school.

15:04

>> Awesome. Um, also recently I was just

15:07

kind of exploring the timeline. I see

15:09

Mike Mill who is a pretty famous, you

15:11

know, internet celebrity was looking for

15:13

an AI guy and you came up through

15:15

Temple. Apparently Mike was surprised

15:18

how many Temple people are in AI and so

15:21

did you end up connecting with him? Did

15:22

you share anything about Langchen and

15:24

stuff?

15:24

>> So Meek Mill is like he's like a rapper

15:26

from from Philadelphia and like I guess

15:29

he lives around Temple like that's where

15:30

he was from and I think everyone was

15:33

like when they saw that tweet they were

15:34

like Meek Mills get into AI so okay let

15:37

me just like reply basically because I

15:38

think like honestly like randomly

15:40

posting on Twitter X is like awesome.

15:42

You can meet so many cool people like

15:44

that and I we'll talk about this but I

15:47

met like Harrison the founder of like

15:48

CEO and the CEO of like W

15:52

And yeah, he did not reply to me. I hope

15:54

his like startup is doing sick, whatever

15:56

he's whatever he's doing. But like I'll

15:58

like repeat it if he does need someone

16:00

for help with like AI. I'm actually like

16:03

seven blocks down. So I could totally

16:06

like just pull up and help him. So no, I

16:09

think that's a good lesson though is

16:10

just like randomly posting maybe like

16:11

I'll just keep doing that and then maybe

16:13

something will happen.

16:14

>> Yep. Awesome.

16:17

So I mean the next question to you is so

16:20

when did you join Langchain and uh what

16:22

actually pulled you there specifically?

16:24

So and since you joined what actually

16:27

has

16:27

>> So this is like this is so much fun. Um

16:30

I was working on my startup like after I

16:32

finished my PhD that didn't work out

16:34

like we basically stopped around the

16:36

fall. At the same time, I was basically

16:39

like doing my first foray into just like

16:41

posting like random stuff on Twitter

16:44

just like my thoughts like basically

16:45

just like open source stuff like hacking

16:46

on random stuff and

16:49

from a bunch of the stuff I was posting

16:50

around like so like last year I also

16:53

like sort of believe that like we have

16:54

amazing models but like because we did a

16:56

bunch of stuff in this like visual

16:58

understanding space with like agents and

17:00

stuff. I was like very very confident

17:02

that models need like some stuff around

17:04

them to like help them do these tasks

17:06

because like they just suck at them out

17:08

of the box and like we basically saw

17:09

this every day. So that's basically when

17:12

a lot of maybe the ideas that were

17:14

brewing around harnessge like started to

17:17

maybe get more like crystallized and I

17:19

just started like posting about that

17:20

online. It's like, hey, like this is

17:23

maybe like what harnesses look like.

17:25

Like harnesses are like supposed to like

17:26

wrap models and like if we're trying to

17:28

do like vertical tasks. It like really

17:30

helps to have some sort of like

17:31

opinionated like prompts, context

17:33

engineering, like tool call structure

17:35

like all this sort of stuff. And I think

17:38

I just like started DMing Harrison like

17:41

the CEO from that which is like super

17:42

sick. He is also always thinking about

17:46

like the frontier of like AI systems

17:49

which is awesome. And then we started

17:51

chatting maybe like late last year just

17:54

like yeah like what would it look like

17:55

to build open-source infrastructure

17:58

around like agent engineering and like

18:02

maybe the best way to facilitate that is

18:04

by helping people build good harnesses

18:07

like whatever good means like let's

18:08

discover like what good means and make

18:10

open source software about that. So, it

18:12

was basically like, okay, that sounds

18:14

sick. And then I was like, I don't

18:16

exactly know what I'm going to do. Like,

18:17

maybe I'll continue like working on the

18:18

startup or like, but I would love to

18:20

join something that like really aligns.

18:21

So, then I started working with like

18:22

their open source team late like last

18:25

year on what ended up becoming like what

18:28

was deep agents, but ended up becoming

18:30

like a lot bigger. Um, so yeah, we were

18:32

working on like the very very early

18:34

versions of like deep agents last year,

18:36

which is like one of our libraries at

18:38

Langchain that we that we have. It's

18:40

like our library to help people build

18:42

harnesses. Um, or at least it's one of

18:44

the ways that people can build harnesses

18:45

using using Wangchain. And yeah, I loved

18:48

it. I love the team. Uh, amazing people

18:51

doing open source. And then I decided to

18:53

join like full-time in in December.

18:54

>> Amazing. Um, and and I mean, the

18:57

adoption is just crazy, dude. I mean, so

18:59

I want to understand about the growth

19:01

here. So, so again I mean right now

19:03

Twitter is full of people flaming

19:05

millions in ARR every month and but like

19:07

a feels like one of the most you know

19:10

stressed metrics out there. So my

19:12

question is how has lang approached

19:14

growth in real terms be it opensource be

19:17

it community adoption be it enterprise

19:19

or

19:19

>> yeah dude it's a great question. So I I

19:21

think about this a bunch because like I

19:23

think the best way to maybe think about

19:24

it is like basically like work backwards

19:26

from you want to like help people build

19:30

stuff using like the tools you're you're

19:32

putting out there, right? And like the

19:34

goal is basically just like help people

19:36

build like really cool things and like

19:39

make that process of building as easy as

19:40

possible. I think in like open source

19:42

that comes through like very clearly

19:44

because in open source I think you get a

19:46

lot of like empathy for the end user

19:48

because they're like directly using your

19:50

product like all the code is like fully

19:52

visible like go inspect it also like put

19:55

your opinions in like our GitHub issues

19:58

and tell us like what's good what's bad

20:00

like what should we fix like what should

20:02

we add also like it's totally cool to

20:05

like disagree in open source because

20:07

like the maintainers sort of have

20:09

limited bandwidth to address like all of

20:12

the things, but we want to make sure

20:13

that the most impactful things that are

20:15

going to help like the most users build

20:17

like the coolest stuff like we like

20:18

prioritize those. So, I think there's a

20:21

there's a big part of growth which is

20:23

why I like really like X um and like

20:27

these direct feedback channels or like

20:28

Slack for example or just like messaging

20:31

builders and customers because you

20:34

basically get to see exactly what

20:35

they're doing. you build like a lot of

20:36

empathy for shoot like this thing that

20:39

we built like it's a little broken in

20:41

this way or like it doesn't exactly like

20:42

fit the use case and then you hear a

20:44

bunch of those stories and you sort of

20:45

like work backwards to say okay like we

20:47

need to improve like this part of our

20:49

library or like we need to like make it

20:51

possible for others to improve our

20:53

library as well. That's like an amazing

20:54

part of open source that we get tons of

20:56

like amazing feedback, tons of like user

20:59

contributions which is great because you

21:02

sort of like grow with your community

21:04

and I think like that's a really big

21:06

part of open source and related to that

21:08

which I really really like about

21:09

Langchain like one of the reasons why I

21:12

joined and like I really enjoy working

21:13

here is there's a lot of like learnings

21:16

that we get from all the research that I

21:19

do in like open source and like putting

21:21

stuff out there and getting feedback

21:23

that slowly like make their way into our

21:25

products as well because it's like for

21:27

example a lot of stuff in like Lang

21:29

Smith for example which is like okay

21:31

like how do you build good evals like

21:33

how do you how do you actually enable

21:35

agents and users to build like really

21:37

good evals like how do you like

21:38

understand what's happening in traces

21:40

like mind signals from

21:42

>> like a lot of that we put out just in

21:44

the open like I did a bunch of blogs on

21:46

that stuff there's other people who are

21:47

like hacking on that stuff as well and a

21:49

lot of the stuff in open source you sort

21:51

of see how the community interacts with

21:53

it. You also just see the raw numbers

21:55

and you put it out there and it's like

21:56

hey like I would love this or like I'm

21:59

using this and it's like oh we should

22:01

make that as easy as possible. Put it

22:04

into a product and like if people love

22:06

the product then like the rest of it

22:08

sort of takes care of itself. It's like

22:09

yes you will make money you know your

22:13

customers will be really happy and then

22:14

like just continue the loop like just

22:15

keep making it better basically. So I

22:17

think like yeah dude customer feedback

22:19

is amazing like community feedback is

22:21

amazing. So it's like a really really

22:22

big part of I think lang chain a really

22:25

big part of like a a lot of the open

22:26

source stuff that we do

22:27

>> I can imagine of course and more

22:29

specifically here so you are leading the

22:32

open source egen and harnesses work

22:34

right now so what does a typical um week

22:37

looks like for you it's more about

22:39

research engineering or product

22:42

>> yeah dude whatever

22:44

>> I think the fun part is like it's it is

22:46

actually like a mix of a ton of stuff

22:48

and I like really really like that so

22:50

it's like the goal is bas basically pick

22:52

the most important thing to work on at

22:55

this time and then like we'll like we'll

22:57

chat about it maybe over the weekend or

22:59

like the week before like Harrison

23:00

jumped in with with us like we'll DM and

23:03

let's just like sprint towards that and

23:05

build it basically and like maybe what

23:08

that looks like lately

23:11

like lately like a ton of my work has

23:13

been on like eval continual learning

23:17

essentially like methods for using like

23:19

evals and continual learning to make

23:21

like agents and like their harness

23:22

better. So that's like basically like

23:24

the research direction and I would say

23:26

maybe like 50% of the week goes into

23:30

okay let's like pick a research

23:31

hypothesis let's like figure out what

23:33

the experiment design around that might

23:35

be. Like for example, last week we were

23:37

doing a bunch on can you like just in

23:40

time generate evals uh like for any

23:42

given task like what does that look

23:44

like? Like are you overfitting to them

23:45

and like what is your like fitting

23:47

algorithm? There's like tons of stuff

23:48

that we put out. There's like a lot of

23:50

good content on like harness hill

23:52

climbing basically. But yeah,

23:54

essentially it's like research. Let's

23:55

pick that task. Um kind of like a PhD.

23:58

We're going to make a hypothesis. We're

24:00

going to like run the experiments on it.

24:02

We're going to get like get metrics and

24:03

we're going to post them on Slack and

24:04

we're going to like review them and like

24:07

argue our takes about them essentially.

24:11

Yeah. Then the other maybe bunch of

24:12

percentage like 50% is like talking to

24:15

customers like talking to people like on

24:17

Twitter getting a bunch of feedback from

24:18

them on like the open source stuff like

24:20

how can we improve our libraries whether

24:23

that's like lang chain lang graph like

24:24

deep agents anything in like lang and

24:27

then a bunch of that is talking with

24:29

like product teams as well. So there's

24:31

like tons of great teams at Lang Chain

24:34

that do a bunch of good work on like all

24:36

the products that we have. So there's

24:37

tons of learnings that I think come from

24:38

open source that we can like port back

24:41

into the products that we're going to

24:42

build and yeah just keeping that

24:44

feedback loop is good. So I would say

24:46

like it's a mix bunch of like research

24:48

and then engineering stuff and then a

24:51

bunch of like I don't know like what the

24:53

term today is but like devril like

24:56

devril devx which is just like if

24:58

someone asks a question on Twitter like

24:59

we should respond to them and we should

25:01

like put our ideas out there and we

25:02

should like be willing to engage with

25:04

other people's ideas and yeah just hear

25:06

what people are saying. So it's like a

25:07

mix yeah it's a mix of those things.

25:09

what percentage of your article source

25:12

like article is coming from this

25:14

research source I can imagine a certain

25:16

percentage but because dude I mean I

25:20

mean let's just come to harnesses like

25:22

what this what is all about the load

25:24

behind harnesses right you know

25:26

>> so you mentioned that the definition of

25:28

agent is basically model plus harness

25:30

right

25:31

>> so I mean this is something like I mean

25:33

it is being in like people know this

25:35

from quite some time like this is this

25:37

is a fact but I think this is the

25:39

cleanest framing anyone any anyone have

25:42

seen at least on Twitter. So if you're

25:44

not the model, you are a harness, right?

25:46

And and a harness is every piece of

25:49

code, configuration or execution logic

25:51

that isn't the model itself.

25:53

>> So can you walk me through how you

25:56

arrive at the definition?

25:58

>> Yeah. Yeah. Yeah, dude. I think like it

26:00

is it is definitely like a cleanish sort

26:04

of specification of like what is this

26:06

thing that we're talking about and I

26:09

think like maybe the definition doesn't

26:11

really matter like as much like what the

26:13

exact equation is but like there is one

26:16

thing that's helpful which is like when

26:18

you're communicating with someone about

26:20

like how we're going to make this agent

26:21

better we need like some shared language

26:24

so we can talk about like what is the

26:26

thing that we're going to optimize

26:27

basically right so it's like

26:29

like working backward from model

26:32

capabilities because like that's sort of

26:35

the thing that we need to wrap

26:37

intelligence like wrap systems around to

26:40

like amplify the intelligence of the

26:42

model. So it's like I basically view it

26:44

as there's some sort of computation

26:46

happening inside the LLM and like where

26:49

that's happening is over this like

26:51

context window boundary. So like all the

26:54

compute happens when I basically like

26:55

take context from like my system and I

26:59

push it over the boundary and I put it

27:00

into the context window like for the

27:03

model to do computation on and then

27:05

produce tokens basically. And like some

27:07

of those tokens correspond to like tool

27:09

calls and then I go and execute those

27:10

tool calls and like I return the context

27:12

back. And like the reason why I like

27:14

that is because like models by

27:16

themselves they're basically just like

27:18

>> token input machines and like token

27:20

generators basically. But like we need

27:22

to put a system around the model so it

27:25

can do useful things. And I really like

27:29

maybe like working backwards from what

27:31

should the agent do and like maybe even

27:34

like what does my customer want the

27:36

agent to do and then like figure out if

27:38

I just like give it like a really really

27:40

simple model like maybe like really

27:41

really simple harness. Can the agent can

27:43

the model and like the agent can the

27:44

agent basically just do that? And like

27:46

if the agent can just do that with like

27:48

a really simple harness, then that's

27:50

like amazing because then we can just

27:52

like give that to the user essentially.

27:55

Where things maybe get like more

27:57

interesting is like where like a really

27:59

simple harness just like can't do that

28:01

today. And that might just be because

28:02

like it doesn't have the right tools or

28:04

maybe like the model isn't intelligent

28:06

enough to like orchestrate those tools

28:07

in order to do that. Or maybe it's like

28:10

some of our context engineering opinions

28:13

in the harness aren't good enough and

28:14

it's like hey like you're you're putting

28:17

a bunch of like really big tool call

28:19

outputs like into the context window and

28:21

it's like confusing the model. We should

28:24

find out ways to not do that. But these

28:26

are all basically like harness level

28:28

configurations that we're doing and

28:30

they're external to the model. Like the

28:32

model is basically just like a

28:34

computation unit and it computes things

28:36

over its context window and like we need

28:38

to decide what goes into that context

28:40

window so it can do like useful work for

28:42

us.

28:42

>> If I have to ask you some like three uh

28:45

three bullet points what really makes a

28:49

good hardness according to you what are

28:51

they?

28:51

>> Yeah. So there's a bunch, but if if I

28:54

had to pick like three right now, I

28:55

would say

28:57

basically prompting and like very very

29:01

clear instructions

29:03

for better or worse. Like there was this

29:04

whole thing like prompting is dead. Like

29:06

prompting is like totally not dead. It

29:08

is like so useful, so helpful. And like

29:10

I I don't just mean like prompting in

29:12

terms of just a system prompt. Like

29:14

prompting also applies to like the tool

29:17

descriptions as well that get like

29:19

autoloaded into context. It also applies

29:21

to how well your like skills front

29:25

matter explains like how to use these

29:27

skills or like how to use like other

29:28

skills. It it also applies to like if

29:31

you have sub agents, does like the sub

29:33

agent front matter specify like when

29:35

this should be used or like how to use

29:37

it basically. So it's just like

29:38

basically prompting that encodes really

29:41

really good instructions from the user

29:44

or on behalf of the user for like how to

29:47

use this agent to do useful work. That's

29:49

like super important. I think like

29:50

prompting is honestly more important

29:52

today than it ever was before because

29:54

our like the systems we have are way

29:56

more intelligent. So we're able to guide

29:58

them towards doing useful work more

30:00

easily with good prompts. That's one. I

30:03

think the other one that we're spending

30:04

a bunch of time on right now is

30:07

basically verification. So we did like

30:10

some blogs around this on like making

30:12

coding agents better. But there's sort

30:15

of like maybe two things in

30:17

verification. like first is prompting,

30:18

second is like verification. So there's

30:20

like a built-in verification that you

30:23

might inject like into into the harness

30:26

itself. So like that can be like a hook

30:29

basically. So like before the model

30:31

tries to go and exit like force it to

30:33

like recheck the work or like make sure

30:37

>> really

30:37

>> verification is basically like if if I

30:39

give so for example if we just use like

30:42

all the terminal bench tasks, right? So

30:44

like terminal bench task comes with like

30:46

an environment. It comes with like a

30:48

task and then it comes with like a

30:49

verifier that will run after the agent

30:52

thinks it's done, right? But like

30:54

obviously we can't use that verifier

30:55

information. So like what the agent

30:57

needs to do is like it needs to like

30:59

self-verify its work before that

31:01

verifier runs to like be like very very

31:04

sure that the code that it developed

31:07

solves the task that we're that we're

31:08

like trying to solve. Maybe there's two

31:10

parts of that. One part is we need to

31:13

like teach agents what the useful

31:16

primitives are for verifying their work.

31:18

I think like one immediate one if like

31:20

anyone uses like the claude model or

31:23

like even like GPT 5.4 is like agents

31:26

are very susceptible towards like

31:28

picking the easy way out in verification

31:30

which is like they test like trivial

31:32

cases or like not not like very

31:34

difficult cases. Obviously, that fails

31:37

in the verifier because it's just like,

31:38

hey, like I checked like these three

31:39

cases are really easy, so like I'm good

31:41

essentially and like that's bad. Like we

31:44

should teach agents to be much more

31:46

thorough when they're like generating

31:49

verification for themselves. That's like

31:50

one part of it. The other part of it is

31:52

like like this is all code. So like we

31:55

have in our repos tons of like unit

31:58

tests and like tons of like evals that

32:00

we already use. Like that is great

32:03

context that we should give to the

32:04

agent. so that it can like run that eval

32:07

suite and that might be run with a hook

32:08

for example like I don't want like maybe

32:11

the agent won't run it by itself but

32:12

like when it tries to exit that should

32:14

just maybe run my eval suite or a subset

32:16

of it and it should inject the context

32:18

or like the results back to the agent so

32:21

the agent can see like what failed like

32:24

what what passed basically because like

32:26

we need some sort of signal to give back

32:29

to the agent so we can like fix the

32:31

thing that it generated so it's like

32:33

self-verify or like use external signals

32:36

from like existing evals so you can like

32:38

fix the things that are going wrong. And

32:39

I think that's like a really really big

32:41

part of it. And like maybe the last part

32:43

that we're focusing a ton on is

32:47

high level. It's kind of like

32:49

orchestration basically but for doing

32:52

things that are more long horizon

32:55

basically like it's problem

32:57

decomposition and like making sure that

32:59

like when we use like sub agents to do

33:02

problem decomposition like two things

33:03

are true. So one is we're picking the

33:06

right model like agent for the job

33:08

because like every model is like good at

33:11

different things and also that um this

33:14

is a lot of context engineering. We're

33:16

basically like bounding the sub problem

33:19

that the agent needs to do in like a

33:20

decent enough window that it can like

33:22

manage it. Basically what I mean by that

33:24

is um I wanted to like do things in like

33:28

a 50k to like a 150k token range roughly

33:32

or like 200k. sort of it depends on the

33:35

model but like I don't want to give a

33:36

subtask to like a sub agent if it's if

33:40

it's so big that it's like okay it's

33:43

going to start getting into like really

33:45

really high context zones like dumb zone

33:47

which like Dex calls it um from human

33:49

layer which I love and yeah so it's like

33:52

efficiently being able to take a problem

33:54

decompose it and then use like sub

33:56

agents as like compute sources to like

33:58

do those problems and like filter stuff

34:00

back to the main agent and like some of

34:02

it is just good model choice like for

34:04

example like we find that maybe the GPT

34:08

series like 5.4 for is exceptional at

34:10

like planning uh which is amazing and

34:13

like Gemini like I find is like really

34:16

really good at like multimodal stuff and

34:18

so actually so is they all are but like

34:20

Gemini is like really good at it and

34:21

like Flash is actually amazing bang for

34:24

a buck for like speed cost and

34:26

multimodal stuff like a lot of this is

34:28

just informed by like dog fooding and

34:29

evals like hey like we need to like test

34:31

these models and figure out what are

34:33

they good at so yeah I think I think

34:34

those are the three maybe roughly and

34:36

there's like way more obviously so it's

34:37

like like prompting

34:38

like systems around like verification

34:41

like self-improvement uh like via traces

34:44

or like via evals and then the last

34:46

thing is like kind of like orchestration

34:47

but maybe it's like context engineering

34:50

around problem decomposition

34:52

>> makes sense um you just mentioned about

34:54

uh 5.4 for for uh planning. So uh so uh

35:00

pretty much I think it uh not just a

35:03

black box but it is kind of a reasoning

35:06

sandwich where where I mean you

35:08

mentioned as well x high for planning

35:10

high for execution x high for

35:12

verification um like running only at x

35:15

high scored 53.9%

35:18

due to timeouts versus 63.6% at high. So

35:22

I mean that's counterative right? I mean

35:25

does more reasoning made it worse?

35:27

>> Yeah. So I think I think this is

35:29

basically touching on like the point

35:30

that I think about a bunch which is like

35:33

we need to like what we try to do is

35:35

basically like we're trying to design

35:36

like an agent system around like a task

35:39

that we need to solve right and like

35:40

that task has maybe like a bunch of

35:42

constraints like I think the one you're

35:44

talking about is maybe like the the some

35:45

of the terminal bench work that we were

35:47

doing and just trying to publish. So

35:49

yeah like for that use case we we had

35:51

like an artificial constraint which was

35:53

like we have a like a timebounded run

35:57

essentially like after this amount of

35:58

time like the sandbox just like exits

36:00

and like the run doesn't get scored or

36:02

like the run gets scored like wherever

36:04

we left the state of the sandbox and

36:06

yeah so I think maybe the takeaway from

36:08

that is less that like maybe like x high

36:10

reasoning all the way through like

36:12

wouldn't have been better. It actually

36:14

like does a great job. It just takes

36:16

like a really long time. So then it like

36:18

runs out of time to like complete the

36:20

task. But also it's like not compute

36:23

efficient and it's not like cost

36:24

efficient. Like it's awesome to like run

36:26

X high at everything all the time and

36:28

spend a bunch of token on like every

36:29

single problem. Like practically

36:32

speaking um you have to pay for the

36:35

tokens and like also like practically

36:37

speaking from like a user experience

36:38

like am I just going to wait for GPT 5.4

36:41

afford to just like think super hard all

36:43

the time or like can I use a smaller

36:46

model or like a cheaper model that I

36:48

like write really good instructions for

36:50

and it can just go do that task like

36:52

immediately then my user just like sort

36:53

of gets like a more you know like

36:56

latency reduced interaction. So it's

36:59

like yeah I think main takeaway is like

37:01

XH high actually for me is amazing and I

37:03

do a bunch of like planning in X high

37:04

when I'm like just coding but because

37:06

like when I'm in the loop I want like

37:08

feedback because like it's annoying if

37:10

I'm just like staring at a blank screen.

37:12

I use like high for a bunch of like in

37:14

the loop coding. So like X high planning

37:17

and then like high for execution. So but

37:19

yeah it just depends. It like totally

37:20

depends on like the work that we're

37:21

doing. I think that's like the main

37:24

thread that I think about.

37:24

>> Awesome. Okay. I mean yeah that makes

37:27

sense. saw and and I have seen that

37:29

people are using people are preferring

37:32

5.4 xi codeex over opus 4.6 six I mean

37:36

now seven has like mixed opinions I mean

37:39

anyways um so uh again like you said

37:42

about what about hardnesses and

37:44

everything and there was a potential a

37:46

lot of news about file system as well

37:48

like I can't give a count the number of

37:51

blogs I have number of Twitter articles

37:53

I have read about file system right and

37:56

even like in your anatomy post you said

37:58

that the file system is arguably the

38:01

most foundational harness primitive so I

38:04

mean it's a it's It's it's a strong

38:06

claim and um and previously obsidian co

38:09

also mentioned about everything just

38:11

about file system. So why the file

38:13

system and how does it kind of make it

38:16

really influential in in this harness

38:19

design and things around agent

38:21

engineering. What other tools?

38:23

>> I mean I'm like incredibly bullish on

38:25

file systems. I think like a ton of

38:27

people internally also are and like a

38:30

ton of people across industry like very

38:31

bullish on file systems. Like one of the

38:33

early decisions in like DB agent when we

38:35

were building it last year was basically

38:37

like using the file system and that was

38:40

more because we saw like two things. one

38:43

like how useful it actually is for

38:45

context management and like two agents

38:49

are just exceptional at using file

38:50

systems already right so it's like it's

38:52

kind of two things like the model is

38:54

already very very good at using this

38:55

tool so I don't have to coersse it a

38:58

bunch to get good at like using these

39:00

sort of like patterns and like now like

39:02

with newer models is probably even like

39:03

post trainer even more on getting good

39:05

at file system stuff so that's like

39:06

amazing the the other thing that's like

39:08

really amazing about file systems or

39:10

like basically the concept of a file

39:13

system. I I'll I'll maybe like

39:14

generalize it a little bit, which is

39:15

like I need some sort of like persistent

39:18

storage that my agent can use to both

39:21

like access information and then like

39:24

offload information. And like that's

39:26

maybe the higher level primitive like a

39:28

file system ends up being like a really

39:29

really easy way to do that. But like the

39:31

primitive is like the LLM the model

39:35

basically has like this computational

39:37

boundary that I put stuff into and like

39:39

I can take stuff out of essentially,

39:41

right? And like all the comput happens

39:43

here and the decision for like where to

39:47

store stuff and like how to access it

39:48

like file systems end up being fantastic

39:51

storage primitives to do that and like

39:53

the reason why I say like the concept of

39:55

a file system is like in in like lang

39:57

chain like in our libraries we have this

39:59

concept like virtual file systems where

40:01

it's like you expose file system like

40:04

storage essentially right so like the

40:07

operations that you would do on a file

40:09

system for example like ls for example

40:11

right or like you're like grapping over

40:13

that. It depends like what your

40:15

underlying storage system is. But can

40:17

you like use existing storage like for

40:19

example like S3 for example or like

40:22

Postgress, right? And then like what

40:23

does it look like to use that as storage

40:25

and then like put it over the

40:27

computational boundary so like the agent

40:28

can like search over this stuff and like

40:30

pull it into context.

40:32

Like agents are exceptional at doing

40:34

that. And the other thing is like

40:36

context management is so important

40:38

because like the context window is like

40:40

where all the computation actually

40:41

happens that we need some mechanism of

40:43

achieving that which is like why I'm so

40:45

bullish on file systems. It's both like

40:47

and then and then actually like maybe

40:48

one more thing I'll add is

40:51

>> now that we're doing a bunch more stuff

40:53

on multi- aent orchestration and like

40:56

multi- aent like collaboration sort of

40:57

stuff. So I think I said like a little

40:59

bit about decomposing like really big

41:00

problems into like sub problems, right?

41:03

But like where should all of that work

41:05

get stored for all of like the

41:07

decomposition that the sub agents do? So

41:09

like file systems actually also become

41:12

excellent like collaborations places. So

41:16

like sub agents can like write to

41:17

particular files and like main agent can

41:19

like read from there and like it doesn't

41:20

pollute like the main agent context

41:22

window a bunch. So it becomes like a

41:24

place where you just like write files

41:26

and like files are basically excellent

41:28

scratch pads or excellent like like

41:30

planning places or excellent like

41:32

persistent storage places like an agent

41:34

needs to come back to something and this

41:36

sort of like primitive that files encode

41:39

information really well like file

41:41

systems

41:42

offer like interfaces to like external

41:45

storage that already exists and like it

41:48

really helps with context management.

41:50

Like all of those things together I

41:52

think make it really really good for for

41:55

as like a harness tool for like an

41:57

agent. And I think a lot of harnesses

41:59

like like basically I think everyone is

42:01

like settled around file systems like

42:03

like it's uh it's not like too

42:04

controversial to say like I'm going to

42:06

give my agent a file system and like

42:08

that's a part of my harness you know

42:09

like people just sort of like oh yeah

42:10

that that makes sense. It's interesting

42:12

to know right I mean this is something

42:14

so basic something so fundamental is

42:17

kind of changed the whole trajectory of

42:19

the space in like 6 months and everyone

42:22

is kind of getting adapted to this thing

42:24

and on the same note you have uh you

42:27

have also mentioned about memory via

42:29

agents.mmd and and this is something you

42:31

kind of connect with you know like

42:33

injecting and start and you also call

42:36

this continual learning so I'm very

42:38

interesting to know about why do you

42:40

think So, and like is it really or is it

42:43

more like a persistent or consistent

42:44

notepad? So, what you really think about

42:47

this could be aligned to

42:49

>> I think like a a ton of a ton of like my

42:52

work recently has been around like this

42:54

just general idea of continual learning

42:57

basically. So like h how do I help my

43:00

agents which are producing a bunch of

43:02

data over time like I'm using let's

43:05

let's just take like my personal agent

43:06

like I'm using this one agent a ton over

43:08

time

43:09

>> and it's producing a ton of data which

43:12

is like traces essentially right and

43:14

then like all those traces like I'm

43:15

storing somewhere like we store them in

43:16

length you can put all your traces in

43:19

one place and how do I update the

43:22

definition of the agent in order to

43:25

learn from all of the data that it's

43:27

producing Right. So there's like maybe

43:30

two ways to really do that. And memory

43:33

is sort of a subpiece of continual

43:35

learning. Like continual learning like

43:36

overall to me is as I'm acting in the

43:39

world and as I'm like sort of like

43:41

producing data kind of like how we

43:42

humans do. Like I'm doing stuff in the

43:44

world and I'm like learning from the

43:46

feedback that I'm getting, right? Like I

43:48

ran and I tripped and I fell when I was

43:49

a kid and like this is a great trace

43:51

stored in my brain to say like please

43:53

like don't do that. Same thing for

43:55

agents. But the way that we actually

43:58

like update the like the agent knowledge

44:01

is like really different probably

44:03

because like we don't understand exactly

44:05

how like experiential memory that humans

44:09

experience like how does like my

44:10

experiential memory as a human get

44:12

encoded into my brain like I don't

44:14

exactly know how that process works and

44:17

we need to do that process essentially

44:21

for agents and like the agents

44:24

computation boundary is just it's

44:26

context window basically. So I need to

44:28

be able to like take learnings from the

44:30

past and I need to be able to like do

44:32

two things. One is um inject them into

44:37

the context window at the appropriate

44:40

time

44:41

>> so that when that scenario comes up, it

44:44

can like use that prior information to

44:46

like fix the thing. Like for example,

44:48

maybe this comes up in like user memory

44:50

for coding, right? It's like you're

44:52

doing a bunch of like coding with your

44:54

coding agent and then like you give it

44:57

it has that trace and like maybe you

44:58

like annotate that trace with human

45:00

feedback saying like hey like the way

45:02

that you did this or like you use this

45:04

library but like we never use that

45:06

library so like please like always use

45:07

this other library right and it's like

45:09

okay like great should that piece of

45:12

feedback and like context should that

45:13

always be in like my always on memory

45:16

right is that like just in my agents.mmd

45:18

that always gets like loaded in or is

45:21

this something that gets injected like

45:22

in real time into the agent like

45:25

contextually. This is like why I'm super

45:27

interested also in like search as a way

45:29

of doing this because like we're I think

45:33

it's like almost like unfathomable the

45:35

data scale that we're going to start

45:36

producing with agents. So like agents

45:38

run like all the time non-stop. they

45:41

produce like millions of tokens like

45:43

every few minutes and like that's a ton

45:45

of information that we need to like sift

45:47

through to figure out what's useful from

45:49

that and like what's not useful from

45:50

that. So like search is like a really

45:53

really big part of distilling a bunch of

45:55

trace knowledge into like nuggets or

45:58

like memories that I can actually

46:00

retrieve that are useful because like

46:02

tons of that trace will actually be

46:04

noise. So it's sort of this process of

46:05

like distilling

46:07

great data which is like trace data but

46:10

into nuggets that I can actually like

46:12

bring into context when I need to.

46:13

That's like one. And then the other one

46:15

is like really interesting for us is

46:17

instead of just selectively and

46:19

contextually pulling the right thing

46:21

over the like the context window

46:23

boundary for like computation to happen

46:25

over it. So like context engineering

46:26

like you can also just touch the

46:28

weights. So like we like lean in a bunch

46:30

into like open models and like I love

46:32

open models. I use like GLM5 a bunch

46:35

like a ton of the team does as well. And

46:37

that's like amazing as well. That's like

46:39

continual learning by using feedback

46:41

from traces and like distilling that

46:43

into data that you can do like RL on

46:47

essentially and like making that process

46:48

a lot easier. And both are really

46:52

interesting like we're leaning into both

46:54

and I think both will happen. So it's

46:56

actually not going to be like an or like

46:58

everything will be RL or like everything

47:00

will be like context entry. you totally

47:01

need both because there's like tons of

47:04

things that you don't want to RL or like

47:06

it just doesn't make sense to like

47:07

fact-based retrieval like you can like

47:10

include that data in there but it makes

47:13

more sense to do search in order to

47:15

retrieve some of that stuff. So it's

47:16

like yeah those are maybe the

47:18

interesting bits that we're sort of

47:20

leaning into like sort of

47:21

>> you just mentioned there are tons of

47:23

things which you don't want to RL so can

47:27

you mention what kind of arenas do you

47:29

think we should go for RL or we should

47:32

not like where there is like it is

47:35

constrained by compute resources or

47:37

anything

47:38

>> I'm like super bullish on if you're like

47:42

if you're a builder or a company

47:44

producing some sort of like data in

47:46

vertical and you want to like do two

47:50

things. One, make your model way better

47:52

at that task and like basically like fit

47:54

to your data, fit to your use case, then

47:56

also like make it like way faster and

47:57

like way cheaper. Like RL is something

47:59

like definitely like worth exploring

48:00

because fine-tuning has gotten like way

48:03

easier in the last whatever year. Like

48:06

there's actually like amazing companies

48:07

that will help you fine-tune if you like

48:09

bring the data, if you massage it

48:10

properly, like you store all your data

48:12

like Langmith and you can like pull it

48:13

down to do RL over it. Um,

48:16

in terms of things that you like should

48:18

RL on or you shouldn't RL on, I think

48:21

it's really really great if you have

48:23

some sort of like vertical that you want

48:24

to like make your model like really

48:26

really good at. I think we see a lot of

48:27

companies that have started, okay, like

48:30

I'm building this like model and it's

48:33

going to be really really good at search

48:35

and I'm going to expose that as like a

48:37

sub agent to like my main agent and like

48:39

this sub agent is going to rock at that

48:41

or it's like this this model we like

48:44

fine-tune on a bunch of our like

48:46

customer service data and like it's

48:48

really really good at that use case or

48:50

like finance data for example or like

48:51

even even yesterday um like OpenAI

48:53

released Rosalind right which is like

48:55

all about bio

48:57

That's like amazing, right? And that

48:58

also like sort of it it butts heads with

49:01

this whole idea that the general purpose

49:05

everything is just going to like kind of

49:07

like subsume everything, right? It's

49:09

like I'm going to have like one general

49:10

agent that's just going to like it's

49:12

going to be so good. It's just going to

49:13

get exactly what I'm saying. It's going

49:14

like solve the task. Like maybe in the

49:16

limit that is definitely maybe going to

49:18

be true, but to like today like we have

49:20

to build for today, you know? So like

49:22

today it's super helpful actually to

49:24

take the opposite view like curate a ton

49:27

of data and like pick a niche that you

49:30

really care about or like that your

49:31

customers care about and like build the

49:33

best data for that like build the best

49:35

harness for your model around that and

49:38

just like sort of rock at that task. And

49:39

I think like RL is amazing for imbuing

49:42

sort of like vertical specific skills

49:44

into an open model and you get it like

49:48

way cheaper like way faster and like

49:50

depending on the original like training

49:52

distribution of that task in like the

49:55

frontier labs like data mixture like

49:58

you're it's very likely that your

50:00

fine-tune model will be better than that

50:02

open model or sorry than that closed

50:04

model at that task as well because like

50:05

you have the data and you like

50:06

fine-tuned it and like maybe like where

50:08

you don't want to use RL4 is like I I

50:10

honestly think it's a really good idea

50:12

just to start with harness engineering

50:13

like or like just really good context

50:15

engineering

50:17

because it's so easy actually like

50:20

relative to RL that just like pick your

50:22

model like design like a really really

50:24

simple harness around it first like for

50:26

example we have like this abstraction

50:27

and lang chain called like create agent

50:30

which is just a react loop and then you

50:31

can like build a bunch of stuff on top

50:33

of that until like you don't need to

50:34

anymore or you can use like deep agents

50:36

out of the box if you want to and Yeah,

50:38

just like go and build and do maybe

50:40

start with harness engineering and like

50:41

maybe the other point was like

50:44

there's things that like things like

50:46

factbased retrieval like fact-based

50:48

retrieval is just it's just like maybe

50:50

more of a search problem like I just

50:52

want to find the thing and I want to

50:53

inject it into my context essentially.

50:57

So it's like yeah that might be like one

50:58

example where it's like hey like you can

50:59

RL this thing and maybe RL on the domain

51:01

but like the way that you put it over

51:03

the like boundary for computation the

51:06

context window is just find it

51:08

essentially via some search mechanism

51:10

>> you mentioned about search previously

51:11

right like you will be going for search

51:14

like essentially so uh there comes this

51:18

concept of context context ro so you

51:21

site the chroma research on how models

51:24

get words on on as context fills up

51:27

maybe compaction tool call offloading

51:30

skills as you know progressive

51:31

disclosure. So which of these has the

51:33

biggest impact in practice when when it

51:36

comes to context fraud and and what are

51:39

the what are the kind of potential um

51:41

practices you specifically use to avoid

51:43

these?

51:45

>> They they all matter actually and I I

51:48

think it it sort of depends on like the

51:51

design that you're going for out of the

51:53

box, right? So I think like maybe maybe

51:55

like a good recipe essentially is that

51:57

like we start building the agent with

52:00

like a goal in mind like I want the

52:02

agent to do this thing but like really

52:03

really focus on like context rot because

52:06

after you pass like some sort of like

52:07

context threshold it gets like just like

52:09

really dumb and like like you said we

52:11

have like levers to fight against that

52:13

which is I can use like sub agents to

52:16

decompose the problem into like more

52:17

manageable chunks so I don't pollute my

52:19

main context window like that's like

52:21

amazing but like basically what what

52:24

like predicates that is that I can

52:25

actually efficiently decompose the

52:27

problem, right? So it's like maybe

52:29

that's like instructions that I give in

52:31

the system prompt to the agent of saying

52:33

like this is how you like decompose a

52:35

problem into like these tasks and like

52:38

if it's like a task specific agent then

52:40

you probably already have a bunch of

52:41

like human priors for how to go tackle

52:43

the problem. Like for example, for

52:44

coding agents, the way that we decompose

52:46

a problem is like you have agents that

52:49

do like sub agents do like codebased

52:50

search essentially and they like do that

52:53

separately and they pull in the

52:55

important information into like the main

52:57

agent to do some of that stuff and like

52:59

maybe there's like a web search agent as

53:00

well like

53:01

>> has to go like pull external information

53:02

and like find that and prepare it for

53:04

the agent. So it's like yeah basically

53:06

like working backwards from like I need

53:08

to avoid context rot like one way to do

53:10

that is like sub aents like is my

53:12

problem amendable to like sub aents like

53:14

if it is fantastic another way to do

53:16

that is like and these are often in

53:18

conjunction like we we like lang like

53:21

our docs we publish like a bunch of

53:22

stuff on like multi- aent docs as well

53:25

and skills are kind of related to that

53:27

which is skills to me they basically

53:31

kind of like encode knowledge and

53:33

workflows like skills are awesome

53:35

because everyone before skills like

53:40

hated writing good docs if that makes

53:42

sense. Like everyone was just like so

53:44

lazy

53:45

>> and they were like I'm just going to

53:47

like tell the model like some sort of

53:48

like random stuff like kind of like hand

53:50

wavy and it'll just get it. But like for

53:52

some reason like skills came out and

53:54

like because maybe skills are like

53:55

sharable and like other people like see

53:57

the skills like everyone writes like

53:59

very very good like workflow

54:01

descriptions in skills and like the

54:03

agent sort of like sees the skill

54:04

content and then it executes the

54:06

workflow and that's like amazing because

54:10

>> I basically get like a very small

54:12

snippet of like when to use this skill

54:14

and like I avoid all the context rot and

54:17

like when necessary we like pull in the

54:20

right context from the skill into there.

54:22

Like the the tricky thing with skills is

54:24

always like

54:26

basically knowing when to trigger them.

54:27

And that again comes down to like

54:29

instruction following which is like we

54:31

have some skills evals as well where

54:32

like we'll have scenarios and then we'll

54:35

sort of have like the skills that we

54:36

want to have like triggered basically

54:40

and then like we we have evals where

54:42

it's like we we only want that skill to

54:44

be triggered because like like let's say

54:46

it like triggers the wrong skill first

54:48

and then like eventually it does a bunch

54:49

of like stuff and then it figures out

54:51

like oh actually I need to do this skill

54:52

like that's bad because you wasted a

54:54

bunch of tokens essentially. So I think

54:56

eval help a bunch with context rot which

54:59

is one does the problem succeed at the

55:02

end like that's a really big part of

55:04

evals and the other one is sort of like

55:06

fine grained metrics on the evals which

55:09

is like how long did it take like how

55:11

many tokens did it take what was the

55:13

overall cost right and then like reading

55:14

the trajectory and then seeing like in

55:17

my effort to reduce like context rot by

55:19

doing like sub agent routing or like

55:21

triggering the right skills is that

55:23

working and then there's like maybe also

55:25

like determine ministic stuff which is

55:26

good. So like tool call offloading. So

55:29

like this happens a bunch with like bash

55:31

calls like you have you you you run the

55:33

shell and uh it's just like a mess. So

55:36

you get this gigantic like tool like

55:38

this output string and you can just pipe

55:40

that into context or you can just take

55:43

like the head and the tail and you pipe

55:45

that into context because that's usually

55:47

the important bits and then you tell the

55:49

model that like the rest of this string

55:51

lives in this file over here if you can

55:53

if you want to access it and then you

55:55

can go and do that. So, it's basically

55:56

like doing a bunch of stuff on the

55:59

model's behalf to really protect that

56:02

like incredibly precious artifact, which

56:04

is our context window. And like we I we

56:08

just think like really hard about like

56:10

if something doesn't need to go in here,

56:12

like really like don't put it in there.

56:13

But if something does need to go in

56:15

here, like do our very best to like

56:18

spend compute on like search or like

56:19

really good instructions to make sure it

56:21

gets in there.

56:21

>> Makes sense.

56:23

Interesting. I mean that that actually

56:25

makes a lot of sense. Um I'm curious

56:27

about so for people who may who may not

56:31

know the space well. So there's been

56:32

like open claw boom. I mean I I just saw

56:36

on Twitter it is kind of declining as

56:38

well. So Hermes has been getting as much

56:40

attention as open claw right which which

56:42

is coming out of news research. So how

56:45

does

56:46

deep agents differ from both of them? It

56:49

would be useful to explain this from

56:51

first principles for both technical and

56:53

nontechnical listeners since we are

56:55

going to spend a lot of time talk about

56:57

hardness in this conversation.

56:58

>> Both amazingly sick projects like

57:02

openclaw amazing like what Peter did

57:04

there and then also like what the new

57:05

guys are doing with Hermes is like so

57:07

cool. Yeah. I think like the main way

57:09

that I think about it is like you have

57:11

this like claw architecture right looks

57:13

like a little bit different from like

57:15

claw to claw but like overarching

57:17

architecture of like I deploy this

57:20

somewhere there's some sort of like

57:21

messaging

57:24

>> it's like live talk to it back and forth

57:27

there's like a heartbeat that triggers

57:29

like over and over again that has like

57:31

some sort of like memory primitives in

57:33

there so it's like it's basically like a

57:36

very opinionated

57:38

harness for the use case that is like my

57:42

personal agent. So I think like a claw

57:44

is like really the first it's the first

57:47

like really mainstream personal agent

57:50

like maybe like besides chatbt like

57:52

chachi is like it didn't like really

57:54

feel like a personal agent like it had

57:56

like memory and stuff like people like

57:57

message their claws like all day like

57:59

maybe people do that with chatbt too but

58:00

it's like the architecture of the

58:02

harness behind like the claw that makes

58:05

it like feel really personal because of

58:07

all the things they put behind it around

58:08

like the integrations like like what's

58:10

happening and telegram and those types

58:12

of things and like the memory that gets

58:13

updated. Like the big thing is honestly

58:14

like I like the heartbeat thing a lot. I

58:16

think like that doesn't get enough hype.

58:17

It's like very ingenious to like wake it

58:20

up on some cadence to like do things for

58:22

example crrons and things like that. So

58:24

the way I think about it like high level

58:26

is a claw is an amazing choice of an

58:31

opinionated harness for like a personal

58:34

agent essentially. And that's like an

58:38

awesome choice that like they make. And

58:41

I think we should have like a lot more

58:43

of these like people should like build

58:45

their own or like people should use them

58:46

more and see if they like them. And then

58:50

maybe like going back to like the

58:51

primitives, I think like you can build

58:54

tons of agents that are not claws that

58:57

like completely solve like your task

59:00

like really really well. And that's

59:03

basically like how I view like maybe

59:04

like Langchain's create agent or like

59:07

deep agents or like all of the other

59:09

great companies that are like building

59:11

harness primitives which is your

59:14

probably your task does not require like

59:16

a claw like most most likely like it's

59:19

awesome like you should have a claw in

59:20

your life but like if you're doing

59:23

something else like you don't need a

59:24

claw. So like actually what you need is

59:27

amazing instructions, amazing context

59:30

engineering, like amazing choice of like

59:32

what models you're going to use to hit

59:34

like the paro frontier of like Perf cost

59:38

and latency and like you can start from

59:40

like a simple harness and you can

59:42

assemble a harness around like that

59:44

model or like models to like build that

59:47

thing essentially. And I think like claw

59:49

is like one instantiation of an

59:51

opinionated harness for like personal

59:53

agents basically and it's like awesome.

59:55

And I know like people use claws for

59:57

like other things as well. So like I

59:58

think claws if you like edit the harness

1:00:01

around them like the base harness and

1:00:02

you like make them I don't know if you

1:00:04

like change them for like another task

1:00:06

of research that's like awesome too. But

1:00:08

I think like that whole process of like

1:00:10

taking a task and like you have a

1:00:13

harness that like wraps a model or like

1:00:14

models and like you sort of like direct

1:00:16

it towards a goal. That's basically I

1:00:19

think the goal of like laying chains

1:00:21

like create agent and like deep agents

1:00:23

which is we have like some opinions in

1:00:24

there to get you started but like really

1:00:27

we want to help you build the best agent

1:00:30

like for your tasks. Like that might be

1:00:33

us giving you all the tooling. That

1:00:34

might be like me and like the rest of

1:00:37

the team like blogging about like actual

1:00:38

use cases and like sharing our evals and

1:00:41

like just publishing results. But yeah,

1:00:44

basically like customize a base harness

1:00:46

to make it like really really good at a

1:00:47

task and like a claw is a like

1:00:49

phenomenal example of

1:00:51

>> makes sense.

1:00:53

What what you really see the future of

1:00:55

it? I mean I mean let's say down the

1:00:58

line what the next um what what it would

1:01:02

look like after let's say five

1:01:04

iterations of it or what you really see

1:01:06

the future of in in a year or so let's

1:01:10

say

1:01:11

>> I mean honestly in a year and six months

1:01:12

like everything's going to change

1:01:13

obviously no I'm just kidding like it's

1:01:15

it's hard to say like I'm like in the

1:01:18

short term very very bullish on

1:01:21

basically helping people build like open

1:01:25

or like us providing open infrastructure

1:01:28

to help other people build agents that

1:01:31

are like amazing for their task. And I

1:01:32

think that is not going away in the near

1:01:36

to medium term at all. In fact, I think

1:01:39

it's going to go in the complete

1:01:40

opposite direction, which is like

1:01:41

everyone is going to start basically

1:01:43

taking their tasks and they're going to

1:01:45

either do like harness engineering

1:01:47

around those tasks, which is largely

1:01:49

like very good context engineering, like

1:01:50

very good like prompts and very good

1:01:52

tools and like very good skills. They're

1:01:54

going to do all of that around like some

1:01:57

sort of task they care about basically.

1:01:59

And I think open harness engineering is

1:02:01

a big part of that. And I think like

1:02:02

open models are also like a really big

1:02:04

part of that which is we're going to see

1:02:06

like a big growth of I'm going to take

1:02:08

like Kimmy, I'm going to take like GLM5

1:02:10

and I have this data and like a big

1:02:12

future is I'm just going to like

1:02:14

fine-tune that model on my data and I'm

1:02:17

going to make it really good. I'm going

1:02:18

to just keep doing that over and over

1:02:19

again and I'm going to compare how that

1:02:21

does to like a Frontier model and I'm

1:02:23

going to make the trade-off between like

1:02:25

is it better, is it like just as good,

1:02:27

what's like the cost, what's like the

1:02:28

latency tradeoff and then like maybe

1:02:31

like a little bit more like longish term

1:02:35

from that which is like it would be

1:02:38

awesome if we got like some sort of like

1:02:43

AGI model that just did everything. I

1:02:46

would love that. Like so then I can like

1:02:48

totally stop talking about like

1:02:50

harnesses and like evals and I can just

1:02:52

like enjoy the model. But it still

1:02:55

really does help to specify like the

1:02:58

intelligence that we want that model to

1:03:01

like act on in a particular situation.

1:03:03

And like I still think even in like the

1:03:05

medium to long term, it's going to be

1:03:06

super helpful for humans to get really

1:03:09

really good at both describing the thing

1:03:13

that they want and not just like hand

1:03:15

wavy like writing like kind of how we

1:03:17

like write really detailed prompts like

1:03:20

getting comfortable with taking the

1:03:22

thing I want and like putting it into

1:03:24

like language basically. And then the

1:03:25

other thing is we're still going to want

1:03:27

to like verify the work that agents are

1:03:30

doing in in some way. I hope like

1:03:32

autonomous verification systems get a

1:03:34

lot better, but like they're not going

1:03:37

to be perfect and we're still going to

1:03:38

want to be able to say like when an

1:03:40

agent is doing good versus like when an

1:03:42

agent is doing bad and that can become

1:03:43

like part of the feedback signal and I

1:03:46

still think that's going to exist like

1:03:47

for for a little bit and like that's

1:03:48

that's like not a bad thing. That's like

1:03:50

totally fine like we can still work on

1:03:52

that.

1:03:53

>> Makes sense. Um you know dude there are

1:03:56

a lot of I mean bunch of companies I

1:03:57

mean interestingly everyone who is

1:03:59

working on frontier are coming out of

1:04:00

their own harness own their own agent.

1:04:03

So so recently RAM basically built their

1:04:05

own harness right. So have you seen what

1:04:08

they put out? So and my other question

1:04:11

is like you yourself um have used open

1:04:13

code. So it seems like um enterprises

1:04:16

building custom harnesses puts real

1:04:19

pressure on competitors. So I'm guessing

1:04:21

a um release like that forces companies

1:04:24

like SLA which also got recently big fun

1:04:27

who competed to RAM and others to build

1:04:29

their own thing too. So what do you make

1:04:32

of this trend?

1:04:33

>> Yeah, I mean like the ramp is amazing

1:04:35

obviously like they put out such fire

1:04:37

blogs. I like ramp lab stuff. I think

1:04:39

like the the overall trend of like

1:04:42

building your own harness or like or

1:04:44

basically like building your own agent

1:04:46

that's custom for your task is like

1:04:48

fantastic. Like I I think more teams

1:04:50

should basically devote time towards

1:04:53

like investing maybe like in the process

1:04:56

of both like helping their teams build

1:04:59

agents, right? That doesn't just mean

1:05:00

like coders. That means like everyone

1:05:02

like the people who are doing go to

1:05:03

market like marketing, sales, all those

1:05:06

people can like benefit in some way from

1:05:08

agents. They just need like help doing

1:05:09

that basically. And I think it's great

1:05:12

that a company basically like picks a

1:05:14

problem and they're like we're going to

1:05:18

solve that by building the best harness

1:05:20

and that means like the best context

1:05:22

edge like the best verification the best

1:05:23

tool stack also a big part of it and

1:05:27

like we work on a lot of the stuff at

1:05:29

like lang is building the correct or

1:05:32

building like really easy to use systems

1:05:35

for taking the trace data and then like

1:05:38

improving the agent because I think

1:05:39

there's a lot of stuff around

1:05:40

improvement loops which is our first

1:05:43

pass at the agent isn't amazing. So like

1:05:45

this comp these companies are like okay

1:05:47

I'm going to pick a task I care about

1:05:49

and like my first version is going to be

1:05:51

like kind of mid totally fine but then

1:05:53

I'm going to get the data from somewhere

1:05:56

and I'm going to like make it better

1:05:57

over time by just like spending a ton of

1:05:59

time on it or like maybe spending a

1:06:01

bunch of like compute on it to like

1:06:02

understand the data and like improve the

1:06:05

prompts like fix the edge cases like

1:06:07

improve all the errors right and it's

1:06:09

just like I still think like we we will

1:06:11

have tons of vertical companies because

1:06:13

today like someone has to do the work

1:06:16

like someone has to like invest in doing

1:06:18

that like someone has to do like sales

1:06:19

around that right it's not just going to

1:06:20

like happen by itself and I think like

1:06:23

more tooling around that and like more

1:06:25

yeah just more like research that helps

1:06:27

people do that that's like a good thing

1:06:28

like doing the open is like an even

1:06:30

better thing so it's like more

1:06:33

>> I think it's also very ambitious to do

1:06:34

you know you you're you're already at

1:06:37

Frontier and why to depend on someone

1:06:39

like let I mean if you want to be at

1:06:41

frontier you have to build something

1:06:42

like what what other people at are

1:06:44

working on. Interesting. So, um beforeh

1:06:47

going to the other segment of the

1:06:48

podcast, let's have some quickfire

1:06:50

chats. Uh so, so there is a meme you

1:06:53

liked from Mintly Fly Slack. Would be

1:06:56

awesome if you can share screen and

1:06:58

share that. Yeah, please. Then I'll go

1:07:00

ahead.

1:07:00

>> Let me

1:07:02

let me get that off. Um

1:07:05

I love this guy. Let me share. Dude,

1:07:07

this guy is so funny. A dude, I love

1:07:09

this guy so much. Dude, this guy this I

1:07:11

don't know like where this came from or

1:07:13

like who can I turn on volume?

1:07:16

>> So then I mean you like this like this

1:07:19

from mental slack channel that

1:07:21

apparently went viral across I think

1:07:23

startups. So what is it and what it made

1:07:25

it so hard? What's

1:07:26

>> so I have I have no idea. I think it's

1:07:28

Nick. It's Nick from Mintify like

1:07:30

tweeted it one day who's funny like I

1:07:33

was just like this is amazing. So I I

1:07:35

just sent it to all my friends um just

1:07:37

like randomly. I think it was our like

1:07:38

uh like soccer chat. I'm like something

1:07:41

happened with like Arsenal or something.

1:07:42

I I sent this to like my friends because

1:07:44

they I think they lost. Yeah. And like

1:07:45

now we have this in our Slack as well.

1:07:48

Like someone made it into like a gift

1:07:50

and like whenever maybe something goes

1:07:51

wrong like we just sort of throw this

1:07:53

guy like I don't know what it is or who

1:07:55

made it, but like I love this guy. I use

1:07:57

it all the time.

1:07:58

>> Awesome. Um my next question is um most

1:08:03

underrated harness feature that nobody

1:08:06

talks about. most underrated.

1:08:09

It's a good question because like I feel

1:08:10

like if it's underrated, we should be

1:08:11

talking about it.

1:08:14

>> Yeah, exactly.

1:08:16

>> Okay. Okay. Okay. I think like one thing

1:08:19

that we use a bunch is like this idea of

1:08:23

like we call it middleware but like

1:08:24

hooks just generally. So like for for a

1:08:28

lot of teams it's like super useful to

1:08:31

inject sort of like deterministic

1:08:33

actions like basically like do

1:08:35

deterministic code execution like

1:08:37

somewhere in the harness. And I think

1:08:39

that's like super underrated maybe

1:08:40

because it requires like sort of like

1:08:41

custom logic. It's not just like you

1:08:43

think of a tool and you just sort of

1:08:44

like add it. But yeah, I think hooks

1:08:48

that sort of like control bad model

1:08:51

behavior are like really really helpful

1:08:53

or like not just bad model behavior like

1:08:54

help the model like do things. So for

1:08:55

example like triggering excuse me

1:08:58

triggering like self-verification and I

1:08:59

think people should like build more

1:09:01

hooks to control their models. Makes

1:09:03

sense. Um interesting. So we have

1:09:06

something which people should talk

1:09:08

about. Interesting. The model that

1:09:10

surprised you most in agent workloads

1:09:13

this year

1:09:14

>> in both I mean we can go in both ways

1:09:18

which was like something which you were

1:09:20

not expecting and it comes out really

1:09:22

better and something which you kind of

1:09:24

not expecting and it was like it comes

1:09:26

out.

1:09:26

>> Yeah. So, I'm like so impressed by open

1:09:30

models generally as like actually ways

1:09:33

that I get work done. And like I think

1:09:36

like it's it always like feels really

1:09:38

good to talk about open models, but like

1:09:40

you sort of like love the idea of open

1:09:42

models, but then like you don't use

1:09:43

them. Like that's like that's not good.

1:09:46

But actually like the open models that

1:09:48

have come out this year are like

1:09:50

amazing. So like the GLM series is like

1:09:53

fantastic and like it is actually a good

1:09:57

agentic coding partner. It's like very

1:09:59

fast and it does amazing work. So like

1:10:02

maybe at the start of this year like

1:10:04

last year I don't think I would have

1:10:05

expected my GLM f like my GLM use to be

1:10:09

so high and there's other models too

1:10:11

like um like the Ry team who you had on

1:10:13

like they're they're amazing. Miniax is

1:10:15

one that we actually like eval on a

1:10:16

bunch and like these are all amazing.

1:10:18

So, like open models have surprised me.

1:10:21

Like I was hoping it would happen, but

1:10:23

it did happen and that's awesome and

1:10:25

like we should invest a bunch more in

1:10:27

that and like I hope like I hope like

1:10:30

teams actually like think about using

1:10:31

them in like their actual workloads

1:10:32

because they're amazing. Yeah, that

1:10:34

surprised me but in a in a good way. I

1:10:36

was like super happy and like it's only

1:10:38

going to get better and that's like

1:10:39

really really good and it's like way

1:10:41

cheaper and faster.

1:10:42

>> Awesome. Um okay, this one is lost. one

1:10:44

thing you would change about how the

1:10:46

industry builds agents right now. It can

1:10:49

be any common practice or something like

1:10:50

that.

1:10:52

>> How should they change? I think

1:10:54

basically like this whole thing that

1:10:56

we've been talking about right now is

1:10:57

like I would love if like that was like

1:11:00

easier for people to do or like more

1:11:01

people like did it which is basically

1:11:03

like maybe like work backwards from a

1:11:06

task and like a goal that you really

1:11:07

want and then like the whole point to me

1:11:11

is just like build a system like for

1:11:13

your team or like for yourself and like

1:11:15

for your agent to like make it better

1:11:17

over time. Like maybe like I'm saying

1:11:18

that because I'm like we're thinking a

1:11:20

lot about continual learning. So this is

1:11:22

like both the agent design which is like

1:11:24

prompts tools like the whole harness

1:11:26

thing like the verification loops like

1:11:27

all this sort of stuff and then also

1:11:28

it's like sort of the infrastructure

1:11:31

around it for doing like

1:11:32

self-improvement. So this is like the

1:11:34

unsexy stuff, but I think the stuff that

1:11:36

like really matters, which is okay like

1:11:38

are you like is tracing on? Like are you

1:11:41

like putting your traces somewhere

1:11:43

basically like are you using your traces

1:11:47

to like mine errors like via monitoring

1:11:50

basically? like lang supports that and

1:11:51

like we think about that a bunch which

1:11:52

is like trace came in like how do I

1:11:54

figure out if something happened and

1:11:55

like am I making eval from that right

1:11:57

and then like am I am I like reading the

1:12:00

evals basically so it's sort of like the

1:12:01

systems approach around like building an

1:12:03

agent and like making it better I think

1:12:06

teams are doing that that's amazing but

1:12:08

like it's awesome and I think teams

1:12:10

should team should try to do that

1:12:11

>> makes sense um also on the same note

1:12:14

there was this um recent paper called

1:12:16

meta harness and DDR also posted about

1:12:19

it a lot of people are working on auto

1:12:21

research and this field adjacent if not

1:12:24

a version of auto research itself then

1:12:26

you have also things like you know post

1:12:28

train bench where a hardness is used to

1:12:31

post train models so if those two

1:12:34

directions start merging so better

1:12:37

harnesses improving post training and

1:12:39

meta harness improving the hardness loop

1:12:40

itself I mean that feels pretty

1:12:42

explosive pretty interesting how do you

1:12:44

think about that convergence

1:12:46

>> I I think it's super exciting like I

1:12:48

love teams that are like

1:12:49

productionalized ing like auto research

1:12:51

and like doing so like we we have I

1:12:53

think we did like something around like

1:12:55

harness opt maybe like a couple months

1:12:58

ago and there were definitely like some

1:13:00

issues that I saw back then and I think

1:13:03

we still have like some of the issues

1:13:04

but like now like a lot more teams like

1:13:06

putting a lot more effort into it. So

1:13:07

it's like I think this is amazing that

1:13:09

and also maybe I'll just like this is my

1:13:11

take as well. So, like I have like we

1:13:13

put up a bunch of blogs and I think like

1:13:16

there's there's like algorithms that we

1:13:18

need to discover to make like agents and

1:13:20

harnesses better using some sort of like

1:13:22

grounding signal. And that's basically

1:13:24

like auto research is like I have a

1:13:25

grounding signal and I hill climb that

1:13:27

signal and like I update my harness and

1:13:28

like meta harnesses that like we have

1:13:30

one like better harness and like we tons

1:13:32

of good people have like work around

1:13:33

this which is amazing. And basically

1:13:35

like I view like eval as such an

1:13:38

important part of this like feedback

1:13:41

loop because like eval are basically how

1:13:42

we like ground our like auto research

1:13:45

loops over time and it's not just like

1:13:46

ground like in the moment like if I run

1:13:48

auto research like later like a two

1:13:50

weeks later I still have that same like

1:13:52

grounding mechanism and maybe hot take

1:13:55

but I think like you can almost try to

1:13:59

define an agent via a set of like evals

1:14:02

that sort of serve as not just a spec,

1:14:04

but a spec that you can like verify and

1:14:06

like ground. You can do it via fitting

1:14:08

to like eval.

1:14:11

And then you basically have a fitting

1:14:13

algorithm, right? And the fitting

1:14:14

algorithm can be like meta meta harness

1:14:16

or like better harness or like any of

1:14:18

these. And that fitting algorithm is

1:14:20

basically run on evals like reflect on

1:14:23

traces and like update harness or like

1:14:26

prepare data to do like RL on it. And I

1:14:28

think like we're in such early early

1:14:30

innings of this self-improvement loop

1:14:34

basically and I'm I'm super excited

1:14:35

about it. I think it's like really

1:14:36

really cool. There's like stuff to work

1:14:38

out around overfitting and stuff but

1:14:41

like that will happen and like people

1:14:42

will use this a bunch more.

1:14:43

>> Awesome. Makes sense. Um

1:14:45

>> I'm pretty much looking forward to as

1:14:47

we're approaching to the next um I mean

1:14:49

last segment of the podcast we have some

1:14:50

questions around environments harnesses.

1:14:54

pretty much harnesses we have covered

1:14:56

but eval and benchmarks around

1:14:58

benchmarks. So, so how do harness fit

1:15:02

into this broader idea of simulation as

1:15:05

a service? So there are companies whose

1:15:07

whole business is simulating work

1:15:09

categories, decisions, operating

1:15:11

environments. If better harnesses lead

1:15:14

to better simulations, so where does the

1:15:16

open-source side go and do you think

1:15:18

langen will eventually release an open

1:15:20

source simulation?

1:15:20

>> Yeah. So like I think these these two

1:15:22

things are super related. So like thing

1:15:25

that we wrote about before there's like

1:15:28

like evals and like environments like

1:15:30

they're not the same but they sort of

1:15:32

like rhyme with harnesses as well. So

1:15:34

it's like basically like the like the

1:15:37

main idea is like I need like some place

1:15:40

for my agent to do work that sort of

1:15:42

like reflects actual work that's going

1:15:44

to be doing like in the real world

1:15:46

basically right so it's like I'm going

1:15:47

to build an environment like there's

1:15:48

tons of like awesome environment

1:15:49

startups that are doing that and like

1:15:51

running the agent in them so it can

1:15:53

produce like a good feedback signal so I

1:15:54

can like train on basically that's like

1:15:56

amazing. I think like even a big part of

1:16:00

like evals are going to start looking

1:16:02

like environments because like when when

1:16:05

we first started trying to like eval

1:16:08

it was really simple. It was like chat

1:16:10

completions evals, right? It was like

1:16:12

I'm going to give you like a really

1:16:13

simple like input prompt and I'm going

1:16:15

to have like a number or like a

1:16:17

structured output at the end of it. I'm

1:16:18

just going to like map the keys, right?

1:16:19

I'm going to be like, "Hey, did you like

1:16:21

did you get them all right?" But as

1:16:23

agents are doing much more like

1:16:25

complicated work and like much more like

1:16:27

long horizon work actually like the

1:16:29

thing I want to eval is like a task

1:16:31

essentially and like the the best way to

1:16:34

maybe do that is to just like build an

1:16:36

environment and just like drop my agent

1:16:38

into the environment and like maybe like

1:16:42

what we do right because we actually do

1:16:43

this like we basically use Harbor right

1:16:45

and like Harbor those guys are awesome

1:16:46

like the the terminal bench guys

1:16:49

so like we'll pick our eval like it maps

1:16:52

to some sort of like hardware config and

1:16:54

then like we run the eval in the

1:16:56

environment that we built. Then like all

1:16:58

of the traces like go into Lenmith and

1:17:00

then we like read them and we look at we

1:17:02

like segment them based on like the

1:17:04

rubric like how much did it pass, how

1:17:05

much did it fail like how long did it

1:17:06

take and then we try to like improve the

1:17:08

agent and I think that process of like

1:17:11

building the environment and like you

1:17:13

asked about simulations like we think I

1:17:14

think about this a bunch which is like

1:17:16

what I really want to happen is the like

1:17:19

the company that we're building or like

1:17:21

the app or product that I'm building

1:17:23

like I want my agent to be able to like

1:17:25

test itself in that exact environment.

1:17:28

So it can figure out like when stuff

1:17:30

goes wrong essentially and then I can

1:17:32

like fix it, right? And like that's

1:17:34

basically the whole point of eval which

1:17:35

is like

1:17:36

>> they're sort of like a proxy for what

1:17:39

happens in production and like as I fit

1:17:41

to my evals I'm kind of imbuing like

1:17:45

behavior into the agent to make it pass.

1:17:48

The whole goal of evals is like to make

1:17:49

them pass, right? Like and like a lot of

1:17:51

our evals fail because like maybe the

1:17:53

models just aren't smart enough yet.

1:17:55

like eventually they will pass and like

1:17:58

then what I've done is like I've taken

1:17:59

that information from that eval and I've

1:18:01

sort of like transfer learned it into

1:18:03

like some sort of agent whether it's

1:18:04

like the weights or like the harness or

1:18:06

something and yeah I'm bullish on both

1:18:08

I'm bullish on like eval as a mechanism

1:18:10

of doing like agent improvement and also

1:18:12

bullish on like more eval looking like

1:18:16

environments basically instead of like

1:18:18

just like input output

1:18:19

>> pretty interesting you know this uh

1:18:21

terminal bench 2.0 Sweet bench pinch

1:18:23

bench. So the benchmark landscape of for

1:18:26

agents is growing fast but you um

1:18:29

explicitly say in your opinionated

1:18:31

agents post where I mean you said test

1:18:33

on real world users for your product

1:18:35

don't trust benchmarks your user has

1:18:37

never heard of terminal benchtop please

1:18:39

don't introduce it to them right

1:18:41

>> so so so so which benchmarks do you

1:18:44

actually trust and which ones are most

1:18:45

performance theater I mean I mean what

1:18:48

what's the general landscape where one

1:18:51

should actually think

1:18:52

>> I was definitely a bit hyperbolic like

1:18:53

like don't introduce anyone to Turbo

1:18:55

Veg. Like I love Turbo Veg. Like those

1:18:57

guys are awesome. But I think I think

1:18:58

the general point actually does stand

1:19:00

though, which is

1:19:02

like like eval to me are basically like

1:19:04

again they're like a mechanism of like

1:19:07

evals and benchmarks. They're basically

1:19:09

like a mechanism that like proxies

1:19:10

behavior that I want my agent to

1:19:12

actually have like via this like thing

1:19:16

which like roughly measures that, right?

1:19:18

So it's like I'm trying to measure like

1:19:21

long horizon like problem solving. Like

1:19:26

can I do that with like a really hard

1:19:27

terminal bench task? Like kind of maybe.

1:19:31

But like if my actual like app has

1:19:34

nothing to do with that, then like me

1:19:36

passing that like terminal bench task

1:19:38

doesn't map well into like my like long

1:19:40

horizon problem solving like the bio

1:19:42

domain, right? So it's like there's sort

1:19:44

of like rough proxy signals that measure

1:19:47

like so like in at like Langshin we have

1:19:50

like axes that we try to measure on. So

1:19:52

like every eval we tag to like an

1:19:54

access. So it's like retrieval, it's

1:19:55

like problem solving, it's like

1:19:56

planning, it's like tool use for example

1:19:58

like we we like try to tag every eval

1:20:02

we do that we tag every eval like one or

1:20:04

multiple axes, right? But I think it's

1:20:08

useful to use benchmarks as like a

1:20:11

general like vibe like a guidance. Like

1:20:14

you should definitely read the traces

1:20:15

from benchmarks. We like I like spend

1:20:17

tons of time every day just like reading

1:20:18

the traces from pre-built benchmarks.

1:20:21

But I think really the thing that helps

1:20:23

teams is

1:20:26

using their trace data to build evals

1:20:29

for themselves that actually like map

1:20:31

onto their customer use case that maybe

1:20:33

like no existing benchmark like does a

1:20:36

really good job of. And I think like

1:20:37

it's kind of like a moat if you want to

1:20:40

call it but it's like it's just a really

1:20:41

good way of like building a better agent

1:20:43

product which is there's awesome people

1:20:46

building awesome benchmarks. None of

1:20:47

those benchmarks map exactly onto like

1:20:49

what my agent needs to do. So like I can

1:20:52

like use those to roughly measure

1:20:54

problem solving ability, but like really

1:20:56

the best way to measure problem solving

1:20:57

ability is just to like get a

1:20:59

representative set of like evals and

1:21:01

tasks like my own bench and just like

1:21:04

use those and like that's going to vary

1:21:05

from like person to person like product

1:21:06

to product like feature to feature. So

1:21:09

>> makes sense.

1:21:09

>> Yeah. Yeah. Yeah.

1:21:11

>> Yeah. What's your opinion on computer

1:21:13

use stuff? Because this is u this is

1:21:15

something very subject to people like

1:21:16

the current approach is not good. You

1:21:18

can't really go in the screenshot way.

1:21:20

You really can't use MPI MCP or API way.

1:21:23

You have to bullish. You have to scale

1:21:24

GUI stuff. So what do you think about it

1:21:27

and about it scaling part of

1:21:29

>> dude? We were just talking about this

1:21:30

today actually like how much should we

1:21:32

do like more examples on computer use?

1:21:33

Like I'm like very fascinated by

1:21:35

computer use. I think it's like super

1:21:36

interesting. I think like there's maybe

1:21:38

like two things. One is there is still

1:21:41

definitely a visual perception problem

1:21:42

like that we like we've known for a

1:21:45

while like like fine grain details is

1:21:47

not it's not like amazing at that maybe

1:21:48

it's like less of a limitation now like

1:21:50

some of these models are like better at

1:21:52

computer use I don't know I don't have

1:21:55

like a great opinion on which way of

1:21:58

doing computer use is going to win like

1:21:59

the hybrid like pulling down like the

1:22:01

actual like webpage content and like

1:22:03

clicking versus like how much do you use

1:22:05

screenshots I would be like very happy

1:22:08

If like everything was just like just

1:22:11

worked with vision because that would

1:22:12

mean that we did we have made like a

1:22:14

step change in like visual perception

1:22:16

and like visual reasoning over

1:22:17

screenshots and like doing or sorry like

1:22:18

yeah visual reasoning over like

1:22:20

>> images basically.

1:22:22

>> Yeah, I don't know if it'll happen. Um

1:22:24

but I'm like for the applications of

1:22:26

computer use I think they're awesome and

1:22:28

like we should do we should do like more

1:22:30

stuff around those but I don't know. I

1:22:32

just haven't played with it as much.

1:22:33

>> I mean yeah awesome. Makes sense. So I

1:22:36

mean do you think there is some secret

1:22:38

sauce something which can be scaled to

1:22:42

scale more long horizon task about in

1:22:44

your experiment experience what what is

1:22:47

something which is blocking

1:22:49

uh because I think a lot of companies

1:22:52

are being forming up involvements for

1:22:53

long horizon task these days and been

1:22:56

selling to enterprises and frontier labs

1:22:57

now I mean what do you think of the

1:22:59

space about scaling

1:23:01

>> I think

1:23:02

>> I think like there's a lot of like good

1:23:05

work that a I think a lot of companies

1:23:07

have like good agents in like medium

1:23:10

horizon tasks like for example like we

1:23:12

like we have like a background coding

1:23:13

agent that can go like do things over

1:23:15

like hours like a few hours basically

1:23:17

right it's like they're all coding

1:23:18

related tasks like it's easy to like

1:23:20

pick those and like scale those and I

1:23:22

think yesterday there was like really

1:23:23

good work from like the proximal team

1:23:25

for like frontier suite which are like

1:23:28

hey like these are like 20our tasks

1:23:30

basically and like we're going to go and

1:23:33

run on

1:23:34

I think like one thing that is still

1:23:37

like really really tricky for models and

1:23:40

I think like in the near term what's

1:23:42

will happen is like hopefully like

1:23:44

models get like post-trained better on

1:23:45

this but like we will still have to

1:23:46

build a bunch of like harness

1:23:47

infrastructure around it which like

1:23:48

hopefully falls away is one like

1:23:51

decomposing a really difficult problem

1:23:54

into like subpieces

1:23:56

and then doing like verification of the

1:23:58

intermediate steps. I think like that is

1:24:00

like a really really good general

1:24:03

purpose recipe that we can use to like

1:24:07

keep doing like longer and longer

1:24:08

horizon tasks because like basically

1:24:11

like all a long horizon task really is

1:24:14

is like I'm going to do I'm going to get

1:24:16

this like really hard task. I'm just

1:24:18

going to do like a bunch of like little

1:24:20

sub pieces like over and over and over

1:24:21

again. And I need to make sure that like

1:24:23

I don't mess up any of the sub pieces or

1:24:24

like if I do mess up I need to like go

1:24:26

back and fix those

1:24:29

like the key thing is like figuring out

1:24:30

like when you messed up that's hard. So

1:24:33

so we we need better like

1:24:34

self-verification systems there that

1:24:36

might be like self bootstrapping like

1:24:38

testing for example

1:24:39

>> and like the other thing we need to do

1:24:40

is like teach systems how to like

1:24:44

decompose problems into like sub agents.

1:24:46

I think like there's really cool stuff

1:24:47

around RLMs around this.

1:24:52

They're like I still find them like a

1:24:53

little bit tricky to get working, but

1:24:54

like the ideas behind them like amazing

1:24:56

basically like externalize context as

1:24:59

like an object and then like sort of

1:25:00

like search over that and like decompose

1:25:02

problems like that for like really

1:25:04

really long horizon tasks. I don't know

1:25:06

it doesn't work amazing right now but

1:25:07

like the general strategy of like verify

1:25:09

and then like decompose like iteratively

1:25:12

I think that's like a good path forward.

1:25:15

like we're we're spending time there as

1:25:16

well. I'm sure like a bunch of other

1:25:17

people are well.

1:25:20

>> Awesome.

1:25:22

Great. Um I think we are pretty much uh

1:25:23

to the end of the part and um so what is

1:25:26

uh what is something which you are most

1:25:29

excited about to happen in let's say

1:25:31

again in 6 months or year because again

1:25:33

like we pretty much don't know but you

1:25:35

really want to see to yeah to happen.

1:25:38

>> Two things like one I'm super excited

1:25:40

for the World Cup. So like World Cup is

1:25:41

happening like here it's happening in

1:25:44

Philly. So, I'm like super stoked for

1:25:45

that. But besides that, I think like the

1:25:48

thing I'm like super stoked about is

1:25:51

we're we're like just starting to get

1:25:53

the first sparks of these like

1:25:56

self-improvement loops from data that's

1:25:59

generated from agents. And like we're

1:26:01

pushing like a ton on this like in the

1:26:03

last like couple months like we put our

1:26:05

like first like research around this.

1:26:06

There's other good teams doing this. But

1:26:07

I think like this is like such an

1:26:09

amazing on-ramp for us for like all

1:26:11

teams to self-improve like all of their

1:26:14

systems by doing like very very good

1:26:17

like data engineering like looking at

1:26:19

all of their trace data like mining it

1:26:21

for errors and like bootstrapping self

1:26:24

like probably to start they're going to

1:26:26

be like semi-autonomous self-improvement

1:26:28

loops like like humans will need to be

1:26:30

in it but the systems will get better

1:26:32

and better and I think the the flow of

1:26:36

build agent

1:26:37

use an environment, generate data from

1:26:40

it, and then like mine the data, point a

1:26:44

lot of compute at the trace data to

1:26:46

derive like eval and to derive training

1:26:48

data and then like use that to like

1:26:50

improve the agent. Just keep doing that

1:26:52

loop. That is like super exciting to me.

1:26:54

And it it's it already works actually

1:26:56

like every like people like we're doing

1:26:58

it like people are already doing it. It

1:26:59

like works. Customers are doing it. It's

1:27:01

awesome.

1:27:02

But it will only get better, I think,

1:27:04

with like better models and like we're

1:27:05

going to build everyone's going to build

1:27:06

better systems around some of this

1:27:08

stuff. So, yeah, I'm stoked in six

1:27:10

months. Like, I can't even imagine like

1:27:11

how good this loop is going to be. It's

1:27:12

going to be amazing.

1:27:15

>> Likewise. Totally. What's the next blog

1:27:18

coming?

1:27:18

>> Next blog. Um, okay. I'm supposed to

1:27:20

write one over this like weekend. Yeah.

1:27:22

Hopefully like next week. Yeah. Oh,

1:27:23

maybe like one thing that's cool I like

1:27:25

lang chain is like because we talked in

1:27:27

the beginning I actually think blogs are

1:27:30

like fantastic like artifacts like work

1:27:32

backwards from so it's like but your

1:27:34

team does a bunch of like amazing work

1:27:36

and like you should like totally share

1:27:37

that work so you can like kind of like

1:27:38

pick like a blog it's like I want to

1:27:40

write a blog about this and it's like

1:27:42

okay like what's all the work I have to

1:27:43

do to make sure that that blog doesn't

1:27:45

like suck basically. Yeah. Yeah.

1:27:47

>> That's great.

1:27:48

>> Yeah. Yeah. I think there's one I'm

1:27:50

thinking about a bunch which is like um

1:27:52

it's less like like agent engineering

1:27:54

stuff but more just like how much we've

1:27:56

like unbundled like agents. I think

1:27:58

there's been like a huge like unbundling

1:28:00

of agents uh into like pick a base

1:28:04

harness and like pick your skills like

1:28:05

pick your tools like um design your

1:28:08

agent like design the models and it's

1:28:10

not just like one monolithic system like

1:28:12

you totally don't have to get locked

1:28:13

into anything like you have the choice

1:28:15

to build like bespoke tooling for

1:28:17

yourself like for your company and like

1:28:19

the unbundling is awesome and like I

1:28:20

think like people are doing cool stuff

1:28:21

around that so hopefully like I'll like

1:28:23

riff on something about that or

1:28:24

something or just whatever I don't

1:28:29

Great. Okay. We um last question to you.

1:28:31

Um so imagine so um so the world is the

1:28:35

technology is changing

1:28:38

by an order of magnitude every week. we

1:28:40

all can see uh what advice would you

1:28:43

give to someone who is just starting out

1:28:45

of college who is someone 20 20 21 year

1:28:48

old because because things are not same

1:28:50

as it has been like I can say it's been

1:28:54

like like couple of years ago it's not

1:28:56

the same the world is changing so fast

1:28:58

and and it's sad to see that lot of

1:29:01

people are actually I mean don't even

1:29:04

care about what is really happening

1:29:05

right so even like if someone is

1:29:08

starting out college so what should they

1:29:10

really look forward to? I mean to to be

1:29:12

at frontier and to actually scale on

1:29:15

things to actually learn and be at good

1:29:17

places. So what's your opinion over

1:29:20

that?

1:29:22

>> I don't know how amazing advice I can

1:29:23

give on this honestly but like maybe

1:29:25

like some like general thoughts of like

1:29:28

what I was thinking when I was like

1:29:30

finishing like PhD and stuff and also

1:29:32

like there's like so many sick like kids

1:29:34

who are just like graduating undergrad

1:29:35

already like that I see on Twitter doing

1:29:37

great work. I think like there's there's

1:29:38

a couple like common threads which are

1:29:40

really cool which is basically like you

1:29:42

just sort of like pick something you're

1:29:44

like kind of interested in and you just

1:29:46

use like AI to help you learn that and

1:29:50

you just like kind of like rabbit hole

1:29:52

like really deep into that one thing.

1:29:54

And I think like that's probably like

1:29:57

really really useful because you can

1:29:58

kind of maybe use AI to become like top

1:30:02

maybe like 10% or like 5% of the world

1:30:04

if you like care enough and like the

1:30:06

problem is not like super crazy. And I

1:30:09

think like that's like really good. And

1:30:10

the other thing is like I think like

1:30:12

it's awesome when people just like post

1:30:14

their thoughts like online. And um I was

1:30:17

saying like it helped me like meet a lot

1:30:19

of like cool people. I see like awesome

1:30:22

like posts on X and like I love

1:30:23

interacting with them, but I think it's

1:30:25

basically just like it's kind of scary

1:30:27

to maybe like put your ideas like online

1:30:29

like dude I'm gonna get like roasted

1:30:30

like first I'm going to get roasted like

1:30:32

by my friends who are like dude why is

1:30:34

he posting so much on like Twitter about

1:30:35

like AI but it's totally fine like you

1:30:37

kind of like get over it but like it's

1:30:40

just like really good to like sort of

1:30:41

share your ideas because it helps you

1:30:43

like other people like challenge you and

1:30:44

then like you realize like oh like that

1:30:46

idea was dumb or maybe that idea was

1:30:47

like really good like resonates with

1:30:48

people and like the only way maybe for

1:30:50

like other people to like really help

1:30:52

you is if they like see your work or

1:30:54

they see your thoughts and then like I

1:30:55

think there's so many people who are

1:30:56

like willing to help. So just like maybe

1:30:58

like pick something just like grind on

1:31:00

it just like post about it basically.

1:31:01

And I feel like if you do that enough

1:31:03

times then something good will hopefully

1:31:06

happen or like you'll have learned

1:31:08

something which is also like really good

1:31:12

>> dude. Um this is so honest and I can

1:31:14

totally relate with both of your points

1:31:16

and basically this is something which I

1:31:17

have experienced again like because

1:31:19

there are so many trajectories so many

1:31:21

arenas opening as AI is evolving to

1:31:24

learn to to actually u make your make

1:31:27

you context aware about things I mean it

1:31:29

can be anything it can be posting side

1:31:31

of things pre-training inference

1:31:32

engineering environments data a lot I

1:31:35

mean you can't really keep it up about

1:31:37

things so again as you said use AI use

1:31:40

your knowledge sources and like read

1:31:42

good blogs, references, hack on,

1:31:44

experiment on and this is something and

1:31:46

that is the reason I mean even good

1:31:48

professors lot of colleges are not

1:31:50

actually wor about things. So I think

1:31:52

this is the best time to learn and

1:31:55

actually dig on things and and I think

1:31:58

there are wide arenas where one can

1:31:59

master one thing right because everyone

1:32:02

needs master of something and get into

1:32:04

places and let's I mean it can be

1:32:06

anything it can be even hiring as well

1:32:08

if you're really good at it so you can

1:32:09

make it to the places of course and as

1:32:11

you said about posting about stuff dude

1:32:14

I mean this is so underrated I mean if

1:32:16

you are really good poster if if you if

1:32:18

you can really uh kind of um convey your

1:32:22

thoughts well

1:32:24

amazing opportunities can open up and

1:32:26

this has been happening for me and I

1:32:27

have seen a lot of amazing people been

1:32:29

to places just by like I've I've

1:32:32

interviewed bunch of people I can give

1:32:34

example of kalome he's he's 19 he just

1:32:37

did ready to wait he went to meet

1:32:40

Shopify CEO then he got hired at prime

1:32:42

so I mean there's so many people who

1:32:44

have just gone to the same trajectory

1:32:46

just by posting their thoughts online

1:32:48

and it is and it is a fascinatingly

1:32:50

>> rewarding

1:32:51

It is actually rewarding. Totally. Um

1:32:55

awesome. I think um we at a wrap. So

1:33:00

thanks Viv. Uh for everyone listening,

1:33:02

deep agent is open source. Everything is

1:33:03

on GitHub and absolutely you read a

1:33:06

web's blog coming on um Twitter. It's

1:33:10

it's just amazing and that is something

1:33:12

which has led to this conversation. So I

1:33:14

hope more more and more of them coming

1:33:17

and follow him at um with Tan on

1:33:21

Twitter.

1:33:21

>> Yeah, dude. This was so fun. Oh, I had a

1:33:23

blast.

Get the TLDR of any YouTube video

Transcribe, summarize, and repurpose videos in 125+ languages — free, no signup required.

Try YouTLDR Free