The Never Ending Lore of Harness | Vivek Trivedy (Product Lead, Langchain)
Hey everyone, welcome back to ground
zero. This is episode 13. Yeah, we are
running fast. Today we have ve from
langchain. So we leads their work on
open source agents and harnesses the hot
term right now. He's the person behind
DP agents the coding agent that went
from top 30 to top five on terminal
bench 2.0 by only changing the harness.
He's been writing some really good stuff
with lot of signal and alpha on what
harnesses actually are. Why agents
should be more opinated the idea of
harness as a service and um how planning
agents are really just dynamic workflow
generators.
Before Langen, he ran his own startup on
visual understanding agents and before
that uh was a scientist at AWS while
doing his PhD in CS at Temple. Uh, we'll
cover a lot into this. Uh, there's a lot
to get into. We welcome.
>> Thank you for having me. I'm super
hyped. I'm super I've been following you
on Twitter a bunch. So, yeah, I'm glad
we're making this happen.
>> How are you doing? And would love to
know your uh initial VIP check on Opus
4.7.
>> First of all, doing great. Whenever
there's a new model release, you know,
it's always like a good week for all of
us. It's maybe like an even more fun
week for like anyone who does like evals
on all the models. Um, so yeah, dropped
yesterday. We started like evaling it.
We have our like set across our
products. We have like open source evals
that we use and like also like for some
of like Lang Smith's products that we
use. It's a good model. It's a good
model. I don't think it was like a crazy
step change for tons of stuff that we're
doing. But TBD I think like the fun part
about stuff we'll like jump into which
is strong belief that every model needs
its own custom things that you add to
it. I know like anthropic release is a
nice skill uh that you can like easily
convert prompts and stuff but we're in
the middle of that process for like the
agents that we're going to use it for.
So it's a good model not a crazy step
change but we'll we'll fit it. We'll
we'll make it good.
>> I mean it is interesting in a way that I
have been seeing a lot of mixed opinions
right now. People have pretty much mixed
opinions on 4.7. Basically what they
have doing it with um the kota users as
well. I mean in just three four prompts
you are running out of I mean there's a
lot of good story I mean interesting
story behind but but yeah I mean the
kind of piece about these models being
coming up be open air or anthropic
anthropic specifically how they have
been doing good at public perception and
effective marketing as I say I mean
working well working working I mean it's
been rewarding for them
>> I mean they're great they're great they
they put out like great models obviously
they put out great products around the
models I think there's definitely some
stuff where
people are playing a lot more with the
models and like they're basically like
picking use cases they use models for.
So it's like everyone uses cloud code
like everyone uses codecs and that sort
of stuff. But like when you build like
your agents on top of those models, it's
like I need to actually care about the
prompts. I need to care about the
context engineering. I need to like care
about the tool design. And I think like
that's where it's really cool to like us
putting out content like other like
really cool people putting out content
which is like like how do I make a model
good at like my task basically because
at the end like my customers that's all
they care about that's all I care about
and I think like that's like a bunch of
the harnessge journey basically whether
you call context whether you call like
agent edge it's basically like fit some
sort of system around this model to make
it like sit at my task and that's like
what we're all trying to do and like
anthropic is trying to help us with
that. Open models are trying to help us
with that as well.
>> Totally makes sense. Um let's dive in um
about your journey. So you went for a
PhD in CS at Temple and I mean worth to
mention you did your bachelor's,
masters, PhD everything at Temple and
this has been a talk of the town as well
in past years on Twitter. People were
talking about it. People have again I
mean some opinions about Temple being a
university, good university or not. So
my question is to being a scientist I
mean doing a PhD PhD then to being a
scientist at AWS to running your own
startup on agents or visual
understanding to leading open source
agents at Langen. How has your journey
been like?
>> Happy to dive in. Um cool cool I'm so
I'm from around this area. So I'm from
like east coast uh Jersey like
Philadelphia area. I went to school at
Temple. So I did my undergrad there did
my masters there like my PhD there. So
like super early I was like I'm just
going to be a doctor like most kids
pressured by their parents like I'm
going to be a great doctor like quickly
realized like I don't really want to do
that most of my undergrad. So I do my
underground in math and math is like
really cool. I think there's a lot of
concepts in math that like translate
really well to CS and like physics and
things like that sort of like systems
thinking.
>> Math is also like at least for me maybe
I'm just not amazing at it. It's
incredibly hard. So like doing something
really hard does prepare you for other
things.
Yeah, dude. Undergrad was like really
fun. I enjoyed math. I got into like
some CS stuff. I think like late 2010s
was when there was a lot of cool stuff
in different parts of ML. So like I got
into computer vision stuff, like
undergrad research. And like I love
vision. So like I think vision is still
one of the coolest things out there.
There's like way less research done on
vision even today relative to text. Like
>> OCR is pretty important, right? OCR is
like now okay just just send just send
the PDF to Claude basically and like
obviously a bunch of systems engineering
around that but yeah man like I I loved
vision I still love vision vision was
really cool so like I did undergrad in
that did like research around that and
then I just went straight into like
masters in PhD like right after I
graduated like early 2020s and then yeah
my PhD was basically all around like
vision focused representation learning
so yeah I can talk a little bit about
that. So the first like topics that I
was working on was like graph neural
networks which are like I don't know how
hot those are anymore but I do see like
some really cool people still doing
research around those. Um basically like
graph representation learning but it's
like graph representation learning for
like vision basically. So it's like if I
like decompose an image into like
particular objects and like I make a
graph of that and then I do like
representation learning do we get like a
better end vector for like retrieval
like classification and then like we did
this at also like the data set level as
well. So like what if I have like kind
of like few shot examples. It's called
like transductive learning like use
other information in the data set to
help you classify the next thing. Dude,
that was really cool. Like I think
graphs I'm like bearish on graphs
overall actually. So maybe hot take but
like that was a really cool part of
research and like that was my first like
dabbling into like computer vision stuff
like undergrad then my first like PhD
topic which like it shifted a little bit
after like the chat PT moment like tons
of research became around okay like
let's do VLMs for everything and let's
do like representation learning on the
VLMs like what are VLMs like actually
seeing when they're doing their like
attention mechanism over images. So
yeah, dude, it was great. It was great.
I like really enjoyed my time in PhD. I
think it's like you get some sort of
unbounded time with your adviser to just
pick an interesting problem and just
like rabbit hole in it. So I did like
retrieval stuff like representation
learning stuff. Yeah, dude. It was
great. I enjoyed it.
>> Awesome. Um, so I had a chat with
Tensorcut the other day. He started
Paradigma. He dropped out of PhD. So my
question to you is what do you really
think about the scenario right now the
linkage between academia and the
industry and right now if you have been
like if someone is going for PhD or
something like that. So what do you
really think about is is is that is this
worth it or how far we have come is
still necessary to go for a PhD to I
mean it is again very opinionated um
question but still I mean I want to
really understand your
>> yeah absolutely so like it's a great
question like people ask me this
question like locally like my friends or
like younger brothers and stuff.
>> Yeah. So like maybe my PhD was like
slightly different because I was doing
research at Temple but I was also doing
research and like working on like prod
projects when I was at AWS and those are
happening at the same time and I like
strongly believe that that is like a
fantastic mix for anyone who wants to do
like research but then sort of
understand maybe like how their research
is going to be applied in like some
settings. And I think today like the
point basically of a PhD to me is like
you pick a topic that you're like really
deeply interested in and you like poke
around the edges of that topic to try to
figure out like how we can make like
this thing better. And like that doesn't
like really require a degree to do that.
There's tons of like sick researchers on
X who just like post like random blogs
and like they don't have a PhD. they
probably don't maybe don't have CS
background but there's like you just
pick a topic you like rabbit hole it
you just like push the boundary of
what's possible and you do that like in
a verifiable way so you like write code
do experiments you try to share like
open research and if you're able to find
a company that allows you to do that
like lang's fantastic at that like I
think they really cultivate like hey
like we're going to like pick this topic
we're just going to like figure out how
it works and we're going to like publish
content about it basically
>> I would say that's great I think it it
kind of depends like if you find a great
company, a good great founder that you
vibe with that lets you do both.
Industry is like amazing and like
especially AI research like it's super
helpful across a lot of companies. You
can probably make a lot of money and
like do interesting research at the same
time. So yeah, kind of like a
non-answer, but if you do find that
scenario amazing if you just want to
like grind on like some sort of topic
and PhD for like a bunch of years, also
great. I actually don't think you can go
wrong like just by being curious and
just exploring it.
>> Yep. I can see you have uh you you were
like working on your startup about
visual understanding agents. So I want
to understand your learnings there and
how do you see the vision space right
now like how can you correlate between
uh the time when you started and the
time we have come so far with the
current frontier state-of-the-art
research and products building. Yeah,
dude. Um, yeah. So, like I started that
startup after I graduated like my PhD.
So, that was sort of like mid last year
with a friend. And basically like the
main thing that we were working on like
starts was called Agentify. And like the
main idea was basically that basically
vision compared to text like really lags
behind in frontier models for like
things like visual reasoning but also
things like perception just generally.
So there's like tons of things where
you'll like show an image or like an
object like o two overlapping boxes to
the model, right? And it's like it
doesn't like fully understand that those
two things like overlapping and like
part of this is just a perception
problem in the visual encoder where it's
like some of these like fine grain
details, it's just not able to
understand them with like the native
training that it has. But that I think
is like a fantastic opportunity because
it's like how much of that gets absorbed
into the vision encoder backbone versus
like how much do we augment models with
like tool calling behavior that they're
exceptional at and actually use that as
the mechanism to like take vision
capabilities and like put them into the
models. Like that's basically the whole
like idea that we were working on. So
like research and like product around
that which is like what if I just took
all of the classic vision models that we
already have and like a lot of this was
honestly inspired by Meta's work on SAM.
So I think like SAM and that whole
series is like incredible like SAM 123.
It also supports like video segmentation
which is like insane and you can also
like fine-tune it. You can do like meds
SAM and things like that. So it's
basically like BET was okay models are
amazing. They're getting very smart, but
like their vision capabilities are
lagging behind. But we can augment them
with tools and like you can basically
like do the right tool selection in the
moment to like get that capability. Like
segmentation is something that it was in
Gemini Flash across the Gemini series,
but like compare that to like SAM,
right? Like SAM was like way better. If
you just use like Sam as a tool compared
to like the native segmentation Gemini,
you would be just like way happier. and
like all you had to really do was like
point to the right spot which is like
way easier than doing like semantic
segmentation. So that was the idea. I
still think that that is true in vision
today. Like even with like Opus 4.7's
new benches, it's still not as good at
visual perception as like we need it to
be. So I still think tool use is like
really really exciting for yeah just for
like agentic systems like visual
basically making a bunch of like vision
specific tools for your task and like
augmenting uh yeah augmenting your agent
with that.
>> I think there is a lot of scope to do
alongside UI bench as well. I mean again
uh it's more about one's taste but uh
but there are lots of ifs and buts lot
of nuances where you really need to take
care of like even if you're cloning a
website I mean there's lot of sc uh
scope to play around something so my
next question is about your work at
Loheed Martin. So you you you interned
there. I think that was your first um
job and honestly a lot of what people
see about world is kind of sophisticated
reals on social media about American
weaponry. So what was the reality like
from the inside? What what what you were
working on? How does it feel like to
work at some defense um kind of defense
company and what experience lead?
>> That is like such a throwback. So that
was like my first internship at like
tech ever. So, I was like a bio intern
in undergrad and I was like looking for
internships and I gave my resume and I
got an internship at like Loy Martin
which is amazing because like I don't
know how good my bio resume was for
getting like any internships. Yeah, man.
I wish I say like tons of stuff I did on
>> What do you mean by bio resume? It was
like like you were working on some bio
>> Yeah. So like I went to undergrad as
like a biochem major because like I
wanted to be like a doctor.
>> Amazing.
>> Yeah. So like then like after freshman
year I applied to like internships cuz I
I switched I wanted to do tech after
that or like at least explore it with
like a bio resume and they were like
dude like what like what are what are we
doing here? And then like I think I
basically just like talked like to the
hiring manager and just said like hey
I'm like really down to like learn this
thing like which is like data science
like that time there bunch of these like
data science courses and things coming
out so it was still like early and I was
like hey like I took these like Python
classes and like I'm super down to learn
this. And basically it was like yeah I
mean it sounds great.
I ended up working on the data science
team there and it was basically like my
first introduction into like kind of
like data analysis sort of stuff. So
like understanding like it was much like
stats basically. So like I wouldn't say
it was like ML but it was like this is
like intro to like making plots like
slice this data this way. So it was a
bunch of just like empathy for like very
very messy data as like my first
internship which is actually like very
valuable today just like insane amounts
of data which is like does not look very
clean and yeah man I wish I say more it
was basically like a great learning
experience because I was kind of
learning how to code and like doing like
data science stuff and then it was also
like a decent confidence boost because
I'm like okay maybe I can do like tech
stuff and yeah I interned there and it
was like fun and then yeah I didn't
really go back after that but I started
getting into more like research stuff at
school.
>> Awesome. Um, also recently I was just
kind of exploring the timeline. I see
Mike Mill who is a pretty famous, you
know, internet celebrity was looking for
an AI guy and you came up through
Temple. Apparently Mike was surprised
how many Temple people are in AI and so
did you end up connecting with him? Did
you share anything about Langchen and
stuff?
>> So Meek Mill is like he's like a rapper
from from Philadelphia and like I guess
he lives around Temple like that's where
he was from and I think everyone was
like when they saw that tweet they were
like Meek Mills get into AI so okay let
me just like reply basically because I
think like honestly like randomly
posting on Twitter X is like awesome.
You can meet so many cool people like
that and I we'll talk about this but I
met like Harrison the founder of like
CEO and the CEO of like W
And yeah, he did not reply to me. I hope
his like startup is doing sick, whatever
he's whatever he's doing. But like I'll
like repeat it if he does need someone
for help with like AI. I'm actually like
seven blocks down. So I could totally
like just pull up and help him. So no, I
think that's a good lesson though is
just like randomly posting maybe like
I'll just keep doing that and then maybe
something will happen.
>> Yep. Awesome.
So I mean the next question to you is so
when did you join Langchain and uh what
actually pulled you there specifically?
So and since you joined what actually
has
>> So this is like this is so much fun. Um
I was working on my startup like after I
finished my PhD that didn't work out
like we basically stopped around the
fall. At the same time, I was basically
like doing my first foray into just like
posting like random stuff on Twitter
just like my thoughts like basically
just like open source stuff like hacking
on random stuff and
from a bunch of the stuff I was posting
around like so like last year I also
like sort of believe that like we have
amazing models but like because we did a
bunch of stuff in this like visual
understanding space with like agents and
stuff. I was like very very confident
that models need like some stuff around
them to like help them do these tasks
because like they just suck at them out
of the box and like we basically saw
this every day. So that's basically when
a lot of maybe the ideas that were
brewing around harnessge like started to
maybe get more like crystallized and I
just started like posting about that
online. It's like, hey, like this is
maybe like what harnesses look like.
Like harnesses are like supposed to like
wrap models and like if we're trying to
do like vertical tasks. It like really
helps to have some sort of like
opinionated like prompts, context
engineering, like tool call structure
like all this sort of stuff. And I think
I just like started DMing Harrison like
the CEO from that which is like super
sick. He is also always thinking about
like the frontier of like AI systems
which is awesome. And then we started
chatting maybe like late last year just
like yeah like what would it look like
to build open-source infrastructure
around like agent engineering and like
maybe the best way to facilitate that is
by helping people build good harnesses
like whatever good means like let's
discover like what good means and make
open source software about that. So, it
was basically like, okay, that sounds
sick. And then I was like, I don't
exactly know what I'm going to do. Like,
maybe I'll continue like working on the
startup or like, but I would love to
join something that like really aligns.
So, then I started working with like
their open source team late like last
year on what ended up becoming like what
was deep agents, but ended up becoming
like a lot bigger. Um, so yeah, we were
working on like the very very early
versions of like deep agents last year,
which is like one of our libraries at
Langchain that we that we have. It's
like our library to help people build
harnesses. Um, or at least it's one of
the ways that people can build harnesses
using using Wangchain. And yeah, I loved
it. I love the team. Uh, amazing people
doing open source. And then I decided to
join like full-time in in December.
>> Amazing. Um, and and I mean, the
adoption is just crazy, dude. I mean, so
I want to understand about the growth
here. So, so again I mean right now
Twitter is full of people flaming
millions in ARR every month and but like
a feels like one of the most you know
stressed metrics out there. So my
question is how has lang approached
growth in real terms be it opensource be
it community adoption be it enterprise
or
>> yeah dude it's a great question. So I I
think about this a bunch because like I
think the best way to maybe think about
it is like basically like work backwards
from you want to like help people build
stuff using like the tools you're you're
putting out there, right? And like the
goal is basically just like help people
build like really cool things and like
make that process of building as easy as
possible. I think in like open source
that comes through like very clearly
because in open source I think you get a
lot of like empathy for the end user
because they're like directly using your
product like all the code is like fully
visible like go inspect it also like put
your opinions in like our GitHub issues
and tell us like what's good what's bad
like what should we fix like what should
we add also like it's totally cool to
like disagree in open source because
like the maintainers sort of have
limited bandwidth to address like all of
the things, but we want to make sure
that the most impactful things that are
going to help like the most users build
like the coolest stuff like we like
prioritize those. So, I think there's a
there's a big part of growth which is
why I like really like X um and like
these direct feedback channels or like
Slack for example or just like messaging
builders and customers because you
basically get to see exactly what
they're doing. you build like a lot of
empathy for shoot like this thing that
we built like it's a little broken in
this way or like it doesn't exactly like
fit the use case and then you hear a
bunch of those stories and you sort of
like work backwards to say okay like we
need to improve like this part of our
library or like we need to like make it
possible for others to improve our
library as well. That's like an amazing
part of open source that we get tons of
like amazing feedback, tons of like user
contributions which is great because you
sort of like grow with your community
and I think like that's a really big
part of open source and related to that
which I really really like about
Langchain like one of the reasons why I
joined and like I really enjoy working
here is there's a lot of like learnings
that we get from all the research that I
do in like open source and like putting
stuff out there and getting feedback
that slowly like make their way into our
products as well because it's like for
example a lot of stuff in like Lang
Smith for example which is like okay
like how do you build good evals like
how do you how do you actually enable
agents and users to build like really
good evals like how do you like
understand what's happening in traces
like mind signals from
>> like a lot of that we put out just in
the open like I did a bunch of blogs on
that stuff there's other people who are
like hacking on that stuff as well and a
lot of the stuff in open source you sort
of see how the community interacts with
it. You also just see the raw numbers
and you put it out there and it's like
hey like I would love this or like I'm
using this and it's like oh we should
make that as easy as possible. Put it
into a product and like if people love
the product then like the rest of it
sort of takes care of itself. It's like
yes you will make money you know your
customers will be really happy and then
like just continue the loop like just
keep making it better basically. So I
think like yeah dude customer feedback
is amazing like community feedback is
amazing. So it's like a really really
big part of I think lang chain a really
big part of like a a lot of the open
source stuff that we do
>> I can imagine of course and more
specifically here so you are leading the
open source egen and harnesses work
right now so what does a typical um week
looks like for you it's more about
research engineering or product
>> yeah dude whatever
>> I think the fun part is like it's it is
actually like a mix of a ton of stuff
and I like really really like that so
it's like the goal is bas basically pick
the most important thing to work on at
this time and then like we'll like we'll
chat about it maybe over the weekend or
like the week before like Harrison
jumped in with with us like we'll DM and
let's just like sprint towards that and
build it basically and like maybe what
that looks like lately
like lately like a ton of my work has
been on like eval continual learning
essentially like methods for using like
evals and continual learning to make
like agents and like their harness
better. So that's like basically like
the research direction and I would say
maybe like 50% of the week goes into
okay let's like pick a research
hypothesis let's like figure out what
the experiment design around that might
be. Like for example, last week we were
doing a bunch on can you like just in
time generate evals uh like for any
given task like what does that look
like? Like are you overfitting to them
and like what is your like fitting
algorithm? There's like tons of stuff
that we put out. There's like a lot of
good content on like harness hill
climbing basically. But yeah,
essentially it's like research. Let's
pick that task. Um kind of like a PhD.
We're going to make a hypothesis. We're
going to like run the experiments on it.
We're going to get like get metrics and
we're going to post them on Slack and
we're going to like review them and like
argue our takes about them essentially.
Yeah. Then the other maybe bunch of
percentage like 50% is like talking to
customers like talking to people like on
Twitter getting a bunch of feedback from
them on like the open source stuff like
how can we improve our libraries whether
that's like lang chain lang graph like
deep agents anything in like lang and
then a bunch of that is talking with
like product teams as well. So there's
like tons of great teams at Lang Chain
that do a bunch of good work on like all
the products that we have. So there's
tons of learnings that I think come from
open source that we can like port back
into the products that we're going to
build and yeah just keeping that
feedback loop is good. So I would say
like it's a mix bunch of like research
and then engineering stuff and then a
bunch of like I don't know like what the
term today is but like devril like
devril devx which is just like if
someone asks a question on Twitter like
we should respond to them and we should
like put our ideas out there and we
should like be willing to engage with
other people's ideas and yeah just hear
what people are saying. So it's like a
mix yeah it's a mix of those things.
what percentage of your article source
like article is coming from this
research source I can imagine a certain
percentage but because dude I mean I
mean let's just come to harnesses like
what this what is all about the load
behind harnesses right you know
>> so you mentioned that the definition of
agent is basically model plus harness
right
>> so I mean this is something like I mean
it is being in like people know this
from quite some time like this is this
is a fact but I think this is the
cleanest framing anyone any anyone have
seen at least on Twitter. So if you're
not the model, you are a harness, right?
And and a harness is every piece of
code, configuration or execution logic
that isn't the model itself.
>> So can you walk me through how you
arrive at the definition?
>> Yeah. Yeah. Yeah, dude. I think like it
is it is definitely like a cleanish sort
of specification of like what is this
thing that we're talking about and I
think like maybe the definition doesn't
really matter like as much like what the
exact equation is but like there is one
thing that's helpful which is like when
you're communicating with someone about
like how we're going to make this agent
better we need like some shared language
so we can talk about like what is the
thing that we're going to optimize
basically right so it's like
like working backward from model
capabilities because like that's sort of
the thing that we need to wrap
intelligence like wrap systems around to
like amplify the intelligence of the
model. So it's like I basically view it
as there's some sort of computation
happening inside the LLM and like where
that's happening is over this like
context window boundary. So like all the
compute happens when I basically like
take context from like my system and I
push it over the boundary and I put it
into the context window like for the
model to do computation on and then
produce tokens basically. And like some
of those tokens correspond to like tool
calls and then I go and execute those
tool calls and like I return the context
back. And like the reason why I like
that is because like models by
themselves they're basically just like
>> token input machines and like token
generators basically. But like we need
to put a system around the model so it
can do useful things. And I really like
maybe like working backwards from what
should the agent do and like maybe even
like what does my customer want the
agent to do and then like figure out if
I just like give it like a really really
simple model like maybe like really
really simple harness. Can the agent can
the model and like the agent can the
agent basically just do that? And like
if the agent can just do that with like
a really simple harness, then that's
like amazing because then we can just
like give that to the user essentially.
Where things maybe get like more
interesting is like where like a really
simple harness just like can't do that
today. And that might just be because
like it doesn't have the right tools or
maybe like the model isn't intelligent
enough to like orchestrate those tools
in order to do that. Or maybe it's like
some of our context engineering opinions
in the harness aren't good enough and
it's like hey like you're you're putting
a bunch of like really big tool call
outputs like into the context window and
it's like confusing the model. We should
find out ways to not do that. But these
are all basically like harness level
configurations that we're doing and
they're external to the model. Like the
model is basically just like a
computation unit and it computes things
over its context window and like we need
to decide what goes into that context
window so it can do like useful work for
us.
>> If I have to ask you some like three uh
three bullet points what really makes a
good hardness according to you what are
they?
>> Yeah. So there's a bunch, but if if I
had to pick like three right now, I
would say
basically prompting and like very very
clear instructions
for better or worse. Like there was this
whole thing like prompting is dead. Like
prompting is like totally not dead. It
is like so useful, so helpful. And like
I I don't just mean like prompting in
terms of just a system prompt. Like
prompting also applies to like the tool
descriptions as well that get like
autoloaded into context. It also applies
to how well your like skills front
matter explains like how to use these
skills or like how to use like other
skills. It it also applies to like if
you have sub agents, does like the sub
agent front matter specify like when
this should be used or like how to use
it basically. So it's just like
basically prompting that encodes really
really good instructions from the user
or on behalf of the user for like how to
use this agent to do useful work. That's
like super important. I think like
prompting is honestly more important
today than it ever was before because
our like the systems we have are way
more intelligent. So we're able to guide
them towards doing useful work more
easily with good prompts. That's one. I
think the other one that we're spending
a bunch of time on right now is
basically verification. So we did like
some blogs around this on like making
coding agents better. But there's sort
of like maybe two things in
verification. like first is prompting,
second is like verification. So there's
like a built-in verification that you
might inject like into into the harness
itself. So like that can be like a hook
basically. So like before the model
tries to go and exit like force it to
like recheck the work or like make sure
>> really
>> verification is basically like if if I
give so for example if we just use like
all the terminal bench tasks, right? So
like terminal bench task comes with like
an environment. It comes with like a
task and then it comes with like a
verifier that will run after the agent
thinks it's done, right? But like
obviously we can't use that verifier
information. So like what the agent
needs to do is like it needs to like
self-verify its work before that
verifier runs to like be like very very
sure that the code that it developed
solves the task that we're that we're
like trying to solve. Maybe there's two
parts of that. One part is we need to
like teach agents what the useful
primitives are for verifying their work.
I think like one immediate one if like
anyone uses like the claude model or
like even like GPT 5.4 is like agents
are very susceptible towards like
picking the easy way out in verification
which is like they test like trivial
cases or like not not like very
difficult cases. Obviously, that fails
in the verifier because it's just like,
hey, like I checked like these three
cases are really easy, so like I'm good
essentially and like that's bad. Like we
should teach agents to be much more
thorough when they're like generating
verification for themselves. That's like
one part of it. The other part of it is
like like this is all code. So like we
have in our repos tons of like unit
tests and like tons of like evals that
we already use. Like that is great
context that we should give to the
agent. so that it can like run that eval
suite and that might be run with a hook
for example like I don't want like maybe
the agent won't run it by itself but
like when it tries to exit that should
just maybe run my eval suite or a subset
of it and it should inject the context
or like the results back to the agent so
the agent can see like what failed like
what what passed basically because like
we need some sort of signal to give back
to the agent so we can like fix the
thing that it generated so it's like
self-verify or like use external signals
from like existing evals so you can like
fix the things that are going wrong. And
I think that's like a really really big
part of it. And like maybe the last part
that we're focusing a ton on is
high level. It's kind of like
orchestration basically but for doing
things that are more long horizon
basically like it's problem
decomposition and like making sure that
like when we use like sub agents to do
problem decomposition like two things
are true. So one is we're picking the
right model like agent for the job
because like every model is like good at
different things and also that um this
is a lot of context engineering. We're
basically like bounding the sub problem
that the agent needs to do in like a
decent enough window that it can like
manage it. Basically what I mean by that
is um I wanted to like do things in like
a 50k to like a 150k token range roughly
or like 200k. sort of it depends on the
model but like I don't want to give a
subtask to like a sub agent if it's if
it's so big that it's like okay it's
going to start getting into like really
really high context zones like dumb zone
which like Dex calls it um from human
layer which I love and yeah so it's like
efficiently being able to take a problem
decompose it and then use like sub
agents as like compute sources to like
do those problems and like filter stuff
back to the main agent and like some of
it is just good model choice like for
example like we find that maybe the GPT
series like 5.4 for is exceptional at
like planning uh which is amazing and
like Gemini like I find is like really
really good at like multimodal stuff and
so actually so is they all are but like
Gemini is like really good at it and
like Flash is actually amazing bang for
a buck for like speed cost and
multimodal stuff like a lot of this is
just informed by like dog fooding and
evals like hey like we need to like test
these models and figure out what are
they good at so yeah I think I think
those are the three maybe roughly and
there's like way more obviously so it's
like like prompting
like systems around like verification
like self-improvement uh like via traces
or like via evals and then the last
thing is like kind of like orchestration
but maybe it's like context engineering
around problem decomposition
>> makes sense um you just mentioned about
uh 5.4 for for uh planning. So uh so uh
pretty much I think it uh not just a
black box but it is kind of a reasoning
sandwich where where I mean you
mentioned as well x high for planning
high for execution x high for
verification um like running only at x
high scored 53.9%
due to timeouts versus 63.6% at high. So
I mean that's counterative right? I mean
does more reasoning made it worse?
>> Yeah. So I think I think this is
basically touching on like the point
that I think about a bunch which is like
we need to like what we try to do is
basically like we're trying to design
like an agent system around like a task
that we need to solve right and like
that task has maybe like a bunch of
constraints like I think the one you're
talking about is maybe like the the some
of the terminal bench work that we were
doing and just trying to publish. So
yeah like for that use case we we had
like an artificial constraint which was
like we have a like a timebounded run
essentially like after this amount of
time like the sandbox just like exits
and like the run doesn't get scored or
like the run gets scored like wherever
we left the state of the sandbox and
yeah so I think maybe the takeaway from
that is less that like maybe like x high
reasoning all the way through like
wouldn't have been better. It actually
like does a great job. It just takes
like a really long time. So then it like
runs out of time to like complete the
task. But also it's like not compute
efficient and it's not like cost
efficient. Like it's awesome to like run
X high at everything all the time and
spend a bunch of token on like every
single problem. Like practically
speaking um you have to pay for the
tokens and like also like practically
speaking from like a user experience
like am I just going to wait for GPT 5.4
afford to just like think super hard all
the time or like can I use a smaller
model or like a cheaper model that I
like write really good instructions for
and it can just go do that task like
immediately then my user just like sort
of gets like a more you know like
latency reduced interaction. So it's
like yeah I think main takeaway is like
XH high actually for me is amazing and I
do a bunch of like planning in X high
when I'm like just coding but because
like when I'm in the loop I want like
feedback because like it's annoying if
I'm just like staring at a blank screen.
I use like high for a bunch of like in
the loop coding. So like X high planning
and then like high for execution. So but
yeah it just depends. It like totally
depends on like the work that we're
doing. I think that's like the main
thread that I think about.
>> Awesome. Okay. I mean yeah that makes
sense. saw and and I have seen that
people are using people are preferring
5.4 xi codeex over opus 4.6 six I mean
now seven has like mixed opinions I mean
anyways um so uh again like you said
about what about hardnesses and
everything and there was a potential a
lot of news about file system as well
like I can't give a count the number of
blogs I have number of Twitter articles
I have read about file system right and
even like in your anatomy post you said
that the file system is arguably the
most foundational harness primitive so I
mean it's a it's It's it's a strong
claim and um and previously obsidian co
also mentioned about everything just
about file system. So why the file
system and how does it kind of make it
really influential in in this harness
design and things around agent
engineering. What other tools?
>> I mean I'm like incredibly bullish on
file systems. I think like a ton of
people internally also are and like a
ton of people across industry like very
bullish on file systems. Like one of the
early decisions in like DB agent when we
were building it last year was basically
like using the file system and that was
more because we saw like two things. one
like how useful it actually is for
context management and like two agents
are just exceptional at using file
systems already right so it's like it's
kind of two things like the model is
already very very good at using this
tool so I don't have to coersse it a
bunch to get good at like using these
sort of like patterns and like now like
with newer models is probably even like
post trainer even more on getting good
at file system stuff so that's like
amazing the the other thing that's like
really amazing about file systems or
like basically the concept of a file
system. I I'll I'll maybe like
generalize it a little bit, which is
like I need some sort of like persistent
storage that my agent can use to both
like access information and then like
offload information. And like that's
maybe the higher level primitive like a
file system ends up being like a really
really easy way to do that. But like the
primitive is like the LLM the model
basically has like this computational
boundary that I put stuff into and like
I can take stuff out of essentially,
right? And like all the comput happens
here and the decision for like where to
store stuff and like how to access it
like file systems end up being fantastic
storage primitives to do that and like
the reason why I say like the concept of
a file system is like in in like lang
chain like in our libraries we have this
concept like virtual file systems where
it's like you expose file system like
storage essentially right so like the
operations that you would do on a file
system for example like ls for example
right or like you're like grapping over
that. It depends like what your
underlying storage system is. But can
you like use existing storage like for
example like S3 for example or like
Postgress, right? And then like what
does it look like to use that as storage
and then like put it over the
computational boundary so like the agent
can like search over this stuff and like
pull it into context.
Like agents are exceptional at doing
that. And the other thing is like
context management is so important
because like the context window is like
where all the computation actually
happens that we need some mechanism of
achieving that which is like why I'm so
bullish on file systems. It's both like
and then and then actually like maybe
one more thing I'll add is
>> now that we're doing a bunch more stuff
on multi- aent orchestration and like
multi- aent like collaboration sort of
stuff. So I think I said like a little
bit about decomposing like really big
problems into like sub problems, right?
But like where should all of that work
get stored for all of like the
decomposition that the sub agents do? So
like file systems actually also become
excellent like collaborations places. So
like sub agents can like write to
particular files and like main agent can
like read from there and like it doesn't
pollute like the main agent context
window a bunch. So it becomes like a
place where you just like write files
and like files are basically excellent
scratch pads or excellent like like
planning places or excellent like
persistent storage places like an agent
needs to come back to something and this
sort of like primitive that files encode
information really well like file
systems
offer like interfaces to like external
storage that already exists and like it
really helps with context management.
Like all of those things together I
think make it really really good for for
as like a harness tool for like an
agent. And I think a lot of harnesses
like like basically I think everyone is
like settled around file systems like
like it's uh it's not like too
controversial to say like I'm going to
give my agent a file system and like
that's a part of my harness you know
like people just sort of like oh yeah
that that makes sense. It's interesting
to know right I mean this is something
so basic something so fundamental is
kind of changed the whole trajectory of
the space in like 6 months and everyone
is kind of getting adapted to this thing
and on the same note you have uh you
have also mentioned about memory via
agents.mmd and and this is something you
kind of connect with you know like
injecting and start and you also call
this continual learning so I'm very
interesting to know about why do you
think So, and like is it really or is it
more like a persistent or consistent
notepad? So, what you really think about
this could be aligned to
>> I think like a a ton of a ton of like my
work recently has been around like this
just general idea of continual learning
basically. So like h how do I help my
agents which are producing a bunch of
data over time like I'm using let's
let's just take like my personal agent
like I'm using this one agent a ton over
time
>> and it's producing a ton of data which
is like traces essentially right and
then like all those traces like I'm
storing somewhere like we store them in
length you can put all your traces in
one place and how do I update the
definition of the agent in order to
learn from all of the data that it's
producing Right. So there's like maybe
two ways to really do that. And memory
is sort of a subpiece of continual
learning. Like continual learning like
overall to me is as I'm acting in the
world and as I'm like sort of like
producing data kind of like how we
humans do. Like I'm doing stuff in the
world and I'm like learning from the
feedback that I'm getting, right? Like I
ran and I tripped and I fell when I was
a kid and like this is a great trace
stored in my brain to say like please
like don't do that. Same thing for
agents. But the way that we actually
like update the like the agent knowledge
is like really different probably
because like we don't understand exactly
how like experiential memory that humans
experience like how does like my
experiential memory as a human get
encoded into my brain like I don't
exactly know how that process works and
we need to do that process essentially
for agents and like the agents
computation boundary is just it's
context window basically. So I need to
be able to like take learnings from the
past and I need to be able to like do
two things. One is um inject them into
the context window at the appropriate
time
>> so that when that scenario comes up, it
can like use that prior information to
like fix the thing. Like for example,
maybe this comes up in like user memory
for coding, right? It's like you're
doing a bunch of like coding with your
coding agent and then like you give it
it has that trace and like maybe you
like annotate that trace with human
feedback saying like hey like the way
that you did this or like you use this
library but like we never use that
library so like please like always use
this other library right and it's like
okay like great should that piece of
feedback and like context should that
always be in like my always on memory
right is that like just in my agents.mmd
that always gets like loaded in or is
this something that gets injected like
in real time into the agent like
contextually. This is like why I'm super
interested also in like search as a way
of doing this because like we're I think
it's like almost like unfathomable the
data scale that we're going to start
producing with agents. So like agents
run like all the time non-stop. they
produce like millions of tokens like
every few minutes and like that's a ton
of information that we need to like sift
through to figure out what's useful from
that and like what's not useful from
that. So like search is like a really
really big part of distilling a bunch of
trace knowledge into like nuggets or
like memories that I can actually
retrieve that are useful because like
tons of that trace will actually be
noise. So it's sort of this process of
like distilling
great data which is like trace data but
into nuggets that I can actually like
bring into context when I need to.
That's like one. And then the other one
is like really interesting for us is
instead of just selectively and
contextually pulling the right thing
over the like the context window
boundary for like computation to happen
over it. So like context engineering
like you can also just touch the
weights. So like we like lean in a bunch
into like open models and like I love
open models. I use like GLM5 a bunch
like a ton of the team does as well. And
that's like amazing as well. That's like
continual learning by using feedback
from traces and like distilling that
into data that you can do like RL on
essentially and like making that process
a lot easier. And both are really
interesting like we're leaning into both
and I think both will happen. So it's
actually not going to be like an or like
everything will be RL or like everything
will be like context entry. you totally
need both because there's like tons of
things that you don't want to RL or like
it just doesn't make sense to like
fact-based retrieval like you can like
include that data in there but it makes
more sense to do search in order to
retrieve some of that stuff. So it's
like yeah those are maybe the
interesting bits that we're sort of
leaning into like sort of
>> you just mentioned there are tons of
things which you don't want to RL so can
you mention what kind of arenas do you
think we should go for RL or we should
not like where there is like it is
constrained by compute resources or
anything
>> I'm like super bullish on if you're like
if you're a builder or a company
producing some sort of like data in
vertical and you want to like do two
things. One, make your model way better
at that task and like basically like fit
to your data, fit to your use case, then
also like make it like way faster and
like way cheaper. Like RL is something
like definitely like worth exploring
because fine-tuning has gotten like way
easier in the last whatever year. Like
there's actually like amazing companies
that will help you fine-tune if you like
bring the data, if you massage it
properly, like you store all your data
like Langmith and you can like pull it
down to do RL over it. Um,
in terms of things that you like should
RL on or you shouldn't RL on, I think
it's really really great if you have
some sort of like vertical that you want
to like make your model like really
really good at. I think we see a lot of
companies that have started, okay, like
I'm building this like model and it's
going to be really really good at search
and I'm going to expose that as like a
sub agent to like my main agent and like
this sub agent is going to rock at that
or it's like this this model we like
fine-tune on a bunch of our like
customer service data and like it's
really really good at that use case or
like finance data for example or like
even even yesterday um like OpenAI
released Rosalind right which is like
all about bio
That's like amazing, right? And that
also like sort of it it butts heads with
this whole idea that the general purpose
everything is just going to like kind of
like subsume everything, right? It's
like I'm going to have like one general
agent that's just going to like it's
going to be so good. It's just going to
get exactly what I'm saying. It's going
like solve the task. Like maybe in the
limit that is definitely maybe going to
be true, but to like today like we have
to build for today, you know? So like
today it's super helpful actually to
take the opposite view like curate a ton
of data and like pick a niche that you
really care about or like that your
customers care about and like build the
best data for that like build the best
harness for your model around that and
just like sort of rock at that task. And
I think like RL is amazing for imbuing
sort of like vertical specific skills
into an open model and you get it like
way cheaper like way faster and like
depending on the original like training
distribution of that task in like the
frontier labs like data mixture like
you're it's very likely that your
fine-tune model will be better than that
open model or sorry than that closed
model at that task as well because like
you have the data and you like
fine-tuned it and like maybe like where
you don't want to use RL4 is like I I
honestly think it's a really good idea
just to start with harness engineering
like or like just really good context
engineering
because it's so easy actually like
relative to RL that just like pick your
model like design like a really really
simple harness around it first like for
example we have like this abstraction
and lang chain called like create agent
which is just a react loop and then you
can like build a bunch of stuff on top
of that until like you don't need to
anymore or you can use like deep agents
out of the box if you want to and Yeah,
just like go and build and do maybe
start with harness engineering and like
maybe the other point was like
there's things that like things like
factbased retrieval like fact-based
retrieval is just it's just like maybe
more of a search problem like I just
want to find the thing and I want to
inject it into my context essentially.
So it's like yeah that might be like one
example where it's like hey like you can
RL this thing and maybe RL on the domain
but like the way that you put it over
the like boundary for computation the
context window is just find it
essentially via some search mechanism
>> you mentioned about search previously
right like you will be going for search
like essentially so uh there comes this
concept of context context ro so you
site the chroma research on how models
get words on on as context fills up
maybe compaction tool call offloading
skills as you know progressive
disclosure. So which of these has the
biggest impact in practice when when it
comes to context fraud and and what are
the what are the kind of potential um
practices you specifically use to avoid
these?
>> They they all matter actually and I I
think it it sort of depends on like the
design that you're going for out of the
box, right? So I think like maybe maybe
like a good recipe essentially is that
like we start building the agent with
like a goal in mind like I want the
agent to do this thing but like really
really focus on like context rot because
after you pass like some sort of like
context threshold it gets like just like
really dumb and like like you said we
have like levers to fight against that
which is I can use like sub agents to
decompose the problem into like more
manageable chunks so I don't pollute my
main context window like that's like
amazing but like basically what what
like predicates that is that I can
actually efficiently decompose the
problem, right? So it's like maybe
that's like instructions that I give in
the system prompt to the agent of saying
like this is how you like decompose a
problem into like these tasks and like
if it's like a task specific agent then
you probably already have a bunch of
like human priors for how to go tackle
the problem. Like for example, for
coding agents, the way that we decompose
a problem is like you have agents that
do like sub agents do like codebased
search essentially and they like do that
separately and they pull in the
important information into like the main
agent to do some of that stuff and like
maybe there's like a web search agent as
well like
>> has to go like pull external information
and like find that and prepare it for
the agent. So it's like yeah basically
like working backwards from like I need
to avoid context rot like one way to do
that is like sub aents like is my
problem amendable to like sub aents like
if it is fantastic another way to do
that is like and these are often in
conjunction like we we like lang like
our docs we publish like a bunch of
stuff on like multi- aent docs as well
and skills are kind of related to that
which is skills to me they basically
kind of like encode knowledge and
workflows like skills are awesome
because everyone before skills like
hated writing good docs if that makes
sense. Like everyone was just like so
lazy
>> and they were like I'm just going to
like tell the model like some sort of
like random stuff like kind of like hand
wavy and it'll just get it. But like for
some reason like skills came out and
like because maybe skills are like
sharable and like other people like see
the skills like everyone writes like
very very good like workflow
descriptions in skills and like the
agent sort of like sees the skill
content and then it executes the
workflow and that's like amazing because
>> I basically get like a very small
snippet of like when to use this skill
and like I avoid all the context rot and
like when necessary we like pull in the
right context from the skill into there.
Like the the tricky thing with skills is
always like
basically knowing when to trigger them.
And that again comes down to like
instruction following which is like we
have some skills evals as well where
like we'll have scenarios and then we'll
sort of have like the skills that we
want to have like triggered basically
and then like we we have evals where
it's like we we only want that skill to
be triggered because like like let's say
it like triggers the wrong skill first
and then like eventually it does a bunch
of like stuff and then it figures out
like oh actually I need to do this skill
like that's bad because you wasted a
bunch of tokens essentially. So I think
eval help a bunch with context rot which
is one does the problem succeed at the
end like that's a really big part of
evals and the other one is sort of like
fine grained metrics on the evals which
is like how long did it take like how
many tokens did it take what was the
overall cost right and then like reading
the trajectory and then seeing like in
my effort to reduce like context rot by
doing like sub agent routing or like
triggering the right skills is that
working and then there's like maybe also
like determine ministic stuff which is
good. So like tool call offloading. So
like this happens a bunch with like bash
calls like you have you you you run the
shell and uh it's just like a mess. So
you get this gigantic like tool like
this output string and you can just pipe
that into context or you can just take
like the head and the tail and you pipe
that into context because that's usually
the important bits and then you tell the
model that like the rest of this string
lives in this file over here if you can
if you want to access it and then you
can go and do that. So, it's basically
like doing a bunch of stuff on the
model's behalf to really protect that
like incredibly precious artifact, which
is our context window. And like we I we
just think like really hard about like
if something doesn't need to go in here,
like really like don't put it in there.
But if something does need to go in
here, like do our very best to like
spend compute on like search or like
really good instructions to make sure it
gets in there.
>> Makes sense.
Interesting. I mean that that actually
makes a lot of sense. Um I'm curious
about so for people who may who may not
know the space well. So there's been
like open claw boom. I mean I I just saw
on Twitter it is kind of declining as
well. So Hermes has been getting as much
attention as open claw right which which
is coming out of news research. So how
does
deep agents differ from both of them? It
would be useful to explain this from
first principles for both technical and
nontechnical listeners since we are
going to spend a lot of time talk about
hardness in this conversation.
>> Both amazingly sick projects like
openclaw amazing like what Peter did
there and then also like what the new
guys are doing with Hermes is like so
cool. Yeah. I think like the main way
that I think about it is like you have
this like claw architecture right looks
like a little bit different from like
claw to claw but like overarching
architecture of like I deploy this
somewhere there's some sort of like
messaging
>> it's like live talk to it back and forth
there's like a heartbeat that triggers
like over and over again that has like
some sort of like memory primitives in
there so it's like it's basically like a
very opinionated
harness for the use case that is like my
personal agent. So I think like a claw
is like really the first it's the first
like really mainstream personal agent
like maybe like besides chatbt like
chachi is like it didn't like really
feel like a personal agent like it had
like memory and stuff like people like
message their claws like all day like
maybe people do that with chatbt too but
it's like the architecture of the
harness behind like the claw that makes
it like feel really personal because of
all the things they put behind it around
like the integrations like like what's
happening and telegram and those types
of things and like the memory that gets
updated. Like the big thing is honestly
like I like the heartbeat thing a lot. I
think like that doesn't get enough hype.
It's like very ingenious to like wake it
up on some cadence to like do things for
example crrons and things like that. So
the way I think about it like high level
is a claw is an amazing choice of an
opinionated harness for like a personal
agent essentially. And that's like an
awesome choice that like they make. And
I think we should have like a lot more
of these like people should like build
their own or like people should use them
more and see if they like them. And then
maybe like going back to like the
primitives, I think like you can build
tons of agents that are not claws that
like completely solve like your task
like really really well. And that's
basically like how I view like maybe
like Langchain's create agent or like
deep agents or like all of the other
great companies that are like building
harness primitives which is your
probably your task does not require like
a claw like most most likely like it's
awesome like you should have a claw in
your life but like if you're doing
something else like you don't need a
claw. So like actually what you need is
amazing instructions, amazing context
engineering, like amazing choice of like
what models you're going to use to hit
like the paro frontier of like Perf cost
and latency and like you can start from
like a simple harness and you can
assemble a harness around like that
model or like models to like build that
thing essentially. And I think like claw
is like one instantiation of an
opinionated harness for like personal
agents basically and it's like awesome.
And I know like people use claws for
like other things as well. So like I
think claws if you like edit the harness
around them like the base harness and
you like make them I don't know if you
like change them for like another task
of research that's like awesome too. But
I think like that whole process of like
taking a task and like you have a
harness that like wraps a model or like
models and like you sort of like direct
it towards a goal. That's basically I
think the goal of like laying chains
like create agent and like deep agents
which is we have like some opinions in
there to get you started but like really
we want to help you build the best agent
like for your tasks. Like that might be
us giving you all the tooling. That
might be like me and like the rest of
the team like blogging about like actual
use cases and like sharing our evals and
like just publishing results. But yeah,
basically like customize a base harness
to make it like really really good at a
task and like a claw is a like
phenomenal example of
>> makes sense.
What what you really see the future of
it? I mean I mean let's say down the
line what the next um what what it would
look like after let's say five
iterations of it or what you really see
the future of in in a year or so let's
say
>> I mean honestly in a year and six months
like everything's going to change
obviously no I'm just kidding like it's
it's hard to say like I'm like in the
short term very very bullish on
basically helping people build like open
or like us providing open infrastructure
to help other people build agents that
are like amazing for their task. And I
think that is not going away in the near
to medium term at all. In fact, I think
it's going to go in the complete
opposite direction, which is like
everyone is going to start basically
taking their tasks and they're going to
either do like harness engineering
around those tasks, which is largely
like very good context engineering, like
very good like prompts and very good
tools and like very good skills. They're
going to do all of that around like some
sort of task they care about basically.
And I think open harness engineering is
a big part of that. And I think like
open models are also like a really big
part of that which is we're going to see
like a big growth of I'm going to take
like Kimmy, I'm going to take like GLM5
and I have this data and like a big
future is I'm just going to like
fine-tune that model on my data and I'm
going to make it really good. I'm going
to just keep doing that over and over
again and I'm going to compare how that
does to like a Frontier model and I'm
going to make the trade-off between like
is it better, is it like just as good,
what's like the cost, what's like the
latency tradeoff and then like maybe
like a little bit more like longish term
from that which is like it would be
awesome if we got like some sort of like
AGI model that just did everything. I
would love that. Like so then I can like
totally stop talking about like
harnesses and like evals and I can just
like enjoy the model. But it still
really does help to specify like the
intelligence that we want that model to
like act on in a particular situation.
And like I still think even in like the
medium to long term, it's going to be
super helpful for humans to get really
really good at both describing the thing
that they want and not just like hand
wavy like writing like kind of how we
like write really detailed prompts like
getting comfortable with taking the
thing I want and like putting it into
like language basically. And then the
other thing is we're still going to want
to like verify the work that agents are
doing in in some way. I hope like
autonomous verification systems get a
lot better, but like they're not going
to be perfect and we're still going to
want to be able to say like when an
agent is doing good versus like when an
agent is doing bad and that can become
like part of the feedback signal and I
still think that's going to exist like
for for a little bit and like that's
that's like not a bad thing. That's like
totally fine like we can still work on
that.
>> Makes sense. Um you know dude there are
a lot of I mean bunch of companies I
mean interestingly everyone who is
working on frontier are coming out of
their own harness own their own agent.
So so recently RAM basically built their
own harness right. So have you seen what
they put out? So and my other question
is like you yourself um have used open
code. So it seems like um enterprises
building custom harnesses puts real
pressure on competitors. So I'm guessing
a um release like that forces companies
like SLA which also got recently big fun
who competed to RAM and others to build
their own thing too. So what do you make
of this trend?
>> Yeah, I mean like the ramp is amazing
obviously like they put out such fire
blogs. I like ramp lab stuff. I think
like the the overall trend of like
building your own harness or like or
basically like building your own agent
that's custom for your task is like
fantastic. Like I I think more teams
should basically devote time towards
like investing maybe like in the process
of both like helping their teams build
agents, right? That doesn't just mean
like coders. That means like everyone
like the people who are doing go to
market like marketing, sales, all those
people can like benefit in some way from
agents. They just need like help doing
that basically. And I think it's great
that a company basically like picks a
problem and they're like we're going to
solve that by building the best harness
and that means like the best context
edge like the best verification the best
tool stack also a big part of it and
like we work on a lot of the stuff at
like lang is building the correct or
building like really easy to use systems
for taking the trace data and then like
improving the agent because I think
there's a lot of stuff around
improvement loops which is our first
pass at the agent isn't amazing. So like
this comp these companies are like okay
I'm going to pick a task I care about
and like my first version is going to be
like kind of mid totally fine but then
I'm going to get the data from somewhere
and I'm going to like make it better
over time by just like spending a ton of
time on it or like maybe spending a
bunch of like compute on it to like
understand the data and like improve the
prompts like fix the edge cases like
improve all the errors right and it's
just like I still think like we we will
have tons of vertical companies because
today like someone has to do the work
like someone has to like invest in doing
that like someone has to do like sales
around that right it's not just going to
like happen by itself and I think like
more tooling around that and like more
yeah just more like research that helps
people do that that's like a good thing
like doing the open is like an even
better thing so it's like more
>> I think it's also very ambitious to do
you know you you're you're already at
Frontier and why to depend on someone
like let I mean if you want to be at
frontier you have to build something
like what what other people at are
working on. Interesting. So, um beforeh
going to the other segment of the
podcast, let's have some quickfire
chats. Uh so, so there is a meme you
liked from Mintly Fly Slack. Would be
awesome if you can share screen and
share that. Yeah, please. Then I'll go
ahead.
>> Let me
let me get that off. Um
I love this guy. Let me share. Dude,
this guy is so funny. A dude, I love
this guy so much. Dude, this guy this I
don't know like where this came from or
like who can I turn on volume?
>> So then I mean you like this like this
from mental slack channel that
apparently went viral across I think
startups. So what is it and what it made
it so hard? What's
>> so I have I have no idea. I think it's
Nick. It's Nick from Mintify like
tweeted it one day who's funny like I
was just like this is amazing. So I I
just sent it to all my friends um just
like randomly. I think it was our like
uh like soccer chat. I'm like something
happened with like Arsenal or something.
I I sent this to like my friends because
they I think they lost. Yeah. And like
now we have this in our Slack as well.
Like someone made it into like a gift
and like whenever maybe something goes
wrong like we just sort of throw this
guy like I don't know what it is or who
made it, but like I love this guy. I use
it all the time.
>> Awesome. Um my next question is um most
underrated harness feature that nobody
talks about. most underrated.
It's a good question because like I feel
like if it's underrated, we should be
talking about it.
>> Yeah, exactly.
>> Okay. Okay. Okay. I think like one thing
that we use a bunch is like this idea of
like we call it middleware but like
hooks just generally. So like for for a
lot of teams it's like super useful to
inject sort of like deterministic
actions like basically like do
deterministic code execution like
somewhere in the harness. And I think
that's like super underrated maybe
because it requires like sort of like
custom logic. It's not just like you
think of a tool and you just sort of
like add it. But yeah, I think hooks
that sort of like control bad model
behavior are like really really helpful
or like not just bad model behavior like
help the model like do things. So for
example like triggering excuse me
triggering like self-verification and I
think people should like build more
hooks to control their models. Makes
sense. Um interesting. So we have
something which people should talk
about. Interesting. The model that
surprised you most in agent workloads
this year
>> in both I mean we can go in both ways
which was like something which you were
not expecting and it comes out really
better and something which you kind of
not expecting and it was like it comes
out.
>> Yeah. So, I'm like so impressed by open
models generally as like actually ways
that I get work done. And like I think
like it's it always like feels really
good to talk about open models, but like
you sort of like love the idea of open
models, but then like you don't use
them. Like that's like that's not good.
But actually like the open models that
have come out this year are like
amazing. So like the GLM series is like
fantastic and like it is actually a good
agentic coding partner. It's like very
fast and it does amazing work. So like
maybe at the start of this year like
last year I don't think I would have
expected my GLM f like my GLM use to be
so high and there's other models too
like um like the Ry team who you had on
like they're they're amazing. Miniax is
one that we actually like eval on a
bunch and like these are all amazing.
So, like open models have surprised me.
Like I was hoping it would happen, but
it did happen and that's awesome and
like we should invest a bunch more in
that and like I hope like I hope like
teams actually like think about using
them in like their actual workloads
because they're amazing. Yeah, that
surprised me but in a in a good way. I
was like super happy and like it's only
going to get better and that's like
really really good and it's like way
cheaper and faster.
>> Awesome. Um okay, this one is lost. one
thing you would change about how the
industry builds agents right now. It can
be any common practice or something like
that.
>> How should they change? I think
basically like this whole thing that
we've been talking about right now is
like I would love if like that was like
easier for people to do or like more
people like did it which is basically
like maybe like work backwards from a
task and like a goal that you really
want and then like the whole point to me
is just like build a system like for
your team or like for yourself and like
for your agent to like make it better
over time. Like maybe like I'm saying
that because I'm like we're thinking a
lot about continual learning. So this is
like both the agent design which is like
prompts tools like the whole harness
thing like the verification loops like
all this sort of stuff and then also
it's like sort of the infrastructure
around it for doing like
self-improvement. So this is like the
unsexy stuff, but I think the stuff that
like really matters, which is okay like
are you like is tracing on? Like are you
like putting your traces somewhere
basically like are you using your traces
to like mine errors like via monitoring
basically? like lang supports that and
like we think about that a bunch which
is like trace came in like how do I
figure out if something happened and
like am I making eval from that right
and then like am I am I like reading the
evals basically so it's sort of like the
systems approach around like building an
agent and like making it better I think
teams are doing that that's amazing but
like it's awesome and I think teams
should team should try to do that
>> makes sense um also on the same note
there was this um recent paper called
meta harness and DDR also posted about
it a lot of people are working on auto
research and this field adjacent if not
a version of auto research itself then
you have also things like you know post
train bench where a hardness is used to
post train models so if those two
directions start merging so better
harnesses improving post training and
meta harness improving the hardness loop
itself I mean that feels pretty
explosive pretty interesting how do you
think about that convergence
>> I I think it's super exciting like I
love teams that are like
productionalized ing like auto research
and like doing so like we we have I
think we did like something around like
harness opt maybe like a couple months
ago and there were definitely like some
issues that I saw back then and I think
we still have like some of the issues
but like now like a lot more teams like
putting a lot more effort into it. So
it's like I think this is amazing that
and also maybe I'll just like this is my
take as well. So, like I have like we
put up a bunch of blogs and I think like
there's there's like algorithms that we
need to discover to make like agents and
harnesses better using some sort of like
grounding signal. And that's basically
like auto research is like I have a
grounding signal and I hill climb that
signal and like I update my harness and
like meta harnesses that like we have
one like better harness and like we tons
of good people have like work around
this which is amazing. And basically
like I view like eval as such an
important part of this like feedback
loop because like eval are basically how
we like ground our like auto research
loops over time and it's not just like
ground like in the moment like if I run
auto research like later like a two
weeks later I still have that same like
grounding mechanism and maybe hot take
but I think like you can almost try to
define an agent via a set of like evals
that sort of serve as not just a spec,
but a spec that you can like verify and
like ground. You can do it via fitting
to like eval.
And then you basically have a fitting
algorithm, right? And the fitting
algorithm can be like meta meta harness
or like better harness or like any of
these. And that fitting algorithm is
basically run on evals like reflect on
traces and like update harness or like
prepare data to do like RL on it. And I
think like we're in such early early
innings of this self-improvement loop
basically and I'm I'm super excited
about it. I think it's like really
really cool. There's like stuff to work
out around overfitting and stuff but
like that will happen and like people
will use this a bunch more.
>> Awesome. Makes sense. Um
>> I'm pretty much looking forward to as
we're approaching to the next um I mean
last segment of the podcast we have some
questions around environments harnesses.
pretty much harnesses we have covered
but eval and benchmarks around
benchmarks. So, so how do harness fit
into this broader idea of simulation as
a service? So there are companies whose
whole business is simulating work
categories, decisions, operating
environments. If better harnesses lead
to better simulations, so where does the
open-source side go and do you think
langen will eventually release an open
source simulation?
>> Yeah. So like I think these these two
things are super related. So like thing
that we wrote about before there's like
like evals and like environments like
they're not the same but they sort of
like rhyme with harnesses as well. So
it's like basically like the like the
main idea is like I need like some place
for my agent to do work that sort of
like reflects actual work that's going
to be doing like in the real world
basically right so it's like I'm going
to build an environment like there's
tons of like awesome environment
startups that are doing that and like
running the agent in them so it can
produce like a good feedback signal so I
can like train on basically that's like
amazing. I think like even a big part of
like evals are going to start looking
like environments because like when when
we first started trying to like eval
it was really simple. It was like chat
completions evals, right? It was like
I'm going to give you like a really
simple like input prompt and I'm going
to have like a number or like a
structured output at the end of it. I'm
just going to like map the keys, right?
I'm going to be like, "Hey, did you like
did you get them all right?" But as
agents are doing much more like
complicated work and like much more like
long horizon work actually like the
thing I want to eval is like a task
essentially and like the the best way to
maybe do that is to just like build an
environment and just like drop my agent
into the environment and like maybe like
what we do right because we actually do
this like we basically use Harbor right
and like Harbor those guys are awesome
like the the terminal bench guys
so like we'll pick our eval like it maps
to some sort of like hardware config and
then like we run the eval in the
environment that we built. Then like all
of the traces like go into Lenmith and
then we like read them and we look at we
like segment them based on like the
rubric like how much did it pass, how
much did it fail like how long did it
take and then we try to like improve the
agent and I think that process of like
building the environment and like you
asked about simulations like we think I
think about this a bunch which is like
what I really want to happen is the like
the company that we're building or like
the app or product that I'm building
like I want my agent to be able to like
test itself in that exact environment.
So it can figure out like when stuff
goes wrong essentially and then I can
like fix it, right? And like that's
basically the whole point of eval which
is like
>> they're sort of like a proxy for what
happens in production and like as I fit
to my evals I'm kind of imbuing like
behavior into the agent to make it pass.
The whole goal of evals is like to make
them pass, right? Like and like a lot of
our evals fail because like maybe the
models just aren't smart enough yet.
like eventually they will pass and like
then what I've done is like I've taken
that information from that eval and I've
sort of like transfer learned it into
like some sort of agent whether it's
like the weights or like the harness or
something and yeah I'm bullish on both
I'm bullish on like eval as a mechanism
of doing like agent improvement and also
bullish on like more eval looking like
environments basically instead of like
just like input output
>> pretty interesting you know this uh
terminal bench 2.0 Sweet bench pinch
bench. So the benchmark landscape of for
agents is growing fast but you um
explicitly say in your opinionated
agents post where I mean you said test
on real world users for your product
don't trust benchmarks your user has
never heard of terminal benchtop please
don't introduce it to them right
>> so so so so which benchmarks do you
actually trust and which ones are most
performance theater I mean I mean what
what's the general landscape where one
should actually think
>> I was definitely a bit hyperbolic like
like don't introduce anyone to Turbo
Veg. Like I love Turbo Veg. Like those
guys are awesome. But I think I think
the general point actually does stand
though, which is
like like eval to me are basically like
again they're like a mechanism of like
evals and benchmarks. They're basically
like a mechanism that like proxies
behavior that I want my agent to
actually have like via this like thing
which like roughly measures that, right?
So it's like I'm trying to measure like
long horizon like problem solving. Like
can I do that with like a really hard
terminal bench task? Like kind of maybe.
But like if my actual like app has
nothing to do with that, then like me
passing that like terminal bench task
doesn't map well into like my like long
horizon problem solving like the bio
domain, right? So it's like there's sort
of like rough proxy signals that measure
like so like in at like Langshin we have
like axes that we try to measure on. So
like every eval we tag to like an
access. So it's like retrieval, it's
like problem solving, it's like
planning, it's like tool use for example
like we we like try to tag every eval
we do that we tag every eval like one or
multiple axes, right? But I think it's
useful to use benchmarks as like a
general like vibe like a guidance. Like
you should definitely read the traces
from benchmarks. We like I like spend
tons of time every day just like reading
the traces from pre-built benchmarks.
But I think really the thing that helps
teams is
using their trace data to build evals
for themselves that actually like map
onto their customer use case that maybe
like no existing benchmark like does a
really good job of. And I think like
it's kind of like a moat if you want to
call it but it's like it's just a really
good way of like building a better agent
product which is there's awesome people
building awesome benchmarks. None of
those benchmarks map exactly onto like
what my agent needs to do. So like I can
like use those to roughly measure
problem solving ability, but like really
the best way to measure problem solving
ability is just to like get a
representative set of like evals and
tasks like my own bench and just like
use those and like that's going to vary
from like person to person like product
to product like feature to feature. So
>> makes sense.
>> Yeah. Yeah. Yeah.
>> Yeah. What's your opinion on computer
use stuff? Because this is u this is
something very subject to people like
the current approach is not good. You
can't really go in the screenshot way.
You really can't use MPI MCP or API way.
You have to bullish. You have to scale
GUI stuff. So what do you think about it
and about it scaling part of
>> dude? We were just talking about this
today actually like how much should we
do like more examples on computer use?
Like I'm like very fascinated by
computer use. I think it's like super
interesting. I think like there's maybe
like two things. One is there is still
definitely a visual perception problem
like that we like we've known for a
while like like fine grain details is
not it's not like amazing at that maybe
it's like less of a limitation now like
some of these models are like better at
computer use I don't know I don't have
like a great opinion on which way of
doing computer use is going to win like
the hybrid like pulling down like the
actual like webpage content and like
clicking versus like how much do you use
screenshots I would be like very happy
If like everything was just like just
worked with vision because that would
mean that we did we have made like a
step change in like visual perception
and like visual reasoning over
screenshots and like doing or sorry like
yeah visual reasoning over like
>> images basically.
>> Yeah, I don't know if it'll happen. Um
but I'm like for the applications of
computer use I think they're awesome and
like we should do we should do like more
stuff around those but I don't know. I
just haven't played with it as much.
>> I mean yeah awesome. Makes sense. So I
mean do you think there is some secret
sauce something which can be scaled to
scale more long horizon task about in
your experiment experience what what is
something which is blocking
uh because I think a lot of companies
are being forming up involvements for
long horizon task these days and been
selling to enterprises and frontier labs
now I mean what do you think of the
space about scaling
>> I think
>> I think like there's a lot of like good
work that a I think a lot of companies
have like good agents in like medium
horizon tasks like for example like we
like we have like a background coding
agent that can go like do things over
like hours like a few hours basically
right it's like they're all coding
related tasks like it's easy to like
pick those and like scale those and I
think yesterday there was like really
good work from like the proximal team
for like frontier suite which are like
hey like these are like 20our tasks
basically and like we're going to go and
run on
I think like one thing that is still
like really really tricky for models and
I think like in the near term what's
will happen is like hopefully like
models get like post-trained better on
this but like we will still have to
build a bunch of like harness
infrastructure around it which like
hopefully falls away is one like
decomposing a really difficult problem
into like subpieces
and then doing like verification of the
intermediate steps. I think like that is
like a really really good general
purpose recipe that we can use to like
keep doing like longer and longer
horizon tasks because like basically
like all a long horizon task really is
is like I'm going to do I'm going to get
this like really hard task. I'm just
going to do like a bunch of like little
sub pieces like over and over and over
again. And I need to make sure that like
I don't mess up any of the sub pieces or
like if I do mess up I need to like go
back and fix those
like the key thing is like figuring out
like when you messed up that's hard. So
so we we need better like
self-verification systems there that
might be like self bootstrapping like
testing for example
>> and like the other thing we need to do
is like teach systems how to like
decompose problems into like sub agents.
I think like there's really cool stuff
around RLMs around this.
They're like I still find them like a
little bit tricky to get working, but
like the ideas behind them like amazing
basically like externalize context as
like an object and then like sort of
like search over that and like decompose
problems like that for like really
really long horizon tasks. I don't know
it doesn't work amazing right now but
like the general strategy of like verify
and then like decompose like iteratively
I think that's like a good path forward.
like we're we're spending time there as
well. I'm sure like a bunch of other
people are well.
>> Awesome.
Great. Um I think we are pretty much uh
to the end of the part and um so what is
uh what is something which you are most
excited about to happen in let's say
again in 6 months or year because again
like we pretty much don't know but you
really want to see to yeah to happen.
>> Two things like one I'm super excited
for the World Cup. So like World Cup is
happening like here it's happening in
Philly. So, I'm like super stoked for
that. But besides that, I think like the
thing I'm like super stoked about is
we're we're like just starting to get
the first sparks of these like
self-improvement loops from data that's
generated from agents. And like we're
pushing like a ton on this like in the
last like couple months like we put our
like first like research around this.
There's other good teams doing this. But
I think like this is like such an
amazing on-ramp for us for like all
teams to self-improve like all of their
systems by doing like very very good
like data engineering like looking at
all of their trace data like mining it
for errors and like bootstrapping self
like probably to start they're going to
be like semi-autonomous self-improvement
loops like like humans will need to be
in it but the systems will get better
and better and I think the the flow of
build agent
use an environment, generate data from
it, and then like mine the data, point a
lot of compute at the trace data to
derive like eval and to derive training
data and then like use that to like
improve the agent. Just keep doing that
loop. That is like super exciting to me.
And it it's it already works actually
like every like people like we're doing
it like people are already doing it. It
like works. Customers are doing it. It's
awesome.
But it will only get better, I think,
with like better models and like we're
going to build everyone's going to build
better systems around some of this
stuff. So, yeah, I'm stoked in six
months. Like, I can't even imagine like
how good this loop is going to be. It's
going to be amazing.
>> Likewise. Totally. What's the next blog
coming?
>> Next blog. Um, okay. I'm supposed to
write one over this like weekend. Yeah.
Hopefully like next week. Yeah. Oh,
maybe like one thing that's cool I like
lang chain is like because we talked in
the beginning I actually think blogs are
like fantastic like artifacts like work
backwards from so it's like but your
team does a bunch of like amazing work
and like you should like totally share
that work so you can like kind of like
pick like a blog it's like I want to
write a blog about this and it's like
okay like what's all the work I have to
do to make sure that that blog doesn't
like suck basically. Yeah. Yeah.
>> That's great.
>> Yeah. Yeah. I think there's one I'm
thinking about a bunch which is like um
it's less like like agent engineering
stuff but more just like how much we've
like unbundled like agents. I think
there's been like a huge like unbundling
of agents uh into like pick a base
harness and like pick your skills like
pick your tools like um design your
agent like design the models and it's
not just like one monolithic system like
you totally don't have to get locked
into anything like you have the choice
to build like bespoke tooling for
yourself like for your company and like
the unbundling is awesome and like I
think like people are doing cool stuff
around that so hopefully like I'll like
riff on something about that or
something or just whatever I don't
Great. Okay. We um last question to you.
Um so imagine so um so the world is the
technology is changing
by an order of magnitude every week. we
all can see uh what advice would you
give to someone who is just starting out
of college who is someone 20 20 21 year
old because because things are not same
as it has been like I can say it's been
like like couple of years ago it's not
the same the world is changing so fast
and and it's sad to see that lot of
people are actually I mean don't even
care about what is really happening
right so even like if someone is
starting out college so what should they
really look forward to? I mean to to be
at frontier and to actually scale on
things to actually learn and be at good
places. So what's your opinion over
that?
>> I don't know how amazing advice I can
give on this honestly but like maybe
like some like general thoughts of like
what I was thinking when I was like
finishing like PhD and stuff and also
like there's like so many sick like kids
who are just like graduating undergrad
already like that I see on Twitter doing
great work. I think like there's there's
a couple like common threads which are
really cool which is basically like you
just sort of like pick something you're
like kind of interested in and you just
use like AI to help you learn that and
you just like kind of like rabbit hole
like really deep into that one thing.
And I think like that's probably like
really really useful because you can
kind of maybe use AI to become like top
maybe like 10% or like 5% of the world
if you like care enough and like the
problem is not like super crazy. And I
think like that's like really good. And
the other thing is like I think like
it's awesome when people just like post
their thoughts like online. And um I was
saying like it helped me like meet a lot
of like cool people. I see like awesome
like posts on X and like I love
interacting with them, but I think it's
basically just like it's kind of scary
to maybe like put your ideas like online
like dude I'm gonna get like roasted
like first I'm going to get roasted like
by my friends who are like dude why is
he posting so much on like Twitter about
like AI but it's totally fine like you
kind of like get over it but like it's
just like really good to like sort of
share your ideas because it helps you
like other people like challenge you and
then like you realize like oh like that
idea was dumb or maybe that idea was
like really good like resonates with
people and like the only way maybe for
like other people to like really help
you is if they like see your work or
they see your thoughts and then like I
think there's so many people who are
like willing to help. So just like maybe
like pick something just like grind on
it just like post about it basically.
And I feel like if you do that enough
times then something good will hopefully
happen or like you'll have learned
something which is also like really good
>> dude. Um this is so honest and I can
totally relate with both of your points
and basically this is something which I
have experienced again like because
there are so many trajectories so many
arenas opening as AI is evolving to
learn to to actually u make your make
you context aware about things I mean it
can be anything it can be posting side
of things pre-training inference
engineering environments data a lot I
mean you can't really keep it up about
things so again as you said use AI use
your knowledge sources and like read
good blogs, references, hack on,
experiment on and this is something and
that is the reason I mean even good
professors lot of colleges are not
actually wor about things. So I think
this is the best time to learn and
actually dig on things and and I think
there are wide arenas where one can
master one thing right because everyone
needs master of something and get into
places and let's I mean it can be
anything it can be even hiring as well
if you're really good at it so you can
make it to the places of course and as
you said about posting about stuff dude
I mean this is so underrated I mean if
you are really good poster if if you if
you can really uh kind of um convey your
thoughts well
amazing opportunities can open up and
this has been happening for me and I
have seen a lot of amazing people been
to places just by like I've I've
interviewed bunch of people I can give
example of kalome he's he's 19 he just
did ready to wait he went to meet
Shopify CEO then he got hired at prime
so I mean there's so many people who
have just gone to the same trajectory
just by posting their thoughts online
and it is and it is a fascinatingly
>> rewarding
It is actually rewarding. Totally. Um
awesome. I think um we at a wrap. So
thanks Viv. Uh for everyone listening,
deep agent is open source. Everything is
on GitHub and absolutely you read a
web's blog coming on um Twitter. It's
it's just amazing and that is something
which has led to this conversation. So I
hope more more and more of them coming
and follow him at um with Tan on
Twitter.
>> Yeah, dude. This was so fun. Oh, I had a
blast.
Get the TLDR of any YouTube video
Transcribe, summarize, and repurpose videos in 125+ languages — free, no signup required.