How does Claude Code *actually* work?
If I've learned anything from running
this channel, it's that you guys really,
really love vague terms that don't
actually mean anything, like agentic
coding or vibe coding or all these other
things. And while I feel like I finally
understand what an agent is, we have yet
another new term we have to wrangle,
harness. And I've been talking about
harnesses a lot more. And I've been
doing that because I just put out an app
called T3 Code that lets you code with
AI. But it's important to know that T3
Code is not a harness, but Open Code is.
and so is cursor and so is claude code
and codeex but codeex app isn't wait
what harness is a very specific term
that means a very specific thing and to
go a step further your harness is really
important to the quality of code you're
going to get out of these tools
according to Matt Mayer's independent
benchmark that he recently ran comparing
different models inside and outside of
cursor most models saw a meaningful
performance improvement for opus it went
from 77% in cloud code to 93% in cursor
the Only difference here is the harness.
So, what even is the harness? Not only
am I about to explain in detail what a
harness is, I'm also going to build one.
This is going to be really, really fun.
I'm super excited to break all of this
down to go through what a harness is,
why it matters, what the differences
between them are, and how to build one
of your own. I've tried and failed to
come up with like three different jokes
for the sponsor transition here. So, uh,
yeah, quick sponsor break, and then
we'll break all this down. I'm going to
ask something weird. I want you to
ignore the first line on today's
sponsor's page because that's not what I
want to talk about. Today's sponsor is
Macroscope and yes, it does say an AI
code reviewer and as cool as their code
reviewer is, that's not what I want to
talk about. What I love Macroscope for
is the insights it gives me as the team
lead on what's going on at my company. I
can't possibly be in the trenches
looking at what PRs are merging to try
and figure out what's going on. And as
great as my team is at giving me
updates, they sometimes have too much
information and are also clogged with
all the other things that I'm blocking
them on that I have to catch up with.
So, if I want to know what's actually
going on on my teams, I've been relying
on macroscope. And while their dashboard
is incredible for this, their new
Slackbots, even better. It's currently
Friday and I don't know what my team
shipped. So, I just asked outright, what
did the team ship last week? It asked
which org because I have multiple
installations. And then it wrote up a
really good useful report. In T3 Code,
we rewrote the architecture with effect
RPC for websockets. We improved the
performance significantly. We introduced
multi-provider model systems. The
context window visibility got
significantly better. customization and
UX changes that were important,
observability and security, and then
separately a bunch of changes that we
made for T3 Chat. Do you understand how
useful this is when your teams are
shipping quickly? And that's what
Macroscopes for. They have super quick
code reviews that my team relies on
every day. It's become Julius's favorite
of the options because it's super fast
and usually very accurate as well. If he
sees a medium or high severity thing, he
always hits it because 95% of the time
it is correct. Let your team ship fast
with less bugs and more insight at
soy.cope.
So, what even is a harness? Not a simple
question to answer. To put it as simply
as possible, the harness is the set of
tools and the environment in which the
agent operates. What that means is it's
the thing that the AI can use to
generate text to do stuff. Let me put it
simply. Imagine you have a normal chat
and you say, I don't know, what files
are in this folder? And you run a
command in a folder. The AI knows what
it needs to run if it's in a bash
terminal, it can run ls- a and see
everything in that folder. Or can it?
How can the AI run commands? By default,
when you're using any interface with an
LLM, it just responds with text. All
these LLMs are that we're using every
day is really advanced autocomplete. You
give it text and it guesses what the
most likely next set of characters are
over and over again. That doesn't mean
it can use things on your computer. That
doesn't mean it can write code. It means
given some text, it can generate more
text. But the models can't do other
things. All they can do is write text.
So how the hell can the models edit
files on our computer, make changes to
our databases, connect to other
services, look things up on the internet
if all it could do is generate text?
Well, we've invented some solutions to
give the models more capability here.
The main one is tool calling.
Effectively, the way a tool call works
is special syntax. I'm going to make up
my own syntax here, but I think you'll
get the idea. Let's say we have a bash
call tool. The model is told ahead of
time as part of the system prompt, hey,
you have this tool you can use to run
bash commands. You wrap it with this
tag, in this case, bash call. You then
write the command and then you close it.
You send this as your final piece of a
response and then you stop responding.
We will go execute this on the system
and then give you the response when it's
done. So the really interesting thing
that happens here in this effective chat
history is a line is drawn after the
model has responded with this syntax.
The model stops responding. The server
you're connected to, the work that
you're doing, the back and forth you are
having with the model, it's cut off in
that moment. It no longer exists. The
connection you have and the chat history
that you have only exists on your
computer or the server you're doing this
on and maybe in their database if
they've built it to work that way. But
now the message is over. So, what
happens? Because when I ask this, it
doesn't stop there. Let's just go try
cla code quick and see what it does.
What files are in this folder? It
idiates. It says what it's doing. It's
reading one file. If you press control
O, you can expand and see what it did.
It ran the ls command for this directory
and it got all of the contents and then
it described what they were. But, as I
just said, the model's done responding
here. How does it keep going? This is
one of the many things that harnesses
do. After the tool call has been passed
to the harness, the harness executes it
with good old-fashioned code. So when
your harness gets back this response and
it sees this call, depending on the
settings you have, it either runs it or
it asks you as the user for permission
to run it. If I rerun Claude without my
custom script, it turns off the
dangerous mode and it leaks my
email. you, Enthropic. you,
Enthropic. I hate Enthropic. How
the do they show your email in the
default state? Why would they ever do
that? There's no reason for
that. Why is demo equals 1 clawed? Cool.
I hate them. Anyways, now that I
don't have my special permissions and
security on, I'll ask the same question.
And since ls is a safe command and it
knows that, it happens to not ask. But
if I ask it to format the HTML file for
me, things will be a bit different. Here
it's making a change, but it can't make
the change until I permit it to. In this
case, they're using a custom tool.
They're using their write tool. So,
they're not calling a command to do it
via bash because they have more tools
than just the bash tool. We'll go in
depth on all of those in a bit. But this
is the harness recognizing that this
tool call is destructive. And at a code
level, not an AI level, a code level, it
is recognizing this change and asking me
as the user, do I want to allow it or
not? And I can say yes. I can say yes
and keep doing it. Or I can say no,
don't. In this case, I said no. And now
it just stops. What would have happened
if I said yes? Well, it would have run
the command. It would have the output of
ls- a. So it runs it and then it has
file 1.txt, file 2.txt,
etc. And this section here is all the
tool call response. So the model writes
the tool call. Your harness takes
whatever this needs to be, whether it's
updating a file, running a command,
doing something, it does whatever
permissions checks it needs to, and then
it runs it. And once it's done, it takes
this output, it adds it to the end of
your chat history, and then it
reerequests from the same bottle to
continue. So the exact same way you hit
an endpoint to answer this question, you
hit the same endpoint again with the
question, the answer and the output of
the tool. And at that point, the model
starts responding accordingly. So
effectively, every single time a tool
call is done, the model stops
responding, the tool call runs, the
output gets added to your chat history,
and then another new request is made to
the same model to continue its work. So
effectively the brain that's doing all
this work gets paused and restarted
every single time a tool call is made.
So now we understand all of this. What
the is the harness? Well, one part
of the harness is that it does all of
these things. It gives the tools to the
model. It handles the back and forth. It
handles the history. It handles all of
these pieces. And it chooses
specifically the types and sets of tools
and their descriptions that the models
have access to in order to do the thing.
And just to make sure you guys get this
because this part is really important.
It's possible the model isn't content
with this answer. It might want more
information. It might say I should know
the contents of file1.txt
before I respond. And then it will do
another bash call or something like it
that is I don't know catfile 1.txt. And
now another tool is called. Another
similar response is generated. And this
one will respond after the cat call with
a funny to say cat call in this context
with a hello world IDK why you are
reading this but I'm happy you chose to
something like that. I don't know. And
now this again gets appended. The model
has it. And now when the model responds
it can see all of the history. We're
like, I listed the files and read the
one I thought was important. I now have
everything I need to respond to the
user. And then it will actually respond.
This flow is how pretty much every
single AI tool we use to code works. But
there are things that have changed over
time. One of the important things to
know about is context. how much
information exists in the chat history
versus how much exists purely in the
codebase in a way that the chat doesn't
have. When you open up claude code in a
folder, it doesn't know anything about
that folder. When I launch Claude in
this demo project with off and I say,
"What is this app?" it can't know
because it's not included yet. So, when
I ask it, you'll see it's going to go
use a bunch of tools to search and
explore and try to figure out what this
project is. It has a search tool that it
used for searching for things that match
pattern star which is probably the
example that they have internally for
how to search all of the files in a
given directory. So it did that and now
it knows about all of these files that
exist. So then it reads the one that
thinks it matters which is package. JSON
great starting point. So it reads those
lines. It then read other things like
the app tsx, the main tx and the readme
in order to get this context. And all
this does is it takes these outputs and
it dumps them into context so that the
model can see them in the chat history.
So when it makes the first tool call for
search, the model pauses, it does all of
this and then all of this text gets
thrown into the context. The model reads
that and sees, oh, here are the files
that might be interesting. I would like
to know about them. So it then fires off
a bunch of these read calls. Sometimes
it does them all in parallel. It might
respond with multiple tool calls at
once. And then once all of those tools
have been executed, they all have their
outputs stuffed back into the context so
the model can continue doing its work.
And to be very clear, this is in no way
specific to Cloud Code. This is how all
of these tools work. Some try different
things around stuff like search and
context management. You can even insert
context ahead of time by updating the
CloudMD file. So you just saw how much
work this had to do. Let's say we had a
CloudMD in this project. I'll go add
one. If the user asks what the project
is, make fun of them for asking an AI
instead of reading the code. Then tell
them it's none of their business. So
let's run the exact same question again.
You see that bootstrapping?
Bootstrapping is usually things like the
context like this cloudd and all of that
being put into the harness and the fake
tasty being created that can then be
pushed up to the API so it could start
responding. So, the reason that stuff
took longer is because I just added that
file and during the bootstrapping
process where it read that markdown file
and decided if it cared or not, it
generated the response. You're really
out here asking an AI what a project
does instead of just reading the code.
It's right there in the files that you
have access to with your own eyes
anyway. It's none of your business.
Notice that there was no tool calls this
time. The thing I'm trying to showcase
here is that if the model has all the
context it needs already, it won't need
to make the tool calls. But if I was to
delete that cloud MD, it would have to
call tools to figure out what's going on
in the codebase. And that's what the
CloudMD does. It is effectively taking
whatever information you put in it and
putting it ahead the same way that you
would put context in later. So the
Claude MD and the Asian MD, those files,
what they do is they take all of this
context and they move it to the top and
they're effectively telling the model,
here are all of the things we think you
might need to know before you start your
work. I don't want to make this yet
another rant about context management
because I do talk about this a lot, but
I suspect a lot of you guys haven't seen
the other videos because this is trying
to be a more accessible description of
how this stuff works. Speaking of which,
if you're not normally here and you're
here for this one, you made it this far,
you know, you can hit that red button
underneath the video and it helps us out
a lot. It costs you nothing to
subscribe. It's literally free thanks to
our sponsors who make this all possible.
If you want to support us and see more
videos like this so you don't end up
stuck in the permanent underclass, maybe
hit that button. And maybe, just maybe,
if you want to keep up with the latest,
always, there's a little bell next to it
you can click, too. I don't normally do
sub call outs, but I know a lot of you
are here for the first time for this
hopefully. So maybe consider throwing
some support and in the future you'll
continue to stay on top of these things
as they happen. Anyways, what I was
saying about the quadmd is that it gets
stuffed up top so the information is in
the history. And one more piece, and I
promise the last thing I'm going to say
about general context management. If
it's not in the chat history, the model
doesn't know it. This doesn't apply for
general knowledge, like what is
TypeScript, what packages exist, those
types of things. But the model only
knows what it can do, not what
information exists. The model doesn't
know what your codebase is or anything
in it unless it gets that information.
It can get that through an agent MD file
or a cloud MD file. It can get that
information through tool calls that it
uses to explore. and it'll get more and
more refined with the tool calls as it
remembers. This is also why it's fun to
stay in one thread instead of making a
new thread every time you make a new
prompt because when you go back and
forth, it doesn't need to look up where
the files are because they're still in
the history. It remembers. For one more
example here, I'm going to delete the
cloud MD. And remember previously when I
gave the example where I asked that and
it did the search call first. I'm going
to game it a little bit. What is this
app? You should probably start at the
package.json JSON. Previously, the model
did not know there was a package JSON
file. It only knew about that because it
called the search tool first. Now that I
am telling it explicitly in my prompt,
the existence of that file will be in
the history. And since that'll be in the
history, it will hopefully be able to
skip the search tool initially at least.
Yeah. See, it started with a reading
instead of a search. And now the search
is more specific. Instead of searching
the whole codebase like it did before
with the single star, it is instead
searching the source directory because
it saw through the package JSON that
that's where the interesting pieces will
be. And it made half as many tool calls
as it did before cuz I gave it that
additional context. I'm already seeing
questions that make sense, but I want to
jump on them because I think it'll help
clarify things before we go further. Is
it useful to ask the model to read a few
key files in full at the beginning of a
conversation if they're relatively
small? My take for this is generally
speaking, no. Tool calls are really,
really cheap. And the models, the
harnesses, and all of the things around
them have gotten pretty good at figuring
out what context you need to solve the
problem. You might think you know the
context well enough, and you quite
possibly do. You can definitely help it
skip a few tool calls that it might not
need to do, but most models are now
smart enough to figure this out
themselves, especially like Opus 4.5 and
4.6, Sonnet 4.6 6 and chat GPT models
like GPT 5.3 CEX and 5.4. Those models
are all now more than smart enough to
figure out where the context is in the
codebase. They don't need you to tell
it. They can find it usually. And this
massively contradicts the prior theory
that we all had about this stuff, which
is that your codebase would basically
determine how good the model could be.
Because if the codebase was too big to
fit in the context window, it's not
going to work. Thankfully, that's not
how things ended up going. And very
thankfully tools like repo mix are
largely dead now. This made a lot of
sense when the model couldn't call bash,
couldn't navigate your system, couldn't
do things the way a developer would do.
And instead we wanted to give the model
all of the code so it could have all of
it before it starts. Repo mix was a
project that let you compress all of the
code in your codebase into a single XML
file that you can copy paste the model
and ask it to make changes which was a
mess for a bunch of reasons.
Mostly because squashing your entire
codebase into the context is creating
the worst needle in a haststack
problem imaginable. Just think about
this. If I ask you to fix a bug and I
give you two files the bug might be in,
or I ask you to fix the bug and I give
you 2,000 files the bug might be in,
which is easier to deal with? Let's be
realistic here. Cool. Happy we're on the
same page with that. Now imagine that
your memory gets reset every 30 seconds.
Crazy, but that's kind of how the AI
works. So, you're given the question of
fix this bug, and you know, your brain's
going to reset in 30 seconds. So, you're
like, "Okay, uh, I don't know anything
about the bug. There's no history here.
Uh, I need to find the files it could be
in. I'm going to do a search to do
that." And as soon as you do that, as
soon as you start the search, your brain
gets reset. And now, when the search is
done, your brain is turned back on, but
with it entirely wiped. But you have the
history of what's happened so far.
You're like, "Okay, I have to fix this
bug. 30 seconds ago, I did the search.
It found these things. I need to figure
out where it is in these." And then you
do that and then you leave another
instruction at tool and then your brain
is reset again. And it happens over and
over. So if you have to squash
everything in your codebase into your
brain just to have it reset every 30
seconds. Not only is that expensive and
inaccurate, it's just bad. And for a
while the belief was that this would be
necessary and that we would need to have
more and more context available to the
models. We would have to find ways to
stuff these gigantic code bases into the
model and that huge context windows
would be the future. Thankfully, that is
not the case because models got good
enough at building their context using
tools that we don't have to tell them
where everything is in the codebase
anymore. This is also what cursor used
to do, which is part of what made it so
special. They had a really good vector
indexing system that made it easier to
find the specific code that mattered for
the model. They still do that, but they
do that through traditional search tools
now instead where the model's told they
can search for a thing and the search it
probably lies to the model and says it's
GP or something and then it uses their
stuff to actually go index in a much
more intelligent way to find what the
model wants. It kind of just turned out
that large context makes the models
dumber. The more you stuff in, the
worse they behave. And there's charts
that prove this. As sonnet breaks the 50
to 100,000 or so range for the number of
things in its context, in this case
tokens, when you break that number, the
accuracy plummets to nearly 50% of where
it was before for its ability to find
repeating words in the context window.
So just stuffing everything in is not
the solution. And that's a big part of
what makes harnesses so interesting.
They provide the models with the tools
to build their own context to identify
where the problems might be or what
needs to be changed and then most
importantly to make those changes. So
how do you actually implement this?
Thankfully there are two awesome
articles that break down how to build
your own harness. There's this one from
April of last year from the AMP team and
there's this one with a very funny
image. This one's from Mah just
independently writing the article to
show people that something like cloud
code isn't that complex to implement. AI
coding assistants feel like magic. You
describe what you want in some barely
coherent English, and they read files,
edit your project, and write functional
code. But here's the thing. The core of
these tools isn't magic. It's about 200
lines of very straightforward Python. I
like how a hail breaks down the mental
model here. The order events is
important. You send a message like
create a new file with this function.
The LM decides it needs a tool and it
responds with a structured tool call or
sometimes multiple at once. Your
program, in this case, the harness, the
thing that you're building, executes the
tool call locally. So in this case, it
could create the file using code or it
could execute a bash command. Any of
those things and the result gets sent
back to the LLM and most importantly the
LM uses that context to continue or to
respond in as few lines of code as 200
is. I'm very lazy so I am asking a
harness harness T3 code to go build this
using claude opus. But we'll have a good
demo in just a second. Back to reading
as we wait. There's only really three
tools you need at the core. You need the
ability to read files so the LM can see
the code, list files so it can navigate
the project and find the code it's
looking for and edit the file so it can
actually make the changes you want.
Production agents, things you actually
use like cloud code, have a few other
capabilities like GP, bash, web search,
and more. Most of them use RIP GP now
cuz it's really strong, but we don't
really need those for the basic of most
basic examples. Let's look at their code
in this example. We import a bunch of
random because we're in Python. Not
that I'm any better as a JS dev. We load
the enenv. We have our claude client
which is an instance of anthropics SDK
that uses the key so that I can now call
claude over the network. We create some
colors for the terminal here. We then
resolve the absolute path because it's
much easier for the model to write valid
commands if it knows the path that we're
in. So now we create this absolute path.
And now I have to implement the tools.
First, we need a read file tool where
the model will pass a name of a file and
it will be returned a string dictionary
that has all of the contents of that
file. Full path is resolve the absolute
path with that file name. We print the
full path first so we can see it in our
UI and then we open that file path as a
read stream and grab the content. And
then we return this JSON blob with file
path which is the string for the path
and content which is the actual content
of the file. This gets I'm assuming as
we scroll added to the chat history when
it's called. We'll see how the tools are
actually used in a bit. Right now we're
just reading the code for said tools.
List files. I'm sure this is super
complex. We resolve the path. We have
all files. And then for item in full
path iter for each file we append the
file name and the type. And then we
return all of that after. And now the
edit file. Here's where things get
really complex. Because we have an old
string and a new string. Is it to
replace the old one with the new one?
This will replace the first occurrence
of the old string with the new string in
the file. If old string is empty, then
we will create and override the file
with the new string content. So if we
have an empty string for old string,
then we just write the text to the path
for this file. But if we do have the old
text we're replacing and we can't find
it, then we return an error saying that
the old string was not found. But if we
can find it, then we edit it out and
replace it with the new string using a
replace call here. and we write that to
the file and we return saying that we
edited it. That's it. So we have our
three tools, but how does the model even
know it can use those? Well, first we
have to list all of these somewhere. In
this case, a simple tool registry that
has a read file tool, list file tool,
edit file tool. And these are just the
functions, by the way. There's nothing
special about these. They're very simple
functions. But the model needs to know
about them. But having those functions,
cool. The model needs to know what they
are, what their like format is, and how
to call them. And we're not in
Typescript, so it can't just use type
signatures. So it needs a bit more info.
Thankfully, we defined this with a lot
more info, including a comment here that
describes what it does and what all of
the parameters are for. So, here we get
the definition for a given tool by
ripping it from the tool registry, and
we return the tool name, the doc from
it, and the signature from the same
tool. And now our system prompt, which
is the text that comes before the first
message, things like your agent MD would
be included in here. This all is
constructed in with the tool registry
included where we tell the model what
the tools are and everything they need
to know to work. And here is what that
prompt actually looks like. I'm going to
copy paste this into an editor so I can
word wrap it. You are a coding assistant
whose goal is to help us solve coding
tasks. You have access to a series of
tools that you can execute. Here are the
tools that you can execute. This is
where the tool list gets dumped. When
you want to use a tool, reply with
exactly one line in this format. tool
colon tool name and then the JSON arcs
and nothing else. Use compact singleline
JSON with double quotes. After receiving
a tool result message, continue the
task. If no tool is needed, respond
normally. That's the whole thing. This
is arguably the majority of the harness
in this example at least right here.
Because the tools are really simple, the
model doesn't know what to do with them.
This here is everything being passed to
the model as the start of the chat
history because again the model only
knows what's in the history. So when you
put the tools in the history, it knows
it can use them. So then we have to
parse that out. When the model stops
responding, we have to look for lines
that start with tool colon. If the line
doesn't start with that, continue. But
if it does, then we have to append this
to invocations with the name of the tool
and the args. And then when it's done,
we have to actually make the calls. The
lm call couldn't be simpler. You have
the system content, you have the
messages, all the things from back and
forth. If the message is the system
message, we put that in the system
content. Otherwise, we just append it to
the messages array. And then we call
claude clients API with the message. And
here we give it the model we want to
use, the max tokens, the messages. And
again, the system prompts important. So
this is not part of the message history.
It's a separate array, which it should
be. Well, not an array. It's a separate
argument because this is something you
should include as the dev. And the
messages array is something that gets
included by the user. And the magic is
all in the loop. We wait for the user to
send an input and once they are done and
they submit a keyboard interrupt, an end
of error, so like an enter key, it
breaks and it appends that to the
conversation. And once that's happened,
we run another loop where we wait for
the execution to occur. At the end of
that, we get our tool invocations. So we
have when the message is done being
generated by the model, we have all of
the tool names and arguments that the
model wants to use. And if there's
nothing here, we just respond. We just
share the message from the assistant the
model. But if there are tools here, then
we go through each of them. For each
tool, we grab it from the registry, make
an empty string response because it's
Python. We start with an empty value and
we set it later. We print the name and
the arguments. And if the tool is the
read file tool because that's the name
that was passed, we call that one. If
it's list files, we call that. And if
it's edit files, we call that.
Specifically, we're passing the
arguments in correctly here too by
grabbing from that JSON blob that's now
a dictionary the key that we want. And
then when that is done, we append the
tool results as messages to the chat
history. And running it is literally
just run it in a loop. That's it. Bad
news. Opus really likes using Python.
Did it not even put in the right
folder? I hate the Claude agent SDK
because it doesn't care what folder it's
executed in and what path it is passed.
It needs multiple different reminders
that it has to be in a specific path.
So, it just ignored the path that this
was executing in. That's
obnoxious. So, we now have our mini
agent. It happened to get dumped in the
wrong folder, but there's no pip
install, no node modules, nothing. Can
you read from
the env
to do that quick? And what's funny, even
in a harness harness like T3 code, we
are exposing the tool call. So I just
asked it to change this file. It didn't
know if it's changed or not since I
asked. So it decided to do a read tool
call just in case to see if the files
caught us the same or not. And once it
confirmed, it made an edit call where it
changed the import path to now have this
new information in it. And now I should
be able to Python agent.py asking it
about the Python code in this app. Now
we can see it called list files. It
called read file and now the model is
thinking because it has this new chat
history with the outputs of these in it.
And here is the response from the model.
Here's a summary of what agent.py does.
It implements a lightweight
self-contained AI coding agent in 60
lines. It's a setup where it loads the
ENV file. It configures the model with
set 4.6. It has these three simple tools
as well as a bash tool that can run
arbitrary shell commands. Ready to see
where this gets fun? Remember earlier
when I said you only really need bash?
Watch this.
And now it only has the bash tool. So
instead, it's just going to call bash
with different commands over and over
again. It's going to get the content the
same way, but instead of using the tool
we gave it, it's just going to call bash
to do it instead. It uses the tools it
has to do the task. And if we delete
everything other than the bash tool,
this gets comically simpler. We're now
down to 75 lines. And I haven't even
purged that thoroughly yet. And half of
it is dealing with the env. Like, let's
just be real. How cool is that? that all
it takes to give an AI model the ability
to do real things on your computer is
you give it a tool that it can pass bash
to and these models have been trained so
thoroughly on these types of fake chat
histories that have all these tool calls
in them that they know how to deal with
that already. One last important thing
because this was not included in the
article and it does matter. Most of the
models and the APIs we hit them through
are now aware of the idea of tools. this
has become a standardized enough thing
that there are specific syntaxes that
different models expect. You can just
put this in the system prompt and it
will just work for simple cases. A lot
of the providers hosting these models, a
lot of the platforms like open router
that manage the in-between and all of
that they all have a dedicated tools
concept now. And in this case, it's a
standard format that I can pass the same
way I pass messages to the model. I also
can pass tools to it in the body when we
make the call to in this case open
router. OpenAI has this, open router has
this, anthropic has this, even Gemini
kind of has this. Passing the tools to
the model through a special format so
that the host can get this syntax just
right because the actual syntax the
model sees is to be frank kind of gross.
This is the format that OpenAI's models
see internally. This format is
relatively complex but also really
powerful and open source. It's meant to
be very compact so the models can
process the data well, but also the
start, end, and weird bracketing syntax
makes it less likely the syntax
conflicts with the things the model's
actually outputting, which is really
cool. Thankfully, you'll never have to
deal with almost any of this if you're
the type of person watching this video,
cuz this is so deep in the weeds that
half the companies hosting these models
don't even know about it. This is not
something you'll ever have to care
about. But the reason that something
like this tool call key here is so
powerful is that in this case, Open
Router will take your tools and format
them the way the different models expect
for the different providers. I think
I've covered everything I need to here.
And we actually built a harness that
works and can call bash to make changes.
You know what? Let's ask it to do
something different here. Again, it only
still has bash. Let's ask it to make an
edit. I don't like the code that loads
the open router API key from the
environment. Can we make it simpler in
some way? And again, all we did here is
append another message in the array. The
message array has the first message we
sent, the first message the model sent,
all the tool calls, and then the last
message the model sent at the end. And
now I added a new message, and now it's
rerunning the loop until the model is
done. It read the enenv. It read the
agent pi and then it made a change by
how to even do this kind of nasty. Oh,
bash. Quite a command to do that. Yeah,
surprised it didn't show more here. It
managed to do it right, but damn. Bash
is its own world. And
thankfully, these models are very, very
good at it. But god damn, it made the
change and now this is a self-healing,
self-modifying tool. Pretty cool. Two
more questions I want to answer before
we wrap this one up. The first is why
the hell is cursor's harness able to
make the models behave so much better if
they're this simple? And the second is
if T3 code isn't a harness, then what
the hell is it? Starting with the first
one, it turns out the harnesses,
specifically the tools they're given,
the system prompts they have, and the
outputs they get from the tools
massively influence the results that you
get. Something I've seen basically every
time I use a Gemini model is in its
reasoning preamble before it starts
responding, it says, "I have all of
these tools available to me. I wonder
which I should use." And then it goes
through each one and says, "I don't need
that tool for this. I don't need that
tool for this." And it does that over
and over. And sometimes, especially in
less well-defined harnesses, it'll just
do it anyways. Something that Cursor
puts a lot of time into is customizing
their harness, customizing the tools,
customizing the shape of the tools, and
most importantly, customizing the system
prompt and the tool descriptions to
steer the models towards which they
should or shouldn't use. I'm going to
make a change here. Right here, it says
read a file's contents, but I'm going to
put in parenthesis here. You should
probably use bash tool instead. And now,
if I run the same thing, what does the
Python code here do? It has the read
file tool, but since I told it in the
description to not use it, it's 50/50 if
it will. In this case, I said it should
probably use the bash tool instead, and
it chose to still use the read file
tool. Something you can do because these
are AI models. You can ask, why did you
use the read file tool instead of the
bash tool? Interesting. You can see to
some extent why the model thinks it did
this thing. It thinks that the read tool
was perfectly reasonable for what it was
doing. So watch what I'm going to do
instead. I'm going to redescribe it with
deprecated. You should use the bash tool
instead. And now just with a system
prompt change. I just changed the string
here. That's all I changed. I told it
the read file tool is deprecated its
description. Let's see what it does now.
Well, it's taking its time.
Right again. There we go. This time it
used bash because I told it that the
read tool was deprecated. None of the
code changed. The tool still works
exactly the same, but the model can't
see the code. Well, okay. In this case,
it can because I happen to be running it
in the same thing, but the model doesn't
know how the code was implemented. You
can also just lie to it. So, watch this.
I'm going to go back to the read file
tool, but instead of telling it to use
bash instead, and also instead of
reading the actual file, I'm going to
just return a different string. Print
hello world. And now that's what it will
return for the read tool, no matter
what. And if I run the same thing, what
does the Python code in this app do? The
model sees the path and it goes to read
agent.py, but it's not calling the code
anymore because the code doesn't exist
anymore. The Python code in this app is
very simple. It's a single line in
agent.py that prints hello world to the
console. You can just lie to the models.
I need you all to internalize this. The
models don't know what the code actually
does. You can tell it it's a bash tool,
but you do something else. You can tell
it it's a read file tool, but you do
something else. You can tell it it's GP
or rep GP or something different and
then go do whatever the you want. I
do this all the time. When I want to
just fake Bash, for example, when I want
a model to think it has Bash when it
doesn't, I'll just tell it it does and
I'll tell another model to make a fake
response for it. You can get two models
to talk to each other without even
knowing that they're models by doing
things like this. And it's genuinely
really fun and helps you realize all
they are doing is generating text. As I
hope I have correctly emphasized to
y'all here, the model only knows what's
in its context. Different models handle
different context different ways. I bet
if I changed this here to have the
deprecated warning and I tried that on a
GPT model or a Gemini model, it would
behave entirely differently. We could
even test it. So, we know when I did the
deprecated with Sonnet, it failed. So,
let's switch this over to I don't know,
let's try Gemini 3.1 Pro. Same question,
this time with a different model. And
because I said that the and this is just
yet another example of Gemini
being Gemini. I told it that the read
file tool was deprecated. So it just
went for bash for everything even though
the other tools weren't. It just said
it, we'll use bash. So to go back
to the question of why is cursors
harness better? It's just cuz they
tested it more. I know a couple people
at Curser whose whole job is when a new
model comes out or they get early access
to just hammer it with all sorts of
different minor changes to the system
prompt, constantly micro adjusting it
until the model for the most part does
whatever the it's supposed to do.
And with certain models that's harnesses
are just full of slop. Like I don't
know, just imagine a company that's
letting the AI write the prompts for
them for the system prompt in these
things. Maybe they haven't spent a whole
lot of time trying to rewrite the tool
descriptions over and over to get them
to behave exactly how they want. Even
the example I just gave where I told the
model to use the bash tool instead and
it didn't for the claude models, but
then for the Gemini models, it only uses
bash. Now, that difference means that
they have to rewrite these descriptions
for every different model they support
in cursor. Meanwhile, Anthropic probably
hasn't changed these lines of code in
their codebase since it was
knitted. That's the difference. They
were probably written by a model for
them in the first place. They're not
trying to fine-tune and get these things
just right. So, a company that has a lot
of people whose job is literally that
the results show. And to this day, I
much prefer using Gemini through cursor
than using it directly. I much prefer
using Opus through Cursor than using it
directly. With GBT models, it barely
feels that different. Honestly, the
issue is a lot of these companies, in
particular, both Google and Enthropic,
don't let you use your subscriptions
with them in tools other than their own.
OpenAI doesn't give a You can use
your OpenAI subscription in basically
anything and they're cool with it. Thus
far, Anthropic and Google have been much
more hostile towards that. So, if you're
paying the 250 a month for Gemini or the
200 month for Opus, you got to use their
harnesses. So, that goes to the next
question of what the is T3 Code?
Well, T3 Code does not provide any
tools. T3 code doesn't have a bash tool
or a read tool or anything because it
doesn't have tools because it's not a
harness. T3 Code has a model picker, but
you're not just picking the model. When
you pick a model for Claude, it's using
the Claude code harness on your machine.
If you don't have Claude Code installed
already and signed in, this will not
work. And it's the same deal with
Codeex. If you don't have the Codex CLI
installed, this will not work either.
These harnesses are being provided
through T3 code as a UI layer. We are
just a really nice UI on top of the
harness. So, you might be thinking, I
did the easy work just wrapping it. Did
you forget how easy it is to make the
harness? This is the hard part. If I
learned anything in my time building T3
Code is that my life would be
significantly easier if I could just
build the harness myself, too. I
think that's all I have to say on this
one. Shout out to Matt for making the
video that led to Edward's tweet that
led to me caring enough to make this.
Shout out to Mah, the author of the
Emperor Has No's clothes article that we
use as a reference point. And shout out
to all of the companies for making this
stuff way more complex than it needs to
be and then realizing it should be
simple and giving me the opportunity to
educate all of you guys on something
that is actually just 60 lines of
Python.
This is actually really fun. It's been a
bit since I did a deep dive video like
this where I just break down a concept
and I'm curious how you'll feel about
this. I know I'm kind of the news guy
now, but I love getting into the weeds.
Did you enjoy this video? Do you want
more things like this? If so, let me
know in the comments. And please ask
some questions about similar stuff so I
know where to steer my content going
forward. Enough people didn't get
harnesses, so I decided to make this.
Are there other things you don't
understand? Cuz if so, I'll do my best
to cover them in the future. Let me know
how this was. And until next time, keep
prompting.
Get the TLDR of any YouTube video
Transcribe, summarize, and repurpose videos in 125+ languages — free, no signup required.