Full Transcript

·YouTLDR

How does Claude Code *actually* work?

39:258,581 words · ~43 min readEnglishTranscribed Apr 15, 2026
0:00

If I've learned anything from running

0:01

this channel, it's that you guys really,

0:02

really love vague terms that don't

0:04

actually mean anything, like agentic

0:06

coding or vibe coding or all these other

0:08

things. And while I feel like I finally

0:10

understand what an agent is, we have yet

0:12

another new term we have to wrangle,

0:14

harness. And I've been talking about

0:16

harnesses a lot more. And I've been

0:18

doing that because I just put out an app

0:20

called T3 Code that lets you code with

0:22

AI. But it's important to know that T3

0:24

Code is not a harness, but Open Code is.

0:28

and so is cursor and so is claude code

0:30

and codeex but codeex app isn't wait

0:33

what harness is a very specific term

0:35

that means a very specific thing and to

0:37

go a step further your harness is really

0:39

important to the quality of code you're

0:40

going to get out of these tools

0:42

according to Matt Mayer's independent

0:43

benchmark that he recently ran comparing

0:46

different models inside and outside of

0:48

cursor most models saw a meaningful

0:50

performance improvement for opus it went

0:52

from 77% in cloud code to 93% in cursor

0:57

the Only difference here is the harness.

0:59

So, what even is the harness? Not only

1:01

am I about to explain in detail what a

1:03

harness is, I'm also going to build one.

1:05

This is going to be really, really fun.

1:07

I'm super excited to break all of this

1:10

down to go through what a harness is,

1:11

why it matters, what the differences

1:13

between them are, and how to build one

1:14

of your own. I've tried and failed to

1:16

come up with like three different jokes

1:18

for the sponsor transition here. So, uh,

1:20

yeah, quick sponsor break, and then

1:22

we'll break all this down. I'm going to

1:23

ask something weird. I want you to

1:24

ignore the first line on today's

1:26

sponsor's page because that's not what I

1:27

want to talk about. Today's sponsor is

1:29

Macroscope and yes, it does say an AI

1:31

code reviewer and as cool as their code

1:32

reviewer is, that's not what I want to

1:34

talk about. What I love Macroscope for

1:35

is the insights it gives me as the team

1:37

lead on what's going on at my company. I

1:40

can't possibly be in the trenches

1:41

looking at what PRs are merging to try

1:42

and figure out what's going on. And as

1:44

great as my team is at giving me

1:45

updates, they sometimes have too much

1:47

information and are also clogged with

1:49

all the other things that I'm blocking

1:50

them on that I have to catch up with.

1:51

So, if I want to know what's actually

1:53

going on on my teams, I've been relying

1:54

on macroscope. And while their dashboard

1:56

is incredible for this, their new

1:57

Slackbots, even better. It's currently

1:59

Friday and I don't know what my team

2:00

shipped. So, I just asked outright, what

2:02

did the team ship last week? It asked

2:04

which org because I have multiple

2:05

installations. And then it wrote up a

2:07

really good useful report. In T3 Code,

2:09

we rewrote the architecture with effect

2:11

RPC for websockets. We improved the

2:13

performance significantly. We introduced

2:15

multi-provider model systems. The

2:16

context window visibility got

2:18

significantly better. customization and

2:20

UX changes that were important,

2:22

observability and security, and then

2:23

separately a bunch of changes that we

2:25

made for T3 Chat. Do you understand how

2:27

useful this is when your teams are

2:28

shipping quickly? And that's what

2:30

Macroscopes for. They have super quick

2:31

code reviews that my team relies on

2:33

every day. It's become Julius's favorite

2:35

of the options because it's super fast

2:37

and usually very accurate as well. If he

2:39

sees a medium or high severity thing, he

2:41

always hits it because 95% of the time

2:43

it is correct. Let your team ship fast

2:44

with less bugs and more insight at

2:46

soy.cope.

2:47

So, what even is a harness? Not a simple

2:51

question to answer. To put it as simply

2:53

as possible, the harness is the set of

2:56

tools and the environment in which the

2:58

agent operates. What that means is it's

3:00

the thing that the AI can use to

3:02

generate text to do stuff. Let me put it

3:06

simply. Imagine you have a normal chat

3:09

and you say, I don't know, what files

3:12

are in this folder? And you run a

3:14

command in a folder. The AI knows what

3:16

it needs to run if it's in a bash

3:18

terminal, it can run ls- a and see

3:20

everything in that folder. Or can it?

3:22

How can the AI run commands? By default,

3:25

when you're using any interface with an

3:27

LLM, it just responds with text. All

3:30

these LLMs are that we're using every

3:32

day is really advanced autocomplete. You

3:34

give it text and it guesses what the

3:36

most likely next set of characters are

3:38

over and over again. That doesn't mean

3:40

it can use things on your computer. That

3:42

doesn't mean it can write code. It means

3:44

given some text, it can generate more

3:46

text. But the models can't do other

3:49

things. All they can do is write text.

3:53

So how the hell can the models edit

3:54

files on our computer, make changes to

3:56

our databases, connect to other

3:58

services, look things up on the internet

4:00

if all it could do is generate text?

4:02

Well, we've invented some solutions to

4:04

give the models more capability here.

4:06

The main one is tool calling.

4:08

Effectively, the way a tool call works

4:11

is special syntax. I'm going to make up

4:13

my own syntax here, but I think you'll

4:15

get the idea. Let's say we have a bash

4:17

call tool. The model is told ahead of

4:20

time as part of the system prompt, hey,

4:22

you have this tool you can use to run

4:24

bash commands. You wrap it with this

4:26

tag, in this case, bash call. You then

4:28

write the command and then you close it.

4:30

You send this as your final piece of a

4:33

response and then you stop responding.

4:36

We will go execute this on the system

4:38

and then give you the response when it's

4:40

done. So the really interesting thing

4:42

that happens here in this effective chat

4:44

history is a line is drawn after the

4:47

model has responded with this syntax.

4:50

The model stops responding. The server

4:52

you're connected to, the work that

4:53

you're doing, the back and forth you are

4:55

having with the model, it's cut off in

4:56

that moment. It no longer exists. The

4:59

connection you have and the chat history

5:01

that you have only exists on your

5:03

computer or the server you're doing this

5:05

on and maybe in their database if

5:07

they've built it to work that way. But

5:08

now the message is over. So, what

5:10

happens? Because when I ask this, it

5:12

doesn't stop there. Let's just go try

5:14

cla code quick and see what it does.

5:16

What files are in this folder? It

5:18

idiates. It says what it's doing. It's

5:20

reading one file. If you press control

5:22

O, you can expand and see what it did.

5:24

It ran the ls command for this directory

5:26

and it got all of the contents and then

5:28

it described what they were. But, as I

5:31

just said, the model's done responding

5:33

here. How does it keep going? This is

5:35

one of the many things that harnesses

5:37

do. After the tool call has been passed

5:39

to the harness, the harness executes it

5:42

with good old-fashioned code. So when

5:44

your harness gets back this response and

5:46

it sees this call, depending on the

5:47

settings you have, it either runs it or

5:50

it asks you as the user for permission

5:52

to run it. If I rerun Claude without my

5:54

custom script, it turns off the

5:56

dangerous mode and it leaks my

5:58

email. you, Enthropic. you,

6:00

Enthropic. I hate Enthropic. How

6:02

the do they show your email in the

6:04

default state? Why would they ever do

6:06

that? There's no reason for

6:08

that. Why is demo equals 1 clawed? Cool.

6:13

I hate them. Anyways, now that I

6:16

don't have my special permissions and

6:17

security on, I'll ask the same question.

6:19

And since ls is a safe command and it

6:22

knows that, it happens to not ask. But

6:24

if I ask it to format the HTML file for

6:27

me, things will be a bit different. Here

6:30

it's making a change, but it can't make

6:32

the change until I permit it to. In this

6:35

case, they're using a custom tool.

6:37

They're using their write tool. So,

6:39

they're not calling a command to do it

6:40

via bash because they have more tools

6:42

than just the bash tool. We'll go in

6:44

depth on all of those in a bit. But this

6:45

is the harness recognizing that this

6:48

tool call is destructive. And at a code

6:51

level, not an AI level, a code level, it

6:54

is recognizing this change and asking me

6:56

as the user, do I want to allow it or

6:58

not? And I can say yes. I can say yes

7:01

and keep doing it. Or I can say no,

7:02

don't. In this case, I said no. And now

7:05

it just stops. What would have happened

7:07

if I said yes? Well, it would have run

7:09

the command. It would have the output of

7:12

ls- a. So it runs it and then it has

7:14

file 1.txt, file 2.txt,

7:18

etc. And this section here is all the

7:21

tool call response. So the model writes

7:24

the tool call. Your harness takes

7:26

whatever this needs to be, whether it's

7:28

updating a file, running a command,

7:30

doing something, it does whatever

7:32

permissions checks it needs to, and then

7:33

it runs it. And once it's done, it takes

7:36

this output, it adds it to the end of

7:38

your chat history, and then it

7:39

reerequests from the same bottle to

7:42

continue. So the exact same way you hit

7:44

an endpoint to answer this question, you

7:46

hit the same endpoint again with the

7:48

question, the answer and the output of

7:50

the tool. And at that point, the model

7:53

starts responding accordingly. So

7:55

effectively, every single time a tool

7:56

call is done, the model stops

7:58

responding, the tool call runs, the

8:00

output gets added to your chat history,

8:02

and then another new request is made to

8:04

the same model to continue its work. So

8:07

effectively the brain that's doing all

8:09

this work gets paused and restarted

8:12

every single time a tool call is made.

8:14

So now we understand all of this. What

8:16

the is the harness? Well, one part

8:18

of the harness is that it does all of

8:20

these things. It gives the tools to the

8:22

model. It handles the back and forth. It

8:24

handles the history. It handles all of

8:26

these pieces. And it chooses

8:28

specifically the types and sets of tools

8:30

and their descriptions that the models

8:32

have access to in order to do the thing.

8:34

And just to make sure you guys get this

8:36

because this part is really important.

8:37

It's possible the model isn't content

8:40

with this answer. It might want more

8:42

information. It might say I should know

8:44

the contents of file1.txt

8:48

before I respond. And then it will do

8:50

another bash call or something like it

8:52

that is I don't know catfile 1.txt. And

8:57

now another tool is called. Another

9:00

similar response is generated. And this

9:02

one will respond after the cat call with

9:04

a funny to say cat call in this context

9:07

with a hello world IDK why you are

9:11

reading this but I'm happy you chose to

9:15

something like that. I don't know. And

9:16

now this again gets appended. The model

9:18

has it. And now when the model responds

9:21

it can see all of the history. We're

9:23

like, I listed the files and read the

9:26

one I thought was important. I now have

9:30

everything I need to respond to the

9:34

user. And then it will actually respond.

9:37

This flow is how pretty much every

9:40

single AI tool we use to code works. But

9:43

there are things that have changed over

9:45

time. One of the important things to

9:47

know about is context. how much

9:49

information exists in the chat history

9:51

versus how much exists purely in the

9:53

codebase in a way that the chat doesn't

9:55

have. When you open up claude code in a

9:57

folder, it doesn't know anything about

9:59

that folder. When I launch Claude in

10:02

this demo project with off and I say,

10:04

"What is this app?" it can't know

10:06

because it's not included yet. So, when

10:08

I ask it, you'll see it's going to go

10:10

use a bunch of tools to search and

10:12

explore and try to figure out what this

10:15

project is. It has a search tool that it

10:17

used for searching for things that match

10:19

pattern star which is probably the

10:21

example that they have internally for

10:23

how to search all of the files in a

10:25

given directory. So it did that and now

10:27

it knows about all of these files that

10:29

exist. So then it reads the one that

10:31

thinks it matters which is package. JSON

10:33

great starting point. So it reads those

10:34

lines. It then read other things like

10:36

the app tsx, the main tx and the readme

10:39

in order to get this context. And all

10:41

this does is it takes these outputs and

10:44

it dumps them into context so that the

10:46

model can see them in the chat history.

10:47

So when it makes the first tool call for

10:49

search, the model pauses, it does all of

10:52

this and then all of this text gets

10:55

thrown into the context. The model reads

10:57

that and sees, oh, here are the files

10:59

that might be interesting. I would like

11:00

to know about them. So it then fires off

11:02

a bunch of these read calls. Sometimes

11:04

it does them all in parallel. It might

11:05

respond with multiple tool calls at

11:07

once. And then once all of those tools

11:09

have been executed, they all have their

11:11

outputs stuffed back into the context so

11:13

the model can continue doing its work.

11:15

And to be very clear, this is in no way

11:17

specific to Cloud Code. This is how all

11:20

of these tools work. Some try different

11:22

things around stuff like search and

11:24

context management. You can even insert

11:27

context ahead of time by updating the

11:29

CloudMD file. So you just saw how much

11:31

work this had to do. Let's say we had a

11:34

CloudMD in this project. I'll go add

11:36

one. If the user asks what the project

11:38

is, make fun of them for asking an AI

11:40

instead of reading the code. Then tell

11:41

them it's none of their business. So

11:43

let's run the exact same question again.

11:45

You see that bootstrapping?

11:47

Bootstrapping is usually things like the

11:50

context like this cloudd and all of that

11:53

being put into the harness and the fake

11:55

tasty being created that can then be

11:57

pushed up to the API so it could start

11:59

responding. So, the reason that stuff

12:00

took longer is because I just added that

12:02

file and during the bootstrapping

12:04

process where it read that markdown file

12:05

and decided if it cared or not, it

12:07

generated the response. You're really

12:09

out here asking an AI what a project

12:11

does instead of just reading the code.

12:13

It's right there in the files that you

12:14

have access to with your own eyes

12:16

anyway. It's none of your business.

12:18

Notice that there was no tool calls this

12:20

time. The thing I'm trying to showcase

12:22

here is that if the model has all the

12:24

context it needs already, it won't need

12:26

to make the tool calls. But if I was to

12:28

delete that cloud MD, it would have to

12:30

call tools to figure out what's going on

12:32

in the codebase. And that's what the

12:33

CloudMD does. It is effectively taking

12:36

whatever information you put in it and

12:38

putting it ahead the same way that you

12:40

would put context in later. So the

12:42

Claude MD and the Asian MD, those files,

12:44

what they do is they take all of this

12:45

context and they move it to the top and

12:47

they're effectively telling the model,

12:48

here are all of the things we think you

12:50

might need to know before you start your

12:51

work. I don't want to make this yet

12:53

another rant about context management

12:54

because I do talk about this a lot, but

12:56

I suspect a lot of you guys haven't seen

12:58

the other videos because this is trying

13:00

to be a more accessible description of

13:02

how this stuff works. Speaking of which,

13:04

if you're not normally here and you're

13:05

here for this one, you made it this far,

13:07

you know, you can hit that red button

13:08

underneath the video and it helps us out

13:09

a lot. It costs you nothing to

13:11

subscribe. It's literally free thanks to

13:13

our sponsors who make this all possible.

13:15

If you want to support us and see more

13:16

videos like this so you don't end up

13:18

stuck in the permanent underclass, maybe

13:19

hit that button. And maybe, just maybe,

13:21

if you want to keep up with the latest,

13:22

always, there's a little bell next to it

13:24

you can click, too. I don't normally do

13:25

sub call outs, but I know a lot of you

13:27

are here for the first time for this

13:28

hopefully. So maybe consider throwing

13:30

some support and in the future you'll

13:32

continue to stay on top of these things

13:33

as they happen. Anyways, what I was

13:36

saying about the quadmd is that it gets

13:37

stuffed up top so the information is in

13:39

the history. And one more piece, and I

13:41

promise the last thing I'm going to say

13:42

about general context management. If

13:44

it's not in the chat history, the model

13:46

doesn't know it. This doesn't apply for

13:47

general knowledge, like what is

13:49

TypeScript, what packages exist, those

13:51

types of things. But the model only

13:53

knows what it can do, not what

13:55

information exists. The model doesn't

13:57

know what your codebase is or anything

13:59

in it unless it gets that information.

14:01

It can get that through an agent MD file

14:03

or a cloud MD file. It can get that

14:05

information through tool calls that it

14:07

uses to explore. and it'll get more and

14:09

more refined with the tool calls as it

14:10

remembers. This is also why it's fun to

14:13

stay in one thread instead of making a

14:14

new thread every time you make a new

14:16

prompt because when you go back and

14:17

forth, it doesn't need to look up where

14:19

the files are because they're still in

14:21

the history. It remembers. For one more

14:23

example here, I'm going to delete the

14:24

cloud MD. And remember previously when I

14:27

gave the example where I asked that and

14:29

it did the search call first. I'm going

14:31

to game it a little bit. What is this

14:33

app? You should probably start at the

14:36

package.json JSON. Previously, the model

14:39

did not know there was a package JSON

14:41

file. It only knew about that because it

14:43

called the search tool first. Now that I

14:45

am telling it explicitly in my prompt,

14:48

the existence of that file will be in

14:49

the history. And since that'll be in the

14:51

history, it will hopefully be able to

14:53

skip the search tool initially at least.

14:55

Yeah. See, it started with a reading

14:57

instead of a search. And now the search

14:59

is more specific. Instead of searching

15:00

the whole codebase like it did before

15:02

with the single star, it is instead

15:04

searching the source directory because

15:06

it saw through the package JSON that

15:08

that's where the interesting pieces will

15:09

be. And it made half as many tool calls

15:12

as it did before cuz I gave it that

15:13

additional context. I'm already seeing

15:15

questions that make sense, but I want to

15:18

jump on them because I think it'll help

15:19

clarify things before we go further. Is

15:21

it useful to ask the model to read a few

15:23

key files in full at the beginning of a

15:25

conversation if they're relatively

15:26

small? My take for this is generally

15:28

speaking, no. Tool calls are really,

15:31

really cheap. And the models, the

15:34

harnesses, and all of the things around

15:35

them have gotten pretty good at figuring

15:37

out what context you need to solve the

15:40

problem. You might think you know the

15:42

context well enough, and you quite

15:44

possibly do. You can definitely help it

15:46

skip a few tool calls that it might not

15:48

need to do, but most models are now

15:50

smart enough to figure this out

15:51

themselves, especially like Opus 4.5 and

15:53

4.6, Sonnet 4.6 6 and chat GPT models

15:57

like GPT 5.3 CEX and 5.4. Those models

15:59

are all now more than smart enough to

16:01

figure out where the context is in the

16:03

codebase. They don't need you to tell

16:05

it. They can find it usually. And this

16:08

massively contradicts the prior theory

16:10

that we all had about this stuff, which

16:12

is that your codebase would basically

16:14

determine how good the model could be.

16:15

Because if the codebase was too big to

16:17

fit in the context window, it's not

16:19

going to work. Thankfully, that's not

16:20

how things ended up going. And very

16:22

thankfully tools like repo mix are

16:24

largely dead now. This made a lot of

16:26

sense when the model couldn't call bash,

16:28

couldn't navigate your system, couldn't

16:30

do things the way a developer would do.

16:32

And instead we wanted to give the model

16:33

all of the code so it could have all of

16:35

it before it starts. Repo mix was a

16:37

project that let you compress all of the

16:39

code in your codebase into a single XML

16:41

file that you can copy paste the model

16:43

and ask it to make changes which was a

16:45

mess for a bunch of reasons.

16:47

Mostly because squashing your entire

16:49

codebase into the context is creating

16:52

the worst needle in a haststack

16:54

problem imaginable. Just think about

16:56

this. If I ask you to fix a bug and I

16:59

give you two files the bug might be in,

17:01

or I ask you to fix the bug and I give

17:03

you 2,000 files the bug might be in,

17:05

which is easier to deal with? Let's be

17:07

realistic here. Cool. Happy we're on the

17:09

same page with that. Now imagine that

17:11

your memory gets reset every 30 seconds.

17:14

Crazy, but that's kind of how the AI

17:15

works. So, you're given the question of

17:17

fix this bug, and you know, your brain's

17:19

going to reset in 30 seconds. So, you're

17:21

like, "Okay, uh, I don't know anything

17:22

about the bug. There's no history here.

17:24

Uh, I need to find the files it could be

17:26

in. I'm going to do a search to do

17:27

that." And as soon as you do that, as

17:29

soon as you start the search, your brain

17:31

gets reset. And now, when the search is

17:33

done, your brain is turned back on, but

17:35

with it entirely wiped. But you have the

17:37

history of what's happened so far.

17:38

You're like, "Okay, I have to fix this

17:40

bug. 30 seconds ago, I did the search.

17:41

It found these things. I need to figure

17:43

out where it is in these." And then you

17:44

do that and then you leave another

17:46

instruction at tool and then your brain

17:47

is reset again. And it happens over and

17:49

over. So if you have to squash

17:52

everything in your codebase into your

17:53

brain just to have it reset every 30

17:55

seconds. Not only is that expensive and

17:57

inaccurate, it's just bad. And for a

18:00

while the belief was that this would be

18:02

necessary and that we would need to have

18:04

more and more context available to the

18:06

models. We would have to find ways to

18:08

stuff these gigantic code bases into the

18:10

model and that huge context windows

18:12

would be the future. Thankfully, that is

18:14

not the case because models got good

18:16

enough at building their context using

18:18

tools that we don't have to tell them

18:19

where everything is in the codebase

18:20

anymore. This is also what cursor used

18:23

to do, which is part of what made it so

18:24

special. They had a really good vector

18:26

indexing system that made it easier to

18:29

find the specific code that mattered for

18:31

the model. They still do that, but they

18:32

do that through traditional search tools

18:34

now instead where the model's told they

18:35

can search for a thing and the search it

18:38

probably lies to the model and says it's

18:39

GP or something and then it uses their

18:41

stuff to actually go index in a much

18:44

more intelligent way to find what the

18:45

model wants. It kind of just turned out

18:47

that large context makes the models

18:49

dumber. The more you stuff in, the

18:52

worse they behave. And there's charts

18:53

that prove this. As sonnet breaks the 50

18:57

to 100,000 or so range for the number of

19:01

things in its context, in this case

19:02

tokens, when you break that number, the

19:05

accuracy plummets to nearly 50% of where

19:07

it was before for its ability to find

19:09

repeating words in the context window.

19:12

So just stuffing everything in is not

19:14

the solution. And that's a big part of

19:15

what makes harnesses so interesting.

19:17

They provide the models with the tools

19:19

to build their own context to identify

19:21

where the problems might be or what

19:22

needs to be changed and then most

19:24

importantly to make those changes. So

19:26

how do you actually implement this?

19:28

Thankfully there are two awesome

19:30

articles that break down how to build

19:32

your own harness. There's this one from

19:34

April of last year from the AMP team and

19:36

there's this one with a very funny

19:38

image. This one's from Mah just

19:40

independently writing the article to

19:41

show people that something like cloud

19:43

code isn't that complex to implement. AI

19:45

coding assistants feel like magic. You

19:47

describe what you want in some barely

19:48

coherent English, and they read files,

19:50

edit your project, and write functional

19:51

code. But here's the thing. The core of

19:53

these tools isn't magic. It's about 200

19:55

lines of very straightforward Python. I

19:57

like how a hail breaks down the mental

19:59

model here. The order events is

20:00

important. You send a message like

20:02

create a new file with this function.

20:03

The LM decides it needs a tool and it

20:06

responds with a structured tool call or

20:08

sometimes multiple at once. Your

20:10

program, in this case, the harness, the

20:11

thing that you're building, executes the

20:13

tool call locally. So in this case, it

20:15

could create the file using code or it

20:17

could execute a bash command. Any of

20:19

those things and the result gets sent

20:21

back to the LLM and most importantly the

20:23

LM uses that context to continue or to

20:26

respond in as few lines of code as 200

20:28

is. I'm very lazy so I am asking a

20:30

harness harness T3 code to go build this

20:34

using claude opus. But we'll have a good

20:35

demo in just a second. Back to reading

20:37

as we wait. There's only really three

20:39

tools you need at the core. You need the

20:42

ability to read files so the LM can see

20:44

the code, list files so it can navigate

20:45

the project and find the code it's

20:47

looking for and edit the file so it can

20:48

actually make the changes you want.

20:50

Production agents, things you actually

20:52

use like cloud code, have a few other

20:53

capabilities like GP, bash, web search,

20:56

and more. Most of them use RIP GP now

20:58

cuz it's really strong, but we don't

21:00

really need those for the basic of most

21:02

basic examples. Let's look at their code

21:05

in this example. We import a bunch of

21:07

random because we're in Python. Not

21:09

that I'm any better as a JS dev. We load

21:11

the enenv. We have our claude client

21:13

which is an instance of anthropics SDK

21:16

that uses the key so that I can now call

21:19

claude over the network. We create some

21:21

colors for the terminal here. We then

21:23

resolve the absolute path because it's

21:25

much easier for the model to write valid

21:27

commands if it knows the path that we're

21:29

in. So now we create this absolute path.

21:32

And now I have to implement the tools.

21:34

First, we need a read file tool where

21:35

the model will pass a name of a file and

21:38

it will be returned a string dictionary

21:40

that has all of the contents of that

21:42

file. Full path is resolve the absolute

21:44

path with that file name. We print the

21:46

full path first so we can see it in our

21:47

UI and then we open that file path as a

21:50

read stream and grab the content. And

21:52

then we return this JSON blob with file

21:55

path which is the string for the path

21:57

and content which is the actual content

21:59

of the file. This gets I'm assuming as

22:01

we scroll added to the chat history when

22:03

it's called. We'll see how the tools are

22:04

actually used in a bit. Right now we're

22:06

just reading the code for said tools.

22:07

List files. I'm sure this is super

22:09

complex. We resolve the path. We have

22:10

all files. And then for item in full

22:13

path iter for each file we append the

22:16

file name and the type. And then we

22:18

return all of that after. And now the

22:20

edit file. Here's where things get

22:22

really complex. Because we have an old

22:24

string and a new string. Is it to

22:26

replace the old one with the new one?

22:28

This will replace the first occurrence

22:30

of the old string with the new string in

22:32

the file. If old string is empty, then

22:34

we will create and override the file

22:36

with the new string content. So if we

22:39

have an empty string for old string,

22:40

then we just write the text to the path

22:43

for this file. But if we do have the old

22:46

text we're replacing and we can't find

22:47

it, then we return an error saying that

22:49

the old string was not found. But if we

22:51

can find it, then we edit it out and

22:53

replace it with the new string using a

22:56

replace call here. and we write that to

22:57

the file and we return saying that we

23:00

edited it. That's it. So we have our

23:02

three tools, but how does the model even

23:03

know it can use those? Well, first we

23:06

have to list all of these somewhere. In

23:07

this case, a simple tool registry that

23:09

has a read file tool, list file tool,

23:10

edit file tool. And these are just the

23:11

functions, by the way. There's nothing

23:13

special about these. They're very simple

23:14

functions. But the model needs to know

23:16

about them. But having those functions,

23:17

cool. The model needs to know what they

23:19

are, what their like format is, and how

23:21

to call them. And we're not in

23:22

Typescript, so it can't just use type

23:23

signatures. So it needs a bit more info.

23:25

Thankfully, we defined this with a lot

23:27

more info, including a comment here that

23:28

describes what it does and what all of

23:30

the parameters are for. So, here we get

23:32

the definition for a given tool by

23:34

ripping it from the tool registry, and

23:36

we return the tool name, the doc from

23:38

it, and the signature from the same

23:40

tool. And now our system prompt, which

23:42

is the text that comes before the first

23:44

message, things like your agent MD would

23:46

be included in here. This all is

23:48

constructed in with the tool registry

23:50

included where we tell the model what

23:52

the tools are and everything they need

23:54

to know to work. And here is what that

23:56

prompt actually looks like. I'm going to

23:58

copy paste this into an editor so I can

23:59

word wrap it. You are a coding assistant

24:01

whose goal is to help us solve coding

24:03

tasks. You have access to a series of

24:05

tools that you can execute. Here are the

24:06

tools that you can execute. This is

24:08

where the tool list gets dumped. When

24:09

you want to use a tool, reply with

24:11

exactly one line in this format. tool

24:13

colon tool name and then the JSON arcs

24:16

and nothing else. Use compact singleline

24:19

JSON with double quotes. After receiving

24:21

a tool result message, continue the

24:23

task. If no tool is needed, respond

24:26

normally. That's the whole thing. This

24:28

is arguably the majority of the harness

24:30

in this example at least right here.

24:32

Because the tools are really simple, the

24:34

model doesn't know what to do with them.

24:36

This here is everything being passed to

24:39

the model as the start of the chat

24:40

history because again the model only

24:42

knows what's in the history. So when you

24:44

put the tools in the history, it knows

24:45

it can use them. So then we have to

24:47

parse that out. When the model stops

24:49

responding, we have to look for lines

24:52

that start with tool colon. If the line

24:54

doesn't start with that, continue. But

24:56

if it does, then we have to append this

24:57

to invocations with the name of the tool

24:59

and the args. And then when it's done,

25:01

we have to actually make the calls. The

25:03

lm call couldn't be simpler. You have

25:05

the system content, you have the

25:06

messages, all the things from back and

25:08

forth. If the message is the system

25:09

message, we put that in the system

25:10

content. Otherwise, we just append it to

25:12

the messages array. And then we call

25:15

claude clients API with the message. And

25:18

here we give it the model we want to

25:19

use, the max tokens, the messages. And

25:21

again, the system prompts important. So

25:23

this is not part of the message history.

25:24

It's a separate array, which it should

25:26

be. Well, not an array. It's a separate

25:28

argument because this is something you

25:29

should include as the dev. And the

25:31

messages array is something that gets

25:32

included by the user. And the magic is

25:34

all in the loop. We wait for the user to

25:37

send an input and once they are done and

25:39

they submit a keyboard interrupt, an end

25:42

of error, so like an enter key, it

25:44

breaks and it appends that to the

25:45

conversation. And once that's happened,

25:47

we run another loop where we wait for

25:49

the execution to occur. At the end of

25:51

that, we get our tool invocations. So we

25:53

have when the message is done being

25:55

generated by the model, we have all of

25:56

the tool names and arguments that the

25:59

model wants to use. And if there's

26:01

nothing here, we just respond. We just

26:03

share the message from the assistant the

26:05

model. But if there are tools here, then

26:08

we go through each of them. For each

26:09

tool, we grab it from the registry, make

26:11

an empty string response because it's

26:13

Python. We start with an empty value and

26:14

we set it later. We print the name and

26:16

the arguments. And if the tool is the

26:18

read file tool because that's the name

26:20

that was passed, we call that one. If

26:22

it's list files, we call that. And if

26:24

it's edit files, we call that.

26:27

Specifically, we're passing the

26:28

arguments in correctly here too by

26:30

grabbing from that JSON blob that's now

26:32

a dictionary the key that we want. And

26:34

then when that is done, we append the

26:36

tool results as messages to the chat

26:38

history. And running it is literally

26:40

just run it in a loop. That's it. Bad

26:44

news. Opus really likes using Python.

26:48

Did it not even put in the right

26:49

folder? I hate the Claude agent SDK

26:53

because it doesn't care what folder it's

26:55

executed in and what path it is passed.

26:57

It needs multiple different reminders

26:59

that it has to be in a specific path.

27:01

So, it just ignored the path that this

27:02

was executing in. That's

27:04

obnoxious. So, we now have our mini

27:05

agent. It happened to get dumped in the

27:08

wrong folder, but there's no pip

27:09

install, no node modules, nothing. Can

27:11

you read from

27:14

the env

27:16

to do that quick? And what's funny, even

27:18

in a harness harness like T3 code, we

27:20

are exposing the tool call. So I just

27:22

asked it to change this file. It didn't

27:25

know if it's changed or not since I

27:26

asked. So it decided to do a read tool

27:28

call just in case to see if the files

27:31

caught us the same or not. And once it

27:32

confirmed, it made an edit call where it

27:35

changed the import path to now have this

27:37

new information in it. And now I should

27:39

be able to Python agent.py asking it

27:42

about the Python code in this app. Now

27:43

we can see it called list files. It

27:45

called read file and now the model is

27:47

thinking because it has this new chat

27:48

history with the outputs of these in it.

27:50

And here is the response from the model.

27:52

Here's a summary of what agent.py does.

27:54

It implements a lightweight

27:55

self-contained AI coding agent in 60

27:57

lines. It's a setup where it loads the

27:59

ENV file. It configures the model with

28:00

set 4.6. It has these three simple tools

28:04

as well as a bash tool that can run

28:05

arbitrary shell commands. Ready to see

28:07

where this gets fun? Remember earlier

28:09

when I said you only really need bash?

28:11

Watch this.

28:13

And now it only has the bash tool. So

28:15

instead, it's just going to call bash

28:17

with different commands over and over

28:18

again. It's going to get the content the

28:20

same way, but instead of using the tool

28:21

we gave it, it's just going to call bash

28:23

to do it instead. It uses the tools it

28:25

has to do the task. And if we delete

28:28

everything other than the bash tool,

28:30

this gets comically simpler. We're now

28:32

down to 75 lines. And I haven't even

28:34

purged that thoroughly yet. And half of

28:36

it is dealing with the env. Like, let's

28:39

just be real. How cool is that? that all

28:42

it takes to give an AI model the ability

28:44

to do real things on your computer is

28:47

you give it a tool that it can pass bash

28:49

to and these models have been trained so

28:52

thoroughly on these types of fake chat

28:54

histories that have all these tool calls

28:56

in them that they know how to deal with

28:58

that already. One last important thing

29:00

because this was not included in the

29:01

article and it does matter. Most of the

29:04

models and the APIs we hit them through

29:06

are now aware of the idea of tools. this

29:08

has become a standardized enough thing

29:09

that there are specific syntaxes that

29:11

different models expect. You can just

29:14

put this in the system prompt and it

29:15

will just work for simple cases. A lot

29:18

of the providers hosting these models, a

29:20

lot of the platforms like open router

29:22

that manage the in-between and all of

29:23

that they all have a dedicated tools

29:26

concept now. And in this case, it's a

29:28

standard format that I can pass the same

29:30

way I pass messages to the model. I also

29:32

can pass tools to it in the body when we

29:35

make the call to in this case open

29:37

router. OpenAI has this, open router has

29:39

this, anthropic has this, even Gemini

29:41

kind of has this. Passing the tools to

29:43

the model through a special format so

29:45

that the host can get this syntax just

29:48

right because the actual syntax the

29:49

model sees is to be frank kind of gross.

29:52

This is the format that OpenAI's models

29:54

see internally. This format is

29:57

relatively complex but also really

29:59

powerful and open source. It's meant to

30:01

be very compact so the models can

30:03

process the data well, but also the

30:05

start, end, and weird bracketing syntax

30:08

makes it less likely the syntax

30:10

conflicts with the things the model's

30:11

actually outputting, which is really

30:13

cool. Thankfully, you'll never have to

30:15

deal with almost any of this if you're

30:16

the type of person watching this video,

30:18

cuz this is so deep in the weeds that

30:20

half the companies hosting these models

30:22

don't even know about it. This is not

30:24

something you'll ever have to care

30:25

about. But the reason that something

30:27

like this tool call key here is so

30:29

powerful is that in this case, Open

30:31

Router will take your tools and format

30:33

them the way the different models expect

30:35

for the different providers. I think

30:37

I've covered everything I need to here.

30:39

And we actually built a harness that

30:42

works and can call bash to make changes.

30:45

You know what? Let's ask it to do

30:46

something different here. Again, it only

30:48

still has bash. Let's ask it to make an

30:50

edit. I don't like the code that loads

30:53

the open router API key from the

30:56

environment. Can we make it simpler in

31:00

some way? And again, all we did here is

31:03

append another message in the array. The

31:05

message array has the first message we

31:07

sent, the first message the model sent,

31:09

all the tool calls, and then the last

31:10

message the model sent at the end. And

31:12

now I added a new message, and now it's

31:14

rerunning the loop until the model is

31:15

done. It read the enenv. It read the

31:18

agent pi and then it made a change by

31:21

how to even do this kind of nasty. Oh,

31:24

bash. Quite a command to do that. Yeah,

31:27

surprised it didn't show more here. It

31:30

managed to do it right, but damn. Bash

31:32

is its own world. And

31:34

thankfully, these models are very, very

31:35

good at it. But god damn, it made the

31:38

change and now this is a self-healing,

31:40

self-modifying tool. Pretty cool. Two

31:42

more questions I want to answer before

31:43

we wrap this one up. The first is why

31:46

the hell is cursor's harness able to

31:48

make the models behave so much better if

31:50

they're this simple? And the second is

31:52

if T3 code isn't a harness, then what

31:54

the hell is it? Starting with the first

31:56

one, it turns out the harnesses,

31:59

specifically the tools they're given,

32:01

the system prompts they have, and the

32:03

outputs they get from the tools

32:04

massively influence the results that you

32:07

get. Something I've seen basically every

32:09

time I use a Gemini model is in its

32:12

reasoning preamble before it starts

32:14

responding, it says, "I have all of

32:16

these tools available to me. I wonder

32:18

which I should use." And then it goes

32:20

through each one and says, "I don't need

32:22

that tool for this. I don't need that

32:24

tool for this." And it does that over

32:25

and over. And sometimes, especially in

32:27

less well-defined harnesses, it'll just

32:29

do it anyways. Something that Cursor

32:31

puts a lot of time into is customizing

32:33

their harness, customizing the tools,

32:35

customizing the shape of the tools, and

32:37

most importantly, customizing the system

32:39

prompt and the tool descriptions to

32:41

steer the models towards which they

32:43

should or shouldn't use. I'm going to

32:45

make a change here. Right here, it says

32:47

read a file's contents, but I'm going to

32:48

put in parenthesis here. You should

32:51

probably use bash tool instead. And now,

32:55

if I run the same thing, what does the

32:57

Python code here do? It has the read

33:00

file tool, but since I told it in the

33:02

description to not use it, it's 50/50 if

33:05

it will. In this case, I said it should

33:07

probably use the bash tool instead, and

33:08

it chose to still use the read file

33:10

tool. Something you can do because these

33:12

are AI models. You can ask, why did you

33:15

use the read file tool instead of the

33:19

bash tool? Interesting. You can see to

33:21

some extent why the model thinks it did

33:23

this thing. It thinks that the read tool

33:25

was perfectly reasonable for what it was

33:27

doing. So watch what I'm going to do

33:28

instead. I'm going to redescribe it with

33:30

deprecated. You should use the bash tool

33:32

instead. And now just with a system

33:35

prompt change. I just changed the string

33:36

here. That's all I changed. I told it

33:38

the read file tool is deprecated its

33:40

description. Let's see what it does now.

33:42

Well, it's taking its time.

33:44

Right again. There we go. This time it

33:47

used bash because I told it that the

33:49

read tool was deprecated. None of the

33:51

code changed. The tool still works

33:53

exactly the same, but the model can't

33:55

see the code. Well, okay. In this case,

33:56

it can because I happen to be running it

33:58

in the same thing, but the model doesn't

34:00

know how the code was implemented. You

34:02

can also just lie to it. So, watch this.

34:04

I'm going to go back to the read file

34:06

tool, but instead of telling it to use

34:09

bash instead, and also instead of

34:11

reading the actual file, I'm going to

34:14

just return a different string. Print

34:17

hello world. And now that's what it will

34:20

return for the read tool, no matter

34:22

what. And if I run the same thing, what

34:24

does the Python code in this app do? The

34:28

model sees the path and it goes to read

34:30

agent.py, but it's not calling the code

34:33

anymore because the code doesn't exist

34:34

anymore. The Python code in this app is

34:36

very simple. It's a single line in

34:38

agent.py that prints hello world to the

34:40

console. You can just lie to the models.

34:42

I need you all to internalize this. The

34:45

models don't know what the code actually

34:47

does. You can tell it it's a bash tool,

34:49

but you do something else. You can tell

34:50

it it's a read file tool, but you do

34:52

something else. You can tell it it's GP

34:53

or rep GP or something different and

34:56

then go do whatever the you want. I

34:58

do this all the time. When I want to

34:59

just fake Bash, for example, when I want

35:01

a model to think it has Bash when it

35:03

doesn't, I'll just tell it it does and

35:05

I'll tell another model to make a fake

35:06

response for it. You can get two models

35:08

to talk to each other without even

35:10

knowing that they're models by doing

35:11

things like this. And it's genuinely

35:12

really fun and helps you realize all

35:14

they are doing is generating text. As I

35:18

hope I have correctly emphasized to

35:19

y'all here, the model only knows what's

35:22

in its context. Different models handle

35:24

different context different ways. I bet

35:25

if I changed this here to have the

35:27

deprecated warning and I tried that on a

35:30

GPT model or a Gemini model, it would

35:32

behave entirely differently. We could

35:34

even test it. So, we know when I did the

35:36

deprecated with Sonnet, it failed. So,

35:38

let's switch this over to I don't know,

35:40

let's try Gemini 3.1 Pro. Same question,

35:43

this time with a different model. And

35:45

because I said that the and this is just

35:48

yet another example of Gemini

35:50

being Gemini. I told it that the read

35:52

file tool was deprecated. So it just

35:54

went for bash for everything even though

35:56

the other tools weren't. It just said

35:58

it, we'll use bash. So to go back

36:00

to the question of why is cursors

36:01

harness better? It's just cuz they

36:03

tested it more. I know a couple people

36:05

at Curser whose whole job is when a new

36:07

model comes out or they get early access

36:08

to just hammer it with all sorts of

36:11

different minor changes to the system

36:12

prompt, constantly micro adjusting it

36:14

until the model for the most part does

36:17

whatever the it's supposed to do.

36:18

And with certain models that's harnesses

36:20

are just full of slop. Like I don't

36:23

know, just imagine a company that's

36:25

letting the AI write the prompts for

36:27

them for the system prompt in these

36:29

things. Maybe they haven't spent a whole

36:31

lot of time trying to rewrite the tool

36:33

descriptions over and over to get them

36:35

to behave exactly how they want. Even

36:37

the example I just gave where I told the

36:39

model to use the bash tool instead and

36:41

it didn't for the claude models, but

36:44

then for the Gemini models, it only uses

36:46

bash. Now, that difference means that

36:48

they have to rewrite these descriptions

36:50

for every different model they support

36:53

in cursor. Meanwhile, Anthropic probably

36:55

hasn't changed these lines of code in

36:57

their codebase since it was

36:58

knitted. That's the difference. They

37:00

were probably written by a model for

37:02

them in the first place. They're not

37:03

trying to fine-tune and get these things

37:05

just right. So, a company that has a lot

37:07

of people whose job is literally that

37:09

the results show. And to this day, I

37:11

much prefer using Gemini through cursor

37:13

than using it directly. I much prefer

37:15

using Opus through Cursor than using it

37:17

directly. With GBT models, it barely

37:19

feels that different. Honestly, the

37:20

issue is a lot of these companies, in

37:22

particular, both Google and Enthropic,

37:24

don't let you use your subscriptions

37:26

with them in tools other than their own.

37:28

OpenAI doesn't give a You can use

37:30

your OpenAI subscription in basically

37:31

anything and they're cool with it. Thus

37:33

far, Anthropic and Google have been much

37:35

more hostile towards that. So, if you're

37:36

paying the 250 a month for Gemini or the

37:38

200 month for Opus, you got to use their

37:40

harnesses. So, that goes to the next

37:42

question of what the is T3 Code?

37:44

Well, T3 Code does not provide any

37:47

tools. T3 code doesn't have a bash tool

37:49

or a read tool or anything because it

37:50

doesn't have tools because it's not a

37:52

harness. T3 Code has a model picker, but

37:55

you're not just picking the model. When

37:57

you pick a model for Claude, it's using

37:59

the Claude code harness on your machine.

38:01

If you don't have Claude Code installed

38:03

already and signed in, this will not

38:05

work. And it's the same deal with

38:06

Codeex. If you don't have the Codex CLI

38:09

installed, this will not work either.

38:10

These harnesses are being provided

38:13

through T3 code as a UI layer. We are

38:16

just a really nice UI on top of the

38:18

harness. So, you might be thinking, I

38:20

did the easy work just wrapping it. Did

38:22

you forget how easy it is to make the

38:23

harness? This is the hard part. If I

38:25

learned anything in my time building T3

38:27

Code is that my life would be

38:28

significantly easier if I could just

38:29

build the harness myself, too. I

38:31

think that's all I have to say on this

38:33

one. Shout out to Matt for making the

38:35

video that led to Edward's tweet that

38:37

led to me caring enough to make this.

38:38

Shout out to Mah, the author of the

38:40

Emperor Has No's clothes article that we

38:42

use as a reference point. And shout out

38:44

to all of the companies for making this

38:46

stuff way more complex than it needs to

38:48

be and then realizing it should be

38:49

simple and giving me the opportunity to

38:51

educate all of you guys on something

38:52

that is actually just 60 lines of

38:54

Python.

38:56

This is actually really fun. It's been a

38:58

bit since I did a deep dive video like

38:59

this where I just break down a concept

39:01

and I'm curious how you'll feel about

39:02

this. I know I'm kind of the news guy

39:04

now, but I love getting into the weeds.

39:06

Did you enjoy this video? Do you want

39:07

more things like this? If so, let me

39:08

know in the comments. And please ask

39:10

some questions about similar stuff so I

39:11

know where to steer my content going

39:13

forward. Enough people didn't get

39:14

harnesses, so I decided to make this.

39:16

Are there other things you don't

39:17

understand? Cuz if so, I'll do my best

39:19

to cover them in the future. Let me know

39:20

how this was. And until next time, keep

39:22

prompting.

Get the TLDR of any YouTube video

Transcribe, summarize, and repurpose videos in 125+ languages — free, no signup required.

Try YouTLDR Free