Full Transcript

·YouTLDR

Building pi in a World of Slop — Mario Zechner

18:11EnglishTranscribed Apr 19, 2026

Open in Studio

0:14

Hey there, I'm Mario. I built pie in a

0:17

world of slop and this is a strategy, a

0:19

tragedy in three acts. Just to talk

0:22

about this real quick, bunch of people

0:23

on the internet gave me money for ad

0:25

space on my torso and all of that goes

0:26

to a charity. So yeah, thanks guys.

0:29

So act one building pi in the beginning

0:32

there was cloud code and was good right

0:34

we all got basically catnipped by that

0:37

thing and stopped sleeping um bunch of

0:41

stuff before that but code cloud code

0:42

was the one thing that kind of clicked

0:44

with me the most and to preface all of

0:46

this I love the cloud cloud team they're

0:48

are brilliant people talented super high

0:50

velocity so uh they also created the

0:53

entire game major props to them so this

0:56

is not a roast this is just me an old

0:58

man telling you why I stopped using

0:59

cloud code and built my own thing. Um in

1:03

2025 I started using cloud code in about

1:05

April I think thanks to Peter uh because

1:08

he told us the agents are working now

1:12

and back then it was simple and

1:13

predictable and fit my workflow but

1:15

eventually

1:17

the token madness got hold of them I

1:19

think and the team got bigger and they

1:20

started uh dog fooding that stuff and

1:23

build a lot of features a lot of

1:24

features I don't need which is fine I

1:26

can just ignore them but with velocity

1:28

and more features come more bucks and

1:30

that's mad because I used to work at

1:33

construction sites and if my hammer

1:35

breaks every day I'm getting really mad

1:36

and if my development tools break every

1:38

day I'm also getting mad. So there was

1:41

this it's just a running gag and here's

1:43

tar telling us that cloud code is now a

1:44

game engine and here's Mitchell from

1:46

Ghosty telling us no it's not and

1:48

eventually they fixed the flicker but

1:50

then other stuff broke and I think

1:51

they're now in the third iteration of a

1:54

2y renderer. Yeah but that's just a

1:56

symptom. The real problem is that my

1:58

context wasn't my context. Cloud code is

2:01

the thing that controls my context. And

2:03

behind my back, cloud code does things

2:05

uh to the context. So you have the

2:08

system prompt which changes on every

2:09

release, including the tool definitions.

2:11

They would remove tools, modify tools.

2:14

It's not good. They would insert system

2:17

reminders in the most oppoune place in

2:20

your context, telling the model, here's

2:21

some information. It may or may not be

2:24

relevant to what you're doing. That it

2:26

actually says it may or may not be

2:27

relevant what you're doing. And that

2:29

kind of confused the model and that kind

2:30

of broke my workflows.

2:34

On top of all that, there's zero

2:35

observability because that's how the

2:36

tool is constructed and I like knowing

2:39

what my agents are doing. There's zero

2:41

model choice which is obvious. It's the

2:42

native entropic uh harness. So it makes

2:44

sense for them to want you to use cloud,

2:46

right? And there's almost zero

2:48

extensibility and some of you might have

2:50

written some hooks for cloud code, but

2:51

I'm telling you the number of hooks and

2:54

the depth of those hooks is very

2:55

shallow. Um, and every time a hook

2:58

triggers, what actually happens is a new

3:00

process gets spawned. Basically, the

3:01

command you specified for the hook to be

3:03

executed. And I don't find that

3:05

specifically efficient. So, I uh took a

3:08

step back and looked around for

3:09

alternatives. And I'd like to especially

3:11

call out AMP and factory droid, the

3:14

Porsche and Lamborghini of coding agent

3:16

harnesses. So, if you can afford them,

3:17

please use them. They're at the

3:18

frontier. They're really good, and the

3:20

teams are fantastic. And there's a bunch

3:22

of other options. And I have history in

3:23

OSS. So naturally I kind of gravitated

3:26

towards open code and again brilliant

3:28

team super high execution velocity and

3:31

they don't sell you hype they sell you

3:33

tools that work for the most part. I

3:36

started looking under the hood of open

3:38

code uh with respect to context handling

3:40

as well because that's the most

3:41

important part for me and I found a

3:43

bunch of things like given some

3:45

conditions open code would just uh prune

3:49

tool output after a specific minimum

3:52

amount of tokens and that basically

3:54

lobomizes the model. Uh there's also LSP

3:57

server support which means every time

3:59

your model is calling the edit tool open

4:02

code goes to the LSP server that's

4:03

connected asks are there any errors and

4:06

if so injects that as part of the edit

4:08

tool uh result which is bad because

4:11

think about how you add editing code

4:13

you're not writing a line of code

4:15

checking the errors writing the next

4:16

line checking the errors you don't do

4:18

that you finish your work and then you

4:20

check the errors this confuses the model

4:23

there's a bunch of other things like

4:24

storing individual messages of a session

4:26

in a JSON file. Each me message is a

4:29

JSON file on disk. Uh there was this and

4:31

this happens to all of us. No, no claim

4:33

there. But it's not great if by default

4:36

a server spins up, course headers are

4:38

set in such a way that any website you

4:39

open in your browser can now access your

4:41

open code server. That's yeah, and

4:44

entirely unrelated to all of this, I

4:46

started looking into benchmarks for

4:47

coding agent harnesses and found

4:49

terminal bench um which is a pretty good

4:52

benchmark all things considered. And the

4:54

funny part about it is that it's the

4:56

most minimal kind of thing you can think

4:58

of. All it gives the model is a tool to

5:01

send keystrokes to to a T-Max session

5:03

and read the output of that T-Max

5:05

session. There's no file tools, no sub

5:07

agents, none of that stuff. And it's one

5:11

of the best performing harnesses in the

5:12

leaderboard. Here's the leaderboard from

5:14

December 2025. irrespective of model

5:17

family terminal scores higher mostly

5:20

high even higher than the native harness

5:22

of that model. So what does that tell

5:25

us? A form two thesis is we are in the

5:28

[ __ ] around and find out phase of coding

5:30

agents and their current form is not

5:31

their final form right. So second thesis

5:35

is we need better ways to [ __ ] around

5:37

and for me that means self modifying

5:40

malleable agents things that the agent

5:42

itself can modify and I can modify

5:45

depending on my workflow. So I stripped

5:47

away all the things built a minimal core

5:49

but made it super extensible and made it

5:52

so that the agent can modify itself

5:55

with some creature comforts. It's not

5:56

entirely bare bones. Uh so that's PI.

5:59

It's an agent that adapts to your

6:00

workflow instead of the other way

6:01

around. It comes with four packages. Uh

6:04

an AI package that's basically just an

6:06

abstraction across providers and context

6:08

handoff between providers. An agent core

6:11

uh which is just a while loop and the

6:12

tool calling. A bespoke toy framework. I

6:15

come out of game development. So I built

6:17

a thing that actually doesn't flicker

6:18

too much. And the coding agent itself.

6:21

Here's Pi's system prompt.

6:23

That's it. Eventually the industry

6:26

created a new standard called skills

6:28

which is basically just markdown files.

6:30

So we added that as well. and that needs

6:31

to go in a system prompt. So, be

6:33

crouchingly, we had to add a couple more

6:35

lines. And finally, here's the magic

6:37

that makes Pi able to modify itself. We

6:40

ship the documentation which was

6:42

handcrafted by me and an agent. Um, and

6:45

code examples of extensions,

6:48

and all we need to do for the agent to

6:50

modify itself is tell it, here's the

6:52

documentation. Here's some code that

6:54

shows you how to modify yourself by

6:55

writing extensions.

6:57

It comes with four tools. That's all it

6:59

has. Retrate, edit, mesh. Here's the

7:01

tool definitions. Don't read the the

7:02

text. Just look at the size.

7:05

That's it. Here's what happens when you

7:08

start a new session in one of these

7:09

tools.

7:11

So the thing is the models are actually

7:13

reinforcement trained up to a wazoo. So

7:15

they know what a coding agent is because

7:17

a coding agent harness is basically what

7:19

they're being trained when they are

7:20

post-trained. You don't need 10,000

7:22

tokens to tell them you're a coding

7:24

agent. They know because they are coding

7:26

agents. No, PI is also YOL by default

7:29

because my security needs are different

7:30

than yours. And I don't think a little

7:32

dialogue that pops up every now every

7:35

time you call bash asking you to approve

7:38

is a smart security uh uh mechanism. So

7:41

instead I give you so much rope that you

7:44

can build anything that's fit for your

7:46

specific security needs. There's also

7:49

stuff that's not built in. I'm a he

7:53

because this is how I do it. But if you

7:56

don't like that then you just ask Pi to

7:57

build you sub agent support or plan mode

8:00

or MCP support whatever you need.

8:02

Extensibility comes with a bunch of

8:04

table stakes and then with the

8:06

extensions itself and extensions imply

8:08

are just TypeScript modules. In the

8:10

simplest case a TypeScript file on disk.

8:12

You point PI at that. Here's an

8:14

extension loaded as part of the harness.

8:16

And with that you get a basically an

8:19

extension API that lets you hook into

8:21

everything and define stuff for the

8:23

harness to expose to the to the model.

8:25

And that includes tools uh slashcomand

8:28

shortcuts. You can listen in on any kind

8:29

of event and react and then save state

8:32

in the session that's optionally

8:36

provided to the agent as well or stored

8:38

there for tools that analyze sessions as

8:41

part of your organizational workflows.

8:43

You can do custom compaction, custom

8:45

providers and you have full control over

8:46

the tool. So you can modify everything

8:48

in PI and you can then bundle all of

8:50

that up and put it on mpm or on GitHub

8:53

because I think we don't need to

8:55

reinvent another bunch of silos called

8:58

marketplaces. We already have package

9:00

manage managers and all of that hot

9:03

reloads. So if you develop an extension

9:06

for pi, you do so in the session and you

9:09

hot reload the changes and see the the

9:12

effects of that immediately which is

9:14

very great and it's also game

9:15

development thing is in game development

9:17

you want high very low iteration uh

9:20

speeds and that's great. So a couple of

9:23

examples cloud or anthropic ships the

9:25

slash by the way which lets you talk to

9:26

the agent why goes on its main quest. I

9:29

posted this little prompt on Twitter

9:31

jokingly and somebody build it in five

9:33

minutes with more features and they

9:35

didn't have to fork a clone pie. They

9:37

just let the agent write the extension

9:40

based on the prompt. Here's Nico. He's

9:42

one of the most prolific uh extension

9:44

writers. I don't know what the [ __ ] is

9:46

going on here. It's a chat room for all

9:48

of his Pi agents and they talk with each

9:49

other. I would never use this, but all

9:51

of this is custom including the UI. or

9:53

you can play NES games or you can play

9:56

Doom.

9:58

And there's a bunch of other examples

9:59

I'm not going to talk about. So, how do

10:01

you build a PI extension? You don't. You

10:03

tell Pi to build it for you based on

10:05

your specifications. And then you just

10:06

iterate with it on that and hot reload

10:08

during the session. I'm going to skip

10:10

that example as well. And if you don't

10:12

like building things yourself, and I

10:14

hope you do like building things

10:15

yourself, but if you don't, you can look

10:17

on MPM or our little search uh interface

10:20

on top of MPM to find packages for sub

10:23

agents, MCP, and so on. So, does it

10:25

actually work? Well, here's the terminal

10:27

bench leaderboard from October before Pi

10:29

had compaction. I added that for Peter's

10:31

claw thingy. It scored sixth place.

10:35

Uh, but none of this is actually about

10:36

Pi. If you want to retake, I I basically

10:39

want you to retake control of your tools

10:40

and workflows. So build your own. Um and

10:43

if you want to know more about pi and

10:44

openclaw, go to this talk please. Yeah.

10:46

And then eventually Peter happened. He

10:48

put pi inside of open claw as its aentic

10:51

core which meant my open source project

10:53

became the target of a lot of openclaw

10:55

instances unbeknownst to their users. So

10:57

this is act 2 oss in the age of

10:59

clankers. Clankers are destroying oss.

11:01

Here's tal draw. They closed down the

11:03

issue on pull request tracker. Here's

11:05

open clause uh trackers. Here's mine.

11:08

Half of that is open source instances

11:10

who post garbage. So I started to rage

11:12

against the clankers.

11:14

Um if you send a pull request, it gets

11:16

autoclosed with a comment that asks you

11:18

to please write a nice issue in your

11:21

human voice, no longer than a screen

11:22

worth of text. And if I see that I write

11:25

looks good to me and your account name

11:26

gets put in a file in the repository and

11:28

the next time you send a pull request,

11:30

it's let through. Clankers don't read

11:33

that comment. They don't go back once

11:34

they posted a pull request. So that's a

11:36

perfect filter. Uh Mitchell eventually

11:38

turned it into vouch. Here's a clanker.

11:40

Uh I also labeled them. If you had

11:42

interactions with openclaw, your issues

11:44

get dep prioritized. I also built tools

11:47

where I embed uh issues and pull request

11:49

texts into 3D space. So I see clusters

11:52

of issues. Uh I also invented OSS

11:54

vacation. I just close the tracker

11:56

whenever I want. So I have my life back.

11:58

So does this work? Yes, sort of.

12:02

Which leads me to act three. Slow the

12:04

[ __ ] down. Everything's broken.

12:09

And then there's people that say, "Our

12:10

product's been 100% built by agents."

12:12

Yes, we know it [ __ ] sucks now.

12:14

Congratulations.

12:22

And I'm hearing this from my peers and

12:24

this is entirely unhealthy.

12:27

Um, so here's how we should not work

12:28

with agents and why, at least in my

12:30

opinion. I wrote this on my blog a while

12:32

ago, but the basic is this. We're having

12:34

armory of agents and you're using beats

12:36

on been and you don't know that it's

12:38

basically uninstallable malware and

12:40

entropic build a C compiler that kind of

12:41

works but actually doesn't and we're

12:43

hoping the next generation of models

12:44

will fix it and here is Perso building a

12:46

browser and that's also super [ __ ]

12:48

broken but the next generation will fix

12:50

it and SAS is dead software solved in

12:52

six months and my grandma just built

12:54

herself a Spotify with her open claw

12:56

come on people so agents are actually

13:00

combounding boooos which is my word for

13:01

errors with serial learning and No

13:03

bottlenecks and uh delayed pain. The

13:06

delayed pain is for you. Here's your

13:08

code base on a human on one agent and 10

13:11

agents. How much of the agent code can

13:13

you review? Here's the same codebase but

13:16

expressed in number of boooos per day.

13:19

How much of those boooos do you think

13:21

you'll find? Then you say, "Oh, I have a

13:23

review agent. Let me introduce you to

13:26

the wonderful world of the Oro." Doesn't

13:28

work. It catches some issues. Um the

13:31

problem is that agents and merchants

13:32

have learned complexity. Where did they

13:34

learn that complexity from? From the

13:36

internet. What's on the internet? All

13:37

our old garbage code. There are some

13:39

pearls on the internet, really

13:41

well-designed systems, but 90% of code

13:43

on the internet is our old garbage. And

13:45

that's what the models learn from. And

13:47

every decision of an agent is local,

13:49

especially if the codebase is so big

13:51

that it doesn't fit into its context.

13:52

And if you let it go wild and add

13:55

abstractions everywhere that are

13:57

intertwined. Um, so that leads to lots

14:00

of abstractions and duplication and

14:02

backwards compatibility. Who has seen

14:04

that in the output of their agent? It's

14:06

[ __ ] annoying or defense in depth. So

14:09

yeah, you get enterprise grade

14:11

complexity within two weeks with just

14:13

two humans and 10 agents.

14:15

Congratulations.

14:16

And then you say, but my detailed spec.

14:19

Yes, sure. You know what we call a

14:21

sufficiently detailed spec? It's a

14:23

program.

14:25

So if you leave blanks in your spec,

14:28

what do you think happens? How does the

14:29

model fill in the blanks? And with what

14:31

does it fill that in? It fills it in

14:34

with the garbage that it learned on the

14:35

internet from our old code, which is

14:37

garbage to mediocre. And then you say,

14:39

but humans also, yes, humans are

14:41

horrible, fail failable beings, but they

14:44

can learn and they are bottlenecks.

14:46

There's only so many boooos they can add

14:48

to your code base on a daily basis. And

14:51

humans feel pain, which is a very

14:54

interesting property because humans hate

14:55

pain. And once there's too much pain,

14:57

the human has a bunch of options. It can

15:00

quit their job. It can uh blame somebody

15:04

else and make them fix it or everybody

15:06

bands together and starts refactoring

15:07

the [ __ ] out of the garbage codebase,

15:10

right? Agents will happily keep [ __ ]

15:13

into your codebase.

15:16

And now your agents MD and super complex

15:19

memory systems will not save you. agents

15:21

don't learn the way we learn.

15:24

Those are my most most beloved people. I

15:26

don't even read the code anymore.

15:28

Congratulations. Something is broken and

15:31

your users are screaming. So, who you

15:32

going to call? Not yourself because you

15:35

haven't read the code. So, you're

15:36

relying on your agents, but they are now

15:38

also overwhelmed because the codebase is

15:40

so humongous that there's absolutely

15:42

zero chance they can get all the context

15:44

they need to fix the issues. And long

15:46

context windows are a heck, as most of

15:49

you will find out this year. as

15:50

everybody's switching to 1 million

15:52

tokens context windows and agentic

15:54

search is also failing.

15:57

So the agent patches locally and [ __ ]

15:59

[ __ ] up globally. If you see this in

16:01

your codebase, you're [ __ ]

16:06

So you cannot trust your codebase

16:08

anymore and also not your test because

16:09

your agent wrote your test. So good

16:11

game. So here's how I think we should

16:13

work. Um there's a bunch of properties

16:15

for good agent tasks. That means scope.

16:18

If you can scope it in such a way that

16:20

the agent is guaranteed to find all the

16:22

things it needs to find to do a good

16:23

job, you're done. That means modularize

16:26

your codebase. If you can give it a

16:28

function to evaluate how well it did the

16:30

job, even better. Hill climbing, auto

16:32

research. Uh, anything non-m mission

16:34

critical, let it wipe. Boring stuff, let

16:36

it wipe. Reproduction cases for user

16:39

issues, which are usually only partial

16:40

in information, perfect. I don't spend

16:43

any mornings anymore doing that. Or if

16:44

you don't have a human near you, rubber

16:46

duck. So, lots of tasks you can use them

16:48

for and save time. At the end of that,

16:51

you evaluate. You take what's

16:53

reasonable. Most of it isn't. And then

16:55

finalize. My final slide, more or less,

16:58

slow the [ __ ] down. Think about what

17:00

you're building and why. And don't just

17:02

build because your agent can do it. Now,

17:03

that's stupid. Uh, learn to say no. This

17:07

is your most valuable uh capability at

17:10

the moment. Fewer features, but the ones

17:12

that matter. And then use your agents to

17:14

polish the [ __ ] out of that. Enlighten

17:16

your users, not your uh token maxing

17:20

desires. Get the amount of generated

17:22

code uh that you need to review.

17:26

And non-critical code, sure, wipe slop

17:28

ahead. Critical code, read every [ __ ]

17:30

line. See the keynote after me for more

17:33

info on that. So, how do you know what's

17:35

critical? Any guesses?

17:38

Well, you read the [ __ ] code. Uh, if

17:42

you do anything important, write it by

17:43

hand. You can use a clanker to help you

17:45

with that, but don't let it make the

17:47

decisions for you because we've learned

17:49

all the decisions it makes are learned

17:51

from the internet. And that friction is

17:53

the thing that builds the understanding

17:55

of the system in your head, which is

17:57

important. And it's also where you learn

18:01

new things. And all of this requires

18:03

discipline and agency. And all of this

18:06

still requires humans. Thank you.

More transcripts

Explore other videos transcribed with YouTLDR.

Get the TLDR of any YouTube video

Transcribe, summarize, and repurpose videos in 125+ languages — free, no signup required.

Try YouTLDR Free