The Friction is Your Judgment — Armin Ronacher & Cristina Poncela Cubeiro, Earendil
morning. Thanks for having us. Um, today
I want to talk with Christina about
friction a little bit. Um
this is um a a social preview that came
up automatically when someone submitted
an issue um to
um basically there was this is a forum
post that goes with um a security
incident that was deployed accidentally.
It was a configuration change that
caused a problem and the social preview
post had the marketing tagline of that
company which said ship without
friction.
Um, and we want to encourage to add a
little bit of friction to it. Um, and
I'll tell you why. So, who are we? Um,
I've been doing software development for
20 years, most of it in the open source
space. Um, I have created Flask, which
is a Python framework, which ironically
is so much in the weights that a lot of
people um are learning about it now
because the machines are producing it.
Um, and I left my previous company that
I worked for, Sentry, in April last
year, which perfectly coincided with um,
me having time and then obviously Cloud
Code. And so I fell deep into a hole of
aicing engineering and I started writing
on my blog and and and a lot of people
reached out to me over the last year um,
being all excited about this. Um, and
then I started with a friend in October,
a company called Arendelle where we are
trying to make sense of all the AI
things. Um,
>> yeah, and my name is Christina and I
work with Armen at this company called
Arendelle. But importantly, I am what I
like to call a native AI engineer. And
what that basically means is that these
tools have been around longer than I
have. Um, so what this means is like
they've been super foundational in how
I've become a software engineer. Not
just because obviously I use them to
work, but also because this is the means
by which I've learned to do what I do.
And before Arendelle I was working at
bending spoons.
>> So we want to share a little bit from
practice not just theory but um I will
readily admit that I don't think we have
all the solutions. So we have been
building with or on agents for a good 12
months. Um we had huge leverage and
great disappointment and we we really
keep running into two types of problems.
Um I I think especially if you listen to
some earlier talks at at this conference
you will have learned a lot about um
that you should keep using your brain.
Um it's for some reason that's really
really hard. So there's a psychological
problem and the other one is the
engineering challenge is like the they
seem to be producing worse code for some
people and better code for some other
people and like what is it that actually
makes that work. Um and so this is
really not a solution as it is our part
of the journey of how we think so far we
have managed. Um yeah, so problem number
one is the psychology part which is like
why is it even though everybody told you
many times over that you should be using
your brain, you should be slowing down,
it's actually incredibly hard. It's just
one more prompt and and we don't sleep
that much. Like what is it that actually
makes it so hard? And then would it be
that hard if the machines would actually
be writing perfect code and we wouldn't
have to think quite as much and like
what is it is there something we can do
to make this a little bit better?
So I'll begin by introducing the first
part of these problems, the psychology
problem. And what I want to talk first
about is the shift. So I'm sure a lot of
us here who have been playing with these
tools for a while now experienced this
at some point. We were prompting
prompting not so good and then at some
point suddenly it clicked and they were
really really useful for us and it was
fun in the beginning and they gave us a
lot of extra time right because not
everyone was using them. They were
actually tools that made us more
productive, that made it more fun to do
our jobs. But very quickly, because they
were so useful and they got us so
hooked, everyone was using them. And so
this kind of had the opposite effect
where suddenly the baseline expectation
was just that everyone is now using them
and you have to use them. And so this
this fun and free time translated into
pressure. Now we all have to ship faster
and produce more code. And it is just
not sustainable to review and to
actually have time to think.
And so this leads us to the trap and I
actually think there's two parts of this
problem of this trap and one of them a
lot of engineers have spoken about and
it's that these tools are super
addictive. You never know if that next
prompt is going to be the one that makes
your product work and you've added a new
feature or if it's going to be that last
drop of slop that brings your product
crashing down. And so it's very
addictive. We keep doing what we're
doing. It's not a great solution. But
also most importantly, and I don't think
we realize this as much is that because
we produce a lot of output very fast, we
are tricked into thinking that we're
actually being more efficient doing more
work. And this is quite the opposite
because now we don't have as much time
to actually stop and think and design
what we're doing. Ask ourselves, is this
the best way in which I can implement
this or could I be some doing something
better? And when you're in this flow,
it's very difficult for yourself to stop
and it's definitely very difficult for
your agent to stop because it's running
around and it's reading files that it
should have never even read. So we are
the ones that need to actually have the
agency to be in control here.
>> And one thing that from a if you start
scaling this from like one person to an
engineering team that actually took me
quite a while to realize is that it
really changes the composition of the
engineering team. We we were really
supply constrained by creation of code
and so like the balance between writing
code and reviewing code and engineering
teams was usually quite decent. Now
every engineer has a multitude of
producing power compared to their
reviewing power and so obviously we are
piling up on poll requests but we are
also slowly starting to expand the total
amount of humans in an organization that
are participating in engineering
process. I talked to a lot of engineers
over the last year and increasingly the
one of the things that came up is like
now I have marketing people shipping
code. I have um former CEOs sh CEOs that
used to be like engineers now shipping
code again. And so the the roles that
those people have in the companies also
doesn't give them there's not that much
um um the responsibility doesn't rest in
them. The the responsibility still rests
with the engineering team. And so the
the total number of entities both humans
and machines that are participating in
the code creation process outnumbers the
ones that can carry responsibility.
We're not there where the machine can be
responsible for the code changes. And so
that has led to more and more code
reviews being skipped being rubber
stamped. Um and on the goal to small PRs
that that we want to see again so that
this reviewing process goes um this
amplification is something that at the
very least we need to recognize.
And so when you get this pull request
that looks really daunting and has 5,000
lines of code in it, this is actually
when you should be thinking and that's
exactly when it's the most overwhelming
and and increasingly we're tapping out
of this.
On the engineering side, what we're
doing is we are creating larger pull
requests. We're creating these massive
changes because it is free now, right?
And the if you think about how the
agents work, they're really optimized to
creating code that runs. Like their main
objective is write some code, run the
tests, make some progress. The
reinforcement learning sort of gets this
in. And so the the agents are writing
kind of code that is is when you as a
human as an software engineer start
learning how to write code you wouldn't
necessarily write. So for instance, you
see quite a bit of code that tries to
read a config file and if it doesn't
read a config file, it loads some
defaults. And as an engineer, you know,
that's actually not great because I
might not notice that I'm reading
reading the default config file. And so
I might only discover that I have a
massive problem after two hours when I
already wrote database records with
wrong data. And so these machines, they
they optimize towards making progress to
shipping stuff to like unblocking
themselves. And as a result, they're
creating many more failure conditions
than human written code normally would
do. in parts is because you as a human
feel a little bit of a you feel bad when
you write code like this. There's
there's something that sort of builds up
emotionally in yourself, but the agent
doesn't have a reason for this. It it
doesn't feel anything. And so if you if
you create these services that are sort
of hobbling along and they're actually
willing to to recover from local
failures, you actually create very very
brittle systems. And this also means
that you're very quickly creating a
codebase of the size and complexity that
the agent itself can no longer dig
itself out from. It's going to start no
longer reading all the files that it
should. It's it's creating code in a new
file that has already done somewhere
else. And so this this entire machinery
over time creates much more entropy in a
source code than you would normally have
if if humans were on it. And a big part
of this is that humans feel bad and
agents don't really have any emotions
that they communicate to you.
>> But as Armen likes to say, don't worry,
not all is lost. We have s found some
correlation between what the agents
really excel at doing and the types of
code bases that we actually put them to
work into. And for example, the main
example here is libraries versus
products. What we found is that for
libraries, they tend to excel a lot
more. And this makes sense because
intrinsically when you're building a
library, you tend to have a very clearly
defined problem that you're trying to
solve. And most of the time you can even
map the set of features that you want to
build to the API service and it has very
tight constraints. And because this is
something that you probably want to
build on top of or make accessible to
other people, it's likely that it's
going to be a very simple core in which
you can then plug into. And on the other
hand, products and perhaps this is a bit
more unlucky for the rest of us because
we all probably are more into building
products. Uh it's much harder because
there are so many interacting concerns
and components like for example you have
your UI, your API response. You have
different permissions depending on the
feature flags, the billing and so on.
And so there's this very heavy
intertwining between different
components. And what this means is that
for the agent itself, it's impossible to
fe fit all of this into its context
window. it has no way to actually
understand the entire global structure
and so locally the agent tends to be
very reasonable but when it gets to the
global scale it becomes a bit demented.
So what we're proposing here is that
just as you would do with any type of
system design in the past, your codebase
has now become infrastructure and as
such you have to design it in the way so
that it is also legible for the agent
and it can make the most of it.
And so this is what we're proposing is
an agent legible codebase and one of the
main points that is very clear to all of
us I'm sure is modularization. So like
we have different components and this
makes it easy for the agent to add one
feature in one spot without corrupting
everything else. But importantly this
also means modularizing your code flow
itself. So for example I've been working
on some refactoring. We're building
somewhat of an AI assistant. And for me
it was super important to understand
which steps of my code are actually like
the main points. So say like you get
user message then I pass the message to
the agent loop and then I have to deal
with the output. And this is where these
points are very clearly defined for me.
So the code was not as messy. But it
happens to be that between these points,
between these steps, that's where the
agent tends to add the most fuzz. So it
will be parsing between different types.
It's adding things to state that
shouldn't be in state. And so you end up
with these behaviors that you didn't
want to support and that are unexpected
and can be quite dangerous. Another
point is trying to follow all of the
known patterns because I think we all
know by now there's no point in fighting
the RL the reinforcement learning. The
more we can lean into it the better that
our output is going to be and it's also
more scalable down the line. Then as
mentioned with libraries like if you
have a simple core and you push the
complexity to other abstraction layers
then it's going to be easier for
yourself and the agent to be able to
read your codebase and no hidden magic.
So for example here uh using react
server actions or using OM instead of
rorowsql what this does is that it hides
intent from the agent and if the agent
can't see something it can surely not
respect it
and so to be more precise these are the
examples of mechanical enforcement that
we have been using at the company and
most of these we actually achieve with
uh linting rules. So the main example
would be no bare catch holes. Great.
Imagine that there's an example here.
The agent found the very catch all and
was like, "Oh no, this is bad. Edited
it." But yeah, so we also try to have
our SQL uh always in one query interface
so that the agent doesn't have to go
hunting around the codebase finding all
of the different places because if it
misses one then you can have breaking
behaviors and again that's dangerous. We
try to have one primitives components
library for the UI and not have any raw
for example input uh input boxes. Uh so
that it's we always have one type of
styling. It's very consistent one kind
of behavior. We don't have any dynamic
imports. And this may not sound as
important but actually we enforce unique
function names. And the reason for this
is not just more legibility for you and
the agent, but it's actually also the
token efficiency. So if your agent is
gripping for a specific feature or
something in your codebase, if it only
gets one output, it's going to be much
better at continuing with the loop. And
we've started exploring something
recently called erasable syntax only
TypeScript mode. And what this does is
that your code is basically JavaScript
and it has the type annotations on top.
And this means that there's no
transpiling direction because there's
one source of truth between your actual
code and the compiler. And so when the
agent is looking for errors, it doesn't
have to have this like confusion of oh
my god, where am I looking at? It is
much better at finding them.
And so the goal really is get in this
loop somehow like get the agent to
produce as good code as it can, but you
really need to find a way to feel the
pain that the agent doesn't feel and you
need to be woken up in a way when you
should be looking at this. And one of
the things we have been doing is we
build a PI extension for our review
needs where we are separating out the
kind of input that normally would go
back to the agent. So this is mechanical
bugs. It is where it clearly violated
the agents MD. Um but then we
specifically call out the kind of
changes where the human's brain should
reactivate, right? It's like we don't
think that the database migration should
ever go in without the human making a
judgment call on this because it very
much depends on the locks, the size of
the data in production. Um if there are
permissioning changes, you better think
about this themselves rather than the
agent because they can be they can be
underdocumented.
Just some examples where we learned if
we miss it, we regret it. Um and you
will miss it. But this these machines
can help you find this and then you see
this and then you actually get a little
bit of a hit like, oh now now I have to
kick into gear and do something here. Um
this is what this looks like in pi. Um
you have the um on the bottom you have
the human call outs on the top you have
what is go what basically if you were to
end this review and say like fix the
issues the the agent would go back and
automatically act on the first two um
but but this is the moment where I will
now go and see like is this a dependency
I actually want to have in this codebase
like do I like the maintainers is this
does this work for me
and we obviously like the speed like
this is addictive it is great we feel
there's a lot of productivity
But it is so devious if you start
relying on it speed where you really
shouldn't. And so I can only encourage
you to find the areas where you you have
this feeling that this is actually net
positive. For me a lot of this is
reproduction cases like when a customer
reports an issue I can I can have the
age and reproduce this perfectly and I
have a really good starting point
exploring different type of product
directions for as long as you commit
yourself to doing this uh with the code
that it generates. Um all of this is
great but on the other hand system
architecture creating reliability in the
system they're not just very good at
because we really still have to go slow.
It's there is so much mess that can
appear in a codebase in so little time.
Mario was already talking about this
earlier but like we forget that we
producing months and months of technical
debt in the in in a time of weeks in a
time of days sometimes and it becomes so
much harder to actually understand
what's going on as codebase. the when
the understanding of your own code
drops, it is really really hard and it's
also psychologically hard. I've found
some code pieces that actually didn't
work in production and I was kind of
frustrated learning that I was the one
that committed it with the agent and
just didn't really see that. It's it's a
very disappointing experience when it
happens and then you realize that you
actually were the one that screwed up.
Um, and so it is it is psychologically
incredibly hard to to really judge
objectively the state of the codebase.
And the only way right now is to really
slow down a little bit on on that front
and this this friction. I know that
friction like every engineering team
I've ever worked at said like we need to
get rid of the friction in shipping and
and that is true. Like there's a lot of
stuff that's very very annoying and
shouldn't be there. But if you have
worked on large enough engineering work,
SLOs's are a great system that is
intentionally designed to put friction
into the engineering process to make you
think, do I need this reliability? Do I
need this criticality of the service? Am
I sufficiently staffed to run it? And
with the agents, we have now gotten this
idea that we should get rid of all of
this when in all reality we need of it.
Um because the friction actually in many
ways is what's necessary on a physical
level to steer. like without friction
there's no steering and and that is
really necessary. Um so you should you
should put a little bit more of a
positive association to this idea of
friction. Um because this is really
where your judgment is. This is where
your experience is and you should be
inserting that and start feeling it.
Thank you.
>> Thank you.
Get the TLDR of any YouTube video
Transcribe, summarize, and repurpose videos in 125+ languages — free, no signup required.