Full Transcript

·YouTLDR

Extreme Harness Engineering: 1M LOC, 1B toks/day, 0% human code or review — Ryan Lopopolo, OpenAI

1:17:4615,508 words · ~78 min readEnglishTranscribed Apr 11, 2026
0:00

I do think that there is an interesting

0:02

space to explore here with codeex the

0:05

harness as part of building AI products

0:08

right there's a ton of momentum around

0:11

getting the models to be good at coding

0:12

we've seen big leaps in like the task

0:16

complexity with each incremental model

0:18

release where if you can figure out how

0:20

to collapse a product that you're trying

0:23

to build a user journey that you're

0:24

trying to solve into code it's pretty

0:27

natural to use the codeex harness to

0:29

solve solve that problem for you. It's

0:31

done all the wiring and lets you just

0:34

communicate in prompts to let the model

0:36

cook. You kind of have to step back,

0:37

right? Like you need to take a systems

0:40

thinking mindset to things and

0:42

constantly be asking where is the agent

0:45

making mistakes? Where am I spending my

0:47

time? How can I not spend that time

0:50

going forward? And then build confidence

0:52

in the automation that I'm putting in

0:53

place so I have solved this part of the

0:55

SDLC.

0:57

Before we get into today's episode, I

0:58

just have a small message for listeners.

1:00

Thank you. We would not be able to bring

1:02

you the AI engineering, science, and

1:04

entertainment content that you so

1:06

clearly want if you didn't choose to

1:07

also click in and tune into our content.

1:09

We've been approached by sponsors on an

1:11

almost daily basis. But fortunately,

1:13

enough of you actually subscribe to us

1:15

to keep all this sustainable without

1:17

ads, and we want to keep it that way.

1:20

But I just have one favor to ask all of

1:22

you. The single most powerful,

1:24

completely free thing you can do is to

1:26

click that subscribe button. It's the

1:27

only thing I'll ever ask of you, and it

1:30

means absolutely everything to me and my

1:32

team that works so hard to bring the

1:34

Inspace to you each and every week. If

1:36

you do it, I promise you we'll never

1:38

stop working to make the show even

1:40

better. Now, let's get into it.

1:47

All right, we're in the studio with Ryan

1:49

Leoplo from OpenAI. Welcome. Hi.

1:52

>> Uh, thanks for visiting San Francisco

1:54

and thanks for spending some time with

1:55

us.

1:55

>> Yeah, thank you. I'm super excited to be

1:56

here.

1:57

>> You wrote a blogbuster article on

1:58

harness engineering. It's probably going

2:00

to be the defining piece of this

2:02

emerging discipline.

2:04

>> Thank you. It is uh it's been kind of

2:06

fun to feel like we've defined the

2:08

discourse in some sense.

2:10

>> Uh let's let's contextualize a little

2:12

bit this first podcast you've ever done.

2:14

Yes. And thank you for spending with us.

2:15

Uh what is where is this coming from?

2:17

What team are you in? All that jazz.

2:20

>> Sure. Sure. Sure. So uh I work on

2:22

frontier product exploration new product

2:24

development in uh the space of open AI

2:26

frontier which is our enterprise

2:28

platform for deploying agents safely at

2:32

scale with good governance in uh any

2:34

business. And the role of me and my team

2:38

has been to figure out novel ways to

2:40

deploy our models into package and

2:43

products that we can sell as solutions

2:45

to enterprises.

2:46

>> And you have a background I'll just

2:48

squeeze it in there. Snowflake stripe

2:49

citadel. Yes. Right. Yes.

2:52

>> The exact same kind of customer entire

2:53

life. Yes. The exact kind of customer

2:55

that you want to

2:56

>> So, I'll say I was actually I didn't

2:57

expect the background. When I looked at

2:59

your Twitter, I'm seeing the opposite,

3:01

right? Uh stuff like this. So, you've

3:03

got the mindset of like full send AI

3:06

coding, uh stuff about slob, like

3:08

buckling in your your laptop on your

3:11

Whimos, and then I look at your profile,

3:12

I'm like, "Oh, you're just like you're

3:14

correct in the other room, too." So,

3:16

perfect mix. Perfect. I uh it's quite

3:18

fun to be AI maximalist. If you're going

3:20

to live that persona, open AI is the

3:23

place to do it and it's

3:24

>> a token is what they say.

3:25

>> Yeah. It certainly helps that we have no

3:27

rate limits internally and I can go like

3:29

you said full send at this thing.

3:30

>> Yeah. Yeah. Uh so so open air frontier

3:33

and you're a special team within OB

3:34

Frontier. We had been given some space

3:37

to cook which has been super super

3:39

exciting and this is kind of why I

3:42

started with kind of a out there

3:44

constraint to not write any of the code

3:46

myself. I was figuring if we're trying

3:49

to make agents that can be deployed into

3:51

end enterprises, they should be able to

3:53

do all the things that I do. And having

3:55

worked with these coding models, these

3:57

coding harnesses over 6 7 8 months, I do

4:00

feel like the models are there enough,

4:03

the harnesses are there enough where

4:05

they're isomeorphic to me in capability,

4:07

in the ability to do the job. So

4:09

starting with this constraint of I can't

4:12

write the code meant that the only way I

4:15

could do my job was to get the agent to

4:17

do my job

4:18

>> and like just a bit of background before

4:20

that this is basically the article. So

4:22

what you guys did is 5 months of working

4:25

on an internal tool zero lines of code

4:28

over a million lines of code in the

4:30

total codebase. You say it was 10x more

4:32

like it was 10x faster than you would

4:34

have if you had done it by end. So yeah,

4:37

>> that was kind of the mindset going into

4:38

this, right?

4:39

>> That's right. I think right started with

4:41

some of the very first versions of codec

4:43

cli with the codeex mini model which was

4:45

obviously much less capable than the

4:47

ones we have today. Uh which was also a

4:49

very good constraint, right? It it's

4:51

quite a visceral feeling to ask the

4:54

model to build you a product feature and

4:56

it it just not being able to assemble

4:59

the pieces together

5:00

>> which kind of defined one of the

5:02

mindsets we had for going into this

5:04

which is whenever the model just cannot

5:07

you always pop open that the task double

5:09

click into it and build smaller building

5:12

blocks that then you can reassemble into

5:14

the broader objective. And it was quite

5:18

painful to do this. Honestly, the first

5:20

month and a half was 10 times slower

5:23

than I would be. But because we paid

5:25

that cost, we ended up getting to

5:28

something much more productive than any

5:30

one engineer could be because we built

5:32

the tools, the assembly station for the

5:35

agent to do the whole thing. But yeah,

5:38

so onward to GBD5 51, 52, 53, 54. To go

5:42

through all these model generations and

5:44

see their kind of quirks and different

5:47

working styles also meant we had to

5:49

adapt the code base to change things up

5:52

when the model was revved. Um, one

5:55

interesting thing here is 52, the codeex

5:58

harness at the time, did not have

5:59

background shells in it, which means we

6:01

were able to rely on blocking scripts to

6:05

perform long horizon work. But with 53

6:08

and background shells, it became less

6:11

patient, less willing to block. So, we

6:13

had to retool the entire build system to

6:16

complete in under a minute. And you know

6:19

this is not a thing I would expect to be

6:20

able to do uh in a codebase where people

6:24

have opinions.

6:26

But because the only goal was to make

6:28

the Asian productive over the course of

6:30

a week we went from a bespoke make file

6:34

build to basil to turbo to NX and just

6:37

kind of left it there because builds

6:39

were fast at that point.

6:41

>> Interesting. Uh talk more about turbo to

6:42

NX. That's interesting because that's

6:44

the other direction that other people

6:45

have been doing. Ultimately, I have not

6:48

a lot of experience with actual

6:50

front-end repo architecture.

6:52

>> You're talking to Josh who built us this

6:53

guy. So, like I know the NXT team and I

6:55

know Turbo for from Jared Bomber and I'm

6:57

like yeah that's an interesting

6:59

comparison.

6:59

>> The hill we were climbing right was make

7:01

it fast.

7:01

>> Is there micro front ends involved? It's

7:04

like how

7:05

>> how complex react Electron uh single app

7:09

sort of thing

7:10

>> and must be under a minute. That's an

7:12

interesting limitation. I'm actually not

7:15

super familiar with the background shell

7:16

stuff. Probably was talked about in the

7:18

FI3 release.

7:19

>> Basically means that uh Codeex is able

7:21

to spawn commands in the background and

7:23

then go continue to work while it waits

7:25

for them to finish. So it can spawn an

7:27

expensive build and then continue uh

7:30

reviewing the code for example.

7:32

>> Yeah.

7:32

>> Uh and this helps it be uh more time

7:35

efficient for the user invoking the

7:37

harness.

7:38

>> I guess like and just to really nail

7:39

this like what does 1 minute matter?

7:42

Like why not five, you know? Okay, we

7:44

want the inner loop to be as fast as

7:46

possible. 1 minute was just a nice round

7:48

number and we were able to hit it. So,

7:49

>> and if it doesn't complete, it kills it

7:51

or some something.

7:52

>> Uh, no. We just take that as a signal

7:54

that we need to stop what we're doing.

7:56

Double click, decompose the build graph

7:58

a bit to get the time back under so that

8:00

we can able the agent to continue to

8:02

operate.

8:02

>> It's almost like you're you're it's like

8:04

a ratchet. It's like you're forcing

8:06

buildtime discipline because if you

8:08

don't, it'll just grow and grow and

8:09

grow. That's right. And you mentioned

8:10

that

8:11

>> like current like the software I work on

8:12

currently is at 12 minutes. It sucks.

8:14

>> This has been my experience with

8:16

platform teams in the past, right? Where

8:18

you have sort of an envelope of

8:20

acceptable build times and you let it go

8:22

up to breach and then you spend 2 3

8:25

weeks to bring it back down to the lower

8:26

end of the low end stop. But because

8:30

tokens are so cheap. Yeah. And we're so

8:32

insanely parallel with the model, we can

8:34

just constantly be gardening this thing

8:36

to make sure that we maintain these

8:37

invariants, which means there's way less

8:40

dispersion in the code and the SDLC,

8:42

which means we can kind of simplify in a

8:45

way and rely on a lot more invariance as

8:47

we write the software.

8:49

>> You kind of mentioned in your article

8:50

like humans became the bottleneck,

8:52

right? You you kicked off as a team of

8:53

like three people. You're putting out a

8:55

million line of code like 1500 PRs

8:57

basically what's the mindset there right

8:59

so as much as code is disposable you're

9:02

doing a lot of review a lot of the

9:03

article talks about how you want to

9:05

rephrase everything is prompting

9:07

everything is what the agent can't see

9:09

it's kind of garbage right you shouldn't

9:11

have it in there so what's kind of like

9:12

the high level of how you went about

9:15

building it and then how you address

9:16

like okay humans are just kind of PR

9:19

review like how is human in the loop for

9:21

this you know

9:21

>> we we've moved beyond even the the

9:23

humans reviewing the code uh as well.

9:25

Most of the human review is uh postmerge

9:28

at this point.

9:29

>> But merge merge

9:31

>> that's not even review that's just like

9:32

oh let's just make ourselves happy by

9:35

using

9:36

>> fundamentally the model is trivially

9:39

paralyzable right as many GPUs and

9:41

tokens as I am willing to spend I can

9:43

have capacity to work on my hood base.

9:46

>> The only fundamentally scarce thing is

9:48

the synchronous human attention of my

9:50

team. There's only so many hours in the

9:52

day. We have to eat lunch. Uh I would

9:55

like to sleep. Although it's quite

9:57

difficult to, you know, stop poking the

9:59

machine because it makes me want to feed

10:01

it. Uh you kind of have to step back,

10:04

right? Like you need to take a systems

10:06

thinking mindset to things and

10:09

constantly be asking where is the agent

10:11

making mistakes? Where am I spending my

10:14

time? How can I not spend that time

10:16

going forward? And then build confidence

10:18

in the automation that I'm putting in

10:20

place. So I have solved this part of the

10:22

SDLC. And usually what that has looked

10:25

like is like we started needing to pay

10:28

very close attention to the code because

10:29

the agent did not have the right

10:31

building blocks to produce modular

10:33

software that decomposed appropriately

10:36

that was reliable and observable and

10:40

actually acred a working front end in

10:42

these things. Right? So in order to not

10:46

spend all of our time sitting in front

10:47

of a terminal at most doing one or two

10:49

things at a time invested in giving the

10:52

model that observability which is that

10:54

uh that graph in the the post here.

10:56

>> Yeah.

10:57

>> Let's walk through this

10:57

>> traces which which existed first.

11:00

>> We started with just the app and the

11:02

whole rest of it from vector through to

11:05

all these login metrics APIs was I don't

11:09

know half an afternoon of my time. We

11:11

have intentionally chosen very high

11:15

level fast developer tools. There's a

11:17

ton of great stuff out there now. Uh we

11:19

use MI a bunch which makes it trivial to

11:22

pull down all these go written Victoria

11:24

stack binaries in our local development.

11:27

Tiny little bit of Python glue to spin

11:29

all these up and off you go. One neat

11:32

thing here is we have tried to invert

11:34

things as much as possible which is

11:35

instead of setting up an environment to

11:37

spawn the coding agent into instead we

11:41

spawn the coding agent like that's the

11:42

entry point just codecs and then we give

11:45

codeex via skills and scripts the

11:48

ability to boot this stack if it chooses

11:50

to

11:50

>> and then tell it how to set some end

11:53

variables so the app in local dev points

11:56

at this stack that it has chosen to spin

11:58

up and this I think is like the

12:00

fundamental difference between reasoning

12:02

models and the four 1s and four O's of

12:05

the past where these models could not

12:08

think. So you kind of had to put them in

12:10

boxes with a predefined set of state

12:12

transitions whereas here we have the

12:16

model the harness be the whole box and

12:19

give it a bunch of options for how to

12:20

proceed with enough context for it to

12:22

make intelligent choices. So sales

12:24

>> feel like a lot of that is around

12:25

scaffolding, right? Previous agents, you

12:27

would define a scaffold. It would it

12:29

would operate in that, you know, loop,

12:31

try again. That's kind of pivoted off

12:34

from when we've had reasoning models.

12:36

They're seeming to perform better when

12:37

you don't have a scaffold, right? You

12:39

and you go into like niches here too,

12:41

like your spec.md and like having a very

12:45

short agent.mmd.

12:48

>> Yes.

12:48

>> Yeah. So you you even lay out what it is

12:51

here, but

12:52

>> I like the table of contents. Yeah. that

12:53

like stuff like this, it really helps

12:55

guide people because everyone's trying

12:56

to do this.

12:57

>> This structure also makes it super cheap

12:59

to put new content into the repository

13:02

to steer both the humans and and the

13:04

agents.

13:04

>> I mean, you you kind of reinvented

13:06

skills, right?

13:07

>> One big agent skills from first

13:09

principles.

13:10

>> Skills did not exist when we started

13:11

doing this, right? Um you have a short

13:14

one 100 line overall table of contents

13:16

and then you have little skills, right?

13:18

Core beliefs, MD, tech tracker. Yeah.

13:20

Yeah. Um yeah. So the skills over The

13:23

techjet tracker and the quality score

13:24

are pretty interesting because this is

13:27

basically a tiny little scaffold like a

13:29

markdown table which is a hook for

13:31

codeex to review all the business logic

13:34

that we have defined in the app assess

13:37

how it matches all these documented

13:39

guardrails and propose follow-up work

13:41

for itself. So you know before beads and

13:44

all these ticketing systems we were just

13:46

tracking follow-up work as notes in a

13:48

markdown file which you know we could

13:50

spawn an agent on acron to kind of burn

13:53

down. There's this really neat thing

13:54

that like the models fundamentally crave

13:56

text. So a lot of what we have done here

13:59

is figure out ways to inject text into

14:02

the system. Right? when we get a page

14:05

because we're missing a timeout, for

14:06

example, I can just add codecs in Slack

14:09

on that page and say, I'm going to fix

14:12

this by adding a timeout. Please update

14:14

our reliability documentation to require

14:16

that all network calls have timeouts.

14:18

So, I have not only made a point in time

14:20

fix, but also like durably encoded this

14:23

process knowledge around what good looks

14:25

like.

14:25

>> Yeah.

14:26

>> And we give that to the root coding

14:29

agent as it goes and does the thing. But

14:31

you can also use that to distill tests

14:34

out of or a code review agent which is

14:36

pointed at the same things to narrow the

14:38

acceptable universe of the code that's

14:39

produced. I think one of the concerns I

14:42

have with that kind of stuff is like you

14:44

think you're making the right call by

14:46

making it persisted for all time across

14:48

everything. Yes.

14:49

>> But then you didn't think about the

14:50

exceptions that you need to make, right?

14:52

And then you have to roll it back.

14:53

>> Part of it is also

14:54

>> sometimes it can follow instructions too

14:56

well.

14:56

>> It's somewhat a skill, right? So it

14:58

determines when it uses the tools,

15:00

right? Like it's not it's not like it'll

15:02

run at every call. It'll determine when

15:04

it wants to check quality score, right?

15:05

>> Yeah. And we do kind of in the prompts

15:08

we give these agents allow them to push

15:11

back. Um when we first started adding

15:13

code review agents to the PR, it would

15:15

be codeci locally writes the change,

15:19

pushes up a PR. On those PR

15:21

synchronizations, a review agent fires,

15:23

it posts a comment. We instruct Codex

15:26

that it has to at least acknowledge and

15:28

respond to that feedback. And initially

15:30

the codeex driving the code author was

15:34

willing to be bullied by the PR reviewer

15:37

which meant you could kind of end up in

15:38

a situation where things were not

15:39

converging. So we kind of had

15:42

>> we kind of had to add more optionality

15:44

to the prompts on both of these things

15:46

right like the reviewer agents were

15:48

instructed to bias toward merging the

15:50

thing to not surface anything greater

15:52

than a P2 in priority. We didn't really

15:54

define P2 but we we gave it

15:56

>> to define P2. We gave it a framework

15:58

within which to uh score its output and

16:02

then

16:02

>> greater than P 0 is worse, right?

16:04

Georgia P2 is P 0 is you will like nuke

16:07

the code base if you merge this thing,

16:08

right?

16:09

>> Yeah. Yeah.

16:10

>> But also on the on the code authoring

16:12

agent side, we also gave it the

16:14

flexibility to either defer or push back

16:17

against review feedback, right? It

16:19

happens all the time, right? like I

16:21

happen to notice something and leave a

16:23

code review which could blow up the

16:26

scope by a factor of two, right? I

16:28

usually don't mean for that to be

16:30

addressed exactly in the moment. It's

16:32

more of an FYI, right? File it to the

16:34

backlog, pick it up in the next fix it

16:36

week sort of thing. And without the

16:39

context that this is permissible, the

16:41

coding agents are going to bias toward

16:42

what they do, which is following

16:43

instructions.

16:44

>> Yeah, I do wanted to check in on a

16:47

couple things, right? like uh all the

16:49

the the coding review agent it can merge

16:52

autonomously

16:53

>> I think that's something that a lot of

16:54

people aren't comfortable with right and

16:56

you have a list here of how much agents

16:58

do they do product code and test CI

17:00

configuration release tooling internal

17:02

dev tools documentation eva harness

17:03

review comments scripts that manage the

17:05

repository itself production dashboard

17:07

definition files like everything yes and

17:10

uh so they're just all turning at the at

17:11

the same time is there like a cord that

17:14

that any human on the team pulls to stop

17:16

everything So because we are building a

17:19

native application here, we're not doing

17:21

continuous deployment, right? So there

17:23

is still a human in the loop for cutting

17:24

the release branch. I see

17:26

>> we require a bless human approved smoke

17:30

test of the app before we promote it to

17:33

distribution these sorts of things.

17:34

>> So you're working on the app you're not

17:35

building like infrastructure where where

17:37

you have like nines of reliability that

17:38

kind of stuff.

17:39

>> That's correct. That's correct. Okay.

17:40

And also like full recognition here that

17:42

all of this activity took in a

17:44

completely green field repository like

17:46

there's should be no that this applies

17:49

generally to like this is a production

17:51

thing you're going to ship to customers

17:52

of course. Yeah. You know so this is

17:54

real

17:55

>> and like one of the things there is you

17:56

mentioned you started this as a repo

17:58

from scratch. The onboarding first month

18:00

or so was pretty it was like working

18:02

backwards right and you had to work with

18:04

the system and now you're at that point

18:06

where you know you're very autonomous.

18:08

I'm curious like okay so what how human

18:10

in the loop is it right so like what are

18:12

the bottlenecks that you wish you could

18:14

still automate and part of that is also

18:16

like where do you see the model

18:17

trajectory improving and offloading more

18:19

human in the loop right we just got 5.4

18:22

for um it's a really good

18:24

>> fantastic model by the way.

18:25

>> Yeah. Yeah. It's the first one that's

18:26

merged uh top tier coding. So it's

18:29

codeex level coding and reasoning. So

18:31

general reasoning both in one model,

18:32

right? So

18:33

>> and computer use

18:34

>> computer use. Now with I can just have

18:37

codeex write the blog post. Whereas for

18:39

this one I had to balance between chat

18:41

and

18:42

>> oh I need to I might be out of a job.

18:47

>> Oh my god.

18:48

>> I know. You just gave me an idea for a

18:51

completely AI newsletter that like 54

18:53

could do.

18:54

>> Yeah, I get it. Now,

18:56

>> this sort of thing is just one example

18:58

of closing the loop, right? like the

19:00

dashboard thing you mentioned. We have

19:02

codec authoring the JSON for the

19:04

Graphana dashboards and publishing them

19:07

and also responding to the pages which

19:09

means when it gets the page it knows

19:11

exactly which dashboards are defined and

19:13

what alerts what alert was triggered by

19:17

which exact log in the codebase cuz all

19:19

of this stuff is collated together.

19:21

>> It has to own everything.

19:22

>> Yes. Yes. And it means that if we have

19:25

an outage that did not result in a page,

19:28

it has the existing set of dashboards

19:30

available to it. It has the existing set

19:32

of metrics and logs and can figure out

19:34

where the gaps in the dashboard are or

19:35

in the underlying metrics and fix them

19:37

in one go. In the same way you would

19:40

kind of have a full stack engineer be

19:42

able to drive a feature from the back

19:43

end all the way to the front end. So it

19:45

seems like a lot of the work you guys

19:47

had to do was you as a small team are

19:50

fully working for a way that the model

19:52

wants the software to be written right

19:54

it's less human legible for better code

19:56

legibility agent legibility how do you

19:58

think that affects broader teams so one

20:00

at open AI like do you leaison like this

20:03

is how software should be written like I

20:04

can imagine say you join a new team with

20:07

this methodology this mindset uh there's

20:10

ways that you know teams do code review

20:12

teams write code like teams are

20:13

structured And a lot of it is for human

20:15

legibility. So like should we all swap?

20:17

Like how does this play back one broader

20:20

into OpenAI and then like broader into

20:22

software engineering, right? Like is it

20:23

like teams that pick this up will like

20:25

you know it's pretty drastic, right? You

20:26

have to make a pretty big switch. Should

20:29

they just full send like

20:32

the mindset is very much that I'm

20:34

removed from the process, right? I can't

20:35

really have deep code level opinions

20:38

about things. It's as if I'm group tech

20:41

leading a 500 person organization like

20:43

yeah like it's not appropriate for me to

20:45

be in the weeds on every PR. This is why

20:47

that postmerge code review thing is like

20:50

a good analog here right like I have

20:51

some representative sample of the code

20:54

as it is written and I have to use that

20:56

to infer what the teams are struggling

20:58

with where they could use help where

21:01

they're already moving quickly and I can

21:02

pivot my focus elsewhere.

21:04

>> Yeah. So I don't really have too many

21:06

opinions around the code as it is

21:09

written. I do however have like a

21:13

commandbased class which is like used to

21:16

have repeatable chunks of business logic

21:17

that comes with tracing and metrics and

21:19

observability for free right and the

21:21

thing to focus on is not how that

21:23

business logic is structured but that it

21:26

uses this primitive because I know

21:27

that's going to give leverage by

21:29

default.

21:29

>> Yeah.

21:30

>> Yeah. back to that sort of systems

21:32

thinking

21:32

>> and you have part of that in your blog

21:34

post enforcing architecture and ta taste

21:36

how you set boundaries for what's used

21:38

uh there's also a section on like

21:40

redefining engineering and stuff but

21:42

yeah it's just it's interesting to hear

21:44

you know

21:44

>> and you know as the models have gotten

21:46

better they have gotten better at

21:48

proposing these abstractions to unblock

21:50

themselves which again lets me move

21:51

higher and higher up the stack to look

21:55

deeper into the future on what

21:57

ultimately block the team from shipping

21:59

>> yeah you mentioned And uh so you this is

22:02

primarily a it's like a 1 million line

22:04

of code codebase electron app uh but it

22:06

manages its own services as well. So

22:08

it's like a back end for front end type

22:09

thing.

22:10

>> We do have like a a backend in there but

22:13

that's hosted in the cloud. But this

22:16

sort of structure is actually within the

22:18

separate main and renderer processes

22:20

with within the electron.

22:21

>> That's just how electron works.

22:22

>> Yeah. Yeah. So like like have also

22:24

treated like MVC style decomposition

22:27

with the same same level of rigor which

22:30

has been very fun.

22:31

>> Uh I have a fun pun this is like a

22:33

tangent but you know MVC is model view

22:34

controller and any sort of full stack

22:36

web dev knows that but my AI native

22:39

version of this is model view claw the

22:42

claw the the harness.

22:44

>> That's right. That's right. That's

22:45

right. I do think that there is an

22:47

interesting space to explore here with

22:50

codecs the harness as part of building

22:53

AI products right there's a ton of

22:55

momentum around getting the models to be

22:58

good at coding we've seen big leaps in

23:01

like the task complexity with each

23:03

incremental model release where if you

23:06

can figure out how to collapse a product

23:08

that you're trying to build a user

23:10

journey that you're trying to solve into

23:11

code it's pretty natural to use the

23:14

codeex harness to solve that problem for

23:16

you. It's done all the wiring and lets

23:19

you just communicate in prompts to let

23:21

the model cook.

23:22

>> It's been very fun. And it's also like a

23:25

very engineering legible way of

23:28

increasing. Right. Yeah.

23:29

>> Just give you just give the model

23:31

scripts, the same scripts you would

23:32

already build for yourself.

23:34

>> Yeah.

23:34

>> Um

23:34

>> Yeah. So for listeners, this is Ryan

23:36

saying that software engineering or

23:39

coding agents will eat knowledge work

23:41

like the non-coding parts that you would

23:43

normally think, oh, you have to build a

23:44

separate agent for it. No, you start

23:46

with coding agent and go up from there,

23:47

which openclaw has, right? It's pie

23:49

under the hood.

23:50

>> Yes.

23:50

>> Basically define your task in code.

23:52

Everything is a coding agent.

23:53

>> By the way, since I brought it up, it's

23:55

probably the only place you bring it up.

23:56

Is any open claw usage from you? Any

23:58

>> No, no, not for me. I don't have any

24:00

spare Mac minis rattling around my

24:02

house.

24:03

>> You can afford it. Um, no, I just I'm

24:05

kind of curious if it's like changed

24:07

anything in OpenAI yet, but it's

24:08

probably early days. And then the, you

24:10

know, the other thing I want to pull on

24:12

here is like you mentioned ticketing

24:13

systems and you mentioned PRs and I'm

24:16

wondering if both those things have to

24:18

go away or be reinvented for this kind

24:20

of coding, right? So the git itself and

24:24

is like very hostile to multi- aents.

24:27

>> Yeah, we make we make very heavy use of

24:30

work trees,

24:31

>> right? But like even then like I just

24:33

did a dropped a podcast yesterday with

24:35

cursor saying then they said they're

24:36

getting rid of work trees because like

24:38

it still has too many merge conflicts.

24:40

It's still too unintuitive. But go

24:42

ahead. The models are really great at

24:44

resolving merge conflicts. Yeah. And to

24:47

get to a state where I'm not

24:49

synchronously in the loop in my

24:51

terminal, I almost don't care that there

24:53

are merge

24:54

>> disposable, right? We invoke a dollar

24:56

land skill and that coaches codecs to

25:00

push the PR, wait for human and agent

25:03

reviewers, wait for CI to be green, fix

25:06

the flakes if there are any merge

25:09

upstream if the PR comes into conflict,

25:12

wait for everything to pass,

25:14

put it in the merge queue, deal with

25:16

flakes until it's in main. And this is

25:20

kind of what it means to delegate fully,

25:23

right? like this is this is in a you

25:25

know very large model probably a

25:27

significant tax on humans to get PRs

25:30

merged but the agent is more than

25:32

capable of doing this and I really don't

25:33

have to think about it other than keep

25:35

my laptop open.

25:36

>> Yeah.

25:38

I used to be much more of a control

25:39

freak but now I'm like yeah actually you

25:41

could do a better job this me.

25:42

>> Yeah.

25:42

>> With the right context.

25:43

>> Yes.

25:44

>> Anything else in harness engine in

25:45

general? Just this piece. I just wanted

25:47

to make sure we

25:49

>> I think one thing that I maybe didn't

25:51

make super clear in the article that I I

25:55

kind of heard on Twitter as an interest

25:57

to them. What's the chatter and then

25:58

what's your response?

25:59

>> Ultimately,

26:01

all the things that we have encoded in

26:03

docs and tests and review agents and all

26:05

these things are ways to put all the

26:07

non-functional requirements of building

26:09

high-scale highquality reliable software

26:12

into a space that prompt injects the

26:14

agent. We either write it down as docs,

26:16

we add lints where the error messages

26:18

tell how to do the right thing. So the

26:22

whole meta of the thing is to basically

26:24

tease out of the heads of all the

26:26

engineers on my team what they think

26:28

good looks like, what they would do by

26:30

default or what they would coach a new

26:33

hire on the team to do to get things to

26:35

merge. And that's why we pay attention

26:38

to all the mistakes mistakes that the

26:41

agent makes, right? This is code being

26:44

written that is misaligned with some as

26:47

yet not written down non-functional

26:48

requirement.

26:49

>> Sorry. What did the online people

26:51

misunderstand or

26:53

>> No, what somebody just literally said

26:55

that. I was like, "Oh, yeah. Okay. This

26:57

this is this is the thing. This is what

26:58

I was doing agree with." Yeah. I see. I

27:00

see. I see. I see.

27:01

>> I see. I see. Interesting. One other

27:03

neat thing which I did totally did not

27:05

expect is folks were just taking the

27:08

link to the article and giving it to

27:11

like pi or codeex and and say make my

27:14

repo this

27:15

>> you achieve a full recursion

27:16

>> and it was wildly effective really it

27:19

was wildly effective like this actually

27:21

is something I tried with 54 yesterday I

27:23

I didn't have that much time I was like

27:25

out speaking at something and this is

27:27

one of my things I was like okay I have

27:28

this article can we can we just like

27:30

scaffold out what it would be like to

27:32

run this and I I did it first as that

27:34

and then I was like okay let me take

27:35

another little side repo and say like

27:36

okay if I was to fully automate this

27:39

like this cuz I haven't written a line

27:40

of code it's like a full set

27:41

>> it's a side thing I'm doing with like

27:43

voice TTS I'm just like slobbing out

27:46

whatever it's not production I'm like

27:48

how would I make this like this and it's

27:49

it's actually like a really good way

27:51

it's like a good way to learn what could

27:53

be changed what could be like it's just

27:54

a good analyzing right you give it all

27:56

the code you give it all the context you

27:57

give it the article and it it walks you

27:58

through it very well

27:59

>> that's right that's right I guess one

28:01

more thing before we go to Symphony is I

28:03

wanted to cover Brett Taylor's response.

28:04

We had him on the on the show. He is

28:06

your chairman which is wild.

28:08

>> Yeah.

28:09

>> Uh that he's reading your articles as

28:11

well and like getting engaged in it. He

28:12

says software dependencies are going

28:14

away basically. They can just be like

28:16

vendored.

28:16

>> Yes.

28:17

>> Uh response

28:19

>> 100%. You still prom you still pay data

28:22

dog. You still pay temporal. Thank you.

28:24

>> Yep. The level of complexity of the

28:26

dependencies that we can internalize is

28:28

I would say low medium right now. Right.

28:30

just based on model capability.

28:32

>> What is what is medium?

28:34

>> I I I would say like a a couple thousand

28:37

line dependency is a thing that we could

28:39

inhouse no problem in an afternoon of

28:41

time. One neat thing about it is like

28:43

probably most of that code you don't

28:44

even need, right? Like by in-housing an

28:47

abstraction, you can kind of strip away

28:49

all the generic parts of it and only

28:51

focus on what you need to enable the

28:53

specific things you're building.

28:54

>> I've been calling this the end of

28:56

plugins.

28:57

>> Yeah. because there's so much like you

28:59

know when I publish an open source thing

29:00

I want to accept everything and be

29:02

liberal I want to accept right this is

29:03

postal's law but that means there's so

29:05

much bl so much overhead

29:07

>> one other neat thing about this too is

29:09

when we deploy codeex security on the

29:12

repo it is able to deeply review and

29:16

change the internalized dependencies

29:19

>> in a much lower friction way than it

29:21

would be to like push patches upstream

29:23

wait for them to be released pull them

29:24

down make sure that's compatible with

29:26

all the transitives I have in my repo

29:28

and things like that. So, it's also much

29:30

lower friction uh to kind of internalize

29:32

some of these things if code is free

29:34

because the tokens are cheap sort of

29:36

thing.

29:36

>> Yeah. Yeah. I I think like the the only

29:38

argument I have against this is

29:40

basically scale testing which obviously

29:42

the larger pieces of software like Linux

29:44

my SQL he calls up even the data and

29:47

temporals and then maybe security

29:49

testing where uh classically I think is

29:51

it Linus Tovals who said like security

29:53

open source is the best disinfectant

29:55

>> right many eyes

29:56

>> many eyes and uh if you you know inline

29:59

your dependencies and and code them up

30:01

you're going to have to relearn mistakes

30:03

from other people that you know

30:05

>> Yep. Yep. And you know to internalize

30:08

that dependency you're back to zero and

30:10

you have to kind of start reassembling

30:13

all those bits and pieces to have high

30:16

confidence in the code as it is written

30:17

right.

30:18

>> Yeah. Um

30:19

>> even part of like the first intro of

30:22

this you basically mentioned like

30:23

everything was written by uh codeex

30:27

including internal tooling right so

30:29

internal tooling like when you're

30:30

visualizing what's going on it's it's

30:32

writing it forward to Yeah, I built

30:33

internal tools for AI now and like I

30:35

just showed them off and they're like

30:37

how long did you spend and I I they I

30:38

didn't spend any time I just prompted

30:40

it, you know,

30:41

>> very funny story here.

30:42

>> Yeah, go ahead.

30:42

>> We had deployed our app to the first

30:45

dozen users internally uh had some

30:47

performance issues. So we asked them to

30:50

export a trace for us. Uh get a tarball,

30:54

gave it to our on call engineer and he

30:58

did a fantastic job of working with

30:59

codeex to build this beautiful local dev

31:02

tool nex.js app that you drag and drop

31:04

the parall in and it visualizes the

31:06

entire trace. Uh it's fantastic. Took an

31:09

afternoon, but none of this was

31:11

necessary because you could just spin up

31:13

codecs and give it the tarball and ask

31:15

the same thing and get the response

31:17

immediately. So in a way optimizing for

31:19

human legibility of that debugging

31:21

process was wrong. It kept him in the

31:24

loop unnecessarily when instead he could

31:27

have just like codex cooked for 5

31:28

minutes and gotten the same.

31:29

>> Yeah. You have to fight your instincts

31:30

here of like this is how we used to do

31:32

it or this is how I I would have used to

31:35

solve it.

31:35

>> Yeah. in this in this local uh

31:37

observability stack like sure you can

31:39

def deploy Jerger to visualize the

31:41

traces but I wouldn't expect to be

31:44

looking at the traces in the first place

31:46

because I'm not going to write the code

31:47

to fix them.

31:48

>> Yeah. I mean so basically there needs to

31:49

be like this kind of house stack and

31:51

owning the whole loop. I think that that

31:52

is very well established and uh it

31:54

sounds like you might be like sharing

31:56

more about that in the future, right?

31:57

>> Yeah. Uh I think we're excited to do so.

32:00

We're gonna talk about Symphony in a

32:01

little bit, but like the way we

32:03

distribute it it as a spec, which I

32:05

think folks are calling ghost libraries

32:07

on Twitter. Like this is like a such a

32:09

cool name. Um it does mean it becomes

32:12

much cheaper to share software with the

32:15

world, right? You define a spec how you

32:18

could build your own specifying as much

32:20

as is required for a coding agent to

32:23

reassemble it locally. The flow here is

32:26

very very cool. Like we have taken all

32:28

the scaffolding that has existed in our

32:30

proprietary repo, spun up a new one, ask

32:34

codeex with our repo as a reference,

32:36

write the spec. We tell it, spin up a

32:39

T-mox, spawn a disconnected codeex to

32:41

implement the spec, wait for it to be

32:43

done, spawn another codeex and another

32:45

T-Mox to review the spec or review the

32:49

implementation compared to upstream and

32:51

update the spec so it diverges less. And

32:53

then you just loop over and over and

32:55

over. Ralph style until you get a spec

32:58

that is with high fidelity able to

33:00

reproduce the system as it is. It's

33:03

fantastic and

33:03

>> and you're basically you're not really

33:05

adding any of your human bias in there,

33:07

right? Like a lot of times people will

33:08

write a spec and be like okay I think it

33:10

should be done this way and you'll

33:12

you'll riff on something and it's like

33:13

no that agent could have just handled

33:14

it. Like you're still scaffolding in a

33:16

sense, right? I want it done this way.

33:18

It can determine that spec better

33:19

better.

33:20

>> That's right. That's right. Part of me

33:22

uh you know I've been working a lot on

33:24

eval recently and part of me is

33:26

wondering if an agent can produce a spec

33:28

that it cannot solve like is it always

33:30

capable of things that it can imagine or

33:32

can you imagine things that it is

33:33

impossible to do. I think with symphony

33:36

we there's like this uh there's this

33:39

axis right where you have things that

33:41

are easy or hard or established or new

33:45

right and I think things that are hard

33:47

and new is still something that uh the

33:50

models need humans yeah drive but I

33:52

think those other quadrants are largely

33:54

solved given the right scaffold and the

33:57

right thing that's going to drive the

33:58

agent to completion

33:59

>> it's crazy that it's solved

34:00

>> but it it means that the humans the ones

34:03

with limited time and attention get to

34:05

work on the hardest stuff, right? Like

34:07

the problems where it's pure white space

34:09

out in front or like the deepest

34:12

refactorings where you don't know what

34:14

the proper shape of the interfaces are.

34:16

And this is where I want to spend my

34:18

time because it lets me set up for the

34:20

next level of scale.

34:21

>> Yeah. Yeah. Amazing. Uh let's let's

34:23

introduce Symphfony. I think we've been

34:24

mentioning it uh every now and then. Uh

34:26

Elixir, interesting option.

34:28

>> Yeah. Yeah. And again like the the the

34:30

the elixir manifestation here is is just

34:34

a derivative.

34:35

>> Is it a model chosen?

34:36

>> Uh yeah. Yeah. And it chose that because

34:39

>> the process supervision and the gen

34:41

servers are super amendable to the type

34:43

of process orchestration that we're

34:45

doing here. Right. You are essentially

34:47

spinning up little dammons for every

34:49

task that is in execution and driving it

34:52

to completion. Which means the model

34:54

gets a ton of stuff for free by using

34:55

elixir and the beam. I mean I I had to

34:58

go do a crash course in Beam and Elixir

35:00

and I think most people are not

35:03

operating at that scale of concurrency

35:05

where you need that but it is a good

35:08

mental model of resumability and all

35:10

those things and these are things I care

35:11

about. Uh but tell me the story the

35:13

origin story of Symphony uh what do you

35:15

use it for? Is this how did it form and

35:17

maybe any abandoned paths that you

35:19

didn't take?

35:20

>> At the end of December uh we were at

35:23

about three and a half PRs per engineer

35:25

per day.

35:26

This was before 52 came out in the

35:29

beginning of January. Everyone gets back

35:31

from holiday with 52 and no other work

35:34

on the repository. We were up in the

35:38

five to 10 PRs per day per engineer. And

35:41

like I don't know about y'all, but like

35:43

it's very taxing to constantly be

35:46

switching like that. Like I was pretty

35:47

tapped out at the end of the day. So

35:49

again, where are the humans spending

35:51

their time? They're spending their time

35:54

>> context switching between all these

35:55

active T-Mo panes to drive the agent

35:57

forward.

35:59

So let's again build something to remove

36:02

ourselves from the loop. And uh this is

36:04

what uh frantic uh sprint adapter here

36:06

to find a way to remove the need for the

36:10

human to sit in front of their terminal.

36:12

So lot of experimentation with dev boxes

36:15

and you know automatically spinning up

36:17

agents like it seems like a fantastic

36:20

end state here where my life is beach. I

36:23

open l twice a day and uh you know say

36:27

yes no to these things and

36:30

>> this is again a super super interesting

36:33

framing for how the work is done because

36:34

I become more latency insensitive. I

36:38

have way less attachment to the code as

36:41

this is written. Like I've had close to

36:44

zero investment in the actual authorship

36:47

experience. So if it's garbage, I can

36:49

just throw it away and not care too much

36:52

about it. In Symphony, there's this like

36:54

rework state where once the PR is

36:56

proposed and it's escalated to the human

36:58

for review, it should be a cheap review,

37:01

right? It is either mergeable or it is

37:02

not. And if it's not, you move it to

37:04

rework. the elixir service will

37:06

completely trash the entire work tree

37:08

and PR and start it again from scratch.

37:11

>> And this is that opportunity again to

37:13

say why was it trash, right? What did

37:16

the agent do that was

37:17

>> fix that before moving the ticket to

37:20

progress again?

37:21

>> Yeah.

37:22

>> Why is this not in Codex app? I guess

37:24

it's you guys are you guys are ahead of

37:25

Codex app, I guess.

37:26

>> Yeah. So the way the team has been

37:29

working is basically to be as AI pill as

37:33

possible and spread ahead and a lot of

37:36

the things we have worked on have fallen

37:39

out into a lot of the products that we

37:41

have like we were in deep consultation

37:43

with the Codex team to have the Codex

37:46

app be a thing that exists right to have

37:48

skills be a thing that Codex is able to

37:50

use so we didn't have to roll our own to

37:54

put automations into the product so all

37:56

of or automatic refactoring agents

37:58

didn't have to be these handrolled

38:00

control loops. It has been really

38:02

fantastic to be in a way unanchored to

38:05

the product development of Frontier and

38:08

Codeex and just very quickly try to

38:11

figure out what works and then later

38:14

find the scalable thing that can be

38:16

deployed widely. It's been a very fun

38:18

way to operate. It's certainly chaotic.

38:21

I have lost track very often of what the

38:24

actual state of the code looks like

38:26

because I'm not in the loop, right? Uh

38:29

there was one point where we had wired

38:32

playright directly up to the Electron

38:34

app uh with MCP. MCPs I'm pretty bearish

38:38

on because the harness forcibly injects

38:40

all those tokens in the context and I

38:42

don't really get a say over it. Uh they

38:44

mess with autocompaction. Uh the agent

38:46

can forget how to use the tool. There's

38:48

probably only like what three calls in

38:50

Playright that I actually ever want to

38:52

use. So I pay the cost for a ton of

38:54

things. Somebody vibed a local Damon

38:58

that boots Playright and exposes a tiny

39:01

little shim CLI to drive it. And I had

39:03

zero idea that this had occurred because

39:06

to me I run codecs and it's able to you

39:09

know get better.

39:11

>> Yeah. Like uh like no knowledge of this

39:13

at all. So we have had like in human

39:16

space uh to spend a lot of time doing

39:19

synchronous knowledge sharing. We have a

39:21

daily standup that's 45 minutes long

39:24

because we almost have to fan out the

39:28

understanding of the current state.

39:30

>> Yeah, I was going to say like this is

39:32

good for a single human multi- aent but

39:34

multihuman multi- aent is a whole like

39:37

pol like explosion of stuff.

39:38

>> Yeah. And that this is fundamentally why

39:40

we have such a rigid like 10,000

39:43

engineer level architecture in the app

39:46

because we have to find ways to carve up

39:49

the space so people are not trampling on

39:52

each other.

39:52

>> Sorry, I don't I don't get the 10,000

39:54

thing. Uh did I miss that?

39:55

>> The structure of the repository is like

39:58

500 mpm packages. Uh it's like

40:02

architecture to the access for what you

40:04

would consider I think normal for a

40:05

seven person team. But if every person

40:09

is actually like 10 to 50 then the like

40:12

numbers on like being super super deep

40:14

into decomposition and sharding and like

40:18

proper interface boundaries make a lot

40:19

more sense

40:20

>> right to me that's why I talked about

40:22

micro front ends and you know NX is from

40:24

that world but cool just coming back to

40:26

to this like I don't know if you have

40:28

other you know thoughts on orchestrating

40:31

so much work going through this is this

40:33

enough is this like any aha moments

40:36

>> it'll be interesting to see like where

40:37

Okay, so right now you pick linear as

40:39

your issue tracker, right? Like

40:40

>> or it's like a is it is it actually

40:42

linear?

40:42

>> This is actually linear.

40:43

>> Oh, that's linear.

40:44

>> It's linear.

40:44

>> Oh, I I never look at the video. The

40:46

demo video I had to download to run, but

40:49

>> yeah. So I I cuz I'm a Slack maxi, but

40:52

like Yeah, linear is also really good.

40:53

Yes,

40:54

>> we do make a good use of Slack. We um we

40:57

fire off uh codecs to do all these

41:00

>> lowlexity fixups, the things that like

41:03

sync that knowledge into the repository.

41:05

It's super cheap.

41:05

>> Yeah. do it in codeex.

41:07

>> My biggest plug is openi needs to build

41:09

slack, right? You need to own slack

41:11

builds to turn this into

41:13

>> I I did I did read it. Yeah. Um

41:16

>> I would say that if we think that we

41:20

want these agents to do economically

41:23

valuable work, which is like this is the

41:24

mission, right? We want AI to be

41:26

deployed widely to do economically

41:28

valuable work. Then we need to find ways

41:30

for them to naturally collaborate with

41:32

humans which means collaboration tooling

41:34

I think is an interesting space to

41:36

explore.

41:36

>> Yeah totally. Yeah. GitHub Slack linear.

41:38

Yeah, that was kind of my thing like

41:40

okay where do we see right now Codex has

41:42

started Codex model then CLI now there's

41:45

an app app can let me shoot off multiple

41:47

CEXes in parallel but there's no great

41:49

team collaboration for Codex right and

41:51

it seems like your team had some say

41:54

into what comes out right so like you

41:56

talked to them Codex kind of was a thing

41:57

from there if you guys are on the bound

42:00

stuff that like you know you might not

42:01

focus on but like what do you expect

42:03

other people to be building right so

42:05

people that are like 5x 50xing should

42:07

you build stuff that's like very niche

42:09

for your workflow, for your team. Should

42:12

it be more general so other people can

42:13

adopt it? Is there a niche there? Like

42:15

because because part of it is just like,

42:17

okay, is everything just internal

42:18

tooling? Do we have everything our own

42:19

way? Like the way our team operates has

42:22

our own ways that we like to communicate

42:23

or you know, is there a broader way to

42:25

do it? Is it is it something like a

42:27

issue tracker? Just thoughts if you want

42:28

to riff on that.

42:29

>> I think TBD like we have not figured

42:32

this out in a general way. I do think

42:35

that there is leverage to be had in

42:38

making the code and the processes as

42:40

much the same as possible. If you think

42:43

that code is context, code is prompts,

42:47

it's better from the agent behavior

42:49

perspective to be able to look in a

42:51

package in directory XYZ and it not to

42:54

have to page so deeply into directory

42:56

ABC because they have the same

42:58

structure, use the same language, they

43:00

have the same patterns internally. And

43:03

that same like leverage comes from

43:05

aligning on a single set of skills that

43:07

you're pouring every engineer's taste

43:10

into to make sure that the agent is

43:12

effective. So like in our codebase, we

43:14

have I think six skills. That's it. And

43:17

if some part of the software development

43:20

loop is not being covered, our first

43:22

attempt is to encode it in one of the

43:24

existing setup skills. Which means that

43:27

we can change the agent behavior more

43:30

cheaply than changing the human driver

43:32

behavior.

43:32

>> Yeah.

43:34

>> Have you ever you experimented with

43:35

agents changing their own behavior?

43:37

>> We do. Uh yes. Or parent agent changing

43:40

a sub agent's you know behavior or

43:41

something like that. We have some bits

43:45

for skill distillation. Um, so for

43:48

example, there's one neat thing you can

43:50

do with codeex which is just point it at

43:52

its own session logs to ask it to

43:55

>> tell you how you can use the tool

43:57

better. It's like

43:58

>> introspection ask it to do things.

44:00

>> How can I use this session better? What

44:02

skills should I have? Yeah, I like the

44:04

modification of you can do just do

44:06

things to like you can just ask agent to

44:07

do things.

44:08

>> Yeah, you can just codeex things. This

44:09

is this is like a this is like a silly

44:11

emoji that we have. You can just codeex

44:12

things. You can just prompt things. Uh

44:14

it's really glorious future we live in.

44:17

But like okay, you can do that oneonone,

44:19

but like we're actually slurping these

44:21

up for the entire team into blob storage

44:25

and running agent loops over them every

44:28

day to figure out where as a team can we

44:31

do better and how do we reflect that

44:33

back into the repository. Yeah. Though

44:34

everybody benefits from everybody else's

44:36

behavior for free. Same for like PR

44:39

comments, right? These are all feedback

44:42

that means the code as written deviated

44:44

from what was good. A PR comment, a

44:46

failed build, these are all signals that

44:48

mean at some point the agent was missing

44:51

context. We got to figure out how to

44:53

slurp it up and put it back in the repo.

44:55

>> By the way, I do this exactly right. I

44:57

used when I use uh cloud code for

44:59

knowledge work.

45:01

>> Cloud code work is like a nice product,

45:02

right? I think you would agree. I always

45:05

have it tell me what do I do better next

45:07

time,

45:08

>> right? And that's the meta programming

45:09

reflection thing. So almost think like

45:11

you have six reflection extraction

45:12

levels in Symphony. Almost like the the

45:14

zero layer. So the six levels are

45:17

policy, configuration, coordination,

45:18

execution, integration, observability.

45:20

We've talked about a couple of these,

45:22

but the zero layer is like the okay well

45:25

are we working well? Can we can we

45:27

improve how we work? Like can I modify

45:30

my own workflow MD or something? I don't

45:32

know.

45:33

>> Yeah, of course. Yeah, of course you

45:34

can. Um, like this thing is also able to

45:38

cut its own tickets because we give it

45:40

full access.

45:41

>> Yeah. Make it a ticket to have it cut

45:42

tickets. You can put in the ticket that

45:45

you expected to file it on followup

45:46

work.

45:47

>> Self modifying. Yeah.

45:48

>> Yeah. Put don't put the agent in a box.

45:50

Give give the agent full accessibility

45:52

over his domain.

45:53

>> I had a mental reaction when you said

45:55

don't put the agent in a box. So I think

45:57

you should put it in a box. Like it's

45:58

just that you're giving the box

45:59

everything it needs.

46:00

>> Yeah. Context and tools. Right. But

46:02

we're like as developers we're used to

46:04

calling out to different systems. But

46:06

here you use the open source things like

46:08

the Prometheus whatever and you run it

46:10

locally so that you can have the full

46:11

loop. Right. I I assume. Yep.

46:13

>> Right. Um

46:14

>> I think I think like

46:14

>> you want to minimize cloud cloud

46:16

dependencies.

46:16

>> You also want to make sure that you

46:19

think about what the agent has access

46:20

to, right? Like what does it see? Does

46:22

it go back into the loop like from the

46:24

most basic sense of uh you let it see

46:26

its own like calls traces. Uh it can

46:29

determine where it went wrong, right?

46:30

But are you feeding that back in? So,

46:32

you know, just the most basic level of

46:34

like you want to see exactly what's

46:35

input output. Like, does the agent have

46:38

access to what is being outputed, right?

46:41

It can self-improve a lot of these

46:42

things.

46:43

>> It's all text, right? My job is to

46:46

figure out ways to funnel text from one

46:48

agent to the other. Um,

46:49

>> it's so strange. Like, you know, like

46:51

way back at the start of this whole AI

46:53

wave, like uh Andre was like, you know,

46:55

English is the hottest new programming

46:56

language is it's here. It's here. Yeah.

46:59

The features. Yeah, a lot of okay like a

47:02

lot of software a lot of stuff there's a

47:03

guey it's made for the human uh you know

47:06

we're seeing the the evolution of CLI

47:08

for everything right all tools have CLI

47:10

your can use them well but you know do

47:12

we get good vision do we get good little

47:14

sandboxes like right now it's a really

47:17

effective way right models love to use

47:19

tools they love to bass they love to

47:20

read through text so slap a CLI let it

47:23

let it go loose that works for

47:25

everything

47:25

>> that does yeah yeah yeah we've also been

47:28

adapting non textual things to that

47:31

shape in order to uh improve uh model

47:34

behavior in some ways, right? Like we

47:37

want the agent to be able to see the UI.

47:40

Agents do not perceive visually in the

47:42

same way that we do, right? Like they

47:44

don't see a red box, they see red box

47:48

button, right? They see these things in

47:49

latent space. Uh so if we want

47:51

>> Yeah. Yeah. We have a thing that goes

47:53

off every time he goes to space.

47:55

>> Ding.

47:56

Anyway, um if we want to actually like

48:00

make it see the layout, it's almost

48:03

easier to rasterize that image to ask

48:06

and feed it in to the agent. Uh and

48:09

there's no reason you can't do both,

48:11

right? To like further refine how the

48:14

model perceives the object it's

48:16

manipulating.

48:17

>> Cool. Uh could we you want to talk about

48:19

a couple more of these layers that might

48:21

bear more introspection or that you have

48:23

personal passion for? I will say that

48:26

the coordination layer here was a really

48:28

tricky piece to get right.

48:29

>> Let's do it. Yeah, I'm all about that.

48:31

And this is Temporal's uh core core

48:33

thing.

48:34

>> This is where when we turn the spec into

48:38

elixir where like the model takes a

48:40

shortcut, right? Like it's like, oh, I

48:42

have all these primitives that I can

48:44

make use of in this lovely runtime that

48:46

has native process supervision. uh which

48:48

is I think kind of a neat way to have

48:50

taken the spec and like made it more

48:54

achievable by making choices that

48:57

naturally map the domain, right? In the

49:00

same way that like you would

49:03

>> prefer to have a TypeScript model repo

49:05

if you are doing full stack web

49:07

development, right? Because

49:09

>> the ability to share types across the

49:11

front end and back end reduces a lot of

49:14

complexity. Uh and because

49:15

>> that's what GraphQL used to be.

49:17

>> That's right. and and

49:18

>> I don't know if it's still alive, but

49:20

>> no humans in the loop here. So like my

49:23

own personal ability to write or not

49:25

write Elixir doesn't really have to bias

49:28

us away from using the right tool for

49:30

the job, which is just wild.

49:33

>> Love it. I love it. Yeah. I wonder if

49:35

any languages struggle more than others

49:38

because of this. I feel like everyone

49:40

has their own abstractions that would

49:42

make sense, but maybe it might be

49:44

slower. It might be more faulty where

49:46

like you would have to just kick the

49:48

server every now and then. Um I I don't

49:51

know. I think observability layer is

49:53

really well understood. Integration

49:54

layer MCP is dead. I think all these

49:56

like just like a really interesting

49:58

hierarchy to travel up and down. It's

50:01

common language for people working on

50:03

the system to understand.

50:04

>> The the policy stuff is really cool,

50:06

right? Like yeah, you don't really have

50:07

to build a bunch of code to make sure

50:09

the system wait for CI to pass. It's

50:11

your institutional knowledge.

50:12

>> Yeah, you just give it the GH CLI with

50:14

some text to say CI has to pass.

50:18

>> It makes the maintenance of these

50:19

systems a lot easier.

50:20

>> Do you think that like CLI maintainers

50:22

need to be do anything special for

50:24

agents or just as is? It's good cuz like

50:26

I don't think when people made the

50:28

GitHub CLI they anticipated this

50:30

happening.

50:30

>> That's correct. The GH CLI is fantastic.

50:33

It's great. Super industry. If you want

50:34

to go try ghre repo create like gh pull

50:37

and then pull request number right gh

50:40

like 153 whatever right and then it it

50:42

like pulls

50:43

>> basically my only interaction with the

50:45

github web UI at this point is ghpr

50:48

view-web

50:50

glance at the diff and be like sure

50:52

thing send it. Yeah. Yeah. Yeah. But um

50:56

the CLI are nice cuz they're super token

50:58

efficient and they can be made more

51:01

token efficient really easily, right?

51:03

Like I'm sure you all have seen like I

51:06

go to build kite or Jenkins and I just

51:09

get this massive wall of build output

51:12

and in order to unblock the humans your

51:15

developer productivity team is almost

51:17

certainly going to write some code that

51:18

parses the actual exception out of the

51:20

build logs and sticks it in a sticky

51:22

note at the top of the page. And you

51:25

basically want CLIs to be structured in

51:27

a similar way, right? you're going to

51:29

want to patch d- silent to prettier

51:32

because the agent doesn't care that

51:34

every file was already formatted. It

51:37

just wants to know it's either formatted

51:39

or not, right? So they can then go run

51:41

the write command. Similarly like in our

51:45

PNPM sort of distributed script runner

51:48

when we had one when you do d-recursive

51:52

like it produces a absolute mountain of

51:56

text but all of that is for passing test

51:59

suites. So we ended up wrapping all of

52:01

this in another script

52:04

>> to suppress the

52:05

>> which you can vibe to generally output

52:07

the failing parts of the test. Yeah, you

52:09

could pipe uh errors versus the standard

52:12

standard out. I don't know. Okay,

52:14

whatever. Too much too much thinking to

52:16

have to do the CL. I used to maintain a

52:18

CLI for my company and like Yeah, this

52:20

is this is like core very core to my

52:22

heart, but you're vibing my job.

52:25

>> That's right.

52:27

>> Cool. Any other things? I mean, this is

52:29

a long spec. I I I appreciate that.

52:30

Like, it's it's like got a lot of strong

52:32

opinions in here. Any other things that

52:34

we should highlight? You know, I think

52:35

obviously you can spend the whole day

52:37

going through some of these, but like

52:39

you know, I I do think that some of

52:41

these have a lot of care or some of this

52:43

you might you might want to tell people,

52:45

hey, take this, but you know, make it

52:47

your own.

52:48

>> Fundamentally, software is made more

52:51

flexible when it's able to adapt to the

52:53

environment in which it is deployed,

52:54

right? Which means that things like

52:57

linear or GitHub even are specified

53:01

within the spec, but not required pieces

53:04

of it, right? there's like a more

53:05

platonic ideal of the thing uh that you

53:07

could swap in like Jira or Bitbucket for

53:11

example, right? But being able to

53:14

tightly specify

53:16

things like the ID formats or how the

53:19

Ralph loop works for the individual

53:22

agents basically means you can get up

53:24

and running with a fully specified

53:27

system quickly that you then evolve

53:29

later on. I think we never intended for

53:32

this to be a static spec that you can

53:34

never change, right? It's more like a

53:37

blueprint to get something working up

53:40

and running

53:40

>> for you then to vibe later till your

53:42

heart's content.

53:43

>> You have like code and scripts in here

53:45

where it's like, oh, I mean I I think

53:48

this is a really good prompt. It's just

53:50

a very very long prompt.

53:52

>> Fundamentally, the agents are good at

53:54

following instructions. So, give them

53:55

instructions, right? And it will, you

53:58

know, improve the reliability of the

53:59

result, right? Like we much like the way

54:02

we use Symphony, we don't want folks to

54:04

have to monitor the agent as it is

54:06

vibing the system into existence. So

54:08

being very opinionated, very strict

54:12

around what these success criteria are

54:14

means that like

54:16

>> our deployment success rate goes up.

54:18

>> Yeah. Means we don't have to get tickets

54:20

on this thing.

54:22

>> I think it all goes back to that like go

54:23

to disposable, right? Like early on when

54:26

you had CLI or you'd kick off a codeex

54:28

run, it would take two hours. you would

54:29

kind of want to monitor like, okay, I'm

54:31

in the workflow of just using one. I

54:33

don't want it to go down the wrong path.

54:34

I'll cut it off and but you know, just

54:36

shoot all four. Like that was my

54:37

favorite thing of the codeex app, right?

54:39

Just 4x it. Like it's okay. One of them

54:42

will probably be right, one of them

54:43

might be better. Stop stop overthinking

54:45

it. Like my my first example was

54:46

probably like deep research. when you

54:48

put out deep research and I'd ask it

54:50

something like I asked it something

54:51

about LLM it thought it was legal

54:53

something and spent an hour came back

54:55

with a report completely off the rails

54:57

and I was like okay I got to monitor

54:58

this thing a bit no don't don't monitor

55:00

it just you you want to build it so that

55:01

it goes the right way and you don't want

55:03

to you don't want to sit there and

55:04

babysit right you don't want to babysit

55:06

your agents

55:06

>> with that deep research query that you

55:09

made looking at the bad result you

55:11

probably figured out you needed to tweak

55:13

your prompt a bit right like that's that

55:15

guardrail that you fed back into the

55:17

code base for the ask your prompt to

55:20

further align the agent's execution.

55:22

Same sort of concepts apply there too

55:24

>> when you talk I mean how are the

55:25

customers feeling

55:27

>> for symphony uh I I think we have none

55:29

right this is a thing we have put out

55:31

into the world

55:31

>> I mean symphony is internal right as

55:33

long as you're happy you're the customer

55:34

>> that's right

55:35

>> uh just you know what's what's the

55:38

external view

55:39

>> I say folks are very excited about this

55:42

way of distributing software and ideas

55:45

in cheap ways for us as users it has

55:49

again pushed the productivity 5x

55:52

Which means I think there's something

55:54

here that's like a durable pattern

55:57

around removing the human from the loop

56:00

and figuring out ways to like trust the

56:02

output. Right? The video that is shared

56:06

here

56:07

>> is the same sort of video we would

56:09

expect the coding agent to attach to the

56:11

PR

56:12

>> that is created. You know that's part of

56:14

building trust in this system. And

56:17

that's to me like fundamentally what has

56:19

been cool about building this is like

56:23

it more closely pushes that persona of

56:26

the agent working with you to be like a

56:28

teammate, right? I I don't shoulder surf

56:31

you like for the tickets that you work

56:34

on during the week. I would never think

56:36

that I would want to do that. I wouldn't

56:38

want a screen recording of your entire

56:41

session in cursor or claude code. I

56:43

would expect you to do what you think

56:46

you need to do to convince me that the

56:48

code is good and mergeable

56:50

>> and compress that full trajectory in a

56:53

way that is legible to me the reviewer.

56:55

>> Y

56:56

>> it's just uh and and you can just do

56:58

that because

57:00

>> CEX will absolutely sling some F around.

57:03

It's great.

57:04

>> Oh, I mean EV F ev is the OG like god

57:07

CLI.

57:08

>> Yeah. Swiss army chainsaw.

57:10

I used to say uh there's a SAS micro SAS

57:14

let's call it in every flag in FFmpeg.

57:17

>> Oh, for sure.

57:17

>> You know what I mean? For sure.

57:18

>> Like just host it as a service, put a UI

57:20

on it. People who don't know FFmpeg will

57:22

pay for it.

57:23

>> When we were first experimenting with

57:25

this, it was a wild feeling to be at the

57:27

computer with just like Windows just

57:30

popping up all over the place and

57:32

getting captured and files appearing on

57:34

my desktop. like very much felt like the

57:37

future to have a a a thing controlling

57:39

my computer for like actual productive

57:41

use, right? Like I'm just there keeping

57:43

it like awake jiggling the mouse every

57:45

once in a while.

57:48

>> That's what some office workers do. They

57:50

buy a mouse jiggler.

57:51

>> That's right. That's right.

57:53

>> One thing I wanted to ask so like okay

57:55

as stuff is so good is disposable async

57:57

shoot off a bunch of agents. One

57:59

question is like okay are you always

58:00

like a extra high thinking guy and where

58:03

do you see spark so 5.3 spark like

58:06

there's a lot of me wanting to make

58:08

quick changes I'm not going to open up

58:10

ID I'm not going to do anything but I

58:11

will say okay fix this little thing

58:13

change a line change a color spark is

58:15

great for that but like am I still the

58:18

bottleneck you know like why don't I

58:19

just let that go back in like just riff

58:21

on that you know is there

58:23

>> spark is such a different model compared

58:25

to the the extra high level reasoning

58:30

that you get in these you know

58:32

>> to be fair for people it is a different

58:33

model different architecture different

58:35

like it doesn't support it just

58:36

>> it's incredibly fast

58:38

>> I have not quite figured out how to use

58:40

it yet uh to be honest I faster I was I

58:42

was adapting it to the same sorts of

58:44

tasks I would use x high reasoning for

58:47

and it would blow through three

58:48

compactions before writing a line of

58:50

code

58:50

>> and I mean that's another big thing with

58:53

uh 5.4 for right million coken content

58:56

which is huge in aentic right like you

58:59

can just run for longer before you have

59:00

to compact the more tokens you can spend

59:03

on a task before compacting like the

59:04

better you'll do

59:05

>> that's right that's right I'm not sure

59:07

uh how to deploy spark I think your

59:09

intuition is right that like it's very

59:12

great for spiking out prototypes

59:14

exploring ideas quickly doing those

59:16

documentation updates it is fantastic

59:19

for us in taking that feedback and

59:22

transforming it into a lint where we

59:24

already have good infrastructure for

59:26

eslints in the codebase. Uh these sorts

59:28

of things it's great at and it allows us

59:30

to unblock quickly doing those like

59:33

antifragile healing tasks in the

59:35

codebase.

59:35

>> Yeah, that makes sense. So you're push

59:38

you guys are pushing models to the

59:39

freaking limit. What can card models not

59:42

do well yet?

59:43

>> They're definitely not there on being

59:46

able to go from new product idea to

59:50

prototype

59:51

>> single one shot. This is where I find I

59:53

spend a lot of time steering is

59:56

translating end state of a mock for a

59:58

net new thing, right? Think no existing

1:00:01

screens into product that is playable

1:00:04

with. Similarly, while this has gotten

1:00:07

better with each model release, like the

1:00:09

gnarliest refactorings are the ones that

1:00:11

I spend my most time with, right? The

1:00:13

ones where I am interrupting the most,

1:00:14

the ones where I am now double clicking

1:00:17

to build tooling to help decompose

1:00:19

monoliths and things like that. This is

1:00:21

a thing I only expect to get better,

1:00:23

right? Over the course of a month, we

1:00:24

went from the low complexity tasks to

1:00:27

like low complexity and big tasks in

1:00:30

both these directions. So, this is what

1:00:32

it means to not bet against the model,

1:00:34

right? You should you should expect that

1:00:35

it is going to push itself out into

1:00:37

these higher and higher complexity

1:00:38

spaces. Yeah. So, the things we do are

1:00:41

robust to that. It just basically means

1:00:43

I'll be able to spend my time elsewhere

1:00:45

and figure out what the next bottleneck

1:00:47

is. I

1:00:47

>> I do think it's also a bit of a

1:00:49

different type of task, right? Like

1:00:50

Codex is really good at codebase

1:00:52

understanding working with code bases

1:00:54

but companies like lovable uh bolt

1:00:57

replet they solve a very different

1:01:00

problem scaffold of zero to one right

1:01:02

idea at a product and it's like there

1:01:05

are people working on that and models

1:01:06

models are also pushing like step

1:01:08

function changes there it's just kind of

1:01:11

different than the software engineering

1:01:12

agents you see today right

1:01:14

>> like I said the model is isomeorphic to

1:01:17

myself uh the only thing that's

1:01:20

different is figuring out how to get

1:01:21

what's in here into context for the

1:01:24

model. And for these whites space sort

1:01:27

of projects, I myself I'm just not good

1:01:30

at it. uh which means that often over

1:01:33

the agent trajectory I realize the bits

1:01:36

that were missing which is why I find I

1:01:38

need to have the synchronous interaction

1:01:40

and I expect with the right harness with

1:01:42

the right scaffold that's able to tease

1:01:44

that out of me or refine the possible

1:01:47

space right to be super opinionated

1:01:49

around the frameworks that are deployed

1:01:51

or to put a template in place right

1:01:53

these are ways to give the model all

1:01:56

those non-functional requirements that

1:01:57

extra context to anchor on and avoid

1:02:00

that wide dispersion of possible

1:02:01

outcomes.

1:02:02

>> Thank you for that. Uh I wanted to talk

1:02:04

a little bit about Frontier.

1:02:05

>> Yeah, sure. Uh overall, uh you guys

1:02:07

announced it maybe like a month ago. Um

1:02:10

and there's there's a few charts in here

1:02:11

and there if it's kind of like your

1:02:13

enterprise offering is kind of what I

1:02:15

view it. Is there one product or is

1:02:17

there many? I can't speak to the full

1:02:19

product roadmap here but what I can say

1:02:21

is that frontier is the platform by

1:02:24

which we want to do AI transformation of

1:02:27

every enterprise and from big to small

1:02:30

and the way we want to do that is by

1:02:32

making it easy to deploy highly

1:02:35

observable safe control

1:02:39

identifiable agents into the workplace

1:02:42

right we want it to work with your

1:02:44

company native IM stack we want it to

1:02:48

plug into the SK uh security tooling

1:02:51

that you have. We want it to be able to

1:02:54

plug into the workspace tools that you

1:02:57

used.

1:02:57

>> So, you're just going to be stripping

1:02:58

specs,

1:03:00

>> right?

1:03:01

>> We expect that there will be some

1:03:03

harness things there. Agents SDK is a

1:03:06

core part of this to enable both startup

1:03:09

builders as well as enterprise builders

1:03:12

to have a works by default harness that

1:03:16

is able to use all the best features of

1:03:17

our models from the shell tool down to

1:03:20

the codeex harness with file attachments

1:03:22

and containers and all these other

1:03:24

things that we know go into building

1:03:27

highly reliable complex agents. We want

1:03:31

to make that great and we want to make

1:03:32

it easy to compose these things together

1:03:34

in ways that are safe. For example,

1:03:36

right like the GPT OSS safeguard model

1:03:40

for example, one thing that's really

1:03:41

cool about it is it ships the ability to

1:03:44

interface with a safety spec. Safety

1:03:46

specs are things that are bespoke to

1:03:49

enterprises. We owe it to these folks to

1:03:51

figure out ways for them to instrument

1:03:53

the agents in their enterprise to avoid

1:03:56

excfiltration in the ways they

1:03:57

specifically care about to know about

1:03:59

their internal company code names these

1:04:01

sorts of things. So providing the right

1:04:03

hooks to make the platform customizable

1:04:07

but also you know mostly working by

1:04:10

default for folks is kind of the the

1:04:11

space we are trying to explore here.

1:04:13

>> Yeah. And this is like you know the

1:04:15

snowflakes of the world just need this

1:04:17

right. Yeah. Brexites of the world

1:04:18

stripes. Yeah, makes sense. I was going

1:04:20

to go back to your, you know, I I I

1:04:21

think the demo videos that you guys had

1:04:24

was was pretty illustrative. It's kind

1:04:26

of like also to me um an example of very

1:04:29

large scale agent management.

1:04:31

>> Yes. Like you give people a control

1:04:32

dashboard that if you play if you like

1:04:34

play any one of these like multiple

1:04:36

agent things. You can dig down to the

1:04:39

individual instance and see what's going

1:04:40

on.

1:04:40

>> Yes, of course.

1:04:42

>> But who's the user? Is it is it like the

1:04:44

CEO, the CTO, CIO, something like that?

1:04:47

So, you know, at least my personal

1:04:50

opinion here, the buyer that we're

1:04:51

trying to build product for here is one

1:04:54

and employees who are making productive

1:04:56

use of these agents, right? That's going

1:04:57

to be whatever surfaces they appear in,

1:04:59

the connectors they have access to,

1:05:01

things like that. Something like this

1:05:03

dashboard is for IT, your GRC and

1:05:07

government's folks, your AI innovation

1:05:10

office, your security team, right? the

1:05:13

stakeholders in your company that are

1:05:15

responsible for successfully deploying

1:05:18

into the spaces where your employees

1:05:20

work as well as doing so in a safe way

1:05:23

that is consistent with all the

1:05:25

regulatory requirements that you have

1:05:27

and customer attestations and things

1:05:29

like that. So it is kind of a iceberg

1:05:33

beneath the actual end. Yeah, you you

1:05:36

jump like every I guess layer in the UI

1:05:40

is like going down the layer of

1:05:42

extraction in terms of the agent, right?

1:05:43

>> Yep.

1:05:44

>> Yeah. Yeah. I think it's good.

1:05:45

>> Yeah. The the ability to dive deep into

1:05:47

the individual agent trajectory level is

1:05:49

going to be super powerful

1:05:51

>> not only for like from like a security

1:05:54

perspective but also from like someone

1:05:56

who is accountable for developing

1:05:57

skills. One thing that was interesting

1:06:00

that we also blogged about shipping was

1:06:02

uh an internal data agent which uses a

1:06:05

lot of the frontier technology in order

1:06:07

to make our data ontology accessible to

1:06:10

the agent and things like that to

1:06:12

understand what's actually in the data

1:06:14

warehouse.

1:06:14

>> Yeah. Semantic layer type things. Uh I

1:06:17

was briefly part of that that world. Uh

1:06:19

is it solved? I don't know. It's

1:06:21

actually really hard for humans to agree

1:06:23

on what revenue is.

1:06:24

>> Yes.

1:06:25

>> You know.

1:06:25

>> Yes. What is what is what is an active

1:06:27

user?

1:06:28

>> There's like what five data scientists

1:06:30

in the company that have defined this

1:06:31

golden

1:06:31

>> they all different yeah and like no and

1:06:34

there's also internal politics as to

1:06:36

attribution of like I I'm marketing I'm

1:06:38

responsible for this much and sales is

1:06:40

responsible for this much and they all

1:06:42

add up to more than 100 and I'm like

1:06:45

well you guys have different

1:06:46

definitions.

1:06:46

>> Yeah. And if you're a startup everything

1:06:48

is a r you know.

1:06:49

>> So so I think that's that's cool. Oh you

1:06:51

guys blogged about this. Okay. I didn't

1:06:52

I didn't see this. Uh yeah. Is this the

1:06:54

same thing?

1:06:54

>> I don't Uh, is this what you're

1:06:56

referring to?

1:06:56

>> Uh, yes.

1:06:57

>> Okay. Well, we'll send people to read

1:06:58

this as our data agency.

1:07:00

>> This one.

1:07:01

>> Uh, yeah. I don't know if you you have

1:07:03

any highlights. I

1:07:04

>> No, no, no. I mean, in general from the

1:07:05

point, a lot of good things to read.

1:07:06

>> Yeah. Yeah. Lot lots of homework for

1:07:08

people. Uh, no, but like data as the

1:07:11

feedback layer. You need to solve this

1:07:13

first in order to have the products

1:07:15

feedback loop closed. That's right. Like

1:07:16

so for the agents to to understand and

1:07:18

like this is not something that humans

1:07:20

have not know of this like in

1:07:21

>> this is how this is how you build

1:07:24

artists that do more than coding right

1:07:27

>> to actually understand how you operate

1:07:29

the business you have to understand what

1:07:32

revenue is what your customer segments

1:07:34

are right

1:07:35

>> what your product lines are right like

1:07:38

one thing that's in like looping back to

1:07:40

the codebase that we described here for

1:07:42

harnessing one thing that's in core

1:07:44

beliefs MD is like who's on the team,

1:07:47

what product we're building, who our end

1:07:50

customers are, who our pilot customers

1:07:53

are, what the full vision of what we

1:07:56

want to achieve over the next 12 months

1:07:58

is like these are all bits of context

1:08:00

that inform how we would go about

1:08:02

building the software. Oh my god. So, we

1:08:04

have to give it to the agent, too.

1:08:06

>> I'm guessing that stuff is like pretty

1:08:08

dynamic and it changes over time, too,

1:08:09

right? Like part of it was it's not just

1:08:11

a big spec. you you have it as one of

1:08:13

the things and it will iterate.

1:08:15

>> One one thing that I think is going to

1:08:17

break your mind even more is we have

1:08:18

skills for how to properly generate deep

1:08:22

fried memes and have reacti culture in

1:08:25

Slack because with the Slack chatgpt app

1:08:29

that you're able to use and codeex like

1:08:32

I can get the agent to post on my

1:08:34

behalf.

1:08:35

>> Just it's part of humor. Humor is part

1:08:37

of AGI. Uh is it is it funny? It's

1:08:40

pretty good. Yeah.

1:08:41

>> Okay. Yeah, it's pretty good at making,

1:08:43

you know, it's it's a lot of like I

1:08:45

think humor is like a really hard

1:08:46

intelligence test, right? Like it's like

1:08:47

you have to get a lot of context into

1:08:49

like very few words.

1:08:50

>> This is this is why this is why 54 is

1:08:52

such a big uplift for our varieties.

1:08:54

It's it's the memeing. Yeah, for sure.

1:08:58

>> Yeah. Yeah, it's really cool.

1:08:59

>> So 54 can chip us. That's the take away.

1:09:02

>> Yeah. Maybe um maybe when y'all are uh

1:09:06

done here today, ask Codeex to go over

1:09:08

your coding agent sessions and to roast

1:09:10

you. Um love it.

1:09:13

>> I'll give it a shot. Give it a shot. Uh

1:09:14

just coming back to the the the final

1:09:16

point I wanted to make is yeah, I I

1:09:18

think that there there are multiple

1:09:20

other like you guys are working on this,

1:09:22

but this is a pattern that every other

1:09:25

company out there should adopt

1:09:27

regardless of whether or not they work

1:09:28

with you. To me this like I saw this I

1:09:31

was like every company needs this.

1:09:33

I mean

1:09:33

>> this is multiple business what it takes

1:09:35

to get people to Yes. Yeah. Actually

1:09:38

realize the benefits and distribute

1:09:41

layer. Um and it's it's it I think it

1:09:44

sounds boring to people like oh you know

1:09:46

it's for safeguards and and whatever but

1:09:47

like um I think you to to handle agents

1:09:51

at scale like you're envisioning here.

1:09:53

Um I don't know if it's like a real

1:09:55

screenshot like a demo but like this is

1:09:57

what you need. This is my original sort

1:09:59

of view of what temporal was supposed to

1:10:01

be like you you built this dashboard and

1:10:03

you basically have every longunning

1:10:05

process in the company and one dashboard

1:10:07

and that's it.

1:10:09

>> That's right. That's right.

1:10:10

>> Yeah. I think it's pretty it's pretty

1:10:12

like customized towards every

1:10:14

enterprise, right? Like you care about

1:10:15

different things.

1:10:16

>> There's a lot of customization, right?

1:10:17

But like I mean there'll be multiple

1:10:19

unicorns just doing this as a service.

1:10:21

Like I don't know. I'm like very very

1:10:24

frontier pled if you can't tell.

1:10:26

>> Amazing. But but like it only clicked

1:10:28

cuz obviously this came out first, then

1:10:30

harness and then Symphony and it only

1:10:32

clicked for me that like this is

1:10:34

actually kind of the thing you ship to

1:10:36

do that.

1:10:37

>> Yeah. Yeah. There's a set of building

1:10:38

blocks here that we assembled into these

1:10:41

agents and the building blocks

1:10:43

themselves are part of the product,

1:10:45

right? the ability to

1:10:47

>> steer, revoke authorization if a model

1:10:51

becomes misaligned. Like all of this is

1:10:53

accessible through Frontier

1:10:54

>> and there's going to be a bunch of

1:10:57

stakeholders in the company that have

1:10:59

>> the things they need to see in the

1:11:01

platform to get to Yes.

1:11:03

>> So we'll build all those in the frontier

1:11:05

so that we can actually do the

1:11:07

widespread deployment. That's the fun

1:11:09

part.

1:11:09

>> Yeah. Yeah. I'm also calling back to

1:11:11

like uh there's this like levels of AGI

1:11:13

like I don't know if OpenAI is still

1:11:15

talking about this but they used to talk

1:11:16

about five levels of AGI and one of it

1:11:19

was like oh it's like an intern and the

1:11:21

coding software engineeration at some

1:11:23

point it was AI organization and this is

1:11:26

it right this is level four or five I

1:11:28

can't remember which which level but

1:11:30

it's somewhere along that path was this

1:11:32

>> you know how I mentioned that my team is

1:11:34

having fun sprinting ahead here right

1:11:36

and we do this thing where we're

1:11:38

collecting all the agent trajectories

1:11:39

from codecs to slurp them up and distill

1:11:41

them like this is what it means to build

1:11:43

our team level knowledge base you know

1:11:46

happen to reflect it back into the

1:11:47

codebase but it doesn't have to be that

1:11:49

way right you know and it doesn't have

1:11:50

to be bound to just codeex right I want

1:11:53

chatbt to also learn our meaning culture

1:11:56

and also the product we are building and

1:11:57

how right so that when I go ask it it

1:12:00

also has the full context of the way I

1:12:02

do my work and I'm super excited for

1:12:05

Frontier to enable this

1:12:06

>> yeah amazing what are the the model

1:12:09

people say when they see you do this

1:12:12

like you have a lot of feedback

1:12:14

obviously you have a lot of usage you

1:12:15

have a lot of trajectories I don't I

1:12:17

don't imagine a lot of it's useful to

1:12:19

them but some of it is

1:12:21

>> you have this too you deploy a billion

1:12:23

tokens of intelligence a day and this

1:12:25

was you know this was at the beginning

1:12:28

of 206 you're yeah you know cooking

1:12:31

>> yeah there's this fundamental tension

1:12:33

which I think you have talked about

1:12:35

between whether or not we invest deeper

1:12:37

into the harness or we invest deeper

1:12:39

into the training process to get the

1:12:41

model to do more of this by default.

1:12:42

Yeah.

1:12:43

>> And I think success for the way we are

1:12:47

operating here means the model gets

1:12:50

better taste because we can point the

1:12:53

way there and none of the things we have

1:12:56

built actively degrade Asian performance

1:12:59

cuz really all they're doing is running

1:13:02

tests and like running tests is a good

1:13:05

part of what it means to write reliable

1:13:06

software. If we were building an entire

1:13:10

separate ROS scaffold around codecs to

1:13:13

restrict its output, that I think would

1:13:15

be like additional harness that would be

1:13:18

prone to being scrapped. But yeah, if

1:13:21

instead we can build all the guardrails

1:13:23

in a way that's just native to the

1:13:24

output that Codex is already producing,

1:13:26

which is code, I think one, no friction

1:13:29

with how the model continues to advance,

1:13:31

but also like just good engineering. And

1:13:34

that's that's the whole point.

1:13:36

>> Yeah. So I've had similar discussions

1:13:38

with research scientists where the RL

1:13:41

equivalent on policy versus off policy.

1:13:42

>> Yeah.

1:13:43

>> And you're basically saying that you

1:13:45

should build an on policy harness which

1:13:47

is already like well within distribution

1:13:49

and you modify it from there. But if you

1:13:50

build off policy well it's not that

1:13:52

useful.

1:13:53

>> That's right.

1:13:53

>> Super cool. Well any thoughts any things

1:13:56

that we haven't covered that we should

1:13:57

get get out there?

1:13:58

>> Just uh I've been super excited to kind

1:14:02

of benefit from all the cooking that the

1:14:04

codeex team has been doing. They

1:14:05

absolutely ship relentlessly. This is

1:14:08

one of our core engineering values. Ship

1:14:09

relentlessly and they the team there

1:14:11

embodies it to an extreme degree. Oh

1:14:13

yeah to have 53 and then spark and 54

1:14:17

come out within like what feels like a

1:14:19

month is just a phenomenally fast.

1:14:21

>> This exactly a month ago it's 53 and

1:14:22

yesterday was 54. Yeah. I mean is do we

1:14:25

have every month now is 5'5 nice? Like

1:14:29

>> uh you know I can't say that the poly

1:14:31

markets would be very upset, right? Uh

1:14:35

well I I think it's interesting that

1:14:36

like it's also correlated with the

1:14:38

growth you know they they announced that

1:14:39

it's like 2 million uh users but like

1:14:42

almost don't care about codeex anymore

1:14:44

like this is it this is the game man

1:14:45

like it's like coding cool soft like

1:14:48

knowledge work

1:14:49

>> that's right you know this is the thing

1:14:51

to chase after and uh you know this is

1:14:53

one of the things that my team is

1:14:54

excited to support

1:14:55

>> get the whole like self-hosted harness

1:14:57

thing working which you have done and

1:14:59

like the rest of us are trying to figure

1:15:00

out how to catch up but like then do

1:15:03

things, you know, right with you.

1:15:05

>> Do things.

1:15:06

>> That's right. You can just do things.

1:15:07

That's the line for the episode.

1:15:09

>> That's it. Any other call to actions?

1:15:11

You're you're based in Seattle. Your

1:15:13

team, I'm guessing.

1:15:14

>> New Belleview office.

1:15:15

>> New Belleview office. We just had the

1:15:16

grand opening yesterday as of the

1:15:18

recording date. Uh which was fantastic.

1:15:20

Beautiful building. Super excited to be

1:15:21

part of the Belleview community building

1:15:23

the future in Washington. And I would

1:15:27

say that there is lots of work to be

1:15:29

done in order to successfully serve

1:15:31

enterprise customers here uh in

1:15:34

Frontier. We are certainly hiring. And

1:15:37

if you haven't tried the Codex app yet,

1:15:39

please give it a download. We just

1:15:41

passed 2 million weekly active users,

1:15:43

growing at a phenomenally fast rate, 25%

1:15:46

week over week. Please come join us.

1:15:50

Uh yes and I think that's an interesting

1:15:53

I don't know my my final observation um

1:15:56

open is a very San Franciscocentric

1:15:58

company like I I know people who have

1:16:00

been who turned down the job or didn't

1:16:02

get the job because they didn't want to

1:16:03

move to SF and now they just don't have

1:16:05

a choice right you have to open the

1:16:07

London you have to open the the Seattle

1:16:09

and I wonder if that's going to be a

1:16:11

shift in the the culture right obviously

1:16:13

you can't say but

1:16:14

>> I was uh one of the first engineering

1:16:16

hires out of our Seattle office so See

1:16:19

it was very natural.

1:16:20

>> Success has been part of what I have

1:16:22

been building toward and it is has grown

1:16:24

quite well. Right. We have durable

1:16:26

products and lines of business that are

1:16:28

built out of there. Uh ton of 0ero to

1:16:31

one work happening as well which is kind

1:16:34

of the core essence of the way we do

1:16:37

applied AI work at the company to sprint

1:16:40

after it uh new to figure out where we

1:16:42

can actually successfully deploy the

1:16:44

model. So uh yes 100%. We also have a

1:16:47

New York office too uh that has a ton of

1:16:49

engineering presence.

1:16:50

>> Yeah. Uh exa exactly that's these these

1:16:52

are my road maps for AIE.

1:16:55

>> Wherever people hire engineers I will

1:16:57

go. That's right.

1:16:58

>> It's a cool office too. New York is the

1:17:00

old REI building I believe. The REI

1:17:02

office.

1:17:02

>> Yeah it's just No, you'll never be as

1:17:03

big. Right. New York is like you can't

1:17:05

get the size of office that they need.

1:17:08

The the New York Seattle has a very like

1:17:11

office madmen sort of vibe. It's it's

1:17:14

beautiful. Uh the the Belleview one is

1:17:16

very green, gold fixtures, very Pacific

1:17:19

Northwest is very cool place

1:17:22

which a lot of people are like there

1:17:23

for. People like New York, they want to

1:17:25

be in New York, right?

1:17:26

>> Yeah. Yeah. We have a fantastic

1:17:28

workplace team that has been building

1:17:29

out these offices. It really is a

1:17:31

privilege to work here.

1:17:32

>> Yeah. Excellent. Uh okay. Well, thank

1:17:33

you for your time. Uh you've been very

1:17:35

generous and uh you you've been cooking.

1:17:37

So, I'm going to let you get back to

1:17:38

cooking.

1:17:39

>> It's been amazing chatting with you

1:17:40

folks. Uh happy Friday.

1:17:42

>> Happy Friday.

Get the TLDR of any YouTube video

Transcribe, summarize, and repurpose videos in 125+ languages — free, no signup required.

Try YouTLDR Free