Full Transcript

·YouTLDR

From Chaos to Choreography: Multi-Agent Orchestration Patterns That Actually Work — Sandipan Bhaumik

26:304,072 words · ~20 min readEnglishTranscribed Apr 19, 2026
0:00

Hi everyone, I'm Sandy. I have spent 18

0:02

years building data systems, a major

0:04

part of it focusing on building and

0:06

scaling distributed data systems in the

0:08

cloud. I've done it for multi-tenant

0:11

systems for software and SaaS companies,

0:13

and then for scaling data and AI

0:15

platforms in regulated industries like

0:17

financial services and healthcare. I've

0:19

learned a great deal about production

0:21

grade distributed systems while I have

0:23

been working at AWS and now in

0:25

Databricks. For the last 2 years, I've

0:27

been deploying multi-agent AI systems in

0:29

production. And I have watched brilliant

0:32

engineers make the same mistakes over

0:35

and over. They think adding more agents

0:38

is just like adding more features. It's

0:40

not. It's building a distributed system.

0:43

And today, I'm going to show you the

0:44

patterns that actually work when you

0:47

make the transition. These are lessons

0:49

that I have learned working in the

0:51

trenches, and today I'm here to share it

0:53

with you. Here's what we're covering

0:54

today. First, the problem. I'll share

0:57

you a very basic production war story

1:00

about race conditions and why complexity

1:03

explodes when you go from one agent to

1:05

five agents.

1:07

Um I'll I'll talk about the patterns,

1:09

choreography and orchestration patterns

1:11

for coordination of agents. I'll talk

1:13

about state management, uh talk about

1:16

failure recovery and how we can um

1:19

design for failure in production

1:21

systems. And then I'll I'll share how a

1:23

production grade architecture will look

1:24

like uh in as simple way possible. And

1:28

I'll also show you an example on how we

1:30

build this on Databricks. So, let's dive

1:32

into it. You see, one agent works

1:34

beautifully. You have got your LLM, some

1:37

prompts, maybe a retrieval augmented

1:40

generation pipeline, maybe some tool

1:43

calls. It demos great. Leadership loves

1:45

it. You feel happy and your team is

1:48

happy. And then, product comes back with

1:51

a request that changes everything. They

1:54

want five more agents. And here's what

1:56

happens. You think, "Okay, I know how to

1:58

build agents, and I will add five more."

2:01

Except, now you have coordination

2:03

problems. Agent A produces data that

2:07

Agent B needs. Agent C is waiting on

2:09

both Agent A and Agent B. Agent D just

2:12

updated the shared state that Agent B

2:14

was reading, and Agent E just crashed

2:18

and took the down this entire workflow.

2:20

This is no longer an AI problem. This is

2:24

a distributed system problem. And most

2:26

of you didn't sign up to be distributed

2:29

systems engineer. Let me tell you about

2:31

a production deployment where this went

2:33

very wrong. We built a credit

2:35

decisioning system for a financial

2:36

services company. The first agent,

2:38

credit score calculation, worked

2:40

perfectly. It worked great in demos, 2

2:42

weeks in production, zero issues. Then

2:45

we added four more agents, income

2:46

verification, risk assessment, fraud

2:49

detection, and final approval.

2:51

Uh we deployed all five. In 3 days'

2:54

time, we started seeing weird approvals.

2:57

Uh 20% of the decisions had incorrect

3:00

risk ratings. Customers who should have

3:02

been flagged were getting approved. The

3:04

business team was panicking. It took us

3:06

2 days to find out what was happening.

3:09

Credit score agent calculated a score of

3:11

750 and wrote to the database. The risk

3:14

assessment agent, on the other hand,

3:16

read from the database 500 milliseconds

3:19

later and got a score of 680 for the

3:22

same customer. Why did it happen?

3:25

Because we had a caching layer for

3:26

customer records. The write to

3:28

PostgreSQL SQL succeeded, but the cache

3:31

was not invalidated. The risk agent read

3:35

from the cache, and it got stale data.

3:40

Use It used the wrong score and made the

3:43

wrong decision. This is a classic

3:45

distributed systems problem. We had

3:47

caching layer between the agents and the

3:50

database. Cache invalidation failed, and

3:53

the agent was reading stale values. The

3:56

race condition wasn't in the database,

3:58

it was in the architecture. Multiple

4:01

agents, shared cache, no coordination on

4:04

cache invalidation. This took us quite a

4:07

while to find the pattern. It created

4:09

delays in delivery and led to wrong

4:13

decisions. And here's the lesson we

4:15

learned. The problem was, of course, not

4:17

with the model. The problem wasn't with

4:19

the prompts. The problem was we built a

4:22

distributed system without distributed

4:24

system thinking. And that's what kills

4:27

multi-agent projects, not bad AI, but

4:30

bad architecture. Now, I will show you

4:33

the architecture that works. We will

4:34

also look into a production grade

4:36

architecture. But first, let's

4:39

understand why this complexity explodes

4:42

so quickly. Now, when you move from a

4:44

one agent system to a multi-agent, let's

4:46

say five agent systems, it doesn't get

4:49

just five times harder. It gets 25 times

4:52

more complex. Coordination complexity

4:55

grows exponentially. One agent has got

4:57

zero coordination problems. Two agents

4:59

have got at least one connection. Five

5:02

agents have got at least 10 potential

5:04

connections and coordination. Each

5:06

connection is a failure point, a race

5:09

condition, a state synchronization

5:11

problem. You are not just building five

5:13

agents, you are building a coordination

5:16

problem across multiple relationships

5:19

and across and and possibility to have

5:22

multiple failure modes. And that's why

5:25

the complexity increases very, very

5:27

quickly. Now, I'm going to show you two

5:29

critical patterns. First pattern is

5:32

about how to coordinate multiple agents.

5:35

Then we will talk about how you can

5:37

manage state. And then we'll talk about

5:38

how you can recover and design for

5:40

failure. Now, these patterns come from

5:42

multiple years of distributed systems

5:44

work, and I can directly apply them on

5:46

multi-agent AI system. Once you get the

5:48

basics, it's really hard to miss these

5:51

patterns when you build multi-agent AI

5:53

architecture. The first decision you

5:55

need to make is about choreography or

5:58

orchestration. These are the two

5:59

fundamental patterns for distributed

6:01

coordination. Choreography means agents

6:04

coordinate through events. They are

6:06

decentralized, they are autonomous.

6:08

Orchestration means a central

6:10

coordinator manages the workflow. This

6:13

is centralized and controlled. Most

6:15

teams pick one instinctively and regret

6:18

it. Let me show you when to use each.

6:21

Let's start with choreography.

6:23

Choreography is event-driven.

6:25

Um the research agent finishes uh

6:28

research and publishes a research

6:30

completed event to a message bus. Agent

6:33

B subscribe to that message bus and

6:36

listens for the event type it is

6:38

interested in. The analysis agent

6:40

subscribes to that event type, picks it

6:43

up, does analysis, and publishes

6:45

analysis ready. Then the report agent

6:48

picks that analysis ready event,

6:50

generates the report. There is no

6:52

central coordinator here. Each agent is

6:55

autonomous, listening for events it

6:57

cares about, publishing when it is done.

7:00

This is the beauty of choreography.

7:02

Agents are loosely coupled.

7:04

It's easy to add add new agents and make

7:07

them subscribe to the events that

7:08

they're interested in. This drives high

7:11

autonomy and scales really well.

7:13

However, the nightmare of choreography

7:15

is debugging. When something fails,

7:17

you're playing detective with no real

7:20

clue. Which agent failed to publish? Did

7:23

the event get consumed? Did the event

7:25

get consumed twice? You need bulletproof

7:28

observability to make choreography work.

7:31

Even with the event propagation, you

7:33

need strong uh guarantees across

7:36

delivery of these events. Without it,

7:39

debugging is really hard. So, when

7:40

should you use choreography? You use

7:43

choreography when your workflow is

7:45

naturally event-driven, when agents need

7:47

to operate independently, when you are

7:50

adding agents frequently and don't want

7:52

to update a central coordinator. But it

7:55

is important to understand

7:57

it is possible only if you have strong

8:00

observability. If you can't trace events

8:03

through your system, choreography will

8:04

destroy you. I have seen teams choose

8:06

choreography because it feels more

8:09

agentic, more autonomous. Then they

8:11

spend months firefighting because they

8:14

can't debug distributed event flows.

8:16

Don't make that mistake. Now, let's look

8:18

at the alternative, orchestration.

8:21

Orchestration is centralized. You have a

8:23

workflow orchestrator that calls each

8:25

agent directly. Agent A runs first. The

8:28

orchestration calls Agent A, waits for

8:31

the result, gets the result back. Then

8:34

the orchestrator calls Agent B and C in

8:37

parallel if they are agents that need to

8:38

run in parallel. The orchestrator

8:40

manages the parallelism, not the agents.

8:43

B and C return their results to the

8:44

orchestrator. Then the orchestrator

8:46

calls Agent D with the combined results

8:49

from B and C. Every call goes through

8:51

the orchestrator. Agents never call each

8:53

other. The orchestrator is the single

8:55

source of truth. It knows the entire

8:58

execution graph. It manages state. It

9:01

handles retries. It logs every step.

9:04

Agents are dumb. They just take the

9:07

input, they do the work, they return the

9:09

output. The orchestrator does all the

9:12

smart coordination. In Databricks, one

9:14

way to implement this pattern would be

9:16

with LangGraph wired into AI agent

9:19

framework as the orchestrator. But any

9:22

workflow that gives you

9:24

DAGs, directed acyclic graphs, and

9:27

proper retry mechanisms would fit in

9:30

this kind of orchestrator patterns. You

9:32

use orchestration when you have complex

9:34

dependencies that need central

9:37

management, when you need to roll back,

9:39

compensate for failures, when you want

9:41

one dashboard showing the entire system

9:44

state, when your workflow is relatively

9:46

stable. In financial services, for

9:48

example, we use orchestration almost

9:51

exclusively. Why? Because it provides

9:54

easy debugging and the ability to roll

9:56

back, and that matters more than

9:58

autonomy in these kind of industries.

10:01

When something goes wrong with a credit

10:03

decision, for example, we need to know

10:05

exactly which agent made that call, in

10:08

what order, and with what data.

10:10

Orchestration gives us that.

10:12

Choreography doesn't. So, how do you

10:14

choose? Here's your decision framework.

10:16

Two axis.

10:18

Workflow complexity, simple to complex.

10:21

Autonomy requirements, low to high.

10:24

Simple workflow, high autonomy, you go

10:26

with choreography. You need complex

10:28

workflow with low autonomy tolerance,

10:31

you go with orchestration. The

10:32

interesting quadrant is the top right,

10:35

where you need complex workflow, but

10:36

agents need autonomy. This is where you

10:39

use hybrid patterns. Choreography with

10:42

saga patterns for compensation. I'll

10:44

talk about this pattern later in this uh

10:47

session as well.

10:49

Uh tools like Agent Bricks on Databricks

10:52

are starting to package these

10:54

orchestration patterns for common

10:56

multi-agent use cases. So, you don't

10:59

need to rebuild them every time. It

11:01

makes

11:02

building these patterns really easy in

11:04

production environments. Now, I use the

11:06

decision metrics uh every time to make

11:09

decisions with customers based on their

11:10

use cases. Uh

11:12

it's worth you take a screenshot. I'm

11:14

sure you'll reference it. Let me show

11:16

you what a production orchestration

11:18

actually looks like at the tail end of

11:20

the session. All right. Now, we have

11:21

chosen a call coordination uh pattern.

11:24

Now, let's talk about the thing that

11:25

actually when you scale. State. How do

11:29

agents share data without race

11:31

conditions? Without stale reads? Without

11:34

mystery bugs? Here's what most people do

11:36

first, and it's wrong. Shared mutable

11:39

state. Multiple agents writing at the

11:42

same database records at the same time.

11:44

Agent A reads credit score, calculates

11:47

the value, writes it back. Agent B does

11:49

the same thing at the same time. Both

11:51

read 680. Agent A writes

11:55

750. Agent B writes 720. Last write

12:00

wins. Agent A's update disappears. Lost

12:03

update. Uh I understand, yes, modern

12:06

databases have protections in place, row

12:08

locks, isolation levels, etc. But, you

12:11

have to use them correctly. Explicit

12:13

transactions um you have to build uh

12:16

serializable isolation. Uh you have to

12:19

make sure that you select for update. Uh

12:22

and and many teams don't.

12:24

Uh they use default isolation. They

12:27

don't use explicit locks, and they ship

12:30

race condition to production. We did it.

12:32

We did that mistake, and that resulted

12:34

in delayed value to the business. We

12:36

just assumed that the database would

12:37

handle these conditions, but they don't.

12:39

When it gets really complex, you have to

12:42

handle them explicitly in the code. Now,

12:44

here's what works. Immutable state

12:46

snapshots with versioning. Agent A

12:48

produces a state version, let's say

12:51

version one. It's sealed. It's

12:53

immutable. Nobody can modify it. State

12:56

is stored in the orchestrator database

12:58

as an append-only log. These are insert

13:01

operations, not not any update. Agent A

13:04

hands state version one to agent B.

13:07

Agent B validates the schema, checks

13:09

that the data contract matches with its

13:11

expectations. It processes it, produces

13:14

state version two. Also immutable. Agent

13:17

B inserts version two as the new row. It

13:19

doesn't update version one. And then

13:22

hands it to agent C. Same thing. Schema

13:24

validation version tracking,

13:26

immutability guarantee at each handoff.

13:29

Agent C fails. Now, if agent C fails,

13:32

you roll back to version two. If you

13:34

need to debug, you replace state

13:36

evolution

13:38

uh from version one through version N.

13:40

You can see exactly what each agent

13:43

received and produced. This eliminates

13:45

race conditions. No concurrent

13:48

modification to the same record. Each

13:50

agent appends a new version instead of

13:53

updating the shared state. Now, of

13:56

course, if you want to

13:58

uh save these state snapshots, they can

14:00

be logged

14:01

uh in any sort of append-only storage

14:04

for audit replay, but they are never

14:06

shared for read or write. Now, here's

14:08

how it looks like in code. Agent state

14:10

class, the frozen means immutable in

14:12

Python. It has a version number, the

14:14

data payload, and who created it. The

14:17

handoff function does three things.

14:19

First, it validates the schema.

14:21

Uh this is the contract enforcement. We

14:24

are checking that agent A's output

14:26

matches agent B's input contract. This

14:29

is critical, and we will come back to

14:31

this. Second, increment version. Create

14:34

a new immutable state object with

14:37

version N plus one. Third, execute the

14:40

next agent with that immutable state.

14:43

The agent can't modify the input state.

14:45

It can only produce a new state. This

14:48

prevents an entire class of bugs. It

14:51

prevents race conditions on shared

14:54

state. No stale reads. It provides a

14:56

clear lineage. Every state has a

14:58

version, and you know who has created

15:00

it. When something goes wrong, you can

15:02

trace back through state evolution.

15:04

Version seven produced bad output, look

15:07

into version six that went into the

15:08

agent. Look at version five before that.

15:11

You can binary search through your state

15:14

history to find where things went wrong.

15:17

And this becomes really, really

15:18

powerful. Now, state management is half

15:20

the battle. Data contracts are the other

15:23

half. Agent A can just throw um

15:26

arbitrary data at agent B and hope it

15:29

works. This doesn't work that way. They

15:31

need a contract in place. In this

15:33

example, research agent promises to

15:37

output findings, confident score,

15:39

sources, timestamp, etc. Analysis agent

15:42

declares it requires research agent

15:45

output with type and first.

15:47

Uh and it validates. If confidence is

15:50

below 0.7, it will reject the handoff.

15:55

This is the contract. If the research

15:57

hand if the research agent tries to

15:59

handoff low-quality data, the contract

16:03

catches it at the boundary. You find out

16:05

immediately, not three agents downstream

16:08

when it produces a report in garbage.

16:10

When we work with our customers um

16:13

using Databricks, one way of doing it is

16:16

uh registering these input-output

16:17

schemas in Unity Catalog. Uh so, every

16:20

agent's contract is versioned and

16:22

governed in one place. All right. We

16:24

talked about coordination patterns. We

16:26

talked about state management. Now, talk

16:28

about Now, now let's talk about another

16:30

thing that you need to keep in mind, and

16:32

that's failure and recovery. And and the

16:34

reason this is important is because

16:36

agents will fail. That's inevitable. The

16:38

LLM will time out. The API will rate

16:41

limit you. The agent will crash

16:43

mid-workflow. What happens then? What

16:45

happens then is what you need to plan

16:47

for and design in the system. Let's talk

16:49

about a few patterns. Let's talk about

16:51

the first pat- pattern, which is a

16:52

circuit breaker pattern, and this comes

16:55

straight from distributed system. When

16:57

agent A calls agent B, it wraps that

17:00

call in a circuit breaker. If agent B

17:02

fails repeatedly, say five times in a

17:05

row, the circuit breaker opens. Now,

17:07

instead of waiting for a timeout every

17:09

single time, you basically fail fast.

17:12

Circuit open, agent B is down, you just

17:15

try again later. You are not bombarding

17:17

agent B with requests. You're protecting

17:20

your system. After a time- timeout

17:22

period, let's say 60 seconds, it the

17:24

circuit goes half open. Then you test

17:27

agent B again with one request. If it

17:29

succeeds, the start circuit closes, and

17:32

normal operation resumes. If it fails,

17:34

the circuit opens again, and it resets

17:37

the timer. This prevents you from

17:39

cascading failures into the system.

17:42

One agent going down doesn't bring your

17:45

entire workflow down. You gracefully

17:48

degrade. Maybe you skip that agent and

17:51

continue with a reduced functionality.

17:54

Uh maybe you use cached results. Maybe

17:57

you alert a human. But, you don't crash

17:59

the entire workflow. Circuit breakers

18:02

are the single most important failure

18:06

recovery pattern for multi-agent

18:08

systems. Every agent call should be

18:10

wrapped with a

18:11

We enforce these circuit breaker

18:13

policies at the serving layer on

18:15

Databricks through model serving or

18:16

through AI Gateway. Here's how it looks

18:18

like in code. You track the failure

18:20

count, and you track the state. When you

18:22

call an agent, you check the state

18:24

first. If it is open, you fail fast. You

18:27

don't even try. If it is closed, you

18:29

make the call. If the call succeeds, you

18:31

reset the failure count and stay closed.

18:34

If it fails, you increment the failure

18:36

count. If you hit the threshold, you

18:38

open the circuit. After the timeout

18:40

period, you transition to half open. You

18:43

test one request. If it succeeds, you

18:45

close the circuit. If it fails, you open

18:47

it again. This is a simple pattern, but

18:50

it has got a massive impact. And in

18:52

Databricks, you can log every

18:54

open-closed transition in MLflow, so you

18:57

can see when an agent started flaking

19:00

out. Now, let's talk about another

19:02

pattern. We call it the compensation

19:04

pattern. Also called saga pattern. Every

19:07

agent has two methods, execute and

19:10

compensate. Execute does the work.

19:12

Compensate rolls it back, undoes it. The

19:15

orchestrator

19:17

agents have executed. If the execution

19:20

agent fails, the orchestrator walk walks

19:23

backward through the executed

19:25

agents.

19:27

And it calls compensate for each one.

19:30

Analysis agent compensates, it deletes

19:33

the draft recommendation from the system

19:34

that it has written originally. And then

19:37

the research agent compensates by

19:38

clearing the cached research data that

19:41

it gathered previously. So, you're back

19:43

to the initial state. No partial

19:45

transactions. No stuck workflows. This

19:48

is a simple rollback pattern that you

19:50

can implement in multi-agent system.

19:52

Compensation gives distributed agents.

19:55

It is not sexy, but it's how production

19:57

systems handle partial failures. Every

19:59

orchestrated workflow needs this kind of

20:02

compensation pattern, and you need to

20:04

plan for it depending on what you're

20:05

doing with your workflows. Here's how

20:07

compensation looks in code. Every agent,

20:10

as I mentioned earlier, has got two

20:12

methods, the execution method and the

20:15

compensate method. The execution does

20:17

the work, the compensate undoes it. Uh

20:20

that's the contract. Every operation

20:23

must be reversible. The orchestration

20:25

tracks which

20:26

uh the orchestrator tracks which agents

20:29

have run successfully, and then it keeps

20:31

a list. Agent A executes, gets added.

20:34

Agent B executes, gets added. Agent C

20:37

fails, now we walk backward through the

20:39

list in reverse order. Agent B

20:41

compensates first, it undoes the work

20:44

that it has done. Agent A compensates

20:46

next, it undoes the work that Agent A

20:48

has done, and it goes back to the

20:50

initial state. This is saga pattern from

20:52

distributed databases. Financial

20:54

services requires this. Now that we have

20:57

covered these different patterns, I

20:59

wanted to show you what a production

21:00

architecture would look like when you

21:01

bring these things together. You've got

21:03

the orchestrator at the left-hand side.

21:05

Um

21:07

it's the brain of the workflow. It

21:09

contains the workflow engine, it

21:11

contains the state store uh holding

21:14

versions through zero to n, and it has

21:16

uh it it it can look into the

21:18

observability layer. It handles the

21:20

observability data. Every call goes

21:22

through the orchestrator. Orchestrator

21:24

calls Agent A, Agent B, Agent A returns

21:27

state version one to the orchestrator.

21:29

Orchestrator then calls Agent B and C in

21:32

parallel if they need to run in

21:33

parallel.

21:34

Uh both receives state version one from

21:37

the orchestrator. They return results.

21:40

Orchestrator stores at version three two

21:42

and three. Finally, orchestrator calls D

21:44

with these combined results. Agents

21:46

never call each other. All coordination

21:48

happens through the orchestrator.

21:50

And this is what gives us control,

21:53

observability, capability to roll back.

21:56

This runs 24/7 across billions of

21:58

transactions because the orchestrator is

22:01

the single source of truth. All right,

22:03

here's a production architecture that

22:06

you could implement with the Databricks

22:08

Data Intelligence Platform.

22:10

In the orchestration layer, you can have

22:12

LangGraph wired into Mosaic AI Agent

22:15

Framework. It handles multi-agent

22:17

orchestration. It manages the workflow

22:19

graph and knows which agents to call in

22:21

what order. Each agent is implemented as

22:24

a Unity Catalog function. It could be

22:27

written in SQL or Python, or it could be

22:29

a model registered in a Unity Catalog.

22:32

Um they are When you register these

22:35

assets in Unity Catalog, they are set

22:38

discoverable centrally within the

22:40

organization. Uh they can be governed in

22:42

one place, and they can be versioned,

22:44

which is really critical uh in terms of

22:46

operating these uh workflows in

22:49

production. We expose these agents

22:51

through a Databricks Model Serving or

22:53

Function Serving, and that's where we

22:55

enforce these circuit breaker style

22:57

policies like retries or timeouts or

23:00

rate limits uh at the serving layer,

23:02

typically via AI Gateway configuration.

23:05

Now when we talk about the data layer,

23:06

Delta Lake stores everything. It not

23:09

only stores the state versions from the

23:12

agent, it also stores customer data and,

23:15

you know, all all all the data that you

23:18

need for your workflows to work.

23:21

Um

23:23

Talking about the snake state snapshots,

23:25

Delta table

23:27

uh is immutable and versioned. For us,

23:30

those state versions are just rows in a

23:32

Delta table. Uh we never update them in

23:35

place. Each agent run is tied to a state

23:38

version via MLflow Traces, so we can

23:40

step through the evolution when

23:42

something breaks. Now, uh I just wanted

23:45

to touch upon uh Unity Catalog. It It

23:48

governs everything access control,

23:50

lineage, audit trail for both data and

23:53

agents. MLflow gives us per agent

23:56

tracing evaluation capabilities with

23:58

out-of-the-box LLM as judges and

24:03

and metrics on every call. And as I

24:05

mentioned earlier, um tools like Agent

24:08

Bricks

24:09

is the higher level way of Databricks

24:12

packaging these orchestration patterns

24:14

for common multi-agent use cases, so you

24:17

don't need to rebuild them every time.

24:19

So just to wrap up this workflow, I see

24:22

the LangGraph orchestrator calls Agent

24:24

A, a Unity Catalog function or model. It

24:27

gets a result, writes version one state

24:30

to Delta. It then calls Agent B with

24:34

state version one, writes version two,

24:36

and so on.

24:37

MLflow traces every call, latency,

24:40

inputs, outputs, token usage. A circuit

24:43

breaker at the serving layer guards each

24:45

call. If Agent C fails, LangGraph

24:48

triggers compensation logic and walks

24:51

backward, calling the compensate

24:53

functions for previous successful steps.

24:55

These kind of patterns run in production

24:57

day in and day out. So thank you for

24:59

hearing me out. You can reach out to me

25:01

over LinkedIn. You can scan this keyword

25:03

that will take you directly to my

25:05

LinkedIn profile.

25:07

Uh

25:07

I I would like to like to leave you with

25:09

three final thoughts. First of all,

25:11

agent chaos is inevitable. When you

25:14

scale past one agent, you will you will

25:18

hit coordination problems, race

25:20

conditions, cascading failures. That's

25:22

guaranteed. The complexity curve doesn't

25:25

lie. Your agent choreography is a

25:27

choice. You can build systems with

25:30

proper patterns, orchestration,

25:32

choreography, immutable state, circuit

25:35

breakers, compensation patterns, data

25:37

contracts. Make sure you understand

25:40

these patterns and bring them to your

25:42

production architecture. Doing so will

25:44

help you build systems, not demos. Demos

25:47

are easy. You use an LLM to show

25:50

something cool. Everyone can do it.

25:52

These things don't work in production.

25:54

In production, you have to build

25:56

systems, and systems are hard. Systems

25:59

are what create value for businesses.

26:02

Everything I showed you today,

26:03

choreography versus orchestration,

26:05

immutable state, circuit breakers, these

26:07

are all unsexy infrastructure work. You

26:10

won't get applause for implementing a

26:12

circuit breaker, but you make your

26:14

systems more reliable. They don't fail

26:16

at 2:00 a.m. in the night. That is what

26:18

people notice over time. Be a systems

26:20

engineer. The patterns here, they work.

26:23

Apply these patterns in your production

26:25

architecture. Thank you very much for

26:27

watching. Bye.

Get the TLDR of any YouTube video

Transcribe, summarize, and repurpose videos in 125+ languages — free, no signup required.

Try YouTLDR Free
From Chaos to Choreography: Multi-Agent Orchestration Patterns That Actually Work — Sandipan Bhaumik — Full Transcript | YouTLDR