Mobaxterm
ArticlesCategories
Education & Careers

Cloudflare's Code Orange Project: A Stronger, More Resilient Network

Published 2026-05-09 17:39:08 · Education & Careers

Introduction

Over the past two and a half quarters, Cloudflare has been hard at work on an intensive engineering initiative internally known as "Code Orange: Fail Small". This project was designed to make our infrastructure more resilient, secure, and reliable for every customer. Earlier this month, we completed the core work that would have prevented the global outages on November 18 and December 5, 2025.

Cloudflare's Code Orange Project: A Stronger, More Resilient Network
Source: blog.cloudflare.com

While improving resiliency is a never-ending journey, this milestone marks a significant leap forward. The efforts focused on several key areas: safer configuration changes, reducing the impact of failures, revising our break-glass procedures and incident management, preventing configuration drift over time, and strengthening how we communicate with customers during incidents. Below, we dive into what we shipped and what it means for you.

Safer Configuration Changes

Health-Mediated Deployment for Config

One of the most impactful changes is how we handle internal configuration changes. Previously, many configuration updates reached our network instantly. Now, Cloudflare internal configuration changes are progressively rolled out with real-time health monitoring. Our observability tools can catch problems and automatically revert changes before they affect your traffic.

We identified high-risk configuration pipelines and built new tools to manage changes better. For products that process customer traffic and receive configuration updates, we no longer deploy those changes instantly across the entire network. Instead, relevant teams have adopted a "health-mediated deployment" methodology—the same approach we use for software releases—now applied to all configuration deployments. This includes product teams directly affected by the past incidents.

Snapstone: A Unified Health-Mediation System

Central to this effort is a new internal component we call Snapstone. Snapstone bundles configuration changes into a package and allows gradual release with health mediation principles. Before Snapstone, applying progressive rollout and automated rollback to configuration changes was possible but difficult—it required significant per-team effort and wasn't consistently applied across the network.

Snapstone closes this gap by providing a unified way to bring progressive rollout, real-time health monitoring, and automated rollback to configuration deployments by default. What makes Snapstone particularly powerful is its flexibility. Rather than being a fix for specific past failures, it allows teams to dynamically define any unit of configuration that needs health mediation—whether it's a data file like the one that caused the November 18 outage, or a control flag in our global configuration system like the one involved in the December 5 outage. Teams create these configuration units on demand, ensuring that any future risky change can be rolled out safely.

Reducing the Impact of Failure

Beyond configuration changes, we’ve taken steps to limit the blast radius when failures do occur. This includes architectural improvements that isolate faults to specific regions or services, preventing a single issue from cascading across the entire network. We’ve also enhanced our circuit breaker patterns and bulkheads in critical systems, ensuring that a failure in one component doesn’t bring down others.

Cloudflare's Code Orange Project: A Stronger, More Resilient Network
Source: blog.cloudflare.com

Revised Break-Glass and Incident Management

Our break-glass procedures—the emergency access processes for critical systems—have been completely overhauled to reduce human error during high-stress situations. We’ve also revamped our incident management playbooks, introducing clearer roles, faster escalation paths, and mandatory post-incident reviews that feed directly back into our development lifecycle. The goal: learn from every incident and prevent recurrence.

Preventing Drift and Regressions

To ensure that improvements stick, we introduced measures to prevent configuration drift and regressions over time. This includes automated compliance checks, regular audits of configuration baselines, and improved version control for all infrastructure-as-code components. Teams now have dashboards that highlight deviations from approved configurations, enabling proactive correction before problems arise.

Better Customer Communication During Outages

Finally, we strengthened how we communicate with you during outages. We’ve implemented real-time status updates with more granular detail, improved our status page with clearer timelines, and added proactive notifications via email and webhooks. Our goal is to keep you informed at every stage of an incident, from detection to resolution, so you can plan accordingly.

Conclusion

The Code Orange: Fail Small project is complete, but our commitment to reliability never ends. The systems and processes we’ve built—especially Snapstone for health-mediated configuration deployments—will continue to evolve. We’re confident that these changes make Cloudflare’s network stronger, safer, and more resilient for every customer. Thank you for your trust and patience as we work to earn it every day.