Mobaxterm
ArticlesCategories
Linux & DevOps

9 Crucial Insights into a CUBIC Congestion Control Bug in QUIC

Published 2026-05-17 19:47:03 · Linux & DevOps

When a seemingly harmless Linux kernel optimization meets the complex world of QUIC, unexpected bugs can emerge. This article explores a fascinating case where CUBIC, the default congestion controller for TCP and QUIC, suffered from a permanent congestion window stall. Through the lens of Cloudflare's quiche implementation, we uncover the root cause, the simple fix, and the lessons learned. Below are nine key points that explain the bug, its discovery, and its resolution.

1. CUBIC’s Role in Modern Networking

CUBIC is the default congestion control algorithm in the Linux kernel, as standardized in RFC 9438. It governs how most TCP and QUIC connections on the public internet probe for bandwidth, handle loss, and recover. At Cloudflare, their open-source QUIC implementation, quiche, relies on CUBIC as its default controller. This means any bug in CUBIC’s logic directly impacts a significant portion of internet traffic. Understanding CUBIC’s behavior is essential for diagnosing performance issues in high-speed networks.

9 Crucial Insights into a CUBIC Congestion Control Bug in QUIC
Source: blog.cloudflare.com

2. The Congestion Window: The Core Knob

The congestion window (cwnd) is a sender-side limit on how many bytes can be in flight at any time. A larger cwnd allows more data per round trip, boosting throughput, while a smaller cwnd throttles the sender to avoid overwhelming the network. CUBIC, like all loss-based algorithms, adjusts cwnd based on packet loss signals. It increases cwnd when the network appears healthy and decreases it when loss is detected. This mechanism is central to both TCP and QUIC performance.

3. The Loss-Based Philosophy

Loss-based congestion control algorithms operate on a simple premise: if there is no packet loss, increase the sending rate; if loss occurs, assume capacity has been exceeded and back off. CUBIC follows this principle, using a cubic function to grow cwnd after a loss event, which allows for efficient bandwidth utilization. However, this logic relies on assumptions that may not hold in all scenarios, such as when the connection is application-limited or when the network experiences high jitter.

4. The Mysterious Test Failure

The investigation began when Cloudflare’s ingress proxy integration test pipeline started failing unexpectedly. In tests where CUBIC faced heavy packet loss early in the connection, the system failed to recover 61% of the time. Recovery after congestion collapse is a rare but critical regime—it’s exactly what a congestion controller is designed to handle. Most tests focus on steady-state behavior, so this corner case had been overlooked, leaving the bug hidden in production-like scenarios.

5. Pinpointing the Root Cause

Engineers discovered that after a congestion collapse, CUBIC’s cwnd became permanently pinned at its minimum value. Normally, the algorithm should gradually increase cwnd as acknowledgments arrive, but here it never recovered. The cause was traced to how CUBIC handled the app-limited exclusion—a condition where the sender hasn’t fully utilized the window. A recent Linux kernel patch intended to align CUBIC with RFC 9438 §4.2-12 inadvertently created this bug in certain edge cases.

6. The Linux Kernel Optimization

The patch aimed to improve CUBIC’s behavior when the sender is application-limited, i.e., when it doesn’t have enough data to fill the cwnd. In such cases, the algorithm should not increase cwnd, to avoid over-optimistic growth. The Linux kernel change introduced a check that prevented cwnd growth during app-limited periods. While correct for TCP, this check failed when ported to QUIC, because QUIC’s pacing and acknowledgment semantics differ from TCP’s, leading to the permanent stall.

9 Crucial Insights into a CUBIC Congestion Control Bug in QUIC
Source: blog.cloudflare.com

7. Porting to QUIC: Unexpected Consequences

When Cloudflare ported the Linux kernel patch to their QUIC library, quiche, the subtle differences between TCP and QUIC surfaced. QUIC uses a different approach for retransmissions and acknowledgments, which meant the app-limited check was triggered far more frequently than intended. Instead of being a rare condition, it became the default after a loss event, preventing cwnd from ever growing again. The bug was invisible in throughput dashboards but devastating for connections that experienced early loss.

8. The Elegant One-Line Fix

The solution was surprisingly simple: removing a single condition that forced cwnd to stay constant during app-limited phases. By eliminating this check, CUBIC reverted to its original behavior for QUIC, allowing cwnd to increase even when the sender was app-limited after a loss. This one-line fix restored proper recovery and eliminated the 61% test failure rate. The fix was thoroughly validated and deployed, highlighting how minor changes in protocol implementations can have outsized effects.

9. Lessons for Congestion Control Design

This bug underscores the importance of testing edge cases—especially recovery after congestion collapse—across different transport protocols. What works for TCP may not directly apply to QUIC due to differences in pacing, acknowledgment handling, and retransmission strategies. It also demonstrates that even well-tested algorithms like CUBIC can harbor subtle bugs that only appear under specific conditions. Future congestion control designs should account for such cross-protocol nuances to ensure robust performance.

In conclusion, the CUBIC bug was a classic example of a well-intentioned optimization causing unintended harm when applied in a new context. The discovery and fix not only improved Cloudflare’s QUIC traffic but also provided valuable insights for the broader networking community. By sharing this story, we hope to encourage more rigorous testing of congestion control in diverse environments, ultimately making the internet more resilient.