feat(entity-caching-3): feature flag percentage based rollout router changes#2829
Conversation
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
Comment |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## milinda/entity-caching-2-control-plane-ff-rollout #2829 +/- ##
======================================================================================
+ Coverage 9.59% 40.49% +30.90%
======================================================================================
Files 445 14 -431
Lines 56997 1521 -55476
Branches 905 0 -905
======================================================================================
- Hits 5468 616 -4852
+ Misses 51122 790 -50332
+ Partials 407 115 -292 🚀 New features to boost your workflow:
|
3294faa to
1af9b5f
Compare
The loop at feature_flag_rollout.go:53-58 read `!= nil` where it should
have been `== nil` — every flag that *carried* a traffic_percentage was
skipped, and only preview-only flags were appended. Net effect:
- Real rollout flags never registered → still client-pinnable via
X-Feature-Flag / cookie. The exact "header pin bypass" the design
promises is defeated; a client who knows the flag name escapes the
percentage gate entirely.
- Preview flags ended up in `rolloutFlags` because GetTrafficPercentage()
returns 0 when unset → isRolloutFlag(name)==true → graph_server.go's
pin-strip block fires → existing preview clients lose pin steering.
The doc-comment two lines up ("every flag whose execution config carries
a traffic_percentage participates") had the right intent; the code just
inverted it.
This single-line fix restores the security contract of the feature.
Enabling rollouts flips the routing semantics for every flag whose
execution config carries a traffic_percentage — header/cookie pins start
being ignored. That is a behavior change that should not happen on a
silent router upgrade.
Per router/CLAUDE.md ("Boolean defaults should be `false` when possible"),
flip envDefault to false, the schema's documented default to false, and
update config_defaults.json to match. fixtures/full.yaml keeps
`enabled: true` — it is a feature-on example, not a defaults snapshot.
… whole graph The previous logic returned an error and a nil selector when cumulative percentage exceeded 100. graph_server.go logs the error and proceeds with selector==nil — but selector==nil silently restores client-pinnability for every other rollout flag on the graph, because isRolloutFlag returns false for all of them. One operator typo (110% instead of 10%) blew up every sibling rollout's invariant. New semantics: - A flag whose own pct > 100 is always a typo — drop it, log Error, keep going. (Pre-fix this returned an error.) - Otherwise iterate the alphabetically-sorted flags; if the next would push cumulative past 100, drop *that* flag (log Error) and keep filling with the remaining ones. The dropped flag's traffic falls through to base. - Selector still returns nil only when nothing usable remains. This computes the budget gate inline as we build, so partial state is never exposed (pre-fix: rules were appended before the post-loop sum check, which would have leaked partial state on any future refactor that returned a half-built selector). Tests TestNewRolloutSelector_FailsClosedOnOverflow and TestNewRolloutSelector_RejectsPercentageAbove100 are renamed to …DropsOverflowingFlagButKeepsSiblings and …DropsAbove100PercentFlag and updated to assert the new drop-but-preserve-siblings behavior.
crypto/rand.Read is a getrandom(2) syscall on Linux — at 50k req/s that is 50k syscalls/s on the hot path purely for non-security bucketing. Bucketing has no security requirement: the bucket value is never visible to the client, and is not derived from request content under the current no-stickiness design (a determined client can already retry-grind via the X-Feature-Flag echo, which is the actual surface to address — separate fix). math/rand/v2.Uint32N is lock-free per goroutine, ~5ns, and rejection- samples to eliminate the modulo bias the previous `% rolloutBucketScale` introduced (≈1 in 430k buckets over-represented). Drops the crypto/rand fallback path (no longer needed) and the now-unused encoding/binary import.
… picks
The selector returns ("flag", "random", true) for traffic landing in a
rollout bucket; the request handler then ran ServeHTTP and unconditionally
set X-Feature-Flag: <picked> on the response. A determined client can
retry until that header reveals the variant, then pin via X-Feature-Flag
header / cookie — defeating the percentage gate the rollout selector
exists to enforce. Echo is preserved for explicit (preview-flag) pins
because the client already chose the flag.
Also drops the dead `_ string // versionSeed: unused` parameter from
newRolloutSelector and the corresponding routerConfig.GetVersion()
threading from graph_server.go. The parameter hinted at hash-based
stickiness that never landed; deleting it removes the cargo-culted
plumbing. If stickiness is wanted later, the right place is `pick`'s
already-existing *http.Request parameter.
Tests updated for the new signature; behavior unchanged where pickedSource
is exercised.
3ba5c01 to
04c9de5
Compare
Router-nonroot image scan passed✅ No security vulnerabilities found in image: |
… picks A previous fix in this branch stripped the X-Feature-Flag response echo when the selector picked the variant via random bucketing, on the theory that the echo lets a client retry until they grind onto the variant. In practice the variant must have an observably different response shape — otherwise the rollout has no purpose. The shape difference itself is a side-channel a grinder can use, so the strip doesn't actually prevent the attack it was meant to address. Meanwhile it: - Hides a useful debug signal from honest clients. - Breaks downstream caches that key on the flag. - Complicates integration tests that need to know which variant served (router-tests/operations/feature_flag_rollouts_test.go was relying on the header, and now correctly does so again). Reinstates the unconditional w.Header().Set(featureFlagHeader, ff) and drops the now-unused `source` capture from selector.pick. The Debug log for ignored pins (added in a prior commit) stays — that observability fix is unrelated and still useful.
res.Body is the raw response payload — a JSON string in which the inner double-quotes around field names are escaped as \". The discriminator constant was searching for `Cannot query field "productCount"` (plain quotes), which never matches the on-the-wire `Cannot query field \"productCount\"`. Every test that asserted the base graph rejected `productCount` (header- or cookie-pin-ignored variants, 0% rollout, >100% fail-closed) was silently broken — the require.Contains assertion passed only when the body genuinely lacked the substring, which is the opposite of the intended check. Now they pass when the base graph errors as expected.
Touches a path under proto/ to retrigger the Proto CI workflow on this branch. The previous Proto CI run on 3ba5c01 left a stale "uncommitted changes detected" sticky comment that never cleared because subsequent commits on this branch shared SHAs with branch 2 (where Proto CI ran once and succeeded). Locally `make generate-go` is clean against this proto state, so this push should produce a green Proto CI and clear the sticky.
@coderabbitai summary
Checklist
Open Source AI Manifesto
This project follows the principles of the Open Source AI Manifesto. Please ensure your contribution aligns with its principles.