Skip to content

feat(entity-caching-3): feature flag percentage based rollout router changes#2829

Draft
SkArchon wants to merge 14 commits intomilinda/entity-caching-2-control-plane-ff-rolloutfrom
milinda/entity-caching-3-feature-flag-rollout-router
Draft

feat(entity-caching-3): feature flag percentage based rollout router changes#2829
SkArchon wants to merge 14 commits intomilinda/entity-caching-2-control-plane-ff-rolloutfrom
milinda/entity-caching-3-feature-flag-rollout-router

Conversation

@SkArchon
Copy link
Copy Markdown
Contributor

@SkArchon SkArchon commented May 6, 2026

@coderabbitai summary

Checklist

  • I have discussed my proposed changes in an issue and have received approval to proceed.
  • I have followed the coding standards of the project.
  • Tests or benchmarks have been added or updated.
  • Documentation has been updated on https://github.com/wundergraph/docs-website.
  • I have read the Contributors Guide.

Open Source AI Manifesto

This project follows the principles of the Open Source AI Manifesto. Please ensure your contribution aligns with its principles.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 6, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: e2263c87-1380-4f90-999f-55ff0bdc710e

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions Bot added the router label May 6, 2026
@SkArchon SkArchon changed the title feat: ff percentage based rollout router feat(entity-caching-3): feature flag percentage based rollout router changes May 6, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented May 6, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 40.49%. Comparing base (09966b2) to head (3ec63a1).

Additional details and impacted files
@@                                  Coverage Diff                                   @@
##           milinda/entity-caching-2-control-plane-ff-rollout    #2829       +/-   ##
======================================================================================
+ Coverage                                               9.59%   40.49%   +30.90%     
======================================================================================
  Files                                                    445       14      -431     
  Lines                                                  56997     1521    -55476     
  Branches                                                 905        0      -905     
======================================================================================
- Hits                                                    5468      616     -4852     
+ Misses                                                 51122      790    -50332     
+ Partials                                                 407      115      -292     

see 431 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@SkArchon SkArchon force-pushed the milinda/entity-caching-2-control-plane-ff-rollout branch from 3294faa to 1af9b5f Compare May 6, 2026 15:20
SkArchon added 7 commits May 6, 2026 22:54
The loop at feature_flag_rollout.go:53-58 read `!= nil` where it should
have been `== nil` — every flag that *carried* a traffic_percentage was
skipped, and only preview-only flags were appended. Net effect:

- Real rollout flags never registered → still client-pinnable via
  X-Feature-Flag / cookie. The exact "header pin bypass" the design
  promises is defeated; a client who knows the flag name escapes the
  percentage gate entirely.
- Preview flags ended up in `rolloutFlags` because GetTrafficPercentage()
  returns 0 when unset → isRolloutFlag(name)==true → graph_server.go's
  pin-strip block fires → existing preview clients lose pin steering.

The doc-comment two lines up ("every flag whose execution config carries
a traffic_percentage participates") had the right intent; the code just
inverted it.

This single-line fix restores the security contract of the feature.
Enabling rollouts flips the routing semantics for every flag whose
execution config carries a traffic_percentage — header/cookie pins start
being ignored. That is a behavior change that should not happen on a
silent router upgrade.

Per router/CLAUDE.md ("Boolean defaults should be `false` when possible"),
flip envDefault to false, the schema's documented default to false, and
update config_defaults.json to match. fixtures/full.yaml keeps
`enabled: true` — it is a feature-on example, not a defaults snapshot.
… whole graph

The previous logic returned an error and a nil selector when cumulative
percentage exceeded 100. graph_server.go logs the error and proceeds with
selector==nil — but selector==nil silently restores client-pinnability for
every other rollout flag on the graph, because isRolloutFlag returns false
for all of them. One operator typo (110% instead of 10%) blew up every
sibling rollout's invariant.

New semantics:
- A flag whose own pct > 100 is always a typo — drop it, log Error, keep
  going. (Pre-fix this returned an error.)
- Otherwise iterate the alphabetically-sorted flags; if the next would push
  cumulative past 100, drop *that* flag (log Error) and keep filling with
  the remaining ones. The dropped flag's traffic falls through to base.
- Selector still returns nil only when nothing usable remains.

This computes the budget gate inline as we build, so partial state is never
exposed (pre-fix: rules were appended before the post-loop sum check, which
would have leaked partial state on any future refactor that returned a
half-built selector).

Tests TestNewRolloutSelector_FailsClosedOnOverflow and
TestNewRolloutSelector_RejectsPercentageAbove100 are renamed to
…DropsOverflowingFlagButKeepsSiblings and …DropsAbove100PercentFlag and
updated to assert the new drop-but-preserve-siblings behavior.
crypto/rand.Read is a getrandom(2) syscall on Linux — at 50k req/s that
is 50k syscalls/s on the hot path purely for non-security bucketing.
Bucketing has no security requirement: the bucket value is never visible
to the client, and is not derived from request content under the current
no-stickiness design (a determined client can already retry-grind via the
X-Feature-Flag echo, which is the actual surface to address — separate
fix).

math/rand/v2.Uint32N is lock-free per goroutine, ~5ns, and rejection-
samples to eliminate the modulo bias the previous `% rolloutBucketScale`
introduced (≈1 in 430k buckets over-represented).

Drops the crypto/rand fallback path (no longer needed) and the now-unused
encoding/binary import.
… picks

The selector returns ("flag", "random", true) for traffic landing in a
rollout bucket; the request handler then ran ServeHTTP and unconditionally
set X-Feature-Flag: <picked> on the response. A determined client can
retry until that header reveals the variant, then pin via X-Feature-Flag
header / cookie — defeating the percentage gate the rollout selector
exists to enforce. Echo is preserved for explicit (preview-flag) pins
because the client already chose the flag.

Also drops the dead `_ string // versionSeed: unused` parameter from
newRolloutSelector and the corresponding routerConfig.GetVersion()
threading from graph_server.go. The parameter hinted at hash-based
stickiness that never landed; deleting it removes the cargo-culted
plumbing. If stickiness is wanted later, the right place is `pick`'s
already-existing *http.Request parameter.

Tests updated for the new signature; behavior unchanged where pickedSource
is exercised.
@SkArchon SkArchon closed this May 6, 2026
@SkArchon SkArchon force-pushed the milinda/entity-caching-3-feature-flag-rollout-router branch from 3ba5c01 to 04c9de5 Compare May 6, 2026 17:50
@SkArchon SkArchon reopened this May 6, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 6, 2026

Router-nonroot image scan passed

✅ No security vulnerabilities found in image:

ghcr.io/wundergraph/cosmo/router:sha-10af5c602db85b6585758340bc97514087e57a26-nonroot

SkArchon added 3 commits May 6, 2026 23:36
… picks

A previous fix in this branch stripped the X-Feature-Flag response echo
when the selector picked the variant via random bucketing, on the theory
that the echo lets a client retry until they grind onto the variant.

In practice the variant must have an observably different response shape
— otherwise the rollout has no purpose. The shape difference itself is a
side-channel a grinder can use, so the strip doesn't actually prevent the
attack it was meant to address. Meanwhile it:

- Hides a useful debug signal from honest clients.
- Breaks downstream caches that key on the flag.
- Complicates integration tests that need to know which variant served
  (router-tests/operations/feature_flag_rollouts_test.go was relying on
  the header, and now correctly does so again).

Reinstates the unconditional w.Header().Set(featureFlagHeader, ff) and
drops the now-unused `source` capture from selector.pick. The Debug log
for ignored pins (added in a prior commit) stays — that observability
fix is unrelated and still useful.
res.Body is the raw response payload — a JSON string in which the inner
double-quotes around field names are escaped as \". The discriminator
constant was searching for `Cannot query field "productCount"` (plain
quotes), which never matches the on-the-wire `Cannot query field \"productCount\"`.

Every test that asserted the base graph rejected `productCount` (header-
or cookie-pin-ignored variants, 0% rollout, >100% fail-closed) was
silently broken — the require.Contains assertion passed only when the
body genuinely lacked the substring, which is the opposite of the
intended check. Now they pass when the base graph errors as expected.
Touches a path under proto/ to retrigger the Proto CI workflow on this
branch. The previous Proto CI run on 3ba5c01 left a stale "uncommitted
changes detected" sticky comment that never cleared because subsequent
commits on this branch shared SHAs with branch 2 (where Proto CI ran
once and succeeded). Locally `make generate-go` is clean against this
proto state, so this push should produce a green Proto CI and clear
the sticky.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant