feat(entity-caching-2): control plane feature flag percentage based rollout (draft)#2828
Conversation
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
WalkthroughThis PR adds proposal rollout management capabilities to the platform API. It introduces database schema changes to link proposals with feature flags, implements two new RPC endpoints for bulk rollout percentage updates and teardown operations, and threads traffic percentage tracking through the composition pipeline with SDL override rewriting for feature flags. ChangesProposal Rollout Management
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
This comment has been minimized.
This comment has been minimized.
1 similar comment
This comment has been minimized.
This comment has been minimized.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## milinda/entity-caching-1-raw-event-pipeline #2828 +/- ##
=============================================================================
Coverage ? 9.59%
=============================================================================
Files ? 445
Lines ? 56997
Branches ? 905
=============================================================================
Hits ? 5468
Misses ? 51122
Partials ? 407 🚀 New features to boost your workflow:
|
This comment has been minimized.
This comment has been minimized.
…ig + proposal-rollout RPCs
Adds the wire contract for percentage-based feature flag rollouts driven from
proposals. node.proto carries traffic_percentage on each feature flag's
execution config so the router can route a configured share of unpinned
traffic to that flag's variant. platform.proto adds the rollout-side RPCs:
- BulkUpdateProposalRolloutPercentages (atomic create-or-update of one or
more proposal rollouts on the same federated graph; deploys feature
subgraphs + flag if no rollout exists yet, otherwise updates the
percentage)
- TeardownProposalRollout (deletes the linked feature flag + feature
subgraphs)
- rolloutFeatureFlagId / rolloutPercentage fields on the Proposal message
Includes regenerated TS bindings (connect/src). Go bindings are regenerated
locally but not committed — those will land alongside the consuming code.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the controlplane-side machinery for percentage-based feature flag rollouts
driven from proposals.
DB schema (migration 0137_rollout_feature_flags):
- feature_flags.traffic_percentage int (null → preview-only flag)
- feature_flags.proposal_id uuid FK → proposals.id ON DELETE SET NULL
- ff_proposal_id_idx for the linked-flag lookup
Composition wiring carries trafficPercentage through composeGraphs (types,
worker) into routerConfigToFeatureFlagExecutionConfig so the router config
proto's FeatureFlagRouterExecutionConfig.traffic_percentage is populated.
Bufservices:
- BulkUpdateProposalRolloutPercentages (new): atomic create-or-update of one
or more proposal rollouts on the same federated graph. Deploys feature
subgraphs + flag if no rollout exists yet, otherwise updates the
percentage. Single transaction + single composeAndDeployGraphs.
Cumulative-budget check across the whole graph (router fails closed at
>100%).
- TeardownProposalRollout (new): deletes the linked feature flag.
- getProposal / updateProposal: surface rolloutFeatureFlagId +
rolloutPercentage on the Proposal DTO; auto-teardown the linked rollout
when the proposal transitions to PUBLISHED (idempotent).
Repositories:
- FeatureFlagRepository: thread trafficPercentage through subgraphsToCompose
+ the FF DTO surface so composition can carry it onto the proto.
- FederatedGraphRepository: small touch to surface rollout flags.
- ProposalRepository: getLinkedRolloutFlag, setLinkedRolloutFlag,
updateRolloutPercentage helpers backing the rollout RPCs.
Includes integration tests under test/proposal/caching-rollout.test.ts.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… feature subgraphs When a feature flag composition replaces a base subgraph with its feature subgraph, sibling subgraphs that have an @OverRide(from: "<baseName>") directive get orphaned by the swap. The FF composition then fails with a @Shareable collision and the router config silently produces no featureFlagConfigs entry, so the router falls back to baseMux and the rollout no-ops. This adds rewriteOverrideTargets, a small SDL-level helper that walks the field directives in a subgraph SDL and renames @OverRide(from:) targets according to a baseName -> featureSubgraphName map. FeatureFlagRepository applies it to every sibling DTO at compose-list construction time so the post-swap composition sees a coherent override graph. Worker re-parses each DTO from `schemaSDL` (composeGraphs.worker.ts), so the write back to schemaSDL is what counts; the parallel compositionSubgraphs AST is dead in this code path. Includes unit tests covering the empty-map no-op, single rewrite, multi- field rewrite, and unmatched-target behavior. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3294faa to
1af9b5f
Compare
|
@CodeRabbit review |
✅ Actions performedReview triggered.
|
Router-nonroot image scan passed✅ No security vulnerabilities found in image: |
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
…e race Under parallel vitest load (pool: forks), Keycloak's admin API occasionally returns "Realm not found" right after a successful realms.create (or 409), flaking SetupTest in unrelated tests like update-proposal and check-subgraph-schema. Retry up to 3x with linear backoff so the realm-cache catches up; non-matching errors still throw immediately so real bugs surface.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
…andler validation The previous app-level check rejected `> 100` only — a negative percentage slipped through and *freed* cumulative budget, bypassing the >100 router-fail-closed guard the bulk RPC exists to enforce. - Migration 0137: add CHECK (traffic_percentage IS NULL OR BETWEEN 0 AND 100). - schema.ts: mirror the constraint via drizzle's check() so future generated migrations don't drop it. - bulkUpdate handler: reject non-integers and negatives explicitly, with a clearer error message.
… 1-flag-per-proposal Previously feature_flags.proposal_id was ON DELETE SET NULL, so deleting a proposal silently nulled the link and left the flag routing traffic against a vanished proposal — invisible to getLinkedRolloutFlag (which keys on proposalId) and never cleaned up. - Migration 0137: change FK to ON DELETE cascade so the rollout flag goes with the proposal it backs. - Add a unique partial index on proposal_id WHERE NOT NULL, so the schema enforces what getLinkedRolloutFlag's LIMIT 1 already assumes — at most one rollout flag per proposal. - schema.ts updated to mirror both.
…ects helper The auto-teardown branch in updateProposal (PUBLISHED state transition) and the TeardownProposalRollout RPC each duplicated the delete-flag → audit-log → composeAndDeployGraphs sequence — and the inline copy in updateProposal was missing the audit log emission. Extract into one helper so both call sites stay in lockstep on what gets recorded and what gets recomposed. teardownProposalRollout also gains an early-out when the federated graph is missing (idempotent: nothing to recompose). RBAC gates added in a follow-up.
… bulkUpdate Closes the four highest-severity findings on the rollout RPCs: - B1 RBAC: bulkUpdate and teardown now enforce organizationDeactivated, namespace.enableProposals, and rbac.hasFederatedGraphWriteAccess — matching every other proposal RPC. Previously any authenticated org member (incl. viewers) could deploy/teardown rollouts and trigger CDN-pushing recompositions. - B2 atomicity: the deploy fan-out — feature subgraph create + schema publish + feature flag create + link + audit + percentage update + composeAndDeployGraphs — now runs inside a single db.transaction, with every repo instantiated against the tx. A composition failure rolls back the DB so the served router config and the DB stay in sync. Previously partial failures left orphan feature subgraphs, live FF rows, and audit entries for a flag the caller believed wasn't created. - B4 TOCTOU: the cumulative-budget read now happens inside the tx after a SELECT FOR UPDATE on every active rollout flag for the federated graph. Two concurrent batches now serialize through the lock instead of both observing <100, both committing, and landing >100 — exactly the fail-closed scenario this RPC exists to prevent. - I4 labels: the open-coded "k=v,k=v" split is replaced with splitLabel from cosmo-shared, which handles the escapes the manual split mishandles. Also includes a few correctness/observability cleanups in the same file: batch capped at 50 items; feature flag name uses the full proposalId (was 8-char prefix — collision-prone); percentage-only updates now write a feature_flag.updated audit row, closing the forensic gap.
The previous file was 4 negative-path tests + 2 { retry: 3 } markers
papering over a Keycloak realm-cache flake (now fixed centrally in
test-util.ts). With retries gone, real regressions in the new RPC can no
longer hide as "flake".
Adds positive coverage that was missing entirely:
- APPROVED first deploy creates the linked rollout flag.
- Re-deploy updates percentage without creating a second flag.
- Cumulative >100 across siblings is rejected; cumulative ==100 is accepted.
- Multi-graph batch is rejected.
- Teardown after a deploy clears the link and percentage.
- PUBLISHED state transition auto-tears down the rollout.
Also tightens negative coverage (negative percentage, duplicate
proposalId in batch) and types the test client as
PromiseClient<typeof PlatformService> instead of `any`.
- ProposalRepository.getLinkedRolloutFlag: scope by organizationId. Defense-in-depth — every current caller pre-validates via ById/ByName, but a future careless caller could leak a sibling org's flag id. - composer.ts and getProposal.ts: replace `trafficPercentage == null ? undefined : trafficPercentage` with `trafficPercentage ?? undefined`. Same semantics, less noise. - node.proto: comment that `optional` on traffic_percentage is load-bearing — explicit `0` (paused rollout, still part of cumulative budget) is distinct from unset (preview-only flag, header-pinned), and proto3 presence is the only way to tell the two apart. Don't drop it. - Regenerated Go proto bindings to pick up the new comment.
b29889d to
601ff7e
Compare
…oc-comment Propagates the "optional is load-bearing" note from node.proto into the generated FeatureFlagRouterExecutionConfig.trafficPercentage doc on the TS side.
…in multi-graph test setupGraphAndCachingProposal calls enableProposalsForNamespace at the end, so calling it twice triggered the second call's createThenPublishSubgraph to fail with ERR_SCHEMA_MISMATCH_WITH_APPROVED_PROPOSAL: once proposals are enabled in the namespace, every subsequent publish requires an approved proposal matching the SDL. Inline the setup for the multi-graph test so both subgraphs + fed graphs are created BEFORE enabling proposals.
04c9de5 to
cf798fe
Compare
This PR introduces the concept of rollout feature flags. Normally Feature Flags can be accessed only by setting a header or a cookie. However rollout feature flags are essentially flags that can be reached WITHOUT any of the headers or cookies needed. It will be reached based on the percentage assigned to it, this percentage indicates how much traffic should be sent to the feature flags.
In summary
Summary by CodeRabbit
New Features
Checklist
Open Source AI Manifesto
This project follows the principles of the Open Source AI Manifesto. Please ensure your contribution aligns with its principles.