Skip to content

feat(entity-caching-2): control plane feature flag percentage based rollout (draft)#2828

Draft
SkArchon wants to merge 19 commits intomilinda/entity-caching-1-raw-event-pipelinefrom
milinda/entity-caching-2-control-plane-ff-rollout
Draft

feat(entity-caching-2): control plane feature flag percentage based rollout (draft)#2828
SkArchon wants to merge 19 commits intomilinda/entity-caching-1-raw-event-pipelinefrom
milinda/entity-caching-2-control-plane-ff-rollout

Conversation

@SkArchon
Copy link
Copy Markdown
Contributor

@SkArchon SkArchon commented May 6, 2026

This PR introduces the concept of rollout feature flags. Normally Feature Flags can be accessed only by setting a header or a cookie. However rollout feature flags are essentially flags that can be reached WITHOUT any of the headers or cookies needed. It will be reached based on the percentage assigned to it, this percentage indicates how much traffic should be sent to the feature flags.

In summary

  • Allow cosmo to have rpc calls to create / update / teardown feature flag rollouts
  • Create an execution config with the percentage
  • The router will read the new execution config with the percentage attribute and assign random traffic based on the percentage to it

Summary by CodeRabbit

New Features

  • Feature flag rollouts now support configurable traffic percentages to manage gradual rollout distribution across users.
  • Added bulk update capability to simultaneously modify rollout percentages for multiple proposals.
  • Added teardown functionality to safely remove rollouts and clean up associated infrastructure.

Checklist

  • I have discussed my proposed changes in an issue and have received approval to proceed.
  • I have followed the coding standards of the project.
  • Tests or benchmarks have been added or updated.
  • Documentation has been updated on https://github.com/wundergraph/docs-website.
  • I have read the Contributors Guide.

Open Source AI Manifesto

This project follows the principles of the Open Source AI Manifesto. Please ensure your contribution aligns with its principles.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 6, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 0f88377c-e982-43a1-a1cc-e9590d81596a

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review

Walkthrough

This PR adds proposal rollout management capabilities to the platform API. It introduces database schema changes to link proposals with feature flags, implements two new RPC endpoints for bulk rollout percentage updates and teardown operations, and threads traffic percentage tracking through the composition pipeline with SDL override rewriting for feature flags.

Changes

Proposal Rollout Management

Layer / File(s) Summary
Proto & Data Model
proto/wg/cosmo/node/v1/node.proto, proto/wg/cosmo/platform/v1/platform.proto, connect/src/wg/cosmo/node/v1/node_pb.ts, connect/src/wg/cosmo/platform/v1/platform_pb.ts, connect/src/wg/cosmo/platform/v1/platform_connect.ts, connect/src/wg/cosmo/platform/v1/platform-PlatformService_connectquery.ts
Protos add trafficPercentage to FeatureFlagRouterExecutionConfig and extend Proposal with rolloutFeatureFlagId and rolloutPercentage fields. Two new RPC methods added to PlatformService: BulkUpdateProposalRolloutPercentages and TeardownProposalRollout, with corresponding request/response message types. Generated TypeScript code reflects these changes.
Database Schema
controlplane/migrations/0137_rollout_feature_flags.sql, controlplane/migrations/meta/_journal.json, controlplane/src/db/schema.ts
Migration adds traffic_percentage and proposal_id columns to feature_flags table with foreign key constraint to proposals.id and matching index. Drizzle ORM schema updated to expose new columns and index. Migration journal entry recorded.
Data Access Layer
controlplane/src/core/repositories/ProposalRepository.ts
Three new methods added: getLinkedRolloutFlag() retrieves the linked feature flag for a proposal, setLinkedRolloutFlag() establishes the association with traffic percentage, and updateRolloutPercentage() updates the traffic percentage of a linked flag.
Composition & Traffic Flow
controlplane/src/core/composition/composeGraphs.types.ts, controlplane/src/core/composition/composeGraphs.worker.ts, controlplane/src/core/composition/composer.ts, controlplane/src/core/composition/rewriteOverrideTargets.ts, controlplane/src/core/repositories/FeatureFlagRepository.ts, controlplane/src/core/repositories/FederatedGraphRepository.ts
New rewriteOverrideTargets() function rewrites SDL @override directives to remap subgraph references. ComposeGraphsTaskInput and ComposeGraphsTaskResultItem extended with optional trafficPercentage field. routerConfigToFeatureFlagExecutionConfig() signature updated to accept and map trafficPercentage into the returned feature flag execution config. FeatureFlagRepository now carries trafficPercentage through FeatureFlagWithFeatureSubgraphs and SubgraphsToCompose interfaces, applies override rewrites before composing, and passes traffic percentage into composition results. FederatedGraphRepository propagates trafficPercentage through subgraph composition inputs and feature-flag router config creation.
API Handlers & Wiring
controlplane/src/core/bufservices/PlatformService.ts, controlplane/src/core/bufservices/proposal/bulkUpdateProposalRolloutPercentages.ts, controlplane/src/core/bufservices/proposal/teardownProposalRollout.ts, controlplane/src/core/bufservices/proposal/getProposal.ts, controlplane/src/core/bufservices/proposal/updateProposal.ts
PlatformService wires two new mutation handlers. bulkUpdateProposalRolloutPercentages() validates items, enforces single federated graph per batch, checks cumulative traffic budget ≤100%, creates feature subgraphs and flags if needed, persists traffic percentages in a transaction, and triggers composition/deployment. teardownProposalRollout() removes linked rollout flags and triggers recomposition. getProposal() now fetches and returns linked rollout metadata. updateProposal() auto-tears down rollout flags when a proposal transitions to PUBLISHED state.
Tests
controlplane/test/composition/rewriteOverrideTargets.test.ts, controlplane/test/proposal/caching-rollout.test.ts
Unit tests validate SDL override rewriting for single/multiple/empty/non-matching cases. Integration test suite exercises proposal persistence, bulk rollout updates with validation (unknown IDs, DRAFT state, out-of-range percentages), and idempotent teardown.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title accurately describes the main feature being implemented: percentage-based rollout control for feature flags. It is specific, concise, and directly reflects the primary purpose of the changeset.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions

This comment has been minimized.

1 similar comment
@github-actions

This comment has been minimized.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 6, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (milinda/entity-caching-1-raw-event-pipeline@8455b62). Learn more about missing BASE report.

Additional details and impacted files
@@                              Coverage Diff                              @@
##             milinda/entity-caching-1-raw-event-pipeline   #2828   +/-   ##
=============================================================================
  Coverage                                               ?   9.59%           
=============================================================================
  Files                                                  ?     445           
  Lines                                                  ?   56997           
  Branches                                               ?     905           
=============================================================================
  Hits                                                   ?    5468           
  Misses                                                 ?   51122           
  Partials                                               ?     407           
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@github-actions

This comment has been minimized.

SkArchon and others added 3 commits May 6, 2026 20:49
…ig + proposal-rollout RPCs

Adds the wire contract for percentage-based feature flag rollouts driven from
proposals. node.proto carries traffic_percentage on each feature flag's
execution config so the router can route a configured share of unpinned
traffic to that flag's variant. platform.proto adds the rollout-side RPCs:

  - BulkUpdateProposalRolloutPercentages (atomic create-or-update of one or
    more proposal rollouts on the same federated graph; deploys feature
    subgraphs + flag if no rollout exists yet, otherwise updates the
    percentage)
  - TeardownProposalRollout (deletes the linked feature flag + feature
    subgraphs)
  - rolloutFeatureFlagId / rolloutPercentage fields on the Proposal message

Includes regenerated TS bindings (connect/src). Go bindings are regenerated
locally but not committed — those will land alongside the consuming code.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the controlplane-side machinery for percentage-based feature flag rollouts
driven from proposals.

DB schema (migration 0137_rollout_feature_flags):
  - feature_flags.traffic_percentage int (null → preview-only flag)
  - feature_flags.proposal_id uuid FK → proposals.id ON DELETE SET NULL
  - ff_proposal_id_idx for the linked-flag lookup

Composition wiring carries trafficPercentage through composeGraphs (types,
worker) into routerConfigToFeatureFlagExecutionConfig so the router config
proto's FeatureFlagRouterExecutionConfig.traffic_percentage is populated.

Bufservices:
  - BulkUpdateProposalRolloutPercentages (new): atomic create-or-update of one
    or more proposal rollouts on the same federated graph. Deploys feature
    subgraphs + flag if no rollout exists yet, otherwise updates the
    percentage. Single transaction + single composeAndDeployGraphs.
    Cumulative-budget check across the whole graph (router fails closed at
    >100%).
  - TeardownProposalRollout (new): deletes the linked feature flag.
  - getProposal / updateProposal: surface rolloutFeatureFlagId +
    rolloutPercentage on the Proposal DTO; auto-teardown the linked rollout
    when the proposal transitions to PUBLISHED (idempotent).

Repositories:
  - FeatureFlagRepository: thread trafficPercentage through subgraphsToCompose
    + the FF DTO surface so composition can carry it onto the proto.
  - FederatedGraphRepository: small touch to surface rollout flags.
  - ProposalRepository: getLinkedRolloutFlag, setLinkedRolloutFlag,
    updateRolloutPercentage helpers backing the rollout RPCs.

Includes integration tests under test/proposal/caching-rollout.test.ts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… feature subgraphs

When a feature flag composition replaces a base subgraph with its feature
subgraph, sibling subgraphs that have an @OverRide(from: "<baseName>")
directive get orphaned by the swap. The FF composition then fails with a
@Shareable collision and the router config silently produces no
featureFlagConfigs entry, so the router falls back to baseMux and the
rollout no-ops.

This adds rewriteOverrideTargets, a small SDL-level helper that walks the
field directives in a subgraph SDL and renames @OverRide(from:) targets
according to a baseName -> featureSubgraphName map. FeatureFlagRepository
applies it to every sibling DTO at compose-list construction time so the
post-swap composition sees a coherent override graph.

Worker re-parses each DTO from `schemaSDL` (composeGraphs.worker.ts), so the
write back to schemaSDL is what counts; the parallel compositionSubgraphs
AST is dead in this code path.

Includes unit tests covering the empty-map no-op, single rewrite, multi-
field rewrite, and unmatched-target behavior.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@SkArchon SkArchon force-pushed the milinda/entity-caching-2-control-plane-ff-rollout branch from 3294faa to 1af9b5f Compare May 6, 2026 15:20
@SkArchon
Copy link
Copy Markdown
Contributor Author

SkArchon commented May 6, 2026

@CodeRabbit review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 6, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 6, 2026

Router-nonroot image scan passed

✅ No security vulnerabilities found in image:

ghcr.io/wundergraph/cosmo/router:sha-aeedb48489bf29ac930ed3d0c1ec99476acc6433-nonroot

@github-actions github-actions Bot added the router label May 6, 2026
@SkArchon
Copy link
Copy Markdown
Contributor Author

SkArchon commented May 6, 2026

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 6, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@SkArchon SkArchon changed the title feat(entity-caching-2) control plane feature flag percentage based rollout feat(entity-caching-2): control plane feature flag percentage based rollout May 6, 2026
@SkArchon
Copy link
Copy Markdown
Contributor Author

SkArchon commented May 6, 2026

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 6, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

…e race

Under parallel vitest load (pool: forks), Keycloak's admin API occasionally
returns "Realm not found" right after a successful realms.create (or 409),
flaking SetupTest in unrelated tests like update-proposal and
check-subgraph-schema. Retry up to 3x with linear backoff so the realm-cache
catches up; non-matching errors still throw immediately so real bugs surface.
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

…andler validation

The previous app-level check rejected `> 100` only — a negative percentage
slipped through and *freed* cumulative budget, bypassing the >100
router-fail-closed guard the bulk RPC exists to enforce.

- Migration 0137: add CHECK (traffic_percentage IS NULL OR BETWEEN 0 AND 100).
- schema.ts: mirror the constraint via drizzle's check() so future generated
  migrations don't drop it.
- bulkUpdate handler: reject non-integers and negatives explicitly, with a
  clearer error message.
SkArchon added 5 commits May 6, 2026 22:10
… 1-flag-per-proposal

Previously feature_flags.proposal_id was ON DELETE SET NULL, so deleting a
proposal silently nulled the link and left the flag routing traffic against
a vanished proposal — invisible to getLinkedRolloutFlag (which keys on
proposalId) and never cleaned up.

- Migration 0137: change FK to ON DELETE cascade so the rollout flag goes
  with the proposal it backs.
- Add a unique partial index on proposal_id WHERE NOT NULL, so the schema
  enforces what getLinkedRolloutFlag's LIMIT 1 already assumes — at most
  one rollout flag per proposal.
- schema.ts updated to mirror both.
…ects helper

The auto-teardown branch in updateProposal (PUBLISHED state transition) and
the TeardownProposalRollout RPC each duplicated the
delete-flag → audit-log → composeAndDeployGraphs sequence — and the inline
copy in updateProposal was missing the audit log emission. Extract into one
helper so both call sites stay in lockstep on what gets recorded and what
gets recomposed.

teardownProposalRollout also gains an early-out when the federated graph is
missing (idempotent: nothing to recompose). RBAC gates added in a follow-up.
… bulkUpdate

Closes the four highest-severity findings on the rollout RPCs:

- B1 RBAC: bulkUpdate and teardown now enforce organizationDeactivated,
  namespace.enableProposals, and rbac.hasFederatedGraphWriteAccess —
  matching every other proposal RPC. Previously any authenticated org
  member (incl. viewers) could deploy/teardown rollouts and trigger
  CDN-pushing recompositions.

- B2 atomicity: the deploy fan-out — feature subgraph create + schema
  publish + feature flag create + link + audit + percentage update +
  composeAndDeployGraphs — now runs inside a single db.transaction, with
  every repo instantiated against the tx. A composition failure rolls back
  the DB so the served router config and the DB stay in sync. Previously
  partial failures left orphan feature subgraphs, live FF rows, and audit
  entries for a flag the caller believed wasn't created.

- B4 TOCTOU: the cumulative-budget read now happens inside the tx after a
  SELECT FOR UPDATE on every active rollout flag for the federated graph.
  Two concurrent batches now serialize through the lock instead of both
  observing <100, both committing, and landing >100 — exactly the
  fail-closed scenario this RPC exists to prevent.

- I4 labels: the open-coded "k=v,k=v" split is replaced with splitLabel
  from cosmo-shared, which handles the escapes the manual split mishandles.

Also includes a few correctness/observability cleanups in the same file:
batch capped at 50 items; feature flag name uses the full proposalId (was
8-char prefix — collision-prone); percentage-only updates now write a
feature_flag.updated audit row, closing the forensic gap.
The previous file was 4 negative-path tests + 2 { retry: 3 } markers
papering over a Keycloak realm-cache flake (now fixed centrally in
test-util.ts). With retries gone, real regressions in the new RPC can no
longer hide as "flake".

Adds positive coverage that was missing entirely:
- APPROVED first deploy creates the linked rollout flag.
- Re-deploy updates percentage without creating a second flag.
- Cumulative >100 across siblings is rejected; cumulative ==100 is accepted.
- Multi-graph batch is rejected.
- Teardown after a deploy clears the link and percentage.
- PUBLISHED state transition auto-tears down the rollout.

Also tightens negative coverage (negative percentage, duplicate
proposalId in batch) and types the test client as
PromiseClient<typeof PlatformService> instead of `any`.
- ProposalRepository.getLinkedRolloutFlag: scope by organizationId.
  Defense-in-depth — every current caller pre-validates via ById/ByName,
  but a future careless caller could leak a sibling org's flag id.
- composer.ts and getProposal.ts: replace
  `trafficPercentage == null ? undefined : trafficPercentage` with
  `trafficPercentage ?? undefined`. Same semantics, less noise.
- node.proto: comment that `optional` on traffic_percentage is
  load-bearing — explicit `0` (paused rollout, still part of cumulative
  budget) is distinct from unset (preview-only flag, header-pinned), and
  proto3 presence is the only way to tell the two apart. Don't drop it.
- Regenerated Go proto bindings to pick up the new comment.
@SkArchon SkArchon force-pushed the milinda/entity-caching-2-control-plane-ff-rollout branch from b29889d to 601ff7e Compare May 6, 2026 16:44
…oc-comment

Propagates the "optional is load-bearing" note from node.proto into
the generated FeatureFlagRouterExecutionConfig.trafficPercentage doc
on the TS side.
@SkArchon SkArchon changed the title feat(entity-caching-2): control plane feature flag percentage based rollout feat(entity-caching-2): control plane feature flag percentage based rollout (draft) May 6, 2026
…in multi-graph test

setupGraphAndCachingProposal calls enableProposalsForNamespace at the end,
so calling it twice triggered the second call's createThenPublishSubgraph
to fail with ERR_SCHEMA_MISMATCH_WITH_APPROVED_PROPOSAL: once proposals are
enabled in the namespace, every subsequent publish requires an approved
proposal matching the SDL.

Inline the setup for the multi-graph test so both subgraphs + fed graphs
are created BEFORE enabling proposals.
@SkArchon SkArchon force-pushed the milinda/entity-caching-2-control-plane-ff-rollout branch 2 times, most recently from 04c9de5 to cf798fe Compare May 6, 2026 17:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant