Compose by_path / paths_of_interest with survey_design (Wave 4 #10)#408
Compose by_path / paths_of_interest with survey_design (Wave 4 #10)#408
Conversation
PR ReviewOverall Assessment✅ Looks good — no unmitigated P0/P1 findings. Executive Summary
MethodologyFinding 1 — P3 Informational: documented Python-only survey extensionImpact: The PR intentionally extends Concrete fix: None required. Code QualityNo findings. The new per-path event and placebo code paths are structurally mirrored, and inference fields use PerformanceNo findings. The added per-path survey variance work is only entered when MaintainabilityNo findings. Sibling-surface audit covered event-study vs placebo, global vs path inference refresh, and Tech DebtNo findings requiring action. The intentionally deferred survey + per-path multiplier-bootstrap derivation is documented in the REGISTRY and gated at runtime. SecurityNo findings. No secrets, external I/O, or security-sensitive code paths were introduced in the reviewed diff. Documentation/TestsFinding 2 — P2: constant-weight SE invariant test only asserts effectsImpact: Concrete fix: Add an assertion comparing Audit NotesCompleted the requested audits: sibling surfaces, pattern-wide grep for inline/unsafe inference patterns, reciprocal |
R0 P2: test_per_path_se_within_envelope_of_unweighted claimed an SE invariant in its name but only asserted point-estimate equality. Add finite-SE rtol=0.10 envelope assertion alongside the existing effect equality, matching the test's documented contract. Under unit weights + single stratum + PSU=group, Binder TSL contributes a Bessel n/(n-1) factor relative to plug-in SE's plain 1/n divisor, so SEs differ by O(1/n) but track within a few percent on cohort-clean panels. Empirical observation on the test fixture: max rtol ~0.84% (well under the 10% envelope). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: PR ReviewOverall AssessmentExecutive Summary
MethodologyFinding 1 — P1 [Newly identified]: final replicate df is not applied to per-path inferenceImpact: Under replicate-weight Concrete fix: In the final replicate-df refresh block, also refresh Code QualityNo additional findings. The event-study and placebo per-path survey code is structurally mirrored and uses PerformanceNo findings. The new cell-period per-path survey variance work is gated to survey + per-path selectors. MaintainabilityThe sibling-surface mirror audit found the P1 above: final replicate-df refresh mirrors global event/placebo surfaces but not the per-path surfaces introduced by this PR. Tech DebtNo TODO-tracked item mitigates the P1 above. Existing TODO coverage for true replicate-weight test fixtures is unrelated to final df propagation. SecurityNo findings. The diff does not introduce secrets, external I/O, or security-sensitive behavior. Documentation/TestsThe previous P2 is resolved at Path To Approval
|
R1 P1: per-path event-study and placebo helpers snapshot df_inference BEFORE appending their own n_valid contributions to the shared _replicate_n_valid_list. The early _refresh_path_inference call then ran immediately after per-path runs but BEFORE later IF sites (overall / joiners / leavers / heterogeneity) appended their own n_valid values. If a later append further reduced the effective df, per-path t_stat / p_value / conf_int silently used a larger df than the global surfaces and survey_metadata.df_survey, which the existing final R2 P1b refresh block already updates with the FINAL df. Fix: relocate _refresh_path_inference to the final R2 P1b block at the bottom of fit() so per-path entries are refreshed alongside the global event-study / placebo / heterogeneity / overall / joiners / leavers / normalized surfaces with the final _final_inf_df. Narrow the helper to per-path only (the existing block handles globals inline). Add a dedicated regression test asserting that every populated per-path entry's t_stat / p_value / conf_int reproduces safe_inference(effect, se, df=results.survey_metadata.df_survey) under a JK1 replicate-weight design with placebo enabled. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: PR ReviewOverall Assessment✅ Looks good — no unmitigated P0/P1 findings. Executive Summary
MethodologyNo P0/P1 findings. The prior stale-df issue is addressed by Code QualityNo findings. Sibling-surface audit: event-study, placebo, by_path, and paths_of_interest surfaces are mirrored. The multiplier-bootstrap gate applies reciprocally to both selectors at PerformanceNo findings. The new per-path survey work is gated to survey + per-path selectors, and the existing cell/group allocator dispatch is reused. MaintainabilityNo findings. The helper isolates the final inference refresh cleanly, and the affected call sites thread Tech DebtNo blocker findings. No TODO entry is needed for the resolved prior P1. Existing TODO entries do not appear to mask any new correctness issue in this diff. SecurityNo findings. Secret-pattern grep over the changed files found no obvious tokens, keys, or private material. Documentation/TestsFinding 1 — P2: replicate-df regression does not force a stale-vs-final df differenceImpact: The new regression checks that per-path inference equals Concrete fix: Strengthen the test fixture so a later overall/joiner/leaver or placebo IF site has fewer valid replicate columns than the per-path site, then assert per-path Audit Notes
|
R2 P2: the previous regression test asserted per-path inference matches `safe_inference(..., df=results.survey_metadata.df_survey)`, but under uniform-valid replicate fixtures every IF site reports the same `n_valid` so the snapshot df and final df happen to coincide and the assertion passes vacuously even when the bug is present (per-path uses a stale snapshot df that incidentally equals the final df). Add `test_refresh_path_inference_called_from_final_block`: wraps the helper with `mock.patch.object` to capture call_args, asserts (a) helper is invoked exactly once, (b) the `df_final` it received equals `results.survey_metadata.df_survey` — a relationship that holds by construction when invoked from the final R2 P1b block (which uses `_final_eff_df = _effective_df_survey(resolved_survey, _replicate_n_valid_list)` AFTER all appends), but can only coincide by chance when invoked from an earlier block on a fixture where snapshot equals final. Update the existing test's docstring to acknowledge it documents the contract on a uniform-valid fixture and points readers at the new mock-based test for direct call-site verification. Use `importlib.import_module` to access the dCDH module: the top-level `diff_diff` package re-exports the convenience function `chaisemartin_dhaultfoeuille`, shadowing the module of the same name in attribute lookup. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: PR ReviewOverall Assessment✅ Looks good — no unmitigated P0/P1 findings. Executive Summary
MethodologyFinding 1 — P3: documented Python-only survey extensionImpact: The PR changes dCDH per-path SE behavior under Concrete fix: No action required. Relevant implementation: Code QualityNo findings. Sibling-surface audit passed for event-study vs placebo, PerformanceNo findings. The extra per-period IF work is only requested when MaintainabilityNo findings. The Tech DebtNo blocker findings. No relevant unmitigated correctness debt was found in SecurityNo findings. Secret-pattern grep over changed files found no credentials or private-key material. The only hit was changelog prose about secret scanning. Documentation/TestsFinding 1 — P2: replicate-df regression still does not force a stale-vs-final df differenceImpact: The added mock-style regression improves coverage by checking Concrete fix: Add a deterministic fixture or monkeypatch that forces a later overall/joiner/leaver/placebo IF site to return a smaller Audit Notes
|
R3 P2: the previous mock-based regression checks the helper is called with the final df, but if no later IF site reduces n_valid relative to per-path snapshots, snapshot df coincides with final df and the assertion is vacuous. Add `test_per_path_inference_refreshes_to_lower_final_df`: deter- ministic forcing function via `mock.patch.object` on `_compute_se`, gated by a flag that flips after `_compute_path_effects` returns. After the flag is set, every subsequent `_compute_se` call returns a hardcoded low `n_valid=5` — so global placebo / overall / joiners / leavers all append 5, while per-path effects already snapshotted a high df from the unmodified pre-flag calls. Final `survey_metadata.df_survey = 5 - 1 = 4` is strictly less than the per-path snapshot df, forcing the refresh to demonstrably move per-path inference from the high snapshot df to the low final df. Sanity-checked: temporarily replacing `_refresh_path_inference` with a no-op causes the new test to fail with a stale-p_value assertion, confirming bug-detection. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: PR ReviewOverall Assessment✅ Looks good — no unmitigated P0/P1 findings. Executive Summary
MethodologyFinding 1 — P3: documented Python-only per-path survey extensionSeverity: P3 informational Code QualityNo findings. The sibling-surface mirror audit passed for PerformanceNo findings. The extra per-period IF construction is gated on survey usage, and MaintainabilityNo findings. The final-df refresh is localized in Tech DebtNo blocker findings. The broader survey cell-period allocator derivation limitation is already tracked in SecurityNo findings. Secret-pattern grep over the changed files found no credentials or private-key material; the only hit was changelog prose about secret scanning. Documentation/TestsFinding 1 — P3: minor wording ambiguity around “zeroed” non-path rowsSeverity: P3 informational Test execution was attempted but blocked: Audit Notes
|
Lift the gate at chaisemartin_dhaultfoeuille.py:1233-1239 so per-path
event-study effects compose with survey_design under analytical
Binder TSL SE and replicate-weight bootstrap variance. Multiplier
bootstrap (n_bootstrap > 0) under survey + by_path remains gated;
the survey-aware perturbation pivot for path-restricted IFs is
methodologically underived and deferred to a future wave.
Per-path SE routes through the existing _survey_se_from_group_if
cell-period allocator. The per-period IF (U_pp_l_path) with non-path
switcher contributions zeroed at both group and cell levels (the
row-sum identity U_pp.sum(axis=1) == U is preserved trivially under
group-level zeroing) is cohort-recentered via
_cohort_recenter_per_period, then expanded to observations as
psi_i = U_pp[g_i, t_i] * (w_i / W_{g_i, t_i}). Replicate-weight
designs unconditionally use the cell allocator (Class A contract,
PR #323).
New _refresh_path_inference helper post-call refreshes safe_inference
on every populated entry across multi_horizon_inference,
placebo_horizon_inference, path_effects, and path_placebos so all
four surfaces reflect the same final df_survey after per-path
replicate fits append n_valid to the shared accumulator.
Path-enumeration ranking under survey_design remains unweighted
(group-cardinality, not population-weight mass). Lonely-PSU policy
stays sample-wide. Telescope invariant holds bit-exactly: on a
single-path panel, per-path SE matches the global non-by_path
survey SE.
No R parity — R did_multiplegt_dyn does not support survey
weighting; this is a Python-only methodology extension.
14 new tests across two test classes:
- TestByPathSurveyDesignAnalytical: gate dispatch, anti-regression
on global TSL+bootstrap (locks per-path-only gate scope), per-path
analytical SE, single-path telescope, replicate-weight SE,
df_survey propagation, per-path placebos, trends_linear cumulated
SE inheritance, unobserved-path warnings under survey.
- TestByPathSurveyDesignTelescope: single-path telescoping
invariant for analytical TSL.
Documentation: REGISTRY.md "Per-path survey-design SE" sub-paragraph;
by_path / paths_of_interest docstrings updated; CHANGELOG entry;
docs/api/chaisemartin_dhaultfoeuille.rst and llms-full.txt updated.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
R0 P2: test_per_path_se_within_envelope_of_unweighted claimed an SE invariant in its name but only asserted point-estimate equality. Add finite-SE rtol=0.10 envelope assertion alongside the existing effect equality, matching the test's documented contract. Under unit weights + single stratum + PSU=group, Binder TSL contributes a Bessel n/(n-1) factor relative to plug-in SE's plain 1/n divisor, so SEs differ by O(1/n) but track within a few percent on cohort-clean panels. Empirical observation on the test fixture: max rtol ~0.84% (well under the 10% envelope). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
R1 P1: per-path event-study and placebo helpers snapshot df_inference BEFORE appending their own n_valid contributions to the shared _replicate_n_valid_list. The early _refresh_path_inference call then ran immediately after per-path runs but BEFORE later IF sites (overall / joiners / leavers / heterogeneity) appended their own n_valid values. If a later append further reduced the effective df, per-path t_stat / p_value / conf_int silently used a larger df than the global surfaces and survey_metadata.df_survey, which the existing final R2 P1b refresh block already updates with the FINAL df. Fix: relocate _refresh_path_inference to the final R2 P1b block at the bottom of fit() so per-path entries are refreshed alongside the global event-study / placebo / heterogeneity / overall / joiners / leavers / normalized surfaces with the final _final_inf_df. Narrow the helper to per-path only (the existing block handles globals inline). Add a dedicated regression test asserting that every populated per-path entry's t_stat / p_value / conf_int reproduces safe_inference(effect, se, df=results.survey_metadata.df_survey) under a JK1 replicate-weight design with placebo enabled. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
R2 P2: the previous regression test asserted per-path inference matches `safe_inference(..., df=results.survey_metadata.df_survey)`, but under uniform-valid replicate fixtures every IF site reports the same `n_valid` so the snapshot df and final df happen to coincide and the assertion passes vacuously even when the bug is present (per-path uses a stale snapshot df that incidentally equals the final df). Add `test_refresh_path_inference_called_from_final_block`: wraps the helper with `mock.patch.object` to capture call_args, asserts (a) helper is invoked exactly once, (b) the `df_final` it received equals `results.survey_metadata.df_survey` — a relationship that holds by construction when invoked from the final R2 P1b block (which uses `_final_eff_df = _effective_df_survey(resolved_survey, _replicate_n_valid_list)` AFTER all appends), but can only coincide by chance when invoked from an earlier block on a fixture where snapshot equals final. Update the existing test's docstring to acknowledge it documents the contract on a uniform-valid fixture and points readers at the new mock-based test for direct call-site verification. Use `importlib.import_module` to access the dCDH module: the top-level `diff_diff` package re-exports the convenience function `chaisemartin_dhaultfoeuille`, shadowing the module of the same name in attribute lookup. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
R3 P2: the previous mock-based regression checks the helper is called with the final df, but if no later IF site reduces n_valid relative to per-path snapshots, snapshot df coincides with final df and the assertion is vacuous. Add `test_per_path_inference_refreshes_to_lower_final_df`: deter- ministic forcing function via `mock.patch.object` on `_compute_se`, gated by a flag that flips after `_compute_path_effects` returns. After the flag is set, every subsequent `_compute_se` call returns a hardcoded low `n_valid=5` — so global placebo / overall / joiners / leavers all append 5, while per-path effects already snapshotted a high df from the unmodified pre-flag calls. Final `survey_metadata.df_survey = 5 - 1 = 4` is strictly less than the per-path snapshot df, forcing the refresh to demonstrably move per-path inference from the high snapshot df to the low final df. Sanity-checked: temporarily replacing `_refresh_path_inference` with a no-op causes the new test to fail with a stale-p_value assertion, confirming bug-detection. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
faa97bc to
fb59427
Compare
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: PR ReviewOverall Assessment✅ Looks good — no unmitigated P0/P1 findings. Executive Summary
MethodologyFinding 1 — P3: documented Python-only per-path survey extensionSeverity: P3 informational Code QualityNo findings. The sibling-surface mirror audit passed for PerformanceNo findings. The extra per-period IF allocation is gated on MaintainabilityNo findings. The final inference refresh is localized in Tech DebtNo blocker findings. The broader survey cell-period allocator derivation limitation remains tracked in SecurityNo findings. Secret-pattern grep over the changed files found no credentials; the only hit was changelog prose about secret scanning. Documentation/TestsNo findings. The previous wording concern is materially addressed in public-facing text by saying non-path switcher-side contributions are skipped while control contributions remain unchanged. Added tests cover gates, analytical telescope, replicate SE/df propagation, per-path placebos, Verification note: attempted targeted pytest runs, but the environment is missing |
Summary
Lifts the
survey_designgate forby_pathandpaths_of_interestper-path event-study disaggregation, supporting analytical Binder TSL SE and replicate-weight bootstrap variance. Multiplier-bootstrap (n_bootstrap > 0) undersurvey_design + by_path/paths_of_interestremains gated — the survey-aware perturbation pivot for path-restricted IFs is methodologically underived and deferred. The global non-by_path TSL multiplier-bootstrap path is unaffected (anti-regression test added).This is the largest methodology lift in the dCDH
by_pathfollow-up sequence; with this landing, only Wave 5 mechanical extensions remain (heterogeneity,design2,honest_did).Methodology references
chaisemartin_dhaultfoeuille.py::_compute_path_effectsand_compute_path_placebosnow threadobs_survey_info,eligible_groups, andreplicate_n_valid_listso the per-period IFU_pp_l_path(with non-path switcher contributions zeroed at both group and cell levels) is cohort-recentered via_cohort_recenter_per_periodand routed through_survey_se_from_group_if(cell-period allocator + Binder TSL / replicate dispatch).U_pp.sum(axis=1) == Uis preserved trivially: theswitcher_subset_maskzeroes a row of the per-group IF, which zeroes the corresponding row of the per-cell IF._refresh_path_inferencehelper refreshessafe_inferenceon every populated inference entry post-per-path somulti_horizon_inference,placebo_horizon_inference,path_effects, andpath_placebosall share the same finaldf_surveyafter replicate fits appendn_validto the shared accumulator.survey_designremains unweighted (group-cardinality, not population-weight mass) — documented decision.eligible_groupsmatches between by_path and non-by_path, per-path SE equals the global non-by_path survey SE bit-exactly.No R parity — R
did_multiplegt_dyndoes not support survey weighting (verified during PR #401's R-source extraction work). Methodology verification is internal: the telescope invariant for analytical TSL on a single-path panel.REGISTRY.md updated under §
ChaisemartinDHaultfoeuilleNote (Phase 3 by_path ...)with a new "Per-path survey-design SE" sub-paragraph; gated combinations list narrowed toheterogeneity / design2 / honest_did.Validation
TestByPathSurveyDesignAnalytical(~14 tests, ~3 slow) covering:atol=1e-12)df_surveypropagationtrends_linearcumulated SE inheritancepaths_of_interestunobserved-path warning under surveyTestByPathSurveyDesignTelescope: single-path bit-exact telescope under analytical TSL.test_chaisemartin_dhaultfoeuille.py,test_survey_dcdh.py,test_survey_dcdh_replicate_psu.py,test_methodology_chaisemartin_dhaultfoeuille.py. By_path R-parity (7 tests) still passes.Test plan
Security / privacy
Confirm no secrets/PII in this PR: Yes