Compose by_path / paths_of_interest with survey_design (Wave 4 #10) by igerber · Pull Request #408 · igerber/diff-diff

igerber · 2026-05-09T22:02:56Z

Summary

Lifts the survey_design gate for by_path and paths_of_interest per-path event-study disaggregation, supporting analytical Binder TSL SE and replicate-weight bootstrap variance. Multiplier-bootstrap (n_bootstrap > 0) under survey_design + by_path/paths_of_interest remains gated — the survey-aware perturbation pivot for path-restricted IFs is methodologically underived and deferred. The global non-by_path TSL multiplier-bootstrap path is unaffected (anti-regression test added).

This is the largest methodology lift in the dCDH by_path follow-up sequence; with this landing, only Wave 5 mechanical extensions remain (heterogeneity, design2, honest_did).

Methodology references

Per-path SE composition: chaisemartin_dhaultfoeuille.py::_compute_path_effects and _compute_path_placebos now thread obs_survey_info, eligible_groups, and replicate_n_valid_list so the per-period IF U_pp_l_path (with non-path switcher contributions zeroed at both group and cell levels) is cohort-recentered via _cohort_recenter_per_period and routed through _survey_se_from_group_if (cell-period allocator + Binder TSL / replicate dispatch).
Row-sum identity U_pp.sum(axis=1) == U is preserved trivially: the switcher_subset_mask zeroes a row of the per-group IF, which zeroes the corresponding row of the per-cell IF.
New _refresh_path_inference helper refreshes safe_inference on every populated inference entry post-per-path so multi_horizon_inference, placebo_horizon_inference, path_effects, and path_placebos all share the same final df_survey after replicate fits append n_valid to the shared accumulator.
Path-enumeration ranking under survey_design remains unweighted (group-cardinality, not population-weight mass) — documented decision.
Lonely-PSU policy stays sample-wide, not per-path — documented.
Telescope invariant: on a single-path panel where every switcher follows the same trajectory and eligible_groups matches between by_path and non-by_path, per-path SE equals the global non-by_path survey SE bit-exactly.

No R parity — R did_multiplegt_dyn does not support survey weighting (verified during PR #401's R-source extraction work). Methodology verification is internal: the telescope invariant for analytical TSL on a single-path panel.

REGISTRY.md updated under §ChaisemartinDHaultfoeuille Note (Phase 3 by_path ...) with a new "Per-path survey-design SE" sub-paragraph; gated combinations list narrowed to heterogeneity / design2 / honest_did.

Validation

New TestByPathSurveyDesignAnalytical (~14 tests, ~3 slow) covering:
- Gate dispatch (lift passes; multiplier-bootstrap still gated under both selectors)
- Anti-regression for global TSL + n_bootstrap (locks per-path-only scope of the new gate)
- Per-path analytical SE finiteness
- Single-path telescope (per-path SE == global SE within atol=1e-12)
- Constant-weight envelope (survey effects match plain plug-in effects)
- Per-path replicate-weight SE under JK1 with 15-20 replicate columns
- df_survey propagation
- Per-path placebo SE under survey
- trends_linear cumulated SE inheritance
- paths_of_interest unobserved-path warning under survey
New TestByPathSurveyDesignTelescope: single-path bit-exact telescope under analytical TSL.
Full regression: 388 tests pass across test_chaisemartin_dhaultfoeuille.py, test_survey_dcdh.py, test_survey_dcdh_replicate_psu.py, test_methodology_chaisemartin_dhaultfoeuille.py. By_path R-parity (7 tests) still passes.

Test plan

New test classes pass (11 non-slow + 3 slow = 14/14)
By_path regression suite passes (72 tests)
Survey suite passes (63 tests)
Replicate ATT Class A regression passes
By_path R-parity tests pass (7 tests)
CI ready-for-ci checks pass after label

Security / privacy

Confirm no secrets/PII in this PR: Yes

github-actions · 2026-05-09T22:11:39Z

PR Review

Overall Assessment

✅ Looks good — no unmitigated P0/P1 findings.

Executive Summary

Affected method: ChaisemartinDHaultfoeuille dCDH by_path / paths_of_interest event-study and placebo inference under survey_design.
The methodology extension is documented in docs/methodology/REGISTRY.md:L641 under the Phase 3 by_path Note, including the survey IF allocator, unweighted path ranking, sample-wide lonely-PSU policy, and gated multiplier bootstrap.
The implementation mirrors event-study and placebo path surfaces: both route per-path survey SE through _survey_se_from_group_if with cell-period IFs and safe_inference.
No new inline inference or partial NaN-guard anti-pattern was found in the modified dCDH paths.
One minor test gap remains: a test named as an SE invariant only checks point estimates.

Methodology

Finding 1 — P3 Informational: documented Python-only survey extension

Impact: The PR intentionally extends by_path / paths_of_interest beyond R parity for survey designs. The REGISTRY explicitly documents the extension, the lack of R parity, the unweighted top-k ranking, the sample-wide lonely-PSU policy, and the multiplier-bootstrap gate. This is not a defect under the review rules. See docs/methodology/REGISTRY.md:L641, diff_diff/chaisemartin_dhaultfoeuille.py:L1245-L1256, diff_diff/chaisemartin_dhaultfoeuille.py:L6001-L6024, and diff_diff/chaisemartin_dhaultfoeuille.py:L6339-L6358.

Concrete fix: None required.

Code Quality

No findings. The new per-path event and placebo code paths are structurally mirrored, and inference fields use safe_inference.

Performance

No findings. The added per-path survey variance work is only entered when by_path / paths_of_interest and survey_design are active.

Maintainability

No findings. Sibling-surface audit covered event-study vs placebo, global vs path inference refresh, and by_path vs paths_of_interest dispatch.

Tech Debt

No findings requiring action. The intentionally deferred survey + per-path multiplier-bootstrap derivation is documented in the REGISTRY and gated at runtime.

Security

No findings. No secrets, external I/O, or security-sensitive code paths were introduced in the reviewed diff.

Documentation/Tests

Finding 2 — P2: constant-weight SE invariant test only asserts effects

Impact: test_per_path_se_within_envelope_of_unweighted says “survey SE matches plug-in SE,” but it only asserts equality of effect fields, not se fields. That leaves the named constant-weight SE invariant unpinned for multi-path survey fits. Location: tests/test_chaisemartin_dhaultfoeuille.py:L9374-L9407.

Concrete fix: Add an assertion comparing res_survey.path_effects[path]["horizons"][l_h]["se"] to res_plain.path_effects[path]["horizons"][l_h]["se"] when both are finite, or rename the test/docstring to make it effect-only.

Audit Notes

Completed the requested audits: sibling surfaces, pattern-wide grep for inline/unsafe inference patterns, reciprocal by_path / paths_of_interest gate checks, and documentation/registry cross-check. No workflow path changes were in scope.

R0 P2: test_per_path_se_within_envelope_of_unweighted claimed an SE invariant in its name but only asserted point-estimate equality. Add finite-SE rtol=0.10 envelope assertion alongside the existing effect equality, matching the test's documented contract. Under unit weights + single stratum + PSU=group, Binder TSL contributes a Bessel n/(n-1) factor relative to plug-in SE's plain 1/n divisor, so SEs differ by O(1/n) but track within a few percent on cohort-clean panels. Empirical observation on the test fixture: max rtol ~0.84% (well under the 10% envelope). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-05-09T22:20:06Z

/ai-review

github-actions · 2026-05-09T22:25:15Z

🔁 AI review rerun (requested by @igerber)

Head SHA: fd28256067a76e492668cff87e7b3458a17b4232

PR Review

Overall Assessment

⚠️ Needs changes — one unmitigated P1 finding.

Executive Summary

Affected method: ChaisemartinDHaultfoeuille per-path by_path / paths_of_interest event-study and placebo inference under survey_design.
The methodology lift is documented in docs/methodology/REGISTRY.md:L641, including the Python-only survey extension and multiplier-bootstrap gate.
Prior P2 test gap is addressed: test_per_path_se_within_envelope_of_unweighted now asserts SEs, not only effects.
New issue: replicate-weight final df refresh is not mirrored onto path_effects / path_placebo_event_study after later IF sites append n_valid.
Static audits found no new inline effect / se inference anti-pattern in the touched dCDH paths.
I could not run tests in this environment because pytest and numpy are not installed.

Methodology

Finding 1 — P1 [Newly identified]: final replicate df is not applied to per-path inference

Impact: Under replicate-weight survey_design + by_path/paths_of_interest, _refresh_path_inference() runs immediately after per-path effects/placebos are computed, using the then-current _replicate_n_valid_list. But later in fit(), overall / joiners / leavers append additional replicate n_valid values, and the final refresh block recomputes inference for overall, event-study, placebo, heterogeneity, and normalized surfaces only. It does not refresh path_effects or path_placebo_event_study, even though survey_metadata.df_survey is then persisted from the final reduced df. If any later IF site lowers the effective df, per-path p_value / conf_int can silently use a larger df than the result metadata and sibling surfaces. Locations: diff_diff/chaisemartin_dhaultfoeuille.py:L2554-L2577, diff_diff/chaisemartin_dhaultfoeuille.py:L2649-L2728, diff_diff/chaisemartin_dhaultfoeuille.py:L3939-L4048, diff_diff/chaisemartin_dhaultfoeuille.py:L4137-L4138.

Concrete fix: In the final replicate-df refresh block, also refresh path_effects and path_placebo_event_study with _final_inf_df before constructing results. Reuse _refresh_path_inference() there, or move the helper call to after all _replicate_n_valid_list appends. Add a regression that forces a lower final n_valid after per-path inference and asserts every per-path populated entry equals safe_inference(effect, se, df=results.survey_metadata.df_survey).

Code Quality

No additional findings. The event-study and placebo per-path survey code is structurally mirrored and uses safe_inference().

Performance

No findings. The new cell-period per-path survey variance work is gated to survey + per-path selectors.

Maintainability

The sibling-surface mirror audit found the P1 above: final replicate-df refresh mirrors global event/placebo surfaces but not the per-path surfaces introduced by this PR.

Tech Debt

No TODO-tracked item mitigates the P1 above. Existing TODO coverage for true replicate-weight test fixtures is unrelated to final df propagation.

Security

No findings. The diff does not introduce secrets, external I/O, or security-sensitive behavior.

Documentation/Tests

The previous P2 is resolved at tests/test_chaisemartin_dhaultfoeuille.py:L9380-L9430. Add the final-df per-path regression described in the Methodology finding.

Path To Approval

Refresh path_effects and path_placebo_event_study in the final replicate-df block using _final_inf_df.
Add a replicate-weight by-path regression that proves per-path t_stat, p_value, and conf_int are computed with results.survey_metadata.df_survey after all replicate n_valid appends.

R1 P1: per-path event-study and placebo helpers snapshot df_inference BEFORE appending their own n_valid contributions to the shared _replicate_n_valid_list. The early _refresh_path_inference call then ran immediately after per-path runs but BEFORE later IF sites (overall / joiners / leavers / heterogeneity) appended their own n_valid values. If a later append further reduced the effective df, per-path t_stat / p_value / conf_int silently used a larger df than the global surfaces and survey_metadata.df_survey, which the existing final R2 P1b refresh block already updates with the FINAL df. Fix: relocate _refresh_path_inference to the final R2 P1b block at the bottom of fit() so per-path entries are refreshed alongside the global event-study / placebo / heterogeneity / overall / joiners / leavers / normalized surfaces with the final _final_inf_df. Narrow the helper to per-path only (the existing block handles globals inline). Add a dedicated regression test asserting that every populated per-path entry's t_stat / p_value / conf_int reproduces safe_inference(effect, se, df=results.survey_metadata.df_survey) under a JK1 replicate-weight design with placebo enabled. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-05-09T22:31:57Z

/ai-review

github-actions · 2026-05-09T22:36:26Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 905e8871646478edf38589a4671f1ef8141d1de9

PR Review

Overall Assessment

✅ Looks good — no unmitigated P0/P1 findings.

Executive Summary

Affected method: ChaisemartinDHaultfoeuille per-path by_path / paths_of_interest event-study and placebo inference under survey_design.
Prior P1 is resolved: per-path event-study and placebo inference now refresh with final replicate df_survey before results construction.
The survey per-path methodology is documented in docs/methodology/REGISTRY.md:L641 and composes with the existing survey IF expansion note.
Pattern-wide grep found no new inline effect / se inference computation in the changed dCDH paths; safe_inference() is used.
I could not run tests because pytest, numpy, and pandas are not installed in this environment.

Methodology

No P0/P1 findings.

The prior stale-df issue is addressed by _refresh_path_inference() and the final replicate-df block, which now refreshes path_effects and path_placebo_event_study with _final_inf_df. Relevant locations: diff_diff/chaisemartin_dhaultfoeuille.py:L3923-L4035, diff_diff/chaisemartin_dhaultfoeuille.py:L7681-L7725.

Code Quality

No findings.

Sibling-surface audit: event-study, placebo, by_path, and paths_of_interest surfaces are mirrored. The multiplier-bootstrap gate applies reciprocally to both selectors at diff_diff/chaisemartin_dhaultfoeuille.py:L1245-L1256.

Performance

No findings.

The new per-path survey work is gated to survey + per-path selectors, and the existing cell/group allocator dispatch is reused.

Maintainability

No findings.

The helper isolates the final inference refresh cleanly, and the affected call sites thread obs_survey_info, eligible_groups, and replicate_n_valid_list consistently for path effects and path placebos.

Tech Debt

No blocker findings.

No TODO entry is needed for the resolved prior P1. Existing TODO entries do not appear to mask any new correctness issue in this diff.

Security

No findings.

Secret-pattern grep over the changed files found no obvious tokens, keys, or private material.

Documentation/Tests

Finding 1 — P2: replicate-df regression does not force a stale-vs-final df difference

Impact: The new regression checks that per-path inference equals safe_inference(..., df=results.survey_metadata.df_survey), but its random replicate weights do not appear to force a later IF site to reduce n_valid after per-path inference. If every site has all 12 replicates valid, the stale snapshot df and final df are identical, so the test would not fail on the old bug. Location: tests/test_chaisemartin_dhaultfoeuille.py:L9470-L9553.

Concrete fix: Strengthen the test fixture so a later overall/joiner/leaver or placebo IF site has fewer valid replicate columns than the per-path site, then assert per-path t_stat, p_value, and conf_int use that lower final df_survey.

Audit Notes

Pattern-wide grep for inline inference anti-patterns completed.
Sibling-surface mirror audit completed for global event-study, global placebo, per-path event-study, per-path placebo, and cumulated surfaces.
Reciprocal selector check completed for by_path and paths_of_interest.
Transitive workflow dependency audit: not applicable; no workflow/path-selection changes in this diff.

R2 P2: the previous regression test asserted per-path inference matches `safe_inference(..., df=results.survey_metadata.df_survey)`, but under uniform-valid replicate fixtures every IF site reports the same `n_valid` so the snapshot df and final df happen to coincide and the assertion passes vacuously even when the bug is present (per-path uses a stale snapshot df that incidentally equals the final df). Add `test_refresh_path_inference_called_from_final_block`: wraps the helper with `mock.patch.object` to capture call_args, asserts (a) helper is invoked exactly once, (b) the `df_final` it received equals `results.survey_metadata.df_survey` — a relationship that holds by construction when invoked from the final R2 P1b block (which uses `_final_eff_df = _effective_df_survey(resolved_survey, _replicate_n_valid_list)` AFTER all appends), but can only coincide by chance when invoked from an earlier block on a fixture where snapshot equals final. Update the existing test's docstring to acknowledge it documents the contract on a uniform-valid fixture and points readers at the new mock-based test for direct call-site verification. Use `importlib.import_module` to access the dCDH module: the top-level `diff_diff` package re-exports the convenience function `chaisemartin_dhaultfoeuille`, shadowing the module of the same name in attribute lookup. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-05-09T22:47:12Z

/ai-review

github-actions · 2026-05-09T22:52:08Z

🔁 AI review rerun (requested by @igerber)

Head SHA: e304c100cc508a8d7ddd57770503d7d07bc991cc

PR Review

Overall Assessment

✅ Looks good — no unmitigated P0/P1 findings.

Executive Summary

Affected method: ChaisemartinDHaultfoeuille per-path by_path / paths_of_interest event-study and placebo inference under survey_design.
Methodology is documented in docs/methodology/REGISTRY.md:L641; the survey extension is Python-only, with unweighted path ranking and sample-wide lonely-PSU policy explicitly documented.
Event-study and placebo sibling surfaces are mirrored: both thread obs_survey_info, eligible_groups, and replicate_n_valid_list.
The survey + per-path multiplier-bootstrap gate is reciprocal for by_path and paths_of_interest at diff_diff/chaisemartin_dhaultfoeuille.py:L1245-L1256.
Pattern-wide inference grep found no new inline effect / se anti-pattern in the changed dCDH path; new inference uses safe_inference().
I could not run tests because numpy, pandas, pytest, and scipy are not installed in this environment.

Methodology

Finding 1 — P3: documented Python-only survey extension

Impact: The PR changes dCDH per-path SE behavior under survey_design, but the deviation is documented in the registry: per-path IFs are routed through the survey allocator, path ranking remains group-count based, lonely-PSU policy stays sample-wide, and survey + per-path multiplier bootstrap remains gated. This is not a defect.

Concrete fix: No action required. Relevant implementation: diff_diff/chaisemartin_dhaultfoeuille.py:L5998-L6021, diff_diff/chaisemartin_dhaultfoeuille.py:L6336-L6355, diff_diff/chaisemartin_dhaultfoeuille.py:L7681-L7725.

Code Quality

No findings.

Sibling-surface audit passed for event-study vs placebo, by_path vs paths_of_interest, and analytical vs replicate survey dispatch. Reciprocal selector coverage is present at the gate and both per-path helper call sites.

Performance

No findings.

The extra per-period IF work is only requested when obs_survey_info is present, and the survey allocator already contains a group-level fallback for within-group-constant PSU designs.

Maintainability

No findings.

The _refresh_path_inference() helper keeps the final replicate-df refresh localized, and the per-path event-study/placebo helper signatures remain symmetric.

Tech Debt

No blocker findings.

No relevant unmitigated correctness debt was found in TODO.md; documented deviations in REGISTRY.md are treated as informational.

Security

No findings.

Secret-pattern grep over changed files found no credentials or private-key material. The only hit was changelog prose about secret scanning.

Documentation/Tests

Finding 1 — P2: replicate-df regression still does not force a stale-vs-final df difference

Impact: The added mock-style regression improves coverage by checking _refresh_path_inference() is called with results.survey_metadata.df_survey, but it still relies on random replicate weights where all IF sites can have identical n_valid. If an implementation refreshed per-path inference from an earlier call site and the later global IF sites did not reduce n_valid, the assertion can still pass. This is a test robustness gap, not a current correctness blocker.

Concrete fix: Add a deterministic fixture or monkeypatch that forces a later overall/joiner/leaver/placebo IF site to return a smaller n_valid than the per-path site, then assert per-path t_stat, p_value, and conf_int change to the lower final df_survey. Function: tests/test_chaisemartin_dhaultfoeuille.py::TestByPathSurveyDesignAnalytical::test_refresh_path_inference_called_from_final_block.

Audit Notes

Pattern-wide grep for inline inference anti-patterns completed.
Sibling-surface mirror audit completed for event-study, placebo, per-path effects, per-path placebos, and final df refresh surfaces.
Reciprocal/symmetry check completed for by_path and paths_of_interest.
Transitive workflow dependency audit: not applicable; no workflow/path-selection changes in this diff.

R3 P2: the previous mock-based regression checks the helper is called with the final df, but if no later IF site reduces n_valid relative to per-path snapshots, snapshot df coincides with final df and the assertion is vacuous. Add `test_per_path_inference_refreshes_to_lower_final_df`: deter- ministic forcing function via `mock.patch.object` on `_compute_se`, gated by a flag that flips after `_compute_path_effects` returns. After the flag is set, every subsequent `_compute_se` call returns a hardcoded low `n_valid=5` — so global placebo / overall / joiners / leavers all append 5, while per-path effects already snapshotted a high df from the unmodified pre-flag calls. Final `survey_metadata.df_survey = 5 - 1 = 4` is strictly less than the per-path snapshot df, forcing the refresh to demonstrably move per-path inference from the high snapshot df to the low final df. Sanity-checked: temporarily replacing `_refresh_path_inference` with a no-op causes the new test to fail with a stale-p_value assertion, confirming bug-detection. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-05-09T22:55:40Z

/ai-review

github-actions · 2026-05-09T23:06:45Z

🔁 AI review rerun (requested by @igerber)

Head SHA: faa97bc7c3d3bf6d4032a08a59085a5b5486d67f

PR Review

Overall Assessment

✅ Looks good — no unmitigated P0/P1 findings.

Executive Summary

Affected method: ChaisemartinDHaultfoeuille per-path by_path / paths_of_interest event-study and placebo inference under survey_design.
The methodology lift is documented in docs/methodology/REGISTRY.md:L641, including Python-only survey support, unweighted path ranking, sample-wide lonely-PSU policy, replicate df propagation, and the gated survey multiplier-bootstrap combination.
Event-study and placebo sibling surfaces are mirrored: both thread survey IF inputs, use the cell-period allocator, append replicate n_valid, and refresh inference with final df.
The by_path / paths_of_interest survey + n_bootstrap > 0 gate is reciprocal and does not block global non-by-path survey bootstrap: diff_diff/chaisemartin_dhaultfoeuille.py:L1245-L1256.
I could not run tests locally because pytest is not installed and numpy is missing.

Methodology

Finding 1 — P3: documented Python-only per-path survey extension

Severity: P3 informational
Impact: The PR changes dCDH per-path survey variance behavior, but the extension and deviations are explicitly documented in the registry. The implementation routes per-path event-study and placebo IFs through _survey_se_from_group_if, keeps top-k path ranking unweighted, keeps lonely-PSU policy sample-wide, and gates the underived survey multiplier-bootstrap path. This is not a defect. Relevant code: diff_diff/chaisemartin_dhaultfoeuille.py:L5956-L6021, diff_diff/chaisemartin_dhaultfoeuille.py:L6294-L6355, diff_diff/chaisemartin_dhaultfoeuille.py:L7681-L7725.
Concrete fix: No action required.

Code Quality

No findings.

The sibling-surface mirror audit passed for path_effects vs path_placebos, by_path vs paths_of_interest, analytical TSL vs replicate-weight variance, and final replicate-df refresh surfaces.

Performance

No findings.

The extra per-period IF construction is gated on survey usage, and _survey_se_from_group_if preserves the group-level allocator fallback for within-group-constant PSU analytical TSL designs.

Maintainability

No findings.

The final-df refresh is localized in _refresh_path_inference() and called from the final replicate-df block at diff_diff/chaisemartin_dhaultfoeuille.py:L3913-L4035.

Tech Debt

No blocker findings.

The broader survey cell-period allocator derivation limitation is already tracked in TODO.md:L51-L60 and documented in REGISTRY.md, so it is not an unmitigated blocker.

Security

No findings.

Secret-pattern grep over the changed files found no credentials or private-key material; the only hit was changelog prose about secret scanning.

Documentation/Tests

Finding 1 — P3: minor wording ambiguity around “zeroed” non-path rows

Severity: P3 informational
Impact: REGISTRY.md:L641 and the constructor docstring around diff_diff/chaisemartin_dhaultfoeuille.py:L461-L464 say non-path switcher contributions are zeroed at group and cell levels. The implementation and earlier registry wording are more precise: non-path groups are skipped as switchers but may still receive control-pool IF mass, preserving the established joiners/leavers convention (diff_diff/chaisemartin_dhaultfoeuille.py:L5582-L5589, diff_diff/chaisemartin_dhaultfoeuille.py:L5671-L5678).
Concrete fix: Optional wording cleanup: say “non-path switcher-side contributions are skipped; control contributions remain unchanged” instead of implying the whole row is zero.

Test execution was attempted but blocked: pytest was not found, and importing numpy failed. The supplied diff adds focused survey/by-path tests, including replicate df propagation coverage under TestByPathSurveyDesignAnalytical.

Audit Notes

Pattern-wide inference grep completed; no new inline effect / se anti-pattern in the changed dCDH path.
Reciprocal/symmetry check completed for by_path and paths_of_interest.
Empty-result and NaN/Inf propagation paths were checked; new inference uses safe_inference().
Transitive workflow dependency audit: not applicable, no workflow/path-selection changes in this diff.

Lift the gate at chaisemartin_dhaultfoeuille.py:1233-1239 so per-path event-study effects compose with survey_design under analytical Binder TSL SE and replicate-weight bootstrap variance. Multiplier bootstrap (n_bootstrap > 0) under survey + by_path remains gated; the survey-aware perturbation pivot for path-restricted IFs is methodologically underived and deferred to a future wave. Per-path SE routes through the existing _survey_se_from_group_if cell-period allocator. The per-period IF (U_pp_l_path) with non-path switcher contributions zeroed at both group and cell levels (the row-sum identity U_pp.sum(axis=1) == U is preserved trivially under group-level zeroing) is cohort-recentered via _cohort_recenter_per_period, then expanded to observations as psi_i = U_pp[g_i, t_i] * (w_i / W_{g_i, t_i}). Replicate-weight designs unconditionally use the cell allocator (Class A contract, PR #323). New _refresh_path_inference helper post-call refreshes safe_inference on every populated entry across multi_horizon_inference, placebo_horizon_inference, path_effects, and path_placebos so all four surfaces reflect the same final df_survey after per-path replicate fits append n_valid to the shared accumulator. Path-enumeration ranking under survey_design remains unweighted (group-cardinality, not population-weight mass). Lonely-PSU policy stays sample-wide. Telescope invariant holds bit-exactly: on a single-path panel, per-path SE matches the global non-by_path survey SE. No R parity — R did_multiplegt_dyn does not support survey weighting; this is a Python-only methodology extension. 14 new tests across two test classes: - TestByPathSurveyDesignAnalytical: gate dispatch, anti-regression on global TSL+bootstrap (locks per-path-only gate scope), per-path analytical SE, single-path telescope, replicate-weight SE, df_survey propagation, per-path placebos, trends_linear cumulated SE inheritance, unobserved-path warnings under survey. - TestByPathSurveyDesignTelescope: single-path telescoping invariant for analytical TSL. Documentation: REGISTRY.md "Per-path survey-design SE" sub-paragraph; by_path / paths_of_interest docstrings updated; CHANGELOG entry; docs/api/chaisemartin_dhaultfoeuille.rst and llms-full.txt updated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

R0 P2: test_per_path_se_within_envelope_of_unweighted claimed an SE invariant in its name but only asserted point-estimate equality. Add finite-SE rtol=0.10 envelope assertion alongside the existing effect equality, matching the test's documented contract. Under unit weights + single stratum + PSU=group, Binder TSL contributes a Bessel n/(n-1) factor relative to plug-in SE's plain 1/n divisor, so SEs differ by O(1/n) but track within a few percent on cohort-clean panels. Empirical observation on the test fixture: max rtol ~0.84% (well under the 10% envelope). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

R1 P1: per-path event-study and placebo helpers snapshot df_inference BEFORE appending their own n_valid contributions to the shared _replicate_n_valid_list. The early _refresh_path_inference call then ran immediately after per-path runs but BEFORE later IF sites (overall / joiners / leavers / heterogeneity) appended their own n_valid values. If a later append further reduced the effective df, per-path t_stat / p_value / conf_int silently used a larger df than the global surfaces and survey_metadata.df_survey, which the existing final R2 P1b refresh block already updates with the FINAL df. Fix: relocate _refresh_path_inference to the final R2 P1b block at the bottom of fit() so per-path entries are refreshed alongside the global event-study / placebo / heterogeneity / overall / joiners / leavers / normalized surfaces with the final _final_inf_df. Narrow the helper to per-path only (the existing block handles globals inline). Add a dedicated regression test asserting that every populated per-path entry's t_stat / p_value / conf_int reproduces safe_inference(effect, se, df=results.survey_metadata.df_survey) under a JK1 replicate-weight design with placebo enabled. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

R2 P2: the previous regression test asserted per-path inference matches `safe_inference(..., df=results.survey_metadata.df_survey)`, but under uniform-valid replicate fixtures every IF site reports the same `n_valid` so the snapshot df and final df happen to coincide and the assertion passes vacuously even when the bug is present (per-path uses a stale snapshot df that incidentally equals the final df). Add `test_refresh_path_inference_called_from_final_block`: wraps the helper with `mock.patch.object` to capture call_args, asserts (a) helper is invoked exactly once, (b) the `df_final` it received equals `results.survey_metadata.df_survey` — a relationship that holds by construction when invoked from the final R2 P1b block (which uses `_final_eff_df = _effective_df_survey(resolved_survey, _replicate_n_valid_list)` AFTER all appends), but can only coincide by chance when invoked from an earlier block on a fixture where snapshot equals final. Update the existing test's docstring to acknowledge it documents the contract on a uniform-valid fixture and points readers at the new mock-based test for direct call-site verification. Use `importlib.import_module` to access the dCDH module: the top-level `diff_diff` package re-exports the convenience function `chaisemartin_dhaultfoeuille`, shadowing the module of the same name in attribute lookup. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

R3 P2: the previous mock-based regression checks the helper is called with the final df, but if no later IF site reduces n_valid relative to per-path snapshots, snapshot df coincides with final df and the assertion is vacuous. Add `test_per_path_inference_refreshes_to_lower_final_df`: deter- ministic forcing function via `mock.patch.object` on `_compute_se`, gated by a flag that flips after `_compute_path_effects` returns. After the flag is set, every subsequent `_compute_se` call returns a hardcoded low `n_valid=5` — so global placebo / overall / joiners / leavers all append 5, while per-path effects already snapshotted a high df from the unmodified pre-flag calls. Final `survey_metadata.df_survey = 5 - 1 = 4` is strictly less than the per-path snapshot df, forcing the refresh to demonstrably move per-path inference from the high snapshot df to the low final df. Sanity-checked: temporarily replacing `_refresh_path_inference` with a no-op causes the new test to fail with a stale-p_value assertion, confirming bug-detection. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-05-09T23:14:05Z

/ai-review

github-actions · 2026-05-09T23:18:37Z

🔁 AI review rerun (requested by @igerber)

Head SHA: fb594275a0b3c7ff2e9f8906310adc0cf168a25a

PR Review

Overall Assessment

✅ Looks good — no unmitigated P0/P1 findings.

Executive Summary

Affected method: ChaisemartinDHaultfoeuille per-path by_path / paths_of_interest event-study and placebo inference under survey_design.
The survey methodology extension is documented in docs/methodology/REGISTRY.md:L643-L677, including unweighted path ranking, sample-wide lonely-PSU policy, replicate df propagation, and the remaining multiplier-bootstrap gate.
Event-study and placebo sibling surfaces are mirrored: both thread obs_survey_info, eligible_groups, and replicate_n_valid_list, use per-period IFs under survey, and route through _survey_se_from_group_if.
The prior stale-df_survey risk is addressed by the final-block _refresh_path_inference() call at diff_diff/chaisemartin_dhaultfoeuille.py:L4024-L4036.
Local tests could not be run because pytest, numpy, and pandas are not installed in this environment.

Methodology

Finding 1 — P3: documented Python-only per-path survey extension

Severity: P3 informational
Impact: The PR lifts survey_design for per-path dCDH analytical Binder TSL and replicate-weight variance. This is a Python-only extension because R did_multiplegt_dyn has no survey weighting surface, and the deviation/contract is documented in the registry. Implementation matches the documented contract: per-path event-study and placebo IFs use switcher-side path filtering while preserving control contributions, cohort-recenter per-period IFs, scale by the path-specific divisor, and route to _survey_se_from_group_if. Relevant code: diff_diff/chaisemartin_dhaultfoeuille.py:L5958-L6022, diff_diff/chaisemartin_dhaultfoeuille.py:L6296-L6356.
Concrete fix: No action required.

Code Quality

No findings.

The sibling-surface mirror audit passed for path_effects vs path_placebos, by_path vs paths_of_interest, and analytical TSL vs replicate-weight variance.

Performance

No findings.

The extra per-period IF allocation is gated on obs_survey_info is not None, and the existing _survey_se_from_group_if fallback preserves the cheaper group allocator for within-group-constant PSU analytical TSL cases.

Maintainability

No findings.

The final inference refresh is localized in _refresh_path_inference() at diff_diff/chaisemartin_dhaultfoeuille.py:L7682-L7725 and called once from the existing final replicate-df block.

Tech Debt

No blocker findings.

The broader survey cell-period allocator derivation limitation remains tracked in TODO.md under “Tech Debt from Code Reviews” and documented in REGISTRY.md, so it is not an unmitigated issue.

Security

No findings.

Secret-pattern grep over the changed files found no credentials; the only hit was changelog prose about secret scanning.

Documentation/Tests

No findings.

The previous wording concern is materially addressed in public-facing text by saying non-path switcher-side contributions are skipped while control contributions remain unchanged. Added tests cover gates, analytical telescope, replicate SE/df propagation, per-path placebos, trends_linear, and unobserved paths_of_interest under survey at tests/test_chaisemartin_dhaultfoeuille.py:L9117-L9925.

Verification note: attempted targeted pytest runs, but the environment is missing pytest; direct imports also show numpy and pandas are unavailable. Pattern-wide inference grep found no new inline effect / se anti-pattern in the changed dCDH path, reciprocal by_path / paths_of_interest gates are covered, and no workflow path-selection changes were present.

igerber and others added 5 commits May 9, 2026 19:13

igerber force-pushed the dcdh-by-path-survey-design branch from faa97bc to fb59427 Compare May 9, 2026 23:13

igerber added the ready-for-ci Triggers CI test workflows label May 9, 2026

igerber merged commit 8bd021d into main May 10, 2026
28 of 29 checks passed

igerber deleted the dcdh-by-path-survey-design branch May 10, 2026 00:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compose by_path / paths_of_interest with survey_design (Wave 4 #10)#408

Compose by_path / paths_of_interest with survey_design (Wave 4 #10)#408
igerber merged 5 commits intomainfrom
dcdh-by-path-survey-design

igerber commented May 9, 2026

Uh oh!

github-actions Bot commented May 9, 2026

Uh oh!

igerber commented May 9, 2026

Uh oh!

github-actions Bot commented May 9, 2026

Uh oh!

igerber commented May 9, 2026

Uh oh!

github-actions Bot commented May 9, 2026

Uh oh!

igerber commented May 9, 2026

Uh oh!

github-actions Bot commented May 9, 2026

Uh oh!

igerber commented May 9, 2026

Uh oh!

github-actions Bot commented May 9, 2026

Uh oh!

igerber commented May 9, 2026

Uh oh!

github-actions Bot commented May 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

igerber commented May 9, 2026

Summary

Methodology references

Validation

Test plan

Security / privacy

Uh oh!

github-actions Bot commented May 9, 2026

PR Review

Overall Assessment

Executive Summary

Methodology

Finding 1 — P3 Informational: documented Python-only survey extension

Code Quality

Performance

Maintainability

Tech Debt

Security

Documentation/Tests

Finding 2 — P2: constant-weight SE invariant test only asserts effects

Audit Notes

Uh oh!

igerber commented May 9, 2026

Uh oh!

github-actions Bot commented May 9, 2026

PR Review

Overall Assessment

Executive Summary

Methodology

Finding 1 — P1 [Newly identified]: final replicate df is not applied to per-path inference

Code Quality

Performance

Maintainability

Tech Debt

Security

Documentation/Tests

Path To Approval

Uh oh!

igerber commented May 9, 2026

Uh oh!

github-actions Bot commented May 9, 2026

PR Review

Overall Assessment

Executive Summary

Methodology

Code Quality

Performance

Maintainability

Tech Debt

Security

Documentation/Tests

Finding 1 — P2: replicate-df regression does not force a stale-vs-final df difference

Audit Notes

Uh oh!

igerber commented May 9, 2026

Uh oh!

github-actions Bot commented May 9, 2026

PR Review

Overall Assessment

Executive Summary

Methodology

Finding 1 — P3: documented Python-only survey extension

Code Quality

Performance

Maintainability

Tech Debt

Security

Documentation/Tests

Finding 1 — P2: replicate-df regression still does not force a stale-vs-final df difference

Audit Notes

Uh oh!

igerber commented May 9, 2026

Uh oh!

github-actions Bot commented May 9, 2026

PR Review

Overall Assessment

Executive Summary

Methodology

Finding 1 — P3: documented Python-only per-path survey extension