Skip to content

Tutorial 21: HAD pre-test workflow (composite QUG + Stute + Yatchew)#409

Merged
igerber merged 12 commits intomainfrom
had-tutorial-21-pretest-workflow
May 10, 2026
Merged

Tutorial 21: HAD pre-test workflow (composite QUG + Stute + Yatchew)#409
igerber merged 12 commits intomainfrom
had-tutorial-21-pretest-workflow

Conversation

@igerber
Copy link
Copy Markdown
Owner

@igerber igerber commented May 10, 2026

Summary

  • New end-to-end practitioner tutorial for did_had_pretest_workflow that walks through the composite pre-test battery on a panel close in shape to T20's brand campaign, surfaces the structural gap on the two-period (aggregate="overall") path (no Step 2 / parallel pre-trends), and upgrades to the multi-period (aggregate="event_study") path that adds the joint pre-trends Stute and joint homogeneity Stute diagnostics.
  • The DGP uses Uniform[$0.01K, $50K] for regional spend (vs T20's [$5K, $50K]) — true support strictly positive but very near zero, chosen so QUG fails-to-reject H0: d_lower = 0 in this finite sample. HAD's design="auto" then selects the continuous_at_zero identification path on the QUG outcome (a workflow decision following the test, not a property of the true DGP support — explicitly distinguished in the tutorial prose).
  • Side panel exercises both yatchew_hr_test null modes side-by-side: null="linearity" (default, paper Theorem 7) vs null="mean_independence" (PR Add yatchew_hr_test(null='mean_independence') mode #397, R-parity with R YatchewTest::yatchew_test(order=0)) on the within-pre-period first-difference paired with post-period dose. Illustrates the stricter null's larger residual variance (sigma2_lin 7.01 vs 6.53) and smaller p-value (0.29 vs 0.49).
  • Tutorial framing throughout treats fail-to-reject pre-tests as non-rejection evidence under finite-sample power and test specification, NOT as proof that identifying assumptions hold.
  • Companion drift-test file (tests/test_t21_had_pretest_workflow_drift.py, 15 tests) pinning panel composition, both verdict pivots, structural anchors on both paths, deterministic QUG / Yatchew statistics, and bootstrap p-value tolerance bands per feedback_bootstrap_drift_tests_need_backend_tolerance.

Surfaces touched

  • docs/tutorials/21_had_pretest_workflow.ipynb (new, 20 cells: 6 code + 14 markdown)
  • tests/test_t21_had_pretest_workflow_drift.py (new, 15 tests)
  • docs/tutorials/20_had_brand_campaign.ipynb Section 6 Extensions (forward-pointer to T21)
  • docs/tutorials/README.md (T21 catalog entry)
  • CHANGELOG.md [Unreleased] Added entry
  • TODO.md row 112 (T21 marked done; T22 row remains queued)
  • docs/doc-deps.yaml had_pretests.py block (T21 tutorial entry)

No source code changes in diff_diff/. T22 weighted/survey HAD tutorial remains queued as a separate notebook PR per project_had_followups.md.

Test plan

  • pytest tests/test_t21_had_pretest_workflow_drift.py -v (Rust backend, 15/15 expected)
  • DIFF_DIFF_BACKEND=python pytest tests/test_t21_had_pretest_workflow_drift.py -v (pure-Python backend, 15/15 expected)
  • pytest --nbmake docs/tutorials/21_had_pretest_workflow.ipynb (notebook executes cleanly)
  • pytest tests/test_t20_had_brand_campaign_drift.py -v (T20 drift unaffected by Section 6 forward-pointer edit, 13/13 expected)

🤖 Generated with Claude Code

@github-actions
Copy link
Copy Markdown

PR Review Report

Overall Assessment

⚠️ Needs changes — one P1 methodology/documentation mismatch.

Executive Summary

  • No diff_diff/ source implementation changes were introduced.
  • The HAD overall-path Step 2 deferral is documented in REGISTRY.md, so it is not a defect.
  • P1: new prose says design="auto" selects continuous_at_zero based on the QUG outcome, but the actual/registry contract uses the independent _detect_design() min/median/modal heuristic.
  • P2: REGISTRY.md still says the T21 tutorial is queued, while this PR marks it landed elsewhere.
  • I did not load docs/tutorials/*.ipynb, per the review carve-out.
  • Verification: AST parse of the new test file passed; pytest is not installed in this environment.

Methodology

Finding 1 — P1

Severity: P1
Location: CHANGELOG.md:L11, tests/test_t21_had_pretest_workflow_drift.py:L16-L21, tests/test_t21_had_pretest_workflow_drift.py:L188-L192; compare diff_diff/had.py:L1951-L2006, docs/methodology/REGISTRY.md:L2528

Impact: The new docs/test prose says the QUG fail-to-reject outcome lets design="auto" select continuous_at_zero. That conflicts with the actual HAD default behavior: _detect_design() selects continuous_at_zero when d.min() == 0 or d.min() < 0.01 * median(abs(d)), then checks modal-min mass. It does not consume the QUG p-value. This blurs a diagnostic pre-test with the estimator’s default design-dispatch rule.

Concrete fix: Reword the changed prose to separate the two facts: QUG fails to reject H0: d_lower = 0, and independently this sample’s near-zero minimum dose satisfies the design="auto" heuristic. If the tutorial wants a QUG-driven practitioner decision, show an explicit HAD(design="continuous_at_zero") choice after the pre-test instead of attributing it to design="auto". If the notebook contains the same wording, update it too. Add a drift assertion for actual HAD(design="auto") design == "continuous_at_zero" / target_parameter == "WAS" if the tutorial claims that estimator path.

Finding 2 — P3 Informational

Severity: P3
Location: docs/methodology/REGISTRY.md:L2489-L2491, docs/methodology/REGISTRY.md:L2542-L2550

Impact: The two-period aggregate="overall" workflow omits Step 2 / Assumption 7 pre-trends and reports that caveat. This is explicitly documented in the registry and is correctly treated as a structural limitation, not a defect.

Concrete fix: None required.

Code Quality

No P0/P1/P2 findings. The new test file parses successfully. Pattern-wide grep for inline inference anti-patterns found no new changed-source occurrence; there are no modified estimator/inference paths.

Performance

No findings. The new drift tests use bootstrap-heavy checks, but that is appropriate for tutorial drift coverage and not a runtime library path.

Maintainability

No additional P0/P1/P2 findings beyond the registry/doc status issue listed under Documentation/Tests.

Tech Debt

No blocking tech-debt issue. The TODO row now tracks T22 as remaining, which is consistent with the PR’s stated deferred work.

Security

No findings. No secrets or security-sensitive code paths were introduced in the reviewed non-notebook diff.

Documentation/Tests

Finding 3 — P2

Severity: P2
Location: docs/methodology/REGISTRY.md:L2509, docs/methodology/REGISTRY.md:L2556; compare TODO.md:L112, docs/doc-deps.yaml:L407-L409, docs/tutorials/README.md:L106-L112

Impact: The methodology registry still says the T21 tutorial is queued/remaining, while this PR marks T21 as landed in TODO, README, changelog, and doc-deps. Since the registry is the project’s methodology source of truth for reviews, stale phase status creates avoidable confusion.

Concrete fix: Update REGISTRY.md to mark T21 as shipped and leave only T22 as remaining, including the drift-test reference if desired.

Audit Confirmation

  • Sibling-surface mirror audit: checked README, TODO, doc-deps, registry, T20 drift-test pattern, and overall/event-study report surfaces.
  • Pattern-wide grep: checked the QUG/design wording and inline inference anti-patterns across allowed files.
  • Reciprocal/symmetry check: checked both aggregate="overall" and aggregate="event_study" workflow semantics.
  • Transitive workflow dependency audit: no GH Actions or pytest-selection changes in this PR.
  • Exclusions honored: did not load docs/tutorials/*.ipynb.

Path to Approval

  1. Correct the QUG/design="auto" wording in CHANGELOG.md, the new drift-test docstrings, and any matching T21 notebook prose so it matches _detect_design() and REGISTRY.md.
  2. If the tutorial claims the estimator’s auto-selected design/target, add a drift assertion for the actual HAD fit result’s design and target_parameter.

igerber added a commit that referenced this pull request May 10, 2026
P1 — Decouple QUG from HAD's `design="auto"` selector across all surfaces.
The two are independent rules: QUG is a statistical pre-test on
`H0: d_lower = 0`; `_detect_design()` is a min/median heuristic
(`continuous_at_zero` fires when `d.min() < 0.01 * median(|d|)`). On
T21's panel both checks point to the same identification path but the
mechanisms are independent — `_detect_design()` does not consume the
QUG p-value. Reword tutorial Section 2 + Section 3, CHANGELOG entry,
and drift-test docstrings to reflect this.

Add `test_had_design_auto_lands_on_continuous_at_zero`: explicitly
fits `HAD(design="auto")` on the two-period panel and asserts
`design == "continuous_at_zero"` and `target_parameter == "WAS"`,
locking the prose claim independently of the QUG-test pins.

P2 — Update REGISTRY.md to mark T21 shipped (PR #409); leave T22 row
queued.

All 16 drift tests pass on both backends; notebook executes cleanly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented May 10, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 2a6c045be4350bb0f99e96067b302a8d5971b440


PR Review Report

Overall Assessment

Looks good — no unmitigated P0/P1 findings. One non-blocking P2 documentation cleanup remains.

Executive Summary

  • Prior P1 QUG/design="auto" mismatch is resolved in the reviewed non-notebook surfaces.
  • No diff_diff/ estimator, weighting, variance, SE, or inference code changes were introduced.
  • HAD workflow methodology matches REGISTRY.md: overall path documents Step 2 deferral; event-study path closes it with joint Stute diagnostics.
  • P2: a few status/count docs are stale after T21 landed.
  • I did not load docs/tutorials/*.ipynb, per the explicit exclusion. pytest could not run because this environment lacks pytest/numpy; AST parse of the new test file passed.

Methodology

Finding M1 — P3 Informational

Severity: P3
Location: CHANGELOG.md:L11, tests/test_t21_had_pretest_workflow_drift.py:L16-L26, tests/test_t21_had_pretest_workflow_drift.py:L302-L321; compare diff_diff/had.py:L1951-L2006, docs/methodology/REGISTRY.md:L2528

Impact: The previous P1 is addressed: the prose now separates QUG fail-to-reject from HAD’s independent design="auto" min/median heuristic, and the drift test explicitly fits HAD(design="auto") and asserts continuous_at_zero / WAS.

Concrete fix: None required.

Finding M2 — P3 Informational

Severity: P3
Location: docs/methodology/REGISTRY.md:L2489-L2491, docs/methodology/REGISTRY.md:L2542-L2550, tests/test_t21_had_pretest_workflow_drift.py:L170-L191

Impact: The two-period aggregate="overall" path still omits Step 2 / Assumption 7 pre-trends, but this is explicitly documented and the verdict is drift-tested. This is not a methodology defect.

Concrete fix: None required.

Code Quality

No P0/P1/P2 findings. The new drift test is focused and parses successfully.

Performance

No findings. The added bootstrap-heavy checks are test-only drift coverage, not runtime library code.

Maintainability

No blocking findings. The remaining issue is documentation consistency, listed below.

Tech Debt

No blocking tech-debt issue. T22 remains tracked in TODO.md:L112, which is acceptable deferred work.

Security

No findings. Secret-pattern grep across the changed non-notebook files returned no hits.

Documentation/Tests

Finding D1 — P2

Severity: P2
Location: docs/methodology/REGISTRY.md:L2509, CHANGELOG.md:L11, CHANGELOG.md:L15; compare docs/methodology/REGISTRY.md:L2556, tests/test_t21_had_pretest_workflow_drift.py:L153-L337

Impact: Documentation status is still partially stale: the registry’s older Phase 3 follow-up paragraph still says practitioner_next_steps() and the tutorial are queued for Phase 5, and an older changelog entry still says T21 remains queued. The new changelog entry also says the companion drift file has 15 tests, while the file and registry now show 16.

Concrete fix: Update REGISTRY.md:L2509 and CHANGELOG.md:L15 to say Phase 5 wave 1 and T21 have landed, with only T22 queued. Change CHANGELOG.md:L11 from “15 tests” to “16 tests”.

Audit Confirmation

  • Sibling-surface mirror audit: checked changelog, registry, TODO, doc-deps, README, HAD _detect_design, and HAD pretest workflow surfaces.
  • Pattern-wide grep: checked QUG/design wording, T21 queued/status wording, test counts, and inline inference anti-patterns across allowed files.
  • Reciprocal/symmetry check: checked overall vs event-study HAD workflow semantics.
  • Transitive workflow deps: no GH Actions workflow or pytest-selection changes in this PR.
  • Exclusions honored: did not load docs/tutorials/*.ipynb.
  • Verification: ast.parse() passed for tests/test_t21_had_pretest_workflow_drift.py; full test execution was blocked because pytest and numpy are not installed.

igerber added a commit that referenced this pull request May 10, 2026
- REGISTRY.md L2509: practitioner_next_steps + T21 tutorial were marked
  "queued for Phase 5"; both now landed (PR #402 + PR #409). Update to
  reflect actual status; T22 remains queued.
- CHANGELOG.md L11 (T21 entry): drift-test count was "15 tests"; now 16
  (after the new test_had_design_auto_lands_on_continuous_at_zero added
  in R1).
- CHANGELOG.md L15 (PR #402 entry, retroactive): said "T21 pretest
  tutorial and T22 weighted/survey tutorial remain queued"; T21 has
  since landed in PR #409. Update to reflect that.

No methodology change; no test surface changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented May 10, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: d9ea86a5494fdd041c8dd8f750ebcf92fd76bce4


PR Review Report

Overall Assessment

Looks good — no unmitigated P0/P1 findings. One P2 maintainability/doc-impact mapping issue remains, but it does not block approval.

Executive Summary

  • Prior stale status/count issue is resolved in the supplied diff: T21 is recorded as landed, T22 remains queued, and the drift-test count is 16.
  • No diff_diff/ estimator, weighting, variance, SE, or default-behavior code changes are introduced.
  • HAD pretest methodology aligns with REGISTRY.md: overall path explicitly defers Step 2; event-study path runs joint pre-trends and joint homogeneity Stute diagnostics.
  • QUG fail-to-reject is now separated from HAD design="auto" resolution, matching _detect_design() and registry guidance.
  • I did not load docs/tutorials/*.ipynb, per the explicit exclusion. Full pytest could not run because this environment lacks pytest and numpy; AST parse/count found 16 tests.

Methodology

Finding M1 — P3 Informational

Severity: P3
Location: tests/test_t21_had_pretest_workflow_drift.py:L170-L321, docs/methodology/REGISTRY.md:L2489-L2557; compare diff_diff/had_pretests.py:L4413-L4850, diff_diff/had.py:L1951-L2006

Impact: The prior methodology concern is addressed. The tutorial/test surfaces distinguish QUG from HAD(design="auto"), and the overall vs event-study workflow behavior matches the registry and source contracts.

Concrete fix: None required.

Code Quality

No findings. The new test file is focused and parses successfully.

Performance

No findings. The added bootstrap-heavy checks are test-only drift coverage.

Maintainability

Finding MT1 — P2

Severity: P2
Location: docs/doc-deps.yaml:L366-L409, tests/test_t21_had_pretest_workflow_drift.py:L302-L321

Impact: T21 is mapped under diff_diff/had_pretests.py, but the tutorial/drift test also locks HAD(design="auto") and _detect_design() behavior from diff_diff/had.py. A future had.py design-selection change could miss T21 in the manual docs-impact map.

Concrete fix: Also add docs/tutorials/21_had_pretest_workflow.ipynb to the diff_diff/had.py docs block, with a note that it drift-locks design="auto" / _detect_design() behavior via tests/test_t21_had_pretest_workflow_drift.py.

Tech Debt

No blocking findings. T22 remains tracked in TODO.md, which is acceptable deferred tutorial work.

Security

No findings. Secret-pattern grep over the changed non-notebook files found no secrets.

Documentation/Tests

No blocking findings. The prior documentation/count issue is resolved in the supplied diff. Verification was limited by missing local dependencies: python -m pytest tests/test_t21_had_pretest_workflow_drift.py -q could not run because pytest is not installed, and importing the package path was blocked by missing numpy.

Audit Confirmation

  • Sibling-surface mirror audit: checked changelog, registry, TODO, README, doc-deps, HAD _detect_design, and HAD pretest workflow surfaces.
  • Pattern-wide grep: checked inline inference/NaN anti-patterns; no new occurrence in the changed test/docs surfaces.
  • Reciprocal/symmetry check: checked overall vs event-study workflow and QUG vs design="auto" treatment.
  • Transitive workflow deps: no GH Actions workflow or pytest-selection changes in this PR.
  • Exclusions honored: did not load docs/tutorials/*.ipynb.

@igerber
Copy link
Copy Markdown
Owner Author

igerber commented May 10, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: f269599cb040c935f7baedad213a77a0c868d635


PR Review Report

Overall Assessment

⚠️ Needs changes — one P1 methodology-documentation mismatch remains in the new tutorial narrative.

Executive Summary

  • No diff_diff/ estimator, weighting, variance, SE, or default-behavior source code changes are introduced.
  • The drift test covers the key workflow pivots and counts 16 tests.
  • The tutorial extract incorrectly swaps the paper/registry labels for Design 1 vs Design 1′ in the HAD support-infimum taxonomy.
  • The tutorial also reintroduces “QUG selects the identification path” language, contradicting the source/test distinction between QUG and HAD(design="auto").
  • Prior P2 doc-impact mapping issue remains: T21 is mapped under had_pretests.py, but it also locks had.py::_detect_design() behavior.

Methodology

Finding M1 — P1 [Newly identified]

Severity: P1
Location: docs/_review/t21_notebook_extract.md:L31, L40, L155, L426, L447; compare docs/methodology/REGISTRY.md:L2267, diff_diff/had.py:L1951-L2006

Impact: The tutorial reverses the HAD design labels: it calls d_lower = 0 / continuous_at_zero “Design 1” and continuous_near_d_lower “Design 1′”. The registry and source use the opposite convention: Design 1′ is the QUG / continuous_at_zero case, while Design 1 is the strictly-positive lower-bound case. The extract also says QUG “led the workflow to select” / “uses the QUG outcome to pick” the path, while the source keeps did_had_pretest_workflow() and HAD(design="auto") separate; the drift test correctly locks that independence at tests/test_t21_had_pretest_workflow_drift.py:L302-L321.

Concrete fix: Regenerate the notebook and review extract so every occurrence says d_lower = 0 / continuous_at_zero is Design 1′, and strictly-positive d_lower / continuous_near_d_lower is Design 1. Reword L426 and L447 to say QUG fail-to-reject supports the continuous_at_zero pretest interpretation, while HAD(design="auto") independently resolves via the min/median heuristic.

Documented deviations checked: null="mean_independence" is documented in REGISTRY.md as an R-parity extension, so its side-panel use is not a defect.

Code Quality

No findings. The new test file is focused and AST-parses successfully.

Performance

No findings. Added bootstrap-heavy checks are test-only drift coverage.

Maintainability

Finding MT1 — P2

Severity: P2
Location: docs/doc-deps.yaml:L366-L409, tests/test_t21_had_pretest_workflow_drift.py:L302-L321

Impact: Prior finding remains. T21 is mapped under diff_diff/had_pretests.py, but the tutorial/drift test also locks HAD(design="auto") and _detect_design() behavior from diff_diff/had.py. A future had.py design-selection change could miss T21 in the manual docs-impact map.

Concrete fix: Also add docs/tutorials/21_had_pretest_workflow.ipynb to the diff_diff/had.py docs block, noting that it drift-locks design="auto" / _detect_design() through tests/test_t21_had_pretest_workflow_drift.py.

Tech Debt

No blocking findings. T22 remains tracked in TODO.md, which is acceptable deferred tutorial work.

Security

No findings. Secret-pattern grep over changed non-notebook surfaces found no secrets.

Documentation/Tests

No additional findings beyond M1 and MT1. I did not load docs/tutorials/*.ipynb, per the explicit exclusion; I reviewed the supplied markdown extract instead. pytest tests/test_t21_had_pretest_workflow_drift.py could not run in this environment because pytest, numpy, and pandas are not installed; AST inspection found 16 test functions.

Path to Approval

  1. Fix the HAD design labels in the notebook source and regenerated docs/_review/t21_notebook_extract.md: continuous_at_zero = Design 1′, continuous_near_d_lower = Design 1.
  2. Reword the summary/extension prose so QUG does not “select” the estimator path; keep the distinction that QUG is a pretest result and HAD(design="auto") uses the separate _detect_design() min/median rule.

Audit Confirmation

  • Sibling-surface mirror audit: checked README, changelog/TODO/registry/doc-deps, HAD pretest workflow, and _detect_design() surfaces.
  • Pattern-wide grep: checked inline inference / NaN anti-patterns; no new affected source code path found.
  • Reciprocal/symmetry check: checked overall vs event-study workflow and QUG vs design="auto" handling.
  • Transitive workflow deps: no GH Actions workflow or pytest-selection changes in this PR.
  • Exclusions honored: did not load docs/tutorials/*.ipynb, benchmarks/data/real/*.json, or benchmarks/data/real/*.csv.

igerber added a commit that referenced this pull request May 10, 2026
P1 — HAD design label convention was reversed across T21. Per
REGISTRY:2267 + had.py:7-33, the convention is:
  - Design 1' = continuous_at_zero (d_lower = 0, QUG case) — that's T21
  - Design 1  = continuous_near_d_lower (d_lower > 0)     — that's T20
T21 had Design 1 / Design 1' swapped throughout. Fixed in the build
script (Section 1 paper-step taxonomy, Section 2 panel framing,
Section 3 reading-the-verdict, Section 7 Extensions). Notebook
re-executed and review extract regenerated.

Two residual "QUG selects/picks the identification path" leakages from
the original prose also surfaced (Section 7 + Summary checklist). Both
contradicted the explicit QUG-vs-_detect_design separation locked by
test_had_design_auto_lands_on_continuous_at_zero. Reworded to keep the
two rules independent ("QUG fail-to-reject and `design="auto"`
heuristic both pointed independently"; "QUG is a statistical test on
H0; `design="auto"` calls _detect_design() which uses a min/median
heuristic — both pointed to continuous_at_zero on this panel").

P2 (MT1) — T21 was mapped under had_pretests.py in doc-deps.yaml but
the drift test now also locks HAD(design="auto") / _detect_design()
behavior from had.py via test_had_design_auto_lands_on_continuous_at_zero.
Add T21 entry to the had.py docs block with a note on the
_detect_design() drift coverage so a future had.py design-selection
change does not miss T21 in the manual docs-impact map.

All 16 drift tests still pass on Rust; nbmake clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented May 10, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: c5a0ee3aa608304c35b0b9b03bba3bb54cfb003b


Overall Assessment

⚠️ Needs changes — one P1 methodology-documentation mismatch remains in the new tutorial narrative.

Executive Summary

  • No diff_diff/ estimator, weighting, variance/SE, or default behavior code changes are introduced.
  • Prior P1 appears resolved: Design 1′ / continuous_at_zero and Design 1 / continuous_near_d_lower are now labeled correctly, and QUG is distinguished from HAD(design="auto").
  • Prior P2 docs-impact issue appears resolved in the supplied diff by adding T21 to both had.py and had_pretests.py doc-deps surfaces.
  • New P1: the tutorial still misstates paper Step 4 and assigns Assumptions 5/6 to the continuous_at_zero path.

Methodology

Finding M1 — P1 [Newly identified]

Severity: P1
Location: docs/_review/t21_notebook_extract.md:L31-L34, L239-L244, L361-L366, L386-L390; compare docs/methodology/REGISTRY.md:L2267-L2273, L2482-L2487, L2532, diff_diff/had.py:L11-L24, diff_diff/had_pretests.py:L54-L55

Impact: The tutorial describes “Step 4” as “Boundary continuity (paper Assumptions 5, 6)” and repeatedly says Assumptions 5/6 remain the non-testable Step 4 caveat on this T21 continuous_at_zero workflow. That does not match the registry/source contract. In the registry, paper Step 4 is the decision rule: if QUG, pre-trends, and linearity do not reject, TWFE may be used. Assumptions 5/6 are the positive-d_lower Design 1 caveats for continuous_near_d_lower / mass_point, while continuous_at_zero is Design 1′ and is tied to Assumption 3 plus regularity, not Assumptions 5/6.

Concrete fix: Reword the tutorial taxonomy so Section 4 workflow steps are: QUG, pre-trends, linearity/homogeneity, then the “use TWFE if none rejects” decision. Move the non-testable caveat into a separate paragraph and make it design-specific: for T21’s continuous_at_zero interpretation, cite Assumption 3 / boundary regularity at zero; mention Assumptions 5/6 only as the caveat for the Design 1 positive-d_lower path, e.g. T20’s continuous_near_d_lower case.

Finding M2 — P3 informational

Severity: P3
Impact: The tutorial uses yatchew_hr_test(null="mean_independence"), which is not paper-derived but is explicitly documented in REGISTRY.md as an R-parity extension. This is not a defect.

Concrete fix: None required.

Code Quality

No findings. The added test file is focused and AST-parses successfully.

Performance

No findings. Bootstrap-heavy checks are test-only drift coverage.

Maintainability

No blocking findings. The supplied diff fixes the previous docs-impact mirror by adding T21 under the diff_diff/had.py docs block as well as diff_diff/had_pretests.py.

Tech Debt

No blocking findings. T22 remains tracked in TODO.md, which is acceptable deferred tutorial work.

Security

No findings. Secret-pattern grep over changed non-notebook text/test surfaces found no new secret-like material.

Documentation/Tests

No additional findings beyond M1. I did not load docs/tutorials/*.ipynb, per the explicit exclusion. I could not run pytest tests/test_t21_had_pretest_workflow_drift.py because pytest is not installed in this environment; AST inspection found 16 test functions.

Path to Approval

  1. Update the T21 tutorial/review extract so “paper Step 4” is the TWFE decision rule, not boundary continuity.
  2. Replace the repeated T21 continuous_at_zero Assumptions 5/6 caveat with the correct Design 1′ caveat: Assumption 3 / boundary regularity at zero, while reserving Assumptions 5/6 for positive-d_lower Design 1 paths.

Audit Confirmation

  • Sibling-surface mirror audit: checked tutorial extract, README, changelog/TODO/registry/doc-deps, HAD source/docstrings, and _detect_design() surfaces.
  • Pattern-wide grep: checked inline inference / NaN anti-patterns; no new affected source-code path in this PR.
  • Reciprocal/symmetry check: checked overall vs event-study workflow, QUG vs design="auto", and Design 1′ vs Design 1 assumption mapping.
  • Transitive workflow deps: no GH Actions workflow or pytest-selection changes in this PR.
  • Exclusions honored: did not load docs/tutorials/*.ipynb, benchmarks/data/real/*.json, or benchmarks/data/real/*.csv.

igerber added a commit that referenced this pull request May 10, 2026
Two methodology framing errors conflated in the original tutorial:

- "Paper Step 4" was described as "Boundary continuity (Assumptions 5/6)"
  in the workflow taxonomy. Per REGISTRY's pretest workflow (lines
  2482-2487 surrounding the four-step enumeration), Step 4 is actually
  the DECISION RULE: "if Steps 1-3 don't reject, TWFE may be used."
  Boundary-continuity assumptions are a separate concern.

- Assumptions 5/6 are Design 1 (continuous_near_d_lower / mass_point)
  identification caveats — the library emits a UserWarning citing them
  on Design 1 fits and stays silent on Design 1' (continuous_at_zero)
  fits per REGISTRY:2532 and had.py. T21's panel resolves to Design 1'
  via QUG fail-to-reject + the _detect_design() heuristic, so the
  relevant non-testable caveat is **Assumption 3** (uniform continuity
  of d -> Y_2(d) at zero, REGISTRY:2270), NOT Assumptions 5/6.
  Inherited the 5/6 framing from T20 (which IS Design 1) inappropriately.

Reframed across 7 surfaces in the build script:
- Section 1 four-step enumeration: Step 4 is now the decision rule
- Section 1: added a separate paragraph for the non-testable
  identification caveat that's design-path-specific (Assumption 3 for
  Design 1', Assumptions 5/6 for Design 1) and explicitly notes the
  library's UserWarning behavior matches this split
- Section 4 event-study verdict reading: separated Step 4 (decision
  rule) from the Design 1' caveat
- Section 4 horizon-detail closing: same split
- Section 6 leadership template: replaced "Step 4 / Assumptions 5/6"
  caveat with the correct Design 1' caveat (Assumption 3); explicit
  parenthetical noting T20's caveat was different because T20 was
  Design 1
- Section 6 bottom line: same split (decision rule vs caveat)
- Section 8 summary checklist: replaced single Step-4-as-caveat
  bullet with a two-part bullet on the workflow vs caveat distinction

Notebook re-executed, review extract regenerated. All 16 drift tests
still pass; nbmake clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented May 10, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: bd5c558dde81b67678fc0825e3bfccce67708a27


Overall Assessment

⚠️ Needs changes — prior tutorial mismatch appears resolved, but one newly identified P1 methodology-doc mismatch remains on the public workflow docstring.

Executive Summary

  • No estimator/math/SE/default behavior changes in diff_diff/.
  • Prior P1 is resolved in the supplied T21 extract: paper Step 4 is now the decision rule, and T21’s continuous_at_zero caveat is Assumption 3, not Assumptions 5/6.
  • New P1 [Newly identified]: did_had_pretest_workflow() docstring still mislabels “Step 4” as the Yatchew alternative on the event-study path.
  • T21 drift test file AST-parses and contains 16 tests.
  • I could not run pytest: pytest is not installed; a direct Python smoke check also failed because numpy is not installed.

Methodology

M1 — P1 [Newly identified]

Severity: P1
Location: diff_diff/had_pretests.py:L4443-L4449; compare docs/methodology/REGISTRY.md:L2482-L2487, diff_diff/had_pretests.py:L53-L55, diff_diff/had_pretests.py:L2732-L2737

Impact: The public did_had_pretest_workflow() docstring says event-study dispatch covers paper steps 1–3 and then labels “Step 4” as “Yatchew-style linearity as an alternative to Stute.” That contradicts the registry and the corrected module/event-verdict docstrings: paper Step 4 is the decision rule, “use TWFE if none of the tests rejects.” The Yatchew alternative belongs to Step 3, and there is no joint Yatchew variant.

Concrete fix: Reword the function docstring to say the event-study path uses joint Stute for Step 3; users needing Yatchew robustness can call yatchew_hr_test() manually; paper Step 4 is the TWFE admissibility decision and has no separate test path.

M2 — P3 informational

Severity: P3
Impact: The tutorial uses yatchew_hr_test(null="mean_independence"), which is an R-parity extension, not paper-derived. This is explicitly documented in REGISTRY.md, so it is not a defect.

Concrete fix: None required.

Code Quality

No findings. The new drift test is focused and AST-parses successfully.

Performance

No findings. Bootstrap-heavy work is confined to drift tests/tutorial validation.

Maintainability

No additional findings. The temporary _review extract is excluded from Sphinx in the supplied diff, and the workflow notebook exclusion explains why the extract exists for review.

Tech Debt

No blocking findings. T22 remains tracked in TODO.md under “Tech Debt from Code Reviews,” which is acceptable deferred tutorial work.

Security

No findings. Secret-pattern grep over changed text/test surfaces found no hits.

Documentation/Tests

No additional findings beyond M1. The new drift file defines 16 tests and covers overall vs event-study paths, QUG/design-auto separation, joint horizon counts, and both Yatchew null modes. I could not execute tests because pytest and numpy are unavailable in this environment.

Path to Approval

  1. Update diff_diff/had_pretests.py did_had_pretest_workflow() docstring at L4443-L4449 so Step 4 is the TWFE decision rule, and the Yatchew note is described as a Step 3 alternative with no joint event-study variant.

Audit Confirmation

  • Sibling-surface mirror audit: checked tutorial extract, registry, HAD source/docstrings, README, doc-deps, TODO/changelog surfaces.
  • Pattern-wide grep: checked inline inference / NaN anti-patterns; no new affected source path in this PR.
  • Reciprocal/symmetry check: checked overall vs event-study workflow, QUG vs design="auto", and Design 1′ vs Design 1 caveat mapping.
  • Transitive workflow deps: no workflow paths: or pytest-selection change in this PR; reviewed pyproject.toml and tests/conftest.py.
  • Exclusions honored: did not load docs/tutorials/*.ipynb, benchmarks/data/real/*.json, or benchmarks/data/real/*.csv.

igerber added a commit that referenced this pull request May 10, 2026
…ion rule

Per REGISTRY's pretest workflow (lines 2482-2487 surrounding the
four-step enumeration) and the same module's two correctly-framed
docstrings (module-level at line 54, `_compose_verdict_event_study` at
line 2735), paper Step 4 is the decision rule "use TWFE if none of the
tests rejects" — NOT a test or assumption. The Yatchew-HR test is a
Step 3 alternative (a single-horizon swap-in for Stute), and there is
no joint Yatchew variant in the paper.

The `did_had_pretest_workflow()` docstring at L4445-4446 was the only
place in the file that mislabeled the Yatchew alternative as "Step 4".
Reword to:
- correctly tag Yatchew as a step-3 alternative
- explicitly state Step 4 is the decision rule with no code path
- cross-reference the two existing correctly-framed docstrings in the
  same file so future readers can confirm the convention is consistent

Surfaced by PR #409 R6 review (T21 tutorial pre-test taxonomy work
made the inconsistency visible). All 42 workflow-specific tests in
`tests/test_had_pretests.py` still pass; T21 drift test (16 tests)
still passes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented May 10, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 49c251ec536713e9ab33f9dacf5d7d47dbe2cbf1


Overall Assessment

Looks good — no unmitigated P0/P1 findings. Prior P1 methodology-doc mismatch is resolved in the supplied diff.

Executive Summary

  • No estimator, variance/SE, weighting, or default behavior changes.
  • Prior Step 4/Yatchew docstring issue is fixed: Step 4 is now framed as the TWFE decision rule.
  • T21’s Design 1′ caveat is correctly Assumption 3, while Design 1 caveats remain Assumptions 5/6.
  • null="mean_independence" is correctly treated as a documented R-parity extension, not paper-derived.
  • One P2 test-quality issue remains: two bootstrap p-value drift tests only lower-bound quoted values, so stale tutorial numbers could pass.
  • Could not run pytest: pytest, numpy, pandas, and scipy are not installed. AST parse passed and found 16 tests.

Methodology

M1 — P3 Informational

Severity: P3
Location: diff_diff/had_pretests.py:L4443-L4452; docs/methodology/REGISTRY.md:L2482-L2503

Impact: Prior P1 is addressed. The workflow docstring now describes event-study Step 3 as joint Stute, Yatchew-HR as a single-horizon alternative, and paper Step 4 as the decision rule rather than a separate Yatchew test path.

Concrete fix: None required.

M2 — P3 Informational

Severity: P3
Location: docs/_review/t21_notebook_extract.md Section 5; docs/methodology/REGISTRY.md:L2480

Impact: The tutorial uses yatchew_hr_test(null="mean_independence"), which is an R-parity extension rather than paper-derived. This deviation is explicitly documented in the registry, so it is not a defect.

Concrete fix: None required.

Code Quality

No findings.

Performance

No findings. The new bootstrap-heavy work is confined to tutorial drift validation.

Maintainability

No findings. The supplied diff excludes the temporary docs/_review extract from Sphinx via docs/conf.py, and T22 remains tracked separately.

Tech Debt

No blocking findings. T22 weighted/survey HAD tutorial remains tracked in TODO.md under “Tech Debt from Code Reviews,” which is acceptable deferred work.

Security

No findings. Secret-pattern grep over changed text/test surfaces found no actionable hits.

Documentation/Tests

D1 — P2

Severity: P2
Location: tests/test_t21_had_pretest_workflow_drift.py:L206-L214, tests/test_t21_had_pretest_workflow_drift.py:L293-L299

Impact: The drift file says it pins quoted bootstrap p-values with tolerance bands, but test_overall_stute_fails_to_reject only asserts p_value > 0.50, and test_event_study_homogeneity_fails_to_reject only asserts p_value > 0.50. The tutorial quotes about 0.686 and 0.763; future drift to, say, 0.99 would leave stale prose while tests pass.

Concrete fix: Use bounded tolerance bands around the quoted values, preferably backend-aware if needed. For example, assert roughly 0.53 <= overall_report.stute.p_value <= 0.84 and 0.61 <= homogeneity_joint.p_value <= 0.92, or route through the project’s backend tolerance fixture.

Audit Confirmation

  • Sibling-surface mirror audit: checked HAD workflow docstring, module/verdict framing, registry, tutorial extract, README, changelog/TODO, doc-deps, and drift tests.
  • Pattern-wide grep: checked inline inference / NaN anti-patterns in diff_diff/*.py; no new affected inference path in this PR.
  • Reciprocal/symmetry check: checked overall vs event-study workflow, QUG vs design="auto", Design 1′ vs Design 1 caveats, and both Yatchew null modes.
  • Transitive workflow deps: no GH Actions paths: or pytest-selection change in this PR; scanned workflow/doc test references.
  • Exclusions honored: did not load docs/tutorials/*.ipynb, benchmarks/data/real/*.json, or benchmarks/data/real/*.csv.

igerber added a commit that referenced this pull request May 10, 2026
Two bootstrap p-value drift tests had lower-bound-only assertions:
- `test_overall_stute_fails_to_reject`: was `p > 0.50`, tutorial quotes
  ~0.686 → would silently pass if p drifted to 0.99
- `test_event_study_homogeneity_fails_to_reject`: was `p > 0.50`,
  tutorial quotes ~0.763 → same silent-stale risk

The third bootstrap test (`test_event_study_pretrends_fails_to_reject`)
already used a bounded band `0.0 <= p <= 0.25`. Mirror that pattern on
the other two with bounded bands per
`feedback_bootstrap_drift_tests_need_backend_tolerance` (>= 0.15
width):
- Stute: 0.53 <= p <= 0.84 (band ~0.31 around 0.686)
- Homogeneity: 0.61 <= p <= 0.92 (band ~0.31 around 0.763)

Both bands wide enough for Rust ↔ pure-Python RNG path differences;
both narrow enough that drift in either direction (toward rejection
or toward an even cleaner pass) flags the prose as stale.

All 16 drift tests pass on both backends within the new bands.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented May 10, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 2f2c7ca21a18e167c742f6f0f6e445d6f5c737dc


Overall Assessment

Looks good — no unmitigated P0/P1 findings in the supplied diff.

Executive Summary

  • No estimator math, weighting, variance/SE, or default behavior changes.
  • Affected method surface is documentation/tests for HAD pretests: did_had_pretest_workflow, QUG, joint Stute, and Yatchew-HR.
  • Methodology framing matches REGISTRY.md: Step 4 is the decision rule, event-study uses joint Stute, and no joint Yatchew variant is claimed.
  • null="mean_independence" is correctly framed as the documented R-parity extension, not paper-derived.
  • Prior P2 drift-test issue is addressed in the supplied diff: bootstrap p-value tests now use bounded tolerance bands.
  • T22 weighted/survey tutorial remains tracked in TODO.md; acceptable deferred work.

Methodology

M1 — P3 Informational

Severity: P3
Location: diff_diff/had_pretests.py:L4443-L4452; docs/methodology/REGISTRY.md:L2482-L2503

Impact: Prior methodology-doc mismatch is resolved. The docstring now frames event-study Step 3 as joint Stute, Yatchew-HR as a single-horizon alternative, and paper Step 4 as the decision rule rather than another diagnostic.

Concrete fix: None required.

M2 — P3 Informational

Severity: P3
Location: docs/_review/t21_notebook_extract.md Section 5; docs/methodology/REGISTRY.md:L2480

Impact: The tutorial’s yatchew_hr_test(null="mean_independence") side panel is correctly labeled as an R-parity extension. This is documented in the registry, so it is not a methodology defect.

Concrete fix: None required.

Code Quality

No findings. The diff_diff/ change is documentation-only.

Performance

No findings. The bootstrap-heavy checks are confined to the tutorial drift test and use module-scoped fixtures.

Maintainability

No findings. The temporary docs/_review extract is excluded from Sphinx via docs/conf.py:L36, and the remaining T22 tutorial work is tracked.

Tech Debt

No blocking findings. TODO.md:L112 tracks the remaining weighted/survey HAD tutorial, which is acceptable deferred work.

Security

No findings. Secret-pattern scan over changed text/test surfaces found no actionable hits.

Documentation/Tests

D1 — P3 Informational

Severity: P3
Location: tests/test_t21_had_pretest_workflow_drift.py:L206-L214, tests/test_t21_had_pretest_workflow_drift.py:L293-L299

Impact: Prior P2 is addressed in the supplied diff: the overall Stute and event-study homogeneity bootstrap p-values now use bounded tolerance bands around the tutorial’s quoted values rather than only lower-bounding p_value > 0.50.

Concrete fix: None required.

Audit Confirmation

  • Sibling-surface mirror audit: checked overall vs event-study workflow framing, registry, README, changelog/TODO, doc-deps, review extract, and drift tests.
  • Pattern-wide grep: checked inference/NaN anti-patterns in diff_diff/*.py; no new affected inference path in this PR.
  • Reciprocal/symmetry check: checked QUG vs design="auto", Design 1′ vs Design 1 caveats, and both Yatchew null modes.
  • Transitive workflow deps: no GH Actions workflow paths: or pytest-selection changes in this PR.
  • Exclusions honored: did not load docs/tutorials/*.ipynb, benchmarks/data/real/*.json, or benchmarks/data/real/*.csv.

igerber and others added 9 commits May 10, 2026 11:25
End-to-end practitioner walkthrough for `did_had_pretest_workflow` building
on T20's brand-campaign framing. Uses a Design 1 (`continuous_at_zero`)
panel variant (Uniform[$0.01K, $50K] vs T20's [$5K, $50K]) so the QUG step
fails-to-reject and the verdict text fires the load-bearing
"Assumption 7 deferred" pivot for the upgrade-arc narrative.

Three sections:
- Overall workflow on a two-period collapse: Step 1 + Step 3 only;
  verdict explicitly flags Step 2 as deferred (single pre-period).
- Upgrade to event_study workflow: closes all three testable steps
  via QUG + joint pre-trends Stute (3 horizons) + joint homogeneity
  Stute (4 horizons); verdict reads "TWFE admissible under Section 4
  assumptions".
- Yatchew side panel comparing null="linearity" (default, paper Theorem 7)
  vs null="mean_independence" (Phase 4 R-parity with R
  YatchewTest::yatchew_test(order=0)) on the within-pre-period
  first-difference paired with post-period dose.

Companion drift-test file with 15 tests pinning panel composition, both
verdict pivots, structural anchors on both paths, deterministic stats,
and bootstrap p-value tolerance bands per backend.

Updates T20 Section 6 Extensions with a forward-pointer to T21,
`docs/tutorials/README.md` with a T21 entry, `docs/doc-deps.yaml`
`had_pretests.py` block, CHANGELOG `[Unreleased]`, and the T21/T22
TODO row.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…dependence

PR #397 added the `null="mean_independence"` mode; PR #400 was the
release-rollup that bundled it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ection vs proof

Two methodology framing issues in T21:

1. The DGP `Uniform[$0.01K, $50K]` has support strictly above zero. The
   tutorial / README / CHANGELOG / drift-test docstrings called it a "true
   Design 1 (`continuous_at_zero`)" panel, conflating "QUG fails-to-reject
   d_lower=0 in this finite sample" with "the true DGP support is at zero".
   Reframe across all surfaces: the DGP has a strictly-positive but very
   near-zero lower bound chosen so QUG fails-to-reject; HAD's `design="auto"`
   then selects the `continuous_at_zero` identification path on that QUG
   outcome (a workflow decision following the test, not a property of the
   true DGP).

2. The notebook over-described fail-to-reject pre-tests as "formal validation",
   "conclusive", "closes assumptions", "TWFE admissible without methodological
   caveat". Soften to "diagnostics fail to reject", "supports but does not
   prove", "non-rejection evidence under finite-sample power and test
   specification". Pre-test tutorials should teach the limits of pre-tests,
   not paper over them.

Also extracts a `yatchew_side_panel_inputs` fixture in the drift test to
deduplicate post_dose / dy construction across the two side-panel tests.

Numerical pins unchanged; all 15 drift tests still pass on both backends;
notebook executes cleanly; T20 drift unaffected.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ed tutorial

Two stale shorthand phrasings inconsistent with the revised methodology framing:

- Section 7 Extensions: "single Design 1 panel" → "single panel where QUG
  led the workflow to select the continuous_at_zero (Design 1) identification
  path" (matches the corrected Section 2 wording).
- `test_event_study_pretrends_fails_to_reject` docstring quoted "close to
  alpha = 0.05 but conclusive"; the user-facing text now says
  "warrants scrutiny" - update internal docstring to match.

No methodology change, no new pins; all 15 drift tests still pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
P1 — Decouple QUG from HAD's `design="auto"` selector across all surfaces.
The two are independent rules: QUG is a statistical pre-test on
`H0: d_lower = 0`; `_detect_design()` is a min/median heuristic
(`continuous_at_zero` fires when `d.min() < 0.01 * median(|d|)`). On
T21's panel both checks point to the same identification path but the
mechanisms are independent — `_detect_design()` does not consume the
QUG p-value. Reword tutorial Section 2 + Section 3, CHANGELOG entry,
and drift-test docstrings to reflect this.

Add `test_had_design_auto_lands_on_continuous_at_zero`: explicitly
fits `HAD(design="auto")` on the two-period panel and asserts
`design == "continuous_at_zero"` and `target_parameter == "WAS"`,
locking the prose claim independently of the QUG-test pins.

P2 — Update REGISTRY.md to mark T21 shipped (PR #409); leave T22 row
queued.

All 16 drift tests pass on both backends; notebook executes cleanly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- REGISTRY.md L2509: practitioner_next_steps + T21 tutorial were marked
  "queued for Phase 5"; both now landed (PR #402 + PR #409). Update to
  reflect actual status; T22 remains queued.
- CHANGELOG.md L11 (T21 entry): drift-test count was "15 tests"; now 16
  (after the new test_had_design_auto_lands_on_continuous_at_zero added
  in R1).
- CHANGELOG.md L15 (PR #402 entry, retroactive): said "T21 pretest
  tutorial and T22 weighted/survey tutorial remain queued"; T21 has
  since landed in PR #409. Update to reflect that.

No methodology change; no test surface changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
P2 — CELL_07 first bullet had a conceptual error in describing the QUG
mechanic: "D_(1) is small relative to the gap D_(2)-D_(1)" — actually
D_(1) ≈ 0.181 and the gap ≈ 0.047, so D_(1) is 3.86x LARGER than the
gap. The reason QUG fails-to-reject is that T = D_(1)/(D_(2)-D_(1)) =
3.86 lands below the critical value 19, NOT because of any "small
relative to the gap" relationship. Rewrote to state the test statistic
and critical value directly.

P3 polish:
- CELL_03: "approximately 0.007" → "below 0.01" (avoids numerical drift
  on a stat that scales with seed; the heuristic threshold itself is
  what matters).
- CELL_07: added a one-line aside reconciling `all_pass=True` with
  Step 2 deferral on the overall path: `all_pass` aggregates only the
  steps that ran on each dispatch, so True here means "of the two
  steps run, neither rejected" — not that Assumption 7 has been cleared.
- CELL_09: explained the very-large-negative `T_hr` ≈ -35,000 as a
  scale artifact (sigma2_diff scales with the squared dose-step gap;
  on Uniform[0.01, 50] doses with a true slope of 100, adjacent-by-dose
  units have dy gaps that swamp sigma2_lin). Adds explicit reference
  forward to the side panel where a different input gives T_hr ≈ 0
  as a sanity check.
- CELL_17: tightened mean_independence vs linearity framing to "linear
  fit absorbs any apparent slope (real or sample noise)" — the
  pre-period has no real signal so the original "absorbs the
  dose-response signal" wording was off-target on this panel.

No methodology change; all 16 drift tests still pass; nbmake clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The CI AI reviewer's diff-build excludes `docs/tutorials/*.ipynb`
(`.github/workflows/ai_pr_review.yml:151-156` + reviewer prompt's DO-NOT
list at `.github/codex/prompts/pr_review.md:87-91`), so the actual T21
notebook prose has not been visible to the CI reviewer through three
review rounds. The notebook content was reviewed once via a standalone
notebook-aware Agent (which caught a P2 conceptual error in CELL_07 +
4 P3 polish items, all addressed in `d9ea86a`), but the CI reviewer
itself has only seen the adjacent surfaces (CHANGELOG, drift test,
README, REGISTRY).

This commit lands a one-shot markdown extract at
`docs/_review/t21_notebook_extract.md` that mirrors the notebook's full
narrative (markdown cells + code cells + executed outputs) so the CI
reviewer can audit the prose directly on this PR. Regenerate via
`python _scratch/t21_pretests/70_extract_for_review.py` from the
notebook source-of-truth at `_scratch/t21_pretests/60_build_notebook.py`.

Adds `_review` to Sphinx `exclude_patterns` in `docs/conf.py` so the
docs build doesn't pick the file up.

A follow-on PR will (a) remove this extract file + the Sphinx
exclude_patterns entry and (b) replace the blanket `.ipynb` exclusion
in the CI workflow with a markdown-only extraction (jq one-liner) wired
into the diff-build itself.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
P1 — HAD design label convention was reversed across T21. Per
REGISTRY:2267 + had.py:7-33, the convention is:
  - Design 1' = continuous_at_zero (d_lower = 0, QUG case) — that's T21
  - Design 1  = continuous_near_d_lower (d_lower > 0)     — that's T20
T21 had Design 1 / Design 1' swapped throughout. Fixed in the build
script (Section 1 paper-step taxonomy, Section 2 panel framing,
Section 3 reading-the-verdict, Section 7 Extensions). Notebook
re-executed and review extract regenerated.

Two residual "QUG selects/picks the identification path" leakages from
the original prose also surfaced (Section 7 + Summary checklist). Both
contradicted the explicit QUG-vs-_detect_design separation locked by
test_had_design_auto_lands_on_continuous_at_zero. Reworded to keep the
two rules independent ("QUG fail-to-reject and `design="auto"`
heuristic both pointed independently"; "QUG is a statistical test on
H0; `design="auto"` calls _detect_design() which uses a min/median
heuristic — both pointed to continuous_at_zero on this panel").

P2 (MT1) — T21 was mapped under had_pretests.py in doc-deps.yaml but
the drift test now also locks HAD(design="auto") / _detect_design()
behavior from had.py via test_had_design_auto_lands_on_continuous_at_zero.
Add T21 entry to the had.py docs block with a note on the
_detect_design() drift coverage so a future had.py design-selection
change does not miss T21 in the manual docs-impact map.

All 16 drift tests still pass on Rust; nbmake clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber and others added 3 commits May 10, 2026 11:27
Two methodology framing errors conflated in the original tutorial:

- "Paper Step 4" was described as "Boundary continuity (Assumptions 5/6)"
  in the workflow taxonomy. Per REGISTRY's pretest workflow (lines
  2482-2487 surrounding the four-step enumeration), Step 4 is actually
  the DECISION RULE: "if Steps 1-3 don't reject, TWFE may be used."
  Boundary-continuity assumptions are a separate concern.

- Assumptions 5/6 are Design 1 (continuous_near_d_lower / mass_point)
  identification caveats — the library emits a UserWarning citing them
  on Design 1 fits and stays silent on Design 1' (continuous_at_zero)
  fits per REGISTRY:2532 and had.py. T21's panel resolves to Design 1'
  via QUG fail-to-reject + the _detect_design() heuristic, so the
  relevant non-testable caveat is **Assumption 3** (uniform continuity
  of d -> Y_2(d) at zero, REGISTRY:2270), NOT Assumptions 5/6.
  Inherited the 5/6 framing from T20 (which IS Design 1) inappropriately.

Reframed across 7 surfaces in the build script:
- Section 1 four-step enumeration: Step 4 is now the decision rule
- Section 1: added a separate paragraph for the non-testable
  identification caveat that's design-path-specific (Assumption 3 for
  Design 1', Assumptions 5/6 for Design 1) and explicitly notes the
  library's UserWarning behavior matches this split
- Section 4 event-study verdict reading: separated Step 4 (decision
  rule) from the Design 1' caveat
- Section 4 horizon-detail closing: same split
- Section 6 leadership template: replaced "Step 4 / Assumptions 5/6"
  caveat with the correct Design 1' caveat (Assumption 3); explicit
  parenthetical noting T20's caveat was different because T20 was
  Design 1
- Section 6 bottom line: same split (decision rule vs caveat)
- Section 8 summary checklist: replaced single Step-4-as-caveat
  bullet with a two-part bullet on the workflow vs caveat distinction

Notebook re-executed, review extract regenerated. All 16 drift tests
still pass; nbmake clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ion rule

Per REGISTRY's pretest workflow (lines 2482-2487 surrounding the
four-step enumeration) and the same module's two correctly-framed
docstrings (module-level at line 54, `_compose_verdict_event_study` at
line 2735), paper Step 4 is the decision rule "use TWFE if none of the
tests rejects" — NOT a test or assumption. The Yatchew-HR test is a
Step 3 alternative (a single-horizon swap-in for Stute), and there is
no joint Yatchew variant in the paper.

The `did_had_pretest_workflow()` docstring at L4445-4446 was the only
place in the file that mislabeled the Yatchew alternative as "Step 4".
Reword to:
- correctly tag Yatchew as a step-3 alternative
- explicitly state Step 4 is the decision rule with no code path
- cross-reference the two existing correctly-framed docstrings in the
  same file so future readers can confirm the convention is consistent

Surfaced by PR #409 R6 review (T21 tutorial pre-test taxonomy work
made the inconsistency visible). All 42 workflow-specific tests in
`tests/test_had_pretests.py` still pass; T21 drift test (16 tests)
still passes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two bootstrap p-value drift tests had lower-bound-only assertions:
- `test_overall_stute_fails_to_reject`: was `p > 0.50`, tutorial quotes
  ~0.686 → would silently pass if p drifted to 0.99
- `test_event_study_homogeneity_fails_to_reject`: was `p > 0.50`,
  tutorial quotes ~0.763 → same silent-stale risk

The third bootstrap test (`test_event_study_pretrends_fails_to_reject`)
already used a bounded band `0.0 <= p <= 0.25`. Mirror that pattern on
the other two with bounded bands per
`feedback_bootstrap_drift_tests_need_backend_tolerance` (>= 0.15
width):
- Stute: 0.53 <= p <= 0.84 (band ~0.31 around 0.686)
- Homogeneity: 0.61 <= p <= 0.92 (band ~0.31 around 0.763)

Both bands wide enough for Rust ↔ pure-Python RNG path differences;
both narrow enough that drift in either direction (toward rejection
or toward an even cleaner pass) flags the prose as stale.

All 16 drift tests pass on both backends within the new bands.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber igerber force-pushed the had-tutorial-21-pretest-workflow branch from 2f2c7ca to 3ab7a86 Compare May 10, 2026 15:28
@igerber igerber added the ready-for-ci Triggers CI test workflows label May 10, 2026
@igerber igerber merged commit 6a1e5b2 into main May 10, 2026
28 of 29 checks passed
@igerber igerber deleted the had-tutorial-21-pretest-workflow branch May 10, 2026 16:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-for-ci Triggers CI test workflows

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant