Add extend_single_year_dataset for fast dataset year projection#7700
Add extend_single_year_dataset for fast dataset year projection#7700
Conversation
PR Review🔴 Critical (Must Fix)1. If both 2. HDFStore (PyTables) files accessed via with pd.HDFStore(file_path, mode="r") as store:
return bool(entity_names & {k.strip("/") for k in store.keys()})3. No handling of The dual-path detection handles 🟡 Should Address4. 5. 6. Test mocking strategy is fragile 7. No tests for file I/O paths 8. 🟢 Suggestions
Validation Summary
Recommendation: Address the To auto-fix issues: |
1677593 to
4c98a8e
Compare
Review fixes appliedAll 8 review items have been addressed in commit Critical fixes1.
2.
3. No
Should-fix items4.
5.
6. Test mocking strategy is fragile (
7. No tests for file I/O paths (
8.
SummaryTotal new tests added: 12 (34 total, up from 22). All pass in ~2s. |
PR Review (Updated)Previous review findings were mostly addressed in the "Fix review items" commit. This is a re-review of the current state. 🔴 Critical (Must Fix)1. Missing 2. Bare 3. 🟡 Should Address4. 5. 6. 7. Entity constant duplication 8. No validation for 9. 10. 11. 14 unrelated whitespace-only changes 🟢 Suggestions
Validation Summary
Next StepsTo auto-fix issues: Or address manually and re-request review. |
Re-review fixes (commit a820388)All 11 findings from the re-review have been addressed, plus 3 additional issues found during a follow-up code review. Original review findings
Also extracted Additional issues found (pre-existing, not regressions)These three issues were present in the original PR code and were caught by a follow-up code review. They predate the re-review:
|
Adds USSingleYearDataset and USMultiYearDataset schema classes, extend_single_year_dataset() with multiplicative uprating from the parameter tree, and dual-path loading in Microsimulation that auto-detects entity-level HDFStore files and extends them without routing through the simulation engine. Legacy h5py files continue to work via the existing code path. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
22 tests covering _resolve_parameter, _apply_single_year_uprating, and end-to-end extend_single_year_dataset. Uses mock system objects to avoid loading the full tax-benefit system (~0.3s total runtime). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Fix USMultiYearDataset.__init__ if/if bug (use if/elif/else, reject both or neither args) - Fix validate_file_path to use pd.HDFStore instead of h5py - Fix USSingleYearDataset.load() to detect duplicate column names - Fix _is_hdfstore_format to use pd.HDFStore instead of h5py - Fix _resolve_dataset_path to raise FileNotFoundError instead of returning None silently - Add explicit USMultiYearDataset branch in Microsimulation.__init__ - Refactor test mocking to use patch.dict for thread safety - Add 12 new tests: init validation, duplicate keys, format detection, path resolution, and file I/O roundtrips Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add tables>=3.9 runtime dependency for pd.HDFStore (finding #1) - Narrow bare except Exception to specific types in _is_hdfstore_format, validate_file_path (findings #2, reviewer #3) - Open HDFStore in mode="r" in USSingleYearDataset and USMultiYearDataset constructors (findings #3, reviewer #2) - Make optional entities (spm_unit, family, marital_unit) fall back to empty DataFrame when absent from HDF5 file (reviewer #1) - Consolidate duplicate kwargs.get("dataset") in Microsimulation.__init__ and remove dead None check (findings #4, #5) - Accept system=None in extend_single_year_dataset and _apply_uprating to allow direct injection, eliminating sys.modules patching in tests (#6) - Import and reuse US_ENTITIES instead of inline duplication (#7) - Add end_year >= start_year validation in extend_single_year_dataset (#8) - Add duplicate-column detection in USMultiYearDataset.load() (#9) - Remove unused validate() method (#10) - Extract DEFAULT_END_YEAR constant (green suggestion) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
a820388 to
2022c2a
Compare
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Program Review: PR #7700 — Add extend_single_year_dataset for fast dataset year projectionPR TypeInfrastructure — HDFStore dataset support and multiplicative uprating for year projection CI Status
Critical (Must Fix)
Should Address
Suggestions
Validation Summary
Review Severity: COMMENTRationale: The append-mode bug in Next StepsTo auto-fix issues: |
- Add name, label, file_path properties to USSingleYearDataset and USMultiYearDataset for policyengine-core Simulation compatibility - Fix USSingleYearDataset.save() append-mode bug (unlink before write) - Extract _REQUIRED_ENTITIES and _DEFAULT_TIME_PERIOD constants - Fix import shadowing in _apply_uprating (system -> _system) - Remove dead elif-pass branch, add core-override documentation comment - Create missing __init__.py files for test discovery - Add 23 new tests: constructor edge cases, hf:// URL mock, legacy h5py compat, Microsimulation dataset routing integration, save regression Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #7700 +/- ##
============================================
- Coverage 100.00% 55.83% -44.17%
============================================
Files 1 7 +6
Lines 34 120 +86
Branches 0 1 +1
============================================
+ Hits 34 67 +33
- Misses 0 53 +53
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
@PavelMakarchuk made some updates and incorporated some of the review comments. I believe codecov fails like this because I made a couple minor edits to the US-specific |
|
The only remaining thing I worry about is the hard coded: |
|
I chose to hard-code these because we have at times explicitly decided to generate up until a particular year, but not afterward. At the very least, I think the start year should be fixed, but if you think the end should just automatically be, e.g., 10 after current, I can adjust. This would mean that on January 1, 2027, we will automatically calculate 2037 uprating. |
I think it should be based on our CPI projections as well but yes automatic would great since I anticipate us forgetting about this by the end of 2026 |
|
Can you elaborate on basing it on the CPI projections? Do you have an envisioned method, or do you want me to propose one? |
We have clear CPI projections which we track here - those are updated quarterly and we will need to extend those annually with a clean updating cadence |
|
Which of the following are you saying:
|
We need USSingleYear datasets to automatically extend out to whatever year we update the CPI-U to (for now, 2035) |
Fixes #7699
Why this is needed
The API v2 alpha and the
policyenginePython package require entity-level Pandas HDFStore datasets (one table per entity: person, household, tax_unit, etc.) to run microsimulations. The current US data pipeline (policyengine-us-data) publishes variable-centric h5py files (variable/year → array), so converting between the two formats currently requires routing every variable throughsim.calculate()viacreate_datasets()— a process that takes over an hour per state and doesn't scale to the 500+ geographic datasets we need to serve.The UK avoids this entirely:
policyengine-uk-datapublishes entity-level HDFStore files directly, andpolicyengine-ukhasextend_single_year_dataset()which projects a single base-year dataset to multiple years via simple multiplicative uprating on DataFrames — no simulation engine involved. This PR brings the same capability to the US.How it works
Dataset schema classes (
dataset_schema.py)USSingleYearDatasetholds six entity DataFrames (person, household, tax_unit, spm_unit, family, marital_unit) plus atime_period. It can load from / save to Pandas HDFStore files, and provides.copy()for deep-copying all DataFrames.USMultiYearDatasetwraps adict[int, USSingleYearDataset]keyed by year. Its.load()returns data in{variable: {year: array}}format (time_period_arrays), which is what policyengine-core'sMicrosimulationexpects for multi-year datasets.Uprating logic (
economic_assumptions.py)extend_single_year_dataset(dataset, end_year=2035)takes a single base-year dataset and produces a multi-year dataset by:base_yearthroughend_yearsystem.variables[var].upratingto get a dotted parameter path (e.g."calibration.gov.irs.soi.employment_income"), resolves it againstsystem.parameters, and computesfactor = param(current_year) / param(previous_year). The column values are then multiplied by that factor.age, entity IDs).This is the same approach used by
policyengine-uk. The uprating mapping is derived entirely fromsystem.variablesat runtime — the 62 variables with explicituprating = "..."and the 108 variables assigned viadefault_uprating.pyare all picked up automatically. No separate list to maintain.Dual-path loading (
system.py)Microsimulation.__init__now auto-detects dataset format before callingsuper().__init__():person,householdas top-level HDF5 keys): loads asUSSingleYearDataset, extends viaextend_single_year_dataset(), and passes the resultingUSMultiYearDatasetto policyengine-core.CoreMicrosimulationcode path, unchanged.Format detection (
_is_hdfstore_format) inspects the top-level HDF5 keys — entity names indicate HDFStore, variable names indicate h5py.How we verify correctness
Unit tests (22 tests, ~0.3s)
The test suite in
tests/microsimulation/data/uses mock system objects (mock parameters, mock variables) to avoid loading the full tax-benefit system, keeping tests fast and deterministic. Coverage includes:_resolve_parameter(3 tests): valid dotted path, invalid path, partially valid path_apply_single_year_uprating(7 tests): correct multiplicative scaling, non-uprated variables unchanged, household entity uprating, unknown columns passed through, unresolvable uprating path, division-by-zero guard (previous param value = 0), zero base values preservedextend_single_year_dataset(12 tests): correct year count, single-year edge case, default end year (2035), base year values unchanged, year 1 uprating, year 2 chaining (verifies uprating compounds from year N to N+1 to N+2, not from base), non-uprated variable identical across all years, row counts preserved, time_period correctness per year, return type, input dataset immutability, multi-entity uprating (person + household)Roundtrip validation (policyengine-us-data PR #568)
A separate one-off validation script in
-us-datareads an existing h5py state dataset (e.g. NV.h5), converts it to HDFStore using the same splitting logic, and compares all ~183 variables between the two formats. This passed 183/183 on the Nevada dataset.Depends on
Test plan
make test-otherpasses (runs the 22 unit tests via pytest)Microsimulation(dataset="path/to/STATE.hdfstore.h5")— verify it loads and extends correctlyMicrosimulation(dataset="path/to/STATE.h5")— verify existing path still worksemployment_income) grow year-over-yearage) are carried forward unchanged🤖 Generated with Claude Code