Add extend_single_year_dataset for fast dataset year projection by anth-volk · Pull Request #7700 · PolicyEngine/policyengine-us

anth-volk · 2026-03-04T20:41:18Z

Why this is needed

The API v2 alpha and the policyengine Python package require entity-level Pandas HDFStore datasets (one table per entity: person, household, tax_unit, etc.) to run microsimulations. The current US data pipeline (policyengine-us-data) publishes variable-centric h5py files (variable/year → array), so converting between the two formats currently requires routing every variable through sim.calculate() via create_datasets() — a process that takes over an hour per state and doesn't scale to the 500+ geographic datasets we need to serve.

The UK avoids this entirely: policyengine-uk-data publishes entity-level HDFStore files directly, and policyengine-uk has extend_single_year_dataset() which projects a single base-year dataset to multiple years via simple multiplicative uprating on DataFrames — no simulation engine involved. This PR brings the same capability to the US.

How it works

Dataset schema classes (`dataset_schema.py`)

USSingleYearDataset holds six entity DataFrames (person, household, tax_unit, spm_unit, family, marital_unit) plus a time_period. It can load from / save to Pandas HDFStore files, and provides .copy() for deep-copying all DataFrames.

USMultiYearDataset wraps a dict[int, USSingleYearDataset] keyed by year. Its .load() returns data in {variable: {year: array}} format (time_period_arrays), which is what policyengine-core's Microsimulation expects for multi-year datasets.

Uprating logic (`economic_assumptions.py`)

extend_single_year_dataset(dataset, end_year=2035) takes a single base-year dataset and produces a multi-year dataset by:

Copying the base-year DataFrames for each year from base_year through end_year
Applying multiplicative uprating year-over-year: for each variable column, it looks up system.variables[var].uprating to get a dotted parameter path (e.g. "calibration.gov.irs.soi.employment_income"), resolves it against system.parameters, and computes factor = param(current_year) / param(previous_year). The column values are then multiplied by that factor.
Carrying forward variables without an uprating parameter unchanged (e.g. age, entity IDs).

This is the same approach used by policyengine-uk. The uprating mapping is derived entirely from system.variables at runtime — the 62 variables with explicit uprating = "..." and the 108 variables assigned via default_uprating.py are all picked up automatically. No separate list to maintain.

Dual-path loading (`system.py`)

Microsimulation.__init__ now auto-detects dataset format before calling super().__init__():

HDFStore format (entity names like person, household as top-level HDF5 keys): loads as USSingleYearDataset, extends via extend_single_year_dataset(), and passes the resulting USMultiYearDataset to policyengine-core.
Legacy h5py format (variable names as top-level keys): falls through to the existing CoreMicrosimulation code path, unchanged.

Format detection (_is_hdfstore_format) inspects the top-level HDF5 keys — entity names indicate HDFStore, variable names indicate h5py.

How we verify correctness

Unit tests (22 tests, ~0.3s)

The test suite in tests/microsimulation/data/ uses mock system objects (mock parameters, mock variables) to avoid loading the full tax-benefit system, keeping tests fast and deterministic. Coverage includes:

_resolve_parameter (3 tests): valid dotted path, invalid path, partially valid path
_apply_single_year_uprating (7 tests): correct multiplicative scaling, non-uprated variables unchanged, household entity uprating, unknown columns passed through, unresolvable uprating path, division-by-zero guard (previous param value = 0), zero base values preserved
extend_single_year_dataset (12 tests): correct year count, single-year edge case, default end year (2035), base year values unchanged, year 1 uprating, year 2 chaining (verifies uprating compounds from year N to N+1 to N+2, not from base), non-uprated variable identical across all years, row counts preserved, time_period correctness per year, return type, input dataset immutability, multi-entity uprating (person + household)

Roundtrip validation (policyengine-us-data PR #568)

A separate one-off validation script in -us-data reads an existing h5py state dataset (e.g. NV.h5), converts it to HDFStore using the same splitting logic, and compares all ~183 variables between the two formats. This passed 183/183 on the Nevada dataset.

Depends on

PolicyEngine/policyengine-us-data#568 — adds HDFStore output format alongside h5py in the data pipeline

Test plan

make test-other passes (runs the 22 unit tests via pytest)
Load an HDFStore file via Microsimulation(dataset="path/to/STATE.hdfstore.h5") — verify it loads and extends correctly
Load a legacy h5py file via Microsimulation(dataset="path/to/STATE.h5") — verify existing path still works
Verify uprated variables (e.g. employment_income) grow year-over-year
Verify non-uprated variables (e.g. age) are carried forward unchanged

🤖 Generated with Claude Code

PavelMakarchuk · 2026-03-17T15:30:24Z

PR Review

🔴 Critical (Must Fix)

1. USMultiYearDataset.__init__ uses if/if instead of if/elif — double-processing bug
dataset_schema.py:175-201

If both datasets and file_path are provided, both branches execute and file_path silently overwrites self.datasets. This should be elif. Also, if neither is provided, self.datasets is never set, causing an AttributeError on line 204.

2. _is_hdfstore_format may not work correctly with actual HDFStore files
system.py:218-239

HDFStore (PyTables) files accessed via h5py expose a different key structure than pd.HDFStore.keys(). Consider using pd.HDFStore directly for detection:

with pd.HDFStore(file_path, mode="r") as store:
    return bool(entity_names & {k.strip("/") for k in store.keys()})

3. No handling of USMultiYearDataset passed directly to Microsimulation
system.py:287-308

The dual-path detection handles str and USSingleYearDataset but not USMultiYearDataset. If a caller passes an already-extended multi-year dataset, it falls through to super().__init__() unhandled.

🟡 Should Address

4. validate_file_path validates with h5py but loads with pd.HDFStore
dataset_schema.py:45-68 vs 84-94 — Using different libraries for validation vs loading could cause mismatches. Use the same library for both.

5. _resolve_dataset_path returns None silently for non-HF, non-existent paths
system.py:199-215 — A typo'd path like "data/staet.h5" returns None, skips HDFStore check, and passes the string to super().__init__() producing a confusing error. Consider raising FileNotFoundError early.

6. Test mocking strategy is fragile
test_extend_single_year_dataset.py:736-760 — Direct sys.modules manipulation is thread-unsafe and can leak state. Use unittest.mock.patch.dict("sys.modules", ...) instead.

7. No tests for file I/O paths
The save() / load() / file-based __init__ for both USSingleYearDataset and USMultiYearDataset are untested — these are the paths used in production.

8. USSingleYearDataset.load() may produce duplicate keys across entities
dataset_schema.py:142-147 — If two entities share a column name, the second silently overwrites the first in the returned dict.

🟢 Suggestions

Changelog fragment is very long — consider shortening to "Add extend_single_year_dataset for fast multi-year dataset projection"
Consider adding __repr__ to dataset classes for easier debugging

Validation Summary

Check	Result
Code Patterns	3 critical issues
Test Coverage	2 gaps (no file I/O tests, fragile mocking)
CI Status	No checks found
Architecture	Sound — mirrors policyengine-uk approach
Documentation	PR description is excellent

Recommendation: Address the if/elif bug and HDFStore detection before merge. Core approach is solid.

To auto-fix issues: /fix-pr 7700

anth-volk · 2026-03-17T18:52:47Z

Review fixes applied

All 8 review items have been addressed in commit 4c98a8e. Here's what was done and how to verify each:

Critical fixes

1. USMultiYearDataset.__init__ if/if bug (dataset_schema.py)

Fix: Changed if/if to if/elif/else. Added explicit guard rejecting both args and raising ValueError when neither is provided.
Tests: TestUSMultiYearDatasetInit::test_given_neither_arg_then_raises_value_error, test_given_both_args_then_raises_value_error

2. _is_hdfstore_format uses h5py for PyTables files (system.py)

Fix: Replaced h5py.File with pd.HDFStore(file_path, mode="r"). Uses k.strip("/").split("/")[0] to handle both single-year and multi-year key formats.
Tests: TestIsHdfstoreFormat::test_entity_level_file_returns_true, test_variable_level_file_returns_false, test_nonexistent_file_returns_false

3. No USMultiYearDataset handling in Microsimulation.__init__ (system.py)

Fix: Added elif isinstance(dataset, USMultiYearDataset): pass branch so already-extended datasets are explicitly handled.
Verification: Visual inspection — the branch is a no-op passthrough. Full integration testing requires the tax-benefit system to load.

Should-fix items

4. validate_file_path uses h5py but __init__ loads with pd.HDFStore (dataset_schema.py)

Fix: Replaced h5py.File validation with pd.HDFStore(file_path, mode="r"). Removed import h5py from the module entirely.
Tests: Covered by TestFileIORoundtrips::test_single_year_save_and_load_roundtrip (validate runs during __init__).

5. _resolve_dataset_path returns None silently (system.py)

Fix: Changed return None to raise FileNotFoundError(f"Dataset file not found: {dataset_str}").
Tests: TestResolveDatasetPath::test_nonexistent_path_raises_file_not_found, test_existing_path_returns_path

6. Test mocking strategy is fragile (test_extend_single_year_dataset.py)

Fix: Replaced manual sys.modules save/restore with patch.dict(_sys.modules, ...) context manager.
Verification: All 22 original tests still pass — the refactored helper is used by every TestExtendSingleYearDataset test.

7. No tests for file I/O paths (test_extend_single_year_dataset.py)

Fix: Added TestFileIORoundtrips class with 3 tests.
Tests: test_single_year_save_and_load_roundtrip, test_multi_year_save_and_load_roundtrip, test_multi_year_load_returns_time_period_arrays

8. USSingleYearDataset.load() duplicate keys across entities (dataset_schema.py)

Fix: Added duplicate column detection — raises ValueError if a column name appears in multiple entity DataFrames.
Tests: TestSingleYearDatasetLoad::test_load_raises_on_duplicate_column_names, test_load_returns_all_entity_columns

Summary

Total new tests added: 12 (34 total, up from 22). All pass in ~2s.

pytest policyengine_us/tests/microsimulation/data/test_extend_single_year_dataset.py -v

PavelMakarchuk · 2026-03-17T22:20:13Z

PR Review (Updated)

Previous review findings were mostly addressed in the "Fix review items" commit. This is a re-review of the current state.

🔴 Critical (Must Fix)

1. Missing tables (pytables) dependency — CI is failing
pyproject.toml — pd.HDFStore requires the tables package. 5 tests fail in CI with ImportError: Import pytables failed. Add tables to dependencies.

2. Bare except Exception in _is_hdfstore_format
system.py — Silently swallows all errors (permission, memory, corruption) and returns False, making the file appear to be legacy format. Narrow to except (OSError, IOError, KeyError, ValueError) or at minimum add debug logging.

3. USSingleYearDataset.__init__ opens HDFStore in write mode
dataset_schema.py:~66 — pd.HDFStore(file_path) defaults to mode='a' (read-write). Should be mode='r' since it's only reading. validate_file_path correctly uses mode='r', but the constructor doesn't. Could fail on read-only filesystems.

🟡 Should Address

4. dataset read from kwargs twice in Microsimulation.__init__
system.py:~254-278 — First read checks for cps_2023, second read does HDFStore detection. If dataset is passed positionally via *args, the HDFStore detection is silently skipped. Consolidate into a single dataset resolution block.

5. _resolve_dataset_path return type inconsistency
system.py — The function raises FileNotFoundError for non-existent paths (never returns None), but the caller checks if local_path is not None. This dead check is misleading.

6. _apply_uprating imports full system at call time
economic_assumptions.py — The deferred from policyengine_us.system import system loads the full tax-benefit system on first call. Consider accepting system as an optional parameter to make the dependency explicit and eliminate the fragile sys.modules patching in tests.

7. Entity constant duplication
dataset_schema.py defines US_ENTITIES; system.py:_is_hdfstore_format redefines the same set inline. Import and reuse US_ENTITIES as single source of truth.

8. No validation for end_year >= start_year
economic_assumptions.py — If end_year < start_year, range() returns empty and only the base year is returned silently. Add a ValueError.

9. USMultiYearDataset.load() inconsistent duplicate-column handling
USSingleYearDataset.load() raises on duplicate columns across entities, but USMultiYearDataset.load() silently overwrites. Behavior should be consistent.

10. validate() method defined but never called
dataset_schema.py — The NaN validation method is unused. Either integrate into the loading flow or remove.

11. 14 unrelated whitespace-only changes
Removing blank lines after def formula across 14 unrelated files inflates the diff. Consider a separate formatting PR.

🟢 Suggestions

end_year=2035 default is a magic number — extract to a named constant
No integration test with the real tax-benefit system — all 22 tests use mocks
No test for hf:// URL path in _resolve_dataset_path
Changelog fragment is verbose — consider shortening

Validation Summary

Check	Result
CI Status	❌ FAILING (missing pytables dep, 5 tests)
Previous Review Items	✅ Most fixed
Code Patterns	3 critical, 8 should-address
Test Coverage	22 unit tests, good mocks, missing integration + edge cases
Unrelated Changes	14 whitespace-only files

Next Steps

To auto-fix issues: /fix-pr 7700

Or address manually and re-request review.

anth-volk · 2026-03-17T23:08:31Z

Re-review fixes (commit `a820388`)

All 11 findings from the re-review have been addressed, plus 3 additional issues found during a follow-up code review.

Original review findings

#	Finding	Resolution
1	Missing `tables` (pytables) dependency	Added `tables>=3.9` to runtime `dependencies` in `pyproject.toml`
2	Bare `except Exception` in `_is_hdfstore_format`	Narrowed to `except (OSError, IOError, KeyError, ValueError)`
3	`USSingleYearDataset.__init__` opens HDFStore in write mode	Changed to `mode="r"`
4	`dataset` read from kwargs twice in `Microsimulation.__init__`	Consolidated into a single read
5	`_resolve_dataset_path` return type — dead `None` check	Removed the dead `local_path is not None` guard
6	`_apply_uprating` imports full system at call time	Added `system=None` parameter to both `extend_single_year_dataset` and `_apply_uprating`; tests now pass `system` directly instead of patching `sys.modules`
7	Entity constant duplication	`_is_hdfstore_format` now imports and reuses `US_ENTITIES` from `dataset_schema.py`
8	No validation for `end_year >= start_year`	Added `ValueError` guard + new test
9	`USMultiYearDataset.load()` inconsistent duplicate-column handling	Added duplicate-column detection per year, matching `USSingleYearDataset.load()` behavior + new test
10	`validate()` method defined but never called	Removed dead code
11	Whitespace-only changes in unrelated files	No action — these are from `make format` (ruff), which is required by project guidelines

Also extracted DEFAULT_END_YEAR = 2035 constant (green suggestion).

Additional issues found (pre-existing, not regressions)

These three issues were present in the original PR code and were caught by a follow-up code review. They predate the re-review:

#	Finding	Resolution
A	`USSingleYearDataset.__init__` crashes on files missing optional entities — `save()` skips empty DataFrames but `__init__` unconditionally reads all 6, causing `KeyError` on roundtrip with minimal datasets	`spm_unit`, `family`, `marital_unit` now fall back to `pd.DataFrame()` when absent from the HDF5 file
B	`USMultiYearDataset.__init__` also opens HDFStore in append mode (same issue as finding #3 but in the multi-year class)	Changed to `mode="r"`
C	`validate_file_path` also has bare `except Exception` (same issue as finding #2)	Narrowed to `except (OSError, IOError, KeyError, ValueError)`

Adds USSingleYearDataset and USMultiYearDataset schema classes, extend_single_year_dataset() with multiplicative uprating from the parameter tree, and dual-path loading in Microsimulation that auto-detects entity-level HDFStore files and extends them without routing through the simulation engine. Legacy h5py files continue to work via the existing code path. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

22 tests covering _resolve_parameter, _apply_single_year_uprating, and end-to-end extend_single_year_dataset. Uses mock system objects to avoid loading the full tax-benefit system (~0.3s total runtime). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Fix USMultiYearDataset.__init__ if/if bug (use if/elif/else, reject both or neither args) - Fix validate_file_path to use pd.HDFStore instead of h5py - Fix USSingleYearDataset.load() to detect duplicate column names - Fix _is_hdfstore_format to use pd.HDFStore instead of h5py - Fix _resolve_dataset_path to raise FileNotFoundError instead of returning None silently - Add explicit USMultiYearDataset branch in Microsimulation.__init__ - Refactor test mocking to use patch.dict for thread safety - Add 12 new tests: init validation, duplicate keys, format detection, path resolution, and file I/O roundtrips Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Add tables>=3.9 runtime dependency for pd.HDFStore (finding #1) - Narrow bare except Exception to specific types in _is_hdfstore_format, validate_file_path (findings #2, reviewer #3) - Open HDFStore in mode="r" in USSingleYearDataset and USMultiYearDataset constructors (findings #3, reviewer #2) - Make optional entities (spm_unit, family, marital_unit) fall back to empty DataFrame when absent from HDF5 file (reviewer #1) - Consolidate duplicate kwargs.get("dataset") in Microsimulation.__init__ and remove dead None check (findings #4, #5) - Accept system=None in extend_single_year_dataset and _apply_uprating to allow direct injection, eliminating sys.modules patching in tests (#6) - Import and reuse US_ENTITIES instead of inline duplication (#7) - Add end_year >= start_year validation in extend_single_year_dataset (#8) - Add duplicate-column detection in USMultiYearDataset.load() (#9) - Remove unused validate() method (#10) - Extract DEFAULT_END_YEAR constant (green suggestion) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

PavelMakarchuk · 2026-03-23T17:59:07Z

Program Review: PR #7700 — Add extend_single_year_dataset for fast dataset year projection

PR Type

Infrastructure — HDFStore dataset support and multiplicative uprating for year projection

CI Status

Quick Feedback (Selective Tests + Coverage): FAILED — transient runner shutdown (The runner has received a shutdown signal), not a test failure. Re-run should fix.
All other checks: PASSING

Critical (Must Fix)

USSingleYearDataset.save() uses HDFStore append mode by default (dataset_schema.py:135). Opens with pd.HDFStore(file_path) which defaults to mode="a". If the file already exists with stale data, save will append rather than overwrite, causing data corruption. USMultiYearDataset.save() correctly calls Path(file_path).unlink(missing_ok=True) before writing, but USSingleYearDataset.save() does not. Fix: add mode="w" or unlink() before write.
Zero integration tests for Microsimulation.__init__ dual-path loading (system.py). The most important user-facing code path — where HDFStore format is detected and auto-extended — has no test coverage at all. All four branches (HDFStore string path, legacy h5py string path, USSingleYearDataset object, USMultiYearDataset object) are untested. This is the primary entry point for users and a regression risk.

Should Address

Hard-coded time_period=2024 default (dataset_schema.py:82). Constructor default couples the schema class to a specific year. Consider using a CURRENT_YEAR constant or requiring explicit specification.
Hard-coded DEFAULT_END_YEAR = 2035 (economic_assumptions.py:6). Named constant but not derived from any base year. Consider CURRENT_YEAR + 11 or add a comment explaining the choice of 2035.
Early FileNotFoundError behavior change (system.py, new HDFStore detection block). When dataset is a string pointing to a non-existent local file, the new _resolve_dataset_path raises FileNotFoundError before reaching super().__init__(). Previously, invalid strings would pass through to the parent class. This silent behavior change should be documented or handled with a fallback.
No test for hf:// URL path in _resolve_dataset_path. The HuggingFace download branch is completely untested. Add a mocked test to verify the download flow without network access.
Missing __init__.py for test discovery (policyengine_us/tests/microsimulation/data/). The fixtures subdirectory has __init__.py but the parent data/ directory may not. Verify this exists; if missing, pytest may not discover the new tests.
Validation only checks 3 of 6 entity keys (dataset_schema.py:59). validate_file_path checks person, household, tax_unit but not spm_unit, family, marital_unit. Add a named constant (e.g., REQUIRED_ENTITIES) and a comment clarifying this is intentional.
Lazy import shadows parameter name (economic_assumptions.py:305-306). from policyengine_us.system import system inside _apply_uprating shadows the system function parameter. Rename the import target to avoid confusion (e.g., import system as _default_system).
No backward compatibility test for legacy h5py format. When _is_hdfstore_format() returns False, code should fall through to existing behavior. No test confirms legacy datasets still load correctly through the new code path.

Suggestions

Dead elif pass branch (system.py:538-539). elif isinstance(dataset, USMultiYearDataset): pass is a no-op. Either remove it or add kwargs["dataset"] = dataset for explicitness.
Hard-coded BASE_YEAR = 2024 in test fixtures (economic_assumptions_fixtures.py). Acceptable for deterministic tests but will need updating when the project base year changes. Consider a comment noting this dependency.
Missing edge case tests for USSingleYearDataset. Constructor validation errors (missing required DataFrames, non-.h5 extension, raise_exception=False path) and USMultiYearDataset edge cases (empty datasets list, duplicate years, get_year with nonexistent year) are untested.
No test for negative income values during uprating. Per CLAUDE.md guidance on negative earnings being a known gotcha, a test confirming sign preservation through multiplicative uprating would be valuable.
Cosmetic formatting changes (15 files) are clean auto-formatting from make format / ruff. No issues.
tables>=3.9 dependency adds nontrivial transitive dependencies (blosc2, numexpr, etc.). Verify this is installed in all CI environments; if the selective test runner uses a cached environment, HDFStore imports may fail.

Validation Summary

Check	Result
Code Patterns	2 critical, 4 should-address, 2 suggestions
Test Coverage	22 unit tests passing; 0 integration tests for primary entry point
CI Status	Transient runner failure (not code-related)
Security	No issues found
Changelog	Present and correctly formatted
Formatting	All cosmetic changes consistent with `make format`

Review Severity: COMMENT

Rationale: The append-mode bug in USSingleYearDataset.save() is a data corruption risk but is limited to the save path (not the primary read/extend flow). The zero integration test coverage for the Microsimulation.__init__ changes is a significant gap but does not block merge if the author commits to adding them as a follow-up. No hard blockers that would warrant REQUEST_CHANGES, but these issues should not be ignored.

Next Steps

To auto-fix issues: /fix-pr 7700

- Add name, label, file_path properties to USSingleYearDataset and USMultiYearDataset for policyengine-core Simulation compatibility - Fix USSingleYearDataset.save() append-mode bug (unlink before write) - Extract _REQUIRED_ENTITIES and _DEFAULT_TIME_PERIOD constants - Fix import shadowing in _apply_uprating (system -> _system) - Remove dead elif-pass branch, add core-override documentation comment - Create missing __init__.py files for test discovery - Add 23 new tests: constructor edge cases, hf:// URL mock, legacy h5py compat, Microsimulation dataset routing integration, save regression Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

codecov · 2026-03-23T21:26:42Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 55.83%. Comparing base (2a9fe46) to head (7e4218b).
⚠️ Report is 58 commits behind head on main.

Additional details and impacted files

@@             Coverage Diff              @@
##              main    #7700       +/-   ##
============================================
- Coverage   100.00%   55.83%   -44.17%     
============================================
  Files            1        7        +6     
  Lines           34      120       +86     
  Branches         0        1        +1     
============================================
+ Hits            34       67       +33     
- Misses           0       53       +53

Flag	Coverage Δ
unittests	`55.83% <ø> (-44.17%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

anth-volk · 2026-03-23T21:49:48Z

@PavelMakarchuk made some updates and incorporated some of the review comments. I believe codecov fails like this because I made a couple minor edits to the US-specific Microsimulation class file, and that file is not fully tested. I would prefer not to add tests there, as they did not exist prior to these changes, either, and would require significant mocking.

PavelMakarchuk · 2026-03-24T15:40:05Z

The only remaining thing I worry about is the hard coded:

BASE_YEAR = 2024
END_YEAR_DEFAULT = 2035
END_YEAR_SHORT = 2026

anth-volk · 2026-03-24T15:43:58Z

I chose to hard-code these because we have at times explicitly decided to generate up until a particular year, but not afterward. At the very least, I think the start year should be fixed, but if you think the end should just automatically be, e.g., 10 after current, I can adjust. This would mean that on January 1, 2027, we will automatically calculate 2037 uprating.

PavelMakarchuk · 2026-03-24T16:24:26Z

I chose to hard-code these because we have at times explicitly decided to generate up until a particular year, but not afterward. At the very least, I think the start year should be fixed, but if you think the end should just automatically be, e.g., 10 after current, I can adjust. This would mean that on January 1, 2027, we will automatically calculate 2037 uprating.

I think it should be based on our CPI projections as well but yes automatic would great since I anticipate us forgetting about this by the end of 2026

anth-volk · 2026-03-24T16:25:45Z

Can you elaborate on basing it on the CPI projections? Do you have an envisioned method, or do you want me to propose one?

PavelMakarchuk · 2026-03-24T16:55:07Z

Can you elaborate on basing it on the CPI projections? Do you have an envisioned method, or do you want me to propose one?

We have clear CPI projections which we track here - those are updated quarterly and we will need to extend those annually with a clean updating cadence

anth-volk · 2026-03-24T20:24:08Z

Which of the following are you saying:

We need to generally support uprating out to 2100
We need USSingleYear datasets out to 2100
We need USSingleYear datasets to automatically extend out to whatever year we update the CPI-U to (for now, 2035)
Something else?

PavelMakarchuk · 2026-03-25T02:16:22Z

Which of the following are you saying:

We need to generally support uprating out to 2100

We need USSingleYear datasets out to 2100

We need USSingleYear datasets to automatically extend out to whatever year we update the CPI-U to (for now, 2035)

Something else?

We need USSingleYear datasets to automatically extend out to whatever year we update the CPI-U to (for now, 2035)

anth-volk mentioned this pull request Mar 5, 2026

Add entity-level HDFStore output format alongside h5py PolicyEngine/policyengine-us-data#568

Open

4 tasks

anth-volk marked this pull request as ready for review March 5, 2026 18:42

anth-volk force-pushed the add-extend-single-year-dataset branch from 1677593 to 4c98a8e Compare March 17, 2026 18:22

anth-volk requested a review from PavelMakarchuk March 17, 2026 19:00

anth-volk and others added 8 commits March 20, 2026 22:12

style: Run black formatter on changed files

9b080f2

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add changelog fragment for extend_single_year_dataset

33229a4

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

style: Reformat with black -l 79 to match CI lint config

b48892d

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

style: Run ruff format on rebased branch

78e6c0c

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

anth-volk force-pushed the add-extend-single-year-dataset branch from a820388 to 2022c2a Compare March 20, 2026 21:16

style: Run ruff format after rebase onto main

b5c3f0e

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Conversation

anth-volk commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why this is needed

How it works

Dataset schema classes (dataset_schema.py)

Uprating logic (economic_assumptions.py)

Dual-path loading (system.py)

How we verify correctness

Unit tests (22 tests, ~0.3s)

Roundtrip validation (policyengine-us-data PR #568)

Depends on

Test plan

Uh oh!

PavelMakarchuk commented Mar 17, 2026

PR Review

🔴 Critical (Must Fix)

🟡 Should Address

🟢 Suggestions

Validation Summary

Uh oh!

anth-volk commented Mar 17, 2026

Review fixes applied

Critical fixes

Should-fix items

Summary

Uh oh!

PavelMakarchuk commented Mar 17, 2026

PR Review (Updated)

🔴 Critical (Must Fix)

🟡 Should Address

🟢 Suggestions

Validation Summary

Next Steps

Uh oh!

anth-volk commented Mar 17, 2026

Re-review fixes (commit a820388)

Original review findings

Additional issues found (pre-existing, not regressions)

Uh oh!

PavelMakarchuk commented Mar 23, 2026

Program Review: PR #7700 — Add extend_single_year_dataset for fast dataset year projection

PR Type

CI Status

Critical (Must Fix)

Should Address

Suggestions

Validation Summary

Review Severity: COMMENT

Next Steps

Uh oh!

codecov bot commented Mar 23, 2026

Codecov Report

Uh oh!

anth-volk commented Mar 23, 2026

Uh oh!

PavelMakarchuk commented Mar 24, 2026

Uh oh!

anth-volk commented Mar 24, 2026

Uh oh!

PavelMakarchuk commented Mar 24, 2026

Uh oh!

anth-volk commented Mar 24, 2026

Uh oh!

PavelMakarchuk commented Mar 24, 2026

Uh oh!

anth-volk commented Mar 24, 2026

Uh oh!

PavelMakarchuk commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

anth-volk commented Mar 4, 2026 •

edited

Loading

Dataset schema classes (`dataset_schema.py`)

Uprating logic (`economic_assumptions.py`)

Dual-path loading (`system.py`)

Re-review fixes (commit `a820388`)