Expose virtual columns from the Arrow Parquet reader in datasource-parquet #20133
+833
−24
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
It would be useful to expose the virtual columns of the arrow Parquet reader in the datasource-parquet
ParquetSourcewe added in apache/arrow-rs#8715. Then engines can use both DataFusion's partition value machinery and the virtual columns. I made a go at it in this PR, but hit some rough edges. This is closer to an issue than a PR, but it is easier to explain with code.The virtual columns we added are a bit difficult to integrate cleanly today. They are part of the physical schema of the Parquet reader, but cannot currently be projected. We need some additional handling to avoid predicate pushdown for virtual columns, to build the correct projection mask, and to build the correct stream schema. See the changes to
opener.rsin this PR.One alternative would be to modify the arrow-rs implementation to remove these workarounds. Then the only change to
opener.rswould be.with_virtual_columns(virtual_columns.to_vec())?(and maybe even that could be avoided? See the discussion below).What would be the best way forward here?
Related to #20132
Aside on
.with_virtual_columnsIt is redundant that the user needs to specify both
Field::new("row_index", DataType::Int64, false).with_extension_type(RowNumber), and add the column in a special way to the reader options with.with_virtual_columns(virtual_columns.to_vec())?. When the extension typeRowNumberis added, we know that it is a virtual column.All users of the
TableSchema/ParquetSourcemust know that a schema is built out of three parts: the physical Parquet columns, the virtual columns and the partition columns. From a user perspective, the user would just like to supply a schema.One alternative is to only indicate the column kind using extension types, and the user only supplies a schema. That is, there would be an extension type indicating that a column is a partition column or virtual column, instead of the user supplying this information piecemeal. This may have a performance impact, as we would likely need to extract different extension type columns during planning, which could be problematic for large schemas.