Skip to content

[AURON #2015] Add Native Scan Support for Apache Iceberg Copy-On-Write Tables.#2016

Open
slfan1989 wants to merge 3 commits intoapache:masterfrom
slfan1989:auron-2015
Open

[AURON #2015] Add Native Scan Support for Apache Iceberg Copy-On-Write Tables.#2016
slfan1989 wants to merge 3 commits intoapache:masterfrom
slfan1989:auron-2015

Conversation

@slfan1989
Copy link
Contributor

@slfan1989 slfan1989 commented Feb 18, 2026

Which issue does this PR close?

Closes #2015

Rationale for this change

This PR adds native scan support for Apache Iceberg Copy-On-Write (COW) tables to improve query performance. Currently, Auron lacks direct integration with Iceberg, forcing all Iceberg queries to use Spark's native execution path, missing opportunities for native engine acceleration.

Key Motivations:

  • Enable Auron's native execution engine to read Iceberg tables directly
  • Leverage native performance optimizations for Iceberg COW tables
  • Provide automatic fallback to Spark scan for unsupported scenarios
  • Lay the foundation for future Iceberg feature enhancements (MOR tables, pruning predicates, etc.)

What changes are included in this PR?

Core Implementation:

  • IcebergConvertProvider - SPI extension point that detects Iceberg scans and decides whether to use native execution
  • IcebergScanSupport - Decision logic that validates scan plans and checks for COW table eligibility
  • NativeIcebergTableScanExec - Native execution node that converts Iceberg FileScanTask to native scan plans

Build & Configuration:

  • Updated pom.xml with Iceberg version management and Maven enforcer rules
  • Modified auron-build.sh to support Iceberg build parameters
  • Added configuration option: spark.auron.enable.iceberg.scan (default: true)

Supported Features:

  • Iceberg COW tables (Parquet and ORC formats)
  • Projection pushdown (column pruning)
  • Partitioned and non-partitioned tables
  • Automatic fallback for unsupported scenarios

Version Support:

  • Spark: 3.4, 3.5, 4.0 only
  • Iceberg: 1.10.1 only (enforced by Maven)

Are there any user-facing changes?

No Breaking Changes: Existing functionality remains unchanged. Iceberg support is additive and disabled by default in unsupported scenarios.

How was this patch tested?

Unit & Integration Tests:

  • Added 10 integration test cases in AuronIcebergIntegrationSuite:
    • Simple COW table scan
    • Projection pushdown
    • Partitioned table with partition filter
    • Orc format support
    • Empty table handling
    • Residual filters fallback
    • Metadata columns fallback
    • Decimal type fallback
    • Delete files (MOR) fallback
    • Configuration toggle functionality

Test Environment:

  • Spark versions: 3.4.4, 3.5.8, 4.0.2
  • Iceberg version: 1.10.1
  • File formats: Parquet, ORC
  • Scala versions: 2.12, 2.13

@slfan1989
Copy link
Contributor Author

@cxzl25 @richox I’ve submitted the first version of the Iceberg-support code. It can now basically read COW tables, and I’ve added unit tests that pass in CI. If you have some time, could you please take a look and share any feedback? Thank you very much!

dev/reformat Outdated
Comment on lines 51 to 52
# Check or format all code, including third-party code, with spark-3.4
sparkver=spark-3.5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the comment be spark-3.5?

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds native scan support for Apache Iceberg Copy-On-Write (COW) tables to the Auron execution engine, enabling direct reads of Iceberg data files through Auron's native path for improved performance. The implementation follows the established SPI (Service Provider Interface) pattern used by other data source integrations like Paimon, with automatic fallback to Spark's execution path for unsupported scenarios.

Changes:

  • Adds IcebergConvertProvider SPI extension to detect and convert Iceberg BatchScanExec nodes to native execution
  • Implements validation logic to determine COW table eligibility (no delete files, no metadata columns, supported data types)
  • Creates NativeIcebergTableScanExec to execute native Iceberg scans with Parquet/ORC format support

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
thirdparty/auron-iceberg/src/main/scala/org/apache/spark/sql/auron/iceberg/IcebergConvertProvider.scala SPI provider that checks version compatibility and delegates to IcebergScanSupport
thirdparty/auron-iceberg/src/main/scala/org/apache/spark/sql/auron/iceberg/IcebergScanSupport.scala Core validation logic to determine native scan eligibility and extract FileScanTask metadata via reflection
thirdparty/auron-iceberg/src/main/scala/org/apache/spark/sql/execution/auron/plan/NativeIcebergTableScanExec.scala Native execution node that converts Iceberg tasks to FilePartitions and generates protobuf scan plans
thirdparty/auron-iceberg/src/test/scala/org/apache/auron/iceberg/AuronIcebergIntegrationSuite.scala Integration tests covering COW tables, projections, partitioning, ORC format, and fallback scenarios
thirdparty/auron-iceberg/src/test/scala/org/apache/auron/iceberg/BaseAuronIcebergSuite.scala Test base configuration with Auron and Iceberg extensions enabled
thirdparty/auron-iceberg/src/main/resources/META-INF/services/org.apache.spark.sql.auron.AuronConvertProvider SPI registration file for IcebergConvertProvider
thirdparty/auron-iceberg/pom.xml Maven enforcer rules to validate Iceberg version (1.10.1) and Spark version (3.4-4.0) compatibility
spark-extension/src/main/java/org/apache/auron/spark/configuration/SparkAuronConfiguration.java Adds ENABLE_ICEBERG_SCAN configuration option
spark-extension/src/main/scala/org/apache/spark/sql/auron/AuronConverters.scala Adds default value handling for shuffle manager configuration
spark-extension/pom.xml Adds arrow-memory-core and arrow-memory-netty dependencies
pom.xml Adds Iceberg version properties and enforcer rules for all Spark version profiles
auron-build.sh Updates Iceberg version support to 1.10.1 and Spark version range to 3.4-4.0
dev/reformat Updates formatting script to include Iceberg module with version 1.10.1
.github/workflows/iceberg.yml CI workflow for testing Iceberg integration across Spark 3.4, 3.5, 4.0 with multiple Java versions

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…n-Write Tables.

Signed-off-by: slfan1989 <slfan1989@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Add Native Scan Support for Apache Iceberg Copy-On-Write Tables

3 participants