Skip to content

Conversation

@liqiangxl
Copy link
Collaborator

No description provided.

@liqiangxl
Copy link
Collaborator Author

!test

@github-actions
Copy link

github-actions bot commented Feb 6, 2026

Review updated until commit 13c94a2

Description

  • Remove shared memory bank conflicts in transpose scheduler by adding swizzle patterns

  • Limit broadcast aliasing to IO tensors only to avoid unnecessary transpose scheduling

  • Filter reduction domains when comparing tensor groups for scheduler selection

  • Add support for non-square tile configurations with proper swizzle handling

Changes walkthrough

Relevant files
Enhancement
alias_analysis.cpp
Limit broadcast aliasing to IO tensors only                           

csrc/alias_analysis.cpp

  • Added check to only consider broadcast aliasing when IO tensors are
    involved
  • Added early return for alias analysis when no IO tensors are involved
  • Added explanatory comments about limiting aliasing to fusion
    boundaries
  • +12/-0   
    domain_map.cpp
    Filter reduction domains in group comparison                         

    csrc/scheduler/tools/domain_map.cpp

  • Added filtering of reduction domains before comparing reference loops
  • Prevents false positives when deciding between transpose and pointwise
    scheduler
  • +9/-3     
    transpose.cpp
    Add shared memory swizzle to reduce bank conflicts             

    csrc/scheduler/transpose.cpp

  • Modified hasSmallTransposeDimensions function signature to take
    pointer instead of unique_ptr
  • Moved bits-in-flight calculation logic and added shared memory swizzle
    support
  • Added logic to handle non-square tiles and disable swizzle when
    appropriate
  • Implemented swizzle scheduling for cached input tensors to reduce bank
    conflicts
  • Added debug print statements for monitoring bits-in-flight
    calculations
  • +90/-33 
    Tests
    test_gpu3.cpp
    Remove invalid transpose test case                                             

    tests/cpp/test_gpu3.cpp

  • Removed test case FusionScheduleTransposeRepro1_CUDA related to issue
    Multidevice staged reduction with inter/intra-device manual scheduling #1925
  • +0/-22   
    test_persistent_buffer.cpp
    Add scheduler expectation validation                                         

    tests/cpp/test_persistent_buffer.cpp

  • Added expectation check for reduction and pointwise scheduler in 3D
    reduction test
  • +7/-0     
    test_rng.cpp
    Remove invalid non-square tile test                                           

    tests/cpp/test_rng.cpp

  • Removed test case BroadcastingRNGSmemNonSquareTile related to issue
    Consolidate Test and Benchmark Directories. #1926
  • +0/-36   
    test_transpose.cpp
    Update expected scheduler type in test                                     

    tests/cpp/test_transpose.cpp

  • Changed expected heuristic from Transpose to PointWise in reduction
    test
  • Reflects improved scheduler selection after filtering reduction
    domains
  • +1/-1     

    PR Reviewer Guide

    Here are some key observations to aid the review process:

    🧪 PR contains tests
    ⚡ Recommended focus areas for review
    Debug output in production code

    Multiple std::cout statements were added for debugging (lines 747-752). These should be removed or replaced with proper logging infrastructure before merging to main.

    std::cout << "total_input_bits_per_elem: " << total_input_bits_per_elem
              << std::endl;
    std::cout << "num_elems_per_tile: " << num_elems_per_tile << std::endl;
    std::cout << "max_blocks_per_sm: " << max_blocks_per_sm << std::endl;
    std::cout << "bits_in_flight_per_sm: " << bits_in_flight_per_sm << std::endl;
    std::cout << "required_bits_per_sm: " << required_bits_per_sm << std::endl;
    Complex swizzle scheduling logic

    The new smem swizzle scheduling logic (lines 1352-1373) is complex and handles multiple edge cases. The logic for determining use_smem_swizzle and the actual swizzle scheduling should be carefully reviewed for correctness, especially the conditions that disable swizzle for non-square tiles and cached outputs.

    if (use_smem_swizzle) {
      for (auto tv : smem_cached_input_tvs) {
        std::cout << "scheduling smem_cached_tv: " << tv->toString() << std::endl;
        int64_t pos = tv->nDims() - 2;
        bool is_group2 = group2_and_cached_inputs.count(tv) > 0;
        int64_t tile2_factor =
            is_group2 ? tparams->vectorize_factor2 : tparams->vectorize_factor1;
        int64_t tile1_factor =
            tparams->tile_size1 * tile2_factor / tparams->tile_size2;
        // [BIDx, UnSwitch, tile1, tile2]
        tv->split(pos + 1, tile2_factor);
        tv->split(pos, tile1_factor);
        tv->swizzle(SwizzleType::XOR, pos, pos + 2);
        tv->merge(pos);
        tv->merge(pos);
        tv->split(pos, tparams->getThreadsPerBlock());
        tv->axis(pos)->parallelize(ParallelType::Unroll);
        tv->axis(pos + 1)->parallelize(ParallelType::TIDx);
        tv->axis(pos + 2)->parallelize(ParallelType::Vectorize);
        std::cout << "scheduled smem_cached_tv: " << tv->toString() << std::endl;
      }
    }
    Test case removal

    Two test cases were removed (FusionScheduleTransposeRepro1_CUDA and BroadcastingRNGSmemNonSquareTile). The reason for removal should be documented, and equivalent test coverage should be maintained to prevent regressions.

    Test failures

    • (Medium, 10) nvFuser swizzle-on-broadcast assertion failures in RNGTest, test_repro, and ThunderFX MoE suites

      Test Name A100 GB200 H100 Source
      RNGTest.BroadcastingRNGSmem Link
      tests.python.direct.test_repro.test_domain_map_hang[nvfuser_direct_test=eager]
      tests.python.direct.test_repro.test_domain_map_hang[nvfuser_direct_test=lru_cache]
      tests.python.test_moe.test_llama4_moe_thunderfx
    • (Medium, 3) nvFuser aliasing mismatch in AliasTest.NotAllOutputsAlias_Pointwise across multiple GPU runners

      Test Name A100 GB200 H100 Source
      AliasTest.NotAllOutputsAlias_Pointwise Link
    • (Medium, 1) Thunder nvFuser scalar mismatch in nanogpt autograd test

      Test Name A100 Source
      thunder.tests.test_networks.test_nanogpt_complete_autograd_nvfuser_cuda_thunder.dtypes.float32

    @liqiangxl
    Copy link
    Collaborator Author

    !test

    @liqiangxl liqiangxl force-pushed the llu/transpose_bank_conflict branch from 5bbb6fa to 6e53109 Compare February 8, 2026 16:37
    @liqiangxl
    Copy link
    Collaborator Author

    !test

    @liqiangxl
    Copy link
    Collaborator Author

    !test

    @liqiangxl
    Copy link
    Collaborator Author

    !test

    @liqiangxl
    Copy link
    Collaborator Author

    !test

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

    Labels

    None yet

    Projects

    None yet

    Development

    Successfully merging this pull request may close these issues.

    1 participant