remove input cached smem bank conflicts in transpose scheduler #5930

liqiangxl · 2026-02-06T17:44:06Z

No description provided.

liqiangxl · 2026-02-06T17:45:01Z

!test

github-actions · 2026-02-06T17:46:59Z

Review updated until commit 13c94a2

Description

Remove shared memory bank conflicts in transpose scheduler by adding swizzle patterns
Limit broadcast aliasing to IO tensors only to avoid unnecessary transpose scheduling
Filter reduction domains when comparing tensor groups for scheduler selection
Add support for non-square tile configurations with proper swizzle handling

Changes walkthrough

Relevant files

Enhancement

alias_analysis.cpp `Limit broadcast aliasing to IO tensors only` csrc/alias_analysis.cpp Added check to only consider broadcast aliasing when IO tensors are involved Added early return for alias analysis when no IO tensors are involved Added explanatory comments about limiting aliasing to fusion boundaries	+12/-0
domain_map.cpp `Filter reduction domains in group comparison` csrc/scheduler/tools/domain_map.cpp Added filtering of reduction domains before comparing reference loops Prevents false positives when deciding between transpose and pointwise scheduler	+9/-3
transpose.cpp `Add shared memory swizzle to reduce bank conflicts` csrc/scheduler/transpose.cpp Modified hasSmallTransposeDimensions function signature to take pointer instead of unique_ptr Moved bits-in-flight calculation logic and added shared memory swizzle support Added logic to handle non-square tiles and disable swizzle when appropriate Implemented swizzle scheduling for cached input tensors to reduce bank conflicts Added debug print statements for monitoring bits-in-flight calculations	+90/-33

Tests

test_gpu3.cpp `Remove invalid transpose test case` tests/cpp/test_gpu3.cpp Removed test case FusionScheduleTransposeRepro1_CUDA related to issue Multidevice staged reduction with inter/intra-device manual scheduling #1925	+0/-22
test_persistent_buffer.cpp `Add scheduler expectation validation` tests/cpp/test_persistent_buffer.cpp Added expectation check for reduction and pointwise scheduler in 3D reduction test	+7/-0
test_rng.cpp `Remove invalid non-square tile test` tests/cpp/test_rng.cpp Removed test case BroadcastingRNGSmemNonSquareTile related to issue Consolidate Test and Benchmark Directories. #1926	+0/-36
test_transpose.cpp `Update expected scheduler type in test` tests/cpp/test_transpose.cpp Changed expected heuristic from Transpose to PointWise in reduction test Reflects improved scheduler selection after filtering reduction domains	+1/-1

PR Reviewer Guide

Here are some key observations to aid the review process:

🧪 PR contains tests

⚡ Recommended focus areas for review

Debug output in production code

Multiple std::cout statements were added for debugging (lines 747-752). These should be removed or replaced with proper logging infrastructure before merging to main.

std::cout << "total_input_bits_per_elem: " << total_input_bits_per_elem
          << std::endl;
std::cout << "num_elems_per_tile: " << num_elems_per_tile << std::endl;
std::cout << "max_blocks_per_sm: " << max_blocks_per_sm << std::endl;
std::cout << "bits_in_flight_per_sm: " << bits_in_flight_per_sm << std::endl;
std::cout << "required_bits_per_sm: " << required_bits_per_sm << std::endl;

Complex swizzle scheduling logic

The new smem swizzle scheduling logic (lines 1352-1373) is complex and handles multiple edge cases. The logic for determining use_smem_swizzle and the actual swizzle scheduling should be carefully reviewed for correctness, especially the conditions that disable swizzle for non-square tiles and cached outputs.

if (use_smem_swizzle) {
  for (auto tv : smem_cached_input_tvs) {
    std::cout << "scheduling smem_cached_tv: " << tv->toString() << std::endl;
    int64_t pos = tv->nDims() - 2;
    bool is_group2 = group2_and_cached_inputs.count(tv) > 0;
    int64_t tile2_factor =
        is_group2 ? tparams->vectorize_factor2 : tparams->vectorize_factor1;
    int64_t tile1_factor =
        tparams->tile_size1 * tile2_factor / tparams->tile_size2;
    // [BIDx, UnSwitch, tile1, tile2]
    tv->split(pos + 1, tile2_factor);
    tv->split(pos, tile1_factor);
    tv->swizzle(SwizzleType::XOR, pos, pos + 2);
    tv->merge(pos);
    tv->merge(pos);
    tv->split(pos, tparams->getThreadsPerBlock());
    tv->axis(pos)->parallelize(ParallelType::Unroll);
    tv->axis(pos + 1)->parallelize(ParallelType::TIDx);
    tv->axis(pos + 2)->parallelize(ParallelType::Vectorize);
    std::cout << "scheduled smem_cached_tv: " << tv->toString() << std::endl;
  }
}

Test case removal

Two test cases were removed (FusionScheduleTransposeRepro1_CUDA and BroadcastingRNGSmemNonSquareTile). The reason for removal should be documented, and equivalent test coverage should be maintained to prevent regressions.

Test failures

(Medium, 10) nvFuser swizzle-on-broadcast assertion failures in RNGTest, test_repro, and ThunderFX MoE suites

Test Name	A100	GB200	H100	Source
RNGTest.BroadcastingRNGSmem	❌	❌	❌	Link
tests.python.direct.test_repro.test_domain_map_hang[nvfuser_direct_test=eager]	❌		❌
tests.python.direct.test_repro.test_domain_map_hang[nvfuser_direct_test=lru_cache]	❌		❌
tests.python.test_moe.test_llama4_moe_thunderfx	❌	❌	❌

(Medium, 3) nvFuser aliasing mismatch in AliasTest.NotAllOutputsAlias_Pointwise across multiple GPU runners

Test Name A100 GB200 H100 Source

AliasTest.NotAllOutputsAlias_Pointwise ❌ ❌ ❌ Link
(Medium, 1) Thunder nvFuser scalar mismatch in nanogpt autograd test

Test Name A100 Source

thunder.tests.test_networks.test_nanogpt_complete_autograd_nvfuser_cuda_thunder.dtypes.float32 ❌

liqiangxl · 2026-02-06T22:28:16Z

!test

liqiangxl · 2026-02-08T16:45:49Z

!test

liqiangxl · 2026-02-08T20:43:27Z

!test

…ia/fuser into llu/transpose_bank_conflict

liqiangxl · 2026-02-09T15:19:20Z

!test

liqiangxl · 2026-02-10T14:31:31Z

!test

tile size and rm bank conflicts

ff62c9e

liqiangxl added 4 commits February 8, 2026 07:12

small transpose

902b273

skip reduction dim

3c561b5

fix transpose

a0d20ff

remove invalid test

6e53109

liqiangxl force-pushed the llu/transpose_bank_conflict branch from 5bbb6fa to 6e53109 Compare February 8, 2026 16:37

remove invalid test

a9de12f

liqiangxl added 2 commits February 9, 2026 07:15

limit bcast aliasing to io tensors

8e9d908

Merge branch 'llu/transpose_bank_conflict' of https://github.com/nvid…

a8dd28d

…ia/fuser into llu/transpose_bank_conflict

update alias

13c94a2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

remove input cached smem bank conflicts in transpose scheduler #5930

remove input cached smem bank conflicts in transpose scheduler #5930

liqiangxl commented Feb 6, 2026

Uh oh!

liqiangxl commented Feb 6, 2026

Uh oh!

github-actions bot commented Feb 6, 2026 •

edited by xwang233

Loading

Changes walkthrough

PR Reviewer Guide

Test failures

Uh oh!

liqiangxl commented Feb 6, 2026

Uh oh!

liqiangxl commented Feb 8, 2026

Uh oh!

liqiangxl commented Feb 8, 2026

Uh oh!

liqiangxl commented Feb 9, 2026

Uh oh!

liqiangxl commented Feb 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

remove input cached smem bank conflicts in transpose scheduler #5930

Are you sure you want to change the base?

remove input cached smem bank conflicts in transpose scheduler #5930

Conversation

liqiangxl commented Feb 6, 2026

Uh oh!

liqiangxl commented Feb 6, 2026

Uh oh!

github-actions bot commented Feb 6, 2026 • edited by xwang233 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes walkthrough

PR Reviewer Guide

Test failures

Uh oh!

liqiangxl commented Feb 6, 2026

Uh oh!

liqiangxl commented Feb 8, 2026

Uh oh!

liqiangxl commented Feb 8, 2026

Uh oh!

liqiangxl commented Feb 9, 2026

Uh oh!

liqiangxl commented Feb 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions bot commented Feb 6, 2026 •

edited by xwang233

Loading