Skip to content

nccl_ep: Low-Latency kernel memory footpring optimization#2040

Open
artpol84 wants to merge 1 commit intoNVIDIA:masterfrom
artpol84:topic/ncclEP/ll_mem_opt_v2
Open

nccl_ep: Low-Latency kernel memory footpring optimization#2040
artpol84 wants to merge 1 commit intoNVIDIA:masterfrom
artpol84:topic/ncclEP/ll_mem_opt_v2

Conversation

@artpol84
Copy link
Collaborator

@artpol84 artpol84 commented Mar 9, 2026

Description

Reduce LL kernels memory consumption by including the top-k indices into the token message payloads.

On dispatch, this allows to avoid maintaining a separate buffer space per local-expert/remote-rank pair and instead have one space for each remote rank.
This reduces the memory overhead from O(E x B x H) down to O(N x B x H) where E - number of experts, N - number of ranks, B - batch size, and H - token hidden dimension

On combine, the top-k indices are used to reduce the communication buffer from O(E x B x H) to O(K x B x H).

Related Issues

N/A

Changes & Impact

Changes: Reorganize NCCL EP communication buffer layout.
Impact: Order of magnitude reduction in memory consumption.

Performance Impact

No impact observed

Reduce LL kernels memory consumption by including the top-k
indices into the token message payloads.

On dispatch, this allows to avoid maintaining a separate
buffer space per local-expert/remote-rank pair and instead
have one space for each remote rank.
This reduces the memory overhead from O(E x B x H) down to
O(N x B x H) where E - number of experts, N - number of ranks,
B - batch size, and H - token hidden dimension

On combine, the top-k indices are used to reduce the communication
buffer from O(E x B x H) to O(K x B x H).

Signed-off-by: Artem Y. Polyakov <artemp@nvidia.com>
@xiaofanl-nvidia
Copy link
Collaborator

@jskrobola can you help start the mirror?

@artpol84
Copy link
Collaborator Author

@xiaofanl-nvidia, @sb17v

Update: This change was tested with:

  • ep_bench microbenchmark, and
  • internal vLLM integration

showing no performance degradation compared to pre-optimization code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants