Skip to content

DPDK: Add symmetric_mp hotplug test#4348

Draft
mcgov wants to merge 8 commits intomainfrom
mcgov/symmetric-hotplug
Draft

DPDK: Add symmetric_mp hotplug test#4348
mcgov wants to merge 8 commits intomainfrom
mcgov/symmetric-hotplug

Conversation

@mcgov
Copy link
Copy Markdown
Collaborator

@mcgov mcgov commented Mar 18, 2026

Adds the dpdk symmetric_mp test with hotplugging, stress and regular.

Passing this test on MANA is unreliable without some patches from Long Li. These have merged into dpdk-next-net and will make their way into the next release.
https://git.dpdk.org/next/dpdk-next-net/commit/?id=ea9d4ee993eec0123b0e1cd227cb5743876ae677

This test will fail all versions below the next release 26.07. I have a seperate patch for LISA's pipeline which will allow testing upstream versions of DPDK easily; this will be required for running validation on this test.

Copilot AI review requested due to automatic review settings March 18, 2026 00:29
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds new DPDK symmetric_mp hotplug test coverage and adjusts supporting infrastructure to better tolerate Azure NIC update timing during repeated SR-IOV toggles.

Changes:

  • Add two new DPDK test cases to run symmetric_mp with SR-IOV hotplug (single-run + stress loop).
  • Update run_dpdk_symmetric_mp process timeouts and add short delays around SR-IOV toggling.
  • Add a small delay inside Azure switch_sriov after NIC update to reduce transient timing issues.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 8 comments.

File Description
lisa/sut_orchestrator/azure/features.py Imports sleep and adds a short delay after SR-IOV NIC updates.
lisa/microsoft/testsuites/dpdk/dpdkutil.py Adjusts hugepage allocation baseline; adds dynamic timeouts and delays around hotplug loop.
lisa/microsoft/testsuites/dpdk/dpdksuite.py Introduces new symmetric_mp hotplug functional + stress test cases.

Key Test Cases:
verify_dpdk_symmetric_mp|verify_dpdk_symmetric_mp_hotplug|stress_dpdk_symmetric_mp_hotplug

Impacted LISA Features:
NetworkInterface, Sriov

Tested Azure Marketplace Images:

  • canonical 0001-com-ubuntu-server-jammy 22_04-lts latest

Comment on lines +165 to +167
netvsc pmd version.
This test case checks dpdk symmetic mp app, plus an sriov hotplug.
More details refer https://docs.microsoft.com/en-us/azure/virtual-network/setup-dpdk#prerequisites # noqa: E501
log: Logger,
variables: Dict[str, Any],
result: TestResult,
) -> None:
Comment on lines +1796 to 1799
timeout=test_timeout,
signal=SIGINT,
kill_timeout=30,
kill_timeout=test_timeout + 5,
)
thank u copilot

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings March 24, 2026 23:50
Will copilots suggestion work.... drumroll!

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds DPDK symmetric_mp SR-IOV hotplug coverage (regular + stress) and introduces small throttling mitigations around SR-IOV toggling to improve stability in Azure/MANA environments.

Changes:

  • Add new DPDK test cases for symmetric_mp with SR-IOV hotplug and a long-running stress variant.
  • Adjust symmetric_mp runner timeouts and add brief sleeps around hotplug operations.
  • Add a small delay in Azure switch_sriov to reduce the chance of ARM/network API throttling during repeated NIC updates.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File Description
lisa/sut_orchestrator/azure/features.py Adds a short delay after toggling accelerated networking per NIC during SR-IOV switching.
lisa/microsoft/testsuites/dpdk/dpdkutil.py Updates hugepage minimum allocation, introduces dynamic timeouts for symmetric_mp, and adds hotplug pacing sleeps.
lisa/microsoft/testsuites/dpdk/dpdksuite.py Adds two new symmetric_mp hotplug test cases (regular + stress) with metadata/requirements.

Key Test Cases:
verify_dpdk_symmetric_mp|verify_dpdk_symmetric_mp_hotplug

Impacted LISA Features:
NetworkInterface, Sriov

Tested Azure Marketplace Images:

  • canonical 0001-com-ubuntu-server-jammy 22_04-lts-gen2 latest

@TestCaseMetadata(
description="""
netvsc pmd version.
This test case checks dpdk symmetic mp app, plus an sriov hotplug.
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in the test description: "symmetic" → "symmetric".

Suggested change
This test case checks dpdk symmetic mp app, plus an sriov hotplug.
This test case checks dpdk symmetric mp app, plus an sriov hotplug.

Copilot uses AI. Check for mistakes.
Comment on lines 481 to 482
hugepages.init_hugepages(hugepage_size, minimum_gb=8 * numa_nodes)
except NotEnoughMemoryException as err:
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing hugepage allocation from minimum_gb=4 * numa_nodes to 8 * numa_nodes affects all DPDK tests that call initialize_node_resources (including the 1GiB hugepages variants) and is likely to cause widespread NotEnoughMemoryException skips on common VM sizes. Consider keeping the previous default and making the larger allocation opt-in (e.g., controlled by a variable or a per-test parameter for symmetric_mp/hotplug only).

Copilot uses AI. Check for mistakes.
Comment on lines +1796 to 1799
timeout=test_timeout,
signal=SIGINT,
kill_timeout=30,
kill_timeout=test_timeout + 5,
)
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Timeout.start_with_timeout() passes kill_timeout to GNU timeout --kill-after, which is a grace period after the initial signal (not an overall timeout). Setting it to test_timeout + 5 can make hung symmetric_mp runs wait an extra ~minutes (e.g., 40-hotplug stress) before being force-killed. Consider using a small, fixed grace period (e.g., 30–60s) independent of test_timeout.

Copilot uses AI. Check for mistakes.
Comment on lines +1810 to 1813
timeout=test_timeout,
signal=SIGINT,
kill_timeout=35,
kill_timeout=test_timeout + 5,
)
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same kill_timeout issue as the primary process: --kill-after is a post-signal grace period, so tying it to test_timeout can significantly extend worst-case hangs. Use a small fixed grace period here as well.

Copilot uses AI. Check for mistakes.
Comment on lines +159 to +161
run_dpdk_symmetric_mp(
node, log, variables, trigger_hotplug=True, hotplug_times=1
)
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description notes this hotplug test will fail on DPDK versions below 26.07. As written, the test will still run against older DPDK sources and fail noisily. Consider adding an explicit version gate (skip with a clear message) or a runbook variable/requirement that ensures only compatible DPDK sources/branches enable this test.

Copilot uses AI. Check for mistakes.
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings March 25, 2026 17:13
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

Comment on lines +1013 to +1016
poller = network_client.network_interfaces.begin_create_or_update(
self._resource_group_name, updated_nic.name, updated_nic
)
updated_nic = network_client.network_interfaces.get(
self._resource_group_name, nic_name
)
updated_nic = poller.result()
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change makes switch_sriov effectively blocking on the Azure NIC update every time because it unconditionally calls poller.result(). In this PR, call sites explicitly pass wait=False (e.g., the hotplug test), so this likely breaks the intended non-wait semantics and can significantly slow down hotplug loops. Consider honoring the wait parameter: only call poller.result() (and any subsequent state assertions) when wait=True; otherwise return after starting the update (or return the poller for the caller to await).

Copilot uses AI. Check for mistakes.
result: TestResult,
) -> None:
run_dpdk_symmetric_mp(
node, log, variables, trigger_hotplug=True, hotplug_times=40
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With hotplug_times=40, run_dpdk_symmetric_mp will loop many times and (per current implementation) starts a new ping_async each iteration without any visible join/wait. This can create many concurrent ping processes during the stress run, increasing flakiness and resource pressure. Consider changing the runner to wait for each async ping to finish before starting the next iteration (or make the stress case use synchronous ping), and/or cap concurrent ping jobs.

Suggested change
node, log, variables, trigger_hotplug=True, hotplug_times=40
node, log, variables, trigger_hotplug=True, hotplug_times=4

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants