Conversation
There was a problem hiding this comment.
Pull request overview
Adds new DPDK symmetric_mp hotplug test coverage and adjusts supporting infrastructure to better tolerate Azure NIC update timing during repeated SR-IOV toggles.
Changes:
- Add two new DPDK test cases to run
symmetric_mpwith SR-IOV hotplug (single-run + stress loop). - Update
run_dpdk_symmetric_mpprocess timeouts and add short delays around SR-IOV toggling. - Add a small delay inside Azure
switch_sriovafter NIC update to reduce transient timing issues.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 8 comments.
| File | Description |
|---|---|
| lisa/sut_orchestrator/azure/features.py | Imports sleep and adds a short delay after SR-IOV NIC updates. |
| lisa/microsoft/testsuites/dpdk/dpdkutil.py | Adjusts hugepage allocation baseline; adds dynamic timeouts and delays around hotplug loop. |
| lisa/microsoft/testsuites/dpdk/dpdksuite.py | Introduces new symmetric_mp hotplug functional + stress test cases. |
Key Test Cases:
verify_dpdk_symmetric_mp|verify_dpdk_symmetric_mp_hotplug|stress_dpdk_symmetric_mp_hotplug
Impacted LISA Features:
NetworkInterface, Sriov
Tested Azure Marketplace Images:
- canonical 0001-com-ubuntu-server-jammy 22_04-lts latest
| netvsc pmd version. | ||
| This test case checks dpdk symmetic mp app, plus an sriov hotplug. | ||
| More details refer https://docs.microsoft.com/en-us/azure/virtual-network/setup-dpdk#prerequisites # noqa: E501 |
| log: Logger, | ||
| variables: Dict[str, Any], | ||
| result: TestResult, | ||
| ) -> None: |
| timeout=test_timeout, | ||
| signal=SIGINT, | ||
| kill_timeout=30, | ||
| kill_timeout=test_timeout + 5, | ||
| ) |
thank u copilot Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Will copilots suggestion work.... drumroll! Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Adds DPDK symmetric_mp SR-IOV hotplug coverage (regular + stress) and introduces small throttling mitigations around SR-IOV toggling to improve stability in Azure/MANA environments.
Changes:
- Add new DPDK test cases for
symmetric_mpwith SR-IOV hotplug and a long-running stress variant. - Adjust
symmetric_mprunner timeouts and add brief sleeps around hotplug operations. - Add a small delay in Azure
switch_sriovto reduce the chance of ARM/network API throttling during repeated NIC updates.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| lisa/sut_orchestrator/azure/features.py | Adds a short delay after toggling accelerated networking per NIC during SR-IOV switching. |
| lisa/microsoft/testsuites/dpdk/dpdkutil.py | Updates hugepage minimum allocation, introduces dynamic timeouts for symmetric_mp, and adds hotplug pacing sleeps. |
| lisa/microsoft/testsuites/dpdk/dpdksuite.py | Adds two new symmetric_mp hotplug test cases (regular + stress) with metadata/requirements. |
Key Test Cases:
verify_dpdk_symmetric_mp|verify_dpdk_symmetric_mp_hotplug
Impacted LISA Features:
NetworkInterface, Sriov
Tested Azure Marketplace Images:
- canonical 0001-com-ubuntu-server-jammy 22_04-lts-gen2 latest
| @TestCaseMetadata( | ||
| description=""" | ||
| netvsc pmd version. | ||
| This test case checks dpdk symmetic mp app, plus an sriov hotplug. |
There was a problem hiding this comment.
Typo in the test description: "symmetic" → "symmetric".
| This test case checks dpdk symmetic mp app, plus an sriov hotplug. | |
| This test case checks dpdk symmetric mp app, plus an sriov hotplug. |
| hugepages.init_hugepages(hugepage_size, minimum_gb=8 * numa_nodes) | ||
| except NotEnoughMemoryException as err: |
There was a problem hiding this comment.
Changing hugepage allocation from minimum_gb=4 * numa_nodes to 8 * numa_nodes affects all DPDK tests that call initialize_node_resources (including the 1GiB hugepages variants) and is likely to cause widespread NotEnoughMemoryException skips on common VM sizes. Consider keeping the previous default and making the larger allocation opt-in (e.g., controlled by a variable or a per-test parameter for symmetric_mp/hotplug only).
| timeout=test_timeout, | ||
| signal=SIGINT, | ||
| kill_timeout=30, | ||
| kill_timeout=test_timeout + 5, | ||
| ) |
There was a problem hiding this comment.
Timeout.start_with_timeout() passes kill_timeout to GNU timeout --kill-after, which is a grace period after the initial signal (not an overall timeout). Setting it to test_timeout + 5 can make hung symmetric_mp runs wait an extra ~minutes (e.g., 40-hotplug stress) before being force-killed. Consider using a small, fixed grace period (e.g., 30–60s) independent of test_timeout.
| timeout=test_timeout, | ||
| signal=SIGINT, | ||
| kill_timeout=35, | ||
| kill_timeout=test_timeout + 5, | ||
| ) |
There was a problem hiding this comment.
Same kill_timeout issue as the primary process: --kill-after is a post-signal grace period, so tying it to test_timeout can significantly extend worst-case hangs. Use a small fixed grace period here as well.
| run_dpdk_symmetric_mp( | ||
| node, log, variables, trigger_hotplug=True, hotplug_times=1 | ||
| ) |
There was a problem hiding this comment.
The PR description notes this hotplug test will fail on DPDK versions below 26.07. As written, the test will still run against older DPDK sources and fail noisily. Consider adding an explicit version gate (skip with a clear message) or a runbook variable/requirement that ensures only compatible DPDK sources/branches enable this test.
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
| poller = network_client.network_interfaces.begin_create_or_update( | ||
| self._resource_group_name, updated_nic.name, updated_nic | ||
| ) | ||
| updated_nic = network_client.network_interfaces.get( | ||
| self._resource_group_name, nic_name | ||
| ) | ||
| updated_nic = poller.result() |
There was a problem hiding this comment.
This change makes switch_sriov effectively blocking on the Azure NIC update every time because it unconditionally calls poller.result(). In this PR, call sites explicitly pass wait=False (e.g., the hotplug test), so this likely breaks the intended non-wait semantics and can significantly slow down hotplug loops. Consider honoring the wait parameter: only call poller.result() (and any subsequent state assertions) when wait=True; otherwise return after starting the update (or return the poller for the caller to await).
| result: TestResult, | ||
| ) -> None: | ||
| run_dpdk_symmetric_mp( | ||
| node, log, variables, trigger_hotplug=True, hotplug_times=40 |
There was a problem hiding this comment.
With hotplug_times=40, run_dpdk_symmetric_mp will loop many times and (per current implementation) starts a new ping_async each iteration without any visible join/wait. This can create many concurrent ping processes during the stress run, increasing flakiness and resource pressure. Consider changing the runner to wait for each async ping to finish before starting the next iteration (or make the stress case use synchronous ping), and/or cap concurrent ping jobs.
| node, log, variables, trigger_hotplug=True, hotplug_times=40 | |
| node, log, variables, trigger_hotplug=True, hotplug_times=4 |
Adds the dpdk symmetric_mp test with hotplugging, stress and regular.
Passing this test on MANA is unreliable without some patches from Long Li. These have merged into dpdk-next-net and will make their way into the next release.
https://git.dpdk.org/next/dpdk-next-net/commit/?id=ea9d4ee993eec0123b0e1cd227cb5743876ae677
This test will fail all versions below the next release 26.07. I have a seperate patch for LISA's pipeline which will allow testing upstream versions of DPDK easily; this will be required for running validation on this test.