Anuket Project
Cross-NUMA performance measurements with VSPERF
Introduction:
Cross-NUMA tests as part of OPNFV Plugfest (Gambia) - January 2019, by Sridhar K. N. Rao (sridhar.rao@spirent.com) Al Morton (acmorton@att.com)
- VSPERF-Scenarios: P2P and PVP.
- Workloads: vSwitchd, PMDs and VNF.
- VNF: L2 Forwarding
- vswitch: OVS and VPP.
Testcases Run:
Framesizes: 64, 128, 256, 512, 1024, 1280, 1518
- RFC2544 Throughput Test - NDR.
- Continuous traffic Test - 100%
Testbed:
Node-4 (DUT), Node-5 (Software Traffic Generators) and H/W Traffic Generator.
CPU Topology on DUT
V2V Scenarios
Summary of V2V Scenarios
Scenarios | Possible Core-allocations: | TGen Ports Info |
---|---|---|
1 | PMDs: 4, 5 (0x30) | 2 Virtual Ports 10G |
2 | PMDs: 22, 23 (0xC00000) | 2 Virtual Ports 10G |
3 | PMDs: 4, 22 (0x400010) | 2 Virtual Ports 10G |
P2P Scenarios
Summary of P2P Scenarios:
Scenario | Possible Core-allocations: | DUT Ports, TGen (Hardware) Ports |
---|---|---|
1 | PMDs: 4, 5 (0x30) | DUT: eno5, eno6 TGEN: 5, 6 |
2 | PMDs: 22, 23 (0xC00000) | DUT: eno5, eno6 TGEN: 5, 6 |
3 | PMDs: 4, 22 (0x400010) | DUT: eno5, eno6 TGEN: 5, 6 |
4 | PMDs: 4, 5 (0x30) | DUT: eno5, ens801f2 TGEN: 5, 7 |
5 | PMDs: 22, 23 (0xC00000) | DUT: eno5, ens801f2 TGEN: 5, 7 |
6 | PMDs: 4, 22 (0x400010) | DUT: eno5, ens801f2 TGEN: 5, 7 |
7 | PMDs: 4, 5 (0x30) | DUT: ens801f2, ens802f3 TGEN: 7, 8 |
8 | PMDs: 22, 23 (0xC00000) | DUT: ens801f2, ens802f3 TGEN: 7, 8 |
9 | PMDs: 4, 22 (0x400010) | DUT: ens801f2, ens802f3 TGEN: 7, 8 |
PVP Scenarios
Summary of PVP Scenarios:
Scenario | Possible Core-allocations: Assumptions: Numa-0 (0-21) Numa-1 (22-43) vSwitch Core # : 02 | DUT Ports TGen Ports (Hardware) | |
1 | PMDs: 4, 5, 6, 7 (0xF0) | VNF: 8,9 | DUT: eno5, eno6 TGEN: 5, 6 |
2 | PMDs: 4, 5, 6, 7 (0xF0) | VNF: 22, 23 | DUT: eno5, eno6 TGEN: 5, 6 |
3 | PMDs: 4, 5, 6, 7 (0xF0) | VNF: 8, 22 | DUT: eno5, eno6 TGEN: 5, 6 |
4 | PMDs: 4,5,22,23 (0xC00030) | VNF: 8,9 | DUT: eno5, ens801f2 TGEN: 5, 7 |
5 | PMDs: 4,5, 22, 23 (0xC00030) | VNF: 24, 25 | DUT: eno5, ens801f2 TGEN: 5, 7 |
6 | PMDs: 4, 5, 22, 23 (0xC00030) | VNF: 8, 24 | DUT: eno5, ens801f2 TGEN: 5, 7 |
7 | PMDs: 22, 23, 24, 25 (0x3C00000) | VNF: 26, 27 | DUT: ens801f2, ens802f3 TGEN: 7, 8 |
8 | PMDs: 22, 23, 24, 24 (0x3C00000) | VNF: 4,5 | DUT: ens801f2, ens802f3 TGEN: 7, 8 |
9 | PMDs: 22, 23, 24, 25 (0x3C00000) | VNFs: 4,26 | DUT: ens801f2, ens802f3 TGEN: 7, 8 |
Results: V2V
RFC2544 Throughput Test Results
RFC2544 With Loss Verification Throughput Test Results
Continuous Throughput Test Results
Results: P2P
RFC2544 Throughput Test Results
Continuous Throughput Test Results (Max Received Frame Rate at 100% of Line rate offered load)
Results: PVP
RFC2544 Throughput Test Results
Continuous Throughput Test Results (Max Received Frame Rate at 100% of Line rate offered load)
PVP Latency Results
Inferences
Theme: What is expected, What is unexpected,
V2V:
- Performance differences upto 1024 bytes packets sizes can be seen.
- Single vCPU serving more interfaces is worse than CPU on the other numa serving the interfaces - This pattern is also seen in other (P2P, and PVP) scenarios.
- RFC2544 with Loss-Verification is more consistent across runs, compared with RFC2544 without loss verification.
P2P:
- Only the smaller (64 and 128) packet sizes matter. For packets sizes above 128 the throughput performance remains similar.
- Scenarios 2 and 7 can be seen as the worst case scenarios with both the PMD-cores running on different NUMA than the NIC. As expected, the performance is consistently low for both scenarios-2 and 7.
- Interesting cases are Scenario-3 and Scenario-9. Here a single pmd-core ends up serving both the NICs. This results in poorer performance than Scenario-2 and 7.
- Scenario 1, 6, and 8 can be seen as good cases where each of the NICs are served by single, separate PMD-cores.
- When one NIC is served by pmd-core on the same NUMA, whereas the other NIC is served by pmd-core on a different NUMA - Scenarios 4 and 5 - can be seen as average cases with lower performance than 1, 6 and 8 - but not as low as 3, 9, 2, and 7.
- There is no difference in performance between continuous and RFC2544-throughput traffic tests.
PVP:
Note: In these scenarios, we ensure there is always at least 1 PMD mapped to a NUMA to which a physical NIC is mapped to. That is, we will not encounter the case of Scenario-2 and 7 of the P2P here.
- Continuous traffic results are more consistent across runs compared to RFC2544-throughput test.
- The inconsistency across the runs in RFC2544 cases can be explained by the way the binary-search algorithm works - and, this can be used to argue about the importance of adaptive RFC2544 Binary-search algorithm in virtualized environments.
- Due to cross-numa traffic flow, scenarios 2, 3 and 8, as expected, performs poorer compared to other scenarios.
- When the NICs are mapped to both the NUMAs - with pmd-cores also present - the performance is similar across all movements of VNF cores. The scenarios 4, 5 and 6 represent these cases. However, among these, Scenario-6 is relatively poorer as its cores are split across NUMAs, and the chances are that only one of them would be used effectively.
- Scenarios 1, 7 and 9 are the best cases - with minimal to none cross-numa effects.
Generic:
- X-NUMA instantiation is a very realistic scenario. If we seek more realism, we might add a stressor load to a few of the interesting scenarios. This might enhance the effects of X-NUMA deloyment.
Observations
V2V Scenarios OVS_PMD and interfaces (virtual) mappings
Scenarios | Mappings |
---|---|
Virtual Interfaces | Bridge trex_br |
Scenario-1 | pmd thread numa_id 0 core_id 4: |
Scenario-2 | pmd thread numa_id 1 core_id 22: |
Scenario-3 | pmd thread numa_id 0 core_id 4: |
PVP Scenarios OVS-PMD and Interfaces (physical and virtual) mappings
Scenario | Mappings |
---|---|
1/2/3 | pmd thread numa_id 0 core_id 4: |
4 | pmd thread numa_id 0 core_id 4: pmd thread numa_id 1 core_id 23: |
5 | pmd thread numa_id 0 core_id 4: pmd thread numa_id 1 core_id 22: |
6 | pmd thread numa_id 0 core_id 4: pmd thread numa_id 1 core_id 23: |
7/8/9 | pmd thread numa_id 1 core_id 22: |
P2P Scenarios OVS-PMDs and Physical-Interface Mappings
Scenario | Mappings |
---|---|
1 | pmd thread numa_id 0 core_id 4: |
2 | pmd thread numa_id 1 core_id 22: |
3 | pmd thread numa_id 0 core_id 4: |
4 | pmd thread numa_id 0 core_id 4: |
5 | pmd thread numa_id 1 core_id 22: |
6 | pmd thread numa_id 0 core_id 4: |
7 | pmd thread numa_id 0 core_id 4: |
8 | pmd thread numa_id 1 core_id 22: |
9 | pmd thread numa_id 0 core_id 4: pmd thread numa_id 1 core_id 22: |
Possible Variations
- Increase the Number of CPUs to 4 for the VNF.
- Phy2phy case (no VNF).
- Try different forwarding VNF
- Different Virtual Switch (VPP)
- RxQ Affinity.
Summary of Key Results and Points of Learning
- Performance degradation due to Cross-NUMA Node instantiation of NIC, vSwitch, and VNF can vary from 50-60% for lower size packets (64, 128, 256) to under 0-20% for higher packet sizes ( > 256 bytes)
- The worst performance was observed with PVP setup and scenarios where all PMD cores and NICs are in same NUMA Node, but VNF cores are shared across NUMA Nodes. Hence, VNF cores are best allocated within the same NUMA Node. If the VIM prevents VNF instantiation across multiple NUMA Nodes then, this issue is effectively avoided. However, at the time of this testing, K8s is believed to be NUMA Node agnostic when placing Pods in its normal mode, or when using DPDK and SR/IOV, and makes this testing more relevant until the situation changes.
- Any variations in CPU assignments under P2P setups has no effect on performance for packet sizes above 128 bytes. However, V2V setups show performance differences for larger packet sizes of 512 and 1024 bytes.
- Continuous traffic-tests and RFC2544 Throughput using Binary Search with Loss-Verification provides more consistent results across multiple-runs. Hence, these methods should be preferred over legacy RFC2544 methods using Binary search or Linear search algorithms.
- A single NUMA Node serving multiple interfaces is worse than Cross-NUMA Node performance degradation. Hence, it is better to avoid such configurations. For example, if both the physical NICs are assigned to NUMA-Node id 0 (with core ids 0-21), then the configuration-a below will lead to poorer performance than configuration-b
Configuration-a
pmd thread numa_id 0 core_id 4:
isolated : false
port: dpdk0 queue-id: 0
port: dpdk1 queue-id: 0
pmd thread numa_id 1 core_id 22:
isolated : false
Configuration-b
pmd thread numa_id 1 core_id 22:
isolated : false
port: dpdk0 queue-id: 0
pmd thread numa_id 1 core_id 23:
isolated : false
port: dpdk1 queue-id: 0
6. The average latencies have exactly opposite patterns under PVP setups and scenarios for continuous traffic testing and RFC2544 throughput test (with search algorithm BSwLV). That is, average latency is lower for lower packet sizes for RFC2544 throughput test and higher for higher packet-sizes, and this trend is opposite for continuous traffic testing.
Note: For result 6, could this be the result of the continuous traffic testing filling all queues for the duration of the trial? The RFC 2544 Throughput methods (and those of the present document) allow the queues to empty and the DUT to stabilize between trials.
Notes on Documentation
- must view log files, qemu threads need to match the intended scenario for VM -
- Christian created qemu command (and documentation) - check this for VM mapping
- SR: CT's command is only the host
- qemu command line -smp 2 should do this - simulates two Numa Nodes - need to see how the VM see it's architecture: numactl -h