Table of contents:

Introduction

VSPERF CI consists of several jobs, which are integrated into OPNFV infrastructure. It means that jobs are triggered by OPNFV jenkins (daily job) or OPNFV gerrit (verify and merge jobs). The comprehensive list of jobs, their status and history is visible in VSPERF specific dashboard at https://build.opnfv.org/ci/view/vswitchperf/

There are two versions of each job, one is created for current stable branch and second for the master branch.

In case of the daily job, which executes a set of performance tests, the results are available also in the graphical form at VSPERF CI Results and test results, reports and logs are stored inside OPNFV artifacts at http://artifacts.opnfv.org/logs_vswitchperf_intel-pod12.html.

OPNFV Jenkins is operated by releng team and the configuration of jobs is stored in releng git repository. VSPERF specific part can be found at YAML file vswitchperf.yml. For more info on writing and using jjbs see Jenkins Wow.

In order to have more flexible way of job configuration, VSPERF project stored detailed job configuration in VSPERF repository into build-vsperf.sh script, which is invoked by generic YAML job configuration above.

Links summary:

CI Dashboard: https://build.opnfv.org/ci/view/vswitchperf/

Daily job results:

VSPERF output: check "console output" of selected job type at jenkins, e.g. for a daily job https://build.opnfv.org/ci/view/vswitchperf/job/vswitchperf-daily-master/lastSuccessfulBuild/consoleFull
report, results & logs: http://artifacts.opnfv.org/logs_vswitchperf_intel-pod12.html
graphs: VSPERF CI Results

Job definition scripts:

generic YAML (releng git repo): vswitchperf.yml
job details (vsperf repo): build-vsperf.sh

JOBs:

The VSPERF CI jobs are broken down into:

Daily job:
1. It is executed at INTEL POD.
2. Requires a traffic generator (Ixia)
3. Runs everyday in case that new change was merged into particular branch since the last daily job execution; Daily job runs about 14 hours, but it can take over a day in case that VM running IxNetwork is slow. Please see FAQ section below for details.
4. A set of performance tests is executed for OVS with DPDK support, Vanilla OVS, VPP and SRIOV. Ixia traffic generator is used to generate RFC2544 Throughput and Back2Back traffic.
Merge job (similar to verify job):
1. It is executed at INTEL POD or at Ericsson PODs.
2. Does not require a traffic generator.
3. Runs whenever patches are merged to the particular branch.
4. Runs a basic set of integration testcases for OVS with DPDK support, Vanilla OVS and VPP.
5. in case that documentation files were modified, then documentation is built.
Verify job (similar to merge job):
1. It is executed at INTEL POD or at Ericsson PODs.
2. Does not require a traffic generator.
3. Runs every time a patch is pushed to gerrit. On success, the patch will be marked as verified (+1 for verification).
4. Runs a basic set of integration testcases for OVS with DPDK support, Vanilla OVS and VPP.
5. in case that documentation files were modified, then documentation is built

NOTE: The list of testcases to be executed for particular job type is configured inside build-vsperf.sh. Please refer to configuration options TESTCASES_* and TESTPARAM_* for additional details.

Where do VSPERF CI jobs run?

VSPERF project has a dedicated POD hosted at Intel LAB. Please check Intel POD12 for details.

DAILY JOB:

It requires a traffic generator in order to execute the performance testcases. Thus this job is executed at POD12.

The status of Intel POD12 is visible in jenkins at: https://build.opnfv.org/ci/computer/intel-pod12/

VERIFY and MERGE JOB:

They are executed at POD12 or at ericsson pods as they don't require a traffic generator. POD12 is used as a primary jenkins slave, because execution at ericsson build machines was not reliable since other projects start to use it more extensively. It seems, that there is a clash on resources (hugepages). There was a attempt to avoid a parallel execution of VSPERF and other jobs, but it didn't help.

FAQ:

Q: VEFIY JOB has failed and my patch got -1 for verification (a red cross in "V" column in a patch summar). How can I fix it?

A: Please check "console output" of failed job to find out a cause of failure. Them most common failures are:

DPDK, OVS, QEMU or VPP can't be cloned from it's repository and thus job fails. Example of console output in that case:
```
Cloning into 'dpdk'...
error: RPC failed; result=18, HTTP code = 200
fatal: The remote end hung up unexpectedly
fatal: early EOF
fatal: index-pack failed
make[1]: *** [dpdk] Error 128
```
This is often a temporary case and it is enough to re-trigger the job, e.g. by inserting a comment "reverify" into gerrit review in question. If problem will persist, please get in touch with admins responsible for particular server to verify, that connection to the failing site is not blocked by firewall.
PYLINT execution failed. Please note, that all files have to pass with score "10" from pylint. Please check a console output for pylint verification details. Correct values are "OK" (i.e. score 10), "NA" (not a pylint code, e.g. a configuration file) or EXCLUDED (e.g. python 2.7. library). In case of pylint error, you will see a score (e.g. 9.64) and a list of detected pylint errors. It is essential to use the same version of pylint at your server. This is ensured by installation of vsperf requirements into your virtual environment by vsperf installation scripts or by execution of "pip install -r requirements.txt" from vsperf repository when your vsperf python virtual environment is active.
Jenkins slave went offline during job execution. Example of a console output in that case:
```
FATAL: command execution failed
java.nio.channels.ClosedChannelException
	at org.jenkinsci.remoting.protocol.NetworkLayer.onRecvClosed(NetworkLayer.java:154)
	at org.jenkinsci.remoting.protocol.impl.NIONetworkLayer.ready(NIONetworkLayer.java:179)
	at org.jenkinsci.remoting.protocol.IOHub$OnReady.run(IOHub.java:721)
	at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused: java.io.IOException: Backing channel 'JNLP4-connect connection from 10.30.0.3/10.30.0.3:34322' is disconnected.
...
```
There are two common causes:
1. A connection between Jenkins server and jenkins slave (a server where tests where physically executed) was terminated. Please check Jenkins GUI to find out the status of server in question, e.g. at https://build.opnfv.org/ci/computer/. If it is "offline" then it is always good to check a status of other servers in the same lab. For example in case that Intel POD12 is offline and other intel PODxx are offline too, then it is some generic connection issue between Intel and OPNFV LAB. If such problem persists, then you should rise a INFRA ticket (if it is not there already). On the other hand if server is online ("idle"), then it was a temporary problem and job can be re-triggered again. In case that particular jenkins slave is offline for a long duration, then contact responsible administrator (in case of Intel-POD12 vsperf community can reboot it or access a console over web) for help.
2. A server was rebooted. It means, that shortly (up to 10 minutes) after the job failure the server is up and running (a status "idle" is visible in jenkins GUI). In that case, re-trigger the job. In case that job fails again with another reboot, then go ahead with inspection of "console output". In the past a reboots were observed at Intel-POD12 during execution of OVS Vanilla testcases. If this is the case, then please login at Intel-POD12 and check what kernel version is running. In case that CentOS specific kernel 3.10.xxx is active (uname -a), then you should update grub to use kernel 4.4. installed from epel repo. As of now (Apr 2018), recent versions of OVS and especially openvswitch.ko kernel module are having issues with recent modifications of 3.10.xxx kernels modified by RHT. Thus kernel 4.4. is used by default at Intel-POD12, however in case of OS update (yum update), the default kernel can be updated and selected by default by GRUB. It is often enough to update grub config to use kernel 4.4. by default and reboot the server. In case that regular reboots are observed at other PODs (e.g. ericsson), then you should get in touch with responsible admins. Hint: In case that default version of tool causing reboot (e.g. OVS) was changed, then you could try to push a (temporary?) patch to gerrit with older or newer version of tools to find out the version, which is "compatible" with OS at given jenkins slave. This information will be helpful during debugging and discussion with responsible administrators.

Q: DAILY JOB has failed.

A: Please firstly check answer to "VERIFY JOB has failed" above for causes common for all jobs. Please note then in case of DAILY job, INTEL POD12 is used as a jenkins slave (a job executor) and VSPERF community does OS administration of this server themselves. So you can login and investigate issues directly. In case of daily job, it is possible to re-trigger it from jenkins GUI, but only in case that jenkins user is logged in and he has appropriate privileges. Get in touch with VSPERF PTL and Linux Fundation helpdesk in order to get these privileges. In case that generic issues above didn't occur, then following DAILY job specific issues can occur:

If all testcases listed in console output have status "FAILED", then it is possible that Ixia traffic generator is causing issues. This can be detected in "console output". In case that all TCs (OVS with DPDK, Vanilla OVS and VPP) have FALIED, then you should look for part of Ixia execution starting with log entry "Connecting to IxNetwork machine...". If there are messages like "Can't connect to socket", "couldn't open socket: host is unreachable" or "Failed to parse output", then you should check following:
1. that VM where IxNetwork GUI is executed is up & running (try remote desktop connection to 10.10.120.6). If it is not reachable, then you should file a jira ticket to INFRA
2. after logging into VM (above) as "vsperf_ci" user, please check that IxNetwork is up & running; If not (e.g. due to VM reboot) please execute it by appropriate icon and verify that it runs at correct port (see Intel POD12 page for port assignment. In case of CI DAILY job it should be port 9126)
3. after execution of IxNetwork (above), please check, that connection to IXIA chassis (as of now 10.10.50.6 is possible); If not, please file a jira ticket to INFRA. Note: if don't know, how to check that chassis is up, then you could try to execute a vsperf TC (e.g. manually from Intel POD3) and check "ports" after TC will finish (i.e. fail). In case that ports can't be reserved (they are red) you could inspect their details. In case that you'll see unsuccessful connections to chassis (with many retries), then chassis is not reachable from IxNetwork GUI app (i.e. chassis down or connection between VM and chassis is broken).
4. after connection to chassis is verified and OK (above), you should check IxNetwork GUI for any error indication. For example after TC execution, there can be "license" related errors listed in Ixia log file (visible in GUI) or errors and warnings indications (see icons at IxNetwork GUI status bar). In case of license related errors (they expires every 6 months), please file a jira ticket to INFRA.
if only a few testcases have status FAILED (e.g. pvvp_tput or pvvp_back2back), then it can be caused by TC timeout. It has been observed several times, that server used for daily job execution gets slower during execution of daily job. In case that the same TC is executed manually it mostly works correctly, but execution of a lot of TCs at once can slow down the machine. It is not clear, what is the root cause. It can be caused by the lack or resources (please check, that enough hugepages are allocated, but at the same time enough memory is available at both nodes to execute processes). There is also a suspicion, that java application, which is used as node health check by OPNFV Jenkins (slave-agent.jnlp) is consuming a lot of resources and causing a machine slowness. The same application was executed at Intel-POD3 without any issues. However at Intel-POD12, it sometimes runs many threads and consumes significant amount of memory. It might be caused by some connection issues between OPNFV and Intel lab. In case that, that it happens, please use "monit" to restart "jenkins" servis or "systemctl" to restart "monit".

Q: DAILY JOB takes too long to finish (e.g. 1 or 2 days)

A: This is caused by VM where IxNetwork GUI application is executed. In the past, VSPERF used Intel-POD3, where execution of DAILY job was stable. It means, that performance results were stable among Daily job executions and the execution always took about 12 hours. After the move to a different Intel LAB and to Intel-POD12, the performance started to fluctuate and daily job execution takes longer time by each execution. Several attempts to fix these issues were made, but issues still persists. In order to shorten DAILY job execution, it is required to login into VM as "vsperf_ci" user via remote desktop and to restart IxNetwork GUI application.