...
- generic YAML (releng git repo): vswitchperf.yml
- job details (vsperf repo): build-vsperf.sh
CI JOBs
The VSPERF CI jobs are broken down into:
...
VSPERF project has a dedicated POD hosted at Intel LAB. Please check Intel POD12 and VSPERF in Intel Pharos Lab - Pod 12 for details.
DAILY JOB:
It requires a traffic generator in order to execute the performance testcases. Thus this job is executed at POD12.
...
They are executed at POD12 or at ericsson pods as they don't require a traffic generator. POD12 is used as a primary jenkins slave, because execution at ericsson build machines was not reliable since other projects start to use it more extensively. It seems, that there is a clash on resources (hugepages). There was a attempt to avoid a parallel execution of VSPERF and other jobs, but it didn't help.
FAQ
Q: Why VEFIY JOB has failed and my patch got -1 for verification
...
?
A: Please check "console output" of failed job to find out a cause of failure. Them most common failures are:
DPDK, OVS, QEMU or VPP can't be cloned from it's repository and thus job fails. Example of console output in that case:
Code Block Cloning into 'dpdk'... error: RPC failed; result=18, HTTP code = 200 fatal: The remote end hung up unexpectedly fatal: early EOF fatal: index-pack failed make[1]: *** [dpdk] Error 128
This is often a temporary case and it is enough to re-trigger the job, e.g. by inserting a comment "reverify" into gerrit review in question. If problem will persist, please get in touch with admins responsible for particular server to verify, that connection to the failing site is not blocked by firewall.
- PYLINT execution failed. Please note, that all files have to pass with score "10" from pylint. Please check a console output for pylint verification details. Correct values are "OK" (i.e. score 10), "NA" (not a pylint code, e.g. a configuration file) or EXCLUDED (e.g. python 2.7. library). In case of pylint error, you will see a score (e.g. 9.64) and a list of detected pylint errors. It is essential to use the same version of pylint at your server. This is ensured by installation of vsperf requirements into your virtual environment by vsperf installation scripts or by execution of "pip install -r requirements.txt" from vsperf repository when your vsperf python virtual environment is active.
Jenkins slave went offline during job execution. Example of a console output in that case:
Code Block FATAL: command execution failed java.nio.channels.ClosedChannelException at org.jenkinsci.remoting.protocol.NetworkLayer.onRecvClosed(NetworkLayer.java:154) at org.jenkinsci.remoting.protocol.impl.NIONetworkLayer.ready(NIONetworkLayer.java:179) at org.jenkinsci.remoting.protocol.IOHub$OnReady.run(IOHub.java:721) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused: java.io.IOException: Backing channel 'JNLP4-connect connection from 10.30.0.3/10.30.0.3:34322' is disconnected. ...
There are two common causes:
- A connection between Jenkins server and jenkins slave (a server where tests where physically executed) was terminated. Please check Jenkins GUI to find out the status of server in question, e.g. at https://build.opnfv.org/ci/computer/. If it is "offline" then it is always good to check a status of other servers in the same lab. For example in case that Intel POD12 is offline and other intel PODxx are offline too, then it is some generic connection issue between Intel and OPNFV LAB. If such problem persists, then you should rise a INFRA ticket (if it is not there already). On the other hand if server is online ("idle"), then it was a temporary problem and job can be re-triggered again. In case that particular jenkins slave is offline for a long duration, then contact responsible administrator (in case of Intel-POD12 vsperf community can reboot it or access a console over web) for help.
- A server was rebooted. It means, that shortly (up to 10 minutes) after the job failure the server is up and running (a status "idle" is visible in jenkins GUI). In that case, re-trigger the job. In case that job fails again with another reboot, then go ahead with inspection of "console output". In the past a reboots were observed at Intel-POD12 during execution of OVS Vanilla testcases. If this is the case, then please login at Intel-POD12 and check what kernel version is running. In case that CentOS specific kernel 3.10.xxx is active (uname -a), then you should update grub to use kernel 4.4. installed from epel repo. As of now (Apr 2018), recent versions of OVS and especially openvswitch.ko kernel module are having issues with recent modifications of 3.10.xxx kernels modified by RHT. Thus kernel 4.4. is used by default at Intel-POD12, however in case of OS update (yum update), the default kernel can be updated and selected by default by GRUB. It is often enough to update grub config to use kernel 4.4. by default and reboot the server. In case that regular reboots are observed at other PODs (e.g. ericsson), then you should get in touch with responsible admins. Hint: In case that default version of tool causing reboot (e.g. OVS) was changed, then you could try to push a (temporary?) patch to gerrit with older or newer version of tools to find out the version, which is "compatible" with OS at given jenkins slave. This information will be helpful during debugging and discussion with responsible administrators.
Q: Why DAILY JOB has failed
...
?
A: Please firstly check answer to "VERIFY JOB has failed" above for causes common for all jobs. Please note then in case of DAILY job, INTEL POD12 is used as a jenkins slave (a job executor) and VSPERF community does OS administration of this server themselves. So you can login and investigate issues directly. In case of daily job, it is possible to re-trigger it from jenkins GUI, but only in case that jenkins user is logged in and he has appropriate privileges. Get in touch with VSPERF PTL and Linux Fundation helpdesk in order to get these privileges. In case that generic issues above didn't occur, then following DAILY job specific issues can occur:
- If all testcases listed in console output have status "FAILED", then it is possible that Ixia traffic generator is causing issues. This can be detected in "console output". In case that all TCs (OVS with DPDK, Vanilla OVS and VPP) have FALIED, then you should look for part of Ixia execution starting with log entry "Connecting to IxNetwork machine...". If there are messages like "Can't connect to socket", "couldn't open socket: host is unreachable" or "Failed to parse output", then you should check following:
- that VM where IxNetwork GUI is executed is up & running (try remote desktop connection to 10.10.120.6). If it is not reachable, then you should file a jira ticket to INFRA
- after logging into VM (above) as "vsperf_ci" user, please check that IxNetwork is up & running; If not (e.g. due to VM reboot) please execute it by appropriate icon and verify that it runs at correct port (see Intel POD12 page for port assignment. In case of CI DAILY job it should be port 9126)
- after execution of IxNetwork (above), please check , that connection to IXIA chassis (as of now 10.10.50.6 is possible); If not, please file a jira ticket to INFRA. Note: if don't know, how to check that chassis is up, then you could try to execute a vsperf TC (e.g. manually from Intel POD3) and check "ports" after TC will finish (i.e. fail). In case that ports can't be reserved (they are red) you could inspect their details. In case that you'll see unsuccessful connections to chassis (with many retries), then chassis is not reachable from IxNetwork GUI app (i.e. chassis down or connection between VM and chassis is broken).
- after connection to chassis is verified and OK (above), you should check IxNetwork GUI for any error indication. For example after TC execution, there can be "license" related errors listed in Ixia log file (visible in GUI) or errors and warnings indications (see icons at IxNetwork GUI status bar). In case of license related errors (they expires every 6 months), please file a jira ticket to INFRA.
- if
- all above is correct but "Falied to parse output" is still observed, then shared samba folder between VM and node3 should be verified. Please check vsperf documentation for details about shared folder between Win machine running IxNetwork GUI and DUT. Folder should be properly mounted at node3 and it's content must be readable by the jenkins user account.
- if only a few testcases have status FAILED (e.g. pvvp_tput or pvvp_back2back), then it can be caused by TC timeout. It has been observed several times, that server used for daily job execution gets slower during execution of daily job. In case that the same TC is executed manually it mostly works correctly, but execution of a lot of TCs at once can slow down the machine. It is not clear, what is the root cause. It can be caused by the lack or resources (please check, that enough hugepages are allocated, but at the same time enough memory is available at both nodes to execute processes). There is also a suspicion, that java application, which is used as node health check by OPNFV Jenkins (slave-agent.jnlp) is consuming a lot of resources and causing a machine slowness. The same application was executed at Intel-POD3 without any issues. However at Intel-POD12, it sometimes runs many threads and consumes significant amount of memory. It might be caused by some connection issues between OPNFV and Intel lab. In case that, that it happens, please use "monit" to restart "jenkins" servis or "systemctl" to restart "monit".
Q: Why DAILY JOB takes too long to finish (e.g. 1 or 2 days)?
A: This is caused by VM where IxNetwork GUI application is executed. In the past, VSPERF used Intel-POD3, where execution of DAILY job was stable. It means, that performance results were stable among Daily job executions and the execution always took about 12 hours. After the move to a different Intel LAB and to Intel-POD12, the performance started to fluctuate and daily job execution takes longer more time by each execution. Several attempts to fix these issues were made, but issues still persists. In order to shorten DAILY job execution, it is required to login into VM as "vsperf_ci" user via remote desktop and to restart IxNetwork GUI application.
Ideas
Configure 2nd jenkins slave for execution of VSPERF JOBs.
There are several nodes available at Intel-POD12 (see Intel POD12). Currently there are two sandboxes, where second sandbox using node1 and node2 was created recently. It would be possible to reconfigure 2nd sandbox to be used as another (or even two) jenkins slave. This would speed up execution of VSPERF Jobs. However releng team must be consulted regarding the proper naming as two different jenkins slaves will be hosted at the same Intel POD12.
Reconfigure VERIFY & MERGE jobs to utilize ericsson PODs more often.
In the past, VERIFY & MERGE jobs, were executed at opnfv-build-ubuntu groups of slaves, which consists of ericsson-build3 and ericsson-build4 machine. The execution was reliable at both of these servers for several months, but later it started to fail. There were several issues, some of them were related to hugepages allocations and usage and to VPP. In case of VPP, it happened several times, that it stopped to work at all at one of ericsson servers. Responsible admins were asked for help, but they were not able to find a root cause. The only solution was to reboot affected server and it worked for some time again. There is a suspicious, that thboth hugepages and VPP issues are caused by parallel execution of jobs for vsperf and other projects. As debugging of such race condition at server without any access is hardly possible, both VERIFY & MERGE jobs are primarily executed at Intel-POD12. Idea was to execute VERIFY & MERGE jobs at POD12 if it is not occupied by DAILY job and if so, then to move to ericsson POD. However current YAML file definition doesn't work that way. It switches to ericsson POD only in case that INTEL POD12 is offline. Releng engineers can help us with YAML file definition to achieve better utilization of available PODs.
Minimize impact of jenkins health check application
Consider pinning of jenkins health check application at second numa slot, which is not used for performance tests execution. Even better would be a move of that application to a jumphost. However one had to solve, how to execute vsperf "remotly" and how configure multiple slaves at the same pod (probably not possible to run multiple heathchecks at the same machine - may be container would help).
This won't be needed if we will configure more jenkins slave at Intel POD12.