Executive Summary Table of Contents
Key Points
Telco Requirements And Definitions
Key terminology for Service Availability in NFV / networking services context
Service Availability and Continuity Targets and their Impacts on Infrastructure services
Fault Management Cycle
Fault Management Cycle Timeline
Infrastructure Services Relative Criticality
Different infrastructure services can have different impacts on the reliability of the infrastructure as a whole. Service components such as network switches can have a disproportionately high impact on the reliability of the entire infrastructure due to their connective nature. Conversely, traditionally emphasized components such as compute nodes mostly affect the VMs / Containers running on them.
The following table provides a high level view of the categories of the services / entities in the NFVI and VIM sublayers of the system SW stack. This high level analysis is based on the expected risk, which is determined based on entities (guest VMs as a proxy) at risk and the fault activation probability from the guest's perspective.
Service Availability and Upgrades
Upgrade Process
Container Upgrades
NFV Enhanced HA Architecture
Addressing Single Point of Failure
Fast Fault Detection and Notification Framework
Telemetry Collection
RT Fault Mgmt Agent (RFE x)
Background
RT Fault Mgmt Bus (RFE x)
Anchor | ||||
---|---|---|---|---|
|
...
High Level Availability Allocation Calculation; all causes | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Atarget | DPM-Tot (s/y) | %HW | %ISW (≤VIM) | %ISW (>VIM) | %APP (SW) | %OPER | %TOT | HW DPM | ISW (≤VIM) DPM | ISW (>VIM) DPM | APPSW DPM | OPER DPM |
|
0.99999 | 315.3599999986 | 10 | 10 | 10 | 50 | 20 | 100 | 31.54 | 31.54 | 31.54 | 157.68 | 63.07 | DPM (sec/NE/y) |
|
|
|
|
|
|
|
| 0.999999 | 0.999999 | 0.999999 | 0.999995 | 0.999998 | A |
5 |
|
|
|
|
|
|
| 6.00 | 6.00 | 6.00 | 5.30 | 5.70 | #NINES |
0.999999 | 31.54 | 10 | 10 | 10 | 50 | 20 | 100 | 3.15 | 3.15 | 3.15 | 15.77 | 6.31 | DPM (sec/NE/y) |
|
|
|
|
|
|
|
| 0.9999999 | 0.9999999 | 0.9999999 | 0.9999995 | 0.9999998 | A |
6 |
|
|
|
|
|
|
| 7.00 | 7.00 | 7.00 | 6.30 | 6.70 | #NINES |
...
1.) ETSI REL-001 statements that goal is to achieve Service Level that are equivalent or better than service levels in the physical systems, and especially the REL-001 problem statement in section 4.1.: "Whereas in the IT domain outages lasting seconds are tolerable and the service user typically initiates retries, in the telecom domain there is an underlying service expectation that outages will be below recognizable level (i.e. milliseconds), and service remediation (recovery) is performed automatically."
2.) ETSI IFA requirements to support anti-affinity policies in context of allocation of virtualized network resources (which is prerequisite for being able to support redundancy mechanisms based on path diversity - i.e. anti-affinitized backup paths)
3.) Examples of SLAs of end customer services, which require same order of magnitude failover performance (60-100ms, depending on service); note that as SLAs are essentially components of bilateral contracts, they are seldom made public. Here is one example of public document from US Government's common SLA requirements, which includes Service Availability and various KPIs, including failover times for select services: GSA Networx SLAs
4.) Failover time requirements in 5G crosshaul project - 4 out of 5 use cases states "time of recovery after failure" of 50 ms or less, while remaining one has 100ms target: 5G CrossHaul.
5.) An aggregated responses of Red Hat commissioned Heavy Reading NFV operator survey of 128 service providers, "Telco Requirements for NFVI", November 2016, which had a specific question on the failover time. Associated graph on response distribution is replicated below for convenience; 40% of respondents were asking for 50ms. In addition, the associated clarification text on the responses stated "Those already executing their NFV strategies were much more likely than those who are not to say that recovery time should be within 50 ms"
To summarize, the high level engineering targets derived from the application level requirements that we use as service independent availability metrics for the NFVI infrastructure, as well as to establish target values for parameters to be measured on testing are:
...
Anchor | ||||
---|---|---|---|---|
|
The OPNFV Barometer project is focused on telemetry and event collection and transport in a cloud environment. Most of the work to date has focused on defining what should be collected and adding improvements to collectd which Barometer has chosen as the telemetry collection agent. The Barometer group is currently defining changes for collectd to allow for sample rates to be higher than transport rates as well as the addition of additional thresholding capabilities. In addition, collectd has been been enhanced such that is possible to write both polled plugins and event-based plugins. Part of the prototyping phase of the Service Assurance epic will be to evaluate if collectd can be used as the basis for the real-time fault management agent described in the following section.
...