Table of Contents
Anchor | ||||
---|---|---|---|---|
|
Telecommunication service providers (Telcos in shortCoSPs) and Enterprises are following an industry transformation of moving to a cloud-based IT infrastructure to reduce cost by leveraging a shared execution environment based on Common Off The Shelf (COTS) servers. Telcos have the additional burden of moving from dedicated fixed function hardware with a vertically integrated software stack to COTS hardware where independent software applications must seamlessly integrate well, inter-operate and meet the required stringent Service Level Agreements (SLAs) thereby ensuring the required Service Assurance.
The requirement for Highly Available (HA) applications and services does not change during this migration adding to the burden that the target cloud infrastructure must provide equivalent mechanisms for High Availability. For example, Telcos CoSPs will still require 5NINES (99.999%) of availability for services; a feature that Telco customers rely upon. In addition, enterprises often rely on applications where a single second of unavailability could incur millions of dollars in lost revenue. For many large service providers and enterprises the public cloud is not an option due to cost or security issues. A private cloud infrastructure provider that can deliver a level of high availability expected by these large customers stands to acquire a large and loyal following. [Sunku → Is there a proof for this that we can add in reference section? alternatively we could make it easy and say private cloud is preferred over public cloud.]
Anchor | ||||
---|---|---|---|---|
|
- Provide traditional levels of availability in a cloud environment, (5NINES)
- Provide a monitoring and event detection framework that distributes fault information to various listeners with very low latency (<10's of milliseconds)
- Provide a local remediation controller that can react quickly (<10's of milliseconds) to host faults, communicate the filtered fault information to local VMs or containers for quick reaction by the application / service.
- For a cloud infrastructure, the network assumes a larger role in availability than individual compute hosts. Provide fast detection and localization (host vs switch) of network failures through interface monitoring, path fault detection, and interaction with an the underlying software defined network control (SDN/For example, ODL)
- Provide availability mechanisms for both current virtualization environments and future containerization environments orchestrated by container management applications like Kubernetes, etc…etc.
- Leverage the fast fault relay platform to provide a common transport mechanism for faults, events, telemetry, and logging.
- A high-availability framework cannot have humans in the remediation / and recovery process if 5NINES of availability is a goal. Logging information is provided for detailed, post-mortem postmortem analysis of infrastructure problems.
Anchor | ||||
---|---|---|---|---|
|
The following sections briefly introduce definitions and terminology that will be used in the remainder of the document.
...
Network Element Outage - an event where network element fails to provide a service as a result of a failure for the period longer than the minimum outage threshold. A network element outage will not necessarily directly cause a service outage if redundant network element(s) are available to perform the associated function(s), and hence necessity of high availability.
Network Element Availability - the probability that a network element (either PNF or VNF) can provide service at any instant. NE availability is an average availability for the total population of the NEs of the same type. Note that the statistical nature of this metric implies that some instances of the network element in the total population of NEs may experience zero downtime per year, while some others may experience much greater downtime. NE availability, unless otherwise stated is typically expressed as percentage of total time (e.g. 99.999%) that its associated service(s) is/are available.
Network Element Un-Availability - the probability that a network element (PNF or VNF) can not provide service at any instant, i.e. U=1-NE Availability; for engineering purposes, the Downtime Performance Measurement (DPM) is generally used instead of the time-fractional form of target as it is a form that can be used for setting engineering level targets and tested.
Network Element Reliability - the probability that a network element (PNF or VNF) will continue to operate for a specified period of time. Reliability can also be expressed in terms of mean time between failures Mean Time Between Failures (MTBF) or outage frequency (e.g. Outage Frequency Measurement (OFM), outages per year).
Failure - an event where an entity (e.g. system, equipment, subsystem or component) fails to perform its intended function. A failure does not necessarily always directly result on Service Outage, depending on the redundancy, usage state etc. aspects at the time of the failure event.
Failure Mode - a mechanism or category of causes that may cause a failure to occur. Examples include Hardware (HW) failures, Software (SW) failures, procedural errors, unmanaged overload conditions, security related incidents or environmental causes (power outages, cooling failures, fires, floods, storms, earthquakes, tsunamis, etc.).
...
Service Continuity - the ability of a system to maintain the current services without disruption. As an example, ability to continue established call without interruption, or ability to continue established TCP session through session-stateful firewall. Service continuity for stateful services (which are typical in networking) requires protection of the associated state with mechanisms such as checkpointing. State protection is primarily responsibility of the VNF implementer.
Service Outage (SO)- an event where service provider fails to provide a service as a result of a failure for the period longer than the minimum outage threshold.
...
Reportable Service Outage - an outage of the specific service, which exceeds the outage threshold associated with a specific SO reporting authority. Reportable Service Outages for network services are associated with voluntary SO reporting organizations (e.g. Quest Forum outage reporting) or regulatory reporting requirements (e.g. US FCC reporting requirements, European Community reporting requirements, and other national / regional authorities reporting requirements). While specific requirements with respect to the reportable outages vary by authority and by specific service, the associated reporting requirements are usually expressed through some combination of the service impact (e.g. thousands of users affected) and outage duration. For thepurposes the purposes of this document, we use 15 seconds as the RSO time-threshold target (based on TL9000 reporting threshold). The implied objective is that for all outages, we want to always set the remediation time objectives to stay well under the 15 second threshold.
...
High Level Availability Allocation Calculation; all causes | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Atarget | DPM-Tot (s/y) | %HW | %ISW (≤VIM) | %ISW (>VIM) | %APP (SW) | %OPER | %TOT | HW DPM | ISW (≤VIM) DPM | ISW (>VIM) DPM | APPSW DPM | OPER DPM |
|
0.99999 | 315.3599999986 | 10 | 10 | 10 | 50 | 20 | 100 | 31.54 | 31.54 | 31.54 | 157.68 | 63.07 | DPM (sec/NE/y) |
|
|
|
|
|
|
|
| 0.999999 | 0.999999 | 0.999999 | 0.999995 | 0.999998 | A |
5 |
|
|
|
|
|
|
| 6.00 | 6.00 | 6.00 | 5.30 | 5.70 | #NINES |
0.999999 | 31.54 | 10 | 10 | 10 | 50 | 20 | 100 | 3.15 | 3.15 | 3.15 | 15.77 | 6.31 | DPM (sec/NE/y) |
|
|
|
|
|
|
|
| 0.9999999 | 0.9999999 | 0.9999999 | 0.9999995 | 0.9999998 | A |
6 |
|
|
|
|
|
|
| 7.00 | 7.00 | 7.00 | 6.30 | 6.70 | #NINES |
...
1.) ETSI REL-001 statements that goal is to achieve Service Level that are equivalent or better than service levels in the physical systems, and especially the REL-001 problem statement in section 4.1.: "Whereas in the IT domain outages lasting seconds are tolerable and the service user typically initiates retries, in the telecom domain there is an underlying service expectation that outages will be below recognizable level (i.e. milliseconds), and service remediation (recovery) is performed automatically."
2.) ETSI IFA requirements to support anti-affinity policies in context of allocation of virtualized network resources (which is prerequisite for being able to support redundancy mechanisms based on path diversity - i.e. anti-affinitized backup paths)
3.) Examples of SLAs of end customer services, which require same order of magnitude failover performance (60-100ms, depending on service); note that as SLAs are essentially components of bilateral contracts, they are seldom made public. Here is one example of public document from US Government's common SLA requirements, which includes Service Availability and various KPIs, including failover times for select services: GSA Networx SLAs
4.) Failover time requirements in 5G crosshaul project - 4 out of 5 use cases states "time of recovery after failure" of 50 ms or less, while remaining one has 100ms target: 5G CrossHaul.
5.) An aggregated responses of Red Hat commissioned Heavy Reading NFV operator survey of 128 service providers, "Telco Requirements for NFVI", November 2016, which had a specific question on the failover time. Associated graph on response distribution is replicated below for convenience; 40% of respondents were asking for 50ms. In addition, the associated clarification text on the responses stated "Those already executing their NFV strategies were much more likely than those who are not to say that recovery time should be within 50 ms"
To summarize, the high level engineering targets derived from the application level requirements that we use as service independent availability metrics for the NFVI infrastructure, as well as to establish target values for parameters to be measured on testing are:
...
Anchor | ||||
---|---|---|---|---|
|
The OPNFV Barometer project is focused on telemetry and event collection and transport in a cloud environment. Most of the work to date has focused on defining what should be collected and adding improvements to collectd which Barometer has chosen as the telemetry collection agent. The Barometer group is currently defining changes for collectd to allow for sample rates to be higher than transport rates as well as the addition of additional thresholding capabilities. In addition, collectd has been been enhanced such that is possible to write both polled plugins and event-based plugins. Part of the prototyping phase of the Service Assurance epic will be to evaluate if collectd can be used as the basis for the real-time fault management agent described in the following section.
...