Stock Neutron OVS – Not the Telco Cloud you were looking for?

Do I need a 3rd party SDN controller for my OpenStack based “Telco Cloud” DC?  Isn’t the default Neutron networking stack that comes with OpenStack considered “SDN”?  What benefits does a 3rd party Controller bring that would make it worth the additional time, money and effort to deploy? These are important questions that come up when a Service Provider is looking to build a cloud DC to host virtual workloads.

The reality is that a lot of Service Provider Clouds have not yet ramped up to the scale or VNF requirements complexity where they have fully experienced the pain points that a 3rd party SDN controller will solve for their OpenStack based Cloud DC.  Many have started out cautiously (smartly) by virtualizing the proverbial “low hanging fruit” applications which don’t require complex, high performance virtual networking at scale.  For these initial efforts the challenges of Neutron with OVS likely haven’t reared their ugly head.  Unfortunately in the world of Telco Clouds there are still a lot of “snowflake” applications that have special requirements that challenge the basic networking capabilities of OpenStack Neutron Networking with OVS. The purpose of this blog post is to delve into what benefits a 3rd party SDN Controller will provide when the number and complexity of hosted virtual applications start to increase in a Telco Cloud.

What are we building?

Here are some immediate considerations that come to mind when building a Telco Cloud:

  • Are the virtual workloads will you be hosting “Virtualized Network Functions (VNFs)  requiring high forwarding performance, low latency and high availability?
  • Do the virtual workloads need to participate in the service provider routing domain outside the DC?
  • Are any of the tenants that need to access these virtual workloads coming from legacy service provider networks?
  • Do you need to scale to more than a couple hundred tenant virtual networks?
  • Do you need to implement complex chains of service functions as part of these virtual workloads?
  • Do you want to use a virtual network overlay to avoid configuring DC switch and router hardware with per tenant state?
  • What type of virtual network security policies do you need and where are you going to be applying them?
  • Do you need end to end visibility of where you traffic is going in the DC and real time metrics on how your virtual applications are performing?

If you answered yes to several of the questions above then you should be taking a closer look at a 3rd party SDN Controller and here is why…Telco Cloud Is Different.

Telco Cloud IS Different

Building a Service Provider or “Telco Cloud” is very different than your traditional virtualized DC hosting Enterprise applications. The following is a list of a few key distinctions when dealing with Telco Cloud applications versus traditional virtualized applications.

  • Legacy network interconnection – a lot of these VNFs require connectivity to the Service Provider WAN for which BGP and MPLS BGP VPNs are the norm. You will need a way to dynamically advertise the reachability of these VNFs to the WAN outside the DC network.
  • VNFs typically have very high packet throughput and low latency requirements which the default OpenStack Neutron vSwitch OVS struggles to handle.
  • Telco Cloud virtualized applications are typically VNFs that are deployed as a chain of services. For example think about how one might configure the virtual networking required to move traffic through all of the virtual components of a 4G or 5G vEPC.
  • High Availability schemes and VNF health checking are often required to quickly move traffic to local or remote backup instances of the service function
  • How do I implement QOS? For example if I host a virtual Route-Reflector that is the ‘brains’ of my external physical legacy network inside of the shared infrastructure that is my virtualized Telco Cloud DC then how do I ensure prioritization of this vRR workloads control plane packets to and from the legacy external infrastructure?
  • IPv6 – its very real in Telco NFV deployments
  • Connectivity of virtualized functions to non-virtualized functions inside the same DC for brownfield deployments

Neutron with OVS still isn’t fully there yet for Telco Cloud

Isn’t the default Neutron networking stack that comes with OpenStack considered “SDN”?  Yes, but unfortunately even after many years of continuous improvements, the stock Neutron OVS networking solution that comes with OpenStack still struggles to fully orchestrate the required virtual networking functions at scale and with adequate enough visibility required to deploy, monitor, troubleshoot and capacity plan a Telco Cloud.

The existing of the OVN project, which is essentially an overhaul of OVS, pretty well summarizes the issues that exist in OVS in it’s charter

“OVN will put users in control over cloud network resources, by allowing users to connect groups of VMs or containers into private L2 and L3 networks, quickly, programmatically,  and without the need to provision VLANs or other physical network resources….

https://networkheresy.com/2015/01/13/ovn-bringing-native-virtual-networking-to-ovs/

Perhaps the most important issue is that gaps in OVS capabilities can result in the Telco Cloud Provider still needing to manually provision parts of the physical DC network infrastructure with tenant reachability information using a vendor specific CLI or NMS.  So when a VM/VNF spins up  I now need the networking team to go statically configure VLANs and routes on all the DC infrastructure gear to properly isolate and forward my tenant traffic.  This is the same way we built DC networks for the last 20+ years.  A Software Defined Network shouldn’t require tenant state to have to be configured in the underlying DC fabric equipment.  Putting tenant state into the DC network switches means having to have bigger state tables to handle lots and lots of logical scale which leads to having to purchase a higher class (i.e. more expensive) of DC switch to handle this. The switching layer should just be transport allowing the provider to purchase the most cost effective DC switching fabric.

Visibility into where your traffic is going is critically important. Stock Neutron networking with OVS lacks complete visibility on how your traffic is flowing throughout the virtualized DC.  OVS doesn’t correlate overlay tunnel flow paths to underlay hops, identify hot spots or allow one to see what security rules and service chains your traffic is traversing. This is still a very manual device by device troubleshooting and data collection process with OVS today.

Another consideration is when a Cloud Provider is hosting virtualized network functions (VNFs) that require SR-IOV to achieve high packet per second throughput and low latency.  SR-IOV bypasses OVS and delivers VM packets directly to the NIC. This also contributes to putting tenant state into the underlay switches requiring complex configuration on these switches in order to implement multi-tenant routing, security policy and service chaining functions.

There is good reason that there is a whole ecosystem of 3rd party OpenStack Neutron Networking SDN Controller plug-ins that exist to fill these gaps in stock Neutron OVS.

This link is to a previous OpenStack Summit Session which does a fantastic job covering the deficiencies and complexity involved in implementing an OpenStack Cloud DC using stock Neutron OVS Networking.  I would highly suggest spending the 35 minutes to get a much more detailed view of the performance and troubleshooting challenges in stock OVS that I won’t fully cover in this post.

“Hey over there, isn’t that the fully automated cloud we were looking for”

So what benefits does a 3rd party SDN Controller bring?

To get a better understanding of the gaps and accompanying challenges I’m referring to above, let’s take a look at what a commercial SDN Controller provides in the OpenStack DC over the top of a stock Neutron OVS deployment.  I’ll use elements of Juniper’s Contrail Networking for this as it’s obviously the implementation that I’m most familiar with as a Juniper Networks CSE though solutions like Nokia Nuage VSG and others should be able to deliver similar advanced virtual network orchestration capabilities.

In a nutshell Contrail uses a centralized SDN controller to program advanced virtual-networking overlay topologies that happen completely in SW at the vRouter level directly on the compute nodes. All Contrail Networking needs to create it’s overlay networks is IP reachability info between all of the servers in a DC which can be achieved using basic routing between the racks of underlay switches. There is no tenant state or complex configuration required in the switch fabric hardware and all security and forwarding decisions are made directly at the virtual-networking layer in SW on the servers hosting the containers or VMs without need for hairpin routing through centralized service nodes.

Contrail leverages BGP for dynamic tenant reachability advertisements towards the legacy WAN network infrastructure outside the DC. It also has a “SmartNIC” offload option for ensuring that SR-IOV enabled workloads still leverage all of Contrail vRouter’ s overlay networking functions.  Along with complex virtual network capabilities comes detailed analytics that show you the performance of your virtual application stacks.  All of this leads to efficient, high performance packet forwarding through dynamically created complex virtual networking topologies at extremely high scale with full visibility into where your traffic is going.

Here is a non-exhaustive list of important capabilities that a Contrail virtual networking overlay solution would provide for a Telco Cloud DC that a standalone Neutron OVS based solution still lags behind in:

  • No tenant networking state is populated into the underlying DC switch hardware – you are free to build a high speed low latency transport fabric using whatever DC switch OS and HW you want at each tier of your design.
  • Native integration with L3 VPNs in the WAN – This is important for a carrier which desires to offer virtualized services to its existing base of L3 VPN customers.
  • Ability to dynamically insert service chains between containers/VMs without having to provision complicated ACL rules into the entire underlying physical infrastructure
  • Service health checking where a VM/VNF route reachability is auto-withdrawn from the overlay upon health check failure allowing active/backup service instances to exist via automated fail-over
  • Ability to dynamically mirror a VM interface or specific 5-tuples of traffic transiting between 2 virtual networks without configuring span ports in the underlying network fabric
  • BGPaaS where the VM/VNF can run BGP with Contrail Network overlay directly and dynamically advertise its own loopback addresses or other secondary IPs, subnet pool IPs/VIPs etc without having to provision BGP sessions on the physical network HW in the DC
  • Single pane of glass detailed analytics reporting down to the flow level including correlation of overlay network flow path with the underlay network path
  • QOS marking and queuing ensuring fairness of traffic scheduling between containers/VMs running on the same compute host and through the underlay fabric
  • SmartNIC offload of SDN overlay encapsulation + network policy for VNF workloads that require SR-IOV in order to achieve desired performance. This prevents the manual network configuration that normally would have happened when a traditional compute node vSwitch or vRouter is bypassed by SR-IOV.

Here is a video presentation with some great insights from AT&T Mobility at NANOG 70 into the criteria required to build a vEPC which can address the new requirements and challenges that 5G and IOT impose and why Juniper Networks Contrail SDN Overlay Controller was essential to enabling a flexible high performance Telco Cloud.

In summary an OpenStack based Telco Cloud DC requires an SDN Controller to handle the advanced virtual-networking requirements at scale.  Without an SDN controller, the Cloud Provider is forced to attempt to provision legacy network interconnect, tenant network isolation instances, redundancy/HA, security rules/ACLs and hairpin routing through service nodes and appliances via programming these tenant specific rules at scale directly in the DC switching and DC Gateway router gear.  This is done by cobbling together the same legacy constructs that have been used to build DC networks for the past 20+ years combined with expensive DC switches that can hold the logical scale required of a large virtualized DC.  This is not network virtualization moving at the speed of compute virtualization which is the promise of SDN. Let your DC fabric be simple transport and leverage a fully functional 3rd party SDN controller to deliver secure and fully automated virtual tenant networking at scale.

Disclaimer: The views expressed here are my own and do not necessarily reflect the views of Juniper Networks