Understanding optics and cabling choices in the Data Center

Do you find the dizzying array of optics form factors, connector types, cabling choices and device connectivity options in the Data Center difficult to consume and make sense of?  In this particular segment we will look at some of the most popular types of Data Center optics and cabling options out there as well as examples of commerically available switching product platforms that leverage these. We will then cover the factors that influence when one might choose a specific type of optic and cabling for connecting end devices to these switches and wrap up with a table which summarizes what we have discussed.

When it comes to optics some important concepts to quickly knock off right out the gate are form factor, transceiver, connector type and cabling.

The form factor of an optic essentially defines the supported data transfer rate (speed) per lane, number of data transfer lanes, physical size and power characteristics of the optical transceiver.

Within each specific form factor there are multiple optical transceiver options which differ in the supported distance range and type of connectors and cabling they support.

You will see transceivers rated at specific distances when paired with a certain cabling type.  Some example distance designations that commonly appear are SR (short-reach), IR (intermediate-reach) or LR (long-reach) which when combined with a supported cabling type will range from as few as 100m on Multi-mode fiber (MMF) for SR to upwards of 10km on Single-Mode Fiber (SMF) in the case of LR.

When it comes to cable types, Multi-mode fiber (MMF) cable is used in the Data Center for distances less than 400m. Single-Mode Fiber (SMF) cable is used for distances >400m where the connected end device is across the Data Center or for much longer distances like interconnects between Data Centers (DCI) that may span multiple kilometers. Cat6 and Cat7 copper cabling is still used for very short distance 1G and 10G connections.

In the Data Center the transceiver’s connector type is typically either an LC connector or an MPO/MTP connector.  Duplex Single Mode Fiber (SMF) and Duplex Multi-Mode Fiber (MMF) cable types both support LC connectors while parrallel MMF trunks utilize MPO/MTP connectors. More on MPO/MTP below.  RJ45 copper connector types for use with Cat6/7 cabling are also possible with some transceivers.

Figure 1: SFP transceiver (left) which accepts a cable with an LC connector (right)

 

MPO/MTP Connector

Multi-Fiber Push-On/Push-Off (MPO) is the standard and MTP is a brand of connector type. These connectors deal with handling the patching or termination of multiple parallel multi-mode fiber strands called ‘MTP trunks’. The most commonly seen type of parallel multi-mode fiber strand for interconnecting two devices is the 12-fiber MTP ‘trunk’ which consists of a length of 6-fiber pairs with MTP connectors on each end.  In practice only 8 of the 12 fibers are actually used which is enough to provide 4 lanes of dual fibers.  More efficient 8-fiber MTP trunks also now exist and are gaining in popularity.  These MTP connectors commonly plug into either a QSFP+ 40G or QSFP28 100G form factor optical transceiver which both use 4 parallel data transfer lanes and are used for short length connections of <400m.

Figure 2: Male and Female (top) MTP connectors, MTP trunk Cable (bottom)

 

MTP harness connectors also exist which, for example, can take an 8-fiber MTP and break it out into 4xLC connectors. This would typically be used to breakout a 40G or 100G port on a switch to 4x10G or 4x25G endpoint connections respectively. This harness might plug directly into a switch port to breakout connections to a server in the same rack as the switch or used in conjunction with an in rack patch panel which provides connectivity to all the servers in rack.

Figure 3: 8-fiber MTP to 4 LC duplex harness

 

There are also versions of transceivers that have a cable directly pre-attached to them and therefore have no real connector. These Direct Attach Copper (DAC) or Active Optical Cables (AOC) are for very short distance connections in the range of 1 to 30m and are described in more detail further below.

Direct Attach Copper (DAC)

This is essentially a cable with the 10G SFP+, 40G QSFP+ or 100G QSFP28 transceivers pre-attached on both ends.  These exist in either passive or active mode with passive having an effective distance of 1-5m while active can cover 5-10m of distance between the switch port and connected device. DAC is mostly used for in rack cabling when the switch and the connected device are in the same rack (Top-of-rack model). It is also possible to use an active DAC to extend the reach between a connected device to a middle of row or end of row switch location.

 Figure 4: DAC Cable

 

Direct Attach Copper Break-Out (DACBO)

This functons just like the DAC above with the key difference being that the switch port end of the DAC will be  have a 40G QSFP+ or 100G QSFP28 transceiver while the connected device end will have 4x10G SFP+ or 4x25G SFP28 connections available. These are typically used within a rack (TOR model).

Figure 5:DAC Break-out Cable

 

Active Optical Cables (AOC)

AOC are just like DAC in that the transceivers and the cable are a single fixed assembly. The key differences here are AOC being fiber which is thinner and more flexible with much longer effective reach in the 10 to 30m range allowing for them to be used in conjunction with middle and end of row switching device location designs.  The drawback to using a really long AOC cable that runs from and end of row switch location to a device in a remote rack is that the entire cable assembly needs to be re-run in the event of a failure which may prove cumbersome. AOC’s also have breakout options for enabling 40G and 100G to breakout to 4x10G and 4x25G respectively. AOC’s are more expensive than DAC cables due to both active components and longer length cables.

Figure 6:Active Optical Cable
Figure 7: Active Optical Break-out Cable

 

Commonly used form factors

Next let’s look at the available type of optics form factors, how they have been historically used in DC switching gear as well as their performance, size, power and cost trend.

SFP+ – 10G Small Form Factor Pluggable Tranceivers

SFP+ is a single lane of 10Gbps which utilizes 1.5W of power.   SFP+ transceivers can support RJ45 copper, LC fiber connectors or Direct Attach Copper (DAC) and Active Optical Cables (AOC). Typical 1RU switch configurations which leverage SFP+ have 48 SFP+ ports with 4 to 6 ports of QSFP+ 40G or QSFP28 100G for uplinks.  10G Data Center switches produced with SFP+ ports are now starting to give way to 25G switches with SFP28 ports and QSFP28 uplinks.

Figure 8:  Juniper QFX5100-48s 1RU 48x10G SFP+ and 6x40G QSFP+ uplink Broadcom Trident 2 based switch

 

QSFP+ – Quad 10G Small Form Factor Pluggable Tranceivers

The QSFP+ is 4 lanes of 10Gbps which is slightly wider than an SFP+ and utilizes 3.5W of power.  When comparted to SFP+ its 4x the bandwidth at roughtly 2.5x the amount of power consumed.  QSFP+ transceivers will support LC fiber connectors, MPO/MTP connectors, Direct Attach Copper (DAC) and Active Optical Cables (AOC).  It’s common to see 32 ports of QSFP+ on a 1RU ethernet switch and 36 QSFP+ ports on a modular chassis line card as this is typically the maximum amount of front panel real estate available for ports.

Figure 9: EdgeCore 1RU 32x40G QSFP+ Broadcom Trident 2 based switch

 

SFP28 – 25G Small Form Factor Pluggable Tranceivers

The SFP28 is a single lane of 28Gbits which is 25Gbps + error correction for an effective data rate of 25Gbps.  SFP28 is the same size form factor as SFP+ so its 2.5 times the bandwidth of SFP+ in the same amount of space and at roughly the same price point.  In addition, SFP28 is also backwards compatible with 10GE which allows for upgrading the DC switching infrastrucuture to support 25G without immediately having to also upgrade all of the devices that will plug into it _and_ allows for reuse of existing 2-pair MMF cabling.  1RU switching products have been introduced with 48x25G SFP28 densities with 4-6 additional ports supporting QSFP28 used as 100G uplinks.

Figure 10: DEll S5148F-ON 1RU 48x25G SFP28 and 6x100G QSFP28 uplink Cavium Xpliant based switch

 

QSFP28 –  Quad 25G Small Form Factor Pluggable Tranceivers

The QSFP28 is 4 lanes of 28Gbits (25Gbps + error correction) providing either full 100Gbps operation or a break-out of (4) 25Gbps interfaces.  Its important to note that the QSFP28 is the same size form factor and power draw as the QSFP+, yet provides 2.5x time bandwidth of QSFP+.  Due to the same size form factor its possible to see switching equipment flexibly support the use of either QSFP+ or QSFP28 pluggable optics in the same port(s).  Most 1RU switches will have 32xQSFP28 ports which will support break out QSFP28 to 4x25G. Modular switch chassis will sport line cards that leverage between 32 and 36 ports supporting QSFP28.  QSFP28 is also popular as a 100G uplink port in newer 48x25G 1RU switches.

Figure 11:  Dell Z9100-ON 1RU Broadcom Tomahawk based switch with 32x100G QSFP28 ports

 

Device placement in the Data Center

Another key concept to understand is the typical switch device placement locations in the Data Center

Top-of-Rack (TOR) Designs

 

 

Top-of-rack or TOR model consists of placing a switch within the same rack as the devices which are connecting to it.  The switch might be placed in the top most portion of the rack or in the middle of the rack with 1/2 the servers above it and the other 1/2 below it.  In this case its common to see either RJ45 copper or DAC cables use to connect in-rack devices to the switch due to the very short intra-rack cabling distance requirement.  This short intra-rack wiring scheme also obviates the need for additional space and cost of patch panels for the rack. The only connections that leave the rack should be uplinks from the TOR switch to middle-of-row or end-of-row spine or aggregation switches which typically would be fiber or even AOC based. From a maintenance or failure domain perspective we are dealing with isolation to the rack level so only devices within the rack will be impacted.

Due to the high front panel density counts on 1RU switches deploying a pair of switches in each rack for redundant server connectivity potentially strands a lot of excess ports on each switch.  For example if you have a desire to multi-home your servers to a pair of 48-port switches, yet only have 24 servers per rack you could x-connect your servers to another 48-port TOR switch in an adjacent rack. While this reduces the total number of switches and leaves no unused ports it makes for slightly more cumbersome x-rack cabling.

Middle-of-Row or End-of-Row Designs

Middle-of-row and End-of-row designs typically consist of a pair of large chassis based switches that aggregate the server connections from all of the racks/cabinets that are in the row.  In this model there is no switching device within the rack itself. The connections between the large chassis based switches and each rack would typically be fiber runs that terminate onto in-rack patch panels.  From the patch panel individual patch cables would be run to each server in the rack. It is technically possible to run cables the full length of the distance between the middle-of-row or end-of-row chassis based switch pair, however when a cable fault occurs the entire length of the run needs to be replaced which could be cumbersome to deal with.

From a maintenance or failure domain perspective we are dealing with the potential to impact multiple racks in the event of a large chassis based switch failure.

 

Lets conclude by putting it all together with a table that illustrates how the various form factor, transceiver, connector and cable types combine to offer choices in the placement of the end devices connected to Data Center switches.

Commonly used 10/25/40/100G Data Center Connectivity Options

Media Form Factor Cabling and ConnectorType Max Distance Type of connection
10GBase-T (Copper) SFP+ CAT6/CAT7 100m server/appliance within the same rack (10G TOR switch)
10G-USR (Fiber) SFP+ duplex MMF w/LC connector 100m server/appliance within the same rack (10G TOR switch)
10GBase-SR (Fiber) SFP+ duplex MMF w/LC connector 400m server/appliance within the same rack (10G TOR switch) or to chassis based switch in end of row (EOR)
10GBase-LR (Fiber) SFP+ duplex SMF w/LC connector 10km device is cross-DC, between floors, external to DC/DCI
SFP+ Direct Attach Copper 1 to 5m server/appliance within the same rack  (10G TOR switch)
SFP+ Active Optical Cable 5 to 30m server/appliance within the same rack (10G TOR switch) or to chassis based switch in end of row (EOR)
25GBase-SR SFP28 duplex MMF w/LC connector 100m server/appliance within the same rack (25G TOR switch)
SFP28 Direct Attach Copper 1 to 5m server/appliance within the same rack (25G TOR switch)
SFP28 Active Optical Cable 5 to 30m server/appliance within the same rack (25G TOR switch) or to 25G chassis based switch in end of row (EOR)
40GBase-SR4 (Fiber) QSFP+ 12-fiber MMF w/MPO connector 150m server/appliance within the same rack (40G TOR) or to chassis based switch in end of row (EOR)
40Gx10G-ESR4 (Fiber) QSFP+ 12-fiber MMF w/MPO connector 400m server/appliance within the same rack (40G TOR) or to chassis based switch in end of row (EOR)
 40GBASE-CR4 QSFP+ Direct Attach Copper 1 to 5m server/appliance within the same rack or 40G leaf to 40G spine switch
QSFP+ Active Optical Cable 5 to 30m 40G leaf to 40G spine switch or to 40G chassis based switch in end of row (EOR)
QSFP+ QSFP+ to 4xSFP+ DACBO 1 to 5m 10G server/appliance within the same rack
QSFP+ QSFP+ to 4xSFP+ AOC 5 to 30m 10G server/appliance within the same rack (40G TOR) or to chassis based switch in end of row (EOR)
100GBase-SR4 (Fiber) QSFP28 12-fiber MMF w/MPO connector 100m server/appliance within the same rack (TOR)
100GBase-LR4 (Fiber) QSFP28 duplex SMF w/LC connector 10km device is cross-DC, between floors, external to DC/DCI
 100GBASE-CR4 QSFP28 Direct Attach Copper 1 to 5m server/appliance within the same rack, device to device 100G connectivity (100G leaf to 100G spine)
QSFP28 Active Optical Cable 5 to 30m device to device 100G connectivity (100G leaf to 100G spine switch) or 100G chassis based switch in end of row (EOR)
QSFP28 QSFP28 to 4xSFP28 DACBO 1 to 5m 25G  server/appliance within the same rack (100G TOR)
QSFP28 QSFP28 to 4xSFP28 AOC 5 to 30m 25G  server/appliance to 100G chassis based switch in end of row

I hope you found this useful and now have a better understanding of how to use these components to construct a Data Center cabling scheme.

 

Disclaimer: The views expressed here are my own and do not necessarily reflect the views of my employer Juniper Networks

Spine-leaf versus large chassis switch fabric designs

With every mega scale web company talking about their spine-leaf fabric designs it sounds as if everyone who is building a Data Center switch fabric needs and is going to implement one.  Is that really the case though?

In this blog post we will look at whether spine-leaf designs are always better than large chassis based Data Center switch fabric designs. We will touch on some important factors to consider when choosing between which fabric design to implement.  Next I will provide a simple example for comparison that will better illustrate why several of these factors are critically important.

First off let’s start with what a spine-leaf fabric actually is.  It’s a physical topology design where endpoints connect to leaf switches and all leaf switches connect to a spine switch layer. Every leaf connects via an uplink port to each of the spines. The spines don’t connect to each other, just to the leaf switches. This physical topology design results in every endpoint being an equidistant number of hops away from every other endpoint with consistent latency.  There are effectively 3 stages in this design where there is an ingress leaf stage a middle stage via the spine and an egress leaf stage. This design is sometimes referred to as a Clos fabric, after Charles Clos who created this design back in 1952 in order to scale a network larger than the radix of the largest telephone switch that was available at the time.  This design being rooted in the need to scale past the largest available switching component will be a key factor we consider later in this post when comparing a large chassis based design to a spine-leaf design.  A spine-leaf topology can be utilized for a layer 2 or a layer 3 fabric.  It is most often coupled with an L3 design, commonly referred to as an ‘IP Fabric’, where routing is employed on all of the interconnection links between the leaf and spine switches. Every endpoint connected to a leaf switch is reachable via N # of equidistant ECMP paths where N is the # of spine switches in use.  In the layer 3 “IP Fabric” paradigm the result is effectively a distributed control plane.

In contrast a large modern chassis based switch is actually a Clos fabric enclosed in a single piece of sheet metal with a shared control plane, power and cooling subsystem. The line cards act like leaf switches and the fabric modules act like spine switches. The fabric links that interconnect all of the line cards (leafs) are internal to the switch fabric modules (spine) and don’t require a conventional routing protocol between them as there is a single control plane in the chassis based system which updates the forwarding tables on the line cards.

Note:  Its worth mentioning that there is technically a hybrid model of the single control plane chassis based switch and the distributed control plane spine-leaf fabric.  An Open Compute Platform (OCP) design called “Backpack”, submitted by Facebook, now exists for a 4-slot chassis comprised of 32x100GE line cards and integrated fabric modules. The difference in this design is that each component inside the shared chassis is a separate switch from a control plane and management plane standpoint. This basically acts like a Clos fabric in a box whereby the fabric cards act as discrete spine switches and the line cards act as discrete leaf switches running routing protocols between each other.  So while the packaging looks like an integrated modular chassis design its really comprised of a dozen individually managed components.  Facebook is actually using tens (if not hundreds) of these 4-slot chassis together as a building block in a massive spine-leaf fabric design. This means lots of individual devices to manage and monitor.  The major benefit to this approach is improved cable management as fabric interconnects between spine and leaf are internal to the 4-slot chassis as well as better cooling.

Now that we have a better understanding of what a spine-leaf fabric is let’s look at the big picture view. What are the actual business requirements that you are building a Data Center network towards? Are you building your network according to the requirements of an application that you are developing or are you building it to run “anything” that might get thrown at it? The former is the case when you look at companies like Facebook or LinkedIn that seek to make the network technology interface with the business better. The latter occurs when you are building an Infrastructure as a Service (IaaS) Data Center fabric that will host someone else’s applications.  Is the application designed with scale out and fault tolerance built in, or is it expected that the network will deliver this on its behalf? You need to know the requirements and behavior of the application. Is ultra-low latency required? Is packet loss tolerated? Is every end host consistently using all of its available bandwidth or is the traffic pattern less deterministic?

How big does your Data Center fabric need to be? Don’t just assume that you need to do a spine-leaf design because that’s what all the massive scale players are doing.  The key here is ‘massive scale’. These companies have unique problems to solve at the largest known scale in the industry.  You need to be realistic here and ask yourself is your Data Center network actually going to need to scale as large as a Google, Facebook or LinkedIn? If you are only talking about 2,000 servers versus 20,000 to upwards of 100,000 then you are building a very different network then the massive scale players in the industry.

What tech is in your network team’s wheelhouse? Are they comfortable with a routing protocol like BGP and proficient with configuration automation and telemetry collection tools required to deploy and monitor a very large scale Data Center fabric compromised of tens if not hundreds of individual network devices?

Another interesting concern is whether or not you intend to pursue a disaggregated software and hardware strategy whereby the switch Network OS (NOS) and the switch HW come from different vendors as opposed to the traditional ‘vertically integrated’ model of SW and HW coming from the same vendor.  This is important to know up front as there currently appears to be far less options for switch NOS’s that will run a large chassis based switch versus the smaller 1 or 2RU Top-of-Rack (TOR) switches.  At the time of this writing the only OCP submitted designs for a large chassis based switch are the Edgecore Networks OMP 256 and 512 models. What this inevitably means is that a 1/2RU open networking switch will give you much a greater choice of Switch NOS and Switch Hardware options.

If you decide to go with a large chassis based switch from a traditional ‘vertically integrated’ network vendor you are likely going to stay with that network vendor for 5-7 years and upgrade the line cards in the switch chassis to go to the next highest interface speed and density cards when needed. With the spine-leaf design everything is a distributed line card and you replace the individual switches with another switching silicon generation of 1RU switches when needed. At this point, given the broader choices in the open ecosystem in the smaller 1 or 2RU fixed switches space, you can elect to change out the hardware or software vendor.

 Consideration Large Chassis Switch Pair Spine-Leaf “IP Fabric”
Fabric Construct Fabric links between line cards are internal to the chassis. Usually proprietary, hard to debug w/out vendor TAC, hard to see what is really going on Fabric is composed of external links between switches running standards based protocols. Easy to monitor and understand what’s happening on the links using a well-established troubleshooting methodology.
Failure Domain Large fault domain here. Losing an entire switch means losing ½ your fabric bandwidth. For SW upgrades you really need mechanisms like NSR and ISSU to work reliably else you will have long reboot times. Small failure domain. Individual switches can fail or be upgraded separately to reduce impact. Shorter reboot times as we are dealing with individual switches
Rack Units (RU) and power Chassis based switches will consume some extra rack units and power for routing engines and other shared system components. How many rack units and power consumed will largely depend on how much oversubscription is acceptable for your application as increasing fabric bandwidth involves adding more spine switches (which means more optics and cabling as well)
Optics/Cabling Chassis design will use less optics and cabling due to the fabric interconnection links being internal to the chassis S/L requires optics and cables for constructing the external fabric interconnections between switches
Config Management Only 2 devices to manage as long as we are building to less than 2,000 server facing ports 10’s of devices to manage. Need to implement a config automation tool/process. Uses more IP addresses for P2P links between switches. Needs a routing protocol configured as the distributed control plane of the fabric
Oversubscription In a typical chassis based switch there should be no oversubscription The acceptable level of oversubscription is key to number of overall spine switches, fabric links, optics/cables, IP addresses, rack units and power
Packet Buffers Custom ASIC based chassis switches typically have deep buffers (GB), though there are some that leverage merchant silicon with shallow buffers (MB). 1 and 2RU merchant silicon switches typically come with shallow buffers (MB) though there are some exceptions here as well

Let’s take a look at how some of these factors listed above combine to play out in a simplified example.  In this case I’m simply going to determine the maximum amount of dual-homed servers with 25G NICs that can be supported in a large chassis based system and then attempt to build a spine-leaf fabric design with the same level of reach and oversubscription.  For this example I’ve decided to compare a large 8-slot chassis fabric based on Broadcom Tomahawk merchant silicon with a spine-leaf fabric of 1RU switches based on the same merchant silicon switching ASIC to make this as much of an ‘apples to apples’ comparison as possible since each 1RU switch is equivalent to a line card or fabric module in the chassis based system. I’ve reserved a reasonable amount of ports in each fabric design to account for exit links towards the DC Edge Routers and some physical appliances like Firewalls, anti-DDOS, Monitoring Probes and other devices that may need to exist in the Data Center. The legacy hosted applications in this DC are not distributed fault-tolerant applications and therefore will require active/active 25G LAG connections to the switches for providing higher availability.

 

Consideration Large Chassis Switch Pair Spine-Leaf “IP Fabric”
Switch Hardware (2) 8-slot chassis w/32x100G line cards (46) Total 1RU 32x100G switches

•(30) 1RU compute leafs

•(2) 1RU service leafs (host DC GW/PNFs

•(14) 1RU spine switches

Total Rack Units 26RU 46RU
Total (Max) Power 12KW 17.5KW
Hardware location Middle-of-row (MOR) in network cabinet Leaf in Top-of-Rack (TOR) in server cabinets and MOR network cabinet for spine switches
Number of dual-homed 25G servers 960 (30 server racks of 32 servers) 960 (30 server racks of 32 servers)
Oversubscription None ~1.2:1
Total non-server ports (to DC GW/MLAG ICLs/Appliances/Free ports) 8x100G/4x100G/10x100G/0 8x100G/64x100G/10x100G/18x100G
Cabling

 

 

This example utilized combinations of Active Optical Cables (AOC), Direct Attach Copper (DAC) and Direct Attach Copper Break Out Cables (DACBO)

(480) QSFP28 to 4xSFP28 AOCs to the servers

(480) + (8 to DC GW + 2 DACs for MLAG Inter-Chassis-Links (ICLs) + 10x QSFP28 to 4xSFP28 DAC for Appliances) = 500 physical cables

(448) QSFP28 AOCs from middle of row spines to leafs in the top of rack

Leaf to servers via (16) QSFP28 to 4xSFP28 DACBO per leaf per rack

(448) for 100GE fabric links + (16) QSFP28 to 4xSFP28 DACBO per leaf to the servers x 30 racks = 928 + (8 to DC GW + 32 DACs for MLAG ICLs +10 for appliances) = 978 physical cables

Optics 8 to DC GW = 8 (100G QSFP28 SR4)

* QSFP28 to 4xSFP28 AOC and DAC cables means no 25G optics on the servers, appliances or MLAG ICLs

8 to DC GW = 8 (100G QSFP28 SR4)

* QSFP28 DAC and QSFP28 to 4xSFP28 DACBO means no 25G optics on the servers, appliances or MLAG ICLs

Total addressable links (IPv4 addresses) 30 IRBs + 2 for MLAG ICL + 2 for LAGs to DC GW + 40 for PNFs = 74 896 (p2p fabric links) + 32 for MLAG ICLs  + 2 for LAGs to DC GW + 40 for PNFs =  970

There are some interesting observations that can quickly be made here.

  • First to get to nearly the same level of oversubscription with the spine-leaf fabric design I need to consume 20 more Rack Units and 5.5 KW more power. Most of this is attributable to the need to have physical links between the spine and leaf switches which consume (16) 100GE ports on the leaf switches that are not required to be used on the line cards inside of a chassis based switch as those connections are internal to the switch fabric in the chassis. The ‘external switch fabric’ of the spine-leaf design requires (14) rack units worth of spine switches to deliver the same function as the fabric modules which are internal to the chassis based system.
  • The requirement to dual-home the 25G NICs on the servers to a pair of switches via LAG requires burning 2 additional interfaces between each pair of leaf switches for running the MLAG/MC-LAG inter-chassis links. In the chassis based switch this only requires a single pair of links for running the MLAG/MC-LAG ICL between the two chassis.  Also important to note here is that MLAG/MC-LAG only works between a pair of switches.  So when you connect the servers directly to a pair of large chassis based switches you have already confined the limit of how big the fabric can be to 2 switches.
  • The total number of IP addressing required to build the spine-leaf design is significantly higher primarily due to the need to address all of the P2P links that comprise the physical fabric links between the leaf and spine switches.

Another interesting observation was that the choice of location of the switches and the type of cabling in use can make a huge difference in the amount of cables and type and cost of the optics required. There are several different permutations here that could have been chosen and the decision tree for where to place the switches and how to run the cabling warrants its own blog post.  The use of Direct Attach Copper (DAC), Active Optical cables (AOC) and Direct Attach Copper Break-Out (DACBO) cables created a large reduction in the 100G optics required.  I think it’s safe to say that an optimum rack placement and cabling solution can be found for either the chassis based fabric design or spine-leaf fabric design, such that neither fabric design solution should have a tremendous advantage or disadvantage here.

 

Conclusion:

If you can handle having a potentially very large failure domain, then a traditional large chassis based system can be a convenient option for deploying a non-blocking switch fabric POD of less than 2,000 ports (1,000 dual-homed servers).  It requires less devices to provide configuration management and monitoring for, less physical cabling, doesn’t need a routing protocol running between elements to bootstrap the fabric and consumes less IP addressing.

However, if one intends to extend the Data Center fabric to 3 to 5 times this size then N more large chassis based systems would need to be deployed. What is also needed is another layer of chassis based switches introduced for interconnection between all of the PODs which would constitute a spine layer. What you now end up with in that case is a spine-leaf fabric constructed entirely of chassis based switches. The failure domain in this scenario now becomes incredibly important.  If the spine layer that interconnects all of these PODs is a pair of 100G chassis based switches you can lose a large portion of bandwidth between any 2 PODs in your Data Center. The configuration management and monitoring advantages of the large chassis based system model also goes away in a Data Center of this size.

 

Disclaimer: The views expressed here are my own and do not necessarily reflect the views of my employer Juniper Networks

Is the future of security cloud based delivery?

It seems every single news article these days contains numerous vendors espousing how they could have prevented the latest malware threat.  Every software and hardware vendor seems to have a solution that could have stopped the WannaCry ransomeware outbreak and will protect from the new Petya variant and then the next variant and so on.

Naturally this piqued my curiousity and begged the question “If every vendor has a preventative solution then why do exploits like this continue to keep happening at such alarming rates and with such devastating financial impact?”

I recently spent some time talking with folks who are the forefront of this who helped me to understand that while preventative measures do exist they are traditionally very complicated to deploy at scale and are only as good as the coverage applied.  First off end system anti-virus software is utterly ineffective in keeping up with adaptive persistent threats in today’s landscape. You are just chasing your tail trying to keep up with the bad actors who actually test their exploits using your commercial anti-virus software.  Not saying don’t use it or bother to keep it updated, just pointing out that this is not going to save us.  Next, and really most important, is the reality that not every enterprise has the same security posture in every location that their users are accessing the internet and cloud based applications from.  For various reasons it’s very difficult to have the same level of advanced security applied in all locations.

In order to really grasp this you first have to look at the history of traditional enterprise WAN design and where the security perimeter got applied.  The legacy enterprise WAN was a Hub-and-Spoke topology designed to provide connectivity between branch offices (spokes) and the Corporate DC (Hub) because that’s where all the applications were running that you needed to access.  With the advent of a mobile workforce VPN concentrators also got added to allow connections to these Corporate DC Hub hosted applications from anywhere.  Internet access breakout was typically implemented at the Corporate DC Hub.  With this Hub-and-Spoke model all end user traffic was coming into the Corporate DC so this is effectively the ‘chokepoint’ where all of the security measures were implemented.

So what do the security measures actually look like in one of these Corporate DCs? Well it’s pretty complex as no single security appliance can handle all of the functions required. Attempting to deliver comprehensive security at scale required multiple disparate components from multiple vendors.  This means forcing your end user traffic through separate appliances for URL filtering, IDP/IPS, anti-malware, Data Loss Prevention (DLP), Next-gen FW, sandboxes and SSL inspection.  This complicated and expensive array of appliances all need to be managed, updated and capacity planned independently as they all scale differently depending on the type of heavy lifting that they are doing.  Then there is the need to interpret logs and threat data coming from all these devices in different formats in order to see whats happening and how effective these security measures really are.

The reality is that not every enterprise has or can deploy all of the above security measures at scale and make them available to every single end user.  Some don’t have a expensive WAN circuit from every one of their remote branch offices to the Corporate DC and instead have deployed at the branch a local subset of the security measures that are normally found in the Corporate DC.  Others may not be able to inspect all SSL encrypted traffic at scale creating a huge blindspot when looking for threats.

Enter WAN transformation…if you read the same tech trade rags that I do you may have heard about this thing called SD-WAN about a hundred times a day.  With ever increasing Enterprise adoption of cloud based SaaS applications the end destination of most user traffic is the cloud and not the Corporate DC Hub where the security perimeter was built.  Maintaining this Hub-and-Spoke model is costly from a WAN circuits perspective and highly inefficient leading to poor cloud based application performance.  This is leading to Enterprises wanting to implement local internet access breakouts at each branch to allow for lower cost yet higher performance access to critically important cloud based applications like Office 365.

So if most of the applications my end users access are in the cloud and I want to provide direct internet access to those applications for high performance how do I secure my traffic headed to directly to the internet?  As mentioned above stamping out a copy of the patch work of security appliances typically deployed in the Corporate DC security perimeter is cost prohibitive and an adminstrative nightmare.  Shortcuts will be taken, coverage won’t be comprehensive and as expected the security posture of the entire Enterprise is only as good as it’s weakest link.

What would be really useful is the ability to point all of my end user locations whether branch offices or my mobile workforce to a cloud based security on-ramp.  Hmm…isn’t this just another version of the hub-and-spoke design?  If done poorly then yeah, it would be.  To do this right you would need to have a cloud based security platform that has a global footprint of DCs colocated at IXPs (Internet Exchange Points) where all the major cloud providers interconnect as well.  This provides high availability as well as high performance in that each end user location is serviced by the closest cloud DC based security platform.  The security platform itself should efficiently scan ALL (including encrypted traffic) of my end user traffic through a comprehensive and optimized pipeline of security functions.  What this would essentially provide is an elastically scalable,  high performance w/ low latency,  advanced security platform that is always on with single pane of glass management and reporting and of course utility based pricing.  Basically all of the promises of the cloud, just now applied to advanced network security.  Adding new branch sites or mobile workforce users in this model and calculating future costs is incredibly simple.  You would no longer need to worry about procuring applicances, capacity planning, designing for HA, software updates, licensing or any of the other hurdles encountered in attempting to implement this in your own environment.

It sounds like unicorns and rainbows…however this cloud based security platform model already exists.  Zscaler with it’s Internet Access service appears to have pioneered the approach of going “all in” on completely cloud based delivery with other companies like Opaq adopting a cloud first security platform model as well.  Traditional security vendors like Palo Alto have come on board last week with Global Protect, their own version of a cloud based offering.  Juniper, through their Software Defined Secure Networking (SDSN) solution, is delivering sandboxing in the cloud via Sky ATP combined with automated mitigation and quarantining via their traditional sw or hw based on-prem security appliances and a growing ecosystem of multi-vendor switches. Besides aspects of their solutions being delivered from the cloud, what else is common in all these offerings is the ability to share detected threat data immediately across all of their customer base.

My goal was not to list every vendor or get into the merits of the specifics of each vendor’s implementation or who you should evaluate…lets leave that as an exploratory exercise for the Enterprise looking for a security solution to accomdate their WAN transformation projects and up level their existing security posture across the Enterprise.  Just wanted to acknowledge that moving advanced security measures to the cloud appears to be the future of Enterprise security and for really good reasons.

 

Disclaimer: The views expressed here are my own and do not necessarily reflect the views of my employer Juniper Networks

 

 

 

AWS Certified Solutions Architect Study Prep

After recently passing the AWS Certified Solutions Architect – Associate level exam several folks had asked what I did in preparation for the exam and whether I had any advice.  Figured I might as well just post this for others who may be interested.

By blocking out 2-3 hour chunks of focused time every few days I figure about 1.5 months is a safe ballpark figure on how long the prep work took between ~15 hours of video lectures and a 14 chapter certification guide + lab exercises.

One thing I do have to say about this particular industry certification is that its never been so easy to quickly gain access to the end product that you are learning and prepping to certify on.  I mean lets be honest, the ease with which you can quickly spin up resources to practice and tear them down when you are done all while minimizing the cost of certification prep is indeed a perfect illustration of the power and value of cloud computing.

A couple of suggestions:

1.) Go and open up an AWS account so you can get in and gets some real stick time.  You can utilize a lot of services in the free tier just to get your feet wet.  Remember, for the stuff you create that is not free, it will only need to exist for a few mins which results in minimal damage to the wallet as you are learning.

2.) Know the core services inside and out. You really have to understand things like VPC Networking constructs, S3/Glacier, RDS, CloudFront, Route53, Auto Scaling and ELB.

3.) Understand how to secure what YOU put into the cloud. Know the differences between what you need to manage the security of versus what AWS manages the security of.

4.) Understand AWS best practices as this is a large part of the exam. As a Solutions Architect you are expected to be able to design highly-available, scalable, fault tolerant and cost-efficient systems. When you take the exam pay very close attention to these themes as you answer the questions. The right implementation choice answer absolutely depends on the theme mentioned.

5.) If you do get the certification guide spend a solid day before the exam reviewing the end of chapter summaries of the key concepts for the exam. If it takes you as long as it did me to get through all the material (while still managing my day job and being a sports parent) then a quick refresher will certainly help.

The following are the particular test prep resources that I decided to use:

AWS Certified Solutions Architect Official Study Guide: Associate Exam

A few people left crappy reviews for the book, but I found it to be a very good in depth look at AWS in general. For someone who is looking for some organized guidance this provides a solid outline for what concepts you will need to know for the exam. There are quizzes at the end of each chapter and several chapters have hands on exercises at the end. If you decide not to do all of the available exercises in the book I would suggest you at least do the exercises at the end of Chapter 14.  Ch 14 is pretty critical as it ties together all of AWS’s recommended best practices which are a large part of the exam.

Pluralsight Courses

Architecting Highly Available Systems on AWS
AWS Security Fundamentals
AWS VPC Operations
AWS Certified SysOps Administrator – Associate

Stock Neutron OVS – Not the Telco Cloud you were looking for?

Do I need a 3rd party SDN controller for my OpenStack based “Telco Cloud” DC?  Isn’t the default Neutron networking stack that comes with OpenStack considered “SDN”?  What benefits does a 3rd party Controller bring that would make it worth the additional time, money and effort to deploy? These are important questions that come up when a Service Provider is looking to build a cloud DC to host virtual workloads.

The reality is that a lot of Service Provider Clouds have not yet ramped up to the scale or VNF requirements complexity where they have fully experienced the pain points that a 3rd party SDN controller will solve for their OpenStack based Cloud DC.  Many have started out cautiously (smartly) by virtualizing the proverbial “low hanging fruit” applications which don’t require complex, high performance virtual networking at scale.  For these initial efforts the challenges of Neutron with OVS likely haven’t reared their ugly head.  Unfortunately in the world of Telco Clouds there are still a lot of “snowflake” applications that have special requirements that challenge the basic networking capabilities of OpenStack Neutron Networking with OVS. The purpose of this blog post is to delve into what benefits a 3rd party SDN Controller will provide when the number and complexity of hosted virtual applications start to increase in a Telco Cloud.

What are we building?

Here are some immediate considerations that come to mind when building a Telco Cloud:

  • Are the virtual workloads will you be hosting “Virtualized Network Functions (VNFs)  requiring high forwarding performance, low latency and high availability?
  • Do the virtual workloads need to participate in the service provider routing domain outside the DC?
  • Are any of the tenants that need to access these virtual workloads coming from legacy service provider networks?
  • Do you need to scale to more than a couple hundred tenant virtual networks?
  • Do you need to implement complex chains of service functions as part of these virtual workloads?
  • Do you want to use a virtual network overlay to avoid configuring DC switch and router hardware with per tenant state?
  • What type of virtual network security policies do you need and where are you going to be applying them?
  • Do you need end to end visibility of where you traffic is going in the DC and real time metrics on how your virtual applications are performing?

If you answered yes to several of the questions above then you should be taking a closer look at a 3rd party SDN Controller and here is why…Telco Cloud Is Different.

Telco Cloud IS Different

Building a Service Provider or “Telco Cloud” is very different than your traditional virtualized DC hosting Enterprise applications. The following is a list of a few key distinctions when dealing with Telco Cloud applications versus traditional virtualized applications.

  • Legacy network interconnection – a lot of these VNFs require connectivity to the Service Provider WAN for which BGP and MPLS BGP VPNs are the norm. You will need a way to dynamically advertise the reachability of these VNFs to the WAN outside the DC network.
  • VNFs typically have very high packet throughput and low latency requirements which the default OpenStack Neutron vSwitch OVS struggles to handle.
  • Telco Cloud virtualized applications are typically VNFs that are deployed as a chain of services. For example think about how one might configure the virtual networking required to move traffic through all of the virtual components of a 4G or 5G vEPC.
  • High Availability schemes and VNF health checking are often required to quickly move traffic to local or remote backup instances of the service function
  • How do I implement QOS? For example if I host a virtual Route-Reflector that is the ‘brains’ of my external physical legacy network inside of the shared infrastructure that is my virtualized Telco Cloud DC then how do I ensure prioritization of this vRR workloads control plane packets to and from the legacy external infrastructure?
  • IPv6 – its very real in Telco NFV deployments
  • Connectivity of virtualized functions to non-virtualized functions inside the same DC for brownfield deployments

Neutron with OVS still isn’t fully there yet for Telco Cloud

Isn’t the default Neutron networking stack that comes with OpenStack considered “SDN”?  Yes, but unfortunately even after many years of continuous improvements, the stock Neutron OVS networking solution that comes with OpenStack still struggles to fully orchestrate the required virtual networking functions at scale and with adequate enough visibility required to deploy, monitor, troubleshoot and capacity plan a Telco Cloud.

The existing of the OVN project, which is essentially an overhaul of OVS, pretty well summarizes the issues that exist in OVS in it’s charter

“OVN will put users in control over cloud network resources, by allowing users to connect groups of VMs or containers into private L2 and L3 networks, quickly, programmatically,  and without the need to provision VLANs or other physical network resources….

https://networkheresy.com/2015/01/13/ovn-bringing-native-virtual-networking-to-ovs/

Perhaps the most important issue is that gaps in OVS capabilities can result in the Telco Cloud Provider still needing to manually provision parts of the physical DC network infrastructure with tenant reachability information using a vendor specific CLI or NMS.  So when a VM/VNF spins up  I now need the networking team to go statically configure VLANs and routes on all the DC infrastructure gear to properly isolate and forward my tenant traffic.  This is the same way we built DC networks for the last 20+ years.  A Software Defined Network shouldn’t require tenant state to have to be configured in the underlying DC fabric equipment.  Putting tenant state into the DC network switches means having to have bigger state tables to handle lots and lots of logical scale which leads to having to purchase a higher class (i.e. more expensive) of DC switch to handle this. The switching layer should just be transport allowing the provider to purchase the most cost effective DC switching fabric.

Visibility into where your traffic is going is critically important. Stock Neutron networking with OVS lacks complete visibility on how your traffic is flowing throughout the virtualized DC.  OVS doesn’t correlate overlay tunnel flow paths to underlay hops, identify hot spots or allow one to see what security rules and service chains your traffic is traversing. This is still a very manual device by device troubleshooting and data collection process with OVS today.

Another consideration is when a Cloud Provider is hosting virtualized network functions (VNFs) that require SR-IOV to achieve high packet per second throughput and low latency.  SR-IOV bypasses OVS and delivers VM packets directly to the NIC. This also contributes to putting tenant state into the underlay switches requiring complex configuration on these switches in order to implement multi-tenant routing, security policy and service chaining functions.

There is good reason that there is a whole ecosystem of 3rd party OpenStack Neutron Networking SDN Controller plug-ins that exist to fill these gaps in stock Neutron OVS.

This link is to a previous OpenStack Summit Session which does a fantastic job covering the deficiencies and complexity involved in implementing an OpenStack Cloud DC using stock Neutron OVS Networking.  I would highly suggest spending the 35 minutes to get a much more detailed view of the performance and troubleshooting challenges in stock OVS that I won’t fully cover in this post.

“Hey over there, isn’t that the fully automated cloud we were looking for”

So what benefits does a 3rd party SDN Controller bring?

To get a better understanding of the gaps and accompanying challenges I’m referring to above, let’s take a look at what a commercial SDN Controller provides in the OpenStack DC over the top of a stock Neutron OVS deployment.  I’ll use elements of Juniper’s Contrail Networking for this as it’s obviously the implementation that I’m most familiar with as a Juniper Networks CSE though solutions like Nokia Nuage VSG and others should be able to deliver similar advanced virtual network orchestration capabilities.

In a nutshell Contrail uses a centralized SDN controller to program advanced virtual-networking overlay topologies that happen completely in SW at the vRouter level directly on the compute nodes. All Contrail Networking needs to create it’s overlay networks is IP reachability info between all of the servers in a DC which can be achieved using basic routing between the racks of underlay switches. There is no tenant state or complex configuration required in the switch fabric hardware and all security and forwarding decisions are made directly at the virtual-networking layer in SW on the servers hosting the containers or VMs without need for hairpin routing through centralized service nodes.

Contrail leverages BGP for dynamic tenant reachability advertisements towards the legacy WAN network infrastructure outside the DC. It also has a “SmartNIC” offload option for ensuring that SR-IOV enabled workloads still leverage all of Contrail vRouter’ s overlay networking functions.  Along with complex virtual network capabilities comes detailed analytics that show you the performance of your virtual application stacks.  All of this leads to efficient, high performance packet forwarding through dynamically created complex virtual networking topologies at extremely high scale with full visibility into where your traffic is going.

Here is a non-exhaustive list of important capabilities that a Contrail virtual networking overlay solution would provide for a Telco Cloud DC that a standalone Neutron OVS based solution still lags behind in:

  • No tenant networking state is populated into the underlying DC switch hardware – you are free to build a high speed low latency transport fabric using whatever DC switch OS and HW you want at each tier of your design.
  • Native integration with L3 VPNs in the WAN – This is important for a carrier which desires to offer virtualized services to its existing base of L3 VPN customers.
  • Ability to dynamically insert service chains between containers/VMs without having to provision complicated ACL rules into the entire underlying physical infrastructure
  • Service health checking where a VM/VNF route reachability is auto-withdrawn from the overlay upon health check failure allowing active/backup service instances to exist via automated fail-over
  • Ability to dynamically mirror a VM interface or specific 5-tuples of traffic transiting between 2 virtual networks without configuring span ports in the underlying network fabric
  • BGPaaS where the VM/VNF can run BGP with Contrail Network overlay directly and dynamically advertise its own loopback addresses or other secondary IPs, subnet pool IPs/VIPs etc without having to provision BGP sessions on the physical network HW in the DC
  • Single pane of glass detailed analytics reporting down to the flow level including correlation of overlay network flow path with the underlay network path
  • QOS marking and queuing ensuring fairness of traffic scheduling between containers/VMs running on the same compute host and through the underlay fabric
  • SmartNIC offload of SDN overlay encapsulation + network policy for VNF workloads that require SR-IOV in order to achieve desired performance. This prevents the manual network configuration that normally would have happened when a traditional compute node vSwitch or vRouter is bypassed by SR-IOV.

Here is a video presentation with some great insights from AT&T Mobility at NANOG 70 into the criteria required to build a vEPC which can address the new requirements and challenges that 5G and IOT impose and why Juniper Networks Contrail SDN Overlay Controller was essential to enabling a flexible high performance Telco Cloud.

In summary an OpenStack based Telco Cloud DC requires an SDN Controller to handle the advanced virtual-networking requirements at scale.  Without an SDN controller, the Cloud Provider is forced to attempt to provision legacy network interconnect, tenant network isolation instances, redundancy/HA, security rules/ACLs and hairpin routing through service nodes and appliances via programming these tenant specific rules at scale directly in the DC switching and DC Gateway router gear.  This is done by cobbling together the same legacy constructs that have been used to build DC networks for the past 20+ years combined with expensive DC switches that can hold the logical scale required of a large virtualized DC.  This is not network virtualization moving at the speed of compute virtualization which is the promise of SDN. Let your DC fabric be simple transport and leverage a fully functional 3rd party SDN controller to deliver secure and fully automated virtual tenant networking at scale.

Disclaimer: The views expressed here are my own and do not necessarily reflect the views of Juniper Networks