Posts Tagged ‘datacenter fabric’

Mice and Elephants in my Data Center

September 8, 2014 1 comment

Elephant flows

A long-lived flow transferring a large volume of data is referred to as an Elephant flow. Compared to the smaller sized, short lived flows referred to as ‘mice’ and of which we have numerous in the data center. Elephant flows, though not that numerous, can monopolize network links and will consume all allocated buffer for a certain port. This can cause temporal starvation of mice flows and disrupt the overall performance of the data center fabric. Juniper will soon introduce Dynamic Load Balancing (DLB) based on Adaptive Flowlet Splicing as a configuration option in its Virtual Chassis Fabric (VCF). DLB provides a much more effective load balancing compared to a traditional hash-based load distribution. Together with the existing end-to-end path-weight based load balancing mechanism, VCF has a strong load distribution capability that will help network architects drive their networks harder than ever before.

Multi-path forwarding

Multi-path forwarding refers to the balancing of packets across multiple active links between 2 network devices or 2 network stages. Consider a spine and leaf network with 4 spines, all traffic from a single leaf is spread across all the links in order to use as much as possible the available aggregate bandwidth and provide redundancy in case of link failure. 

Multi-path forwarding is typically based on a hash function. A hash function maps a virtually infinite collection of data into a finite or limited set of buckets or hashes, as illustrated below.

img1Image source: Wikipedia (

In networking terms: the hash function will take multiple fields of the Ethernet, IP and TCP/UDP header and use these to map all sessions onto a limited collection of 2 or more links. Because of the static nature of the fields used for the hashing function, a single flow will be mapped onto exactly 1 link and stay there for the lifetime of the flow. A good hashing function should be balanced, meaning that it should equally fill all hashing buckets and by doing so provide an equal distribution of the flows across the available links.

One of the reasons to use static fields for the hashing function is to avoid reordering of packets as they travel through a network when paths might not all be of equal distance or latency. Even if by design all paths are equal, different buffer fill patterns on different paths will cause differences in latency. Reordering can be provided by the end-point or in the network, but it always comes at a cost which is why a network ensuring in-order delivery is preferred.

Because of that static nature, the distribution of packets will be poor balanced when a few flows are disproportionally larger than the others. A long-lived, high volume flow will be mapped to a single link for the whole of its life-time and will cause the network buffer of that link to get exhausted with packet drops as a result.

TCP as in keeping the data flowing

To understand the mechanism of Adaptive Flowlet Splicing, we need to understand some of the dynamics of how data is transmitted through the network. TCP has been architected to try to avoid network congestion and keep a steady flow of data over a wide range of networks and links. One provision that enables this is the TCP window size. The TCP window size specifies how much data can be in-flight in the network before expecting a receiver acknowledgement. The TCP Window pretty much tells the sender to blindly send a number of packets and when an acknowledgement is received from the receiver the sender can slide the window down for the size of one packet for each received ‘ack’. The size of the window is not fixed but dynamic and self-tuning in nature. TCP uses what we call AIMD (Additive Increase, Multiplicative Decrease) congestion control. In AIMD congestion control, the window size is increased additively for each acknowledgement received and cut by half whenever a few acknowledgements were missed (indicating loss of packets). The resulting traffic pattern is the typical saw-tooth:

img2 Adaptive Flowlet Splicing (AFS)

From the above it should be apparent that elephant flows will result in repeating patterns of short bursts followed by a quiet period. This characteristic pattern divides the single long-lived flow over time into smaller pieces, which we refer to as ‘flowlets’. Below picture, courtesy of the blog article by Yafan An [*1], visually represents what flowlets look like within elephant flows when we look at them through a time microscope:

img3Now suppose that the quiet times between the flowlets are larger than the biggest difference in latency between different paths in the network fabric, in that case load balancing based on flowlets will always ensure in-order arrival.

To distribute the flowlets more evenly across the member links of a multi-path, it would be good to keep some kind of relative quality measure for each link depending on its history. This measure is implemented using the moving average of the link’s load and its queue depth. Using this metric, the least utilized and least congested link among the members will be selected for assigning new flowlets.

Is this elephant flow handling unique to VCF ?

By all means, no. But the controlled environment and the imposed topology of the VCF solution allows Juniper to get the timings right without having to resort to heuristics and complex analytics. In a VCF using only QFX5100 switches in a spine and leaf topology, each path is always 3 hops. The latency between two ports across the fabric is between 2µs and 5µs resulting in a latency skew of max 3µs. By consequence, any inter-arrival time between flowlets larger than the 3µs latency skew will allow flowlets to be reassigned member links without impacting the order of arrival of packets.

img4In an arbitrary network topology using a mix of switch technologies and vendors, every variable introduced makes it exponentially more complex to get the timings right and find the exact point where to split the elephant flow into adequate flowlets for distributing them across multiple links.

Another problem we did not have to address, in the case of VCF, is how to detect or differentiate an elephant from a mice. AFS records the timestamp of the last received packet of a flow. In combination with the known latency skew of 3µs, the timestamp is enough to provide the indicator for the reassignment of the flow to another member link. It is less important for AFS to be aware of the flow’s actual nature.

In arbitrary network architectures however, as Martin Casado and Justin Pettit describe in their blog post ‘Of Mice and Elephants’ [*2], it might be helpful to be able to differentiate the elephants from the mice and have them treated differently. Whether this should be done through distinct queues or different routes for mice and elephants or turning the elephants into mice, or some other clever mechanism is a topic of debate and network design. Another point to consider is where to differentiate between them? The vswitch is certainly a good candidate, but in the end the underlay will be the one that handles the flows according to their nature and hence a standardized signaling interface between overlay and underlay must be considered.


By introducing AFS in VCF, data center workloads that run on top of the data center fabric will be distributed more evenly and congestion on single paths that might be caused by elephant flows will be avoided. If a customer has no needs for a certain topology or for a massively scalable solution, a practical and effective solution like VCF brings a lot of value to their data centers.


[*1] Yafan An, Flowlet Splicing – VCF’s Fine-Grained Dynamic Load Balancing Without Packet Re-ordering –

[*2] Martin Casado and Justin Pettit – Of Mice and Elephants –



OpenFlow in the Data Center

October 2, 2012 Leave a comment

A QFabric perspective on the emerging network virtualization technologies

What is OpenFlow?

Literally quoting the website : “OpenFlow is an open standard that enables researchers to run experimental protocols in the campus networks we use every day. OpenFlow is added as a feature to commercial Ethernet switches, routers and wireless access points – and provides a standardized hook to allow researchers to run experiments, without requiring vendors to expose the internal workings of their network devices. OpenFlow is currently being implemented by major vendors, with OpenFlow-enabled switches now commercially available.”

In a router or switch, the fast packet forwarding (data plane) and the high level routing decisions (control plane) occur on the same device. In an OpenFlow Switch the data plane portion resides on the switch, while the high-level routing decisions are moved to a separate controller. The communication between the OpenFlow Switch and the OpenFlow Controller uses the OpenFlow protocol.

An OpenFlow Switch presents a clean flow table abstraction; each flow table entry contains a set of packet fields to match, and an action (such as send-out-port, modify-field, or drop). When an OpenFlow Switch receives a packet that does not match its flow table entries, it sends this packet to the controller. The controller makes the decision on how to handle this packet and adds a flow entry to the switch’s flow table directing the switch on how to forward similar packets in the future.

What is QFabric?

QFabric is a distributed device that creates a single switch abstraction, using a central control plane with smart edge devices representing the data plane. Multiple edge devices are interconnected through a common backplane implemented by 2 or more dedicated interconnect devices. All high-level layer 2 and layer 3 decisions are controlled by a central director (control plane) which supplies the edge devices with information on how to forward packets. Edge devices are smart in a sense that they make their own forwarding decisions for local forwarding while informing the control plane on their local topology and state, and taking input from the central director for making inter edge device forwarding decisions. The communication between the control plane and distributed data planes is implemented using the mature and standardized MBGP protocol (IETF RFC 4760).

The smart edge devices allow for better scalability. The backplane of the distributed device is implemented by very fast interconnects providing the edge devices with a consistent latency between any 2 ports across the whole fabric. Management is abstracted in a central control plane and leaves the administrator with a single switch view.

QFabric is a distributed device, performing and acting like a single switch, implemented using top of rack edge nodes for deployment flexibility. All components of the QFabric are fully redundant and the central control plane provides the single switch abstraction and management view. Using QFabric, it is possible to create one flat network with consistent latency and performance scaling up to 6144 ports.

OpenFlow in the datacenter

OpenFlow is a SDN protocol that I would position as a L2 network virtualization solution in the datacenter world, much like NVGRE, VXLAN, EVB and others. It provides a way to scale beyond the infamous 4095 VLAN restriction imposed by most of the datacenter network hardware in use today.

I see OpenFlow as one of the potential solutions for multitenant cloud datacenters. In public cloud datacenters the primary concern is scaling isolated environments as far as possible, with the option to go well beyond the 4095 VLAN limit. Second concern in multitenant clouds is the dynamic provisioning of services. In an IaaS public cloud for example, it is imperative to have dynamic provisioning of the network layer as new virtual machines are created and deployed – software orchestration using OpenStack, CloudStack, vCloud and others need to integrate with the network and OpenFlow is probably one of the most dynamic solutions available today to achieve this integration easily and in a vendor agnostic way.

Though dynamic in nature, scale is one of the issues that might impact an OpenFlow network. The central controller is the single brain and decision point for all the devices in the network. Except for elephant flows, which are more typical for big data synchronization and backup applications, lots of short lived connections are made across the datacenter network. Any new sessions or flows have to synchronize around the central OpenFlow controller, making this controller the choke point and virtually limiting the scale and performance of the datacenter.

Network BubbleFor flexible deployment of cloud services, when scaling beyond one rack, it is imperative to have a flat network architecture. This flat network architecture can only be implemented by a fabric that provides consistent latency, not favoring flows between two devices located in the same bubble. Traditional network design requires multiple layers to scale, resulting in bubbles at the hardware layer, and location awareness becomes the main restriction for flexible management. See also the IBM Redbook – Build a Smarter Data Center with Juniper Networks QFabric for a more elaborate description of the bubble issue.

The only solution available today, providing a flat and scalable architecture with consistent low latency across all ports and racks is QFabric. This requirement becomes even more apparent when thinking about OpenFlow. OpenFlow delivers dynamic provisioning for a virtual network across physical servers. In order to not suffer a management nightmare and having to confine virtual machine motion to only a subset of the network where optimal performance and latency exists between different servers (bubbles), one requires a flat network architecture. This makes Qfabric the best architecture to run OpenFlow, not requiring location awareness and providing true dynamic scalability across the datacenter.

L2 Virtualization in the Datacenter

L2 virtualization provides a solution to the question how to create virtualized networks on top of a physical network infrastructure matching the dynamic nature of server virtualization in datacenters today. Think about a couple of virtual servers, all residing in the same L2 VLAN, but that can move from one physical host to another. How do we handle the dynamic nature of VLANs moving from one physical host to another host on the physical network ports? Look at this as a mapping problem between virtual and physical VLANs, the VLANs known and living in the virtual switch and those known and residing on the physical network ports.

The most obvious approach to this mapping question is to define all used VLANs on every port connected to a physical server. By far the easiest solution, but… as so many times, the easiest not always being the best… The most important issue faced when defining all VLANs as a trunk on every server access port is MAC flooding. Since all VLANs are defined on all physical ports, whenever a MAC address is unknown to the switch, the switch will flood the ARP request to all ports carrying that specific VLAN where this MAC address is supposed to live. This means that all servers will receive ARP requests flooded by the switch for all VLANs, even if the physical server is currently not hosting any virtual machines that participate in this VLAN. As such, there is no issue as the server will drop the unwanted packets; however the packet needs to come up the network interface device driver in software before it is discarded and in doing so will waste CPU cycles that could otherwise be allocated to virtual machines.

The problem described above is more apparent when you have a lot of L2 isolated networks (VLANs) and lots of VMs joining and leaving the L2 network (eg starting, resuming, stopping machines) which is typical for VDI environments. If you have a limited set of VLANs and servers running 24/7, this problem is much less apparent as flooding will be limited.

Another problem is the limitation on the number of VLANs. If you are running a multitenant environment with many customers and allocate one or even more VLANs per customer, your scalability will be limited by the max number of VLANs on traditional networking equipment (max 4095).

To overcome the above limitations, a number of solutions emerged and have been forming and standardizing in the last few months/years. They can be classified in a few different approaches:

  • Dynamic VLAN assignment solutions
  • Layer 2 encapsulation over L3 networks (overlay networks)
  • OpenFlow (which is a solution that might also fit in the overlay networks class, but treated independently)

Dynamic VLAN assignment solutions

Moving VLANs on physical port trunks depending on where virtual machines are active in the physical servers is an easy solution to overcome the flooding issue. Dynamic motion of VLANs can be achieved using software controllers which integrate with virtualization platforms (eg the RESTful/SOAP API provided by VMware) in order gather knowledge about which machines are running on which physical hardware and also which VLANs a particular vSwitch is serving. The software controllers can then dynamically reconfigure the physical VLANs on the switch port trunks. This controller software can be running on a separate server or it can be embedded in the switch’s control plane. Several vendors use this approach today. Arista VM Tracer and Force10 HyperLink are examples of such controller embedded in the switch’s control plane while Juniper provides an application in its Junos Space network management platform called Virtual Control which runs in a separate server.

The emerging Edge Virtual Bridging (EVB ; 802.1Qbg) standard has a component addressing the above described dynamic VLAN assignment through the standardization of a negotiation protocol between the physical switch and the virtual switch. This protocol is called VSI Discovery and Configuration Protocol (VDP) and in my opinion the most elegant solution for small to medium sized cloud datacenters available today.

The market adoption of EVB and VDP is growing, but today it is limited to a few vendors that expressed commitment to this standard. One of which is Juniper on the physical side and on the virtual side there is Open vSwitch and IBM’s Distributed Virtual Switch 5000V. VMWare has not yet expressed interest for this open standard as of yet and is proposing a solution based on a collaboration with Cisco called VXLAN. As such VMWare virtualization deployments need to replace the standard distributed vSwitch by IBM’s 5000V offering to use VDP. KVM and XEN are compatible with Open vSwitch and are compliant as such. Microsoft with Hyper-V has not expressed specific interest yet and is proposing their proper solution based on GRE (see below). In the newest Microsoft Server 2012 however, Hyper-V provides a flexible and extensible virtual switch which allows third parties to code extensions to the switch using WFP and NDIS filter drivers (known as extensible switch extensions). It is certainly imaginable that with time a VDP extension will be available for Hyper-V.

Layer 2 encapsulation across L3 networks

A second approach to the L2 virtualization problem is to create isolated overlay L2 networks between virtual machines on top of an L3 IP based network (MAC-over-IP). Each physical machine hosting multiple virtual machines has only one IP address from the physical network point of view and the virtual switches on the different physical servers create tunnels between themselves for each L2 virtual network. The virtual switches dynamically build the tunnels for each VLAN required by the virtual machines running on the virtual switch, merely creating overlay networks on top of the physical network. As of today, none of the implementations are providing an intelligent MAC distribution mechanism built into their control planes and as such the ARP protocol and MAC flooding mechanism from the physical world are conserved leading to broadcasts and multicasts being encapsulated inside the overlay tunnel, which in turn are translated into L3 multicasts on the physical network. Two very similar solutions are emerging in this area, NVGRE from Microsoft and VXLAN from VMWare/Cisco.

It is imperative that the physical network provides a good and performing L3 multicast implementation in order to transport the overlay L2 networks. The one flat network requirement also stands for this scenario. Avoiding bubbles is imperative to any network virtualization technology used. L3 multicast was taken into account during the design of QFabric, as such QFabric excels at handling multicast and has provides the flat network architecture making it the best choice for overlay virtual networks.


Another emerging solution is provided through OpenFlow. The dynamic characteristic of the per flow forwarding of OpenFlow and the central control plane approach allow for easy management and design of overlay networks between virtual machines on top of a physical infrastructure. The traditional VLAN approach is left and isolation is provided through flows on the OpenFlow Switches.

For OpenFlow to succeed as a L2 virtualization solution in the datacenter, it needs to be present at all levels of the architecture. At the virtual switch level, Open vSwitch supports OpenFlow, NEC has an extension for the Microsoft Hyper-V virtual switch which integrates with its OpenFlow controller and VMWare recently acquired Nicira so OpenFlow might be part of their strategy as well.

At the physical switch level, if no fabric technology has been considered like QFabric, FEX, TRILL or SPB ; it is imperative that all the layers of interconnection support OpenFlow, not only the server access switches.

Juniper supports OpenFlow, is an active contributor in the standards process and plans to bring OpenFlow to QFabric.  As QFabric provides the architecture of choice (one flat network) for an OpenFlow based datacenter implementation, it can uniquely position them in the SDN marketplace. Juniper’s Software Solution EVP, Bob Muglia recently talked with Jim Duffy about Juniper’s SDN strategy here.

QFabric and L2 virtualization

For every L2 virtualization technology known today, the requirement to simplify the datacenter network connecting the physical servers to one flat network stands. If the network consists of multiple hops and inconsistent latency between network ports, the transparent overlay network idea will fail and careful planning and management for location awareness will be required to make this a success.

In all 3 scenarios, today, QFabric comes out as the architecture of choice to support network virtualization. Be it using EVB VDP, L2 overlay networks or OpenFlow; QFabric will provide investment protection whatever direction or technology you decide to run in the future.

IBM and Juniper QFabric

IBM is committed to QFabric and the one flat network architecture as the future of datacenter networking. See here and referring to 2 recent Redbooks that IBM published in this regard:

Considerations on deploying OpenFlow in the Data Center

A flat network that is scalable, performing and provides consistent latency is the foundation for a good network virtualization strategy. If you do not want to be restricted or confined to designing and managing bubbles in your network, this is unavoidable.

If no fabric technology has been considered like QFabric, FEX, TRILL or SPB ; all the layers of interconnection need to support OpenFlow. Make sure all proposed devices support OpenFlow today, from the core to the access including the virtual switch.

When considering storage convergence and especially FCoE, compliance with DCB is required. Best practice is to split off the FCoE in a completely separate VLAN and use DCBx for ease of deployment, avoiding OpenFlow in the storage VLAN. For pure OpenFlow devices, the OpenFlow controller would require some level of insight in the FC world and extensions for QoS in the OpenFlow device.

When planning for OpenFlow, it is best to consider devices that provide “traditional” L2/L3 mode besides pure OpenFlow. It will allow the use of OpenFlow for parts of the datacenter where they have their best use case and at the same time mix in the “traditional” L2/L3 for critical and latency sensitive protocols such as storage (FCoE, iSCSI, NAS).

Troubleshooting OpenFlow based networks can be a daunting task because of the distributed nature of the data and control planes. In QFabric this is solved by providing troubleshooting tools that mirror the traditional troubleshooting available in traditional switches. Check the OpenFlow devices and controllers for troubleshooting tools and options.

Availability and scalability of the OpenFlow controller is another concern that cannot be taken lightly. In case of an unreachable controller, the OpenFlow devices cannot function and in best case fall back to their traditional L2/L3 functionality (if the device is L2/L3 and OpenFlow capable at the same time).

QFabric is fully redundant by design, from the control network up to the directors, nodes and interconnects. When deciding to run OpenFlow you should design with this same redundancy in mind, meaning fully redundant, physically separated out-of-band connectivity between the controller and the OpenFlow devices for the control plane.

Finally, one more point that needs looking into is how the OpenFlow Controller and the OpenFlow Switches handle multicast. As you know, QFabric is designed with multicast in mind and use multicast trees in the interconnect layer. Multicast being one of the foundations of overlay networks for network virtualization.

%d bloggers like this: