Arista flattens networks for large enterprises with splines
by Timothy Prickett Morgan
Arista Networks, which boasts serial entrepreneur Andy Bechtolsheim as its chief development officer and chairman, has been selling products for five years against Cisco Systems, Juniper Networks, Mellanox Technologies, and others in the networking arena. Its largest customer has 200 000 servers connected on a single network.
So Arista can hardly be thought of as a start-up. But the company is still an upstart, and is launching a new set of switches that deliver an absolutely flat network, called a spline, to customers who need moderate scale, simplicity, and low cost.
Arista has actually trademarked the term "spline" to keep others from using it, as they have the leaf-spine naming convention that Arista came up with for its original products back in 2008. A spline network is aimed at the low end of extreme scale computing, to just over 2 000 server nodes, but it is going to provide some significant advantages over lead-spine networks, Anshul Sadana, senior vice-president of customer engineering, explains.
The innovation five years ago with leaf-spine networks, which other switch vendors now mimic, is to create a Layer 2 network that only has two tiers, basically merging the core and aggregation layers in a more traditional network set-up into a spine with top-of-rack switches being the leaves. This cuts costs, but more importantly, a leaf-spine network is architected to handle a substantial amount of traffic between interconnected server nodes (what is called east-west traffic), while a three-tier network is really architected to connect a wide variety of devices out to the Internet or an internal network (which is called north-south traffic). Arista correctly saw that clouds, high frequency trading networks, and other kinds of compute clusters were going to have far more east-west traffic than north-south traffic.
Fortinet secures NEC cloud platform
With a leaf-spine network, every server on the network is exactly the same distance away from all other servers - three port hops, to be precise. The benefit of this architecture is that you can just add more spines and leaves as you expand the cluster and you don't have to do any re-cabling. You also get more predictable latency between the nodes.
But not everyone needs to link 100 000 servers together, which a leaf-spine set-up can do using Arista switches. And just to be clear, that particular 200 000 node network mentioned above has 10 Gb/sec ports down from the top-of-rack switches to the servers, and 40Gb/sec uplinks to the spine switches, and to scale beyond 100 000 nodes, Arista had to create a spine of spine switches based on its 40Gb/sec devices. Sadana says there are only about four such customers in the world that build such large networks; they are very much the exception, not the rule.
Arista has over 2 000 customers, with its initial customers coming from the financial services sector, followed by public cloud builders and service providers, Web 2.0 application providers, and super-computing clusters. Among cloud builders, service providers, and very large enterprises, Sadana says that somewhere between 150 and 200 racks is a common network size, with 40 server nodes per rack being typical among the Arista customer base. That works out to 6 000 to 8 000 servers.
"And then you move one level down, and you hit the large enterprises that have many small clusters, or a medium-sized company where the entire data centre has 2 000 servers," Sadana explains. "Many companies have been doing three-tier networks, but only now are we seeing enterprises migrate to 10Gb/sec Ethernet, and as part of that migration they are using new network architecture. We have seen networks with between 200 to 2 000 servers among large enterprises, with 500 servers in a cluster being a very common design. These enterprises have many more servers in total - often 50 000 to 100 000 machines - but each cluster has to be separate for compliance reasons."
This segregation of networks among enterprises was one of the main motivations behind the development of that new one-tier spline network architecture and a set of new Ethernet switches that deliver it. And by the way, you can build more scalable two-tier leaf-spine networks out of these new Arista switches if you so choose.
With the spline set-up, you put two redundant switches in the middle of the row and you link each server to those switches out of two server ports. The multiple connections are just for high availability. The important thing is that each server is precisely one hop from any other server on the network.
You also don't have stranded ports with a spline network. Large enterprises that often have a mix of fat and skinny nodes in their racks, and they typically have somewhere between 24 and 32 servers per rack. If you put a 48-port top-of-rack switch in to connect those machines up to a spine, you will have anywhere from 16 to 24 ports on that switch that will never be used. These are stranded ports, and they are an abomination before the eyes of the chief financial officer. What the CFO is going to love is that a spline network takes fewer switches and cables than either a three-tier or two-tier network, and basically you can get 10Gb/sec connectivity for the price of 1Gb/sec Ethernet switching. Here's how Arista is doing the math comparing a one-tier spline network for 1 000 servers to a three-tier network based on Cisco gear and its own leaf-spine set-up:
The power draw on the new Arista 7250X and 7300X switches is also 25% lower per 10Gb/sec port than on the existing machines in the Arista line, says Sadana. This will matter to a lot of shops working in confined spaces and on tight electrical and thermal budgets.
The original 64-port 1Gb/sec low latency 7050 switch that put Arista on the map five years ago was based on Broadcom's Trident+ network ASIC, and its follow-on 40Gb/sec model used Broadcom's Trident-II ASIC. The 7150 switches used Intel's Fulcrum Bali and Alta ASICs. The new 7250X and 7300X switches put multiple Trident-II ASICs into a single switch to boost the throughput and port scalability. All of the switches run Arista's EOS network operating system, which is a hardened variant of the Linux operating system designed just for switching.
At the low end of the new switch line-up from Arista is the 7250QX-64, which has multiple Trident-II chips for Layer 2 and Layer 3 switching as well as a quad-core Xeon processor with 8GB of its own memory for running network applications inside of the switch. The 7250QX-64, as the name suggests, has 64 40Gb/sec ports and uses QSFP+ cabling. You can use cable splitters to make it into a switch that supports 256 10Gb/sec switches. This device has 5Tb/sec of aggregate switching throughput and can handle up to 3.84 billion packets per second. Depending on the packet sizes, port-to-port latencies range from 550 nanoseconds to 1.8 microseconds, and the power draw under typical loads is under 14 watts per 40Gb/sec port. The switch supports VXLAN Layer 3 overlays and it also supports OpenFlow protocols, so its control plane can be managed by an external SDN controller. The 7250QX-64 has 4GB of flash storage and a 100GB solid state drive for the switch functions, as well as a 48MB packet buffer.
If you want to build a leaf-spine network from the 7250QX-64 and use cable splitters to convert it into a 10Gb/sec switch, you can have up to 49 152 server nodes linked. That configuration assumes you have 64 spine switches and 256 leaf switches and a 3:1 over-subscription level on the bandwidth. Sadana says the typical application in the data centre is only pushing 2Gb/sec to 3Gb/sec of bandwidth, so there is enough bandwidth in a 10Gb/sec port to not worry about over-subscription too much. If you wanted to use the top-end 7508E modular switch as the spine and the 7250QX-64 as the leaves, you can support up to 221 184 ports running at 10Gb/sec with the same 3:1 bandwidth over-subscription. This network has 64 spines and 1 152 leaves.
The 7300X modular switches, also based on multiple Trident-II ASICs, come in three chassis sizes and support three different line cards. These are used for larger spline networks, and to get to the 2 048 server count, you need to top-end 7300X chassis. Here are the enclosure speeds and feeds:
And here are the line cards that can be slotted into the enclosures:
Here's the bottom line. The 7300X tops out at over 40Tb/sec of aggregate switching bandwidth, with up to 2.56Tb/sec per line card, and can handle up to 30 billion packets per second. It can have up to 2 048 10Gb/sec ports or 512 40Gb/sec ports, and latencies are again below two microseconds on port-to-port hops.
The 7300X series can be ordered and it costs $500 per 10Gb/sec port or just under $2 000 per 40Gb/sec port for the various configurations of the machine, according to Sadana. The top-end 7500E switch, which was chosen by BP for its 2.2 petaflops cluster for seismic simulations, costs around $3 000 per 40Gb/sec port, by the way. This beast supports more protocols and has more bandwidth per slot that the 7300X. The 7500E also has 100Gb/sec line cards and the 7300X does not have them - yet. But Sadana says they are in the works.