In late 2012 I wrote about the migration of DataSift's Hadoop cluster to Arista switches but what I didn't mention was that we also moved our real-time systems over to Arista too.
Within the LAN
During our fact-finding trek through the Cisco portfolio we acquired a bunch of 4948 and 3750 switches which were re-purposed into the live platform. Unfortunately, the live platform (as opposed to Hadoop-sourced historical data) would occasionally experience performance issues due to the fan-out design of our distributed architecture amplifying the impact of micro-bursts during high traffic events. Every interaction we receive is augmented with additional meta data such as language detection, sentiment analysis, trend analysis and more. To acquire these values an interaction is tokenized into the relevant parts (for example, a Twitter user name for Klout score, sentences for sentiment analysis, trigrams for language analysis and so on). Each of those tokens is then dispatched to the service endpoints for processing. A stream of 15,000 interactions a second can instantly become 100,000+ additional pieces of data traversing the network which puts load on NICs, switch backplanes and core uplinks.
If a particular request were to fail then precious time is wasted on waiting for the reply, processing the failure and then re-processing the request. To combat this you might duplicate calls to service endpoints (for example, speculative execution in Hadoop parlance) which doubles your chances of success and those ~100,000 streams would become ~200,000 putting further stress on all your infrastructure.
At DataSift we discuss internal platform latency in terms of microseconds and throughput in tens of gigabits so adding an unnecessary callout here or a millisecond extra there isn't acceptable. We want to be as efficient, fast and reliable as possible. When we started looking at ways of improving the performance of the real-time platform it was obvious that many of the arguments that made Arista an obvious choice for Hadoop also meant it was ideal for our real-time system too. The Arista 7050 switches we'd already deployed have some impressive statistics in regards to latency so we needed little more convincing that we were on the right path (although the 1.28 Tbps and 960,000,000 packets per second statistics don't hurt either). For truly low-latency switching at the edge, one would normally look at the 7150 series but from our testing the 7048 switches were well within the performance threshold we wanted and enabled us to standardize our edge. We made use of our failure-intolerant platform design (detailed further below) to move entire cabs at a time over to the Arista 7048 switches with no interruption of service to customers. Once all cabinets were migrated and with no other optimizations at that point we saw an immediate difference in key metrics:
Simply by deploying Arista switches for our 'real-time' network we decreased augmentation latency from ~15,000µs down to 2200µs. Further optimizations to our stack and how we leverage the Linux kernels myriad options have improved things even more.
Epic Switches are only half the story
One of the great features of the Arista 7048 switches is their deep buffer architecture but in certain circumstances another buffer in the path is that last thing you want. Each buffer potentially adds latency to the system before the upstream can detect the congestion and react to it. The stack needs to be free of bottlenecks to prevent the buffers from filling up and the 7048 switches can provide up to 40Gb/s of throughput to the core which fits nicely with 40 1u servers in a 44u cabinet. With that said we're not ones to waste time and bandwidth by leaving the TOR switch if we don't have to. By pooling together resources into 'cells' we can reduce uplink utilization and decrease the RTT of fan out operations by splitting the workload into per-cabinet pools.
With intelligent health checks and resource routing coupled with the Aristas non-blocking full wire speed forwarding in the event of a resource pool suffering failures the processing servers can call out cross-rack with very little penalty.
That's great but I'm on the other side of the Internet
We are confident in our ability to provide low-latency, real-time content that is filtered and augmented. This enables us to publish live latency statistics of a stream consumed being consumed by an EC2 node from the other side of the planet on our status site. We can be this confident because we control and manage every aspect of our platform from influencing how data traverses the Internet to reach us, our routers, our switches all the way down to the SSD chipset or SAS drive spindle speed in the servers. (You can't say that if you're on someone's public cloud!)
(They could be next door to a social platform DC or over in Antarctica)
|10ms - 150ms|
Source Platform Processing time
(For example, the time taken for Facebook or Twitter to process & send it on)
(For example, San Jose to our furthest European processing node)
(For example, from a European processing node to a customer in Japan)
When dealing with social data on a global scale there can be a lot of performance uncertainty with under-sea fiber cuts, carrier issues and entire IX outages but we can rest assured that once that data hits our edge we know we can process it with low latencies and high throughput. In conclusion I've once again been impressed by Arista and would whole heartedly recommend their switches to anyone else working with high volume, latency sensitive data.
Arista switches were already a joy to work with (access to bash on a switch, what's not to love?) but Gary's insights and advice makes it all the better. Arista Warrior - Gary A. Donahue
Even with all the epicness of this hardware, if you're lazy with how you treat the steps your data goes through before it becomes a frame on the switch you're gonna have a bad time so for heavy duty reading The Linux TCP/IP stack book may help. The Linux TCP/IP Stack: Networking for Embedded Systems - Thomas F Herbert