When it comes to network-attached storage, forget Moore’s Law… Amdahl’s Law rules.
- 07.19.13
- 1:20 PM
Shortly before I left Auspex in 1997, we achieved the fastest NFS benchmark results in the industry. The system we used to achieve these results comprised 102 disk drives and included five I/O processing boards with two processors each. Those state-of-the-art processors were always the bottleneck in the system, a problem that we worked tirelessly to solve. Despite countless efforts to remedy the problem, the system always hit a CPU limit when performing network processing.
In the years that followed, the system bottleneck gradually shifted. Between 1996 and 2008, disk-drive speeds increased by a factor of eleven. Over that same period, Moore’s Law stood its ground and processor speeds increased by a factor of 180. Slowly but surely, the processor became less of a concern.
By today’s standards, the ratio of processors to disks used in that 1997 Auspex SPEC release was absurd. We devoted ten processors to network I/O processing, and another five to storage processing in order to drive just 102 disks. Where today it’s common for a mid-level filer to require more than 200 disk drives to saturate just two processors, we had one processor for every seven drives!
Of course if you add enough disk drives, the processor once again becomes the bottleneck. As Alacritech CEO Larry Boucher puts it, nobody goes out and buys a new server until they have to, and they don’t have to until their current one runs out of CPU. Still, when it takes several hundred disk drives to push your processor to its limit, it’s hard to argue that there is a CPU problem. In the first decade of this century, it was common to hear data center managers expressing “I’ve got more CPU than I know what to do with.” This largely explains the massive adoption of server virtualization to harness those excess CPU cycles. As long as processor speeds were increasing exponentially, and storage was still, by and large, spinning mechanical platters, it was easy to assume that this growing disparity in performance between processor and storage would continue forever. But it didn’t.
Enter Flash Memory
Lurking in the background since the 70s was solid state storage – a technology that had been far too expensive, too small, and too impractical for widespread use. By 2009, however, companies were demonstrating 1TB flash drives. Solid state drives have a random access time that is roughly 200 times faster than that of a spinning disk, and they are now cheap enough to cache the entire working set of a server.
In a relative instant, the world had changed. The trajectory of processor performance relative to disk performance, something that had remained relatively constant for decades, got turned upside down, and the processor, which everyone had come to take for granted, had now become the bottleneck again. A serious bottleneck.
Larry Boucher and a handful of ex-Auspex engineers founded Alacritech to solve the problem of the network processor CPU bottleneck. We invented dynamic TCP offload to solve it, and have spent the years since creating new ways to further optimize network processing and alleviate the system CPU bottleneck. Now, with the prevalence of flash, these inventions have become invaluable.
How Big a Bottleneck Has the Processor Become?
To illustrate this, we ran a modified mix of a well-known benchmark against two systems — one system running ANX bridge software and the other running Linux CentOS version 6.4. The hardware platform in these two systems was identical, except the Linux system was equipped with Intel 82598 10 gigabit Ethernet adapters, while the ANX system was equipped with Alacritech 10 gigabit Ethernet TCP offload engine (TOE) adapters.
Otherwise, the systems included:
- 2 Quad-core Intel Xeon processors running at 2.27GHz
- 96GB of DRAM
- 20 SATA 6-GbpS 400GB SSDs
In the case of Linux, the 20 SS drives were configured with RAID 0, and a single EXT4 filesystem.
The benchmark consisted of a mix of read and read-metadata operations as follows:
- READ – 22%
- GETATTR – 33%
- LOOKUP – 30%
- ACCESS – 14%
- READLINK – 1%
The results of this test were stunning. The Linux system, equipped with SSDs, was able to achieve an impressive 174k operations per second before saturating the CPU. Running our ANX software and equipped with our network accelerator adapters, we were able to achieve 810k operations per second, at which point we saturated the four 10GbE networks. More impressive still was that the processors in the system were only 35% busy at that peak throughput.
This isn’t just an issue of performance. The cost of flash technology has dropped substantially in recent years, and will continue to do so, but it is still by far the most expensive component in this system. Clearly, spending that kind of money for SSDs and then realizing less than 25% of its performance capabilities is foolish at best.
You might ask — why not simply add more processors? The answer is Amdahl’s Law. Amdahl’s Law predicts the amount of performance improvement that can be realized by adding processors based on the portion of a program that can be parallelized. This law shows that as processors are added, performance benefits fall off very quickly even when there is only a small amount of processing that cannot be parallelized — and with NAS, there is a fair amount of processing that cannot be parallelized. Historically speaking, most NAS systems can’t take advantage of more than eight processor cores before they hit the point of diminishing returns.
To address this issue, many NAS providers have turned to scale-out solutions instead. To be sure, scale-out solutions have tremendous value, but it comes at a price. There is the obvious initial cost of additional platform hardware, but also the downstream cost of power, cooling and footprint. Lastly, there is a latency penalty as well, due to the amount of interaction that has to occur between nodes in a scale-out cluster. Put more simply, surely it makes more sense to achieve the 800k Ops/S above using a single system than it does with more than four systems.
The solution to this processor bottleneck is not the brute force addition of processors or the ganging up of multiple systems, but rather a more intelligent approach to NAS processing that makes systems work more efficiently. So to return to the question: How big of a bottleneck has the processor become? Well in this case, thanks to solid-state drives, the processor has held the system performance back by more than a factor of four — a very big bottleneck indeed. We are no longer in an era of CPU-to-spare, and as flash continues to become more affordable, network acceleration technology plays a critical role in maximizing the performance of network-attached storage systems.
Peter Craft is co-founder and systems architect at Alacritech.