Google gave the world a deeper dive into its data center network with a paper presented at the ACM Sigcomm conference in London.
Amin Vahdat (Google) already presented much of this information at the Open Networking Summit (ONS) in June. Everyone knew Google had built its own networking infrastructure, but that presentation was the first time where Google had explained its actions in detail.
The Sigcomm paper covers the same ground as the ONS talk, including the five-generation progression of Google’s data centers, but offers more detail about Google’s switches, topology, and customized control plane. Many of the ideas seem commonplace today, with the advent of software-defined networking (SDN), but they were radical 10 years ago.
It was all born of necessity as Google struggled to cope with delivering bandwidth throughout a network that was growing virally. The biggest scale that vendors offered still came down to managing “individual boxes with individual command-line interfaces,” he said at the time. That seemed “unnecessarily hard” compared to Google’s practice of managing 10,000 servers as a single machine.
It comes down to a complaint that’s been at the heart of the SDN movement: Compute and storage could accommodate scale-out operations, but networking had not advanced comparably.
Google developed its own networking based on three principles. The first was a CLOS topology, or a leaf/spine architecture — a well known design where relatively few switches are interconnected to create a huge, nonblocking fabric.
Second, Google wanted to use off-the-shelf chips. The result was the Pluto switch that came to light about two years ago.
Finally, the network operated with centralized control — a precursor to SDN. If something, somewhere, knew what the network should look like, it would become easier to keep the network in that state, Google’s engineers reasoned.
What resulted was five generations of data-center networking at Google: Firehose 1.0, Firehose 1.1, Watchtower, Saturn, and the current Jupiter, all built on merchant switching chips. The iterations started 10 years ago with aggregate bandwidth of 2 Tb/s per cluster; with Jupiter, that figure has risen to 1.3 Pb/s, thanks in part to merchant chips supporting 40-Gb/s connections.
On the management side, Google famously treats its data-center network as a single fabric rather than “a collection of hundreds of autonomous switches that had to dynamically discover information about the fabric,” as the Sigcomm paper words it. Google drew its inspiration from distributed storage systems, which were already using centralized management. Elements of what it developed here would later find their way into B4, Google’s global wide-area network (WAN).
Want to know more about the control architecture? “Details of Jupiter’s control architecture are beyond the scope of this paper,” the Sigcomm paper reads. Dang.
But the paper does deliver some details about Firepath, Google’s homegrown routing architecture. It’s very SDN-like: There’s a master element in the topology that can dictate global state to all the switches. The switches start with a base topology and check in with their neighbors to learn about the status of local links. They then exchange this information with the Firepath master.
Google uses Layer 3 routing down to the top-of-rack switches (ToRs). All the machines under any given ToR are part of one Layer 2 subnet — one broadcast domain.
Google presented three other papers at Sigcomm this week: