Tuesday, September 29, 2009

Understanding TCP Incast Through Collapse in Datacenter Networks

[paper]

This paper looks at the same problem as the previous one (TCP incast collapse in data centers) with a similar solution (reduce the RTO timer to be closer to the RTT, randomize min RTO timer value, and set a smaller & randomized backoff). They found that a smaller & randomized multiplier for the exponential backoff didn't make any difference, nor did randomizing the min RTO timer value.

-- This paper is extremely similar to the previous paper. They are both examining incast collapse in data centers running highly parallel applications, with a proposed solution of reducing the RTO timer to be closer to the RTT. The difference is the workload (this paper has fixed fragment size, the other has fixed block size), and the two papers consequently show slightly different results. However the work is still so similar that I am surprised that both are published. (This paper does not feel to me like it was framed as a follow-up although I can imagine thinking of it that way? It seems like they were concurrent projects and then the other one got published so they threw in some disclaimers about how their workloads are different. The only section that makes me think this might actually be a follow-up is 5.2; but the introduction and background did not give me that impression.) One other difference between the experimental setup of the two papers was the RTT. In the other paper they report really small (hundreds of microseconds) RTT so they need high-resolution timers. In this paper they have higher (milliseconds) RTT numbers. This matters since both are trying to adjust the RTO relative to the RTT.

-- I think this paper gives a better introduction to the problem, although the other paper is more specific about the terms that are necessary for the problem to arise (see their 3 conditions).

-- The graphs in Figure 2 are very ugly...and the scale in Figure 3 does not seem very good.

Safe and Effective Fine-grained TCP Retransmissions for Datacenter Communication

[paper]

Problem: TCP incast collapse in data centers occurs when (1) workload is highly parallel & barrier-synchronized, (2) cheap switches have small buffers on a high-bandwidth network, and (3) servers are returning small amounts of data per request. Bursty traffic fills switch buffers, causing packet loss & TCP timeouts. Timeouts last hundreds of milliseconds (notable in a DC where round trip time is very very small due to #2 and #3). Protocols that require synchronization block waiting on these timeouts before issuing any new requests. Timeouts + subsequent delay severely reduce application throughput.

Previous solutions: Get bigger buffers ($$). Ethernet flow control ("dangerous across inter-switch trunks because of head-of-line blocking"). Reducing the minimum retransmission time -- not thoroughly explored -- so that is what this paper tries.

Their solution: Reduce the granularity of timeouts by modifying the TCP implementation. Decrease or get rid of minimum RTO, desynchronizing retransmission for scaling. Funny thing is that they need their retransmission granularity to be smaller than the Linux kernel's timer, so they implement a high-resolution timer.

Evaluation: Making the RTO very small (as small as the avg RTT) works well in both their simulated data & real cluster data, although it is less effective as the number of concurrent senders scales up. They find that the cause of this is many flows timing out and retransmitting simultaneously; they fix this by desynchronizing retransmission.

-- Question: won't more memory become cheap soon enough that this won't be a problem anymore? Or will the communication speed keep increasing faster than memory's cheapness?

Tuesday, September 22, 2009

VL2: A Scalable and Flexible Data Center Network

[paper]

A basic tree topology for a data center partitions intra-DC communication. It's slow for the "far left" server to talk to the "far right" server. The authors of this paper want a uniform communication cost between all hosts in the database to promote agility, which means that any host can run any service at any given time. (This means that a DC can dynamically handle spikes in loads; no pre-allocation is necessary.) For this end, they suggest use of the Clos network topology, which essentially means that each aggregation switch is connected to all of the (many) intermediate switches. Shortest path first routing algorithms are inadequate here; they use Valiant Load Balancing (VLB) for routing so that traffic can be routed non-uniformly. VLB basically chooses a random path from the set of equivalent paths for a flow so that no particular path becomes overwhelmed. They call the use of VLB over a Clos network "VL2".

They do some measurements on an unspecified Microsoft data center to motivate their work. They show that 80% of DC traffic is internal, meaning you do want to focus on decreasing intra-DC transit costs. They also show that flows are small and uniformly distributed, meaning that (A) the system can work by load balancing flows instead of individual packets and (b) you don't need to support locality for the intra-DC traffic. They did a traffic analysis and found no regularity or patterns in the traffic, and they also found that traffic needs shifted a lot. I think this is a really good approach to motivating research -- they have actual hard evidence that the problem they are trying to solve exists. (When reading papers outside of my own area, I am nervous about accepting whether a problem really exists -- since I have no knowledge to judge it myself.) The one downside is that they don't say WHAT application this is for, so maybe you would see totally different patterns with a different type of application.

Monday, September 21, 2009

PortLand: A Scalable Fault-Tolerant Layer 2 Data Center Network Fabric

[paper]

PortLand is a follow-up to SEATTLE. They address the issue of networking within a data center. They want to have both the plug-and-play properties of Ethernet plus the ability to support a huge number of end hosts. (Layer 2 networks are easier to administer than Layer 3 networks since they can use flat MAC addresses, but Ethernet bridging doesn't scale because of its broadcasting and difficulty with forward spanning trees in topologies with multiple equal cost paths.) They improve on SEATTLE by adding plug-and-play switches (not just plug-and-play hosts) and removing forwarding loops and all-to-all broadcasts. Their improvement to SEATTLE stems from the observation that data centers are physically organized into multi-rooted trees; SEATTLE was intended for a more general case. The topology is the tree structure we discussed in the last class.

With PortLand, each end host gets a hierarchically-assigned PMAC (pseudo MAC) address that encodes the location of the recipient host in the topology. ARP requests return PMAC addresses rather than actual MAC addresses, and egress switches do PMAC-AMAC rewriting. Edge switches get "pod" and position numbers. Forwarding is done using this topological information to avoid loops. PortLand also has a centralized "fabric manager" that maintains soft state about the network topology. Since it's soft state, (A) the fabric manager does not need to be configured, and (B) replicating the fabric manager does not require complete consistency with the original fabric manager. If a switch sees an ARP request, it will be forwarded to the fabric manager who will do a lookup in its PMAC table and send it back to the switch; this prevents the need for broadcast ARPs. If the fabric manager doesn't have the PMAC information due to a recent failure, it will resort to a broadcast.

-- Unlike the SEATTLE paper, the authors evaluated their work using microbenchmarks (or what I'd call microbenchmarks -- not sure if that's actually the appropriate term) rather than a full simulation. I'm not sure if this approach is very valid since there is no other traffic on the network, so nothing is being stressed. However...I guess the purpose of a full simulation isn't to provide stress, since the loads for stress testing would be different from regular traffic.

-- I like this paper, I think their use of topology information is simple and clever.

POST-CLASS EDIT:
I didn't realize that the fat tree idea was crucial to their idea, since they try to play it off as more generalizable ("the fat tree is simply an instance of the traditional data center multi-rooted tree topology....We present the fat tree because our available hardware/software evaluation platform is built as a fat tree."). I understand that maybe they were trying to forestall criticisms about tying their work too closely to fat tree, but I think that it's kind of deceptive and it hurt my understanding of the paper. I took their dismissal of the fat tree topology fairly seriously and didn't pay much attention to it, whereas after our discussion in class it seems like routing for a fat tree in particular is their main contribution.

Thursday, September 17, 2009

Detailed Diagnosis in Enterprise Networks

[paper]

NetMedic is a network diagnostic tool that is supposed to find the source of network faults at the granularity of a responsible process or firewall configuration. (So not only does it identify the host, but it identifies the culpable application running on the host.) I found this paper confusing because I couldn't figure out where they were planning on placing NetMedic -- is it somewhere in the network, or is it running on host machines? If it's the former, then how does it get enough application information? If it's the latter, then how can you expect it to work correctly given that the host machine is the one experiencing the problem (e.g., it may have a lack of network connectivity)? Furthermore, if it's on the host, what if NetMedic itself interferes with other host applications in some unexpected way? The implementation section finally concretely states this (it was implemented on the host using OS tools), but I was confused until then.

Basically they are trying to correlate the fault with component state and the interaction of components, and then a high correlation is considered causation. A "component" here is any part of the system, and their tool automatically infers relationships /dependencies between components. This seems like it would not work well (given that correlation != causation). However, they did find the correct component in 80% of their cases. Interestingly enough, they didn't have any direct competitors to compare their tool against so they implemented a different technique based loosely on previous work and compared themselves to that. Although unusual, I kind of like this approach for comparison...of course the down side is that since they are implementing both, they are biased and won't try as hard on the competitor.

Floodless in SEATTLE

[Paper]

SEATTLE claims to "...[achieve] the best of both worlds: The scalability of IP combined with the simplicity of Ethernet." On one hand, Ethernet is simple to manage thanks to flat addressing. On the other hand, Ethernet doesn't scale well -- Ethernet bridges use broadcast messages (e.g. ARP requests and DHCP) & paths need to be a spanning tree. To reconcile this problem, network admins break up their large networks into small Ethernet networks connected by IP or use VLANs. The authors assert that both of those solutions are not good enough (inefficient and harder to manage since you need to worry about addressing). SEATTLE is intended to provide the same simple plug-and-play semantics of Ethernet, but also scale well to large networks.

[ to be done: further summarization ]

My criticisms of their simulations:

-- For the packet-level simulation, they start with a real LBNL trace and then inject extra traffic into it. They have fake hosts sending traffic to random destinations. I find this curious considering they make this earlier claim: "In enterprise networks, hosts typically communicate with a small number of other hosts [5], making caching highly effective."
-- They also have to make some kludgy assumptions about the number of hosts connected to each switch due to the anonymization of the traces. I suppose they can't really help that, it's a general field problem.
-- They have really huge error bars in many of their graphs. Is this normal for network simulations due to the nature of error in the physical links etc? Notably, the SEATTLE results have tight error bars even when ROFL and ETH have huge ones (e.g Fig 5a and 5c)...is their simulation somehow skewed for their own results?


Sorry about the lack of summaries......I had prelims on Tuesday and have two papers due Friday......multi-tasking is not going so well.

Monday, September 7, 2009

Congestion Avoidance and Control

[Paper Link]

Networks at the time were suffering from widespread congestion, despite the fact that the TCP specification had a theoretically correct description of how to avoid it. The principle they wish to adhere to is the 'conservation of packets' principle, which states that a new packet should not be transmitted until another packet has left the network. Since the protocol itself adheres to conservation of packets but networks at the time were still failing due to load, then they observe that this must be due to implementation bugs. Thanks to the bugs, one of the following three things happens: a connection doesn't get to equilibrium, a sender injects a new packet before one has exited, or an equilibrium can't be reached because of resource limits.

Descriptions of the changes they made to the 4BSD TCP implementation:

Slow-start. A connection can potentially not get to equilibrium if it acts wonky on startup. A host doesn't send its next packet until it's received an ACK. However, if restarted hosts start off firing off packets really fast and then throttle downwards, or starts at 0 and has no defined way of going upwards, that can cause the connection to not be at equilibrium. Instead, slow-start gradually increases the number of packets being sent. It starts at 1 and increases as more and more ACKs come back.

Round-trip-time variance estimation and exponential retransmit timer backoff. Sometimes packet transmission fails and therefore no ACK is generated. At what point should the sender decide to retransmit? Their round-trip-time variance estimation algorithm addresses this. And, if a packet needs to be retransmitted repeatedly, their exponential retransmit timer backoff algorithm handles that.

More aggressive receiver ack policy and dynamic window sizing on congestion. If recipients ALWAYS send an ACK upon receipt, then the lack of an ACK is equivalent to the network notifying the sender of congestion on that link. That near-guarantee of an ACK if all is well is what they mean by an aggressive receiver ACK policy. Additionally, they describe 'dynamic window sizing', which talks about how to increase/decrease (multiplicative increase and additive increase) to fill up a connection.

-- Unlike the previous paper, none of the main schemes they discuss need the cooperation of anything in the core. They can infer congestion by dropped packets, so no specific indicator (or view of the network state) is necessary. The end-to-end argument is satisfied here.

-- In their future work, they address the issue of misbehaving hosts. They argue that ONLY with the participation of gateways can you ensure fairness, since you can't trust hosts to be fair without another agent to police the situation. The idea is that if a host starts using more than it's fair share, the gateway starts dropping its packets.

-- The paper takes a very practical tone. I especially like how they explain their logic, the major takeaway points, the relevant math, AND the actual algorithm. It's a well-written, easy-to-read paper, in my opinion.

About Me

Berkeley EECS PhD student