Tuesday, September 29, 2009

Understanding TCP Incast Through Collapse in Datacenter Networks

[paper]

This paper looks at the same problem as the previous one (TCP incast collapse in data centers) with a similar solution (reduce the RTO timer to be closer to the RTT, randomize min RTO timer value, and set a smaller & randomized backoff). They found that a smaller & randomized multiplier for the exponential backoff didn't make any difference, nor did randomizing the min RTO timer value.

-- This paper is extremely similar to the previous paper. They are both examining incast collapse in data centers running highly parallel applications, with a proposed solution of reducing the RTO timer to be closer to the RTT. The difference is the workload (this paper has fixed fragment size, the other has fixed block size), and the two papers consequently show slightly different results. However the work is still so similar that I am surprised that both are published. (This paper does not feel to me like it was framed as a follow-up although I can imagine thinking of it that way? It seems like they were concurrent projects and then the other one got published so they threw in some disclaimers about how their workloads are different. The only section that makes me think this might actually be a follow-up is 5.2; but the introduction and background did not give me that impression.) One other difference between the experimental setup of the two papers was the RTT. In the other paper they report really small (hundreds of microseconds) RTT so they need high-resolution timers. In this paper they have higher (milliseconds) RTT numbers. This matters since both are trying to adjust the RTO relative to the RTT.

-- I think this paper gives a better introduction to the problem, although the other paper is more specific about the terms that are necessary for the problem to arise (see their 3 conditions).

-- The graphs in Figure 2 are very ugly...and the scale in Figure 3 does not seem very good.

Safe and Effective Fine-grained TCP Retransmissions for Datacenter Communication

[paper]

Problem: TCP incast collapse in data centers occurs when (1) workload is highly parallel & barrier-synchronized, (2) cheap switches have small buffers on a high-bandwidth network, and (3) servers are returning small amounts of data per request. Bursty traffic fills switch buffers, causing packet loss & TCP timeouts. Timeouts last hundreds of milliseconds (notable in a DC where round trip time is very very small due to #2 and #3). Protocols that require synchronization block waiting on these timeouts before issuing any new requests. Timeouts + subsequent delay severely reduce application throughput.

Previous solutions: Get bigger buffers ($$). Ethernet flow control ("dangerous across inter-switch trunks because of head-of-line blocking"). Reducing the minimum retransmission time -- not thoroughly explored -- so that is what this paper tries.

Their solution: Reduce the granularity of timeouts by modifying the TCP implementation. Decrease or get rid of minimum RTO, desynchronizing retransmission for scaling. Funny thing is that they need their retransmission granularity to be smaller than the Linux kernel's timer, so they implement a high-resolution timer.

Evaluation: Making the RTO very small (as small as the avg RTT) works well in both their simulated data & real cluster data, although it is less effective as the number of concurrent senders scales up. They find that the cause of this is many flows timing out and retransmitting simultaneously; they fix this by desynchronizing retransmission.

-- Question: won't more memory become cheap soon enough that this won't be a problem anymore? Or will the communication speed keep increasing faster than memory's cheapness?

Tuesday, September 22, 2009

VL2: A Scalable and Flexible Data Center Network

[paper]

A basic tree topology for a data center partitions intra-DC communication. It's slow for the "far left" server to talk to the "far right" server. The authors of this paper want a uniform communication cost between all hosts in the database to promote agility, which means that any host can run any service at any given time. (This means that a DC can dynamically handle spikes in loads; no pre-allocation is necessary.) For this end, they suggest use of the Clos network topology, which essentially means that each aggregation switch is connected to all of the (many) intermediate switches. Shortest path first routing algorithms are inadequate here; they use Valiant Load Balancing (VLB) for routing so that traffic can be routed non-uniformly. VLB basically chooses a random path from the set of equivalent paths for a flow so that no particular path becomes overwhelmed. They call the use of VLB over a Clos network "VL2".

They do some measurements on an unspecified Microsoft data center to motivate their work. They show that 80% of DC traffic is internal, meaning you do want to focus on decreasing intra-DC transit costs. They also show that flows are small and uniformly distributed, meaning that (A) the system can work by load balancing flows instead of individual packets and (b) you don't need to support locality for the intra-DC traffic. They did a traffic analysis and found no regularity or patterns in the traffic, and they also found that traffic needs shifted a lot. I think this is a really good approach to motivating research -- they have actual hard evidence that the problem they are trying to solve exists. (When reading papers outside of my own area, I am nervous about accepting whether a problem really exists -- since I have no knowledge to judge it myself.) The one downside is that they don't say WHAT application this is for, so maybe you would see totally different patterns with a different type of application.

Monday, September 21, 2009

PortLand: A Scalable Fault-Tolerant Layer 2 Data Center Network Fabric

[paper]

PortLand is a follow-up to SEATTLE. They address the issue of networking within a data center. They want to have both the plug-and-play properties of Ethernet plus the ability to support a huge number of end hosts. (Layer 2 networks are easier to administer than Layer 3 networks since they can use flat MAC addresses, but Ethernet bridging doesn't scale because of its broadcasting and difficulty with forward spanning trees in topologies with multiple equal cost paths.) They improve on SEATTLE by adding plug-and-play switches (not just plug-and-play hosts) and removing forwarding loops and all-to-all broadcasts. Their improvement to SEATTLE stems from the observation that data centers are physically organized into multi-rooted trees; SEATTLE was intended for a more general case. The topology is the tree structure we discussed in the last class.

With PortLand, each end host gets a hierarchically-assigned PMAC (pseudo MAC) address that encodes the location of the recipient host in the topology. ARP requests return PMAC addresses rather than actual MAC addresses, and egress switches do PMAC-AMAC rewriting. Edge switches get "pod" and position numbers. Forwarding is done using this topological information to avoid loops. PortLand also has a centralized "fabric manager" that maintains soft state about the network topology. Since it's soft state, (A) the fabric manager does not need to be configured, and (B) replicating the fabric manager does not require complete consistency with the original fabric manager. If a switch sees an ARP request, it will be forwarded to the fabric manager who will do a lookup in its PMAC table and send it back to the switch; this prevents the need for broadcast ARPs. If the fabric manager doesn't have the PMAC information due to a recent failure, it will resort to a broadcast.

-- Unlike the SEATTLE paper, the authors evaluated their work using microbenchmarks (or what I'd call microbenchmarks -- not sure if that's actually the appropriate term) rather than a full simulation. I'm not sure if this approach is very valid since there is no other traffic on the network, so nothing is being stressed. However...I guess the purpose of a full simulation isn't to provide stress, since the loads for stress testing would be different from regular traffic.

-- I like this paper, I think their use of topology information is simple and clever.

POST-CLASS EDIT:
I didn't realize that the fat tree idea was crucial to their idea, since they try to play it off as more generalizable ("the fat tree is simply an instance of the traditional data center multi-rooted tree topology....We present the fat tree because our available hardware/software evaluation platform is built as a fat tree."). I understand that maybe they were trying to forestall criticisms about tying their work too closely to fat tree, but I think that it's kind of deceptive and it hurt my understanding of the paper. I took their dismissal of the fat tree topology fairly seriously and didn't pay much attention to it, whereas after our discussion in class it seems like routing for a fat tree in particular is their main contribution.

Thursday, September 17, 2009

Detailed Diagnosis in Enterprise Networks

[paper]

NetMedic is a network diagnostic tool that is supposed to find the source of network faults at the granularity of a responsible process or firewall configuration. (So not only does it identify the host, but it identifies the culpable application running on the host.) I found this paper confusing because I couldn't figure out where they were planning on placing NetMedic -- is it somewhere in the network, or is it running on host machines? If it's the former, then how does it get enough application information? If it's the latter, then how can you expect it to work correctly given that the host machine is the one experiencing the problem (e.g., it may have a lack of network connectivity)? Furthermore, if it's on the host, what if NetMedic itself interferes with other host applications in some unexpected way? The implementation section finally concretely states this (it was implemented on the host using OS tools), but I was confused until then.

Basically they are trying to correlate the fault with component state and the interaction of components, and then a high correlation is considered causation. A "component" here is any part of the system, and their tool automatically infers relationships /dependencies between components. This seems like it would not work well (given that correlation != causation). However, they did find the correct component in 80% of their cases. Interestingly enough, they didn't have any direct competitors to compare their tool against so they implemented a different technique based loosely on previous work and compared themselves to that. Although unusual, I kind of like this approach for comparison...of course the down side is that since they are implementing both, they are biased and won't try as hard on the competitor.

Floodless in SEATTLE

[Paper]

SEATTLE claims to "...[achieve] the best of both worlds: The scalability of IP combined with the simplicity of Ethernet." On one hand, Ethernet is simple to manage thanks to flat addressing. On the other hand, Ethernet doesn't scale well -- Ethernet bridges use broadcast messages (e.g. ARP requests and DHCP) & paths need to be a spanning tree. To reconcile this problem, network admins break up their large networks into small Ethernet networks connected by IP or use VLANs. The authors assert that both of those solutions are not good enough (inefficient and harder to manage since you need to worry about addressing). SEATTLE is intended to provide the same simple plug-and-play semantics of Ethernet, but also scale well to large networks.

[ to be done: further summarization ]

My criticisms of their simulations:

-- For the packet-level simulation, they start with a real LBNL trace and then inject extra traffic into it. They have fake hosts sending traffic to random destinations. I find this curious considering they make this earlier claim: "In enterprise networks, hosts typically communicate with a small number of other hosts [5], making caching highly effective."
-- They also have to make some kludgy assumptions about the number of hosts connected to each switch due to the anonymization of the traces. I suppose they can't really help that, it's a general field problem.
-- They have really huge error bars in many of their graphs. Is this normal for network simulations due to the nature of error in the physical links etc? Notably, the SEATTLE results have tight error bars even when ROFL and ETH have huge ones (e.g Fig 5a and 5c)...is their simulation somehow skewed for their own results?


Sorry about the lack of summaries......I had prelims on Tuesday and have two papers due Friday......multi-tasking is not going so well.

Monday, September 7, 2009

Congestion Avoidance and Control

[Paper Link]

Networks at the time were suffering from widespread congestion, despite the fact that the TCP specification had a theoretically correct description of how to avoid it. The principle they wish to adhere to is the 'conservation of packets' principle, which states that a new packet should not be transmitted until another packet has left the network. Since the protocol itself adheres to conservation of packets but networks at the time were still failing due to load, then they observe that this must be due to implementation bugs. Thanks to the bugs, one of the following three things happens: a connection doesn't get to equilibrium, a sender injects a new packet before one has exited, or an equilibrium can't be reached because of resource limits.

Descriptions of the changes they made to the 4BSD TCP implementation:

Slow-start. A connection can potentially not get to equilibrium if it acts wonky on startup. A host doesn't send its next packet until it's received an ACK. However, if restarted hosts start off firing off packets really fast and then throttle downwards, or starts at 0 and has no defined way of going upwards, that can cause the connection to not be at equilibrium. Instead, slow-start gradually increases the number of packets being sent. It starts at 1 and increases as more and more ACKs come back.

Round-trip-time variance estimation and exponential retransmit timer backoff. Sometimes packet transmission fails and therefore no ACK is generated. At what point should the sender decide to retransmit? Their round-trip-time variance estimation algorithm addresses this. And, if a packet needs to be retransmitted repeatedly, their exponential retransmit timer backoff algorithm handles that.

More aggressive receiver ack policy and dynamic window sizing on congestion. If recipients ALWAYS send an ACK upon receipt, then the lack of an ACK is equivalent to the network notifying the sender of congestion on that link. That near-guarantee of an ACK if all is well is what they mean by an aggressive receiver ACK policy. Additionally, they describe 'dynamic window sizing', which talks about how to increase/decrease (multiplicative increase and additive increase) to fill up a connection.

-- Unlike the previous paper, none of the main schemes they discuss need the cooperation of anything in the core. They can infer congestion by dropped packets, so no specific indicator (or view of the network state) is necessary. The end-to-end argument is satisfied here.

-- In their future work, they address the issue of misbehaving hosts. They argue that ONLY with the participation of gateways can you ensure fairness, since you can't trust hosts to be fair without another agent to police the situation. The idea is that if a host starts using more than it's fair share, the gateway starts dropping its packets.

-- The paper takes a very practical tone. I especially like how they explain their logic, the major takeaway points, the relevant math, AND the actual algorithm. It's a well-written, easy-to-read paper, in my opinion.

Thursday, September 3, 2009

Analysis of the Increase and Decrease Algorithms for Congestion Avoidance in Computer Networks

[Paper link]

Congestion control is designed to keep network devices from being overwhelmed by traffic. A congested network will see high rates of dropped packets as queues overflow. Congestion control is a dynamic resource management problem, where hosts should modulate the frequency of their packet transmissions to reflect the current state of the network. This paper focuses on congestion avoidance, which focuses on keeping traffic at a manageable level that does not approach excessive. (This is opposed to congestion recover, which keeps the traffic load right at the edge of being overly congested.) More specifically, the authors analyze the increase/decrease algorithms of the binary feedback scheme, a decentralized decision-making system. The binary feedback scheme has network resources monitor their state, and when above their optimal range they send a feedback packet with 1 for the congestion bit. Users then adjust their sending according to an increase/decrease algorithm.

They look at 4 criteria for algorithm controls: efficiency, fairness, distributedness (binary feedback vs universal knowledge), and convergence. Their questions based on these criteria are: (1) "What are all the possible solutions that converge to efficient and fair states?" and (2) "How do we compare those controls that converge?" Their main result is that increase should be additive and decrease should be multiplicative.

- Notably, this is an example of something that goes against the (strict?) end-to-end argument. Although the hosts are doing the actual throttling according to the increase/decrease algorithm, this still relies on the addition of a binary feedback bit in the header whose value is set by routers (in the "dumb" core). However, it seems unreasonable to think that any given host could see enough of the network to be able to make such a judgement without advice from routers (as they discuss when they say they desire "distributedness"). It seems like sometimes routers know best about routing (e.g., traffic conditions).

- This model assumes that users will respond honestly and in everyone's best interests when told that the network is being overwhelmed. What if users try to ride the fact that other users will rate limit, and therefore don't limit themselves for the advantage of getting more bandwidth for themselves? Oddly enough they don't mention this in their "Practical Considerations" section.

- Not a huge fan of this paper. It's mathy and details-oriented, but it's not clear to me that understanding those details (and not just the principles/guidelines derived from them) is useful unless you are directly doing research in this area, especially since they fail to consider practical things such as misbehaving hosts.

Understanding BGP Misconfiguration

[Paper link]

This paper presents a study of BGP misconfiguration problems. They watched route advertisements from 23 places and checked them for incorrect information, and then contacted the ISPs responsible for incorrect broadcasts to find out what had caused the problem. They look at two types of problems: 1) "origin misconfiguration", described as the injection of routes into global BGP tables (e.g., the Pakistani-YouTube problem), and 2) "export misconfiguration", which means exporting routes against an ISP's policy. (From the previous paper, one should recall that ISPs have a financial motivation for withholding some of the routes in their table.)

They determined that an event in the BGP update stream was a misconfiguration based on time. If a policy suddenly changes (and then quickly is changed back) then that is indicative of a misconfiguration, rather than a policy shift. They look at changes that lasted for less than a day. They can miss potential misconfigurations if those misconfigurations go undetected by the ISP for a while, but I suppose that those misconfigurations couldn't be too important if nobody noticed them. They can also potentially miscategorize a legitimate policy shift as a misconfiguration, but their communication with the ISP about the cause of the misconfiguration should set the record straight. They also limit the scope of misconfigurations to the two types I listed above, because they aren't able to detect all kinds of misconfigurations (e.g., the inverse of #2 -- filtering something that should have been released -- looks identical to a failed node/link). They observed misconfigurations for 21 days.

The results for origin misconfiguration are astounding. Their results indicate that 72% of route updates per day are actually from a misconfiguration. 13% of the misconfigurations result in a connectivity problem. Does that mean that 9.4% of route updates overall lead to connectivity problems?? That's insanely high...so high that it actually makes me suspicious of their results. They present this as being a low number compared to actual failures, but it's really high (in my mind) compared to the total number of route updates sent. If about 10% of all route updates are causing connectivity problems...well, that seems terrible to me!

Criticisms:

-- They present data showing that most of the misconfigurations are really short-lived. I'm not sure why they present this information, because they have already selected for short misconfigurations. Their categorization method decides that short-lived updates are a misconfiguration, and don't ever consider longer-lived updates. So: if there were misconfigurations of longer duration (e.g., a set of subtler misconfigurations), they would have thrown them out anyway.

-- A note on their presentation: I find their result tables really hard to read. It took me a few minutes to figure out that the table alternated between paths and incidents.

-- A general philosophical point is that this paper is on small, relatively minor errors. Are we really concerned about those? Should we be? Or have people learned to live with this background noise, and we should be looking at what causes the disastrous failures. The disastrous failures might be caused by entirely different factors in entirely different contexts. Since the scale of the failures is different, it seems like other factors would be different too.

Interdomain Internet Routing

[Paper link]

This paper is informational (part of a textbook?), rather than making an argument or evaluating a system. It explains BGP and how inter-ISP route selection is influenced by $$. Routing is not done in a way to maximize global efficiency and performance; instead it is done in a cooperative-competitive fashion where money is a large factor in routing decisions. Notably, transit routing is profitable whereas peering usually is not, and ISPs will value their customers' routing announcements over other routing announcements. The design of BGP has space for attributes that can be used to decide between routes based on factors relevant to profitability.

My specific notes on the paper follow...

- Topology of the Internet. Topology is made of Autonomous Systems (ASes). BGP routes between ASes. Different IGPs (Interior Gateway Protocols) like RIP run inside ASes. A small AS is a university/corporation, a large AS is a nation-wide ISP.

- Two types of AS-AS interconnection:
1. Provider-customer transit. "One ISP provides access to all destinations in its routing tables." Customer is paying provider for this universal access. If there were only transit relationships then there would be a single huge Tier-1 AS at the root, and it would control all Internet traffic.
2. Peering. "Two ASes provide mutual access to a subset of each other's routing tables." Sometimes paid, often reciprocal. Currently there are 9 big Tier-1 ISPs who do peering so that they can have explicit routes to all destinations. Smaller ASes do peering simply to save money, if there's a lot of traffic going between the two ASes.
- "A route advertisement from B to A for a destination prefix is an agreement by B that it will forward packets sent via A destined for any destination in the prefix." Other way to think about it is that ISPs charge customers for entries in routing tables.

- How BGP works. Router A establishes TCP connection to Router B. A sends B "OPEN" message. They then exchange tables of all active routes that they want to share. Each router then adds the routes they want into their own routing table. Routers may then later send UPDATE messages, which can be either announcements or withdrawals. eBGP is used between ASes. iBGP is also used within an AS because an AS may have several eBGP routers that need to be kept consistent. iBGP is *not* the same as IGP. Notable BGP attributes are LOCAL PREF (a local policy for choosing between routes), ASPATH (route length plus a way to discard circular route announcements), and MED (used for comparing two routes from the same neighboring AS).

- Problems with BGP. BGP does not support origin authentication, meaning there's no way to tell which AS owns a prefix, so any AS can claim that it owns any prefix. Traffic from other networks would then be sent to that AS instead of where it should actually go.
- With the Pakistan/You Tube problem, a Pakistan ISP was redirecting internal Pakistani traffic to their own error message to block YouTube. They accidentally leaked this listing to their parent ISP, which started to propagate it, and YouTube traffic started going to the Pakistani ISP. Since the Pakistani ISP was a customer of the larger ISP, it prioritized that listing over the existing one. Additionally, since the listing was MORE specific than the actual YouTube listing, it propagated very quickly.
- This can also be used for spam. A bad guy claims it owns a prefix and sends it upstream, and a questionable or careless ISP propagates it. They then send spam "from" IPs in that prefix and are able to receive responses for a short period of time. Then a few hours or days later, they disappear!
- Perhaps this is part of the larger problem of accountability/authentication not being part of the original goals of the Internet.

Tuesday, September 1, 2009

End-to-End Arguments in System Design

[Paper link]

This paper discusses the question of what responsibilities should be given to a communication system vs. the applications using it. They believe that most functions are more suited to the applications rather than the communication system, and they call this the "end-to-end argument." A motivating example they give is file transfer reliability: should the communication system worry about checksumming and internally confirming each packet that is sent across the network, or should the host application checksum the files once and then request a retransmit of the whole file if the recipient application's checksum fails to match? The high cost of executing the former option outweighs the low odds of a file transmission error occurring. Additionally, one must consider problems external to the communication system, such as the reliability of the hard drives; in order to account for this, the host applications would need to perform their own integrity check even if the communication system provided its own guarantee of reliability. The "end-to-end argument" is that many functions (e.g., reliability) are better performed by the end applications rather than the communication system itself. The communication system should only offer these functions as performance enhancements, and one needs to be careful that the performance "enhancement" actually works out to be beneficial and not overly complex.

In the previous paper on DARPA Internet design, the author also talked about this issue as being a motivating factor for splitting up TCP/IP. Different applications (e.g., voice vs rlogin) need different reliability guarantees; rlogin will wait to receive its packets in order, whereas a voice application would rather drop a packet and insert silence rather than delay transmission for an out-of-order packet. For this reason, TCP and IP were separated so that IP was a lower building block with TCP and other alternatives like IP on top of it, handling reliability as they saw fit. I like this paper because it formalizes the rationale behind TCP/IP separation and makes it a more general principle.

The authors also apply the end-to-end argument to encryption, duplicate message suppression, guaranteeing FIFO message delivery, and transaction management. The general theme is the same as with the file transfer example: the end hosts need to perform the function anyway (and probably can do it better since it has more semantic information), and doing the work in the communication system would just be wasteful duplication. Another important point is that not all applications need the same underlying "help" from the communication system, so what is useful performance enhancement for one type of application might slow down another application.

This same type of discussion could be applied to NIDSs -- should the work of intrusion detection be done on the network or by the host? Doing work on a dedicated NIDS will save hosts computation time, and it's easier to keep a NIDS programmed and updated than it is to keep host security software updated. On the other hand, understanding network traffic sometimes means the NIDS has to infer application behavior, which can become very expensive and lead to DOS attacks. Instead of having a NIDS do bifurcation analysis, it might make more sense to have the host be responsible for filtering its own incoming traffic.

The paper does acknowledge that not EVERYTHING should be moved to the end applications -- there is always a tradeoff. For example, retransmitting a single packet makes more sense than retransmitting an entire file. I would be interested in seeing a formal cost/benefit analysis that could be used to make the decision of when one approach makes more sense.

About Me

Berkeley EECS PhD student