APF's Network Summaries: 2009

Friday, November 20, 2009

Skilled in the Art of Being Idle: Reducing Energy Waste in Networked Systems

A "sleeping" device uses less power than an idle-but-turned-on machine but loses its ability to respond to network events. This paper explores different ways of setting a machine to sleep without losing network presence, so that computer users will be more willing to actually use sleep modes. Notably, they look at two approaches: Wake-on-Lan (WoL), which allows machines to be woken up remotely with a special magic packet; and the use of a "proxy" that handles traffic on behalf of a sleeping host (potentially waking the machine).

1. Is the problem worth solving? They look at 250 machines (a mix of desktops and notebooks) in both home and work settings. If a person hasn't used a machine for 15 mins they classify it as "idle", and the machines in their study are idle a huge percentage of the time. That's a big opportunity for sleep-but-present.

2. What network traffic do idle machines see? They use BRO to record the traffic. They find that both at home and at work, machines are receiving nearly constant network traffic (although home networks are quieter). If they were going to wake up a machine for EVERY PACKET, machines wouldn't get to sleep much: not at all in the office, and only about 50% of idle time at home. This implies that a selective wakeup mechanism (only wake for some packets) would be better.

To try to figure out what kind of policy would be good (which packets should they wake for?), they broke down idle-time traffic by communication type (multi/uni/broadcast), incoming vs outgoing, and protocol. Outgoing traffic is mostly unicast, incoming is multi/uni/broadcast (so a proxy would need to handle all 3). Incoming broadcast and multicast chatter are largely responsible for poor sleep; proxying & ignoring them would save a lot of sleep time. Most broadcast and multicast protocols can be ignored or responded to by the proxy (eg the proxy can respond to ARP queries or SSDP packets without waking the host).

3. What is the design space for a proxy? They come up with 4 example policies that wake/respond/ignore in different ways. "Proxy_3" performs the best, reclaiming 70% of idle time for sleeping; it only wakes up for telnet, ssh, VNC, SMB file-sharing, and NetBIOS. The proxy will also respond to certain packets. This is "non-transparent" because it will make the machine seem asleep for certain protocols that don't fall in their interest range.

It seems like administrator-customized proxy policies might be a good idea; administrator would have their own insights on what is important. For example, I would happily sleep my machine if I could selectively wake it up just for ssh! Perhaps the EECS admins might ask me to have it respond to a few other things.

All-in-all I think this is a really cool paper, I have never read anything about energy saving before these two energy-related papers and I think it's a neat topic. I'd like to hear more about this topic area.

Wednesday, November 18, 2009

Cutting the Electric Bill for Internet-Scale Systems

This is a REALLY cool paper!

Their idea is to cut a company's electric bill by servicing traffic at data centers with low energy prices. This might mean servicing a host in Boston from a data center in Virginia (rather than NY) if VA is cheaper than NY at that time. "Preferred data centers" would be recalculated every hour based on that hour's regional energy prices. By shifting traffic around based on energy price, the total electricity cost of the system is minimized. They look at energy prices for different regions and note that not all preferences can be pre-determined; sometimes static routing might work (always prefer one region over another, e.g., Chicago vs Virginia), but other times dynamic routing would be preferential (e.g. Boston vs NY). They use a real, large Akamai traffic data set to examine the usefulness of their idea. They find that energy savings could be up to 40% with perfect conditions.

Their idea has four caveats:

(1) Electrical elasticity refers to how much power a server consumes when idle. If idle & active costs are identical, then this plan doesn't work at all. If idle uses 0 power, then you'd have ideal conditions. The reality is somewhere in the middle. Google's elasticity rate is 65%. With perfect elasticity, they find 40% savings; with Google's elasticity rate, they only see 5% savings.

(2) This could increase latency, which might not be tolerable for some applications. Looking at normal Akamai traffic (Akamai is a latency-sensitive CDN), they see that geo-locality is not always adhered to anyway; perhaps geo-locality does not actually correspond to performance. I would be interested in seeing more work on this point.

(3) Increasing the routing path could have the side effect of incurring extra energy expenditures along the routing path. However, they say this is not a big problem because routers energy use is not proportional to traffic.

(4) The 95/5 rule of bandwidth billing generally states that if the 95th percentile of traffic exceeds the committed rate, then the consumer has to pay substantially more for the overage. This means that their work is constrained by the 95/5 rule and they can't increase traffic to one data center beyond that point. The extra bandwidth would be more expensive than the energy they are saving. If the 95/5 rule is adhered to, perfect elasticity only yields 15% savings; without the 95/5 rule, you see their 40% max savings figure. Note that with both the 95/5 rule and 65% elasticity, Google only sees a 2% savings (which is still $1M+).

One question I have is whether this system would be self-affecting. That is, to say: if a large company like Google actually moving its computation to its cheaper DCs, would power costs in those regions increase? (Conversely, if all the DCs in Silicon Valley moved their servers to Oklahoma, would Silicon Valley prices go down?)

------------------

Notes -

Google #s should be higher now. Their machines used to be single-app, however now they are multi-app. Regardless though, utilization is likely under 60%...over 60% you see an increase in latency because of competition for resources.

Scalable Application Layer Multicast

This paper presents a system, NICE, for application-layer multicast. They look at a class of applications known as "data stream applications", which simultaneously deliver streaming data to large numbers of people. They view multicast as an overlay network, with all the recipients of the multicast communication being the nodes in the overlay. They set out 3 ways to evaluate the "performance" of an application-layer multicast scheme: (1) Quality of the data delivery path using metrics like stress, stretch, and node degrees, (2) Robustness of the overlay, (3) Control overhead.

Their multicast system is hierarchical. End hosts are arranged in layers. The bottom layer (L0) has all hosts in it, organized into small clusters. Hosts are grouped into clusters based on "closeness." Each cluster has a cluster leader who is in the "center"; that leader also belongs to the next layer up (L1). L1 members are clustered into small groups and then each one of those has a cluster leader, who belongs to L2, etc. recursively up to a single node at the top. At the very top of the pyramid/tree is the data provider. The data delivery path goes down the layers, with each cluster leader distributing data to its cluster members.

Members of a cluster exchange "heartbeat" messages. The cluster leader sends out a digest with info about all the cluster members. Members of a cluster tell each other the distance estimate to every other group member. Based on this info, the cluster leader can be changed if the cluster leader is no longer in the "center" based on changing network conditions. The cluster leader will check on cluster size and split a cluster if it gets too big.

When a new node joins the overlay network, first it will contact the the data provider ("RP" for rendezvous point) on the top. The RP will respond with the hosts in the highest layer; the host will contact those hosts to find out which one is the best match. It will then choose one and contact that leader's children to pick among those, until finally it is mapped to a L0 cluster in the hierarchy. Nodes may leave the network either intentionally (with a REMOVE message) or due to failure (in which case their loss is noted by a lack of heartbeats). New cluster leaders may be chosen when nodes enter or leave a cluster.

I understand their overlay network topology and maintenance protocols, but I feel like I'm missing something in terms of how the actual data is transferred. OK, so data goes down the tree. What happens with failures/retransmissions? How are those relayed, who is responsible for them in the hierarchy? Are they just assuming that reliability is being handled beneath the network by TCP? What if one node is very faulty (thereby generating lots of failures/retransmissions), won't nodes higher up in the hierarchy see their bandwidth suffer because the cluster leaders will need to handle the failure/retransmission traffic from the flaky node in L0?

-------------

Notes -

app layer with NICE
----
UDP
IP

Tuesday, November 17, 2009

A Reliable Multicast Framework for Light-weight Sessions and Application Level Framing

Implementing reliable multicast is not as easy as making slight changes to unicast protocols. With TCP, the sender is responsible for loss detection & recovery. This does not work for multicast because (1) if all recipients constantly send the sender ACKS, then the sender will be overwhelmed with traffic (ACK implosion), and (2) the set of recipients is unknown and may be constantly shifting. Essentially, the issue is that multicast does not imply "fate-sharing", whereas unicast does. The authors conclude that receiver-based reliability is a better approach to reliable multicast. The authors also note that the use of communication state (sequence numbers) for naming data does not work well for multicast because state synchronization is weaker; instead, the authors advocate "naming in application data units (ADUs)".

This paper (from 1995) presents Scalable Reliable Multicast, a framework that provides reliable (but not necessarily ordered) multicast delivery over IP. They suggest that ordered delivery and other application-specific details could be built on top of SRM. If I understand it correctly, they are splitting the transport layer in half: SRM would be on the bottom of the transport layer, and then some number of ordering protocols would be on top of it. In my mind, this is equivalent to making the unicast assumptions of UDP and TCP explicit and then building UDP and TCP on top of the unicast framework.
-- Note that I did not understand their description of "Application-level framing". Do they simply mean that there need to be many different transport layer protocols for different types of applications (much like there is UDP vs TCP), or do they mean that you shouldn't really have anything on the transport layer, and applications should be handling this directly?

They use SRM with a whiteboard conferencing tool named "wb" from LBNL. Many users (maybe thousands) simultaneously use a wb instance. Wb *does not* require ordered delivery because its operations are either idempotent or incorrect ordering can be easily fixed. Every change to the whiteboard is multicast to all of the users of a whiteboard, and whiteboard users (the system receivers) are responsible for their own loss and retransmission. If a receiver detects a loss, it schedules a request timer that is drawn from a uniform distribution based on the receiver's estimate of the one-way delay to the source of the data. The request is then multicast, as is the response to the request. If a receiver sees *another* request for the same data before its own request timer expires, then t will do a random exponential backoff and reset its timer. If it gets the missing data, it will cancel its request timer altogether. This provides reliability without the ACK implosion. The point of the randomization & timer scheduling based on location is to de-synchronize host actions to prevent unnecessary duplicates.

They ran some simulations of their request & repair algorithms on small contrived topologies. They say the point of the simulations is to look at the performance for individual packet drops; "We do not claim to be presenting realistic topologies or typical patterns of packet loss." In their scenarios, they show that request & repair algorithms work well in random and bounded-degree trees when every node is a member of the multicast session; it doesn't work as well in a sparse session. This is a bit concerning since the sparse session is the expected real-world scenario, so they make some changes to their request & repair algorithms. Notably, they add adaptive adjustment of timer parameters based on the past behavior of the algorithms in the multicast session. If the average number of dupe requests is high, then the request timer interval increases; if the average number is OK but the delay in sending a request is too high, they decrease the request timer interval. They then do some new simulations and show that that adaptive algorithm works better ("well") under many conditions. However, in their simulations, note that none of the requests or repairs are dropped, which you would see in the real world.
-- Honestly, I am not really sure what the value of their simulations is.

They also suggest (but do not implement or simulate) "local recovery". The idea is that a receiver belongs to a "loss neighborhood" where a set of nodes are all experiencing the same losses. Receivers learn about their loss neighborhoods from session messages (which are sent out periodically). Requests & repairs could be broadcast only to a loss neighborhood, thereby reducing the overall bandwidth used.

Different apps have different tradeoffs between delay and bandwidth. wb wants to minimize delay but isn't worried about bandwidth, whereas something like Usenet would want to minimize bandwidth without concern for delay. This is a little confusing to me -- unlike ordering, these properties aren't something you can build "on top" of SRM. So I am back to my original confusion of whether this is meant to be transport-layer or part of an application itself.

-----------

SRM - transport
IP - routing

Thursday, November 12, 2009

NetFPGA: A Tool for Network Research and Education

(paper)

This paper is about a resource for teaching/learning about networks. The idea is to expose students to L1 and L2 technologies. It's also possible to use for research for cheaply and easily building networks.

Tuesday, November 10, 2009

A Policy-aware Switching Layer for Data Centers

(paper)

Data centers try to route their traffic through middleboxes (firewalls, load balancers, NATs, etc). They typically do this by putting them in-line on the network topology and trying to ensure there are no other paths. Sometimes some network traffic doesn't need to go through a particular box but it will anyway because that's how they've set up the topology. This paper proposes a L2 way to accomplish this better; they call it "the policy-aware switching layer", aka PLayer. The properties they want are correctness (can't circumvent), flexibility, and efficiency. pswitches classify traffic and redirect it to the appropriate middlebox. Middleboxes do not need to be modified, but switches would need to be changed. There is a central policy server that sends policies to all the pswitches so they know how to classify and direct traffic.

They implemented prototype pswitches using Click, and then "validated" its functionality on the DETER testbed. They make the disclaimer about their work being a prototype which of course tips you off that their performance is bad: 40% of normal throughput with 2x the latency. Owch!!!

They then provide a formal analysis that I do not buy at all. I personally think that a formal analysis of a system like this is BS. They are just rephrasing what they have said earlier in the paper with subscripts.

Interesting paper but honestly they went into too much detail and lost my attention in places. This might be because this is a tech report. In a few places it seemed like they were unnecessarily belaboring obvious points that I didn't need convincing about.

Monday, November 9, 2009

Internet Indirection Infrastructure

The Internet is built to provide point-to-point, unicast service: the sender sends a packet to one addressee. IP routers in the middle will direct the packet to the addressee. This paper discusses how to support multicast, anycast, and host mobility: the sender doesn't necessarily know the identities and/or location of the receiving hosts. They propose an overlay network (Internet Indirection Infrastructure aka i3) to support these tasks.

i3 decouples sending & receiving. Sources send packets to an identifier, and receivers can ask for all packets sent to that identifier. So --- anytime something is sent to a given identifier, it is distributed to everyone who has registered for that identifier. They map each identifier to an i3 server; there's one node for each ID. When a packet is sent, it goes to that server; then that server distributes it to all hosts registered to that ID. This is useful for mobility because triggers just need to be updated with the new address; multicast can be done by all of the recipients registering for the trigger; and honestly I am not sure I follow their explanation of how to apply this to anycast. Public triggers could be used for public severs, and private triggers can be used for short-term flows.

-- If there's only one node for each ID, how does this work in case of failures? Isn't non-replicated "hard state" generally a bad idea? They later say that end-hosts use periodic refreshing...which makes it soft state...but then your outage time would be the same length as your refresh period. If you are not a fault-tolerant application, do you want to be refreshing every 50ms? OK...so then they try to mitigate this later in the paper by saying (1) an application can maintain a backup trigger or (2) the overlay network can automatically manage replicas.

The underlying overlay is responsible for robustness, scalability, efficiency, and stability. They use the Chord lookup protocol. One thing to note is that routing using i3 is less efficient than point-to-point addressing. They can alleviate this by having senders cache the i3 server's IP address. You can still run into the "triangle problem" where the i3 server ends up being farther away than the recipient nodes; they say this can be solved by having receivers choose i3 servers close to them. This could worsen scalability problems...seems like certain i3 nodes could easily end up overwhelmed. They suggest hot spots could be avoided by replicating the trigger at multiple nodes.

Potential security problems....
-- Eavesdropping: anyone who knows a trigger can eavesdrop. They suggest that private triggers should be randomly chosen & kept secret. Plus periodically change it. Not sure if eavesdropping is actually a huge concern given that most people assume eavesdropping on the network is possible in other ways.
-- Trigger hijacking: a malicious user that removes a public trigger. They suggest an extra level of indirection.
-- DoS attacks by making lots of triggers that point to a victim & amplify the sending of a single packet.

Thursday, November 5, 2009

DNS Performance and the Effectiveness of Caching

[paper]

As mentioned in the previous paper, DNS servers aggressively cache address data. The previous paper also suggests that negative caching could give a performance boost. This paper questions and tests the effectiveness of both of these mechanisms using DNS and associated TCP traffic from MIT CSAIL and KAIST, circa 2000 and 2001.

Background: Many servers ("stub servers" or "stub resolvers") do nothing but cache responses and act as proxies for resolvers (they will query another server if they do not already have the data cached). (Stub servers answer "recursive queries", whereas authoritative servers receive "iterative queries.")

They collected outgoing DNS queries, incoming DNS responses, and TCP session start and end packets. Notably they cannot observe DNS lookups cached inside their network (eg on the client itself). They removed all TCP sessions that didn't first generate a TCP lookup; I assume this removes the effect of client-cached addresses. They also removed all DNS A-record lookups not part of a TCP session; I don't understand why not or how this skews their findings, and this bothers me.

Their results:

-- About 80% of DNS lookups don't require a referral; this means the first NS contacted has the answer.
-- Total number of DNS queries is much higher than DNS lookups, meaning that many lookups require query retransmission. Despite retransmissions, about 20% of clients never got any answer (not even an error). They suggest that it is better to give up sooner than to keep retransmitting, and let the application figure out what else to do. Between 12-19% of their lookups did not have any retransmissions. Aggressive retransmissions can cause a lot of traffic -- 63% of all DNS queries they saw were for lookups that never received a response!
-- 13% of client lookups result in errors (most say the name does not exist).
-- Distribution of both successful and failed DNS requests are heavy-tailed. For one set of traces, 10% of names account for 68% of answers but then the remaining 32% are very distributed.
-- They do believe that NS record caching is important. Only 20% of responses came from a root server, which means that if TTLs on NS records were shorter, the root server load would increase by 5x.

Fact to remember: Address (A) records store IP addresses, and NS records say what name server is responsible for an IP address

---------------

Class Notes:

-- Top sites accessed a lot, but low TTL; then many sites accessed infrequently. Reminds me of the 80:20 rule.
-- Key point is to stop with aggressive retransmissions -- if multiple attempts get no response, give up because it's not helping.

Development of the Domain Name System

[paper]

This paper tells the story of the invention of DNS. Prior to DNS, every host had a plain text file named HOSTS.TXT that contained addressing information for all places they might want to reach. Clearly this was not scalable.

DNS has two components: name servers and resolvers. Name servers hold the addressing information & answer queries, and resolvers find name servers. Data for each DNS name is a "resource record" (RR) that carries a type, class field, and data. There are two mechanisms for transferring data from source to destination: zones and caching. A zone is a section of the tree controlled, updated, and maintained by a single organization. Resolvers (which can be centrally located or on a host itself) cache responses for each zone for as long as the TTL for that zone. A resolver is configured to point at servers for the root node and the top of the local domain. It then searches "downwards."

Surprises they encountered when DNS use became widespread:

-- Refinemenet of semantics. The form of data specification ended up confusing, eg how to order multiple IP addresses for a single host.
-- Performance. Initial performance was much worse than they expected. Root servers would get multiple copies of the same query from the same source because their responses were slower than resend timers. Measuring DNS performance precisely, however, proved problematic because of gateway changes, software releases, etc.
-- Negative caching. DNS can respond negatively to a query in 2 ways: name does not exist, or name exists but they don't have data for it. They expected negative responses to be rare but they ended up being extremely popular (20-60% of all queries) because of transition confusion (use of old addresses and shortcuts, etc). They expected it would drop off but it actually stayed between 15-50%. This is because a user would type one bad address and then their application would automatically try other potential addresses, in the process making a bunch of negative queries. They suggest that negative queries should be cached like positive queries to address this issue. I'm not sure what kind of a TTL should be put on negative query caches; perhaps there could be a force update command that would clear a negative query cache when that domain name got registered.

Their successes:

-- variable depth hierarchy
-- organizational structuring of names
-- datagram access
-- caching proved crucial!!

Shortcomings:

-- type & class growth -- either demand for new types & classes was misunderstood, or DNS makes new definitions too hard
-- upgrading from HOSTS.TXT -> DNS was hard, but seems like that isn't a long-term issue
-- distribution of control

I like this paper a lot, especially the "surprises" section. The negative caching is really interesting, I would never have guessed that.

Related paper that might be interesting: Protecting Browsers from DNS Rebinding Attacks. This paper is interesting, but I've never fully understood how it works. The basic idea is that Attacker.com wants to try to gain access to an internal resource (eg iternal.corporate.com). The attacker answers DNS queries for attacker.com with a really short TTL. After the user navigates correctly to attacker.com, the attacker then starts answering DNS queries for attacker.com with the IP address for internal.corporate.com (eg 10.10.10.105). A script on the loaded attacker.com website will issue a request for materials at attacker.com; this will then actually point to 10.10.10.105 and the attacker will successfully get the content. (Normally this operation should be denied, because attacker.com and internal.corporate.com are not the same domain.) The part I am confused about here is the administration of DNS. Why can any arbitrary person claim that any random IP address belongs to their domain name?

---------------

Class notes:

-- DNS is eventually consistent
-- Caching: scalability, reliability in outages, speed up flookup
-- Caching done in client resolution software + also in name servers

Wednesday, November 4, 2009

Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications

[paper]

This paper presents Chord, a distributed lookup protocol. (Their routing algorithm was described as "skiplist-like routing" by the previous paper.) They use "consistent hashing" to assign IDs to nodes, in an attempt to load balance key assignments over nodes. One of their goals is scalability: each node should only need to know about a small number of other nodes. Also, they need to handle nodes leaving and entering the network without complete degradation of performance.

Each key has a "successor node", which is the node whose ID is equal to or most closely follows the key's ID. When a node joins the network, keys previously assigned to other nodes now become assigned to n if the new node's ID is closer to theirs. When a node leaves the network, its keys are reassigned to the next closest. Nodes maintain the ID of the next successor as well as a routing table that helps find a node closer to the key during a lookup propagation operation.

Looking Up Data in P2P Systems

P2P networks raise the "lookup problem": how can you locate data on the P2P network without a centralized server or hierarchy? Simplest form of lookup: searcher broadcasts a message to all neighbors with a request for X. Each neighbor checks its local database; if it doesn't have the item, it forwards to its neighbors, and repeat. However, all these broadcast messages will take up a lot of bandwidth.

Distributed hash tables (DHTs) offer a solution. In a DHT, all keys & nodes haves IDs. Each key is stored at a node with an ID that's "close" to the ID's key. Each node sends a lookup request to another node "closer" in the ID space until the lookup request gets to the right place. However, nodes need a routing algorithm to be able to *find* a closer node.

-- Skiplist-like routing: Each node has a table of node IP & ID pairs of nodes whose IDs are 1/2 around the ID space, 1/4 around, 1/8 around, etc. A searcher node will send a lookup request to the node with the closest ID that does not exceed the key ID.
-- Tree-like routing: Each node has a randomly assigned ID, and each one knows a node with a certain "prefix". It uses knowledge of the tree structure to find nodes.
-- Multi-dimensional routing: Not sure I quite follow this. D-dimensional Cartesian coordinate space partitioned into hyper-rectangles (zones), and each node has a zone & identified by the boundaries of its zone. A key gets a point in the d-dimensional space and is stored in the node that owns the key's point's zone. The lookup follows the straight line path from the searcher to the key.

I found this paper interesting because I've been wondering for a while how DHTs work, and this is a review of what seem like important topics for understanding anything about DHTs in more detail. I really like that this paper was on the reading list.

Wednesday, October 28, 2009

Active network vision and reality: lessons from a capsule-based system

[paper]

In an active network, packets carry code that is executed by the routers that handle the packet while it is in transmission. Nodes in an active network need to be able to understand and execute packet code, but it is possible to mix and match active and normal routers. Packets can therefore react to changing network conditions and choose new routing paths for themsleves. This is very much the opposite of the "dumb core" from the end-to-end argument.

ANTS is the active network system developed by the author. A "capsule" is an extended packet that has code associated with it. In their design, capsules do not actually carry code. Rather, each active node has a set list of forwarding functions that it will execute, and each capsule says which forwarding function should be used for the packet. New forwarding functions can be provided for new services; these functions can query the router about network state. They suggest that code should be certified by an overseeing authority. Code is always executed in a sandbox.

- They do not actually want to load all possible functions in all active nodes (memory constraints), nor do they want packets to actually carry the executable code (certifying the code every time would be slow). Instead, they cache the most recently seen functions. Since they expect a small number of functions to be very popular, this would be good for those popular functions. For less-popular functions, the code is requested from the last active node that the packet was at, which (hopefully?) will still have the code.

- Forwarding functions are identified by their MD5 hash. This is an old paper, so this was probably OK at the time. However, MD5 shouldn't be used anymore. Also -- at first I was concerned because a hash function alone will only provide integrity but say nothing about authenticity -- you need an HMAC if you want both. However, after thinking about it, authenticity doesn't really matter in this scenario, we really do just care about integrity.

- They duly note that active networks won't work very well when you have end-to-end encryption and your function depends on packet contents. Seems like this will be an increasingly big problem.

They do an evaluation. Their use of Java is a significant cause of poor performance; they compare a Java relay to a C relay and the C relay has a lot more throughput (4x) and much lower latency (~5x). However, processing forwarding functions doesn't add much on top of that, although it does seem like a poorly-written forwarding function could do much worse than their example forwarding function. Use of C would make it harder to sandbox the code but it is certainly still possible.

Comment: I don't think the paper explained the concept of an "active network" well. I was confused about whether the point was to use the network for arbitrary computation or to change the way routing works. I ended up reading the Wikipedia article and this GE page. I think the authors should have defined "capsule" in the introduction so that I didn't have to read the third section to understand the intro. Sections 3 and onwards are quite good, but I found the intro & background less than helpful for someone who didn't know anything about active networks.

--------------

Class Notes

- Things you might want to do in a network: app-relevant metrics; DOS prevention with capabilities; NAT address translation; media transcoders; compression/decompression.
- Many of those apps can be done with a box on either end rather than in every router.
- No killer app! However, this does sort of happen near the edges, but at the application layer.
- PROS of active networks vs application-layer services: common way of implementing services; potentially more efficient since packets stay in their "natural path"
- CONS of active networks vs application-layer services: performance (esp latency), security, isolation (if one crashes, do they all?), state

Resilient Overlay Networks

[paper]

Problem: BGP's fault recovery mechanisms take several minutes; new routes are slow to converge. Outages can be 10+ minutes. This is a problem for distributed applications.

Solution: RON, a Resilient Overlay Network. A RON node constantly monitors the quality of the path ("virtual link") between it and other RON nodes in terms of latency, throughput, and packet loss rate. RON networks are meant to be small so each node can keep track of alternate routes as well as primary routes. Each RON node is a forwarder as well as a client. Interesting point: some applications may be more fault tolerant than others, so applications can define their own "failure" settings and prioritize certain metrics over others.

- Fast detection & recovery. They want to recover from link or route failures in seconds. When a probe is lost, a series of high-frequency probes are sent. If the link is dead or bandwidth is degraded by 50%+, and a good alternative path is available, then it will use the alternative path to send data. A potential problem is "flapping" between two routes; you don't want to frequently switch because that could cause a lot of reordering. To solve this, they favor the "last good route."

Evaluation: Looked at two RON networks, one with 12 nodes and one with 16. In an experiment averaging packet loss rate over 30-min intervals, they showed that BGP routing suffered 32 significant outages and RON suffered none. They also present RON Win/Loss/No change statistics based on loss rates and it seems like RON does very well.

An interesting point about this paper is that they assume that RON networks will be small. I question whether this makes a lot of sense given that they are targeting distributed applications. Does Google do distributed computing across different data centers? That would be much larger scale. I guess Google has their own fiberoptic network though so they wouldn't need to worry about these BGP-based routing concerns. But there must be large distributed applications that don't have their own networks, too.

--------------------

Class Notes

- overlay: network on top of a network
- application-specific routing means that you get to pick your own routing metric
- RON's concern is reachability, period, not latency.
- BGP doesn't broadcast alternate routes between ASes. this is how RON gets its improvement: it can identify a path through a different set of ASes that avoids the problem. this means that you need one overlay node in each AS that you care about; two nodes in the same AS won't see any improvement.
- overlays over overlays cause fairness problems. if they all see an area with no congestion and try to use it at once, then all of a sudden that will be congested.
- scalability issue with RON because of their active probing. won't scale up to more than 50 nodes.

Tuesday, October 20, 2009

White Space Networking with Wi-Fi like Connectivity

[paper]

The problem: How can we use UHF white spaces (unregistered portions of the wireless spectrum) for network connectivity? This is more challenging than standard WiFi because:

Spatial variation in spectrum matters more in white spaces because the FCC requires non-interference with wireless transmissions of incumbent users (e.g., TV).
White spaces are more likely to have spectrum fragmentation, which each fragment potentially being a different width.
Temporal variations because of wireless microphones. Need to immediately switch to another channel when wireless mic is transmitting.

Their proposal: WhiteFi adaptively configures itself to handle the above challenges.

They propose an adaptive spectrum assignment algorithm that periodically checks space availability of both the access point AND the clients. Every AP & client maintains a spectrum map that knows what channels are in use and an estimate of how much each channel has been used. Clients periodically send the AP updates on their spectrum map. An AP looks for a new channel if an incumbent starts to use part of the channel: this can either be voluntary (incumbent is causing minor performance degradation) or involuntary (a wireless mic has completely killed the channel). When a switch happens, the AP looks for the best available space as determined by both the AP's and its clients' spectrum & availability maps. The AP then broadcasts the choice to all users. (Question: how does this broadcast work if the wireless mic is totally killing the channel?)
Variable width channels makes AP discovery hard. They propose SIFT (Signal Interpretation before Fourier Transform) to determine packet widths based on signal amplitudes. They can then search the space using SIFT, which will figure out the width of the channels.
This one answers my earlier question about how to handle an involuntary disconnection. Each AP maintains a backup channel. If there is a disconnection, the AP will broadcast the new channel information on the backup channel and/or the client will chirp that it's been disconnected (depending on who first discovers the disconnection).

Evaluation: In a small experiment, they showed that switching on a mic causes a disconnection between a client and an AP of at most 4 seconds. They also used a simulator to examine their spectrum assignment algorithm and find that, in practice, it performs nearly as well as the theoretical optimal. (Figures 11 and 12 look pretty impressive.)

--------------------

Class notes:

White space: Lower frequency. Many frequencies are assigned but not used widely. Digital TV opened up a lot of space between channels because it is compressed with coding. Great frequencies for wide area transmissions.

Wifi = ISM

Interactive WiFi Connectivity For Moving Vehicles

[paper]

Problem: This paper looks at how to best provide Internet connectivity to moving (i.e., vehicular) clients in a city where WiFi is ubiquitous. Their work is tested on and tuned for a city network with many overlapping base stations and bursty and independent packet losses across senders and receivers. They assume base stations can talk to each other but are bandwidth-constrained. They also want to avoid batching since they claim it increases latency & they want to support highly interactive mobile applications.

Past approaches: The standard "hard handoff" approach is to connect to a single base station and then switch to a new, "better" one when the first connection becomes too "bad". This paper evaluates five different hard handoff strategies: RSSI, BRR, Sticky, History, and BestBS. These strategies differ in how they determine when to switch between connections:

RSSI - base station strength (note: standard familiar metric),
BRR - highest exponentially averaged beacon reception ratio,
Sticky - stays with a base station until disconnected for a set time period, then picks highest signal strength,
History - historic best average performance for a location,
BestBS - for each 1sec period, uses BS with best predicted future 1sec reception ratio.

They compare these hard handoff strategies to a macrodiversity handoff method, AllBSes, which opportunistically uses all BSes in the vicinity.

Number of packets delivered: AllBSes had the highest average number of packets delivered per day, but is within 25% of all the other methods except for Sticky, which (perhaps predictably) had by far the lowest. Distribution of session length: Median session length of AllBSes is 2x BestBS, 7x over BRR. However looking at the full distribution graph (Fig 3d), I'm not sure whether "median" properly captures the difference; AllBSes doesn't look that much better than BestBS on the graph. Periods of disconnection: On their drive, BRR had 6 disconnections, BestBS had 2, AllBSes had 0.

Their proposal: Inspired by the success of AllBSes, the authors propose ViFi. ViFi picks one main "anchor" base station using any strategy (their implementation uses the highest exponentially averaged beacon reception ratio like BRR). The mobile client can then designate other BSes as "auxiliary" connection points. If an auxiliary point hears a packet sent between the client & anchor but not an ACK, the aux BS probabilistically relays it. This is often better than retransmission by the original sender because (assuming bursty losses) if the original was lost then the resent packet is likely to be lost too. However, an aux tries to only relay (as modeled by the probability computation) if other auxiliaries are unlikely to relay, given that auxes with better connectivity have preference. Each BS knows the anchor and aux set for the client because it is periodically broadcast. Also, a new anchor will check for packets at the old anchor, which they call "salvaging." Note that ACKs may take longer than usual since there's an optional relay step in the middle that doesn't require retransmission. Also, packet reordering may occur; they claim that this is infrequent enough to not hurt TCP performance.

Comment: This reminds me of ExOR, although here they don't go for "luckiness" (the case where a packet goes unexpectedly far) -- they just try to fix "unluckiness" (a packet doesn't make it far enough) with eavesdropping on broadcast packets. Both need to pick a set of points to interact with, and each node needs to determine whether other nodes might be transmitting. ViFi is simpler though: it chooses a primary point (the anchor) and doesn't require slotting.

-------------------

Class notes:

Probabilistic choosing:
-- Which packet is supposed to forward?
-- Try to make probability that it is relayed = 1.
-- Trying to minimize # of redundant transmissions, but that is LESS important than making sure it is relayed -- transport layer protocols are usually resilient to duplication.

Salvaging has a big impact on performance because TCP is loss intolerant.

Wednesday, October 14, 2009

XORs in The Air: Practical Wireless Network Coding

[paper]

This paper uses the idea of network coding to improve throughput on wireless networks in the form of COPE. The introductory example they give is Alice and Bob each sending each other packets through 1 router. This takes 4 transmissions. Instead, the router XORs the two packets together and then broadcasts the XOR value. This is 3 transmissions. Basically, their idea is to reduce the number of packets sent by doing opportunistic coding so that end points can "figure out" what the message is without actually needing to see the message. The goal is to combine as many packets as possible into one transmission. Other approaches try to minimize the number of broadcasts to raise throughput; they instead capitalize on the idea of broadcasts as a way to minimize the overall number of transmissions. XOR is a quick, cheap operation with hardware support.

Something that was not immediately clear to me is how this works with hidden terminals. I think the solution to this is by learning neighbor state with "reception reports", and so the encoder does not assume that 2 nodes it can communicate with can also see each other's transmissions. It still seems to me though that they are going to have hidden terminal issues.

To decode, each node has a packet pool with packets it has recently received or sent; the packets are hashed and keyed by packet ID. When it receives an encoded packet, it gets the relevant packets from the packet pool if they have not yet been garbage collected. Also, they do hop-by-hop acks and retransmissions.

Their results from different toy topologies are a 1.1-1.3 throughput gain for Alice-and-Bob and X-topologies with long-lived TCP flows, and 1.35-1.5 throughput gain for the cross topology. They do a lot better with UDP flows -- 1.58-2 for Alice-and-Bob, 1.5-1.75 for X-topology, and 2-4 for cross. All of these throughput gains are less than the theoretical Coding+MAC gains because of header overhead. In a 20-node wireless ad hoc network, TCP didn't have much improvement with coding (2-3%) because of TCP backoff after a collision loss. This is a hidden terminal problem. When they bring all the nodes closer together and remove hidden terminals, they get a 38% improvement. They get a 3-4x improvement for UDP when congestion is high.

-----------------

Class notes:
-- topology matters - wheel is best
-- symmetry also matters -- need symmetry to do coding
-- hosts need to cache for long enough, and the sending proxy needs to keep track of cached state correctly

ExOR: Opportunistic Multi-Hop Routing for Wireless Networks

[paper]

"Traditional" routing protocols choose a single path for a packet before sending the packet. In contrast, "cooperative diversity schemes" simultaneously transmit information through multiple paths and the route may dynamically change. ExOR is a cooperative diversity scheme geared towards highly lossy wireless connections.

At each hop, ExOR broadcasts each packet and then chooses the next hop in the path based on who actually received the packet. The benefit of broadcasting is to increase the chance that at least one node successfully receives the packet. Unlike standard cooperative diversity schemes, only one node is chosen to forward the packet (chosen by all the recipient nodes agreeing on who the chosen sender is). ExOR batches packets together, and each packet has a list of preferred next hops. If not all the batch makes it to one node, then other nodes that have the remaining packets will forward those packets separately. Once 90% of the packets have reached the end, traditional routing is used to send the last 10%. The forward priority list is made based on ETX with only forward delivery probability.

Their evaluation was conducted on Roofnet. They measure throughput between 65 random node pairs. First to kickstart ETX they broadcast 1500-byte packets every 10 seconds for 10 minutes and distributed the delivery probabilities. Then each selected pair transferred a 1MB file using traditional routing and then 1.1MB using ExOR. They used the different sizes because in actual ExOR the last 10% is transferred using traditional routing; they chose to omit that part and only transfer the first 90% of the 1.1MB. This is kind of weird because .9*1.1=.99, not quite 1MB. This is also weird because their actual protocol isn't pure ExOR, it's a 90/10 mix of ExOR and traditional routing; you'd think they would want to measure the actual end-to-end performance of their system.

I couldn't figure out what the heck Figure 6 is showing. Why did they only show part of the second batch? Why does it look like 2 packets left N5 and 3 ended up at N24? Shouldn't N24 have only received one copy of each packet? In general though their other results show that ExOR improves throughput, with more improvement for longer traditional routes. The main benefit seems to come from ExOR's ability to use long-distance, lossy, and asymmetric links that otherwise wouldn't have been chosen by a traditional routing protocol. I wonder though whether that would be true if the traditional routing protocol used something other than ETX?

When I first read the design of ExOR, I had a question: At what point is a longer successful path (including the cost of an agreement protocol) faster than the cost of retransmission? This seems like it would be a function of how long the path is and how likely it is for the retransmission to also fail. However looking at their evaluation I get the sense that their routes must not be much longer than the traditional routes.

Also -- I am wondering about doing traditional routing but with retransmission for each hop (meaning each interior node checks to see if the next hop received it or not and it retransmits in case of failure; but only one copy is sent out). How would that compare with ExOR?

----------------------

Class notes:
-- trying to exploit luckiness - maybe a packet will make it extra far! also unluckiness - maybe it didn't make it as far as you'd expect, but still made it a bit closer to destination.

0. determine batch size / forwarding group
1. forwarding order. topo order based on ETX to DST
2. schedule transmission of local cached packets
3. approx shared view of which node has which packets

Problems with TCP:
- batching requires more packets than typical TCP window
- changes RTT estimates a lot; will make TCP retransmission timer confused
- lots of packet reordering probably

Thursday, October 8, 2009

A Performance Comparison of Multi-Hop Wireless Ad Hoc Network Routing Protocols

This paper compares four different routing protocols for ad hoc networks: DSDV, TORA, DSR, and AODV. They extend a simulator to model node mobility, a radio propagation model, radio network interfaces, and MAC DCF (which, like MACA(W) tries to reduce the probability of collisions due to hidden terminals).

DSDV: "hop-by-hop distance vector routing protocol requiring each node to periodically broadcast routing updates." Each node maintains a routing table with info about its neighbors.

TORA: goal is to find routes quickly and react to mobility changes; shortest path routing is secondary to this goal. When a node needs a route, it broadcasts a packet asking for the route which propagates through the network until it reaches the destination or a neighbor of the destination. At that point, the destination/neighbor sends back a packet with a "height" parameter that is incremented as it goes back to the requester.

DSR: source routing, so intermediate nodes don't need to maintain a whole bunch of state. Each packet has all the routing information it needs (this seems like it might be dangerous in a mobile network with rapidly changing topology). For path discovery, route requests are broadcast & the destination responds. Route maintenance occurs periodically to discover broken paths.

AODV: combination of DSR and DSDV. When a node needs a route, it broadcasts a ROUTE REQUEST. Anyone with a path will respond with that route and create a reverse route along the way to connect the two.

Simulation results... (1) Unsurprisingly, mobility impairs the performance of all of the routing protocols. (2) Huge variation between protocols in terms of overhead -- TORA (most) has an order of magnitude more than DSR (least). (3) DSDV does poorly with high rates of mobility when links are broken. (4) Despite its huge overhead, TORA is the most reliable, delivering 90+% of packets.

A High-Throughput Path Metric for Multi-Hop Wireless Routing

This paper presents the ETX (Expected Transmission Count) metric, which finds the most reliable path on a multi-hop wireless network. This metric takes into account problems that cause loss on the network, unlike a simple minimum hop-count. The idea is that a longer path might still be faster than a shorter path if the longer path is more reliable. They attempt to calculate the expected number of retransmissions using packet loss measurements.

They set up a 29-node wireless testbed across five floors of an office building. They then showed that a minimum hop-count metric did not find the highest-throughput route. I'm a little suspicious of their calculation of the highest-throughput route after the fact; they have to estimate the interference on each route. I also found Figure 2 really darn confusing. It seems like the min hop-count metric gets closer to the highest-throughput as the packets per second increases, and at one point (around 200-250 pps) EXCEEDS the highest throughput??

Their experiment on the testbed shows that their design of a new metric needs to account for a wide range of loss ratios; asymmetric loss ratios; and interference between successive hops in a multi-hop path. The ETX for a single link is the predicted number of data transmissions based on the forward and reverse delivery ratios of the link. The ETX for a route is the sum of the ETX for each link in that route.

This paper fits into our ongoing discussion of how the network stack is broken up into layers. ETX assumes that the link layer is making retransmissions (since they are trying to minimize the number of retransmissions). ETX doesn't route around congestion; they say this is a bonus because it avoids oscillations, but I wonder if this actually could be a useful feature?

They then implement ETX (as part of both DSDV and DSR) and compare the paths it finds to min hop-count. I find their graphs pretty impossible to read -- I can't tell which line is which in Figure 6. Looking at Figure 7, it appears that ETX did perform better (at least on paths that were longer than one hop). One question I'd have about their methodology is whether a network should really be modeled as just random pairs of communicating nodes; it seems like network usage might be clustered in certain areas (i.e., one node is getting a lot more traffic), which would change interference patterns.

Tuesday, September 29, 2009

Understanding TCP Incast Through Collapse in Datacenter Networks

[paper]

This paper looks at the same problem as the previous one (TCP incast collapse in data centers) with a similar solution (reduce the RTO timer to be closer to the RTT, randomize min RTO timer value, and set a smaller & randomized backoff). They found that a smaller & randomized multiplier for the exponential backoff didn't make any difference, nor did randomizing the min RTO timer value.

-- This paper is extremely similar to the previous paper. They are both examining incast collapse in data centers running highly parallel applications, with a proposed solution of reducing the RTO timer to be closer to the RTT. The difference is the workload (this paper has fixed fragment size, the other has fixed block size), and the two papers consequently show slightly different results. However the work is still so similar that I am surprised that both are published. (This paper does not feel to me like it was framed as a follow-up although I can imagine thinking of it that way? It seems like they were concurrent projects and then the other one got published so they threw in some disclaimers about how their workloads are different. The only section that makes me think this might actually be a follow-up is 5.2; but the introduction and background did not give me that impression.) One other difference between the experimental setup of the two papers was the RTT. In the other paper they report really small (hundreds of microseconds) RTT so they need high-resolution timers. In this paper they have higher (milliseconds) RTT numbers. This matters since both are trying to adjust the RTO relative to the RTT.

-- I think this paper gives a better introduction to the problem, although the other paper is more specific about the terms that are necessary for the problem to arise (see their 3 conditions).

-- The graphs in Figure 2 are very ugly...and the scale in Figure 3 does not seem very good.

Safe and Effective Fine-grained TCP Retransmissions for Datacenter Communication

[paper]

Problem: TCP incast collapse in data centers occurs when (1) workload is highly parallel & barrier-synchronized, (2) cheap switches have small buffers on a high-bandwidth network, and (3) servers are returning small amounts of data per request. Bursty traffic fills switch buffers, causing packet loss & TCP timeouts. Timeouts last hundreds of milliseconds (notable in a DC where round trip time is very very small due to #2 and #3). Protocols that require synchronization block waiting on these timeouts before issuing any new requests. Timeouts + subsequent delay severely reduce application throughput.

Previous solutions: Get bigger buffers ($$). Ethernet flow control ("dangerous across inter-switch trunks because of head-of-line blocking"). Reducing the minimum retransmission time -- not thoroughly explored -- so that is what this paper tries.

Their solution: Reduce the granularity of timeouts by modifying the TCP implementation. Decrease or get rid of minimum RTO, desynchronizing retransmission for scaling. Funny thing is that they need their retransmission granularity to be smaller than the Linux kernel's timer, so they implement a high-resolution timer.

Evaluation: Making the RTO very small (as small as the avg RTT) works well in both their simulated data & real cluster data, although it is less effective as the number of concurrent senders scales up. They find that the cause of this is many flows timing out and retransmitting simultaneously; they fix this by desynchronizing retransmission.

-- Question: won't more memory become cheap soon enough that this won't be a problem anymore? Or will the communication speed keep increasing faster than memory's cheapness?

Tuesday, September 22, 2009

VL2: A Scalable and Flexible Data Center Network

[paper]

A basic tree topology for a data center partitions intra-DC communication. It's slow for the "far left" server to talk to the "far right" server. The authors of this paper want a uniform communication cost between all hosts in the database to promote agility, which means that any host can run any service at any given time. (This means that a DC can dynamically handle spikes in loads; no pre-allocation is necessary.) For this end, they suggest use of the Clos network topology, which essentially means that each aggregation switch is connected to all of the (many) intermediate switches. Shortest path first routing algorithms are inadequate here; they use Valiant Load Balancing (VLB) for routing so that traffic can be routed non-uniformly. VLB basically chooses a random path from the set of equivalent paths for a flow so that no particular path becomes overwhelmed. They call the use of VLB over a Clos network "VL2".

They do some measurements on an unspecified Microsoft data center to motivate their work. They show that 80% of DC traffic is internal, meaning you do want to focus on decreasing intra-DC transit costs. They also show that flows are small and uniformly distributed, meaning that (A) the system can work by load balancing flows instead of individual packets and (b) you don't need to support locality for the intra-DC traffic. They did a traffic analysis and found no regularity or patterns in the traffic, and they also found that traffic needs shifted a lot. I think this is a really good approach to motivating research -- they have actual hard evidence that the problem they are trying to solve exists. (When reading papers outside of my own area, I am nervous about accepting whether a problem really exists -- since I have no knowledge to judge it myself.) The one downside is that they don't say WHAT application this is for, so maybe you would see totally different patterns with a different type of application.

Monday, September 21, 2009

PortLand: A Scalable Fault-Tolerant Layer 2 Data Center Network Fabric

[paper]

PortLand is a follow-up to SEATTLE. They address the issue of networking within a data center. They want to have both the plug-and-play properties of Ethernet plus the ability to support a huge number of end hosts. (Layer 2 networks are easier to administer than Layer 3 networks since they can use flat MAC addresses, but Ethernet bridging doesn't scale because of its broadcasting and difficulty with forward spanning trees in topologies with multiple equal cost paths.) They improve on SEATTLE by adding plug-and-play switches (not just plug-and-play hosts) and removing forwarding loops and all-to-all broadcasts. Their improvement to SEATTLE stems from the observation that data centers are physically organized into multi-rooted trees; SEATTLE was intended for a more general case. The topology is the tree structure we discussed in the last class.

With PortLand, each end host gets a hierarchically-assigned PMAC (pseudo MAC) address that encodes the location of the recipient host in the topology. ARP requests return PMAC addresses rather than actual MAC addresses, and egress switches do PMAC-AMAC rewriting. Edge switches get "pod" and position numbers. Forwarding is done using this topological information to avoid loops. PortLand also has a centralized "fabric manager" that maintains soft state about the network topology. Since it's soft state, (A) the fabric manager does not need to be configured, and (B) replicating the fabric manager does not require complete consistency with the original fabric manager. If a switch sees an ARP request, it will be forwarded to the fabric manager who will do a lookup in its PMAC table and send it back to the switch; this prevents the need for broadcast ARPs. If the fabric manager doesn't have the PMAC information due to a recent failure, it will resort to a broadcast.

-- Unlike the SEATTLE paper, the authors evaluated their work using microbenchmarks (or what I'd call microbenchmarks -- not sure if that's actually the appropriate term) rather than a full simulation. I'm not sure if this approach is very valid since there is no other traffic on the network, so nothing is being stressed. However...I guess the purpose of a full simulation isn't to provide stress, since the loads for stress testing would be different from regular traffic.

-- I like this paper, I think their use of topology information is simple and clever.

POST-CLASS EDIT:
I didn't realize that the fat tree idea was crucial to their idea, since they try to play it off as more generalizable ("the fat tree is simply an instance of the traditional data center multi-rooted tree topology....We present the fat tree because our available hardware/software evaluation platform is built as a fat tree."). I understand that maybe they were trying to forestall criticisms about tying their work too closely to fat tree, but I think that it's kind of deceptive and it hurt my understanding of the paper. I took their dismissal of the fat tree topology fairly seriously and didn't pay much attention to it, whereas after our discussion in class it seems like routing for a fat tree in particular is their main contribution.

Thursday, September 17, 2009

Detailed Diagnosis in Enterprise Networks

[paper]

NetMedic is a network diagnostic tool that is supposed to find the source of network faults at the granularity of a responsible process or firewall configuration. (So not only does it identify the host, but it identifies the culpable application running on the host.) I found this paper confusing because I couldn't figure out where they were planning on placing NetMedic -- is it somewhere in the network, or is it running on host machines? If it's the former, then how does it get enough application information? If it's the latter, then how can you expect it to work correctly given that the host machine is the one experiencing the problem (e.g., it may have a lack of network connectivity)? Furthermore, if it's on the host, what if NetMedic itself interferes with other host applications in some unexpected way? The implementation section finally concretely states this (it was implemented on the host using OS tools), but I was confused until then.

Basically they are trying to correlate the fault with component state and the interaction of components, and then a high correlation is considered causation. A "component" here is any part of the system, and their tool automatically infers relationships /dependencies between components. This seems like it would not work well (given that correlation != causation). However, they did find the correct component in 80% of their cases. Interestingly enough, they didn't have any direct competitors to compare their tool against so they implemented a different technique based loosely on previous work and compared themselves to that. Although unusual, I kind of like this approach for comparison...of course the down side is that since they are implementing both, they are biased and won't try as hard on the competitor.

Floodless in SEATTLE

[Paper]

SEATTLE claims to "...[achieve] the best of both worlds: The scalability of IP combined with the simplicity of Ethernet." On one hand, Ethernet is simple to manage thanks to flat addressing. On the other hand, Ethernet doesn't scale well -- Ethernet bridges use broadcast messages (e.g. ARP requests and DHCP) & paths need to be a spanning tree. To reconcile this problem, network admins break up their large networks into small Ethernet networks connected by IP or use VLANs. The authors assert that both of those solutions are not good enough (inefficient and harder to manage since you need to worry about addressing). SEATTLE is intended to provide the same simple plug-and-play semantics of Ethernet, but also scale well to large networks.

[ to be done: further summarization ]

My criticisms of their simulations:

-- For the packet-level simulation, they start with a real LBNL trace and then inject extra traffic into it. They have fake hosts sending traffic to random destinations. I find this curious considering they make this earlier claim: "In enterprise networks, hosts typically communicate with a small number of other hosts [5], making caching highly effective."
-- They also have to make some kludgy assumptions about the number of hosts connected to each switch due to the anonymization of the traces. I suppose they can't really help that, it's a general field problem.
-- They have really huge error bars in many of their graphs. Is this normal for network simulations due to the nature of error in the physical links etc? Notably, the SEATTLE results have tight error bars even when ROFL and ETH have huge ones (e.g Fig 5a and 5c)...is their simulation somehow skewed for their own results?

Sorry about the lack of summaries......I had prelims on Tuesday and have two papers due Friday......multi-tasking is not going so well.

Monday, September 7, 2009

Congestion Avoidance and Control

[Paper Link]

Networks at the time were suffering from widespread congestion, despite the fact that the TCP specification had a theoretically correct description of how to avoid it. The principle they wish to adhere to is the 'conservation of packets' principle, which states that a new packet should not be transmitted until another packet has left the network. Since the protocol itself adheres to conservation of packets but networks at the time were still failing due to load, then they observe that this must be due to implementation bugs. Thanks to the bugs, one of the following three things happens: a connection doesn't get to equilibrium, a sender injects a new packet before one has exited, or an equilibrium can't be reached because of resource limits.

Descriptions of the changes they made to the 4BSD TCP implementation:

Slow-start. A connection can potentially not get to equilibrium if it acts wonky on startup. A host doesn't send its next packet until it's received an ACK. However, if restarted hosts start off firing off packets really fast and then throttle downwards, or starts at 0 and has no defined way of going upwards, that can cause the connection to not be at equilibrium. Instead, slow-start gradually increases the number of packets being sent. It starts at 1 and increases as more and more ACKs come back.

Round-trip-time variance estimation and exponential retransmit timer backoff. Sometimes packet transmission fails and therefore no ACK is generated. At what point should the sender decide to retransmit? Their round-trip-time variance estimation algorithm addresses this. And, if a packet needs to be retransmitted repeatedly, their exponential retransmit timer backoff algorithm handles that.

More aggressive receiver ack policy and dynamic window sizing on congestion. If recipients ALWAYS send an ACK upon receipt, then the lack of an ACK is equivalent to the network notifying the sender of congestion on that link. That near-guarantee of an ACK if all is well is what they mean by an aggressive receiver ACK policy. Additionally, they describe 'dynamic window sizing', which talks about how to increase/decrease (multiplicative increase and additive increase) to fill up a connection.

-- Unlike the previous paper, none of the main schemes they discuss need the cooperation of anything in the core. They can infer congestion by dropped packets, so no specific indicator (or view of the network state) is necessary. The end-to-end argument is satisfied here.

-- In their future work, they address the issue of misbehaving hosts. They argue that ONLY with the participation of gateways can you ensure fairness, since you can't trust hosts to be fair without another agent to police the situation. The idea is that if a host starts using more than it's fair share, the gateway starts dropping its packets.

-- The paper takes a very practical tone. I especially like how they explain their logic, the major takeaway points, the relevant math, AND the actual algorithm. It's a well-written, easy-to-read paper, in my opinion.

Thursday, September 3, 2009

Analysis of the Increase and Decrease Algorithms for Congestion Avoidance in Computer Networks

[Paper link]

Congestion control is designed to keep network devices from being overwhelmed by traffic. A congested network will see high rates of dropped packets as queues overflow. Congestion control is a dynamic resource management problem, where hosts should modulate the frequency of their packet transmissions to reflect the current state of the network. This paper focuses on congestion avoidance, which focuses on keeping traffic at a manageable level that does not approach excessive. (This is opposed to congestion recover, which keeps the traffic load right at the edge of being overly congested.) More specifically, the authors analyze the increase/decrease algorithms of the binary feedback scheme, a decentralized decision-making system. The binary feedback scheme has network resources monitor their state, and when above their optimal range they send a feedback packet with 1 for the congestion bit. Users then adjust their sending according to an increase/decrease algorithm.

They look at 4 criteria for algorithm controls: efficiency, fairness, distributedness (binary feedback vs universal knowledge), and convergence. Their questions based on these criteria are: (1) "What are all the possible solutions that converge to efficient and fair states?" and (2) "How do we compare those controls that converge?" Their main result is that increase should be additive and decrease should be multiplicative.

- Notably, this is an example of something that goes against the (strict?) end-to-end argument. Although the hosts are doing the actual throttling according to the increase/decrease algorithm, this still relies on the addition of a binary feedback bit in the header whose value is set by routers (in the "dumb" core). However, it seems unreasonable to think that any given host could see enough of the network to be able to make such a judgement without advice from routers (as they discuss when they say they desire "distributedness"). It seems like sometimes routers know best about routing (e.g., traffic conditions).

- This model assumes that users will respond honestly and in everyone's best interests when told that the network is being overwhelmed. What if users try to ride the fact that other users will rate limit, and therefore don't limit themselves for the advantage of getting more bandwidth for themselves? Oddly enough they don't mention this in their "Practical Considerations" section.

- Not a huge fan of this paper. It's mathy and details-oriented, but it's not clear to me that understanding those details (and not just the principles/guidelines derived from them) is useful unless you are directly doing research in this area, especially since they fail to consider practical things such as misbehaving hosts.

Understanding BGP Misconfiguration

[Paper link]

This paper presents a study of BGP misconfiguration problems. They watched route advertisements from 23 places and checked them for incorrect information, and then contacted the ISPs responsible for incorrect broadcasts to find out what had caused the problem. They look at two types of problems: 1) "origin misconfiguration", described as the injection of routes into global BGP tables (e.g., the Pakistani-YouTube problem), and 2) "export misconfiguration", which means exporting routes against an ISP's policy. (From the previous paper, one should recall that ISPs have a financial motivation for withholding some of the routes in their table.)

They determined that an event in the BGP update stream was a misconfiguration based on time. If a policy suddenly changes (and then quickly is changed back) then that is indicative of a misconfiguration, rather than a policy shift. They look at changes that lasted for less than a day. They can miss potential misconfigurations if those misconfigurations go undetected by the ISP for a while, but I suppose that those misconfigurations couldn't be too important if nobody noticed them. They can also potentially miscategorize a legitimate policy shift as a misconfiguration, but their communication with the ISP about the cause of the misconfiguration should set the record straight. They also limit the scope of misconfigurations to the two types I listed above, because they aren't able to detect all kinds of misconfigurations (e.g., the inverse of #2 -- filtering something that should have been released -- looks identical to a failed node/link). They observed misconfigurations for 21 days.

The results for origin misconfiguration are astounding. Their results indicate that 72% of route updates per day are actually from a misconfiguration. 13% of the misconfigurations result in a connectivity problem. Does that mean that 9.4% of route updates overall lead to connectivity problems?? That's insanely high...so high that it actually makes me suspicious of their results. They present this as being a low number compared to actual failures, but it's really high (in my mind) compared to the total number of route updates sent. If about 10% of all route updates are causing connectivity problems...well, that seems terrible to me!

Criticisms:

-- They present data showing that most of the misconfigurations are really short-lived. I'm not sure why they present this information, because they have already selected for short misconfigurations. Their categorization method decides that short-lived updates are a misconfiguration, and don't ever consider longer-lived updates. So: if there were misconfigurations of longer duration (e.g., a set of subtler misconfigurations), they would have thrown them out anyway.

-- A note on their presentation: I find their result tables really hard to read. It took me a few minutes to figure out that the table alternated between paths and incidents.

-- A general philosophical point is that this paper is on small, relatively minor errors. Are we really concerned about those? Should we be? Or have people learned to live with this background noise, and we should be looking at what causes the disastrous failures. The disastrous failures might be caused by entirely different factors in entirely different contexts. Since the scale of the failures is different, it seems like other factors would be different too.

Interdomain Internet Routing

[Paper link]

This paper is informational (part of a textbook?), rather than making an argument or evaluating a system. It explains BGP and how inter-ISP route selection is influenced by $$. Routing is not done in a way to maximize global efficiency and performance; instead it is done in a cooperative-competitive fashion where money is a large factor in routing decisions. Notably, transit routing is profitable whereas peering usually is not, and ISPs will value their customers' routing announcements over other routing announcements. The design of BGP has space for attributes that can be used to decide between routes based on factors relevant to profitability.

My specific notes on the paper follow...

- Topology of the Internet. Topology is made of Autonomous Systems (ASes). BGP routes between ASes. Different IGPs (Interior Gateway Protocols) like RIP run inside ASes. A small AS is a university/corporation, a large AS is a nation-wide ISP.

- Two types of AS-AS interconnection:

1. Provider-customer transit. "One ISP provides access to all destinations in its routing tables." Customer is paying provider for this universal access. If there were only transit relationships then there would be a single huge Tier-1 AS at the root, and it would control all Internet traffic.
2. Peering. "Two ASes provide mutual access to a subset of each other's routing tables." Sometimes paid, often reciprocal. Currently there are 9 big Tier-1 ISPs who do peering so that they can have explicit routes to all destinations. Smaller ASes do peering simply to save money, if there's a lot of traffic going between the two ASes.

- "A route advertisement from B to A for a destination prefix is an agreement by B that it will forward packets sent via A destined for any destination in the prefix." Other way to think about it is that ISPs charge customers for entries in routing tables.

- How BGP works. Router A establishes TCP connection to Router B. A sends B "OPEN" message. They then exchange tables of all active routes that they want to share. Each router then adds the routes they want into their own routing table. Routers may then later send UPDATE messages, which can be either announcements or withdrawals. eBGP is used between ASes. iBGP is also used within an AS because an AS may have several eBGP routers that need to be kept consistent. iBGP is *not* the same as IGP. Notable BGP attributes are LOCAL PREF (a local policy for choosing between routes), ASPATH (route length plus a way to discard circular route announcements), and MED (used for comparing two routes from the same neighboring AS).

- Problems with BGP. BGP does not support origin authentication, meaning there's no way to tell which AS owns a prefix, so any AS can claim that it owns any prefix. Traffic from other networks would then be sent to that AS instead of where it should actually go.

- With the Pakistan/You Tube problem, a Pakistan ISP was redirecting internal Pakistani traffic to their own error message to block YouTube. They accidentally leaked this listing to their parent ISP, which started to propagate it, and YouTube traffic started going to the Pakistani ISP. Since the Pakistani ISP was a customer of the larger ISP, it prioritized that listing over the existing one. Additionally, since the listing was MORE specific than the actual YouTube listing, it propagated very quickly.
- This can also be used for spam. A bad guy claims it owns a prefix and sends it upstream, and a questionable or careless ISP propagates it. They then send spam "from" IPs in that prefix and are able to receive responses for a short period of time. Then a few hours or days later, they disappear!
- Perhaps this is part of the larger problem of accountability/authentication not being part of the original goals of the Internet.

Tuesday, September 1, 2009

End-to-End Arguments in System Design

[Paper link]

This paper discusses the question of what responsibilities should be given to a communication system vs. the applications using it. They believe that most functions are more suited to the applications rather than the communication system, and they call this the "end-to-end argument." A motivating example they give is file transfer reliability: should the communication system worry about checksumming and internally confirming each packet that is sent across the network, or should the host application checksum the files once and then request a retransmit of the whole file if the recipient application's checksum fails to match? The high cost of executing the former option outweighs the low odds of a file transmission error occurring. Additionally, one must consider problems external to the communication system, such as the reliability of the hard drives; in order to account for this, the host applications would need to perform their own integrity check even if the communication system provided its own guarantee of reliability. The "end-to-end argument" is that many functions (e.g., reliability) are better performed by the end applications rather than the communication system itself. The communication system should only offer these functions as performance enhancements, and one needs to be careful that the performance "enhancement" actually works out to be beneficial and not overly complex.

In the previous paper on DARPA Internet design, the author also talked about this issue as being a motivating factor for splitting up TCP/IP. Different applications (e.g., voice vs rlogin) need different reliability guarantees; rlogin will wait to receive its packets in order, whereas a voice application would rather drop a packet and insert silence rather than delay transmission for an out-of-order packet. For this reason, TCP and IP were separated so that IP was a lower building block with TCP and other alternatives like IP on top of it, handling reliability as they saw fit. I like this paper because it formalizes the rationale behind TCP/IP separation and makes it a more general principle.

The authors also apply the end-to-end argument to encryption, duplicate message suppression, guaranteeing FIFO message delivery, and transaction management. The general theme is the same as with the file transfer example: the end hosts need to perform the function anyway (and probably can do it better since it has more semantic information), and doing the work in the communication system would just be wasteful duplication. Another important point is that not all applications need the same underlying "help" from the communication system, so what is useful performance enhancement for one type of application might slow down another application.

This same type of discussion could be applied to NIDSs -- should the work of intrusion detection be done on the network or by the host? Doing work on a dedicated NIDS will save hosts computation time, and it's easier to keep a NIDS programmed and updated than it is to keep host security software updated. On the other hand, understanding network traffic sometimes means the NIDS has to infer application behavior, which can become very expensive and lead to DOS attacks. Instead of having a NIDS do bifurcation analysis, it might make more sense to have the host be responsible for filtering its own incoming traffic.

The paper does acknowledge that not EVERYTHING should be moved to the end applications -- there is always a tradeoff. For example, retransmitting a single packet makes more sense than retransmitting an entire file. I would be interested in seeing a formal cost/benefit analysis that could be used to make the decision of when one approach makes more sense.

Monday, August 31, 2009

Design Philosophy of the DARPA Internet Protocols

[Paper link]

This paper explains the rationale behind DARPA design decisions of TCP/IP. The main point of the paper is to explain how the military's priorities and application suite influenced the design of the Internet. Since many of these goals are no longer applicable to the widespread commercial Internet, their choices may be worth reconsidering in future designs. If you can identify the priorities and goals that have changed, you might be able to find areas of improvement in modern contexts. They place a very high priority on survivability and the ability to support mixed media networks, and give very low priority to accountability and cost effectiveness. This clearly reflects a military mindset rather than the modern commercial one. Since survivability was a #1 priority, stateless packet switching was chosen as the core architecture of the Internet; if parts of the Internet died and had to be replaced or restarted, no state would need to be restored.

Their original goal was to give users of a packet radio network access to the resources of the main network. This had three important ramifications:
(1) These were two separate administrative domains, hence the idea of the Internet being locally rather than universally administered.
(2) They needed to support both rlogin and voice over these different types of networks (with their different speed and reliability constraints). To do this, they moved reliability up the network stack to TCP (rlogin and its like) and UDP (voice and the like). IP provides flexible, "best effort" transmission of datagrams, and TCP and UDP can handle reliability (or not) however they like. The fact that networks only need to support the switching of datagram packets means that lots of different network types of varying quality can support the Internet.
(3) Since the specifications and requirements of networks are so minimal, every time a new portion of Internet is set up, a lot of re-engineering is done.

For context, this paper was published in 1988, 15 years after the inception of TCP/IP.

-- Since they were a military network, survivability was their #1 goal and accountability was #7. This has interesting implications in terms of network security. In the original threat model, all threats were physically external (trying to damage the infrastructure of the network). In a modern threat model, network threats are internal to the Internet, and accountability would be very useful (e.g., for tracing and discouraging DOS attacks). They just saw accountability as a resource allocation detail. Accountability is hard with the stateless datagram model because packets contain all their own state and the packet switches are stateless. This is a problem because there is no place in a packet's header for recording accounting information except the source field, and the creator of a packet can lie about the source field when building the packet. We need EGRESS/INGRESS filtering to detect these lies but that requires ISP cooperation. Related paper: Network support for IP traceback.

-- In section 7, the author writes: "In the large Internet being currently operated, routing decisions need to be constrained by policies...Today this can be done only in a very limited way, which requires manual setting of tables." Is this manual setting of tables what BGP now does?

-- In section 8, the author talks about how simulators can/should be used to figure out the "best" design of a network. He seems chagrined that most actual designs are "best guesses" rather than supported by simulated data. However -- it's my understanding that simulating realistic network traffic is a very hard problem that still hasn't been solved.

-- Are flows implemented now? I am sort of confused by his discussion in the conclusion. I thought a network flow was just, generally, the packets traveling from one host to another over a multiplexed connection...and therefore still packet-based. Is the difference that, in his model of "flows", intermediate switches & routers maintain state about the flow?

-- I liked this paper a lot. Definitely should keep in the syllabus. It's nice to know the rationale behind past design decisions.

FIRST POST

FIRST POST!

Blog Archive