APF's Network Summaries

Friday, November 20, 2009

Skilled in the Art of Being Idle: Reducing Energy Waste in Networked Systems

A "sleeping" device uses less power than an idle-but-turned-on machine but loses its ability to respond to network events. This paper explores different ways of setting a machine to sleep without losing network presence, so that computer users will be more willing to actually use sleep modes. Notably, they look at two approaches: Wake-on-Lan (WoL), which allows machines to be woken up remotely with a special magic packet; and the use of a "proxy" that handles traffic on behalf of a sleeping host (potentially waking the machine).

1. Is the problem worth solving? They look at 250 machines (a mix of desktops and notebooks) in both home and work settings. If a person hasn't used a machine for 15 mins they classify it as "idle", and the machines in their study are idle a huge percentage of the time. That's a big opportunity for sleep-but-present.

2. What network traffic do idle machines see? They use BRO to record the traffic. They find that both at home and at work, machines are receiving nearly constant network traffic (although home networks are quieter). If they were going to wake up a machine for EVERY PACKET, machines wouldn't get to sleep much: not at all in the office, and only about 50% of idle time at home. This implies that a selective wakeup mechanism (only wake for some packets) would be better.

To try to figure out what kind of policy would be good (which packets should they wake for?), they broke down idle-time traffic by communication type (multi/uni/broadcast), incoming vs outgoing, and protocol. Outgoing traffic is mostly unicast, incoming is multi/uni/broadcast (so a proxy would need to handle all 3). Incoming broadcast and multicast chatter are largely responsible for poor sleep; proxying & ignoring them would save a lot of sleep time. Most broadcast and multicast protocols can be ignored or responded to by the proxy (eg the proxy can respond to ARP queries or SSDP packets without waking the host).

3. What is the design space for a proxy? They come up with 4 example policies that wake/respond/ignore in different ways. "Proxy_3" performs the best, reclaiming 70% of idle time for sleeping; it only wakes up for telnet, ssh, VNC, SMB file-sharing, and NetBIOS. The proxy will also respond to certain packets. This is "non-transparent" because it will make the machine seem asleep for certain protocols that don't fall in their interest range.

It seems like administrator-customized proxy policies might be a good idea; administrator would have their own insights on what is important. For example, I would happily sleep my machine if I could selectively wake it up just for ssh! Perhaps the EECS admins might ask me to have it respond to a few other things.

All-in-all I think this is a really cool paper, I have never read anything about energy saving before these two energy-related papers and I think it's a neat topic. I'd like to hear more about this topic area.

Wednesday, November 18, 2009

Cutting the Electric Bill for Internet-Scale Systems

This is a REALLY cool paper!

Their idea is to cut a company's electric bill by servicing traffic at data centers with low energy prices. This might mean servicing a host in Boston from a data center in Virginia (rather than NY) if VA is cheaper than NY at that time. "Preferred data centers" would be recalculated every hour based on that hour's regional energy prices. By shifting traffic around based on energy price, the total electricity cost of the system is minimized. They look at energy prices for different regions and note that not all preferences can be pre-determined; sometimes static routing might work (always prefer one region over another, e.g., Chicago vs Virginia), but other times dynamic routing would be preferential (e.g. Boston vs NY). They use a real, large Akamai traffic data set to examine the usefulness of their idea. They find that energy savings could be up to 40% with perfect conditions.

Their idea has four caveats:

(1) Electrical elasticity refers to how much power a server consumes when idle. If idle & active costs are identical, then this plan doesn't work at all. If idle uses 0 power, then you'd have ideal conditions. The reality is somewhere in the middle. Google's elasticity rate is 65%. With perfect elasticity, they find 40% savings; with Google's elasticity rate, they only see 5% savings.

(2) This could increase latency, which might not be tolerable for some applications. Looking at normal Akamai traffic (Akamai is a latency-sensitive CDN), they see that geo-locality is not always adhered to anyway; perhaps geo-locality does not actually correspond to performance. I would be interested in seeing more work on this point.

(3) Increasing the routing path could have the side effect of incurring extra energy expenditures along the routing path. However, they say this is not a big problem because routers energy use is not proportional to traffic.

(4) The 95/5 rule of bandwidth billing generally states that if the 95th percentile of traffic exceeds the committed rate, then the consumer has to pay substantially more for the overage. This means that their work is constrained by the 95/5 rule and they can't increase traffic to one data center beyond that point. The extra bandwidth would be more expensive than the energy they are saving. If the 95/5 rule is adhered to, perfect elasticity only yields 15% savings; without the 95/5 rule, you see their 40% max savings figure. Note that with both the 95/5 rule and 65% elasticity, Google only sees a 2% savings (which is still $1M+).

One question I have is whether this system would be self-affecting. That is, to say: if a large company like Google actually moving its computation to its cheaper DCs, would power costs in those regions increase? (Conversely, if all the DCs in Silicon Valley moved their servers to Oklahoma, would Silicon Valley prices go down?)

------------------

Notes -

Google #s should be higher now. Their machines used to be single-app, however now they are multi-app. Regardless though, utilization is likely under 60%...over 60% you see an increase in latency because of competition for resources.

Scalable Application Layer Multicast

This paper presents a system, NICE, for application-layer multicast. They look at a class of applications known as "data stream applications", which simultaneously deliver streaming data to large numbers of people. They view multicast as an overlay network, with all the recipients of the multicast communication being the nodes in the overlay. They set out 3 ways to evaluate the "performance" of an application-layer multicast scheme: (1) Quality of the data delivery path using metrics like stress, stretch, and node degrees, (2) Robustness of the overlay, (3) Control overhead.

Their multicast system is hierarchical. End hosts are arranged in layers. The bottom layer (L0) has all hosts in it, organized into small clusters. Hosts are grouped into clusters based on "closeness." Each cluster has a cluster leader who is in the "center"; that leader also belongs to the next layer up (L1). L1 members are clustered into small groups and then each one of those has a cluster leader, who belongs to L2, etc. recursively up to a single node at the top. At the very top of the pyramid/tree is the data provider. The data delivery path goes down the layers, with each cluster leader distributing data to its cluster members.

Members of a cluster exchange "heartbeat" messages. The cluster leader sends out a digest with info about all the cluster members. Members of a cluster tell each other the distance estimate to every other group member. Based on this info, the cluster leader can be changed if the cluster leader is no longer in the "center" based on changing network conditions. The cluster leader will check on cluster size and split a cluster if it gets too big.

When a new node joins the overlay network, first it will contact the the data provider ("RP" for rendezvous point) on the top. The RP will respond with the hosts in the highest layer; the host will contact those hosts to find out which one is the best match. It will then choose one and contact that leader's children to pick among those, until finally it is mapped to a L0 cluster in the hierarchy. Nodes may leave the network either intentionally (with a REMOVE message) or due to failure (in which case their loss is noted by a lack of heartbeats). New cluster leaders may be chosen when nodes enter or leave a cluster.

I understand their overlay network topology and maintenance protocols, but I feel like I'm missing something in terms of how the actual data is transferred. OK, so data goes down the tree. What happens with failures/retransmissions? How are those relayed, who is responsible for them in the hierarchy? Are they just assuming that reliability is being handled beneath the network by TCP? What if one node is very faulty (thereby generating lots of failures/retransmissions), won't nodes higher up in the hierarchy see their bandwidth suffer because the cluster leaders will need to handle the failure/retransmission traffic from the flaky node in L0?

-------------

Notes -

app layer with NICE
----
UDP
IP

Tuesday, November 17, 2009

A Reliable Multicast Framework for Light-weight Sessions and Application Level Framing

Implementing reliable multicast is not as easy as making slight changes to unicast protocols. With TCP, the sender is responsible for loss detection & recovery. This does not work for multicast because (1) if all recipients constantly send the sender ACKS, then the sender will be overwhelmed with traffic (ACK implosion), and (2) the set of recipients is unknown and may be constantly shifting. Essentially, the issue is that multicast does not imply "fate-sharing", whereas unicast does. The authors conclude that receiver-based reliability is a better approach to reliable multicast. The authors also note that the use of communication state (sequence numbers) for naming data does not work well for multicast because state synchronization is weaker; instead, the authors advocate "naming in application data units (ADUs)".

This paper (from 1995) presents Scalable Reliable Multicast, a framework that provides reliable (but not necessarily ordered) multicast delivery over IP. They suggest that ordered delivery and other application-specific details could be built on top of SRM. If I understand it correctly, they are splitting the transport layer in half: SRM would be on the bottom of the transport layer, and then some number of ordering protocols would be on top of it. In my mind, this is equivalent to making the unicast assumptions of UDP and TCP explicit and then building UDP and TCP on top of the unicast framework.
-- Note that I did not understand their description of "Application-level framing". Do they simply mean that there need to be many different transport layer protocols for different types of applications (much like there is UDP vs TCP), or do they mean that you shouldn't really have anything on the transport layer, and applications should be handling this directly?

They use SRM with a whiteboard conferencing tool named "wb" from LBNL. Many users (maybe thousands) simultaneously use a wb instance. Wb *does not* require ordered delivery because its operations are either idempotent or incorrect ordering can be easily fixed. Every change to the whiteboard is multicast to all of the users of a whiteboard, and whiteboard users (the system receivers) are responsible for their own loss and retransmission. If a receiver detects a loss, it schedules a request timer that is drawn from a uniform distribution based on the receiver's estimate of the one-way delay to the source of the data. The request is then multicast, as is the response to the request. If a receiver sees *another* request for the same data before its own request timer expires, then t will do a random exponential backoff and reset its timer. If it gets the missing data, it will cancel its request timer altogether. This provides reliability without the ACK implosion. The point of the randomization & timer scheduling based on location is to de-synchronize host actions to prevent unnecessary duplicates.

They ran some simulations of their request & repair algorithms on small contrived topologies. They say the point of the simulations is to look at the performance for individual packet drops; "We do not claim to be presenting realistic topologies or typical patterns of packet loss." In their scenarios, they show that request & repair algorithms work well in random and bounded-degree trees when every node is a member of the multicast session; it doesn't work as well in a sparse session. This is a bit concerning since the sparse session is the expected real-world scenario, so they make some changes to their request & repair algorithms. Notably, they add adaptive adjustment of timer parameters based on the past behavior of the algorithms in the multicast session. If the average number of dupe requests is high, then the request timer interval increases; if the average number is OK but the delay in sending a request is too high, they decrease the request timer interval. They then do some new simulations and show that that adaptive algorithm works better ("well") under many conditions. However, in their simulations, note that none of the requests or repairs are dropped, which you would see in the real world.
-- Honestly, I am not really sure what the value of their simulations is.

They also suggest (but do not implement or simulate) "local recovery". The idea is that a receiver belongs to a "loss neighborhood" where a set of nodes are all experiencing the same losses. Receivers learn about their loss neighborhoods from session messages (which are sent out periodically). Requests & repairs could be broadcast only to a loss neighborhood, thereby reducing the overall bandwidth used.

Different apps have different tradeoffs between delay and bandwidth. wb wants to minimize delay but isn't worried about bandwidth, whereas something like Usenet would want to minimize bandwidth without concern for delay. This is a little confusing to me -- unlike ordering, these properties aren't something you can build "on top" of SRM. So I am back to my original confusion of whether this is meant to be transport-layer or part of an application itself.

-----------

SRM - transport
IP - routing

Thursday, November 12, 2009

NetFPGA: A Tool for Network Research and Education

(paper)

This paper is about a resource for teaching/learning about networks. The idea is to expose students to L1 and L2 technologies. It's also possible to use for research for cheaply and easily building networks.

Tuesday, November 10, 2009

A Policy-aware Switching Layer for Data Centers

(paper)

Data centers try to route their traffic through middleboxes (firewalls, load balancers, NATs, etc). They typically do this by putting them in-line on the network topology and trying to ensure there are no other paths. Sometimes some network traffic doesn't need to go through a particular box but it will anyway because that's how they've set up the topology. This paper proposes a L2 way to accomplish this better; they call it "the policy-aware switching layer", aka PLayer. The properties they want are correctness (can't circumvent), flexibility, and efficiency. pswitches classify traffic and redirect it to the appropriate middlebox. Middleboxes do not need to be modified, but switches would need to be changed. There is a central policy server that sends policies to all the pswitches so they know how to classify and direct traffic.

They implemented prototype pswitches using Click, and then "validated" its functionality on the DETER testbed. They make the disclaimer about their work being a prototype which of course tips you off that their performance is bad: 40% of normal throughput with 2x the latency. Owch!!!

They then provide a formal analysis that I do not buy at all. I personally think that a formal analysis of a system like this is BS. They are just rephrasing what they have said earlier in the paper with subscripts.

Interesting paper but honestly they went into too much detail and lost my attention in places. This might be because this is a tech report. In a few places it seemed like they were unnecessarily belaboring obvious points that I didn't need convincing about.

Monday, November 9, 2009

Internet Indirection Infrastructure

The Internet is built to provide point-to-point, unicast service: the sender sends a packet to one addressee. IP routers in the middle will direct the packet to the addressee. This paper discusses how to support multicast, anycast, and host mobility: the sender doesn't necessarily know the identities and/or location of the receiving hosts. They propose an overlay network (Internet Indirection Infrastructure aka i3) to support these tasks.

i3 decouples sending & receiving. Sources send packets to an identifier, and receivers can ask for all packets sent to that identifier. So --- anytime something is sent to a given identifier, it is distributed to everyone who has registered for that identifier. They map each identifier to an i3 server; there's one node for each ID. When a packet is sent, it goes to that server; then that server distributes it to all hosts registered to that ID. This is useful for mobility because triggers just need to be updated with the new address; multicast can be done by all of the recipients registering for the trigger; and honestly I am not sure I follow their explanation of how to apply this to anycast. Public triggers could be used for public severs, and private triggers can be used for short-term flows.

-- If there's only one node for each ID, how does this work in case of failures? Isn't non-replicated "hard state" generally a bad idea? They later say that end-hosts use periodic refreshing...which makes it soft state...but then your outage time would be the same length as your refresh period. If you are not a fault-tolerant application, do you want to be refreshing every 50ms? OK...so then they try to mitigate this later in the paper by saying (1) an application can maintain a backup trigger or (2) the overlay network can automatically manage replicas.

The underlying overlay is responsible for robustness, scalability, efficiency, and stability. They use the Chord lookup protocol. One thing to note is that routing using i3 is less efficient than point-to-point addressing. They can alleviate this by having senders cache the i3 server's IP address. You can still run into the "triangle problem" where the i3 server ends up being farther away than the recipient nodes; they say this can be solved by having receivers choose i3 servers close to them. This could worsen scalability problems...seems like certain i3 nodes could easily end up overwhelmed. They suggest hot spots could be avoided by replicating the trigger at multiple nodes.

Potential security problems....
-- Eavesdropping: anyone who knows a trigger can eavesdrop. They suggest that private triggers should be randomly chosen & kept secret. Plus periodically change it. Not sure if eavesdropping is actually a huge concern given that most people assume eavesdropping on the network is possible in other ways.
-- Trigger hijacking: a malicious user that removes a public trigger. They suggest an extra level of indirection.
-- DoS attacks by making lots of triggers that point to a victim & amplify the sending of a single packet.

About Me

Berkeley EECS PhD student