A "sleeping" device uses less power than an idle-but-turned-on machine but loses its ability to respond to network events. This paper explores different ways of setting a machine to sleep without losing network presence, so that computer users will be more willing to actually use sleep modes. Notably, they look at two approaches: Wake-on-Lan (WoL), which allows machines to be woken up remotely with a special magic packet; and the use of a "proxy" that handles traffic on behalf of a sleeping host (potentially waking the machine).
1. Is the problem worth solving? They look at 250 machines (a mix of desktops and notebooks) in both home and work settings. If a person hasn't used a machine for 15 mins they classify it as "idle", and the machines in their study are idle a huge percentage of the time. That's a big opportunity for sleep-but-present.
2. What network traffic do idle machines see? They use BRO to record the traffic. They find that both at home and at work, machines are receiving nearly constant network traffic (although home networks are quieter). If they were going to wake up a machine for EVERY PACKET, machines wouldn't get to sleep much: not at all in the office, and only about 50% of idle time at home. This implies that a selective wakeup mechanism (only wake for some packets) would be better.
To try to figure out what kind of policy would be good (which packets should they wake for?), they broke down idle-time traffic by communication type (multi/uni/broadcast), incoming vs outgoing, and protocol. Outgoing traffic is mostly unicast, incoming is multi/uni/broadcast (so a proxy would need to handle all 3). Incoming broadcast and multicast chatter are largely responsible for poor sleep; proxying & ignoring them would save a lot of sleep time. Most broadcast and multicast protocols can be ignored or responded to by the proxy (eg the proxy can respond to ARP queries or SSDP packets without waking the host).
3. What is the design space for a proxy? They come up with 4 example policies that wake/respond/ignore in different ways. "Proxy_3" performs the best, reclaiming 70% of idle time for sleeping; it only wakes up for telnet, ssh, VNC, SMB file-sharing, and NetBIOS. The proxy will also respond to certain packets. This is "non-transparent" because it will make the machine seem asleep for certain protocols that don't fall in their interest range.
It seems like administrator-customized proxy policies might be a good idea; administrator would have their own insights on what is important. For example, I would happily sleep my machine if I could selectively wake it up just for ssh! Perhaps the EECS admins might ask me to have it respond to a few other things.
All-in-all I think this is a really cool paper, I have never read anything about energy saving before these two energy-related papers and I think it's a neat topic. I'd like to hear more about this topic area.
Blog Archive
-
▼
2009
(32)
-
▼
November
(11)
- Skilled in the Art of Being Idle: Reducing Energy ...
- Cutting the Electric Bill for Internet-Scale Systems
- Scalable Application Layer Multicast
- A Reliable Multicast Framework for Light-weight Se...
- NetFPGA: A Tool for Network Research and Education
- A Policy-aware Switching Layer for Data Centers
- Internet Indirection Infrastructure
- DNS Performance and the Effectiveness of Caching
- Development of the Domain Name System
- Chord: A Scalable Peer-to-peer Lookup Service for ...
- Looking Up Data in P2P Systems
-
▼
November
(11)
Friday, November 20, 2009
Wednesday, November 18, 2009
Cutting the Electric Bill for Internet-Scale Systems
This is a REALLY cool paper!
Their idea is to cut a company's electric bill by servicing traffic at data centers with low energy prices. This might mean servicing a host in Boston from a data center in Virginia (rather than NY) if VA is cheaper than NY at that time. "Preferred data centers" would be recalculated every hour based on that hour's regional energy prices. By shifting traffic around based on energy price, the total electricity cost of the system is minimized. They look at energy prices for different regions and note that not all preferences can be pre-determined; sometimes static routing might work (always prefer one region over another, e.g., Chicago vs Virginia), but other times dynamic routing would be preferential (e.g. Boston vs NY). They use a real, large Akamai traffic data set to examine the usefulness of their idea. They find that energy savings could be up to 40% with perfect conditions.
Their idea has four caveats:
(1) Electrical elasticity refers to how much power a server consumes when idle. If idle & active costs are identical, then this plan doesn't work at all. If idle uses 0 power, then you'd have ideal conditions. The reality is somewhere in the middle. Google's elasticity rate is 65%. With perfect elasticity, they find 40% savings; with Google's elasticity rate, they only see 5% savings.
(2) This could increase latency, which might not be tolerable for some applications. Looking at normal Akamai traffic (Akamai is a latency-sensitive CDN), they see that geo-locality is not always adhered to anyway; perhaps geo-locality does not actually correspond to performance. I would be interested in seeing more work on this point.
(3) Increasing the routing path could have the side effect of incurring extra energy expenditures along the routing path. However, they say this is not a big problem because routers energy use is not proportional to traffic.
(4) The 95/5 rule of bandwidth billing generally states that if the 95th percentile of traffic exceeds the committed rate, then the consumer has to pay substantially more for the overage. This means that their work is constrained by the 95/5 rule and they can't increase traffic to one data center beyond that point. The extra bandwidth would be more expensive than the energy they are saving. If the 95/5 rule is adhered to, perfect elasticity only yields 15% savings; without the 95/5 rule, you see their 40% max savings figure. Note that with both the 95/5 rule and 65% elasticity, Google only sees a 2% savings (which is still $1M+).
One question I have is whether this system would be self-affecting. That is, to say: if a large company like Google actually moving its computation to its cheaper DCs, would power costs in those regions increase? (Conversely, if all the DCs in Silicon Valley moved their servers to Oklahoma, would Silicon Valley prices go down?)
------------------
Notes -
Google #s should be higher now. Their machines used to be single-app, however now they are multi-app. Regardless though, utilization is likely under 60%...over 60% you see an increase in latency because of competition for resources.
Their idea is to cut a company's electric bill by servicing traffic at data centers with low energy prices. This might mean servicing a host in Boston from a data center in Virginia (rather than NY) if VA is cheaper than NY at that time. "Preferred data centers" would be recalculated every hour based on that hour's regional energy prices. By shifting traffic around based on energy price, the total electricity cost of the system is minimized. They look at energy prices for different regions and note that not all preferences can be pre-determined; sometimes static routing might work (always prefer one region over another, e.g., Chicago vs Virginia), but other times dynamic routing would be preferential (e.g. Boston vs NY). They use a real, large Akamai traffic data set to examine the usefulness of their idea. They find that energy savings could be up to 40% with perfect conditions.
Their idea has four caveats:
(1) Electrical elasticity refers to how much power a server consumes when idle. If idle & active costs are identical, then this plan doesn't work at all. If idle uses 0 power, then you'd have ideal conditions. The reality is somewhere in the middle. Google's elasticity rate is 65%. With perfect elasticity, they find 40% savings; with Google's elasticity rate, they only see 5% savings.
(2) This could increase latency, which might not be tolerable for some applications. Looking at normal Akamai traffic (Akamai is a latency-sensitive CDN), they see that geo-locality is not always adhered to anyway; perhaps geo-locality does not actually correspond to performance. I would be interested in seeing more work on this point.
(3) Increasing the routing path could have the side effect of incurring extra energy expenditures along the routing path. However, they say this is not a big problem because routers energy use is not proportional to traffic.
(4) The 95/5 rule of bandwidth billing generally states that if the 95th percentile of traffic exceeds the committed rate, then the consumer has to pay substantially more for the overage. This means that their work is constrained by the 95/5 rule and they can't increase traffic to one data center beyond that point. The extra bandwidth would be more expensive than the energy they are saving. If the 95/5 rule is adhered to, perfect elasticity only yields 15% savings; without the 95/5 rule, you see their 40% max savings figure. Note that with both the 95/5 rule and 65% elasticity, Google only sees a 2% savings (which is still $1M+).
One question I have is whether this system would be self-affecting. That is, to say: if a large company like Google actually moving its computation to its cheaper DCs, would power costs in those regions increase? (Conversely, if all the DCs in Silicon Valley moved their servers to Oklahoma, would Silicon Valley prices go down?)
------------------
Notes -
Google #s should be higher now. Their machines used to be single-app, however now they are multi-app. Regardless though, utilization is likely under 60%...over 60% you see an increase in latency because of competition for resources.
Scalable Application Layer Multicast
This paper presents a system, NICE, for application-layer multicast. They look at a class of applications known as "data stream applications", which simultaneously deliver streaming data to large numbers of people. They view multicast as an overlay network, with all the recipients of the multicast communication being the nodes in the overlay. They set out 3 ways to evaluate the "performance" of an application-layer multicast scheme: (1) Quality of the data delivery path using metrics like stress, stretch, and node degrees, (2) Robustness of the overlay, (3) Control overhead.
Their multicast system is hierarchical. End hosts are arranged in layers. The bottom layer (L0) has all hosts in it, organized into small clusters. Hosts are grouped into clusters based on "closeness." Each cluster has a cluster leader who is in the "center"; that leader also belongs to the next layer up (L1). L1 members are clustered into small groups and then each one of those has a cluster leader, who belongs to L2, etc. recursively up to a single node at the top. At the very top of the pyramid/tree is the data provider. The data delivery path goes down the layers, with each cluster leader distributing data to its cluster members.
Members of a cluster exchange "heartbeat" messages. The cluster leader sends out a digest with info about all the cluster members. Members of a cluster tell each other the distance estimate to every other group member. Based on this info, the cluster leader can be changed if the cluster leader is no longer in the "center" based on changing network conditions. The cluster leader will check on cluster size and split a cluster if it gets too big.
When a new node joins the overlay network, first it will contact the the data provider ("RP" for rendezvous point) on the top. The RP will respond with the hosts in the highest layer; the host will contact those hosts to find out which one is the best match. It will then choose one and contact that leader's children to pick among those, until finally it is mapped to a L0 cluster in the hierarchy. Nodes may leave the network either intentionally (with a REMOVE message) or due to failure (in which case their loss is noted by a lack of heartbeats). New cluster leaders may be chosen when nodes enter or leave a cluster.
I understand their overlay network topology and maintenance protocols, but I feel like I'm missing something in terms of how the actual data is transferred. OK, so data goes down the tree. What happens with failures/retransmissions? How are those relayed, who is responsible for them in the hierarchy? Are they just assuming that reliability is being handled beneath the network by TCP? What if one node is very faulty (thereby generating lots of failures/retransmissions), won't nodes higher up in the hierarchy see their bandwidth suffer because the cluster leaders will need to handle the failure/retransmission traffic from the flaky node in L0?
-------------
Notes -
app layer with NICE
----
UDP
IP
Their multicast system is hierarchical. End hosts are arranged in layers. The bottom layer (L0) has all hosts in it, organized into small clusters. Hosts are grouped into clusters based on "closeness." Each cluster has a cluster leader who is in the "center"; that leader also belongs to the next layer up (L1). L1 members are clustered into small groups and then each one of those has a cluster leader, who belongs to L2, etc. recursively up to a single node at the top. At the very top of the pyramid/tree is the data provider. The data delivery path goes down the layers, with each cluster leader distributing data to its cluster members.
Members of a cluster exchange "heartbeat" messages. The cluster leader sends out a digest with info about all the cluster members. Members of a cluster tell each other the distance estimate to every other group member. Based on this info, the cluster leader can be changed if the cluster leader is no longer in the "center" based on changing network conditions. The cluster leader will check on cluster size and split a cluster if it gets too big.
When a new node joins the overlay network, first it will contact the the data provider ("RP" for rendezvous point) on the top. The RP will respond with the hosts in the highest layer; the host will contact those hosts to find out which one is the best match. It will then choose one and contact that leader's children to pick among those, until finally it is mapped to a L0 cluster in the hierarchy. Nodes may leave the network either intentionally (with a REMOVE message) or due to failure (in which case their loss is noted by a lack of heartbeats). New cluster leaders may be chosen when nodes enter or leave a cluster.
I understand their overlay network topology and maintenance protocols, but I feel like I'm missing something in terms of how the actual data is transferred. OK, so data goes down the tree. What happens with failures/retransmissions? How are those relayed, who is responsible for them in the hierarchy? Are they just assuming that reliability is being handled beneath the network by TCP? What if one node is very faulty (thereby generating lots of failures/retransmissions), won't nodes higher up in the hierarchy see their bandwidth suffer because the cluster leaders will need to handle the failure/retransmission traffic from the flaky node in L0?
-------------
Notes -
app layer with NICE
----
UDP
IP
Tuesday, November 17, 2009
A Reliable Multicast Framework for Light-weight Sessions and Application Level Framing
Implementing reliable multicast is not as easy as making slight changes to unicast protocols. With TCP, the sender is responsible for loss detection & recovery. This does not work for multicast because (1) if all recipients constantly send the sender ACKS, then the sender will be overwhelmed with traffic (ACK implosion), and (2) the set of recipients is unknown and may be constantly shifting. Essentially, the issue is that multicast does not imply "fate-sharing", whereas unicast does. The authors conclude that receiver-based reliability is a better approach to reliable multicast. The authors also note that the use of communication state (sequence numbers) for naming data does not work well for multicast because state synchronization is weaker; instead, the authors advocate "naming in application data units (ADUs)".
This paper (from 1995) presents Scalable Reliable Multicast, a framework that provides reliable (but not necessarily ordered) multicast delivery over IP. They suggest that ordered delivery and other application-specific details could be built on top of SRM. If I understand it correctly, they are splitting the transport layer in half: SRM would be on the bottom of the transport layer, and then some number of ordering protocols would be on top of it. In my mind, this is equivalent to making the unicast assumptions of UDP and TCP explicit and then building UDP and TCP on top of the unicast framework.
-- Note that I did not understand their description of "Application-level framing". Do they simply mean that there need to be many different transport layer protocols for different types of applications (much like there is UDP vs TCP), or do they mean that you shouldn't really have anything on the transport layer, and applications should be handling this directly?
They use SRM with a whiteboard conferencing tool named "wb" from LBNL. Many users (maybe thousands) simultaneously use a wb instance. Wb *does not* require ordered delivery because its operations are either idempotent or incorrect ordering can be easily fixed. Every change to the whiteboard is multicast to all of the users of a whiteboard, and whiteboard users (the system receivers) are responsible for their own loss and retransmission. If a receiver detects a loss, it schedules a request timer that is drawn from a uniform distribution based on the receiver's estimate of the one-way delay to the source of the data. The request is then multicast, as is the response to the request. If a receiver sees *another* request for the same data before its own request timer expires, then t will do a random exponential backoff and reset its timer. If it gets the missing data, it will cancel its request timer altogether. This provides reliability without the ACK implosion. The point of the randomization & timer scheduling based on location is to de-synchronize host actions to prevent unnecessary duplicates.
They ran some simulations of their request & repair algorithms on small contrived topologies. They say the point of the simulations is to look at the performance for individual packet drops; "We do not claim to be presenting realistic topologies or typical patterns of packet loss." In their scenarios, they show that request & repair algorithms work well in random and bounded-degree trees when every node is a member of the multicast session; it doesn't work as well in a sparse session. This is a bit concerning since the sparse session is the expected real-world scenario, so they make some changes to their request & repair algorithms. Notably, they add adaptive adjustment of timer parameters based on the past behavior of the algorithms in the multicast session. If the average number of dupe requests is high, then the request timer interval increases; if the average number is OK but the delay in sending a request is too high, they decrease the request timer interval. They then do some new simulations and show that that adaptive algorithm works better ("well") under many conditions. However, in their simulations, note that none of the requests or repairs are dropped, which you would see in the real world.
-- Honestly, I am not really sure what the value of their simulations is.
They also suggest (but do not implement or simulate) "local recovery". The idea is that a receiver belongs to a "loss neighborhood" where a set of nodes are all experiencing the same losses. Receivers learn about their loss neighborhoods from session messages (which are sent out periodically). Requests & repairs could be broadcast only to a loss neighborhood, thereby reducing the overall bandwidth used.
Different apps have different tradeoffs between delay and bandwidth. wb wants to minimize delay but isn't worried about bandwidth, whereas something like Usenet would want to minimize bandwidth without concern for delay. This is a little confusing to me -- unlike ordering, these properties aren't something you can build "on top" of SRM. So I am back to my original confusion of whether this is meant to be transport-layer or part of an application itself.
-----------
SRM - transport
IP - routing
This paper (from 1995) presents Scalable Reliable Multicast, a framework that provides reliable (but not necessarily ordered) multicast delivery over IP. They suggest that ordered delivery and other application-specific details could be built on top of SRM. If I understand it correctly, they are splitting the transport layer in half: SRM would be on the bottom of the transport layer, and then some number of ordering protocols would be on top of it. In my mind, this is equivalent to making the unicast assumptions of UDP and TCP explicit and then building UDP and TCP on top of the unicast framework.
-- Note that I did not understand their description of "Application-level framing". Do they simply mean that there need to be many different transport layer protocols for different types of applications (much like there is UDP vs TCP), or do they mean that you shouldn't really have anything on the transport layer, and applications should be handling this directly?
They use SRM with a whiteboard conferencing tool named "wb" from LBNL. Many users (maybe thousands) simultaneously use a wb instance. Wb *does not* require ordered delivery because its operations are either idempotent or incorrect ordering can be easily fixed. Every change to the whiteboard is multicast to all of the users of a whiteboard, and whiteboard users (the system receivers) are responsible for their own loss and retransmission. If a receiver detects a loss, it schedules a request timer that is drawn from a uniform distribution based on the receiver's estimate of the one-way delay to the source of the data. The request is then multicast, as is the response to the request. If a receiver sees *another* request for the same data before its own request timer expires, then t will do a random exponential backoff and reset its timer. If it gets the missing data, it will cancel its request timer altogether. This provides reliability without the ACK implosion. The point of the randomization & timer scheduling based on location is to de-synchronize host actions to prevent unnecessary duplicates.
They ran some simulations of their request & repair algorithms on small contrived topologies. They say the point of the simulations is to look at the performance for individual packet drops; "We do not claim to be presenting realistic topologies or typical patterns of packet loss." In their scenarios, they show that request & repair algorithms work well in random and bounded-degree trees when every node is a member of the multicast session; it doesn't work as well in a sparse session. This is a bit concerning since the sparse session is the expected real-world scenario, so they make some changes to their request & repair algorithms. Notably, they add adaptive adjustment of timer parameters based on the past behavior of the algorithms in the multicast session. If the average number of dupe requests is high, then the request timer interval increases; if the average number is OK but the delay in sending a request is too high, they decrease the request timer interval. They then do some new simulations and show that that adaptive algorithm works better ("well") under many conditions. However, in their simulations, note that none of the requests or repairs are dropped, which you would see in the real world.
-- Honestly, I am not really sure what the value of their simulations is.
They also suggest (but do not implement or simulate) "local recovery". The idea is that a receiver belongs to a "loss neighborhood" where a set of nodes are all experiencing the same losses. Receivers learn about their loss neighborhoods from session messages (which are sent out periodically). Requests & repairs could be broadcast only to a loss neighborhood, thereby reducing the overall bandwidth used.
Different apps have different tradeoffs between delay and bandwidth. wb wants to minimize delay but isn't worried about bandwidth, whereas something like Usenet would want to minimize bandwidth without concern for delay. This is a little confusing to me -- unlike ordering, these properties aren't something you can build "on top" of SRM. So I am back to my original confusion of whether this is meant to be transport-layer or part of an application itself.
-----------
SRM - transport
IP - routing
Thursday, November 12, 2009
NetFPGA: A Tool for Network Research and Education
(paper)
This paper is about a resource for teaching/learning about networks. The idea is to expose students to L1 and L2 technologies. It's also possible to use for research for cheaply and easily building networks.
This paper is about a resource for teaching/learning about networks. The idea is to expose students to L1 and L2 technologies. It's also possible to use for research for cheaply and easily building networks.
Tuesday, November 10, 2009
A Policy-aware Switching Layer for Data Centers
(paper)
Data centers try to route their traffic through middleboxes (firewalls, load balancers, NATs, etc). They typically do this by putting them in-line on the network topology and trying to ensure there are no other paths. Sometimes some network traffic doesn't need to go through a particular box but it will anyway because that's how they've set up the topology. This paper proposes a L2 way to accomplish this better; they call it "the policy-aware switching layer", aka PLayer. The properties they want are correctness (can't circumvent), flexibility, and efficiency. pswitches classify traffic and redirect it to the appropriate middlebox. Middleboxes do not need to be modified, but switches would need to be changed. There is a central policy server that sends policies to all the pswitches so they know how to classify and direct traffic.
They implemented prototype pswitches using Click, and then "validated" its functionality on the DETER testbed. They make the disclaimer about their work being a prototype which of course tips you off that their performance is bad: 40% of normal throughput with 2x the latency. Owch!!!
They then provide a formal analysis that I do not buy at all. I personally think that a formal analysis of a system like this is BS. They are just rephrasing what they have said earlier in the paper with subscripts.
Interesting paper but honestly they went into too much detail and lost my attention in places. This might be because this is a tech report. In a few places it seemed like they were unnecessarily belaboring obvious points that I didn't need convincing about.
Data centers try to route their traffic through middleboxes (firewalls, load balancers, NATs, etc). They typically do this by putting them in-line on the network topology and trying to ensure there are no other paths. Sometimes some network traffic doesn't need to go through a particular box but it will anyway because that's how they've set up the topology. This paper proposes a L2 way to accomplish this better; they call it "the policy-aware switching layer", aka PLayer. The properties they want are correctness (can't circumvent), flexibility, and efficiency. pswitches classify traffic and redirect it to the appropriate middlebox. Middleboxes do not need to be modified, but switches would need to be changed. There is a central policy server that sends policies to all the pswitches so they know how to classify and direct traffic.
They implemented prototype pswitches using Click, and then "validated" its functionality on the DETER testbed. They make the disclaimer about their work being a prototype which of course tips you off that their performance is bad: 40% of normal throughput with 2x the latency. Owch!!!
They then provide a formal analysis that I do not buy at all. I personally think that a formal analysis of a system like this is BS. They are just rephrasing what they have said earlier in the paper with subscripts.
Interesting paper but honestly they went into too much detail and lost my attention in places. This might be because this is a tech report. In a few places it seemed like they were unnecessarily belaboring obvious points that I didn't need convincing about.
Monday, November 9, 2009
Internet Indirection Infrastructure
The Internet is built to provide point-to-point, unicast service: the sender sends a packet to one addressee. IP routers in the middle will direct the packet to the addressee. This paper discusses how to support multicast, anycast, and host mobility: the sender doesn't necessarily know the identities and/or location of the receiving hosts. They propose an overlay network (Internet Indirection Infrastructure aka i3) to support these tasks.
i3 decouples sending & receiving. Sources send packets to an identifier, and receivers can ask for all packets sent to that identifier. So --- anytime something is sent to a given identifier, it is distributed to everyone who has registered for that identifier. They map each identifier to an i3 server; there's one node for each ID. When a packet is sent, it goes to that server; then that server distributes it to all hosts registered to that ID. This is useful for mobility because triggers just need to be updated with the new address; multicast can be done by all of the recipients registering for the trigger; and honestly I am not sure I follow their explanation of how to apply this to anycast. Public triggers could be used for public severs, and private triggers can be used for short-term flows.
-- If there's only one node for each ID, how does this work in case of failures? Isn't non-replicated "hard state" generally a bad idea? They later say that end-hosts use periodic refreshing...which makes it soft state...but then your outage time would be the same length as your refresh period. If you are not a fault-tolerant application, do you want to be refreshing every 50ms? OK...so then they try to mitigate this later in the paper by saying (1) an application can maintain a backup trigger or (2) the overlay network can automatically manage replicas.
The underlying overlay is responsible for robustness, scalability, efficiency, and stability. They use the Chord lookup protocol. One thing to note is that routing using i3 is less efficient than point-to-point addressing. They can alleviate this by having senders cache the i3 server's IP address. You can still run into the "triangle problem" where the i3 server ends up being farther away than the recipient nodes; they say this can be solved by having receivers choose i3 servers close to them. This could worsen scalability problems...seems like certain i3 nodes could easily end up overwhelmed. They suggest hot spots could be avoided by replicating the trigger at multiple nodes.
Potential security problems....
-- Eavesdropping: anyone who knows a trigger can eavesdrop. They suggest that private triggers should be randomly chosen & kept secret. Plus periodically change it. Not sure if eavesdropping is actually a huge concern given that most people assume eavesdropping on the network is possible in other ways.
-- Trigger hijacking: a malicious user that removes a public trigger. They suggest an extra level of indirection.
-- DoS attacks by making lots of triggers that point to a victim & amplify the sending of a single packet.
i3 decouples sending & receiving. Sources send packets to an identifier, and receivers can ask for all packets sent to that identifier. So --- anytime something is sent to a given identifier, it is distributed to everyone who has registered for that identifier. They map each identifier to an i3 server; there's one node for each ID. When a packet is sent, it goes to that server; then that server distributes it to all hosts registered to that ID. This is useful for mobility because triggers just need to be updated with the new address; multicast can be done by all of the recipients registering for the trigger; and honestly I am not sure I follow their explanation of how to apply this to anycast. Public triggers could be used for public severs, and private triggers can be used for short-term flows.
-- If there's only one node for each ID, how does this work in case of failures? Isn't non-replicated "hard state" generally a bad idea? They later say that end-hosts use periodic refreshing...which makes it soft state...but then your outage time would be the same length as your refresh period. If you are not a fault-tolerant application, do you want to be refreshing every 50ms? OK...so then they try to mitigate this later in the paper by saying (1) an application can maintain a backup trigger or (2) the overlay network can automatically manage replicas.
The underlying overlay is responsible for robustness, scalability, efficiency, and stability. They use the Chord lookup protocol. One thing to note is that routing using i3 is less efficient than point-to-point addressing. They can alleviate this by having senders cache the i3 server's IP address. You can still run into the "triangle problem" where the i3 server ends up being farther away than the recipient nodes; they say this can be solved by having receivers choose i3 servers close to them. This could worsen scalability problems...seems like certain i3 nodes could easily end up overwhelmed. They suggest hot spots could be avoided by replicating the trigger at multiple nodes.
Potential security problems....
-- Eavesdropping: anyone who knows a trigger can eavesdrop. They suggest that private triggers should be randomly chosen & kept secret. Plus periodically change it. Not sure if eavesdropping is actually a huge concern given that most people assume eavesdropping on the network is possible in other ways.
-- Trigger hijacking: a malicious user that removes a public trigger. They suggest an extra level of indirection.
-- DoS attacks by making lots of triggers that point to a victim & amplify the sending of a single packet.
Thursday, November 5, 2009
DNS Performance and the Effectiveness of Caching
[paper]
As mentioned in the previous paper, DNS servers aggressively cache address data. The previous paper also suggests that negative caching could give a performance boost. This paper questions and tests the effectiveness of both of these mechanisms using DNS and associated TCP traffic from MIT CSAIL and KAIST, circa 2000 and 2001.
Background: Many servers ("stub servers" or "stub resolvers") do nothing but cache responses and act as proxies for resolvers (they will query another server if they do not already have the data cached). (Stub servers answer "recursive queries", whereas authoritative servers receive "iterative queries.")
They collected outgoing DNS queries, incoming DNS responses, and TCP session start and end packets. Notably they cannot observe DNS lookups cached inside their network (eg on the client itself). They removed all TCP sessions that didn't first generate a TCP lookup; I assume this removes the effect of client-cached addresses. They also removed all DNS A-record lookups not part of a TCP session; I don't understand why not or how this skews their findings, and this bothers me.
Their results:
-- About 80% of DNS lookups don't require a referral; this means the first NS contacted has the answer.
-- Total number of DNS queries is much higher than DNS lookups, meaning that many lookups require query retransmission. Despite retransmissions, about 20% of clients never got any answer (not even an error). They suggest that it is better to give up sooner than to keep retransmitting, and let the application figure out what else to do. Between 12-19% of their lookups did not have any retransmissions. Aggressive retransmissions can cause a lot of traffic -- 63% of all DNS queries they saw were for lookups that never received a response!
-- 13% of client lookups result in errors (most say the name does not exist).
-- Distribution of both successful and failed DNS requests are heavy-tailed. For one set of traces, 10% of names account for 68% of answers but then the remaining 32% are very distributed.
-- They do believe that NS record caching is important. Only 20% of responses came from a root server, which means that if TTLs on NS records were shorter, the root server load would increase by 5x.
Fact to remember: Address (A) records store IP addresses, and NS records say what name server is responsible for an IP address
---------------
Class Notes:
-- Top sites accessed a lot, but low TTL; then many sites accessed infrequently. Reminds me of the 80:20 rule.
-- Key point is to stop with aggressive retransmissions -- if multiple attempts get no response, give up because it's not helping.
As mentioned in the previous paper, DNS servers aggressively cache address data. The previous paper also suggests that negative caching could give a performance boost. This paper questions and tests the effectiveness of both of these mechanisms using DNS and associated TCP traffic from MIT CSAIL and KAIST, circa 2000 and 2001.
Background: Many servers ("stub servers" or "stub resolvers") do nothing but cache responses and act as proxies for resolvers (they will query another server if they do not already have the data cached). (Stub servers answer "recursive queries", whereas authoritative servers receive "iterative queries.")
They collected outgoing DNS queries, incoming DNS responses, and TCP session start and end packets. Notably they cannot observe DNS lookups cached inside their network (eg on the client itself). They removed all TCP sessions that didn't first generate a TCP lookup; I assume this removes the effect of client-cached addresses. They also removed all DNS A-record lookups not part of a TCP session; I don't understand why not or how this skews their findings, and this bothers me.
Their results:
-- About 80% of DNS lookups don't require a referral; this means the first NS contacted has the answer.
-- Total number of DNS queries is much higher than DNS lookups, meaning that many lookups require query retransmission. Despite retransmissions, about 20% of clients never got any answer (not even an error). They suggest that it is better to give up sooner than to keep retransmitting, and let the application figure out what else to do. Between 12-19% of their lookups did not have any retransmissions. Aggressive retransmissions can cause a lot of traffic -- 63% of all DNS queries they saw were for lookups that never received a response!
-- 13% of client lookups result in errors (most say the name does not exist).
-- Distribution of both successful and failed DNS requests are heavy-tailed. For one set of traces, 10% of names account for 68% of answers but then the remaining 32% are very distributed.
-- They do believe that NS record caching is important. Only 20% of responses came from a root server, which means that if TTLs on NS records were shorter, the root server load would increase by 5x.
Fact to remember: Address (A) records store IP addresses, and NS records say what name server is responsible for an IP address
---------------
Class Notes:
-- Top sites accessed a lot, but low TTL; then many sites accessed infrequently. Reminds me of the 80:20 rule.
-- Key point is to stop with aggressive retransmissions -- if multiple attempts get no response, give up because it's not helping.
Development of the Domain Name System
[paper]
This paper tells the story of the invention of DNS. Prior to DNS, every host had a plain text file named HOSTS.TXT that contained addressing information for all places they might want to reach. Clearly this was not scalable.
DNS has two components: name servers and resolvers. Name servers hold the addressing information & answer queries, and resolvers find name servers. Data for each DNS name is a "resource record" (RR) that carries a type, class field, and data. There are two mechanisms for transferring data from source to destination: zones and caching. A zone is a section of the tree controlled, updated, and maintained by a single organization. Resolvers (which can be centrally located or on a host itself) cache responses for each zone for as long as the TTL for that zone. A resolver is configured to point at servers for the root node and the top of the local domain. It then searches "downwards."
Surprises they encountered when DNS use became widespread:
-- Refinemenet of semantics. The form of data specification ended up confusing, eg how to order multiple IP addresses for a single host.
-- Performance. Initial performance was much worse than they expected. Root servers would get multiple copies of the same query from the same source because their responses were slower than resend timers. Measuring DNS performance precisely, however, proved problematic because of gateway changes, software releases, etc.
-- Negative caching. DNS can respond negatively to a query in 2 ways: name does not exist, or name exists but they don't have data for it. They expected negative responses to be rare but they ended up being extremely popular (20-60% of all queries) because of transition confusion (use of old addresses and shortcuts, etc). They expected it would drop off but it actually stayed between 15-50%. This is because a user would type one bad address and then their application would automatically try other potential addresses, in the process making a bunch of negative queries. They suggest that negative queries should be cached like positive queries to address this issue. I'm not sure what kind of a TTL should be put on negative query caches; perhaps there could be a force update command that would clear a negative query cache when that domain name got registered.
Their successes:
-- variable depth hierarchy
-- organizational structuring of names
-- datagram access
-- caching proved crucial!!
Shortcomings:
-- type & class growth -- either demand for new types & classes was misunderstood, or DNS makes new definitions too hard
-- upgrading from HOSTS.TXT -> DNS was hard, but seems like that isn't a long-term issue
-- distribution of control
I like this paper a lot, especially the "surprises" section. The negative caching is really interesting, I would never have guessed that.
Related paper that might be interesting: Protecting Browsers from DNS Rebinding Attacks. This paper is interesting, but I've never fully understood how it works. The basic idea is that Attacker.com wants to try to gain access to an internal resource (eg iternal.corporate.com). The attacker answers DNS queries for attacker.com with a really short TTL. After the user navigates correctly to attacker.com, the attacker then starts answering DNS queries for attacker.com with the IP address for internal.corporate.com (eg 10.10.10.105). A script on the loaded attacker.com website will issue a request for materials at attacker.com; this will then actually point to 10.10.10.105 and the attacker will successfully get the content. (Normally this operation should be denied, because attacker.com and internal.corporate.com are not the same domain.) The part I am confused about here is the administration of DNS. Why can any arbitrary person claim that any random IP address belongs to their domain name?
---------------
Class notes:
-- DNS is eventually consistent
-- Caching: scalability, reliability in outages, speed up flookup
-- Caching done in client resolution software + also in name servers
This paper tells the story of the invention of DNS. Prior to DNS, every host had a plain text file named HOSTS.TXT that contained addressing information for all places they might want to reach. Clearly this was not scalable.
DNS has two components: name servers and resolvers. Name servers hold the addressing information & answer queries, and resolvers find name servers. Data for each DNS name is a "resource record" (RR) that carries a type, class field, and data. There are two mechanisms for transferring data from source to destination: zones and caching. A zone is a section of the tree controlled, updated, and maintained by a single organization. Resolvers (which can be centrally located or on a host itself) cache responses for each zone for as long as the TTL for that zone. A resolver is configured to point at servers for the root node and the top of the local domain. It then searches "downwards."
Surprises they encountered when DNS use became widespread:
-- Refinemenet of semantics. The form of data specification ended up confusing, eg how to order multiple IP addresses for a single host.
-- Performance. Initial performance was much worse than they expected. Root servers would get multiple copies of the same query from the same source because their responses were slower than resend timers. Measuring DNS performance precisely, however, proved problematic because of gateway changes, software releases, etc.
-- Negative caching. DNS can respond negatively to a query in 2 ways: name does not exist, or name exists but they don't have data for it. They expected negative responses to be rare but they ended up being extremely popular (20-60% of all queries) because of transition confusion (use of old addresses and shortcuts, etc). They expected it would drop off but it actually stayed between 15-50%. This is because a user would type one bad address and then their application would automatically try other potential addresses, in the process making a bunch of negative queries. They suggest that negative queries should be cached like positive queries to address this issue. I'm not sure what kind of a TTL should be put on negative query caches; perhaps there could be a force update command that would clear a negative query cache when that domain name got registered.
Their successes:
-- variable depth hierarchy
-- organizational structuring of names
-- datagram access
-- caching proved crucial!!
Shortcomings:
-- type & class growth -- either demand for new types & classes was misunderstood, or DNS makes new definitions too hard
-- upgrading from HOSTS.TXT -> DNS was hard, but seems like that isn't a long-term issue
-- distribution of control
I like this paper a lot, especially the "surprises" section. The negative caching is really interesting, I would never have guessed that.
Related paper that might be interesting: Protecting Browsers from DNS Rebinding Attacks. This paper is interesting, but I've never fully understood how it works. The basic idea is that Attacker.com wants to try to gain access to an internal resource (eg iternal.corporate.com). The attacker answers DNS queries for attacker.com with a really short TTL. After the user navigates correctly to attacker.com, the attacker then starts answering DNS queries for attacker.com with the IP address for internal.corporate.com (eg 10.10.10.105). A script on the loaded attacker.com website will issue a request for materials at attacker.com; this will then actually point to 10.10.10.105 and the attacker will successfully get the content. (Normally this operation should be denied, because attacker.com and internal.corporate.com are not the same domain.) The part I am confused about here is the administration of DNS. Why can any arbitrary person claim that any random IP address belongs to their domain name?
---------------
Class notes:
-- DNS is eventually consistent
-- Caching: scalability, reliability in outages, speed up flookup
-- Caching done in client resolution software + also in name servers
Wednesday, November 4, 2009
Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications
[paper]
This paper presents Chord, a distributed lookup protocol. (Their routing algorithm was described as "skiplist-like routing" by the previous paper.) They use "consistent hashing" to assign IDs to nodes, in an attempt to load balance key assignments over nodes. One of their goals is scalability: each node should only need to know about a small number of other nodes. Also, they need to handle nodes leaving and entering the network without complete degradation of performance.
Each key has a "successor node", which is the node whose ID is equal to or most closely follows the key's ID. When a node joins the network, keys previously assigned to other nodes now become assigned to n if the new node's ID is closer to theirs. When a node leaves the network, its keys are reassigned to the next closest. Nodes maintain the ID of the next successor as well as a routing table that helps find a node closer to the key during a lookup propagation operation.
This paper presents Chord, a distributed lookup protocol. (Their routing algorithm was described as "skiplist-like routing" by the previous paper.) They use "consistent hashing" to assign IDs to nodes, in an attempt to load balance key assignments over nodes. One of their goals is scalability: each node should only need to know about a small number of other nodes. Also, they need to handle nodes leaving and entering the network without complete degradation of performance.
Each key has a "successor node", which is the node whose ID is equal to or most closely follows the key's ID. When a node joins the network, keys previously assigned to other nodes now become assigned to n if the new node's ID is closer to theirs. When a node leaves the network, its keys are reassigned to the next closest. Nodes maintain the ID of the next successor as well as a routing table that helps find a node closer to the key during a lookup propagation operation.
Looking Up Data in P2P Systems
P2P networks raise the "lookup problem": how can you locate data on the P2P network without a centralized server or hierarchy? Simplest form of lookup: searcher broadcasts a message to all neighbors with a request for X. Each neighbor checks its local database; if it doesn't have the item, it forwards to its neighbors, and repeat. However, all these broadcast messages will take up a lot of bandwidth.
Distributed hash tables (DHTs) offer a solution. In a DHT, all keys & nodes haves IDs. Each key is stored at a node with an ID that's "close" to the ID's key. Each node sends a lookup request to another node "closer" in the ID space until the lookup request gets to the right place. However, nodes need a routing algorithm to be able to *find* a closer node.
-- Skiplist-like routing: Each node has a table of node IP & ID pairs of nodes whose IDs are 1/2 around the ID space, 1/4 around, 1/8 around, etc. A searcher node will send a lookup request to the node with the closest ID that does not exceed the key ID.
-- Tree-like routing: Each node has a randomly assigned ID, and each one knows a node with a certain "prefix". It uses knowledge of the tree structure to find nodes.
-- Multi-dimensional routing: Not sure I quite follow this. D-dimensional Cartesian coordinate space partitioned into hyper-rectangles (zones), and each node has a zone & identified by the boundaries of its zone. A key gets a point in the d-dimensional space and is stored in the node that owns the key's point's zone. The lookup follows the straight line path from the searcher to the key.
I found this paper interesting because I've been wondering for a while how DHTs work, and this is a review of what seem like important topics for understanding anything about DHTs in more detail. I really like that this paper was on the reading list.
Distributed hash tables (DHTs) offer a solution. In a DHT, all keys & nodes haves IDs. Each key is stored at a node with an ID that's "close" to the ID's key. Each node sends a lookup request to another node "closer" in the ID space until the lookup request gets to the right place. However, nodes need a routing algorithm to be able to *find* a closer node.
-- Skiplist-like routing: Each node has a table of node IP & ID pairs of nodes whose IDs are 1/2 around the ID space, 1/4 around, 1/8 around, etc. A searcher node will send a lookup request to the node with the closest ID that does not exceed the key ID.
-- Tree-like routing: Each node has a randomly assigned ID, and each one knows a node with a certain "prefix". It uses knowledge of the tree structure to find nodes.
-- Multi-dimensional routing: Not sure I quite follow this. D-dimensional Cartesian coordinate space partitioned into hyper-rectangles (zones), and each node has a zone & identified by the boundaries of its zone. A key gets a point in the d-dimensional space and is stored in the node that owns the key's point's zone. The lookup follows the straight line path from the searcher to the key.
I found this paper interesting because I've been wondering for a while how DHTs work, and this is a review of what seem like important topics for understanding anything about DHTs in more detail. I really like that this paper was on the reading list.
Subscribe to:
Posts (Atom)
About Me
- Adrienne
- Berkeley EECS PhD student