APF's Network Summaries: Resilient Overlay Networks

[paper]

Problem: BGP's fault recovery mechanisms take several minutes; new routes are slow to converge. Outages can be 10+ minutes. This is a problem for distributed applications.

Solution: RON, a Resilient Overlay Network. A RON node constantly monitors the quality of the path ("virtual link") between it and other RON nodes in terms of latency, throughput, and packet loss rate. RON networks are meant to be small so each node can keep track of alternate routes as well as primary routes. Each RON node is a forwarder as well as a client. Interesting point: some applications may be more fault tolerant than others, so applications can define their own "failure" settings and prioritize certain metrics over others.

- Fast detection & recovery. They want to recover from link or route failures in seconds. When a probe is lost, a series of high-frequency probes are sent. If the link is dead or bandwidth is degraded by 50%+, and a good alternative path is available, then it will use the alternative path to send data. A potential problem is "flapping" between two routes; you don't want to frequently switch because that could cause a lot of reordering. To solve this, they favor the "last good route."

Evaluation: Looked at two RON networks, one with 12 nodes and one with 16. In an experiment averaging packet loss rate over 30-min intervals, they showed that BGP routing suffered 32 significant outages and RON suffered none. They also present RON Win/Loss/No change statistics based on loss rates and it seems like RON does very well.

An interesting point about this paper is that they assume that RON networks will be small. I question whether this makes a lot of sense given that they are targeting distributed applications. Does Google do distributed computing across different data centers? That would be much larger scale. I guess Google has their own fiberoptic network though so they wouldn't need to worry about these BGP-based routing concerns. But there must be large distributed applications that don't have their own networks, too.

--------------------

Class Notes

- overlay: network on top of a network
- application-specific routing means that you get to pick your own routing metric
- RON's concern is reachability, period, not latency.
- BGP doesn't broadcast alternate routes between ASes. this is how RON gets its improvement: it can identify a path through a different set of ASes that avoids the problem. this means that you need one overlay node in each AS that you care about; two nodes in the same AS won't see any improvement.
- overlays over overlays cause fairness problems. if they all see an area with no congestion and try to use it at once, then all of a sudden that will be congested.
- scalability issue with RON because of their active probing. won't scale up to more than 50 nodes.

APF's Network Summaries

Blog Archive

Wednesday, October 28, 2009

Resilient Overlay Networks

No comments:

Post a Comment

About Me