Thursday, September 3, 2009

Understanding BGP Misconfiguration

[Paper link]

This paper presents a study of BGP misconfiguration problems. They watched route advertisements from 23 places and checked them for incorrect information, and then contacted the ISPs responsible for incorrect broadcasts to find out what had caused the problem. They look at two types of problems: 1) "origin misconfiguration", described as the injection of routes into global BGP tables (e.g., the Pakistani-YouTube problem), and 2) "export misconfiguration", which means exporting routes against an ISP's policy. (From the previous paper, one should recall that ISPs have a financial motivation for withholding some of the routes in their table.)

They determined that an event in the BGP update stream was a misconfiguration based on time. If a policy suddenly changes (and then quickly is changed back) then that is indicative of a misconfiguration, rather than a policy shift. They look at changes that lasted for less than a day. They can miss potential misconfigurations if those misconfigurations go undetected by the ISP for a while, but I suppose that those misconfigurations couldn't be too important if nobody noticed them. They can also potentially miscategorize a legitimate policy shift as a misconfiguration, but their communication with the ISP about the cause of the misconfiguration should set the record straight. They also limit the scope of misconfigurations to the two types I listed above, because they aren't able to detect all kinds of misconfigurations (e.g., the inverse of #2 -- filtering something that should have been released -- looks identical to a failed node/link). They observed misconfigurations for 21 days.

The results for origin misconfiguration are astounding. Their results indicate that 72% of route updates per day are actually from a misconfiguration. 13% of the misconfigurations result in a connectivity problem. Does that mean that 9.4% of route updates overall lead to connectivity problems?? That's insanely high...so high that it actually makes me suspicious of their results. They present this as being a low number compared to actual failures, but it's really high (in my mind) compared to the total number of route updates sent. If about 10% of all route updates are causing connectivity problems...well, that seems terrible to me!

Criticisms:

-- They present data showing that most of the misconfigurations are really short-lived. I'm not sure why they present this information, because they have already selected for short misconfigurations. Their categorization method decides that short-lived updates are a misconfiguration, and don't ever consider longer-lived updates. So: if there were misconfigurations of longer duration (e.g., a set of subtler misconfigurations), they would have thrown them out anyway.

-- A note on their presentation: I find their result tables really hard to read. It took me a few minutes to figure out that the table alternated between paths and incidents.

-- A general philosophical point is that this paper is on small, relatively minor errors. Are we really concerned about those? Should we be? Or have people learned to live with this background noise, and we should be looking at what causes the disastrous failures. The disastrous failures might be caused by entirely different factors in entirely different contexts. Since the scale of the failures is different, it seems like other factors would be different too.

1 comment:

  1. To me, it is surprising how many misconfigurations happen each and every day. The real routing meltdowns, such as the youtube or rensys examples, are things that happen a small number of times per year. I think the bottom line is that we need a better way to define routing policies than the current hand edited approach.

    ReplyDelete

About Me

Berkeley EECS PhD student