SDN & BGP: Resiliance

SDN

The Resiliance of BGP

The current BGP-based Internet has proven simultaneously resilient and fragile. It is resilient in its ability to resist, adapt to and recover from attacks and failures. It is most fragile in the nature of these recoveries: recovery depends on coordinated ad-hoc human responses, as critical as knowing who to call to run a fiber across an exchange center or shut off a machine. The Internet control plane has failed to be resilient to some classes of attacks, such as route leaks, denial-of-service attacks and botnets.

The Internet, as a network of networks, is also a network of trust, instantiated in the practical motto "send conservatively, accept liberally". The most clear example is the updating of router tables based on unsubstantiated announcements from other networks. Thanks to this trust, the control plane can be extremely responsive to failures, and recover quickly. The tragic attack on the World Trade Center did not just lead to horrific, major loss of life, but also the loss of more than three million data and three hundred thousand voice (recoverable) circuits, resulting in a 6% loss of connectivity. In addition to the physical destruction of switching locations, there were cascading failures from power loss. Yet the Internet, by accepting updates, maintained connectivity. Much of this resilience was the result of engineers who trusted each other, and executives willing to forego negotiation before connection. Yet as networks become more automated, that response is not necessarily reproducible, as earlier failures have not been subject to complete analysis.

The very trust that enables resilience can lead to failures when principals lack competence or benevolence.

Some major failures have been caused by network configuration errors. China Telecom announced 15% of all IPv4 space in April of 2010, resulting in loss of traffic for 18 minutes. Some commentators thought this could have been a cyber-war exercise, rhetoric escalated to the 'testing a cybernuke' level of excitable. However, most observers accepted China Telecom's explanation that it was an error. Given that the traffic did not reach its intended destination, it would have been a very clumsy attack. It would also have been erratic: Level 3 and AT&T were notably different in their response.

Another route leak that denied service at scale was the misconfiguration of a small Australian ISP in 2012, which took Australia down for hours after it announced all the routes from two larger ISPs to each other. This is the most common routing failure: a straight-forward failure of human factors, which vendors blame on operator error and operators blame on poorly-designed and error-prone control interfaces. The ISP which suffered the outage can be blamed for not filtering route announcements appropriately. There are open economic questions around liability: do such outages cause users to switch ISPs? What's the optimal level of route filtering, for each ISP and for the Internet as a whole? As the outages that result from route leaks are immediately apparent and repaired within hours, such questions rumble in the background rather than becoming major industry issues. SDN may change this game.

Malice can also change the game. When an entity misrepresents its location in a path, rather than claiming to own a destination, the errors are less obvious. In this case, traffic continues to be delivered and such a routing configuration could remain stable for long periods. This occurred again between China Telecom and AT&T, this time for a period of some months. In that particular case, Facebook traffic was routed through China. Note that while the login to Facebook is protected by TLS, no updates are encrypted. Thus a significant amount of global traffic was routed via a nation where Facebook adoption is remarkably low.

In addition to errors and odd incidents, there have also been a wide range of political attacks. The most famous is from Pakistan in 2008, where Pakistan objected to several YouTube videos sufficiently to block all of YouTube. An internal address for YouTube was intended to be announced within Pakistan Telecom; however, it was broadcast across northern Africa and Europe, leading to a service outage lasting several hours. It may be argued that the intention to block within Pakistan was a sovereign political decision, and the leak itself was a human factors problem: a blunder rather than an attack. Yet Internet blocking incidents during the Arab Spring tend to have been seen as attacks by governments on their populations, as in the case of the rapid drop-offs for Egyptian and Libyan populations. In fact the earliest political Internet blocking may have been during the Serbian atrocities during the dissolution of Yugoslavia, and the subsequent war-crimes trials of Serbian leaders allow us to unambiguously describe such actions as 'attacks'; similar arguments can be made in the case of Libya and Egypt.

The most straightforward way to limit network access is destruction of the infrastructure, as shown in cable cuts in January of 2008. These attacks on the physical infrastructure near Egypt effected large parts of the world in terms of reliable service, but reachability was generally maintained during the ten days of repairs. The fragility of the information infrastructure in terms of the physical infrastructure was also highlighted by the disruption of network traffic to Armenia. An elderly woman, surviving by scavenging scrap metal, discovered what could have been her finest meal when she found a large copper cable close to the surface. That so much of the nation's connectivity depended on this one cable was not apparent until she sliced out a few meters.

It is a general problem that the opacity of BGP makes it difficult to understand the redundancy, or lack of redundancy, in a network. Another example was the Buncefield incident in the UK where an explosion at a fuel storage depot destroyed a number of fibres leading to surprising network outages where the primary and secondary network connections of firms and hospitals had been routed through adjacent fibres without anyone being aware. There are so many layers of subcontracting and outsourcing in the telecomms business that tracking the physical infrastructure on which a system relies is both difficult and expensive. The visibility provided by the abstractions in SDN may make such weaknesses more visible.

Besides failures of data and failures of updates there is one notable failure of software which points to another potential vulnerability. For the better part of an hour in August 2010, BGP routes read an attribute of a RIPE-announced address, and Cisco routers interpreted this as a command to drop the route.