[ih] Large scale BGP failure in early 90's

Jack Haverty jack at 3kitty.org
Wed May 28 11:17:12 PDT 2025


Dave Mills was the master researcher, always trying out new ideas on the 
early 80s Internet.   Sometimes of course there were bugs.  At the same 
time, Vint had tasked my "Gateway Group" at BBN to make the "core 
gateways" into a reliable 24x7 network service, much as the ARPANET had 
become over the previous decade.

Those two goals - research and reliability - conflicted.  After yet 
another Internet disruption, I recruited Dr. Eric Rosen to help figure 
out a solution.  Eric was one of the "ARPANET gurus" and knew a lot 
about what had been done in the ARPANET IMP algorithms to transition it 
from research to operations.

We sat down for an afternoon of brainstorming.  That led to the 
invention of the concept of "Autonomous Systems" (AS), and the 
rudimentary "firewall" protocol of EGP, documented in RFC 827 in October 
1982.   EGP and AS enabled the "core" gateways to insulate themselves 
from whatever bugs emerged from research activities.   It was 
essentially a "firewall" mechanism.  Whatever Dave, or anyone else, did 
in their research was much less likely to affect the operation of the 
"core" part of The Internet after EGP was implemented as a "firewall" in 
the core machines.

The notion of AS and EGP was envisioned to be temporary.   When research 
advanced to rough consensus and solutions, those algorithms and 
techniques would be included in the next version of TCP/IP.  We were 
rather naive in 1982; the 'net evolved quite differently.

The incidents I recall with Dave's Fuzzballs were mostly related to 
routing.   There were two general kinds of incidents.  One involved 
"black hole" gateways, where some gateway somehow became the best route 
to everywhere.  Another kind was the "counting to infinity" scenario, 
where a routing change took sometimes 15 minutes to take effect, as the 
gateways involved slowly added one to their hop counts until reaching 
"infinity" (set to some number, perhaps 32?).   While a 
count-to-infinity was in progress, datagrams tended to just "loop" 
around between gateways.   Users saw that as a minutes-long outage of 
the Internet.

One incident I recall happened when two universities (can't remember 
which) were involved in some kind of joint project and needed to 
exchange lots of data.  Although the Internet worked for that, it was 
somewhat slow to transfer their files.   So they decided to put in their 
own leased circuit (9.6kb IIRC) directly between their two sites, 
expecting traffic between them to utilize that line and transfers to 
complete more quickly.

They were surprised to see that their file transfers actually became 
much slower with the additional circuit.   The culprit was the Internet 
routing scheme, which was (still is?) based on "hop counts" rather than 
on datagram transit time.

The ARPANET had been using transit time as the metric for routing for 
many years.  Transit time included time spent in buffers inside IMPs, 
using a real-time clock which had been added to the first IMP 
minicomputers.  Early gateways however lacked such hardware and couldn't 
measure transit time.  Hop counts were an interim approach until the 
hardware was replaced, when a time-based routing approach could be 
introduced.

In the universities scenario, the additional line reduced the "hop 
count" for all sorts of other data flows through the Internet, 
attracting such flows to their new "private" line between their sites.  
So everything became slower.

Hope this helps explain some of the History.
Jack Haverty




On 5/28/25 00:16, Craig Partridge via Internet-history wrote:
> Wasn't that the same pattern that took down the AT&T network c. 1990 (a
> busted update that was propagated before being fully processed)?
>
> I seem to recall Dave Mills actually triggered a similar issue in the early
> Internet (I do recall he talked about some bug in his code and that some
> authority figure, Vint?, was mad at him at the time).
>
> Craig
>
> On Tue, May 27, 2025 at 2:32 PM Matt Mathis via Internet-history <
> internet-history at elists.isoc.org> wrote:
>
>> Is there a writeup (or does anybody recall) a large-scale BGP failure in
>> the early 90s, when one ventor was testing a feature to make routes less
>> preferred (AS prepending or doubling) which caused all gated based BGP
>> implementation to lock up?   Unfortunately the processing order was: parse
>> the test route, forward test route, and then lockup.  So the test route
>> successfully flooded the entire routing system before locking up all of the
>> NSFnet and many other networks.
>>
>> I am looking for a more complete and accurate description of this event.
>>
>> Thanks,
>> --MM--
>> Evil is defined by mortals who think they know "The Truth" and use force to
>> apply it to others.
>> -------------------------------------------
>> Matt Mathis  (Email is best)
>> Home & mobile: 412-654-7529 please leave a message if you must call.
>> --
>> Internet-history mailing list
>> Internet-history at elists.isoc.org
>> https://elists.isoc.org/mailman/listinfo/internet-history
>> -
>> Unsubscribe:
>> https://app.smartsheet.com/b/form/9b6ef0621638436ab0a9b23cb0668b0b?The%20list%20to%20be%20unsubscribed%20from=Internet-history
>>
>

-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_signature.asc
Type: application/pgp-signature
Size: 665 bytes
Desc: OpenPGP digital signature
URL: <http://elists.isoc.org/pipermail/internet-history/attachments/20250528/14674b45/attachment.asc>


More information about the Internet-history mailing list