[ih] Large scale BGP failure in early 90's
Jack Haverty
jack at 3kitty.org
Wed May 28 11:17:12 PDT 2025
Dave Mills was the master researcher, always trying out new ideas on the
early 80s Internet. Sometimes of course there were bugs. At the same
time, Vint had tasked my "Gateway Group" at BBN to make the "core
gateways" into a reliable 24x7 network service, much as the ARPANET had
become over the previous decade.
Those two goals - research and reliability - conflicted. After yet
another Internet disruption, I recruited Dr. Eric Rosen to help figure
out a solution. Eric was one of the "ARPANET gurus" and knew a lot
about what had been done in the ARPANET IMP algorithms to transition it
from research to operations.
We sat down for an afternoon of brainstorming. That led to the
invention of the concept of "Autonomous Systems" (AS), and the
rudimentary "firewall" protocol of EGP, documented in RFC 827 in October
1982. EGP and AS enabled the "core" gateways to insulate themselves
from whatever bugs emerged from research activities. It was
essentially a "firewall" mechanism. Whatever Dave, or anyone else, did
in their research was much less likely to affect the operation of the
"core" part of The Internet after EGP was implemented as a "firewall" in
the core machines.
The notion of AS and EGP was envisioned to be temporary. When research
advanced to rough consensus and solutions, those algorithms and
techniques would be included in the next version of TCP/IP. We were
rather naive in 1982; the 'net evolved quite differently.
The incidents I recall with Dave's Fuzzballs were mostly related to
routing. There were two general kinds of incidents. One involved
"black hole" gateways, where some gateway somehow became the best route
to everywhere. Another kind was the "counting to infinity" scenario,
where a routing change took sometimes 15 minutes to take effect, as the
gateways involved slowly added one to their hop counts until reaching
"infinity" (set to some number, perhaps 32?). While a
count-to-infinity was in progress, datagrams tended to just "loop"
around between gateways. Users saw that as a minutes-long outage of
the Internet.
One incident I recall happened when two universities (can't remember
which) were involved in some kind of joint project and needed to
exchange lots of data. Although the Internet worked for that, it was
somewhat slow to transfer their files. So they decided to put in their
own leased circuit (9.6kb IIRC) directly between their two sites,
expecting traffic between them to utilize that line and transfers to
complete more quickly.
They were surprised to see that their file transfers actually became
much slower with the additional circuit. The culprit was the Internet
routing scheme, which was (still is?) based on "hop counts" rather than
on datagram transit time.
The ARPANET had been using transit time as the metric for routing for
many years. Transit time included time spent in buffers inside IMPs,
using a real-time clock which had been added to the first IMP
minicomputers. Early gateways however lacked such hardware and couldn't
measure transit time. Hop counts were an interim approach until the
hardware was replaced, when a time-based routing approach could be
introduced.
In the universities scenario, the additional line reduced the "hop
count" for all sorts of other data flows through the Internet,
attracting such flows to their new "private" line between their sites.
So everything became slower.
Hope this helps explain some of the History.
Jack Haverty
On 5/28/25 00:16, Craig Partridge via Internet-history wrote:
> Wasn't that the same pattern that took down the AT&T network c. 1990 (a
> busted update that was propagated before being fully processed)?
>
> I seem to recall Dave Mills actually triggered a similar issue in the early
> Internet (I do recall he talked about some bug in his code and that some
> authority figure, Vint?, was mad at him at the time).
>
> Craig
>
> On Tue, May 27, 2025 at 2:32 PM Matt Mathis via Internet-history <
> internet-history at elists.isoc.org> wrote:
>
>> Is there a writeup (or does anybody recall) a large-scale BGP failure in
>> the early 90s, when one ventor was testing a feature to make routes less
>> preferred (AS prepending or doubling) which caused all gated based BGP
>> implementation to lock up? Unfortunately the processing order was: parse
>> the test route, forward test route, and then lockup. So the test route
>> successfully flooded the entire routing system before locking up all of the
>> NSFnet and many other networks.
>>
>> I am looking for a more complete and accurate description of this event.
>>
>> Thanks,
>> --MM--
>> Evil is defined by mortals who think they know "The Truth" and use force to
>> apply it to others.
>> -------------------------------------------
>> Matt Mathis (Email is best)
>> Home & mobile: 412-654-7529 please leave a message if you must call.
>> --
>> Internet-history mailing list
>> Internet-history at elists.isoc.org
>> https://elists.isoc.org/mailman/listinfo/internet-history
>> -
>> Unsubscribe:
>> https://app.smartsheet.com/b/form/9b6ef0621638436ab0a9b23cb0668b0b?The%20list%20to%20be%20unsubscribed%20from=Internet-history
>>
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_signature.asc
Type: application/pgp-signature
Size: 665 bytes
Desc: OpenPGP digital signature
URL: <http://elists.isoc.org/pipermail/internet-history/attachments/20250528/14674b45/attachment.asc>
More information about the Internet-history
mailing list