[ih] Global congestion collapse

Mon Dec 13 22:47:01 PST 2004

Folks,

Thanks a lot for answering my original question; this
discussion is getting more and more exciting  :)

Cheers,
Michael


On Tue, 2004-12-14 at 05:37, David L. Mills wrote:
> Perry,
> 
> Not so fast. Steve Wolff of NSF and I had a nasty little secret we did 
> not tell the NSFnet maintenance crew who could never keep a secret. I 
> built in priority queueing and preemption in the fuzzball routers. The 
> former wiretapped the telnet port and made it just below NTP on the 
> priority scale. We put mail on the bottom just below ftp. A lot of 
> telnet users stopped complaining because they thought we "fixed" the 
> network.
> 
> The other thing was to shoot the elephants. When a new packet arrived 
> and no buffer space was available, the output queues were scanned 
> looking for the biggest elephant (total byte count on all queues from 
> the same IP address) and killed its biggest  packet. Gunshots continued 
> until either the arriving packet got shot or there was enough room to 
> save it. It all worked gangbusters and the poor ftpers never found out.
> 
> Dave
> 
> Perry E. Metzger wrote:
> 
> >"David L. Mills" <mills at udel.edu> writes:
> >  
> >
> >>Well, if your incident was during 1986-1988 and involved transit of
> >>the NSFnet Phase-I backbone, I'm the perp. The NSFnet routers ran my
> >>code, which was horribly overrun by supercomputer traffic. I found the
> >>best way to deal with the problem was to find the supercomputer
> >>elephants and shoot them. More is in a 1988 SIGCOMM Symposium
> >>paper. More recently the USNO and NIST time servers are being overrun
> >>with NTP traffic. See my recent PTTI paper at
> >>www.eecis.udel.edu/+mills/papers.html.
> >>
> >>The NSFnet meltdown occured primarily because the fuzzball routers
> >>used smart interfaces that retransmitted when either an error occured
> >>or the receiver ran dry of buffers. The entire network locked up for a
> >>time because all the buffers in all six machines filled up with
> >>retransmit traffic and nothing could get in or out. As I recall, the
> >>ARPAnet also had a similar problem with reassembly buffers.
> >>    
> >>
> >
> >Interesting. Bellcore switched from a 56k link to the IMP at Columbia
> >to NSFnet towards the end (latter half?) of that time, but I can't
> >remember if the horrible congestion was before or after our switch.
> >
> >Either way, though, it was pretty shortly thereafter that I remember
> >getting my first replacement .o files with yummy new TCP congestion
> >control algorithms in them.
> >
> >Perry
> >  
> >