From feinler at earthlink.net Sat May 17 10:06:47 2014 From: feinler at earthlink.net (Elizabeth Feinler) Date: Sat, 17 May 2014 10:06:47 -0700 Subject: [ih] SAIL vending machine In-Reply-To: References: Message-ID: <5E2586FE-9895-4897-8C0E-43E15402BBD8@earthlink.net> Yes, there was a vending machine at SAIL that was accessed from a terminal. It was set up (and I assume programmed) by Les Earnest. If you recall the AI Center at SAIL in those days was way up on a hill away from the campus. People hung out there day and night and drove Les nuts asking for change to put into the vending machine, so he let people have "charge accounts" that they paid each month in return for automatic access to the goodies. You received your monthly bill via email. This was no ordinary vending machine. I remember at one point it was stocked daily with pot stickers from Louie's Chinese restaurant. And yes, you could try "double or nothing" if you felt lucky. I'm sure Les can fill us in on the technical details. Regards, Jake On Jan 18, 2014, at 6:11 PM, internet-history-request at postel.org wrote: > Send internet-history mailing list submissions to > internet-history at postel.org > > To subscribe or unsubscribe via the World Wide Web, visit > http://mailman.postel.org/mailman/listinfo/internet-history > or, via email, send a message with subject or body 'help' to > internet-history-request at postel.org > > You can reach the person managing the list at > internet-history-owner at postel.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of internet-history digest..." > > > Today's Topics: > > 1. Internet milestone - The Refrigerator Strikes Back (Jack > Haverty) (Ian Peter) > 2. Re: Internet milestone - The Refrigerator Strikes Back (Jack > Haverty) (Noel Chiappa) > 3. Re: Internet milestone - The Refrigerator Strikes Back (Jack > Haverty) (Jack Haverty) > 4. Re: Internet milestone - The Refrigerator Strikes Back (Jack > Haverty) (Ian Peter) > 5. Re: Internet milestone - The Refrigerator Strikes Back (Jack > Haverty) (Noel Chiappa) > 6. Re: Internet milestone - The Refrigerator Strikes Back (Jack > Haverty) (Vint Cerf) > 7. Re: Internet milestone - The Refrigerator Strikes Back (Jack > Haverty) (Randy Bush) > 8. Re: Internet milestone - The Refrigerator Strikes Back (Jack > Haverty) (Jack Haverty) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Sun, 19 Jan 2014 07:28:20 +1100 > From: "Ian Peter" > Subject: [ih] Internet milestone - The Refrigerator Strikes Back (Jack > Haverty) > To: > Message-ID: > Content-Type: text/plain; format=flowed; charset="iso-8859-1"; > reply-type=original > > But wasn't the first example of this the Carnegie-Mellon Coke machine in > 1982? > > http://knowyourmeme.com/memes/internet-coke-machine > > > > Message: 1 > Date: Sat, 18 Jan 2014 08:30:39 -0800 > From: Jack Haverty > Subject: [ih] Internet milestone - The Refrigerator Strikes Back > To: "internet-history at postel.org" > Message-ID: > > Content-Type: text/plain; charset="iso-8859-1" > > This has just got to be a milestone for Internet historians: > > http://zeenews.india.com/news/net-news/hackers-use-connected-home-appliances-to-launch-global-cyberattack_905067.html > > When I saw this report, the final scene of the recent Hobbit movie flashed > into my brain -- where the hero (Bilbo) laments "What have we done!!!?" > > Excuse me, I have to go configure my router so that my refrigerator can't > talk to the outside world. I'm assuming of course that my router is still > actually doing what I tell it to do. > > /Jack Haverty > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: > http://mailman.postel.org/pipermail/internet-history/attachments/20140118/24f923f3/attachment-0001.html > > ------------------------------ > > _______________________________________________ > internet-history mailing list > internet-history at postel.org > http://mailman.postel.org/mailman/listinfo/internet-history > > > End of internet-history Digest, Vol 82, Issue 2 > *********************************************** > > > > ------------------------------ > > Message: 2 > Date: Sat, 18 Jan 2014 16:26:42 -0500 (EST) > From: jnc at mercury.lcs.mit.edu (Noel Chiappa) > Subject: Re: [ih] Internet milestone - The Refrigerator Strikes Back > (Jack Haverty) > To: internet-history at postel.org > Cc: jnc at mercury.lcs.mit.edu > Message-ID: <20140118212642.D697618C152 at mercury.lcs.mit.edu> > >> From: "Ian Peter" > >> But wasn't the first example of this [CMU] Coke machine in 1982? > > The Coke machine didn't, AFAIK, mount attacks on other ARPANet hosts (which > was the point of Jack's message). > > Noel > > > ------------------------------ > > Message: 3 > Date: Sat, 18 Jan 2014 15:55:16 -0800 > From: Jack Haverty > Subject: Re: [ih] Internet milestone - The Refrigerator Strikes Back > (Jack Haverty) > To: Noel Chiappa > Cc: "internet-history at postel.org" > Message-ID: > > Content-Type: text/plain; charset="iso-8859-1" > > Exactly right - this is the first (that I've heard about) instance of our > household appliances being taken over by a malevolent force and used to > attack others, all via the Internet. You kind of expect your personal > computers to be recruited -- after all they sit there saying "Program me!!" > all day. But refrigerators, TVs, et al were more loyal, until now. > Where's an exorcist when you need one? > > Actually, this is seriously a real problem....how do I get anti-virus > software into my kitchen appliances? > > BTW, there was a Coke machine attached to the ARPANET in the mid 70s, well > before IP was deployed, or the 1982 CMU machine. IIRC it had a specific > IMP/port address (on MIT-AI I believe) to which you could Telnet and get > back the current temperature of the contents of the machine. No one likes > warm soda....or a long fruitless walk to a too recently stocked machine. > > /Jack > > > On Sat, Jan 18, 2014 at 1:26 PM, Noel Chiappa wrote: > >>> From: "Ian Peter" >> >>> But wasn't the first example of this [CMU] Coke machine in 1982? >> >> The Coke machine didn't, AFAIK, mount attacks on other ARPANet hosts (which >> was the point of Jack's message). >> >> Noel >> > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: http://mailman.postel.org/pipermail/internet-history/attachments/20140118/b6ea186e/attachment-0001.html > > ------------------------------ > > Message: 4 > Date: Sun, 19 Jan 2014 11:12:42 +1100 > From: "Ian Peter" > Subject: Re: [ih] Internet milestone - The Refrigerator Strikes Back > (Jack Haverty) > To: , "Noel Chiappa" > > Message-ID: <8135FB2848E943CA8F968E03B24F4855 at Toshiba> > Content-Type: text/plain; format=flowed; charset="iso-8859-1"; > reply-type=original > > true, but there are some origins there > > -----Original Message----- > From: Noel Chiappa > Sent: Sunday, January 19, 2014 8:26 AM > To: internet-history at postel.org > Cc: jnc at mercury.lcs.mit.edu > Subject: Re: [ih] Internet milestone - The Refrigerator Strikes Back (Jack > Haverty) > >> From: "Ian Peter" > >> But wasn't the first example of this [CMU] Coke machine in 1982? > > The Coke machine didn't, AFAIK, mount attacks on other ARPANet hosts (which > was the point of Jack's message). > > Noel > > > > ------------------------------ > > Message: 5 > Date: Sat, 18 Jan 2014 19:15:41 -0500 (EST) > From: jnc at mercury.lcs.mit.edu (Noel Chiappa) > Subject: Re: [ih] Internet milestone - The Refrigerator Strikes Back > (Jack Haverty) > To: internet-history at postel.org > Cc: jnc at mercury.lcs.mit.edu > Message-ID: <20140119001541.1FBAF18C16E at mercury.lcs.mit.edu> > >> From: Jack Haverty > >> there was a Coke machine attached to the ARPANET in the mid 70s, well >> before IP was deployed, or the 1982 CMU machine. IIRC it had a specific >> IMP/port address (on MIT-AI I believe) to which you could Telnet and >> get back the current temperature of the contents of the machine. > > I think that was actually SAIL, wasn't it? And it wouldn't have been > connected directly to the IMP (that would have required an IMP interface, > and a mini to run it); it was a peripheral on the PDP-10. (There may have been > an NCP server that returned the status of the Coke machine, though.) > > Just like the elevator call hack at MIT... Oh, better not talk about that! :-) > > Noel > > > ------------------------------ > > Message: 6 > Date: Sat, 18 Jan 2014 19:50:25 -0500 > From: Vint Cerf > Subject: Re: [ih] Internet milestone - The Refrigerator Strikes Back > (Jack Haverty) > To: Noel Chiappa > Cc: internet history > Message-ID: > > Content-Type: text/plain; charset="utf-8" > > Stanford had a vending machine called the Prancing Pony that you would > order from... > > v > > > > On Sat, Jan 18, 2014 at 7:15 PM, Noel Chiappa wrote: > >>> From: Jack Haverty >> >>> there was a Coke machine attached to the ARPANET in the mid 70s, well >>> before IP was deployed, or the 1982 CMU machine. IIRC it had a >> specific >>> IMP/port address (on MIT-AI I believe) to which you could Telnet and >>> get back the current temperature of the contents of the machine. >> >> I think that was actually SAIL, wasn't it? And it wouldn't have been >> connected directly to the IMP (that would have required an IMP interface, >> and a mini to run it); it was a peripheral on the PDP-10. (There may have >> been >> an NCP server that returned the status of the Coke machine, though.) >> >> Just like the elevator call hack at MIT... Oh, better not talk about that! >> :-) >> >> Noel >> > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: http://mailman.postel.org/pipermail/internet-history/attachments/20140118/e752e95b/attachment-0001.html > > ------------------------------ > > Message: 7 > Date: Sun, 19 Jan 2014 07:56:47 +0600 > From: Randy Bush > Subject: Re: [ih] Internet milestone - The Refrigerator Strikes Back > (Jack Haverty) > To: Vint Cerf > Cc: internet history , Noel Chiappa > > Message-ID: > Content-Type: text/plain; charset=US-ASCII > >> Stanford had a vending machine called the Prancing Pony that you would >> order from... > > and it would bet you doible or nothing > > > ------------------------------ > > Message: 8 > Date: Sat, 18 Jan 2014 18:10:31 -0800 > From: Jack Haverty > Subject: Re: [ih] Internet milestone - The Refrigerator Strikes Back > (Jack Haverty) > To: Noel Chiappa > Cc: "internet-history at postel.org" > Message-ID: > > Content-Type: text/plain; charset="iso-8859-1" > > Correct. It was hooked somehow to a PDP-10. No one wrote an NCP for the > coke machine. Although it *did* have quite a few FIFO buffers - might have > made a decent gateway... > > I think lots of places had some kind of vending machine attached to their > ARPANET host and somehow accessible via the net. Not sure who did it > first. > > /Jack > > > > On Sat, Jan 18, 2014 at 4:15 PM, Noel Chiappa wrote: > >>> From: Jack Haverty >> >>> there was a Coke machine attached to the ARPANET in the mid 70s, well >>> before IP was deployed, or the 1982 CMU machine. IIRC it had a >> specific >>> IMP/port address (on MIT-AI I believe) to which you could Telnet and >>> get back the current temperature of the contents of the machine. >> >> I think that was actually SAIL, wasn't it? And it wouldn't have been >> connected directly to the IMP (that would have required an IMP interface, >> and a mini to run it); it was a peripheral on the PDP-10. (There may have >> been >> an NCP server that returned the status of the Coke machine, though.) >> >> Just like the elevator call hack at MIT... Oh, better not talk about that! >> :-) >> >> Noel >> > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: http://mailman.postel.org/pipermail/internet-history/attachments/20140118/05c5cf27/attachment.html > > ------------------------------ > > _______________________________________________ > internet-history mailing list > internet-history at postel.org > http://mailman.postel.org/mailman/listinfo/internet-history > > > End of internet-history Digest, Vol 82, Issue 3 > *********************************************** From randy at psg.com Sat May 17 12:12:02 2014 From: randy at psg.com (Randy Bush) Date: Sat, 17 May 2014 21:12:02 +0200 Subject: [ih] SAIL vending machine In-Reply-To: <5E2586FE-9895-4897-8C0E-43E15402BBD8@earthlink.net> References: <5E2586FE-9895-4897-8C0E-43E15402BBD8@earthlink.net> Message-ID: and pretty good empanadas From dhc2 at dcrocker.net Sat May 17 13:46:16 2014 From: dhc2 at dcrocker.net (Dave Crocker) Date: Sat, 17 May 2014 13:46:16 -0700 Subject: [ih] SAIL vending machine In-Reply-To: <5E2586FE-9895-4897-8C0E-43E15402BBD8@earthlink.net> References: <5E2586FE-9895-4897-8C0E-43E15402BBD8@earthlink.net> Message-ID: <5377CA98.7000308@dcrocker.net> On 5/17/2014 10:06 AM, Elizabeth Feinler wrote: > I remember at one point it was stocked daily with pot stickers from Louie's Chinese restaurant. In fact on one visit out there from grad school, your crew introduced me to pot stickers and 'ant climb tree' at Louies and you guys mentioned that the pot stickers were available in the vending machine. Those pot stickers remain among the best I've ever had. But I immediately went to the operations question of how to keep the machine stocked and you guys said that anyone at SAIL who noticed that re-stocking was needed merely called Louie for a refill. I then asked the obvious next question, which was how the pot stickers traveled from Louies to SAIL and you said there was always someone from SAIL eating at Louies, so whoever was there would take them back. On it's face, that's a pretty unbelievable scenario -- there was /always/ someone from SAIL there??? So of course during our meal, a call came in and indeed, someone from SAIL was just finishing up, getting ready to head back... d/ -- Dave Crocker Brandenburg InternetWorking bbiw.net From detlef.bosau at web.de Sun May 18 06:07:39 2014 From: detlef.bosau at web.de (Detlef Bosau) Date: Sun, 18 May 2014 15:07:39 +0200 Subject: [ih] When was Go Back N adopted by TCP Message-ID: <5378B09B.1000406@web.de> if at all? To my understanding, TCP Tahoe was based upon a go back n retransmission strategy, which was not yet part of RFC 793. In RFC 793, necessary retransmissions have been started by individual RTO timers for each packet. So, there must have been a change here in the 80s. When did this happen and why? -- ------------------------------------------------------------------ Detlef Bosau Galileistra?e 30 70565 Stuttgart Tel.: +49 711 5208031 mobile: +49 172 6819937 skype: detlef.bosau ICQ: 566129673 detlef.bosau at web.de http://www.detlef-bosau.de From craig at aland.bbn.com Sun May 18 07:03:39 2014 From: craig at aland.bbn.com (Craig Partridge) Date: Sun, 18 May 2014 10:03:39 -0400 Subject: [ih] When was Go Back N adopted by TCP Message-ID: <20140518140340.037C328E137@aland.bbn.com> I'll try to be brief (a multi-page essay keeps wanting to break out as I write this). First, worth remembering that ARQ research (Automatic Repeat reQuest), the work that first developed go-back-n, was happening concurrently with TCP development. Sometimes TCP was ahead of the theory. Indeed, the theory never quite caught up as all the ARQ research is based on a slotted transmission channel and the question is always "what is the most efficient packet to fill the current slot." Put another way, ARQ doesn't do flow control. Second, formally, go-back-n says that whenever you detect a loss, you restart transmission of *all* data beginning with the lost item. So if you sent bytes 1000 through 8000 and learn that there was a loss at 2000, you resend 2000 through 8000. Some of the first TCP implementations did something like this (cf. David Plummer's note "Re: Interesting Happenings" on the TCP-IP list of 8 June 1984). Many other TCPs would only retransmit a chunk of data at 2000 and wait for an ACK to see where the next data gap was (also noted in Plummer's note). At some point in the early 1980s Jon Postel sent out a note saying that the retransmission only of the data immediately around the known loss was the right thing to do, but I can't find the note in my (limited) archives. I believe by about 1986 or so, all TCPs were only retransmitting the data known to be lost. Certainly, the 4.3bsd release of June 1986 only retransmitted data that was known to be lost. It is my recollection that 4.2bsd only only retransmitted data that was lost as it was based on the BBN 4.1c TCP, which only retransmitted lost data. So by 1986 and probably as early as 1979 if not before, BSD TCP was *not* go back n. (One might say it was a flow controlled variant of what theory called "selective repeat ARQ"). TCP Tahoe was released in June 1988 and added the initial Van Jacobson versions of slow start and the like. So, in short, TCP Tahoe was never based on go back n. The question of how to do retransmission RTO timers has a parallel history until it converges with Van's work in 1988. The broad point is that people did not necessarily track each packet's round-trip separately and TCP doesn't carry a retransmission indicator in each segment. As a result, there were various issues in accuracy. For that, you can go read Robert Morris' paper in AFIPS 1979, Craig Milo Rogers' notes to TCP-IP around 1983, Raj Jain's paper at 5th Phoenix Conference on Computers and Communications in 1986, Lixia Zhang's paper at SIGCOMM '86, my paper with Phil Karn at SIGCOMM '87 and then Van's paper at SIGCOMM '88. Thanks! Craig > if at all? > > To my understanding, TCP Tahoe was based upon a go back n retransmission > strategy, which was not yet part of RFC 793. > > In RFC 793, necessary retransmissions have been started by individual > RTO timers for each packet. > > So, there must have been a change here in the 80s. When did this happen > and why? From detlef.bosau at web.de Sun May 18 07:26:34 2014 From: detlef.bosau at web.de (Detlef Bosau) Date: Sun, 18 May 2014 16:26:34 +0200 Subject: [ih] When was Go Back N adopted by TCP In-Reply-To: <20140518140340.037C328E137@aland.bbn.com> References: <20140518140340.037C328E137@aland.bbn.com> Message-ID: <5378C31A.3000108@web.de> Am 18.05.2014 16:03, schrieb Craig Partridge: > I'll try to be brief (a multi-page essay keeps wanting to break out > as I write this). > > First, worth remembering that ARQ research (Automatic Repeat reQuest), the > work that first developed go-back-n, was happening concurrently with TCP > development. Sometimes TCP was ahead of the theory. Indeed, the theory > never quite caught up as all the ARQ research is based on a slotted > transmission channel and the question is always "what is the most > efficient packet to fill the current slot." Put another way, ARQ > doesn't do flow control. Period :-) I think, this is a very important note and as far as I see, the term "flow control" is sometimes used in a multi meaning manner. > Second, formally, go-back-n says that whenever you detect a loss, you > restart transmission of *all* data beginning with the lost item. In terms of TCP and the variables of RFC 793, you set snd.nxt to the value of snd.una. Doing so, you accept that some packets are retransmitted without necessity. > So if > you sent bytes 1000 through 8000 and learn that there was a loss at 2000, > you resend 2000 through 8000. O.k., I oversimplified. When you wrote the paper together with Phil Karn, did you assume a retransmission queue as in RFC 793? In that case, you have one RTO timer per packet and hence can determine the very packet which is not acknowledged timely. In more recent RFC, we use only one RTO timer which "slides" with the window (i.e. when a new ack arrives which does not ack all outstanding data, you the pending RTO is cancelled and a new one is started waiting for at least the value "snd.nxt"), hence you will restart the transmission at snd.una. When you use the original concept of a retransmission queue, you can restart the transmission, referring to your example, with 2000 when 2000 was the first loss. > > Some of the first TCP implementations did something like this (cf. David > Plummer's note "Re: Interesting Happenings" on the TCP-IP list of 8 June 1984). > Many other TCPs would only retransmit a chunk of data at 2000 and wait for > an ACK to see where the next data gap was (also noted in Plummer's note). > At some point in the early 1980s Jon Postel sent out a note saying that the > retransmission only of the data immediately around the known loss was the > right thing to do, but I can't find the note in my (limited) archives. That make sense, particularly as it makes flow control difficult to retransmit data unnecessarily and cause duplicates. > > I believe by about 1986 or so, all TCPs were only retransmitting the data > known to be lost. I.e. they followed the scheme of a retransmission queue as outlined in RFC 793? > Certainly, the 4.3bsd release of June 1986 only > retransmitted data that was known to be lost. It is my recollection that > 4.2bsd only only retransmitted data that was lost as it was based on the > BBN 4.1c TCP, which only retransmitted lost data. So by 1986 and probably > as early as 1979 if not before, BSD TCP was *not* go back n. (One might say > it was a flow controlled variant of what theory called "selective repeat ARQ"). > > TCP Tahoe was released in June 1988 and added the initial Van Jacobson > versions of slow start and the like. So, in short, TCP Tahoe was never based > on go back n. However, I just had a look at the congavoid paper yesterday, the congavoid paper does not mention a retransmission queue. In addition: When Tahoe does selective ARQ, how is it ensured that any sent (or re-sent) packets stays within the limites of the actual CWND? Is it done on per packet basis? (This would require some more lines than the "three lines of code" mentioned by VJ.) > > The question of how to do retransmission RTO timers has a parallel history > until it converges with Van's work in 1988. The broad point is that > people did not necessarily track each packet's round-trip separately ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ exactly that's the point. Detlef From jack at 3kitty.org Sun May 18 11:46:15 2014 From: jack at 3kitty.org (Jack Haverty) Date: Sun, 18 May 2014 11:46:15 -0700 Subject: [ih] When was Go Back N adopted by TCP In-Reply-To: <5378C31A.3000108@web.de> References: <20140518140340.037C328E137@aland.bbn.com> <5378C31A.3000108@web.de> Message-ID: Since this is a "history" forum, I'll offer my perspective as one who was there in the 80s and involved in the TCP work... IMHO, it's important to make the distinction between the protocol and the implementations of that protocol. The protocol defines the formats of the data passing back and forth "on the wire", and the required actions that the computer at each and take in response to receiving that data. How a particular implementation performs that response is totally up to that particular implementer. So, when you're talking about ARQ, packet timers, retransmission algorithms, et al, you're talking about the *implementation*, rather than the TCP protocol itself. I wrote a TCP back in the 1979 timeframe - the first one for a Unix system, running on a PDP-11/40. It first implemented TCP version 2.5, and later evolved to version 4. It was a very basic implementation, no "slow start" or any other such niceties that were created as the Internet grew. As far as I know, that 1979 protocol is the same protocol as is in use today (pending IPV6 of course). So if my 1979 TCP could somehow be loaded into a PDP-11 today, it should still be able to communicate with all the other TCPs out there. Of course I haven't been tracking all the details of the TCP work over the last few decades, so someone will tell me if I just missed it, but I don't think anything in the actual protocol has changed. True? IMHO, TCP itself, i.e. the protocol, hasn't changed at all in the last 30+ years. However, there has been a lot of work inventing new algorithms and putting them into implementations of TCP, probably starting with van Jacobsen's work. RFC 793 makes the distinction between the protocol and the algorithms used in the implementation: "Because of the variability of the networks that compose an internetwork system and the wide range of uses of TCP connections the retransmission timeout must be dynamically determined. One procedure for determining a retransmission time out is given here as an illustration." (page 41) The "Example Retransmission Timeout Procedure" which follows in the RFC 793 spec is an *example*. It is not required as part of the protocol. Much of the detail in RFC 793 about such algorithms and implementation strategies was presented as an example. The core of the protocol itself is the packet formats and state diagram, which all implementations must follow. That approach provided implementers with a lot of flexibility. This flexibility was intentional, for two reasons. I recall some of the meetings where retransmission was discussed as TCP 4 congealed. Basically we decided that we didn't have a clue, or at least didn't agree, what "the" right answer should be, and much experimentation would be needed and appropriate. The "rough consensus" criteria wasn't met, so the specific retransmission algorithm of RFC 793 was included only as an example. It was intended as a starting point for experimentation. However, the protocol was structured to essentially preclude certain implementation strategies. ARQ, for example, involves Requesting a Repeat. But there is no guaranteed back-channel in TCP whereby you could reliably make such a request, other than the very rudimentary Window and Sequence Number mechanisms. No way to say "send that last packet again" for example. We did talk about such things, a lot, but decided it was too complex, especially when you had routers doing things like fragmentation, or the possibility that the reverse traffic flow would be cut off. There was also discussion of adding an "out of band" channel to TCP to enable implementations to reliably do negotiations like ARQ but that was also excluded to reduce complexity. ICMP was introduced as an in-band control channel of sorts, but such packets were by definition unreliable and therefore appropriate only for cases where losing a packet wouldn't cause the connection to lock up or misbehave. The second reason was that there were many conflicting goals that different implementers faced. Some had to shoehorn the TCP into a very limited computer (that would be me). Others were pressured to avoid using precious computer cycles that otherwise would generate revenue. Some TCPs were used in situations dominated by character-at-a-time Telnet activity, and users wanted what they typed to echo immediately - so hanging on to data hoping for more to arrive before sending it out was unacceptable. There were lots of forces pulling different ways. The result was that there were a lot of TCP implementations, all conforming to the protocol, but with widely different internal algorithms. I encountered one such implementation that was likely the simplest possible. It would only accept the next sequential bytes in the byte stream that would fit in its (small) buffer, and simply discard any packet that arrived out-of-order, or any other bytes in the packet it received, knowing they would be sent again. Another implementation would retransmit all of its unacknowledged data immediately whenever it received a Source Quench, following the philosophy that a "Source Quench", despite the name, actually told it that its previous transmission had been discarded by some router along the way and therefore had to be retransmitted. When I was at Oracle, we had to test our software with all the TCPs that a customer might use. I recall that, at the time (1990 or so), there were more than 30 unique and different implementations of TCP available just for DOS! So, there were (and probably still are) a lot of algorithms within different TCP implementations that you wouldn't give any "best practice" medals. But they are all legal implementations of the same TCP protocol. I still have a listing of that ancient Unix TCP, written in Macro-11, dated March 30, 1979. One of these days I'll get it scanned! /Jack Haverty On Sun, May 18, 2014 at 7:26 AM, Detlef Bosau wrote: > Am 18.05.2014 16:03, schrieb Craig Partridge: > > I'll try to be brief (a multi-page essay keeps wanting to break out > > as I write this). > > > > First, worth remembering that ARQ research (Automatic Repeat reQuest), > the > > work that first developed go-back-n, was happening concurrently with TCP > > development. Sometimes TCP was ahead of the theory. Indeed, the theory > > never quite caught up as all the ARQ research is based on a slotted > > transmission channel and the question is always "what is the most > > efficient packet to fill the current slot." Put another way, ARQ > > doesn't do flow control. > > Period :-) > > I think, this is a very important note and as far as I see, the term > "flow control" is sometimes used in a multi meaning manner. > > > Second, formally, go-back-n says that whenever you detect a loss, you > > restart transmission of *all* data beginning with the lost item. > > In terms of TCP and the variables of RFC 793, you set snd.nxt to the > value of snd.una. > Doing so, you accept that some packets are retransmitted without necessity. > > So if > > you sent bytes 1000 through 8000 and learn that there was a loss at 2000, > > you resend 2000 through 8000. > > O.k., I oversimplified. When you wrote the paper together with Phil > Karn, did you assume a retransmission queue as in RFC 793? > In that case, you have one RTO timer per packet and hence can determine > the very packet which is not acknowledged timely. > In more recent RFC, we use only one RTO timer which "slides" with the > window (i.e. when a new ack arrives which does not ack all outstanding > data, you the pending RTO is cancelled and a new one is started waiting > for at least the value "snd.nxt"), hence you will restart the > transmission at snd.una. > > When you use the original concept of a retransmission queue, you can > restart the transmission, referring to your example, with 2000 when 2000 > was the first loss. > > > > Some of the first TCP implementations did something like this (cf. David > > Plummer's note "Re: Interesting Happenings" on the TCP-IP list of 8 June > 1984). > > Many other TCPs would only retransmit a chunk of data at 2000 and wait > for > > an ACK to see where the next data gap was (also noted in Plummer's note). > > At some point in the early 1980s Jon Postel sent out a note saying that > the > > retransmission only of the data immediately around the known loss was the > > right thing to do, but I can't find the note in my (limited) archives. > > That make sense, particularly as it makes flow control difficult to > retransmit data unnecessarily and cause duplicates. > > > > > I believe by about 1986 or so, all TCPs were only retransmitting the data > > known to be lost. > > I.e. they followed the scheme of a retransmission queue as outlined in > RFC 793? > > > Certainly, the 4.3bsd release of June 1986 only > > retransmitted data that was known to be lost. It is my recollection that > > 4.2bsd only only retransmitted data that was lost as it was based on the > > BBN 4.1c TCP, which only retransmitted lost data. So by 1986 and > probably > > as early as 1979 if not before, BSD TCP was *not* go back n. (One might > say > > it was a flow controlled variant of what theory called "selective repeat > ARQ"). > > > > TCP Tahoe was released in June 1988 and added the initial Van Jacobson > > versions of slow start and the like. So, in short, TCP Tahoe was never > based > > on go back n. > > However, I just had a look at the congavoid paper yesterday, the > congavoid paper does not mention a retransmission queue. > In addition: When Tahoe does selective ARQ, how is it ensured that any > sent (or re-sent) packets stays within the limites of the actual CWND? > Is it done on per packet basis? (This would require some more lines than > the "three lines of code" mentioned by VJ.) > > > > > The question of how to do retransmission RTO timers has a parallel > history > > until it converges with Van's work in 1988. The broad point is that > > people did not necessarily track each packet's round-trip separately > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > exactly that's the point. > > > Detlef > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jeanjour at comcast.net Sun May 18 12:40:35 2014 From: jeanjour at comcast.net (John Day) Date: Sun, 18 May 2014 15:40:35 -0400 Subject: [ih] When was Go Back N adopted by TCP In-Reply-To: References: <20140518140340.037C328E137@aland.bbn.com> <5378C31A.3000108@web.de> Message-ID: At 11:46 AM -0700 5/18/14, Jack Haverty wrote: >Since this is a "history" forum, I'll offer my perspective as one >who was there in the 80s and involved in the TCP work... > >IMHO, it's important to make the distinction between the protocol >and the implementations of that protocol. The protocol defines the >formats of the data passing back and forth "on the wire", and the >required actions that the computer at each and take in response to >receiving that data. > >How a particular implementation performs that response is totally up >to that particular implementer. > >So, when you're talking about ARQ, packet timers, retransmission >algorithms, et al, you're talking about the *implementation*, rather >than the TCP protocol itself. > >I wrote a TCP back in the 1979 timeframe - the first one for a Unix >system, running on a PDP-11/40. It first implemented TCP version >2.5, and later evolved to version 4. It was a very basic >implementation, no "slow start" or any other such niceties that were >created as the Internet grew. I think we went over this earlier and the conclusion was, we weren't sure. But I can say Jack's was probably the first on an 11/40. By 1979, we were on our second TCP implementation on Unix on an 11/45 and 11/70, and would start our third soon. Take care, John From detlef.bosau at web.de Sun May 18 13:49:39 2014 From: detlef.bosau at web.de (Detlef Bosau) Date: Sun, 18 May 2014 22:49:39 +0200 Subject: [ih] When was Go Back N adopted by TCP In-Reply-To: References: <20140518140340.037C328E137@aland.bbn.com> <5378C31A.3000108@web.de> Message-ID: <53791CE3.30208@web.de> Am 18.05.2014 20:46, schrieb Jack Haverty: > Since this is a "history" forum, I'll offer my perspective as one who > was there in the 80s and involved in the TCP work... Nevertheless, my question has a background from my work in this days. > > IMHO, it's important to make the distinction between the protocol and > the implementations of that protocol. The protocol defines the > formats of the data passing back and forth "on the wire", and the > required actions that the computer at each and take in response to > receiving that data. I agree - however, care must be taken that a protocol's semantics are not affected. My question arouse because I'm just about to reimplement my own network simulator - and (for the 2nd or 3rd time) I want to understand how VJCC is integrated in TCP and particularly Karn's algorithm. >From my knowledge up to know I always assumed TCP would do go back n typically - as I learned now, this assumption was wrong. > > How a particular implementation performs that response is totally up > to that particular implementer. Yes for the implementation, however a node's behaviour as seen from the outside should follow the standards quite strict. "Be liberal in what you accept but conservative in what you send." > > So, when you're talking about ARQ, packet timers, retransmission > algorithms, et al, you're talking about the *implementation*, rather > than the TCP protocol itself. Not quite. The internet is always a well behaved community. E.g.: Delayed ACK. I would have to look it up in RFC 793, but IIRC delayed ACK is not mandatory there - actually it is highly recommended. A TCP socket which would send a pure ACK for each received segment would certainly work - but it would offer load to the shared resources in a network. > > I wrote a TCP back in the 1979 timeframe - the first one for a Unix > system, running on a PDP-11/40. It first implemented TCP version 2.5, > and later evolved to version 4. It was a very basic implementation, > no "slow start" or any other such niceties that were created as the > Internet grew. > > As far as I know, that 1979 protocol is the same protocol as is in use > today (pending IPV6 of course). So if my 1979 TCP could somehow be > loaded into a PDP-11 today, it should still be able to communicate > with all the other TCPs out there. Of course I haven't been tracking > all the details of the TCP work over the last few decades, so someone > will tell me if I just missed it, but I don't think anything in the > actual protocol has changed. True? I think it would be compatible. (Although I didn't know anything about TCP in 1979, I was born in 1963 and I did not know anything about packet switching in 1979.) > > IMHO, TCP itself, i.e. the protocol, hasn't changed at all in the last > 30+ years. However, there has been a lot of work inventing new > algorithms and putting them into implementations of TCP, probably > starting with van Jacobsen's work. > > RFC 793 makes the distinction between the protocol and the algorithms > used in the implementation: > > "Because of the variability of the networks that compose an > internetwork system and the wide range of uses of TCP connections the > retransmission timeout must be dynamically determined. One procedure > for determining a retransmission time out is given here as > anillustration." (page 41) > > The "Example Retransmission Timeout Procedure" which follows in the > RFC 793 spec is an *example*. It is not required as part of the protocol. However, when we pursue "fairness" for TCP rates, it makes sense to have identical time out mechanisms. > > However, the protocol was structured to essentially preclude certain > implementation strategies. ARQ, for example, involves Requesting a Repeat. Not quite. You may well request a repeat by a NACK, you may resend a packet when there is no ACK. However, the semantics of a present ACK is not the same as the semantics of a missing NACK. > But there is no guaranteed back-channel in TCP whereby you could > reliably make such a request, other than the very rudimentary Window > and Sequence Number mechanisms. We work around this lack of reliability by TCP's cumulative ACK scheme. > No way to say "send that last packet again" for example. We did talk > about such things, a lot, but decided it was too complex, especially > when you had routers doing things like fragmentation, or the > possibility that the reverse traffic flow would be cut off. There was > also discussion of adding an "out of band" channel to TCP to enable > implementations to reliably do negotiations like ARQ but that was also > excluded to reduce complexity. The "push" flag and the "urgent pointer" ;-) Back to my actual work. I wanted to reproduce the step from RFC 793 to Tahoe - and I ever thought, both did go back n - and learned: Neither does so. > ICMP was introduced as an in-band control channel of sorts, but such > packets were by definition unreliable and therefore appropriate only > for cases where losing a packet wouldn't cause the connection to lock > up or misbehave. > > The second reason was that there were many conflicting goals that > different implementers faced. Some had to shoehorn the TCP into a very > limited computer (that would be me). Others were pressured to avoid > using precious computer cycles that otherwise would generate revenue. > Some TCPs were used in situations dominated by character-at-a-time > Telnet activity, and users wanted what they typed to echo immediately > - so hanging on to data hoping for more to arrive before sending it > out was unacceptable. There were lots of forces pulling different ways. My basic interest is TCP flow control, what hardware limitations are concerned, these are - at least to a huge degree - historical considerations. > > The result was that there were a lot of TCP implementations, all > conforming to the protocol, but with widely different internal algorithms. > > I encountered one such implementation that was likely the simplest > possible. It would only accept the next sequential bytes in the byte > stream that would fit in its (small) buffer, and simply discard any > packet that arrived out-of-order, or any other bytes in the packet it > received, knowing they would be sent again. > > Another implementation would retransmit all of its unacknowledged data > immediately whenever it received a Source Quench, following the > philosophy that a "Source Quench", despite the name, actually told it > that its previous transmission had been discarded by some router along > the way and therefore had to be retransmitted. > Is this the semantics of a Source Quench? > When I was at Oracle, we had to test our software with all the TCPs > that a customer might use. I recall that, at the time (1990 or so), > there were more than 30 unique and different implementations of TCP > available just for DOS! Yes. And they were as compatible as men and women. > > So, there were (and probably still are) a lot of algorithms within > different TCP implementations that you wouldn't give any "best > practice" medals. But they are all legal implementations of the same > TCP protocol. O.k., but at one point in time, computer science must turn from a number of kludges to an engineering science. Detlef -- ------------------------------------------------------------------ Detlef Bosau Galileistra?e 30 70565 Stuttgart Tel.: +49 711 5208031 mobile: +49 172 6819937 skype: detlef.bosau ICQ: 566129673 detlef.bosau at web.de http://www.detlef-bosau.de -------------- next part -------------- An HTML attachment was scrubbed... URL: From jack at 3kitty.org Sun May 18 14:10:49 2014 From: jack at 3kitty.org (Jack Haverty) Date: Sun, 18 May 2014 14:10:49 -0700 Subject: [ih] When was Go Back N adopted by TCP In-Reply-To: References: <20140518140340.037C328E137@aland.bbn.com> <5378C31A.3000108@web.de> Message-ID: Hi John, Just FYI, the Unix TCP listing that survives implemented TCP version 4 and is dated March 30, 1979. It is however a descendant of earlier versions that started with TCP 2.5. As far as I know, there were no other Unix TCPs at the time - mid 1977. At least neither BBN nor (D)ARPA knew of any (and I'd expect Vint would have known). We needed a TCP for a Unix system as a part of another ARPA project, so I got the task to take Jim Mathis' existing LSI-11 TCP code and tweak it as needed to get a TCP functional on Unix on a PDP-11/40 for that project to use. That ended up involving mostly kernel hacking to get the right primitives into Unix and creating interfaces for user processes. It was never intended to be a general purpose Unix implementation, and the design choices necessary to get everything into a PDP-11/40 were not what anyone would want for a more capable computer, so other implementations were subsequently started (funded) by DARPA and DCEC at BBN to create "from scratch" general purpose TCPs for PDP11/45 and /70 Unix systems (Rob Gurwitz, Al Nemeth, Mike Wingfield and others I can't remember...). My yellowing lab notebook saved with the listing contains diary entries, e.g., July 27, 1977: "got TCP11 Unix version to assemble", and September 16, 1977: "TCP and Al Spector's TCP can talk fine" (Al was using Mathis' TCP on an LSI-11 communicating with the Unix system). Most of the intervening entries had to do with recovering from disk crashes and other such annoyances. By 1979 the TCP working group had gotten to the TCP4 stage and I modified the original code as needed along the way as we made changes to get from 2.5 to 2.5 plus epsilon to eventually 4 as captured by the 1979 listing. Fun times! /Jack Haverty On Sun, May 18, 2014 at 12:40 PM, John Day wrote: > At 11:46 AM -0700 5/18/14, Jack Haverty wrote: > >> Since this is a "history" forum, I'll offer my perspective as one who was >> there in the 80s and involved in the TCP work... >> >> IMHO, it's important to make the distinction between the protocol and the >> implementations of that protocol. The protocol defines the formats of the >> data passing back and forth "on the wire", and the required actions that >> the computer at each and take in response to receiving that data. >> >> How a particular implementation performs that response is totally up to >> that particular implementer. >> >> So, when you're talking about ARQ, packet timers, retransmission >> algorithms, et al, you're talking about the *implementation*, rather than >> the TCP protocol itself. >> >> I wrote a TCP back in the 1979 timeframe - the first one for a Unix >> system, running on a PDP-11/40. It first implemented TCP version 2.5, and >> later evolved to version 4. It was a very basic implementation, no "slow >> start" or any other such niceties that were created as the Internet grew. >> > > I think we went over this earlier and the conclusion was, we weren't sure. > But I can say Jack's was probably the first on an 11/40. > > By 1979, we were on our second TCP implementation on Unix on an 11/45 and > 11/70, and would start our third soon. > > Take care, > John > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jeanjour at comcast.net Sun May 18 14:33:57 2014 From: jeanjour at comcast.net (John Day) Date: Sun, 18 May 2014 17:33:57 -0400 Subject: [ih] When was Go Back N adopted by TCP In-Reply-To: References: <20140518140340.037C328E137@aland.bbn.com> <5378C31A.3000108@web.de> Message-ID: Yes, we have discussed this on this list before. And the memory in our group is fuzzy or no one has delved into their attic. But we put the first Unix on a PDP-11/45 on the Net with NCP in the summer of 1975 and then immediately started work on TCP. We were working for DCA and JTSA (remember them). As I said, we were on our second implementation by 77 or so and were doing our 3rd by 1978 when I returned from Houston. I forget when we took delivery of our 11/70 but it was certainly about this time. DARPA may not have known about it, but I doubt that. I remember Grossman returning from TCP meetings in Cambridge where there had been long discussions about whether things in the spec could work and it turned out we were further along on implementation than BBN was. One of these days, I may get a chance to dig into Grossman's attic. ;-) Take care, John At 2:10 PM -0700 5/18/14, Jack Haverty wrote: >Hi John, > >Just FYI, the Unix TCP listing that survives implemented TCP version >4 and is dated March 30, 1979. It is however a descendant of >earlier versions that started with TCP 2.5. > >As far as I know, there were no other Unix TCPs at the time - mid >1977. At least neither BBN nor (D)ARPA knew of any (and I'd expect >Vint would have known). We needed a TCP for a Unix system as a >part of another ARPA project, so I got the task to take Jim Mathis' >existing LSI-11 TCP code and tweak it as needed to get a TCP >functional on Unix on a PDP-11/40 for that project to use. That >ended up involving mostly kernel hacking to get the right primitives >into Unix and creating interfaces for user processes. It was never >intended to be a general purpose Unix implementation, and the design >choices necessary to get everything into a PDP-11/40 were not what >anyone would want for a more capable computer, so other >implementations were subsequently started (funded) by DARPA and DCEC >at BBN to create "from scratch" general purpose TCPs for PDP11/45 >and /70 Unix systems (Rob Gurwitz, Al Nemeth, Mike Wingfield and >others I can't remember...). > >My yellowing lab notebook saved with the listing contains diary >entries, e.g., July 27, 1977: "got TCP11 Unix version to assemble", >and September 16, 1977: "TCP and Al Spector's TCP can talk fine" (Al >was using Mathis' TCP on an LSI-11 communicating with the Unix >system). Most of the intervening entries had to do with recovering >from disk crashes and other such annoyances. By 1979 the TCP >working group had gotten to the TCP4 stage and I modified the >original code as needed along the way as we made changes to get from >2.5 to 2.5 plus epsilon to eventually 4 as captured by the 1979 >listing. > >Fun times! > >/Jack Haverty > > > >On Sun, May 18, 2014 at 12:40 PM, John Day ><jeanjour at comcast.net> wrote: > >At 11:46 AM -0700 5/18/14, Jack Haverty wrote: > >Since this is a "history" forum, I'll offer my perspective as one >who was there in the 80s and involved in the TCP work... > >IMHO, it's important to make the distinction between the protocol >and the implementations of that protocol. The protocol defines the >formats of the data passing back and forth "on the wire", and the >required actions that the computer at each and take in response to >receiving that data. > >How a particular implementation performs that response is totally up >to that particular implementer. > >So, when you're talking about ARQ, packet timers, retransmission >algorithms, et al, you're talking about the *implementation*, rather >than the TCP protocol itself. > >I wrote a TCP back in the 1979 timeframe - the first one for a Unix >system, running on a PDP-11/40. It first implemented TCP version >2.5, and later evolved to version 4. It was a very basic >implementation, no "slow start" or any other such niceties that were >created as the Internet grew. > > >I think we went over this earlier and the conclusion was, we weren't >sure. But I can say Jack's was probably the first on an 11/40. > >By 1979, we were on our second TCP implementation on Unix on an >11/45 and 11/70, and would start our third soon. > >Take care, >John -------------- next part -------------- An HTML attachment was scrubbed... URL: From detlef.bosau at web.de Sun May 18 14:49:41 2014 From: detlef.bosau at web.de (Detlef Bosau) Date: Sun, 18 May 2014 23:49:41 +0200 Subject: [ih] When was Go Back N adopted by TCP In-Reply-To: References: <20140518140340.037C328E137@aland.bbn.com> <5378C31A.3000108@web.de> Message-ID: <53792AF5.8070409@web.de> Just to make my thoughts more clear. Let A,B,C end nodes, S some switch. A \ S---------------------------------C / B Assume one data flow from a to C, a competing one from B to C. I expect that it makes a difference, whether both TCPs employ the same RTO scheme or, e.g. A, uses a more aggressive one that does earlier retransmissions than its competitor. At least when we expect both TCPs to use a fair share of resources, both TCPs should employ the same scheme. Do you agree? Detlef -- ------------------------------------------------------------------ Detlef Bosau Galileistra?e 30 70565 Stuttgart Tel.: +49 711 5208031 mobile: +49 172 6819937 skype: detlef.bosau ICQ: 566129673 detlef.bosau at web.de http://www.detlef-bosau.de From vint at google.com Sun May 18 15:26:48 2014 From: vint at google.com (Vint Cerf) Date: Sun, 18 May 2014 18:26:48 -0400 Subject: [ih] When was Go Back N adopted by TCP In-Reply-To: <53792AF5.8070409@web.de> References: <20140518140340.037C328E137@aland.bbn.com> <5378C31A.3000108@web.de> <53792AF5.8070409@web.de> Message-ID: i think this is not an adequate analysis. Delay variations between A, S, B and C might produce different timeouts even if the RTT measurement algorithms are the same. "fair"is not a simple notion in the dynamic and variable world of the Internet. v On Sun, May 18, 2014 at 5:49 PM, Detlef Bosau wrote: > Just to make my thoughts more clear. > > Let A,B,C end nodes, S some switch. > > A > \ > S---------------------------------C > / > B > > Assume one data flow from a to C, a competing one from B to C. > > I expect that it makes a difference, whether both TCPs employ the same > RTO scheme or, e.g. A, uses a more aggressive one that does earlier > retransmissions than its competitor. At least when we expect both TCPs > to use a fair share of resources, both TCPs should employ the same scheme. > > Do you agree? > > Detlef > > -- > ------------------------------------------------------------------ > Detlef Bosau > Galileistra?e 30 > 70565 Stuttgart Tel.: +49 711 5208031 > mobile: +49 172 6819937 > skype: detlef.bosau > ICQ: 566129673 > detlef.bosau at web.de http://www.detlef-bosau.de > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jack at 3kitty.org Sun May 18 18:49:50 2014 From: jack at 3kitty.org (Jack Haverty) Date: Sun, 18 May 2014 18:49:50 -0700 Subject: [ih] When was Go Back N adopted by TCP In-Reply-To: References: <20140518140340.037C328E137@aland.bbn.com> <5378C31A.3000108@web.de> <53792AF5.8070409@web.de> Message-ID: I thinkl Vint's right - the Internet is very complex. it's very difficult to draw conclusions about its behavior. In addition to the variance in delay on those "wires", there's other unpredictable influences on the behavior of the algorithms. For example, random events that happen may effect A and B differently, even though they are implemented identically. I saw an example of this in the Oracle internal internet-clone back in the 90s. Users had been complaining about performance instability. So we instrumented the paths along the network (like A,B to S and C, but somewhat more complex. A and B were in our building near San Francisco, and C was somewhere far away, I think in Asia. Lots of SNMP gathering lots of data from the equipment along the path. We had two identical A and B computers, who were presumably therefore using identical TCP implementations (the TCPs were not ours, so we couldn't "look inside"). When two simultaneous jobs were fired up, they started A-C and B-C interactions, which were essentially the same, i.e., it was "fair". They each got similar throughput and delay behavior, as seen by the users. However, eventually something somewhere would drop a packet, e.g., caused by a noise burst on a satellite circuit. That would of course trigger retransmission and other error recovery mechanisms. What we observed was that, after the dust settled, the two virtual circuits would return to a stable flow. The TCP connections remained intact. But one flow was about twice the delay and half the throughput of the other. Very unfair. Investigating further, we determined that, after the disruption cleared, one of the TCPs had settled into a new stable pattern where every packet was being sent twice. I.E., it was retransmitting everything. This was rather annoying since not only was the user seeing less throughput and higher delay, the network was using twice as much resource to deliver that poor service. Intercontinental circuits are expensive! Presumably one of the TCPs was the one which had a packet destroyed, and the other one didn't. But they both experienced a disruption as the satellite circuit reset, so they both may have decided to retransmit. Obviously one of them didn't handle the situation as well as the other, even though they were identical. Since the TCP involved was out of our control, we passed the problem to the computer vendor. Whether or not they ever fixed it is anybody's guess. It might be tempting to characterize this as a "bug", either in the code or in the algorithm. But it illustrates that, even with identical implementations of the same algorithm, "fairness" is an elusive goal. /Jack Haverty On Sun, May 18, 2014 at 3:26 PM, Vint Cerf wrote: > i think this is not an adequate analysis. Delay variations between A, S, B > and C might produce different timeouts even if the RTT measurement > algorithms are the same. "fair"is not a simple notion in the dynamic and > variable world of the Internet. > > v > > > > On Sun, May 18, 2014 at 5:49 PM, Detlef Bosau wrote: > >> Just to make my thoughts more clear. >> >> Let A,B,C end nodes, S some switch. >> >> A >> \ >> S---------------------------------C >> / >> B >> >> Assume one data flow from a to C, a competing one from B to C. >> >> I expect that it makes a difference, whether both TCPs employ the same >> RTO scheme or, e.g. A, uses a more aggressive one that does earlier >> retransmissions than its competitor. At least when we expect both TCPs >> to use a fair share of resources, both TCPs should employ the same scheme. >> >> Do you agree? >> >> Detlef >> >> -- >> ------------------------------------------------------------------ >> Detlef Bosau >> Galileistra?e 30 >> 70565 Stuttgart Tel.: +49 711 5208031 >> mobile: +49 172 6819937 >> skype: detlef.bosau >> ICQ: 566129673 >> detlef.bosau at web.de http://www.detlef-bosau.de >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From detlef.bosau at web.de Sun May 18 23:33:19 2014 From: detlef.bosau at web.de (Detlef Bosau) Date: Mon, 19 May 2014 08:33:19 +0200 Subject: [ih] When was Go Back N adopted by TCP In-Reply-To: References: <20140518140340.037C328E137@aland.bbn.com> <5378C31A.3000108@web.de> <53792AF5.8070409@web.de> Message-ID: <5379A5AF.5090709@web.de> Am 19.05.2014 00:26, schrieb Vint Cerf: > i think this is not an adequate analysis. Delay variations between A, > S, B and C might produce different timeouts even if the RTT > measurement algorithms are the same. "fair"is not a simple notion in > the dynamic and variable world of the Internet. On the one hand, I agree. On the other: You put in question basically all phd theses on congestion control from the last 20 years. I just had a look at the ns2 code, I always (if unsaid) assumed, it did go back n as retransmission strategy. (And from what I've seen 5 minutes ago, it apparently does. However, I should take the time to have a closer look at the code.) => Quite a lot of work in the past apparently made quite strong assumptions, which may simply not hold in any cases, => many phd-theses were theses on "ns2" but not on "TCP". What you say is no contradiction to my analysis. What you say is that an identical rtt measurement algorithm may be insufficient to produce "faire shares". (The term is not my invention, btw.) Detlef > > v > > > > On Sun, May 18, 2014 at 5:49 PM, Detlef Bosau > wrote: > > Just to make my thoughts more clear. > > Let A,B,C end nodes, S some switch. > > A > \ > S---------------------------------C > / > B > > Assume one data flow from a to C, a competing one from B to C. > > I expect that it makes a difference, whether both TCPs employ the same > RTO scheme or, e.g. A, uses a more aggressive one that does earlier > retransmissions than its competitor. At least when we expect both TCPs > to use a fair share of resources, both TCPs should employ the same > scheme. > > Do you agree? > > Detlef > > -- > ------------------------------------------------------------------ > Detlef Bosau > Galileistra?e 30 > 70565 Stuttgart Tel.: +49 711 5208031 > > mobile: +49 172 6819937 > > skype: detlef.bosau > ICQ: 566129673 > detlef.bosau at web.de > http://www.detlef-bosau.de > > -- ------------------------------------------------------------------ Detlef Bosau Galileistra?e 30 70565 Stuttgart Tel.: +49 711 5208031 mobile: +49 172 6819937 skype: detlef.bosau ICQ: 566129673 detlef.bosau at web.de http://www.detlef-bosau.de -------------- next part -------------- An HTML attachment was scrubbed... URL: From detlef.bosau at web.de Sun May 18 23:39:51 2014 From: detlef.bosau at web.de (Detlef Bosau) Date: Mon, 19 May 2014 08:39:51 +0200 Subject: [ih] When was Go Back N adopted by TCP In-Reply-To: <5379A5AF.5090709@web.de> References: <20140518140340.037C328E137@aland.bbn.com> <5378C31A.3000108@web.de> <53792AF5.8070409@web.de> <5379A5AF.5090709@web.de> Message-ID: <5379A737.7060701@web.de> put another way: I said that identical time out algorithms on A and B are necessary have both TCPs use their fair share of resources on the bottleneck (unsaid: link between s and c). Vint said: the use of an identical time out mechanisms is not sufficient for this purpose. (at least this is my understanding ...) -- ------------------------------------------------------------------ Detlef Bosau Galileistra?e 30 70565 Stuttgart Tel.: +49 711 5208031 mobile: +49 172 6819937 skype: detlef.bosau ICQ: 566129673 detlef.bosau at web.de http://www.detlef-bosau.de From detlef.bosau at web.de Mon May 19 07:51:13 2014 From: detlef.bosau at web.de (Detlef Bosau) Date: Mon, 19 May 2014 16:51:13 +0200 Subject: [ih] When was Go Back N adopted by TCP In-Reply-To: <20140518140340.037C328E137@aland.bbn.com> References: <20140518140340.037C328E137@aland.bbn.com> Message-ID: <537A1A61.8070107@web.de> Am 18.05.2014 16:03, schrieb Craig Partridge: > > TCP Tahoe was released in June 1988 and added the initial Van Jacobson > versions of slow start and the like. So, in short, TCP Tahoe was never based > on go back n. And that's what confuses me. Particularly as I thought for years, TCP would use GBN as "default procedure" when a timeout occurs, obviously I was wrong. I had a closer look at the ns2 sources today (I don't work with the ns2 for about 5 years now, so I'm admittedly a bit out of practice) but as far as I see, the ns-2 uses GBN. So, we have (for the n-th time) the question: What is simulated at all in the ns-2? Unfortunately, I don't have the BSD 4.3 source code handy, when I can obtain it somewhere (under non disclosure agreement, if necessary) I would be interested in looking it up in the source code there. However, I'm more than confused that obviously the two dozens of industrial TCP flavours and the simulations in the ns-2 world simply talk at cross purposes for about two decades. Perhaps, we should correct our regulations that way, that PhD students are awarded a PhD in computer science - when they work with TCP, or a PhD in Cartoons und Urban Legends, when they work with the NS-2. (Yes, this sounds harsh. It doesn't only sound harsh. It IS harsh.) Computer scientists are engineers in the first. And for engineers, the only important thing are (and we keep it that way since the time of the Codex Hammurabi) proper standards and proper definitions. And I'm still to dig deeper into the NS2 code, because I simply cannot believe that TCP basically is not based on GBN - and AT THE SAME TIME the ns2 is. Unforatunately, I don't have a running ns2 version at the moment, this is my fault and I still think I have to spend some nights of work to keep up here. But at the moment, this possible divergence leaves me shocked. And that's decently spoken. But most likely, it is my own fault. And that leaves my angry about myself. If this is due to my negligence, this would be embarrassing for me :-( Detlef -- ------------------------------------------------------------------ Detlef Bosau Galileistra?e 30 70565 Stuttgart Tel.: +49 711 5208031 mobile: +49 172 6819937 skype: detlef.bosau ICQ: 566129673 detlef.bosau at web.de http://www.detlef-bosau.de From detlef.bosau at web.de Mon May 19 08:02:19 2014 From: detlef.bosau at web.de (Detlef Bosau) Date: Mon, 19 May 2014 17:02:19 +0200 Subject: [ih] One clarification: Re: When was Go Back N adopted by TCP In-Reply-To: <537A1A61.8070107@web.de> References: <20140518140340.037C328E137@aland.bbn.com> <537A1A61.8070107@web.de> Message-ID: <537A1CFB.1020206@web.de> to avoid further confusion: With respect to TCP GBN means to do RTO handling as defined in RFC 2988 (and more recent), i.e. we have one RTO per CONNECTION - in contrast to one RTO per PACKET (at least, this is my understanding of RFC 793). Perhaps, I make a fuss of nothing, than I have to learn and to apologize, however at the moment, this causes some confusion.... Detlef -- ------------------------------------------------------------------ Detlef Bosau Galileistra?e 30 70565 Stuttgart Tel.: +49 711 5208031 mobile: +49 172 6819937 skype: detlef.bosau ICQ: 566129673 detlef.bosau at web.de http://www.detlef-bosau.de From craig at aland.bbn.com Mon May 19 08:02:59 2014 From: craig at aland.bbn.com (Craig Partridge) Date: Mon, 19 May 2014 11:02:59 -0400 Subject: [ih] When was Go Back N adopted by TCP Message-ID: <20140519150259.35C1E28E137@aland.bbn.com> Hi Detlef: I don't keep the 4.3bsd code around anymore, but here's my recollection of what the code did. 4.3BSD had one round-trip timeout (RTO) counter per TCP connection. On round-trip timeout, send 1MSS of data starting at the lowest outstanding sequence number. Set the RTO counter to the next increment. Once an ack is received, update the sequence numbers and begin slow start again. What I don't remember is whether 4.3bsd kept track of multiple outstanding losses and fixed all of them before slow start or not. Others probably remember (and may also have corrections to the above). Thanks! Craig > Am 18.05.2014 16:03, schrieb Craig Partridge: > > > > TCP Tahoe was released in June 1988 and added the initial Van Jacobson > > versions of slow start and the like. So, in short, TCP Tahoe was never bas > ed > > on go back n. > > And that's what confuses me. Particularly as I thought for years, TCP > would use GBN as "default procedure" when a timeout occurs, > obviously I was wrong. > > I had a closer look at the ns2 sources today (I don't work with the ns2 > for about 5 years now, so I'm admittedly a bit out of practice) but as > far as I see, the ns-2 uses GBN. > > So, we have (for the n-th time) the question: What is simulated at all > in the ns-2? > > Unfortunately, I don't have the BSD 4.3 source code handy, when I can > obtain it somewhere (under non disclosure agreement, if necessary) I > would be interested in looking it up in the source code there. > > However, I'm more than confused that obviously the two dozens of > industrial TCP flavours and the simulations in the ns-2 world simply > talk at cross purposes for about two decades. > > Perhaps, we should correct our regulations that way, that PhD students > are awarded a PhD in computer science - when they work with TCP, or > a PhD in Cartoons und Urban Legends, when they work with the NS-2. > > (Yes, this sounds harsh. It doesn't only sound harsh. It IS harsh.) > > Computer scientists are engineers in the first. > > And for engineers, the only important thing are (and we keep it that way > since the time of the Codex Hammurabi) proper standards and proper > definitions. And I'm still to dig deeper into the NS2 code, because I > simply cannot believe that TCP basically is not based on GBN > - and AT THE SAME TIME the ns2 is. > > Unforatunately, I don't have a running ns2 version at the moment, this > is my fault and I still think I have to spend some nights of work to > keep up here. > > But at the moment, this possible divergence leaves me shocked. And > that's decently spoken. But most likely, it is my own fault. And that > leaves my angry about myself. If this is due to my negligence, this > would be embarrassing for me :-( > > Detlef > > -- > ------------------------------------------------------------------ > Detlef Bosau > Galileistra?e 30 > 70565 Stuttgart Tel.: +49 711 5208031 > mobile: +49 172 6819937 > skype: detlef.bosau > ICQ: 566129673 > detlef.bosau at web.de http://www.detlef-bosau.de ******************** Craig Partridge Chief Scientist, BBN Technologies E-mail: craig at aland.bbn.com or craig at bbn.com Phone: +1 517 324 3425 From vint at google.com Mon May 19 09:59:10 2014 From: vint at google.com (Vint Cerf) Date: Mon, 19 May 2014 12:59:10 -0400 Subject: [ih] One clarification: Re: When was Go Back N adopted by TCP In-Reply-To: <537A1CFB.1020206@web.de> References: <20140518140340.037C328E137@aland.bbn.com> <537A1A61.8070107@web.de> <537A1CFB.1020206@web.de> Message-ID: one RTO per connection makes sense: calculate and monitor the RTT for the connection and use that value to timeout and retransmit the oldest, unacknowledged packet. This is NOT GBN. It makes no sense to do RTO per packet calculation especially if the packet had to be retransmitted since you then get into double delay affecting the RTT computation.. vint On Mon, May 19, 2014 at 11:02 AM, Detlef Bosau wrote: > to avoid further confusion: > > With respect to TCP GBN means to do RTO handling as defined in RFC 2988 > (and more recent), > i.e. we have one RTO per CONNECTION - in contrast to one RTO per PACKET > (at least, this is my understanding of RFC 793). > > Perhaps, I make a fuss of nothing, than I have to learn and to > apologize, however at the moment, this causes some confusion.... > > > Detlef > > > -- > ------------------------------------------------------------------ > Detlef Bosau > Galileistra?e 30 > 70565 Stuttgart Tel.: +49 711 5208031 > mobile: +49 172 6819937 > skype: detlef.bosau > ICQ: 566129673 > detlef.bosau at web.de http://www.detlef-bosau.de > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From braden at isi.edu Mon May 19 11:43:30 2014 From: braden at isi.edu (Bob Braden) Date: Mon, 19 May 2014 11:43:30 -0700 Subject: [ih] internet-history Digest, Vol 84, Issue 4 In-Reply-To: References: Message-ID: <537A50D2.3010807@meritmail.isi.edu> Jack, You wrote: I wrote a TCP back in the 1979 timeframe - the first one for a Unix system, running on a PDP-11/40. It first implemented TCP version 2.5, and later evolved to version 4. It was a very basic implementation, no "slow start" or any other such niceties that were created as the Internet grew. I have been trying to recall where my TCP/IP for UCLA's IBM 360/91 ran in this horse race. The best I can tell from IEN 70 and IEN 77 is that my TCP-4 version made it between Dec 1978 and Jan 1979, although I think I had an initial TP-2.5 version talkng to itself in mid 1978. Bob Braden From braden at isi.edu Mon May 19 12:36:36 2014 From: braden at isi.edu (Bob Braden) Date: Mon, 19 May 2014 12:36:36 -0700 Subject: [ih] Detlef's TCP questions In-Reply-To: References: Message-ID: <537A5D44.30908@meritmail.isi.edu> Detlef, As Craig and Vint has pointed out, TCP never was GBN. Yes, any network researcher who wants to call him/herself a computer scientist should take seriously the experimentalist's task of fully understanding the assumptions and implementations of their test environment. That includes NS-2 simulations of TCP. Yes, in broad generality, the level of network science taught in many graduate schools is abysmal. How can those with clue resist the temptation of real mony in industry or getting rich from a startup? So the next generation of largely clueless PhDs learn from clueless predecessors. During the period of Van Jacobson's development of the algorithms that bear his name, he wrote many lengthy, pithy, and informative messages to various public mailing lists about the hazards of the Internet and how his algorithms cope. Maybe some of these lists are archived somewhere. Bob Braden From craig at aland.bbn.com Mon May 19 13:11:50 2014 From: craig at aland.bbn.com (Craig Partridge) Date: Mon, 19 May 2014 16:11:50 -0400 Subject: [ih] Detlef's TCP questions Message-ID: <20140519201150.9C03628E137@aland.bbn.com> > During the period of Van Jacobson's development of the algorithms that > bear his > name, he wrote many lengthy, pithy, and informative messages to various > public > mailing lists about the hazards of the Internet and how his algorithms cope. > Maybe some of these lists are archived somewhere. > > Bob Braden Hi Bob: Turns out the late Rich Stevens had a small collection of notes he considered Van's "greatest hits" from that time and was kind enough to share it with me. Here are the notes. Craig > To: markl at PTT.LCS.MIT.EDU > Subject: Re: interpacket arrival variance and mean > In-reply-to: Your message of Mon, 08 Jun 87 12:31:33 EDT. > Date: Mon, 15 Jun 87 06:08:01 PDT > From: Van Jacobson > > Mark - > > I'm working on long paper about transport protocol timers (models, > estimation, implementations, common implementation mistakes, etc.). > This has me all primed on the subject so attached is a long-winded > explanation and example C code for estimating the mean and variance > of a series of measurements. > > Though I know your application isn't tcp, I'm going to use a new > tcp retransmit timer algorithm as an example. The algorithm > illustrates the method and, based on 3 months of experiment, is > much better than the algorithm in rfc793 (though it isn't as good > as the algorithm I talked about at IETF and am currently debugging). > > Theory > ------ > You're probably familiar with the rfc793 algorithm for estimating > the mean round trip time. This is the simplest example of a > class of estimators called "recursive prediction error" or > "stochastic gradient" algorithms. In the past 20 years these > algorithms have revolutionized estimation and control theory so > it's probably worth while to look at the rfc793 estimator in some > detail. > > Given a new measurement, Meas, of the rtt, tcp updates an > estimate of the average rtt, Avg, by > > Avg <- (1 - g) Avg + g Meas > > where g is a "gain" (0 < g < 1) that should be related to the > signal- to-noise ratio (or, equivalently, variance) of Meas. > This makes a bit more sense (and computes faster) if we rearrange > and collect terms multiplied by g to get > > Avg <- Avg + g (Meas - Avg) > > We can think of "Avg" as a prediction of the next measurement. > So "Meas - Avg" is the error in that prediction and the > expression above says we make a new prediction based on the old > prediction plus some fraction of the prediction error. The > prediction error is the sum of two components: (1) error due to > "noise" in the measurement (i.e., random, unpredictable effects > like fluctuations in competing traffic) and (2) error due to a > bad choice of "Avg". Calling the random error RE and the > predictor error PE, > > Avg <- Avg + g RE + g PE > > The "g PE" term gives Avg a kick in the right direction while the > "g RE" term gives it a kick in a random direction. Over a number > of samples, the random kicks cancel each other out so this > algorithm tends to converge to the correct average. But g > represents a compromise: We want a large g to get mileage out of > the PE term but a small g to minimize the damage from the RE > term. Since the PE terms move Avg toward the real average no > matter what value we use for g, it's almost always better to use > a gain that's too small rather than one that's too large. > Typical gain choices are .1 - .2 (though it's always a good > idea to take long look at your raw data before picking a gain). > > It's probably obvious that Avg will oscillate randomly around > the true average and the standard deviation of Avg will be > something like g sdev(Meas). Also that Avg converges to the > true average exponentially with time constant 1/g. So > making g smaller gives a stabler Avg at the expense of taking > a much longer time to get to the true average. > > [My paper makes the preceding hand-waving a bit more rigorous > but I assume no one cares about rigor. If you do care, check > out almost any text on digital filtering or estimation theory.] > > If we want some measure of the variation in Meas, say to compute > a good value for the tcp retransmit timer, there are lots of > choices. Statisticians love variance because it has some nice > mathematical properties. But variance requires squaring (Meas - > Avg) so an estimator for it will contain two multiplies and a > large chance of integer overflow. Also, most of the applications > I can think of want variation in the same units as Avg and Meas, > so we'll be forced to take the square root of the variance to use > it (i.e., probably at least a divide, multiply and two adds). > > A variation measure that's easy to compute, and has a nice, > intuitive justification, is the mean prediction error or mean > deviation. This is just the average of abs(Meas - Avg). > Intuitively, this is an estimate of how badly we've blown our > recent predictions (which seems like just the thing we'd want to > set up a retransmit timer). Statistically, standard deviation > (= sqrt variance) goes like sqrt(sum((Meas - Avg)^2)) while mean > deviation goes like sum(sqrt((Meas - Avg)^2)). Thus, by the > triangle inequality, mean deviation should be a more "conservative" > estimate of variation. > > If you really want standard deviation for some obscure reason, > there's usually a simple relation between mdev and sdev. Eg., > if the prediction errors are normally distributed, mdev = > sqrt(pi/2) sdev. For most common distributions the factor > to go from sdev to mdev is near one (sqrt(pi/2) ~= 1.25). Ie., > mdev is a good approximation of sdev (and is much easier to > compute). > > Practice > -------- > So let's do estimators for Avg and mean deviation, Mdev. Both > estimators compute means so we get two instances of the rfc793 > algorithm: > > Err = abs (Meas - Avg) > Avg <- Avg + g (Meas - Avg) > Mdev <- Mdev + g (Err - Mdev) > > If we want to do the above fast, it should probably be done in > integer arithmetic. But the expressions contain fractions (g < 1) > so we'll have to do some scaling to keep everything integer. If > we chose g so that 1/g is an integer, the scaling is easy. A > particularly good choice for g is a reciprocal power of 2 > (ie., g = 1/2^n for some n). Multiplying through by 1/g gives > > 2^n Avg <- 2^n Avg + (Meas - Avg) > 2^n Mdev <- 2^n Mdev + (Err - Mdev) > > To minimize round-off error, we probably want to keep the scaled > versions of Avg and Mdev rather than the unscaled versions. If > we pick g = .125 = 1/8 (close to the .1 that rfc793 suggests) and > express all the above in C: > > Meas -= (Avg >> 3); > Avg += Meas; > if (Meas < 0) > Meas = -Meas; > Meas -= (Mdev >> 3); > Mdev += Meas; > > I've been using a variant of this to compute the retransmit timer > for my tcp. It's clearly faster than the two floating point > multiplies that 4.3bsd uses and gives you twice as much information. > Since the variation is estimated dynamically rather than using > the wired-in beta of rfc793, the retransmit performance is also > much better: faster retransmissions on low variance links and > fewer spurious retransmissions on high variance links. > > It's not necessary to use the same gain for Avg and Mdev. > Empirically, it looks like a retransmit timer estimate works > better if there's more gain (bigger g) in the Mdev estimator. > This forces the timer to go up quickly in response to changes in > the rtt. (Although it may not be obvious, the absolute value in > the calculation of Mdev introduces an asymmetry in the timer: > Because Mdev has the same sign as an increase and the opposite > sign of a decrease, more gain in Mdev makes the timer go up fast > and come down slow, "automatically" giving the behavior Dave > Mills suggests in rfc889.) > > Using a gain of .25 on the deviation and computing the retransmit > timer, rto, as Avg + 2 Mdev, my tcp timer code looks like: > > Meas -= (Avg >> 3); > Avg += Meas; > if (Meas < 0) > Meas = -Meas; > Meas -= (Mdev >> 2); > Mdev += Meas; > rto = (Avg >> 3) + (Mdev >> 1); > > Hope this helps. > > - Van > To: jain%erlang.DEC at decwrl.dec.com (Raj Jain, LKG1-2/A19, DTN: 226-7642) > Cc: ietf at gateway.mitre.org > Subject: Re: Your congestion scheme > In-Reply-To: Your message of 03 Nov 87 12:51:00 GMT. > Date: Mon, 16 Nov 87 06:03:29 PST > From: Van Jacobson > > Raj, > > Thanks for the note. I hope you'll excuse my taking the liberty > of cc'ing this reply to the ietf: At the last meeting there was a > great deal of interest in your congestion control scheme and our > adaptation of it. > > > I am curious to know what values of increase amount and decrease > > factor did you use. In our previous study on congestion using > > timeout (CUTE scheme, IEEE JSAC, Oct 1986), we had found that the > > decrease factor should be small since packet losses are > > expensive. In fact, we found that a decrease factor of zero > > (decreasing to one) was the best. > > We use .5 for the decrease factor and 1 for the increase factor. > We also use something very close to CUTE (Mike Karels and I were > behind on our journal reading so we independently rediscovered > the algorithm and called it slow-start). Since we use a lost > packet as the "congestion experienced" indicator, the CUTE > algorithm and the congestion avoidance algorithm (BTW, have you > picked a name for it yet?) get run together, even though they are > logically distinct. > > The sender keeps two state variables for congestion control: a > congestion window, cwnd, and a threshhold size, ssthresh, to > switch between the two algorithms. The sender's output routine > sends the minimum of cwnd and the window advertised by the > receiver. The rest of the congestion control sender code looks > like: On a timeout, record half the current window size as > "ssthresh" (this is the multiplicative decrease part of the > congestion avoidance algorithm), then set the congestion window > to 1 packet and call the output routine (this initiates > slowstart/CUTE). When new data is acked, the sender does > something like > > if (cwnd < ssthresh) // if we're still doing slowstart > cwnd += 1 packet // open the window exponentially > else > cwnd += 1/cwnd // otherwise do the Congestion > // Avoidance increment-by-1 > > Notice that our variation of CUTE opens the window exponentially > in time, as opposed to the linear scheme in your JSAC article. > We looked at a linear scheme but were concerned about the > performance hit on links with a large bandwidth-delay product > (ie., satellite links). An exponential builds up fast enough to > accomodate any bandwidth-delay and our testing showed no more > packet drops with exponential opening that with linear opening. > (My model of what slowstart is doing -- starting an ack "clock" > to meter out packets -- suggested that there should be no > loss-rate difference between the two policies). > > The reason for the 1/2 decrease, as opposed to the 7/8 you use, > was the following hand-waving: When a packet is dropped, you're > either starting (or restarting after a drop) or steady-state > sending. If you're starting, you know that half the current > window size "worked", ie., that a window's worth of packets were > exchanged with no drops (slowstart guarantees this). Thus on > congestion you set the window to the largest size that you know > works then slowly increase the size. If the connection is > steady-state running and a packet is dropped, it's probably > because a new connection started up and took some of your > bandwidth. We usually run our nets with rho <= .5 so it's > probable that there are now exactly two conversations sharing the > bandwidth. Ie., that you should reduce your window by half > because the bandwidth available to you has been reduced to half. > And, if there are more than two conversations sharing the > bandwidth, halving your window is conservative -- and being > conservative at high traffic intensities is probably wise. > > Obviously, this large decrease term is accounting for the high > cost of our "congestion experienced" indicator compared to yours -- > a dropped packet vs. a bit in the ack. If the DEC bit were > available, the preceding "logic" should be ignored. But I wanted > tcp congestion avoidance that we could deploy immediately and > incrementally, without adding code to the hundreds of Internet > gateways. So using dropped packets seemed like the only choice. > And, in system terms, the cost isn't that high: Currently packets > are dropped only when a large queue has formed. If we had a bit > to force senders to reduce their windows, we'd still be stuck > with the queue since we'd still be running the bottleneck at 100% > utilization so there's no excess bandwidth available to dissipate > the queue. If we toss a packet, a sender will shut up for 2 rtt, > exactly the time we need to empty the queue (in the ususal case). > If that sender restarts with the correct window size the queue > won't reform. Thus we've reduced the delay to minimum without > the system losing any bottleneck bandwidth. > > The 1-packet increase has less justification that the .5 > decrease. In fact, it's almost certainly too large. If the > algorithm converges to a window size of w, you get O(w^2) packets > between drops with an additive increase policy. We were shooting > for an average drop rate of <1% and found that on the Arpanet > (the worst case of the 4 networks we tested), windows converged > to 8 - 12 packets. This yields 1 packet increments for a 1% > average drop rate. > > But, since we've done nothing in the gateways, the window we > converge to is the maximum the gateway can accept without dropping > packets. I.e., in the terms you used, we are just to the left of > the cliff rather than just to the right of the knee. If we > now fix the gateways so they start dropping packets when the > queue gets pushed past the knee, our increment will be much too > agressive and we'll have to drop it by at least a factor of 4 > (since all my measurements on an unloaded Arpanet or Milnet > place their "pipe size" at 4-5 packets). Mike and I have talked > about a second order control loop to adaptively determine the > appropriate increment to use for a path (there doesn't seem to > be much need to adjust the decrement). It looks trivial to > implement such a loop (since changing the increment affects > the frequency of oscillation but not the mean window size, > the loop would affect rate of convergence but not convergence > and (first-order) stability). But we think 2nd order stuff > should wait until we've spent some time on the 1st order part > of the algorithm for the gateways. > > I'm tired and probably no longer making sense. I think I'll > head home and get some sleep. Hope to hear from you soon. > > Cheers. > > - Van > To: tcp-ip at sri-nic.arpa > Subject: Dynamic Congestion Avoidance / Control (long message) > Date: Thu, 11 Feb 88 22:17:04 PST > From: Van Jacobson > > A dozen people forwarded Phil Karn's question about TCP > congestion control to me, usually with pithy subject lines like > "how much longer till you publish something?". I do have three > RFCs and two papers in various stages of preparation, but innate > laziness combined with this semester's unexpectedly large > teaching load means it will probably be late spring before > anything gets finished. In the interim, here's a sort-of > overview of our congestion control work. > > I should point out these algorithms are joint work with Mike > Karels of UC Berkeley (I guess my name got stuck on things > because I make the presentations while Mike is off fighting the > battles on EGP or routing or domains or ...). I should also > mention that there's not a whole lot that's original. In > particular, the two most important algorithms are quite close to > (prior) work done by DEC's Raj Jain. This was by accident in > one case and deliberate in the other. > > This stuff has been discussed on the ietf and end2end lists > (Phil participated in some of those discussions and was, in > fact, one of the early beta testers for our new tcp -- I have > this nagging suspicion I was set up). I've appended a couple of > those mail messages. > > > Mike and I have put a number (six, actually) of new algorithms > into the 4bsd tcp. Our own measurements and the reports of our > beta testers suggest that the result is quite good at dealing > with current, congested conditions on the Internet. The various > algorithms were developed and presented at different times over > the past year and it may not be clear to anyone but the > developers how, or if, the pieces relate. > > Most of the changes spring from one observation: The flow on a > TCP connection (or ISO TP-4 or XNS SPP connection) should, by > the design of the protocol, obey a `conservation of packets' > principle. And, if this principle were obeyed, congestion > collapse would become the exception rather than the rule. > Thus congestion control turns into finding places where > conservation is violated and fixing them. > > By `conservation of packets' I mean the following: > If you look at the connection "in equilibrium", i.e., running > stably with a full window of data in transit, you notice that > the flow is what a physicist would call "conservative": A new > packet isn't put into the network until an old packet leaves. > Two scientific disciplines that deal with flow, hydrodynamics > and thermodynamics, predict that systems with this conservation > property should be extremely robust in the face of congestion. > Observation of the Arpanet or NSFNet suggests that they were not > particularly robust in the face of congestion. From whence > comes the discrepancy? > > [Someone asked me for a simple explanation of why a conservative > flow system should be stable under load. I've been trying to > come up with one, thus far without success -- absolutely nothing > in hydrodynamics admits to simple explanations. The shortest > explanation is that the inhomogenous forcing terms in the > Fokker-Planck equation vanish and the remaining terms are highly > damped. But I don't suppose this means much to anyone (it > doesn't mean much to me). My intuition is that conservation > means there's never an `energy difference' between the outside > world and the system that the system could use to `pump up' an > instability to a state of collapse. Thus the only mechanism the > system can use for self-destruction is to re-arrange its > internal energy and create a difference large enough to break > something. But entropy (or diffusion) always trys to erase > large internal energy differences so collapse via these > mechanisms is improbable. Possible, but improbable, at least > until the load gets so outrageous that collective, non-ergodic > effects start to dominate the overall behavior.] > > Packet conservation can fail three ways: > > 1) the connection doesn't get to equilibrium, or > > 2) a sender injects a packet before an old packet has exited, or > > 3) the equilibrium can't be reached because of resource > limits along the path. > > (1) has to be from a connection starting or restarting after a > packet loss. Another way to look at the conservation property > is to say that the sender uses acks as a "clock" to strobe new > packets into the network. Since the receiver can generate acks > no faster than data packets can get through the network, the > protocol is "self clocking" (an ack can't go in until a packet > comes out, a packet can't go in until an ack comes out). Self > clocking systems automatically adjust to bandwidth and delay > variations and have a wide dynamic range (an important issue > when you realize that TCP spans everything from 800 Mbit/sec > Cray channels to 1200 bit/sec packet radio links). But the same > thing that makes a self-clocked system stable when it's running > makes it hard to get started -- to get data flowing you want acks > to clock packets out but to get acks you have to have data flowing. > > To get the `clock' started, we came up with an algorithm called > slow-start that gradually increases the amount of data flowing. > Although we flatter ourselves that the design of this algorithm > is quite subtle, the implementation is trivial -- 3 lines of > code and one new state variable in the sender: > > 1) whenever you're starting or restarting after a loss, > set your "congestion window" to 1 packet. > 2) when you get an ack for new data, increase the > congestion window by one packet. > 3) when you send, send the minimum of the receiver's > advertised window and the congestion window. > > (This is quite similar to Raj Jain's CUTE algorithm described in > IEEE Transactions on Communications, Oct, '86, although we didn't > know about CUTE at the time we were developing slowstart). > > Actually, the slow-start window increase isn't all that gradual: > The window opening takes time proportional to log2(W) where W is > the window size in packets. This opens the window fast enough > to have a negligible effect on performance, even on links that > require a very large window. And the algorithm guarantees that > a connection will source data at a rate at most twice the > maximum possible on the path. (Without slow-start, by contrast, > when 10Mb ethernet hosts talk over the 56Kb Arpanet via IP > gateways, the gateways see a window's worth of packets (8-16) > delivered at 200 times the path bandwidth.) > > Once you can reliably start data flowing, problems (2) and (3) > have to be addressed. Assuming that the protocol implementation > is correct, (2) must represent a failure of sender's retransmit > timer. A good round trip time estimator (the core of the > retransmit timer) is the single most important feature of any > protocol implementation that expects to survive heavy load. And > it almost always gets botched. > > One mistake seems to be not estimating the variance of the rtt. > >From queuing theory we know that rtt and rtt variation increase > very quickly with load. Measuring the load by rho (the ratio of > average arrival rate to average departure rate), the rtt goes up > like 1/(1-rho) and the variation in rtt goes like 1/(1-rho)^2. > To make this concrete, if the network is running at 75% of > capacity (as the Arpanet was in last April's collapse), you > should expect rtt to vary by a factor of 16 around its mean > value. The RFC793 parameter "beta" accounts for rtt variation. > The suggested value of "2" can adapt to loads of at most 30%. > Above this point, a connection will respond to load increases by > retransmitting packets that have only been delayed in transit. > This makes the network do useless work (wasting bandwidth on > duplicates of packets that have been or will be delivered) at a > time when it's known to be having trouble with useful work. > I.e., this is the network equivalent of pouring gasoline on a > fire. > > We came up a cheap method for estimating variation (see first of > attached msgs) and the resulting retransmit timer essentially > eliminates spurious retransmissions. A pleasant side effect of > estimating "beta" rather than using a fixed value is that low > load as well as high load performance improves, particularly > over high delay paths such as satellite links. > > Another timer mistake seems to be in the backoff after a > retransmit: If we have to retransmit a packet more than once, > how should the retransmits be spaced? It turns out there's only > one scheme that's likely to work, exponential backoff, but > proving this is a bit involved (one of the two papers alluded to > in to opening goes through a proof). We might finesse a proof > by noting that a network is, to a very good approximation, a > linear system. (That is, it is composed of elements that behave > like linear operators -- integrators, delays, gain stages, > etc.). Linear system theory says that if a system is stable, > the stability is exponential. This suggests that if we have a > system that is unstable (a network subject to random load shocks > and prone to congestive collapse), the way to stabilize it is to > add some exponential damping (read: exponential timer backoff) > to its primary excitation (read: senders, traffic sources). > > Once the timers are in good shape, you can state with some > confidence that a timeout really means a lost packet and not a > busted timer. At this point you can do something about (3). > Packets can get lost for two reasons: they are damaged in > transit or the network is congested and somewhere on the path > there was insufficient buffer capacity. On most of the network > paths we use, loss due to damage is rare (<<1%) so it is very > probable that a packet loss is due to congestion in the network. > > Say we try to develop a `congestion avoidance' strategy like the > one Raj Jain, et.al., propose in DEC TR506 and ISO 8473. It > must have two components: The network must be able to signal > the transport endpoints that congestion is occurring (or about > to occur). And the endpoints must have a policy that decreases > utilization if this signal is received and increases utilization > if the signal isn't received. > > If packet loss is (almost) always due to congestion and if a > timeout is (almost) always due to a lost packet, we have a good > candidate for the `network is congested' signal. Particularly > since this signal is delivered automatically by all existing > networks, without special modification (as opposed to, say, ISO > 8473 which requires a new bit in the packet headers and a mod to > *all* existing gateways to set this bit). > > [As a (lengthy) aside: > Using `lost packets' to communicate information seems to bother > people. The second of the two papers mentioned above is devoted > to analytic and simulation investigations of our algorithm under > various network conditions. I'll briefly mention some of the > objections we've heard and, I think, addressed. > > There have been claims that signaling congestion via dropping > packets will adversely affect total throughput (it's easily > proved that the opposite is true) or that this signal will be > `too slow' compared to alternatives (The fundamental frequency > of the system is set by its average round trip time. From > control theory and Nyquist's theorem, any control strategy that > attempts to respond in less than two round trip times will be > unstable. A packet loss is detected in at most two rtt, the > minimum stable response time). > > There have also been claims that this scheme will result in > unfair or sub-optimal resource allocation (this is untrue if > equivalent implementations of ISO and our schemes are compared) > or that mistaking damage for congestion will unnecessarily > throttle the endpoints on some paths with a high intrinsic loss > rate (this is mostly untrue -- the scheme we propose is > analytically tractable so we've worked out its behavior under > random loss. It turns out that window flow control schemes just > don't handle high loss rates well and throughput of a vanilla > TCP under high, random loss is abysmal. Adding our congestion > control scheme makes things slightly worse but the practical > difference is negligible. As an example (that we have calculated > and Jon Crowcroft at UCL has measured), it takes 40 256-byte > packets to fill the Satnet pipe. Satnet currently shows a > random, 1% average packet loss. The maximum throughput a > vanilla tcp could achieve at this loss rate is 56% of the 40kbs > channel bandwidth. The maximum throughput our TCP can achieve at > this loss rate is 49% of the channel bandwidth. > > In theory, the use of dynamic congestion control should allow > one to achieve much better throughput under high loss than is > possible with normal TCP -- performance comparable to, say, > NETBLT over the same lossy link. The reason is that regular TCP > tries to communicate two different things with one number (the > window field in the packet): the amount of buffer the receiver > has available and the delay-bandwidth product of the pipe. Our > congestion control scheme separates these two numbers. The > sender estimates the pipe size so the receiver window need only > describe the receiver's buffer size. As long as the receiver > advertises a sufficiently large buffer, (twice the delay- > bandwidth product) a 1% loss rate would translate into a 1% > throughput effect, not a factor-of-two effect. Of course, we > have yet to put this theory to test.] > > The other part of a congestion avoidance strategy, the endnode > action, is almost identical in Jain's DEC/ISO scheme and our > TCP. This is not an accident (we copied Jain's scheme after > hearing his presentation at the Boston IETF & realizing that the > scheme was, in a sense, universal): The endnode strategy > follows directly from a first-order time-series model of the > network. Say we measure network `load' by average queue length > over fixed intervals of some appropriate length (something near > the rtt). If L(i) is the load at interval i, a network not > subject to congestion can be modeled by saying L(i) changes > slowly compared to the sampling time. I.e., > > L(i) = N > > (the average queue length doesn't change over time). If the > network is subject to congestion, this zero'th order model > breaks down. The average queue length becomes the sum of two > terms, the N above that accounts for the average arrival rate > of new traffic and a new term that accounts for the left-over > traffic from the last time interval: > > L(i) = N + a L(i-1) > > (As pragmatists, we extend the original model just enough to > explain the new behavior. What we're doing here is taking the > first two terms in a taylor series expansion of L(t) when we > find that just the first term is inadequate. There is reason to > believe one would eventually need a three term, second order > model, but not until the Internet has grown to several times its > current size.) > > When the network is congested, the `a' term must be large and > the load (queues) will start increasing exponentially. The only > way to stabilize the system is if the traffic sources throttle > back at least as fast as the queues are growing. Since the way > a source controls load in a window-based protocol is to adjust > the size of the window, W, we end up with the sender policy > > On congestion: W(i) = d W(i-1) (d < 1) > > I.e., a multiplicative decrease of the window size. (This will > turn into an exponential decrease over time if the congestion > persists.) > > If there's no congestion, the `a' term must be near zero and the > network load is about constant. You have to try to increase the > bandwidth you're using to find out the current limit (e.g., you > could have been sharing the path with someone else and converged > to a window that gives you each half the available bandwidth. If > he shuts down, 50% of the bandwidth will get wasted unless you > make some effort to increase your window size.) What should the > increase policy be? > > The first thought is to use a symmetric, multiplicative increase, > possibly with a longer time constant. E.g., W(i) = b W(i-1), > 1 < b <= 1/d. This is a mistake. The result will oscillate > wildly and, on the average, deliver poor throughput. The reason > why is tedious to explain. It has to do with that fact that it > is easy to drive the net into saturation but hard for the net > to recover (what Kleinrock, vol.2, calls the "rush-hour effect"). > Thus overestimating the available bandwidth is costly. But an > exponential, almost regardless of its time constant, increases > so quickly that large overestimates are inevitable. > > Without justification, I'll simply state that the best increase > policy is to make small, constant changes to the window size (if > you've had a control theory course, you've seen the justification): > > On no congestion: W(i) = W(i-1) + u (u << the path delay- > bandwidth product) > > I.e., an additive increase policy. This is exactly the policy that > Jain, et.al., suggest in DEC TR-506 and exactly the policy we've > implemented in TCP. The only difference in our implementations is > the choice of constants for `d' and `u'. We picked .5 and 1 for > reasons that are partially explained in the second of the attached > messages. A more complete analysis is in the second in-progress > paper (and may be discussed at the upcoming IETF meeting). > > All of the preceding has probably made it sound as if the > dynamic congestion algorithm is hairy but it's not. Like > slow-start, it turns out to be three lines of code and one new > connection state variable (the combined slow-start/congestion > control algorithm is described in the second of the attached > msgs). > > > That covers about everything that's been done to date. It's > about 1/3 of what we think clearly needs to be done. The next > big step is to do the gateway `congestion detection' algorithms > so that a signal is sent to the endnodes as early as possible > (but not so early that the gateway ends up starved for traffic). > The way we anticipate doing these algorithms, gateway `self > protection' from a mis-behaving host will fall-out for free (that > host will simply have most of its packets dropped as the gateway > trys to tell it that it's using more than its fair share). Thus, > like the endnode algorithm, the gateway algorithm will improve > things even if no endnode is modified to do dynamic congestion > avoidance. And nodes that do implement congestion avoidance > will get their fair share of bandwidth and a minimum number of > packet drops. > > Since congestion grows exponentially, detecting it early is > important. (If it's detected early, small adjustments to the > senders' windows will cure it. Otherwise massive adjustments > will be necessary to give the net enough spare capacity to pump > out the backlog.) But, given the bursty nature of traffic, > reliable detection is a non-trivial problem. There is a scheme > proposed in DEC TR-506 but we think it might have convergence > problems under high load and/or significant second-order dynamics > in the traffic. We are thinking of using some of our earlier > work on ARMAX models for rtt/queue length prediction as the > basis of the detection. Preliminary results suggest that this > approach works well at high load, is immune to second-order > effects in the traffic and is cheap enough to compute that it > wouldn't slow down thousand-packet-per-second gateways. > > In addition to the gateway algorithms, we want to apply the > endnode algorithms to connectionless traffic (e.g., domain > server queries, RPC requests). We have the rate-based > equivalent of the TCP window algorithm worked out (there should > be a message describing it on the tcp list sometime in the near > future). Sun Microsystems has been very interested, and > supportive, during the development of the TCP congestion control > (I believe Sun OS 4.0 will ship with our new TCP) and Sun has > offered to cooperate in developing the RPC dynamic congestion > control algorithms, using NFS as a test-bed (since NFS is known > to have congestion problems). > > The far future holds some work on controlling second-order, non- > ergodic effects, adaptively determining the rate constants in > the control algorithm, and some other things that are too vague > to mention. > > - Van > From: van at HELIOS.EE.LBL.GOV (Van Jacobson) > Newsgroups: comp.protocols.tcp-ip > Subject: some interim notes on the bsd network speedups > Message-ID: <8807200426.AA01221 at helios.ee.lbl.gov> > Date: 20 Jul 88 04:26:17 GMT > > I told the potential beta-tests for our new 4bsd network code > that I hoped to have a version of the code out by the end of > July. (BTW, we've got all the beta testers we can handle -- > please don't apply.) It looks like that's going to turn into > the end of August, in part because of SIGCOMM and in part > because Sun puts sending source to academic customers on the > bottom of its priority list. I thought I'd flame about the > latter and give a bit of a status report on the new code. > > I've been trying to put together a beta distribution for the > "header prediction" bsd network code. This effort has been > stymied because it's impossible to get current source from Sun. > The code involves major changes to the 4.3bsd kernel. The only > local machines I can use to develop new kernels are Suns -- > everything else is either multi-user or has pathetic ethernet > hardware. But the only Sun kernel source I've got is the doubly > obsolete Sun OS 3.5/4.2 BSD. It would be a massive waste of > time to upgrade this system to 4.3 BSD just so I can develop > the next BSD -- Bill Nowicki did the upgrade a year ago and > binaries of the new system (totally worthless to me) are > shipping as Sun OS 4.0. [I'm not the only one suffering this > problem -- I understand Craig Partridge's multicast work is > suffering because he can't get 4.3-compatible Sun source. I > think he gave up & decided to do all his development on 4.3bsd > Vaxen. And I think I heard Chuck Hedrick say 4.0 has all the > rlogin, URG and nameserver bugs that we fondly remember fixing > in 3.x. And he has to get source before the academic year > starts or they won't be able to switch until a semester break. > And Mike Karels is saying "I told you so" and suggesting I buy > some CCIs. Pity that Sun can't figure out that it's in their > best interest to do TIMELY source distribution to the academic > and research community -- their software development base gets > expanded a few hundred-fold for the cost of making tapes.] > > Anyway, now that I've vented my spleen, there are some interim > results to talk about. While waiting for either useful source > or a better hardware platform to show up, I've been cleaning up > my original mods and backing out changes one and two at a time > to gauge their individual effect. Because network performance > seems to rest on getting a lot of things happening in parallel, > this leave-one-out testing doesn't give simple good/bad answers > (I had one test case that went like > > Basic system: 600 KB/s > add feature A: 520 KB/s > drop A, add B: 530 KB/s > add both A & B: 700 KB/s > > Obviously, any statement of the form "feature A/B is good/bad" > is bogus.) But, in spite of the ambiguity, some of the network > design folklore I've heard seems to be clearly wrong. > > In the process of cleaning things up, they slowed down. Task- > to-task data throughput using TCP between two Sun 3/60s dropped > from 1 MB/s (about 8.6 Mb/s on the wire) to 890 KB/s (about 7.6 > Mb/s on the wire). I know where the 11% was lost (an > interaction between "sosend" and the fact that an AMD LANCE chip > requires at least 100 bytes in the first chunk of data if you > ask it to chain -- massive braindamage on AMD's part) and how to > get it back (re-do the way user data gets handed to the > transport protocol) but need to talk with Mike Karels about the > "right" way to do this. > > Still, 890 KB/s represents a non-trivial improvement over the > stock Sun/4bsd system: Under identical test conditions (same > socket & user buffer sizes, same MSS and MTU, same machines), > the best tcp throughput I could get with an out-the-box Sun OS > 3.5 was 380 KB/s. I wish I could say "make these two simple > changes and your throughput will double" but I can't. There > were lots and lots of fiddley little changes, none made a huge > difference and they all seemed to be necessary. > > The biggest single effect was a change to sosend (the routine > between the user "write" syscall and tcp_output). Its loop > looked something like: > > while there is user data & space in the socket buffer > copy from user space to socket > call the protocol "send" routine > > After hooking a scope to our ethernet cable & looking at the > packet spacings, I changed this to > > while there is user data & space in the socket buffer > copy up to 1K (one cluster's worth) from user space to socket > call the protocol "send" routine > > and the throughput jumped from 380 to 456 KB/s (+20%). There's > one school of thought that says the first loop was better > because it minimized the "boundary crossings", the fixed costs > of routine calls and context changes. This same school is > always lobbying for "bigger": bigger packets, bigger windows, > bigger buffers, for essentially the same reason: the bigger > chunks are, the fewer boundary crossings you pay for. The > correct school, mine :-), says there's always a fixed cost and a > variable cost (e.g., the cost of maintaining tcp state and > tacking a tcp packet header on the front of some data is > independent of the amount of data; the cost of filling in the > checksum field in that header scales linearly with the amount of > data). If the size is large enough to make the fixed cost small > compared to the variable cost, making things bigger LOWERS > throughput because you throw away opportunities for parallelism. > Looking at the ethernet, I saw a burst of packets, a long dead > time, another burst of packets, ... . It was clear that while > we were copying 4 KB from the user, the processor in the LANCE > chip and tcp_input on the destination machine were mostly > sitting idle. > > To get good network performance, where there are guaranteed to > be many processors that could be doing things in parallel, you > want the "units of work" (loop sizes, packet sizes, etc.) to be > the SMALLEST values that amortise the fixed cost. In Berkeley > Unix, the fixed costs of protocol processing are pretty low and > sizes of 1 - 2 KB on a 68020 are as large as you'd want to get. > (This is easy to determine. Just do a throughput vs. size test > and look for the knee in the graph. Best performance is just to > the right of the knee.) And, obviously, on a faster machine > you'd probably want to do things in even smaller units (if the > overhead stays the same -- Crays are fast but hardware > strangeness drives the fixed costs way up. Just remember that > if it takes large packets and large buffers to get good > performance on a fast machine, that's because it's broken, not > because it's fast.) > > Another big effect (difficult to quantify because several > changes had to be made at once) was to remove lots of > more-or-less hidden data copies from the protocol processing. > It's a truism that you never copy data in network code. Just > lay the data down in a buffer & pass around pointers to > appropriate places in that buffer. But there are lots of places > where you need to get from a pointer into the buffer back to a > pointer to the buffer itself (e.g., you've got a pointer to the > packet's IP header, there's a header checksum error, and you > want to free the buffer holding the packet). The routine "dtom" > converts a data pointer back to a buffer pointer but it only > works for small things that fit in a single mbuf (the basic > storage allocation unit in the bsd network code). Incoming > packets are never in an mbuf; they're in a "cluster" which the > mbuf points to. There's no way to go from a pointer into a > cluster to a pointer to the mbuf. So, anywhere you might need > to do a dtom (anywhere there's a detectable error), there had to > be a call to "m_pullup" to make sure the dtom would work. > (M_pullup works by allocating a fresh mbuf, copying a bunch of > data from the cluster to this mbuf, then chaining the original > cluster behind the new mbuf.) > > So, we were doing a bunch of copying to expedite error handling. > But errors usually don't happen (if you're worried about > performance, first you make sure there are very, very few > errors), so we were doing a bunch of copying for nothing. But, > if you're sufficiently insomniac, in about a week you can track > down all the pullup's associated with all the dtom's and change > things to avoid both. This requires massive recoding of both > the TCP and IP re-assembly code. But it was worth it: TCP > throughput climbed from around 600 KB/s to 750 KB/s and IP > forwarding just screamed: A 3/60 forwarding packets at the 9 > Mb/s effective ethernet bandwidth used less than 50% of the CPU. > > [BTW, in general I didn't find anything wrong with the BSD > mbuf/cluster model. In fact, I tried some other models (e.g., > do everything in packet sized chunks) and they were slower. > There are many cases where knowing that you can grab an mbuf and > chain it onto a chunk of data really simplifies the protocol > code (simplicity == speed). And the level of indirection and > fast, reference counted allocation of clusters can really be a > win on data transfers (a la kudp_fastsend in Sun OS). The > biggest problem I saw, other than the m_pullup's, was that > clusters are too small: They need to be at least big enough for > an ethernet packet (2K) and making them page sized (8K on a Sun) > doesn't hurt and would let you do some quick page swap tricks in > the user-system data copies (I didn't do any of these tricks in > the fast TCP. I did use 2KB clusters to optimize things for the > ethernet drivers).] > > An interesting result of the m_pullup removals was the death of > another piece of folklore. I'd always heard that the limiting > CPU was the receiver. Wrong. After the pullup changes, the > sender would be maxed out at 100% CPU utilization while the > receiver loafed along at 65-70% utilization (utilizations > measured with a microprocessor analyzer; I don't trust the > system's stats). In hindsight, this seems reasonable. At the > receiver, a packet comes in, wanders up to the tcp layer, gets > stuck in the socket buffer and an ack is generated (i.e., the > processing terminates with tcp_input at the socket buffer). At > the sender, the ack comes in, wanders up to the tcp layer, frees > some space, then the higher level socket process has to be woken > up to fill that space (i.e., the processing terminates with > sosend, at the user socket layer). The way Unix works, this > forces a boundary crossing between the bottom half (interrupt > service) and top half (process context) of the kernel. On a > Sun, and most of the other Unix boxes I know of, this is an > expensive crossing. [Of course, the user process on the > receiver side has to eventually wake up and empty the socket > buffer but it gets to do that asynchronously and the dynamics > tend to arrange themselves so it processes several packets on > each wakeup, minimizing the boundary crossings.] > > Talking about the bottom half of the kernel reminds me of > another major effect. There seemed to be a "sound barrier" at > 550 KB/s. No matter what I did, the performance stuck at 550 > KB/s. Finally, I noticed that Sun's LANCE ethernet driver, > if_le.c, would only queue one packet to the LANCE at a time. > Picture what this means: (1) driver hands packet to LANCE, (2) > LANCE puts packet on wire, (3) end of packet, LANCE interrupts > processor, (4) interrupt dispatched to driver, (5) go back to > (1). The time involved in (4) is non-trivial, more than 150us, > and represents a lot of idle time for the LANCE and the wire. > So, I rewrote the driver to queue an arbitrary number of packets > to the LANCE, the sound barrier disappeared, and other changes > started making the throughput climb (it's not clear whether this > change had any effect on throughput or just allowed other > changes to have an effect). > > [Shortly after making the if_le change, I learned why Sun might > have written the driver the silly way they did: Before the > change, the 6 back-to-back IP fragments of an NFS write would > each be separated by the 150us interrupt service time. After > the change, they were really back-to-back, separated by only the > 9.6us minimum ethernet spacing (and, no, Sun does NOT violate > the ethernet spec in any way, shape or form. After my last > message on this stuff, Apollo & DEC people kept telling me Sun > was out of spec. I've been watching packets on our ethernet for > so long, I'm starting to learn the middle name of every bit. > Sun bits look like DEC bits and Sun, or rather, the LANCE in the > 3/50 & 3/60, completely complys with the timings in the blue > book.) Anyway, the brain-dead Intel 82586 ethernet chip Sun > puts in all its 3/180, 3/280 and Sun-4 file servers can't hack > back-to-back, minimum spacing packets. Every now and again it > drops some of the frags and wanders off to never-never land > ("iebark reset"). Diskless workstations don't work well when > they can't write to their file server and, until I hit on the > idea of inserting "DELAY" macros in kudp_fastsend, it looked > like I could have a fast TCP or a functional workstation but not > both.] > > Probably 30% of the performance improvements came from fixing > things in the Sun kernel. I mean like, really, guys: If the > current task doesn't change, and it doesn't 80% of the time > swtch is called, there's no need to do a full context save and > restore. Adding the two lines > > cmpl _masterprocp,a0 > jeq 6f ? restore of current proc is easy > > just before the call to "resume" in sun3/vax.s:swtch got me a > quick 70 KB/s performance increase but felt more like a bug fix > than progress. And a kernel hacker that does 16-bit "movw"s and > "addw"s on a 68020, or writes 2 instruction dbra loops designed > to put a 68010 in loop mode, should be shot. The alu takes the > same 3 clocks for a 2 byte add or a 4 byte add so things will > finish a lot quicker if you give it 4 bytes at a time. And > every branch stalls the pipe, so unrolling that loop to cut down > on branches is a BIG win. > > Anyway, I recoded the checksum routine, ocsum.s (now oc_cksum.s > because I found the old calling sequence wasn't the best way to > do things) and its caller, in_cksum.c, and got the checksumming > time down from 490us/KB to 130us/KB. Unrolling the move loop in > copyin/copyout (the routines that move data user to kernel and > kernel to user), got them down from 200us/KB to 140us/KB. (BTW, > if you combine the move with the checksum, which I did in the > kludged up, fast code that ran 1 MB/s on a 15MHz 3/50, it costs > 200us/KB, not the 300us/KB you'd expect from adding the move > and sum times. Pipelined processors are strange and wonderful > beasts.) > > From these times, you can work out most of the budgets and > processing details: I was using 1408 data byte packets (don't > ask why 1408). It takes 193us to copy a packet's worth of data > and 184us to checksum the packet and its 40 byte header. From > the logic analyzer, the LANCE uses 300us of bus and memory > bandwidth reading or writing the packet (I don't understand why, > it should only take half this). So, the variable costs are > around 700us per packet. When you add the 18 byte ethernet > header and 12 byte interpacket gap, to run at 10 Mb/s I'd have > to supply a new packet every 1200us. I.e., there's a budget of > 500us for all the fixed costs (protocol processing, interrupt > dispatch, device setup, etc.). This is do-able (I've done it, > but not very cleanly) but what I'm getting today is a packet > every 1500us. I.e., 800us per packet fixed costs. When I look > with our analyzer, 30% of this is TCP, IP, ARP and ethernet > protocol processing (this was 45% before the "header prediction" > tcp mods), 15% is stalls (idle time that I don't currently > understand but should be able to eventually get rid of) and 55% > is device setup, interrupt service and task handling. I.e., > protocol processing is negligible (240us per packet on this > processor and this isn't a fast processor in today's terms). To > make the network go faster, it seems we just need to fix the > operating system parts we've always needed to fix: I/O service, > interrupts, task switching and scheduling. Gee, what a surprise. > > - Van > > > BBoard-ID: 7621 > BB-Posted: Tue, 25 Oct 88 2:06:08 EDT > To: tcp-ip at sri-nic.ARPA > Subject: 4BSD TCP Ethernet Throughput > Date: Mon, 24 Oct 88 13:33:13 PDT > From: Van Jacobson > > Many people have asked for the Ethernet throughput data I > showed at Interop so it's probably easier to post it: > > These are some throughput results for an experimental version of > the 4BSD (Berkeley Unix) network code running on a couple of > different MC68020-based systems: Sun 3/60s (20MHz 68020 with AMD > LANCE Ethernet chip) and Sun 3/280s (25MHz 68020 with Intel > 82586 Ethernet chip) [note again the tests were done with Sun > hardware but not Sun software -- I'm running 4.?BSD, not Sun > OS]. There are lots and lots of interesting things in the data > but the one thing that seems to have attracted people's > attention is the big difference in performance between the two > Ethernet chips. > > The test measured task-to-task data throughput over a TCP > connection from a source (e.g., chargen) to a sink (e.g., > discard). The tests were done between 2am and 6am on a fairly > quiet Ethernet (~100Kb/s average background traffic). The > packets were all maximum size (1538 bytes on the wire or 1460 > bytes of user data per packet). The free parameters for the > tests were the sender and receiver socket buffer sizes (which > control the amount of 'pipelining' possible between the sender, > wire and receiver). Each buffer size was independently varied > from 1 to 17 packets in 1 packet steps. Four tests were done at > each of the 289 combinations. Each test transferred 8MB of data > then recorded the total time for the transfer and the send and > receive socket buffer sizes (8MB was chosen so that the worst > case error due to the system clock resolution was ~.1% -- 10ms > in 10sec). The 1,156 tests per machine pair were done in random > order to prevent any effects from fixed patterns of resource > allocation. > > In general, the maximum throughput was observed when the sender > buffer equaled the receiver buffer (the reason why is complicated > but has to do with collisions). The following table gives the > task-to-task data throughput (in KBytes/sec) and throughput on > the wire (in MBits/sec) for (a) a 3/60 sending to a 3/60 and > (b) a 3/280 sending to a 3/60. > > _________________________________________________ > | 3/60 to 3/60 | 3/280 to 3/60 | > | (LANCE to LANCE) | (Intel to LANCE) | > | socket | | > | buffer task to | task to | > | size task wire | task wire | > |(packets) (KB/s) (Mb/s) | (KB/s) (Mb/s) | > | 1 384 3.4 | 337 3.0 | > | 2 606 5.4 | 575 5.1 | > | 3 690 6.1 | 595 5.3 | > | 4 784 6.9 | 709 6.3 | > | 5 866 7.7 | 712 6.3 | > | 6 904 8.0 | 708 6.3 | > | 7 946 8.4 | 710 6.3 | > | 8 954 8.4 | 718 6.4 | > | 9 974 8.6 | 715 6.3 | > | 10 983 8.7 | 712 6.3 | > | 11 995 8.8 | 714 6.3 | > | 12 1001 8.9 | 715 6.3 | > |_____________________________|__________________| > > The theoretical maximum data throughput, after you take into > account all the protocol overheads, is 1,104 KB/s (this > task-to-task data rate would put 10Mb/s on the wire). You can > see that the 3/60s get 91% of the the theoretical max. The > 3/280, although a much faster processor (the CPU performance is > really dominated by the speed of the memory system, not the > processor clock rate, and the memory system in the 3/280 is > almost twice the speed of the 3/60), gets only 65% of > theoretical max. > > The low throughput of the 3/280 seems to be entirely due to the > Intel Ethernet chip: at around 6Mb/s, it saturates. (I put the > board on an extender and watched the bus handshake lines on the > 82586 to see if the chip or the Sun interface logic was pooping > out. It was the chip -- it just stopped asking for data. (The > CPU was loafing along with at least 35% idle time during all > these tests so it wasn't the limit). > > [Just so you don't get confused: Stuff above was measurements. > Stuff below includes opinions and interpretation and should > be viewed with appropriate suspicion.] > > If you graph the above, you'll see a large notch in the Intel > data at 3 packets. This is probably a clue to why it's dying: > TCP delivers one ack for every two data packets. At a buffer > size of three packets, the collision rate increases dramatically > since the sender's third packet will collide with the receiver's > ack for the previous two packets (for buffer sizes of 1 and 2, > there are effectively no collisions). My suspicion is that the > Intel is taking a long time to recover from collisions (remember > that you're 64 bytes into the packet when you find out you've > collided so the chip bus logic has to back up 64 bytes -- Intel > spent their silicon making the chip "programmable", I doubt they > invested as much as AMD in the bus interface). This may or may > not be what's going on: life is too short to spend debugging > Intel parts so I really don't care to investigate further. > > The one annoyance in all this is that Sun puts the fast Ethernet > chip (the AMD LANCE) in their slow machines (3/50s and 3/60s) > and the slow Ethernet chip (Intel 82586) in their fast machines > (3/180s, 3/280s and Sun-4s, i.e., all their file servers). > [I've had to put delay loops in the Ethernet driver on the 3/50s > and 3/60s to slow them down enough for the 3/280 server to keep > up.] Sun's not to blame for anything here: It costs a lot > to design a new Ethernet interface; they had a design for the > 3/180 board set (which was the basis of all the other VME > machines--the [34]/280 and [34]/110); and no market pressure to > change it. If they hadn't ventured out in a new direction with > the 3/[56]0 -- the LANCE -- I probably would have thought > 700KB/s was great Ethernet throughput (at least until I saw > Dave Boggs' DEC-Titan/Seeq-chip throughput data). > > But I think Sun is overdue in offering a high-performance VME > Ethernet interface. That may change though -- VME controllers > like the Interphase 4207 Eagle are starting to appear which > should either put pressure on Sun and/or offer a high > performance 3rd party alternative (I haven't actually tried an > Eagle yet but from the documentation it looks like they did a > lot of things right). I'd sure like to take the delay loops out > of my LANCE driver... > > - Van > > ps: I have data for Intel-to-Intel and LANCE-to-Intel as well as > the Intel-to-LANCE I listed above. Using an Intel chip on the > receiver, the results are MUCH worse -- 420KB/s max. I chose > the data that put the 82586 in its very best light. > > I also have scope pictures taken at the transceivers during all > these tests. I'm sure there'll be a chorus of "so-and-so violates > the Ethernet spec" but that's a lie -- NONE OF THESE CHIPS OR > SYSTEMS VIOLATED THE ETHERNET SPEC IN ANY WAY, SHAPE OR FORM. > I looked very carefully for violations and have the pictures to > prove there were none. > > Finally, all of the above is Copyright (c) 1988 by Van Jacobson. > If you want to reproduce any part of it in print, you damn well > better ask me first -- I'm getting tired of being misquoted in > trade rags. > > From van at helios.ee.lbl.gov Mon Apr 30 01:44:05 1990 > To: end2end-interest at ISI.EDU > Subject: modified TCP congestion avoidance algorithm > Date: Mon, 30 Apr 90 01:40:59 PDT > From: Van Jacobson > Status: RO > > This is a description of the modified TCP congestion avoidance > algorithm that I promised at the teleconference. > > BTW, on re-reading, I noticed there were several errors in > Lixia's note besides the problem I noted at the teleconference. > I don't know whether that's because I mis-communicated the > algorithm at dinner (as I recall, I'd had some wine) or because > she's convinced that TCP is ultimately irrelevant :). Either > way, you will probably be disappointed if you experiment with > what's in that note. > > First, I should point out once again that there are two > completely independent window adjustment algorithms running in > the sender: Slow-start is run when the pipe is empty (i.e., > when first starting or re-starting after a timeout). Its goal > is to get the "ack clock" started so packets will be metered > into the network at a reasonable rate. The other algorithm, > congestion avoidance, is run any time *but* when (re-)starting > and is responsible for estimating the (dynamically varying) > pipesize. You will cause yourself, or me, no end of confusion > if you lump these separate algorithms (as Lixia's message did). > > The modifications described here are only to the congestion > avoidance algorithm, not to slow-start, and they are intended to > apply to large bandwidth-delay product paths (though they don't > do any harm on other paths). Remember that with regular TCP (or > with slow-start/c-a TCP), throughput really starts to go to hell > when the probability of packet loss is on the order of the > bandwidth-delay product. E.g., you might expect a 1% packet > loss rate to translate into a 1% lower throughput but for, say, > a TCP connection with a 100 packet b-d p. (= window), it results > in a 50-75% throughput loss. To make TCP effective on fat > pipes, it would be nice if throughput degraded only as function > of loss probability rather than as the product of the loss > probabilty and the b-d p. (Assuming, of course, that we can do > this without sacrificing congestion avoidance.) > > These mods do two things: (1) prevent the pipe from going empty > after a loss (if the pipe doesn't go empty, you won't have to > waste round-trip times re-filling it) and (2) correctly account > for the amount of data actually in the pipe (since that's what > congestion avoidance is supposed to be estimating and adapting to). > > For (1), remember that we use a packet loss as a signal that the > pipe is overfull (congested) and that packet loss can be > detected one of two different ways: (a) via a retransmit > timeout or (b) when some small number (3-4) of consecutive > duplicate acks has been received (the "fast retransmit" > algorithm). In case (a), the pipe is guaranteed to be empty so > we must slow-start. In case (b), if the duplicate ack > threshhold is small compared to the bandwidth-delay product, we > will detect the loss with the pipe almost full. I.e., given a > threshhold of 3 packets and an LBL-MIT bandwidth-delay of around > 24KB or 16 packets (assuming 1500 byte MTUs), the pipe is 75% > full when fast-retransmit detects a loss (actually, until > gateways start doing some sort of congestion control, the pipe > is overfull when the loss is detected so *at least* 75% of the > packets needed for ack clocking are in transit when > fast-retransmit happens). Since the pipe is full, there's no > need to slow-start after a fast-retransmit. > > For (2), consider what a duplicate ack means: either the > network duplicated a packet (i.e., the NSFNet braindead IBM > token ring adapters) or the receiver got an out-of-order packet. > The usual cause of out-of-order packets at the receiver is a > missing packet. I.e., if there are W packets in transit and one > is dropped, the receiver will get W-1 out-of-order and > (4.3-tahoe TCP will) generate W-1 duplicate acks. If the > `consecutive duplicates' threshhold is set high enough, we can > reasonably assume that duplicate acks mean dropped packets. > > But there's more information in the ack: The receiver can only > generate one in response to a packet arrival. I.e., a duplicate > ack means that a packet has left the network (it is now cached > at the receiver). If the sender is limitted by the congestion > window, a packet can now be sent. (The congestion window is a > count of how many packets will fit in the pipe. The ack says a > packet has left the pipe so a new one can be added to take its > place.) To put this another way, say the current congestion > window is C (i.e, C packets will fit in the pipe) and D > duplicate acks have been received. Then only C-D packets are > actually in the pipe and the sender wants to use a window of C+D > packets to fill the pipe to its estimated capacity (C+D sent - > D received = C in pipe). > > So, conceptually, the slow-start/cong.avoid/fast-rexmit changes > are: > > - The sender's input routine is changed to set `cwnd' to `ssthresh' > when the dup ack threshhold is reached. [It used to set cwnd to > mss to force a slow-start.] Everything else stays the same. > > - The sender's output routine is changed to use an effective window > of min(snd_wnd, cwnd + dupacks*mss) [the change is the addition > of the `dupacks*mss' term.] `Dupacks' is zero until the rexmit > threshhold is reached and zero except when receiving a sequence > of duplicate acks. > > The actual implementation is slightly different than the above > because I wanted to avoid the multiply in the output routine > (multiplies are expensive on some risc machines). A diff of the > old and new fastrexmit code is attached (your line numbers will > vary). > > Note that we still do congestion avoidance (i.e., the window is > reduced by 50% when we detect the packet loss). But, as long as > the receiver's offered window is large enough (it needs to be at > most twice the bandwidth-delay product), we continue sending > packets (at exactly half the rate we were sending before the > loss) even after the loss is detected so the pipe stays full at > exactly the level we want and a slow-start isn't necessary. > > Some algebra might make this last clear: Say U is the sequence > number of the first un-acked packet and we are using a window > size of W when packet U is dropped. Packets [U..U+W) are in > transit. When the loss is detected, we send packet U and pull > the window back to W/2. But in the round-trip time it takes > the U retransmit to fill the receiver's hole and an ack to get > back, W-1 dup acks will arrive (one for each packet in transit). > The window is effectively inflated by one packet for each of > these acks so packets [U..U+W/2+W-1) are sent. But we don't > re-send packets unless we know they've been lost so the amount > actually sent between the loss detection and the recovery ack is > U+W/2+W-1 - U+W = W/2-1 which is exactly the amount congestion > avoidance allows us to send (if we add in the rexmit of U). The > recovery ack is for packet U+W so when the effective window is > pulled back from W/2+W-1 to W/2 (which happens because the > recovery ack is `new' and sets dupack to zero), we are allowed > to send up to packet U+W+W/2 which is exactly the first packet > we haven't yet sent. (I.e., there is no sudden burst of packets > as the `hole' is filled.) Also, when sending packets between > the loss detection and the recovery ack, we do nothing for the > first W/2 dup acks (because they only allow us to send packets > we've already sent) and the bottleneck gateway is given W/2 > packet times to clean out its backlog. Thus when we start > sending our W/2-1 new packets, the bottleneck queue is as empty > as it can be. > > [I don't know if you can get the flavor of what happens from > this description -- it's hard to see without a picture. But I > was delighted by how beautifully it worked -- it was like > watching the innards of an engine when all the separate motions > of crank, pistons and valves suddenly fit together and > everything appears in exactly the right place at just the right > time.] > > Also note that this algorithm interoperates with old tcp's: Most > pre-tahoe tcp's don't generate the dup acks on out-of-order packets. > If we don't get the dup acks, fast retransmit never fires and the > window is never inflated so everything happens in the old way (via > timeouts). Everything works just as it did without the new algorithm > (and just as slow). > > If you want to simulate this, the intended environment is: > > - large bandwidth-delay product (say 20 or more packets) > > - receiver advertising window of two b-d p (or, equivalently, > advertised window of the unloaded b-d p but two or more > connections simultaneously sharing the path). > > - average loss rate (from congestion or other source) less than > one lost packet per round-trip-time per active connection. > (The algorithm works at higher loss rate but the TCP selective > ack option has to be implemented otherwise the pipe will go empty > waiting to fill the second hole and throughput will once again > degrade at the product of the loss rate and b-d p. With selective > ack, throughput is insensitive to b-d p at any loss rate.) > > And, of course, we should always remember that good engineering > practise suggests a b-d p worth of buffer at each bottleneck -- > less buffer and your simulation will exhibit the interesting > pathologies of a poorly engineered network but will probably > tell you little about the workings of the algorithm (unless the > algorithm misbehaves badly under these conditions but my > simulations and measurements say that it doesn't). In these > days of $100/megabyte memory, I dearly hope that this particular > example of bad engineering is of historical interest only. > > - Van > > [...code diffs deleted...] > Received: from rx7.ee.lbl.gov by uu2.psi.com (5.65b/4.0.071791-PSI/PSINet) via SMTP; > id AA12583 for popbbn; Wed, 8 Sep 93 01:29:46 -0400 > Received: by rx7.ee.lbl.gov for craig at aland.bbn.com (5.65/1.44r) > id AA05271; Tue, 7 Sep 93 22:30:15 -0700 > Message-Id: <9309080530.AA05271 at rx7.ee.lbl.gov> > To: Craig Partridge > Cc: David Clark > Subject: Re: query about TCP header on tcp-ip > In-Reply-To: Your message of Tue, 07 Sep 93 09:48:00 PDT. > Date: Tue, 07 Sep 93 22:30:14 PDT > From: Van Jacobson > > Craig, > > As you probably remember from the "High Speed TCP" CNRI meeting, > my kernel looks nothing at all like any version of BSD. Mbufs > no longer exist, for example, and `netipl' and all the protocol > processing that used to be done at netipl interrupt level are > gone. TCP receive packet processing in the new kernel really is > about 30 instructions on a RISC (33 on a sparc but three of > those are compiler braindamage). Attached is the C code & the > associated sparc assembler. > > A brief recap of the architecture: Packets go in 'pbufs' which > are, in general, the property of a particular device. There is > exactly one, contiguous, packet per pbuf (none of that mbuf > chain stupidity). On the packet input interrupt, the device > driver upcalls through the protocol stack (no more of that queue > packet on the netipl software interrupt bs). The upcalls > percolate up the stack until the packet is completely serviced > (e.g., NFS requests that can be satisfied from in-memory data & > data structures) or they reach a routine that knows the rest of > the processing must be done in some process's context in which > case the packet is laid on some appropriate queue and the > process is unblocked. In the case of TCP, the upcalls almost > always go two levels: IP finds the datagram is for this host & > it upcalls a TCP demuxer which hashes the ports + SYN to find a > PCB, lays the packet on the tail of the PCB's queue and wakes up > any process sleeping on the PCB. The IP processing is about 25 > instructions & the demuxer is about 10. > > As Dave noted, the two processing paths that need the most > tuning are the data packet send & receive (since at most every > other packet is acked, there will be at least twice as many data > packets as ack packets). In the new system, the receiving > process calls 'tcp_usrrecv' (the protocol specific part of the > 'recv' syscall) or is already blocked there waiting for new > data. So the following code is embedded in a loop at the start of > tcp_usrrecv that spins taking packets off the pcb queue until > there's no room for the next packet in the user buffer or the > queue is empty. The TCP protocol processing is done as we > remove packets from the queue & copy their data to user space > (and since we're in process context, it's possible to do a > checksum-and-copy). > > Throughout this code, 'tp' points to the pcb and 'ti' points to > the tcp header of the first packet on the queue (the ip header was > stripped as part of interrupt level ip processing). The header info > (excluding the ports which are implicit in the pcb) are sucked out > of the packet into registers [this is to minimize cache thrashing and > possibly to take advantage of 64 bit or longer loads]. Then the > header checksum is computed (tp->ph_sum is the precomputed pseudo-header > checksum + src & dst ports). > > int tcp_usrrecv(struct uio* uio, struct socket* so) > { > struct tcpcb *tp = (struct tcpcb *)so->so_pcb; > register struct pbuf* pb; > > while ((pb = tp->tp_inq) != 0) { > register int len = pb->len; > struct tcphdr *ti = (struct tcphdr *)pb->dat; > > u_long seq = ((u_long*)ti)[1]; > u_long ack = ((u_long*)ti)[2]; > u_long flg = ((u_long*)ti)[3]; > u_long sum = ((u_long*)ti)[4]; > u_long cksum = tp->ph_sum; > > /* NB - ocadd is an inline gcc assembler function */ > cksum = ocadd(ocadd(ocadd(ocadd(cksum, seq), ack), flg), sum); > > Next is the header prediction check which is probably the most > opaque part of the code. tp->pred_flags contains snd_wnd (the > window we expect in incoming packets) in the bottom 16 bits and > 0x4x10 in the top 16 bits. The 'x' is normally 0 but will be > set non-zero if header prediction shouldn't be done (e.g., if > not in established state, if retransmitting, if hole in seq > space, etc.). So, the first term of the 'if' checks four > different things simultaneously: > - that the window is what we expect > - that there are no tcp options > - that the packet has ACK set & doesn't have SYN, FIN, RST or URG set > - that the connection is in the right state > and the 2nd term of the if checks that the packet is in sequence: > > #define FMASK (((0xf000 | TH_SYN|TH_FIN|TH_RST|TH_URG|TH_ACK) << 16) | 0xffff) > > if ((flg & FMASK) == tp->pred_flags && seq == tp->rcv_nxt) { > > The next few lines are pretty obvious -- we subtract the header > length from the total length and if it's less than zero the packet > was malformed, if it's zero we must have a pure ack packet & we > do the ack stuff otherwise if the ack field didn't move we have > a pure data packet which we copy to the user's buffer, checksumming > as we go, then update the pcb state if everything checks: > > len -= 20; > if (len <= 0) { > if (len < 0) { > /* packet malformed */ > } else { > /* do pure ack things */ > } > } else if (ack == tp->snd_una) { > cksum = in_uiomove((u_char*)ti + 20, len, uio, cksum); > if (cksum != 0) { > /* packet or user buffer errors */ > } > seq += len; > tp->rcv_nxt = seq; > if ((int)(seq - tp->rcv_acked) >= 0) { > /* send ack */ > } else { > /* free pbuf */ > } > continue; > } > } > /* header prediction failed -- take long path */ > ... > > That's it. On the normal receive data path we execute 16 lines of > C which turn into 33 instructions on a sparc (it would be 30 if I > could trick gcc into generating double word loads for the header > & my carefully aligned pcb fields). I think you could get it down > around 25 on a cray or big-endian alpha since the loads, checksum calc > and most of the compares can be done on 64 bit quantities (e.g., > you can combine the seq & ack tests into one). > > Attached is the sparc assembler for the above receive path. Hope > this explains Dave's '30 instruction' assertion. Feel free to > forward this to tcp-ip or anyone that might be interested. > > - Van > > ---------------- > ld [%i0+4],%l3 ! load packet tcp header fields > ld [%i0+8],%l4 > ld [%i0+12],%l2 > ld [%i0+16],%l0 > > ld [%i1+72],%o0 ! compute header checksum > addcc %l3,%o0,%o3 > addxcc %l4,%o3,%o3 > addxcc %l2,%o3,%o3 > addxcc %l0,%o3,%o3 > > sethi %hi(268369920),%o1 ! check if hdr. pred possible > andn %l2,%o1,%o1 > ld [%i1+60],%o2 > cmp %o1,%o2 > bne L1 > ld [%i1+68],%o0 > cmp %l3,%o0 > bne L1 > addcc %i2,-20,%i2 > bne,a L3 > ld [%i1+36],%o0 > ! packet error or ack processing > ... > L3: > cmp %l4,%o0 > bne L1 > add %i0,20,%o0 > mov %i2,%o1 > call _in_uiomove,0 > mov %i3,%o2 > cmp %o0,0 > be L6 > add %l3,%i2,%l3 > ! checksum error or user buffer error > ... > L6: > ld [%i1+96],%o0 > subcc %l3,%o0,%g0 > bneg L7 > st %l3,[%i1+68] > ! send ack > ... > br L8 > L7: > ! free pbuf > ... > L8: ! done with this packet - continue > ... > > L1: ! hdr pred. failed - do it the hard way From jack at 3kitty.org Mon May 19 13:45:45 2014 From: jack at 3kitty.org (Jack Haverty) Date: Mon, 19 May 2014 13:45:45 -0700 Subject: [ih] internet-history Digest, Vol 84, Issue 4 In-Reply-To: <537A50D2.3010807@meritmail.isi.edu> References: <537A50D2.3010807@meritmail.isi.edu> Message-ID: Hi Bob, That sounds about right. IIRC, there were a lot of TCP implementations in various stages of progress, as well as in various stages of protocol genealogy - 2.5, 3, 4, and many could communicate with themselves or selected others prior to January 1979. Jon's "bakeoff" on the Saturday preceding the January 1979 TCP Meeting at ISI was the first time a methodical test was done to evaluate the NxN interoperability of a diverse collection of implementations. I remember that you were one of the six implementations in that test session. We each had been given an office at ISI for the day and kept at it until everyone could establish a connection with everyone else and pass data. There were a lot of issues resolved that day, mostly having to do with ambiguities in the then-current spec we had all been coding to meet. As we all finally agreed (or our code agreed) on all the details, Jon tweaked the spec to reflect what the collected software was now doing. So I've always thought that those six implementations were the first TCP4 implementations to successfully interoperate. Yours was one of them. There was a lot of pressure at the time to get the spec of TCP4 nailed down and published, and that test session was part of the process. Subsequently that TCP4 spec became an RFC, and a DoD Standard, and The Internet started to grow, and the rest is history.... I wonder if Dave Clark ever forgave Bill Plummer for crashing the Multics TCP by innocently asking Dave to temporarily disable his checksumming code....and then sending a kamikaze packet from Tenex. /Jack On Mon, May 19, 2014 at 11:43 AM, Bob Braden wrote: > > Jack, > > You wrote: > > I wrote a TCP back in the 1979 timeframe - the first one for a Unix > system, running on a PDP-11/40. It first implemented TCP version > 2.5, and later evolved to version 4. It was a very basic > implementation, no "slow start" or any other such niceties that were > created as the Internet grew. > > I have been trying to recall where my TCP/IP for UCLA's IBM 360/91 ran in > this horse race. The best I can tell from IEN 70 and IEN 77 is that my > TCP-4 version made it between Dec 1978 and Jan 1979, although I think I had > an initial TP-2.5 version talkng to itself in mid 1978. > > Bob Braden > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From detlef.bosau at web.de Tue May 20 03:48:14 2014 From: detlef.bosau at web.de (Detlef Bosau) Date: Tue, 20 May 2014 12:48:14 +0200 Subject: [ih] One clarification: Re: When was Go Back N adopted by TCP In-Reply-To: References: <20140518140340.037C328E137@aland.bbn.com> <537A1A61.8070107@web.de> <537A1CFB.1020206@web.de> Message-ID: <537B32EE.7090403@web.de> Am 19.05.2014 18:59, schrieb Vint Cerf: > one RTO per connection makes sense: Sorry, I think you got me wrong there. One _value_ for the retransmission time out makes sense, that goes without discussion. The very point is that RFC 793 (and some off list discussions with Jon and Dave) made clear, that a TCP implementations have one _timer_ per packet. Both, Jon and David, pointed me to the concept of timer wheels, as the number of timers may grow large here. > calculate and monitor the RTT for the connection and use that value to > timeout and retransmit the oldest, unacknowledged packet. This is NOT > GBN. It makes no sense to do RTO per packet calculation especially if > the packet had to be retransmitted since you then get into double > delay affecting the RTT computation.. The interesting point is, that my understanding of RFC 2988 is to maintain only one retransmission timer, When I refer to section 5, it says: > > 5 Managing the RTO > Timer > > > > An implementation MUST manage the retransmission timer(s) in such a > way that a segment is never retransmitted too early, i.e. less than > one RTO after the previous transmission of that segment. > > The following is the RECOMMENDED algorithm for managing the > retransmission timer: > > (5.1) Every time a packet containing data is sent (including a > retransmission), if the timer is not running, start it running > so that it will expire after RTO seconds (for the current value > of RTO). > > (5.2) When all outstanding data has been acknowledged, turn off the > retransmission timer. > > (5.3) When an ACK is received that acknowledges new data, restart the > retransmission timer so that it will expire after RTO seconds > (for the current value of RTO). > > > > > > > > Paxson & Allman Standards Track [Page 4] > > RFC 2988 Computing TCP's Retransmission Timer November 2000 > > > When the retransmission timer expires, do the following: > > (5.4) Retransmit the earliest segment that has not been acknowledged > by the TCP receiver. > > (5.5) The host MUST set RTO <- RTO * 2 ("back off the timer"). The > maximum value discussed in (2.5) above may be used to provide an > upper bound to this doubling operation. > > (5.6) Start the retransmission timer, such that it expires after RTO > seconds (for the value of RTO after the doubling operation > outlined in 5.5). > > Note that after retransmitting, once a new RTT measurement is > obtained (which can only happen when new data has been sent and > acknowledged), the computations outlined in section 2 are performed, > including the computation of RTO, which may result in "collapsing" > RTO back down after it has been subject to exponential backoff > (rule 5.5). > > Note that a TCP implementation MAY clear SRTT and RTTVAR after > backing off the timer multiple times as it is likely that the > current SRTT and RTTVAR are bogus in this situation. Once SRTT and > RTTVAR are cleared they should be initialized with the next RTT > sample taken per (2.2) rather than using (2.3). This may imply (and having looked at the NS2 source, I state that it is so understood in the NS2) that there is only one timout up and running per connection. I repeat: There might be a severe misumderstanding on my side, when I have questions and things are not clear to me, it's up to me to make sure that I understand things correctly. -- ------------------------------------------------------------------ Detlef Bosau Galileistra?e 30 70565 Stuttgart Tel.: +49 711 5208031 mobile: +49 172 6819937 skype: detlef.bosau ICQ: 566129673 detlef.bosau at web.de http://www.detlef-bosau.de -------------- next part -------------- An HTML attachment was scrubbed... URL: From detlef.bosau at web.de Tue May 20 04:00:32 2014 From: detlef.bosau at web.de (Detlef Bosau) Date: Tue, 20 May 2014 13:00:32 +0200 Subject: [ih] When was Go Back N adopted by TCP In-Reply-To: <20140519150259.35C1E28E137@aland.bbn.com> References: <20140519150259.35C1E28E137@aland.bbn.com> Message-ID: <537B35D0.9040400@web.de> Am 19.05.2014 17:02, schrieb Craig Partridge: > Hi Detlef: > > I don't keep the 4.3bsd code around anymore, but here's my recollection > of what the code did. > > 4.3BSD had one round-trip timeout (RTO) counter per TCP connection. That's the way I find it in the NS2. > > On round-trip timeout, send 1MSS of data starting at the lowest outstanding > sequence number. Which is not yet GBN in its "pure" form, but actually it is, because CWND is increased with every new ack. And when you call "send_much" when a new ack arrives (I had a glance at the BSD code myself some years ago, the routines are named equally there, as far as I've seen, the ns2 code and the BSD code are extremely similar) the behaviour resembles GBN very much. > Set the RTO counter to the next increment. > > Once an ack is received, update the sequence numbers and begin slow start > again. > > What I don't remember is whether 4.3bsd kept track of multiple outstanding > losses and fixed all of them before slow start or not. OMG. ;-) Who else should remember this, if not Van himself our you? However, first of all I have to thank for all the answers here. Detlef -- ------------------------------------------------------------------ Detlef Bosau Galileistra?e 30 70565 Stuttgart Tel.: +49 711 5208031 mobile: +49 172 6819937 skype: detlef.bosau ICQ: 566129673 detlef.bosau at web.de http://www.detlef-bosau.de From jnc at mercury.lcs.mit.edu Tue May 20 11:39:00 2014 From: jnc at mercury.lcs.mit.edu (Noel Chiappa) Date: Tue, 20 May 2014 14:39:00 -0400 (EDT) Subject: [ih] And I thought _I_ was crazy... Message-ID: <20140520183900.1AD2E18C0CD@mercury.lcs.mit.edu> Since PDP-10's were important in the early days of the Internet, this site: http://www.retrocmp.com/stories/the-pdp-10-ki10-console-panel/ might amuse some of you. Noel From jeanjour at comcast.net Tue May 20 12:08:35 2014 From: jeanjour at comcast.net (John Day) Date: Tue, 20 May 2014 15:08:35 -0400 Subject: [ih] And I thought _I_ was crazy... In-Reply-To: <20140520183900.1AD2E18C0CD@mercury.lcs.mit.edu> References: <20140520183900.1AD2E18C0CD@mercury.lcs.mit.edu> Message-ID: Early imprinting is so powerful. ;-) At 2:39 PM -0400 5/20/14, Noel Chiappa wrote: >Since PDP-10's were important in the early days of the Internet, this site: > > http://www.retrocmp.com/stories/the-pdp-10-ki10-console-panel/ > >might amuse some of you. > > Noel From mfidelman at meetinghouse.net Tue May 20 15:12:05 2014 From: mfidelman at meetinghouse.net (Miles Fidelman) Date: Tue, 20 May 2014 18:12:05 -0400 Subject: [ih] And I thought _I_ was crazy... In-Reply-To: References: <20140520183900.1AD2E18C0CD@mercury.lcs.mit.edu> Message-ID: <537BD335.10703@meetinghouse.net> John Day wrote: > Early imprinting is so powerful. ;-) > > > At 2:39 PM -0400 5/20/14, Noel Chiappa wrote: >> Since PDP-10's were important in the early days of the Internet, this >> site: >> >> http://www.retrocmp.com/stories/the-pdp-10-ki10-console-panel/ >> >> might amuse some of you. >> >> Noel Then again: telnet://DEC-10.PDPplanet.com telnet://xkleten.paulallen.com Real, LIVE PDP-10s! (Now what I really want is a PDP-1!). Cheers, Miles Fidelman -- In theory, there is no difference between theory and practice. In practice, there is. .... Yogi Berra From detlef.bosau at web.de Tue May 20 15:19:56 2014 From: detlef.bosau at web.de (Detlef Bosau) Date: Wed, 21 May 2014 00:19:56 +0200 Subject: [ih] Detlef's TCP questions In-Reply-To: <537A5D44.30908@meritmail.isi.edu> References: <537A5D44.30908@meritmail.isi.edu> Message-ID: <537BD50C.1010205@web.de> Am 19.05.2014 21:36, schrieb Bob Braden: > Detlef, > > As Craig and Vint has pointed out, TCP never was GBN. Interestingly, even some lectures take a different position. It would be interesting to have an overview, how TCP is typically presented. > > Yes, any network researcher who wants to call him/herself a computer > scientist should take seriously the experimentalist's task of fully > understanding the assumptions and implementations of their test > environment. That includes NS-2 simulations of TCP. Myself, I have to admit that I trusted too much in the NS-2. And quite some papers I've read rely heavily on the NS-2. > > Yes, in broad generality, the level of network science taught in many > graduate schools is abysmal. How can those with clue resist the > temptation of real mony in industry or getting rich from a startup? So > the next generation of largely clueless PhDs learn from clueless > predecessors. When I refer to the aforementioned slides, even the typical "model" of TCP delay (serialization latency, MAC latency, propagation latency, queueing latency) is taken as "word of god". That doesn't mean, simulations were worthless. But I think, we should treat them with a certain professional distance. In my own simulator, I used GBN for TCP. After some discussions, I decided to re-write my TCP code completely and eventually read RFC 793 a bit more thoroughly than before, and eventually detected that RFC 793 explicitely requests an individual timeout for each packt. I'm afraid, I'm not the only one who detects this discrepancy between RFC 793 and many implementations quite late... The reason for doing this work is, once again, TCP flow control - which I think is not completely understood. Admittedly, I did not always make friends with my colleagues. But this is not my primary goal. My goal is to understand TCP and TCP flow control and resource allocation. And as for the first time, I did a TCP implementation for a simulator, I thought, I understood TCP. And learned, I only understand a little part. > > During the period of Van Jacobson's development of the algorithms that > bear his > name, he wrote many lengthy, pithy, and informative messages to > various public > mailing lists about the hazards of the Internet and how his algorithms > cope. Some of these are (so my impression ;-)) im Craigs memory :-) And perhaps, we can even ask Van himself :-) What I want to do is to understand the questions and challenges during the development, the alternatives - and the choices made. And the reason for the decisions. Detlef -- ------------------------------------------------------------------ Detlef Bosau Galileistra?e 30 70565 Stuttgart Tel.: +49 711 5208031 mobile: +49 172 6819937 skype: detlef.bosau ICQ: 566129673 detlef.bosau at web.de http://www.detlef-bosau.de From jnc at mercury.lcs.mit.edu Tue May 20 16:50:52 2014 From: jnc at mercury.lcs.mit.edu (Noel Chiappa) Date: Tue, 20 May 2014 19:50:52 -0400 (EDT) Subject: [ih] And I thought _I_ was crazy... Message-ID: <20140520235052.A9CB118C0D0@mercury.lcs.mit.edu> > From: Miles Fidelman > Then again: > ... > telnet://xkleten.paulallen.com TWENEX. Blech. Try: telnet://its.svensson.org :-) > Real, LIVE PDP-10s! Over the net, how can you tell it's not an emulator? :-) > (Now what I really want is a PDP-1!). I think the only real ones left are at the Computer History Museum, of which one is actually working: http://pdp-1.computerhistory.org/pdp-1/ but if an emulated one will do you (and I'm totally happy with an emulated PDP-11 to run V6 Unix on, enjoying myself no end fiddling with it :-), there's this: http://www.aracnet.com/~healyzh/pdp1emu.html (and apologies if you already knew of it). Noel From mfidelman at meetinghouse.net Tue May 20 18:33:55 2014 From: mfidelman at meetinghouse.net (Miles Fidelman) Date: Tue, 20 May 2014 21:33:55 -0400 Subject: [ih] And I thought _I_ was crazy... In-Reply-To: <20140520235052.A9CB118C0D0@mercury.lcs.mit.edu> References: <20140520235052.A9CB118C0D0@mercury.lcs.mit.edu> Message-ID: <537C0283.8020607@meetinghouse.net> Noel Chiappa wrote: > > From: Miles Fidelman > > > Then again: > > ... > > telnet://xkleten.paulallen.com > > TWENEX. Blech. Try: > > telnet://its.svensson.org > > :-) > > > Real, LIVE PDP-10s! > > Over the net, how can you tell it's not an emulator? :-) > > > > (Now what I really want is a PDP-1!). > > I think the only real ones left are at the Computer History Museum, of which > one is actually working: > > http://pdp-1.computerhistory.org/pdp-1/ > > but if an emulated one will do you (and I'm totally happy with an emulated > PDP-11 to run V6 Unix on, enjoying myself no end fiddling with it :-), there's > this: > > http://www.aracnet.com/~healyzh/pdp1emu.html > Hmmm.... now if we can hack it to emulate the PDP-1X in its heydey in Bldg 20 - what with all the wire-wrapped instructions folks added, and all. :-) Miles -- In theory, there is no difference between theory and practice. In practice, there is. .... Yogi Berra From nigel at channelisles.net Tue May 20 23:34:18 2014 From: nigel at channelisles.net (Nigel Roberts) Date: Wed, 21 May 2014 07:34:18 +0100 Subject: [ih] And I thought _I_ was crazy... In-Reply-To: <20140520235052.A9CB118C0D0@mercury.lcs.mit.edu> References: <20140520235052.A9CB118C0D0@mercury.lcs.mit.edu> Message-ID: <537C48EA.7080207@channelisles.net> WOW! That transported me back to 1979!! On 05/21/2014 12:50 AM, Noel Chiappa wrote: > > Then again: From detlef.bosau at web.de Wed May 21 05:37:46 2014 From: detlef.bosau at web.de (Detlef Bosau) Date: Wed, 21 May 2014 14:37:46 +0200 Subject: [ih] internet-history Digest, Vol 84, Issue 4 In-Reply-To: References: <537A50D2.3010807@meritmail.isi.edu> Message-ID: <537C9E1A.2070209@web.de> This does not really answer my original question, I consider asking Van directly, but I see that TCP resembles swabian "K?ssp?tzle". (cheesy noodles.) Everyone has his own recipe, there is not "that one standard" and the real clues in preparing them aren't written in any textbook. Am 19.05.2014 22:45, schrieb Jack Haverty: > Hi Bob, > > That sounds about right. IIRC, there were a lot of TCP > implementations in various stages of progress, as well as in various > stages of protocol genealogy - 2.5, 3, 4, and many could communicate > with themselves or selected others prior to January 1979. Jon's > "bakeoff" on the Saturday preceding the January 1979 TCP Meeting at > ISI was the first time a methodical test was done to evaluate the NxN > interoperability of a diverse collection of implementations. > > I remember that you were one of the six implementations in that test > session. We each had been given an office at ISI for the day and > kept at it until everyone could establish a connection with everyone > else and pass data. > > There were a lot of issues resolved that day, mostly having to do with > ambiguities in the then-current spec we had all been coding to meet. > As we all finally agreed (or our code agreed) on all the details, Jon > tweaked the spec to reflect what the collected software was now doing. > So I've always thought that those six implementations were the first > TCP4 implementations to successfully interoperate. Yours was one of them. > > There was a lot of pressure at the time to get the spec of TCP4 nailed > down and published, and that test session was part of the process. > Subsequently that TCP4 spec became an RFC, and a DoD Standard, and > The Internet started to grow, and the rest is history.... > > I wonder if Dave Clark ever forgave Bill Plummer for crashing the > Multics TCP by innocently asking Dave to temporarily disable his > checksumming code....and then sending a kamikaze packet from Tenex. > > /Jack > > > > On Mon, May 19, 2014 at 11:43 AM, Bob Braden > wrote: > > > Jack, > > You wrote: > > I wrote a TCP back in the 1979 timeframe - the first one for a > Unix > system, running on a PDP-11/40. It first implemented TCP version > 2.5, and later evolved to version 4. It was a very basic > implementation, no "slow start" or any other such niceties > that were > created as the Internet grew. > > I have been trying to recall where my TCP/IP for UCLA's IBM 360/91 > ran in this horse race. The best I can tell from IEN 70 and IEN 77 > is that my TCP-4 version made it between Dec 1978 and Jan 1979, > although I think I had an initial TP-2.5 version talkng to itself > in mid 1978. > > Bob Braden > > -- ------------------------------------------------------------------ Detlef Bosau Galileistra?e 30 70565 Stuttgart Tel.: +49 711 5208031 mobile: +49 172 6819937 skype: detlef.bosau ICQ: 566129673 detlef.bosau at web.de http://www.detlef-bosau.de -------------- next part -------------- An HTML attachment was scrubbed... URL: From galmes at tamu.edu Wed May 21 06:25:07 2014 From: galmes at tamu.edu (Guy Almes) Date: Wed, 21 May 2014 08:25:07 -0500 Subject: [ih] internet-history Digest, Vol 84, Issue 4 In-Reply-To: <537C9E1A.2070209@web.de> References: <537A50D2.3010807@meritmail.isi.edu> <537C9E1A.2070209@web.de> Message-ID: <537CA933.1090600@tamu.edu> Detlef et al., The subtlety and difficulty and importance of TCP congestion control algorithms are indeed worthy of discussion: more now with our 100-Gb/s wide-area networks than in the early days of TCP/IP. But I'd suggest that, for this list, attention be focused on a few issues. <> Clarity on the degree to which the early TCP RFCs were pretty clear about the protocol, but only suggestive about an example congestion control algorithm. <> Clarity on the degree to which the authors of the early TCP RFCs did not recognize the importance of developing very good congestion control algorithms. <> Clarity on the degree to which the early TCP developers did or did not view as of any importance conformity by different TCP implementations of the best (or set of almost best) practices in congestion control algorithms. <> Clarity on how/when it began to become evident that the naive algorithms documented in the TCP RFCs and used in early testing would themselves become the source of trouble. Even today, confusion between "TCP the protocol" vs "TCP the set of common congestion control algorithms used in practice" persists. But, for this list, I'm interested in the state of clarity vs confusion in these matters early on. Regards, -- Guy On 5/21/14, 7:37 AM, Detlef Bosau wrote: > This does not really answer my original question, I consider asking Van > directly, but I see that TCP resembles swabian "K?ssp?tzle". (cheesy > noodles.) Everyone has his own recipe, there is not "that one standard" > and the real clues in preparing them aren't written in any textbook. > > > > Am 19.05.2014 22:45, schrieb Jack Haverty: >> Hi Bob, >> >> That sounds about right. IIRC, there were a lot of TCP >> implementations in various stages of progress, as well as in various >> stages of protocol genealogy - 2.5, 3, 4, and many could communicate >> with themselves or selected others prior to January 1979. Jon's >> "bakeoff" on the Saturday preceding the January 1979 TCP Meeting at >> ISI was the first time a methodical test was done to evaluate the NxN >> interoperability of a diverse collection of implementations. >> >> I remember that you were one of the six implementations in that test >> session. We each had been given an office at ISI for the day and >> kept at it until everyone could establish a connection with everyone >> else and pass data. >> >> There were a lot of issues resolved that day, mostly having to do with >> ambiguities in the then-current spec we had all been coding to meet. >> As we all finally agreed (or our code agreed) on all the details, Jon >> tweaked the spec to reflect what the collected software was now doing. >> So I've always thought that those six implementations were the first >> TCP4 implementations to successfully interoperate. Yours was one of them. >> >> There was a lot of pressure at the time to get the spec of TCP4 nailed >> down and published, and that test session was part of the process. >> Subsequently that TCP4 spec became an RFC, and a DoD Standard, and >> The Internet started to grow, and the rest is history.... >> >> I wonder if Dave Clark ever forgave Bill Plummer for crashing the >> Multics TCP by innocently asking Dave to temporarily disable his >> checksumming code....and then sending a kamikaze packet from Tenex. >> >> /Jack >> >> >> >> On Mon, May 19, 2014 at 11:43 AM, Bob Braden > > wrote: >> >> >> Jack, >> >> You wrote: >> >> I wrote a TCP back in the 1979 timeframe - the first one for a >> Unix >> system, running on a PDP-11/40. It first implemented TCP version >> 2.5, and later evolved to version 4. It was a very basic >> implementation, no "slow start" or any other such niceties >> that were >> created as the Internet grew. >> >> I have been trying to recall where my TCP/IP for UCLA's IBM 360/91 >> ran in this horse race. The best I can tell from IEN 70 and IEN 77 >> is that my TCP-4 version made it between Dec 1978 and Jan 1979, >> although I think I had an initial TP-2.5 version talkng to itself >> in mid 1978. >> >> Bob Braden >> >> > > > -- > ------------------------------------------------------------------ > Detlef Bosau > Galileistra?e 30 > 70565 Stuttgart Tel.: +49 711 5208031 > mobile: +49 172 6819937 > skype: detlef.bosau > ICQ: 566129673 > detlef.bosau at web.de http://www.detlef-bosau.de > From jnc at mercury.lcs.mit.edu Wed May 21 07:31:31 2014 From: jnc at mercury.lcs.mit.edu (Noel Chiappa) Date: Wed, 21 May 2014 10:31:31 -0400 (EDT) Subject: [ih] internet-history Digest, Vol 84, Issue 4 Message-ID: <20140521143131.5577A18C0E0@mercury.lcs.mit.edu> > From: Guy Almes > Clarity on the degree to which the authors of the early TCP RFCs did > not recognize the importance of developing very good congestion control > algorithms. I think it was as much (if not more) an issue of 'we didn't have the capability to do one as good as Van's' as "recogniz[ing] the importance of developing [a] very good" one. To what degree that was the lack of a good understanding of the problem, and to what degree simply that Van was better at control theory and analysis of the system than the rest of us, is a good question, and one I don't have a ready answer too. But if you look at something like "Why TCP Timers Don't Work Well", it's clear we all just didn't understand what could be done. We did understand that congestion control was important (although my recollection is that I don't think we clearly foresaw the severe congestive collapse which the ARPANET-based section of the Internet suffered not too long before Van started working on the problem). Hence, we did put a certain amount of thought into congestion control (Source Quench, the Nagle algorithm, etc). My vague recollection is that in the very early days we were more focused on flow control in the hosts, rather than congestion control in the network, but I think we did understand that congestion in the network was also aan issue (hence SQ, etc). The thing is that we understand all this so much better now - the importance of congestion control, source algorithms to control it, etc - and we were really groping in the dark back then. The ARPANET (because of its effective VC nature, with flow and thus congestion control built into the network itself) hadn't given us much in the way of advance experience in this particular area. So, as with many things, what is crystal clear in hindsight was rather obscured without the mental frameworks, etc that we have now (e.g. F=ma). > Clarity on how/when it began to become evident that the naive > algorithms documented in the TCP RFCs and used in early testing would > themselves become the source of trouble. Not just testing, but early service! (Q.v. the ARPANET-local congestive collapse.) But your wording makes it sound like they were positively incorrect. Well, not really (to my eyes); they mostly simply were not _always effective_ at controlling congestion (although they did generate some useless, duplicate packets). But they were not positively defective, the way TFTP was, with Sorcerer's Apprentice Syndrome: http://en.wikipedia.org/wiki/Sorcerer's_Apprentice_Syndrome Noel From woolf at isc.org Wed May 21 10:16:54 2014 From: woolf at isc.org (Suzanne Woolf) Date: Wed, 21 May 2014 13:16:54 -0400 Subject: [ih] notable "bakeoffs" Re: internet-history Digest, Vol 84, Issue 4 In-Reply-To: References: <537A50D2.3010807@meritmail.isi.edu> Message-ID: <477F5B93-F5A3-4075-AEDE-77CDB1F645B4@isc.org> Side question: On May 19, 2014, at 4:45 PM, Jack Haverty wrote: > That sounds about right. IIRC, there were a lot of TCP implementations in various stages of progress, as well as in various stages of protocol genealogy - 2.5, 3, 4, and many could communicate with themselves or selected others prior to January 1979. Jon's "bakeoff" on the Saturday preceding the January 1979 TCP Meeting at ISI was the first time a methodical test was done to evaluate the NxN interoperability of a diverse collection of implementations. > > I remember that you were one of the six implementations in that test session. We each had been given an office at ISI for the day and kept at it until everyone could establish a connection with everyone else and pass data. > > There were a lot of issues resolved that day, mostly having to do with ambiguities in the then-current spec we had all been coding to meet. As we all finally agreed (or our code agreed) on all the details, Jon tweaked the spec to reflect what the collected software was now doing. So I've always thought that those six implementations were the first TCP4 implementations to successfully interoperate. Yours was one of them. > > There was a lot of pressure at the time to get the spec of TCP4 nailed down and published, and that test session was part of the process. Subsequently that TCP4 spec became an RFC, and a DoD Standard, and The Internet started to grow, and the rest is history?. Was this the first such "bakeoff" for test/debug of interoperability for TCP/IP or its ancestor protocols? Occasionally I try to explain Internet history and processes to people outside of engineering culture. In that context, what we mean by "interoperability" and its role in usable standards is hard to explain, but keeps turning out to be important?. thanks, Suzanne From mfidelman at meetinghouse.net Wed May 21 10:34:28 2014 From: mfidelman at meetinghouse.net (Miles Fidelman) Date: Wed, 21 May 2014 13:34:28 -0400 Subject: [ih] notable "bakeoffs" Re: internet-history Digest, Vol 84, Issue 4 In-Reply-To: <477F5B93-F5A3-4075-AEDE-77CDB1F645B4@isc.org> References: <537A50D2.3010807@meritmail.isi.edu> <477F5B93-F5A3-4075-AEDE-77CDB1F645B4@isc.org> Message-ID: <537CE3A4.6040209@meetinghouse.net> Suzanne Woolf wrote: > SOccasionally I try to explain Internet history and processes to people outside of engineering culture. In that context, what we mean by "interoperability" and its role in usable standards is hard to explain, but keeps turning out to be important?. > > English, as spoken in, say, America vs. Jamaica. :-) -- In theory, there is no difference between theory and practice. In practice, there is. .... Yogi Berra From braden at isi.edu Wed May 21 11:21:44 2014 From: braden at isi.edu (Bob Braden) Date: Wed, 21 May 2014 11:21:44 -0700 Subject: [ih] notable "bakeoffs" Re: internet-history Digest, Vol 84, Issue 4 In-Reply-To: <477F5B93-F5A3-4075-AEDE-77CDB1F645B4@isc.org> References: <537A50D2.3010807@meritmail.isi.edu> <477F5B93-F5A3-4075-AEDE-77CDB1F645B4@isc.org> Message-ID: <537CEEB8.4070006@meritmail.isi.edu> On 5/21/2014 10:16 AM, Suzanne Woolf wrote: > Was this the first such "bakeoff" for test/debug of interoperability > for TCP/IP or its ancestor protocols? Occasionally I try to explain > Internet history and processes to people outside of engineering > culture. In that context, what we mean by "interoperability" and its > role in usable standards is hard to explain, but keeps turning out to > be important?. thanks, Suzanne Suzanne, As recorded in IEN70 by our compulsive record keeper, Jon Postel, in the minutes of the 4 December 1978 Internet Meeting: "In the afternoon we met at DCEC to test or demonstrate the TCP-4 implementations. The four programs that were in a state to attempt interconnections were Jim Mathis', Bob Braden's, Mike Wingfield's, and Dave Clark's. " I am pretty sure this Dec 78 testing session was the first interoperability event("bakefoff") for TCP/IP. We were testing TCP version 2 (or 2.5?). I think, and some of our implementations of this moving spec were a bit on the buggy side :-( But I am a little bit puzzled by the difficulty of explaining the interoperability requirement. It takes 2 to communicate (although TCP had/has the neat symmetry property that allows loop back. I expect that all of us did our initial testing using loop back. But of course that did not guarantee interoperability with others.) Bob Braden From jnc at mercury.lcs.mit.edu Wed May 21 12:08:17 2014 From: jnc at mercury.lcs.mit.edu (Noel Chiappa) Date: Wed, 21 May 2014 15:08:17 -0400 (EDT) Subject: [ih] internet-history Digest, Vol 84, Issue 4 Message-ID: <20140521190817.5212618C0E4@mercury.lcs.mit.edu> PS: > From: jnc at mercury.lcs.mit.edu (Noel Chiappa) >> Clarity on how/when it began to become evident that the naive >> algorithms documented in the TCP RFCs and used in early testing would >> themselves become the source of trouble. > Well, not really (to my eyes); they mostly simply were not _always > effective_ at controlling congestion (although they did generate some > useless, duplicate packets). > ... > So, as with many things, what is crystal clear in hindsight was rather > obscured without the mental frameworks, etc that we have now (e.g. > F=ma). For an illuminating take on this all, try (re-)reading RFC-793. It only contains the word 'congestion' twice - both in sections of generic text (e.g. about how a packet could be lost). It does spend a while on RTT estimation algorithms - but _only_ to find out when data needs to be re-transmitted. (IOW, we were so focused on 'getting the data through ASAP' that that's all we saw the RTT as being for - so that as soon as the ACK was seen to be missing, we could retransmit the packet and get the transfer going again). And as to what to do when a timeout happened (which usually, although not always, indicates that a packet has been dropped due to congestion), it says: if the retransmission timeout expires on a segment in the retransmission queue, send the segment at the front of the retransmission queue again [and] reinitialize the retransmission timer That's it! Again, note the focus on 'hey, we gotta get the user's data there as fast as we can'. Absolutely no thought given to 'hey, maybe that packet was lost through congestion, maybe I ought to consider that, and if so, how to respond'. The later 'if you have a congesitve loss, back off exponentially to prevent congestive overload' stuff is completely missing (and would be for some time, probably until Van, I would guess). I fairly vividly remember being the IETF where Van gave his first talk about his congestion work, and when he started talking about how a lost packet was a congestion signal, I think we all went 'wow, that's so obvious - how come we never thought of that'! The TCP RFC does, however, spend a great deal of time talking about the window (which is purely a destination host buffering thing at that point). Looking at RFC-792 (ICMP) there's a fair chunk in the Source Quench section about congestion, but little in the way of a solid algorithmic suggestion: the source host should cut back the rate at which it is sending traffic to the specified destination until it no longer receives source quench messages from the gateway (And I still think SQ has gotten a bad rap about being ineffective and/or making things worse; I would love to see some rigorous investigation of SQ. But I digress...) The 'Dave Clark 5' RFCs are similarly thin on congestion-related content: RFC-813, "Window and Acknowledgement Strategy in TCP" (which one would assume would be the place to find it) doesn't even contain the word 'congestion'! It's all about host buffering, etc. (Indeed, it suggests delayed ACKs, which interrupts the flow of ACKs which are an important signal in VJCC.) And one also finds this gem: the disadvantage of burstiness is that it may cause buffers to overflow, either in the eventual recipient .. or in an intermediate gateway, a problem ignored in this paper. It's interesting to see what _is_ covered in the DC5 set, and similar writings: Dave goes into Silly Window Syndome at some length, but there's nothing about congestive losses. Lixia's "Why TCP Timers Don't Work Well" paper is, I think, a valuable snap-shot of thinking about related topics pre-Van; it too doesn't have much about congestive losses, mentioning them only briefly. The general sense one gets from reading it is that 'the increased measured RTT caused by congestive losses will cause people to back off enough to get rid of the congestion' (which wasn't true, of course). I haven't read Nagle's thing, but that would also be interesting to look at, to see how much we understood at that point. So I think congestion control was so lacking, in part, because we just hadn't run into it as a serious problem. Yes, we knew that _in theory_ congestion was possible, and we'd added some stuff for it (SQ), but we just hadn't seen it a lot - and we probably hadn't seen how _bad_ it could get back then. (Although experience at MIT with Sorcerer's Apprentice had shown us how bad congestive collapse _could_ get - and I seem to recall hearing that PARC had seen a similar thing. But I suspect the particular circumstances of SAS, with the exponential increases in volume, even though it was - in theory! - a 'single packet outstanding' protocol, might have led us to believe that it was a pathological case, one that didn't have any larger lessons.) We were off fixing other alligators (SWS, etc) that actually had bitten us at that point... So I suspect it was only when congestive collapse hit on the ARPANET section of the Internet (shortly before Van's work) that it really got a focus. Noel From brian.e.carpenter at gmail.com Wed May 21 13:02:01 2014 From: brian.e.carpenter at gmail.com (Brian E Carpenter) Date: Thu, 22 May 2014 08:02:01 +1200 Subject: [ih] Loss as a congestion signal [internet-history Digest, Vol 84, Issue 4] In-Reply-To: <20140521190817.5212618C0E4@mercury.lcs.mit.edu> References: <20140521190817.5212618C0E4@mercury.lcs.mit.edu> Message-ID: <537D0639.6030604@gmail.com> On 22/05/2014 07:08, Noel Chiappa wrote: > I fairly vividly remember being the IETF where Van gave his first talk about > his congestion work, and when he started talking about how a lost packet was > a congestion signal, I think we all went 'wow, that's so obvious - how come > we never thought of that'! I wasn't there so I have no right to comment on this, but it's surely the case that actual bit destruction causing non-congestive packet loss was a much bigger worry in the 1970s than it was ten years later? And indeed when actual packet loss became a significant factor with the rise of wireless networks some years ago, it proved that treating it mainly as a congestion signal was (and is) problematic. If you have a path that includes both loss-prone and congestion-prone segments, TCP doesn't work so well. (See http://www.ietf.org/proceedings/87/slides/slides-87-nwcrg-4.pdf for example.) Brian From jack at 3kitty.org Wed May 21 13:13:22 2014 From: jack at 3kitty.org (Jack Haverty) Date: Wed, 21 May 2014 13:13:22 -0700 Subject: [ih] notable "bakeoffs" Re: internet-history Digest, Vol 84, Issue 4 In-Reply-To: <537CEEB8.4070006@meritmail.isi.edu> References: <537A50D2.3010807@meritmail.isi.edu> <477F5B93-F5A3-4075-AEDE-77CDB1F645B4@isc.org> <537CEEB8.4070006@meritmail.isi.edu> Message-ID: >From the horse's mouth - http://stuff.mit.edu/afs/sipb/user/rlk/hack has Jon Postel's thoughts. Also, RFC1025 - http://tools.ietf.org/html/rfc1025is a good summary. IEN70 and IEN77 captured the events surrounding the "first bakeoff" that Postel held at ISI. I think that Jon distinguished a "bakeoff" from other "testing sessions" in that a bakeoff had a set of rules, a list of specific test scenarios, and a scoring scheme. Jon's 1985 email in that MIT archive recounts the scoring structure of the second bakeoff, in 1980, and characterizes the January 1979 event at ISI as the "first bakeoff". That was the timeframe when there was a lot of pressure to nail down the specification so that it could become a DoD standard. I think the bakeoffs were a crucial part of that process. Jon used the results of the bakeoffs to drive the creation of the documents in RFC 761 and 793, focused on making the documents and the actual implementations match as exactly as possible. I've found in my boxes of history the scoring document that Jon handed out to all of us at the first bakeoff at ISI, on 27 January 1979. I can't find it online (more accurately, Google can't find it). Since it may be of historical interest, I've scanned it and attached to this message. Hope it makes it through the email... It's interesting to see how the scoring rules evolved from the first to second bakeoff. /Jack Haverty On Wed, May 21, 2014 at 11:21 AM, Bob Braden wrote: > On 5/21/2014 10:16 AM, Suzanne Woolf wrote: > >> Was this the first such "bakeoff" for test/debug of interoperability for >> TCP/IP or its ancestor protocols? Occasionally I try to explain Internet >> history and processes to people outside of engineering culture. In that >> context, what we mean by "interoperability" and its role in usable >> standards is hard to explain, but keeps turning out to be important?. >> thanks, Suzanne >> > Suzanne, > > As recorded in IEN70 by our compulsive record keeper, Jon Postel, in the > minutes of the 4 December 1978 Internet Meeting: > > "In the afternoon we met at DCEC to test or demonstrate the TCP-4 > implementations. > The four programs that were in a state to attempt interconnections were Jim > Mathis', Bob Braden's, Mike Wingfield's, and Dave Clark's. " > > I am pretty sure this Dec 78 testing session was the first > interoperability event("bakefoff") for TCP/IP. We were testing TCP version > 2 (or 2.5?). I think, and some of our implementations of this moving spec > were a bit on the buggy side :-( > > But I am a little bit puzzled by the difficulty of explaining the > interoperability requirement. It takes 2 to communicate (although TCP > had/has the neat symmetry property that allows loop back. I expect that > all of us did our initial testing using loop back. But of course that did > not guarantee interoperability with others.) > > Bob Braden > > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: TCPBakeoff1979.pdf Type: application/pdf Size: 70711 bytes Desc: not available URL: From detlef.bosau at web.de Wed May 21 13:14:10 2014 From: detlef.bosau at web.de (Detlef Bosau) Date: Wed, 21 May 2014 22:14:10 +0200 Subject: [ih] Detlef's TCP questions In-Reply-To: <923b00f71cc486202b097fb4327f9d20@seas.upenn.edu> References: <537A5D44.30908@meritmail.isi.edu> <537BD50C.1010205@web.de> <923b00f71cc486202b097fb4327f9d20@seas.upenn.edu> Message-ID: <537D0912.3090400@web.de> Am 21.05.2014 18:38, schrieb Michael Greenwald: > On 2014-05-20 15:19, Detlef Bosau wrote: >> Am 19.05.2014 21:36, schrieb Bob Braden: >>> Detlef, >>> >>> Yes, any network researcher who wants to call him/herself a computer >>> scientist should take seriously the experimentalist's task of fully >>> understanding the assumptions and implementations of their test >>> environment. That includes NS-2 simulations of TCP. >> >> Myself, I have to admit that I trusted too much in the NS-2. And quite >> some papers I've read rely heavily on the NS-2. > > NS-2 is a simulator. As such it is, at best, an approximation of a > real network --- including approximate implementation of various > protocols. No discussion about that. > One of the first rules about using simulation results is that you > must always validate your results in the real system. (Not necessarily > *every* result, but compare enough runs to know when the simulation > becomes > inaccurate). I am surprised to hear you say (or at least imply) that a > reasonable fraction of people studying networks believe that NS-2 is or > was "truth". (In fact, I am skeptical of this claim.) One of the oldest questions of mankind, I think it is even mentioned in the bible, is "What is truth?" However, the particular problem is that at least myself learned quite a lot about TCP from the NS-2 code. Hence, perhaps the NS-2 may have some influence on practical protocol implemenation. Even that is no problem. But it IS a problem, when RFC and some implementations, one of which is the NS-2, diverge. Especially when this is not said appropriately. > >> >> That doesn't mean, simulations were worthless. But I think, we should >> treat them with a certain professional distance. > > I think it is already the case that people treat simulation results > with "a certain professional distance." At least I know several colleagues who do so. Unfortunately, I know some colleagues as well who mix up simulations with reality. > Simulators *model* the real world. Our abstract models (whether when > making back of the envelope estimates, or simulating real phenomena) > commonly trade off accuracy for ease of modeling (whether the "ease" > is because of performance, obtaining closed form solutions, or > generality, > or...). I think it is/was well understood that NS-2 simplified and > abstracted things, and was therefore inaccurate. The question is always > whether the accuracy is good enough to support any claims you make based > on the simulation results. > > I completely agree that researchers need to experiment, analyze, and > understand > TCP and other network protocols more carefully; I just don't think > people were > as confused by NS-2 (or other simulations) as much as you seem to think. > Admittedly, I was extremely surprised when I noticed this very point with the RTO ;-) It is my fault. However, I still don't know for sure, whether VJs original code for TCP/Tahoe did GBN or not.... -- ------------------------------------------------------------------ Detlef Bosau Galileistra?e 30 70565 Stuttgart Tel.: +49 711 5208031 mobile: +49 172 6819937 skype: detlef.bosau ICQ: 566129673 detlef.bosau at web.de http://www.detlef-bosau.de From dhc2 at dcrocker.net Wed May 21 13:19:27 2014 From: dhc2 at dcrocker.net (Dave Crocker) Date: Wed, 21 May 2014 13:19:27 -0700 Subject: [ih] notable "bakeoffs" Re: internet-history Digest, Vol 84, Issue 4 In-Reply-To: <477F5B93-F5A3-4075-AEDE-77CDB1F645B4@isc.org> References: <537A50D2.3010807@meritmail.isi.edu> <477F5B93-F5A3-4075-AEDE-77CDB1F645B4@isc.org> Message-ID: <537D0A4F.9080302@dcrocker.net> On 5/21/2014 10:16 AM, Suzanne Woolf wrote: > Occasionally I try to explain Internet history and processes to people outside of engineering culture. In that context, what we mean by "interoperability" and its role in usable standards is hard to explain, but keeps turning out to be important?. I would assume two kinds of difficulty, for folks not familiar with interop testing: 1. There's a spec; implement it. The assumption would be that all that's needed is to follow the spec. Of course, this misses a) the ambiguities of (all) specs, which produces individual interpretations and hence individual variations and hence non-interoperability; and b) the complexity of specs and the inevitable array of bugs. Interop testing isn't exhaustive, but it makes sure that the basic code paths (inter-)work. 2. Conformance testing is sufficient. This assumes that an independent test engine will suffice. What it misses is the impressively unpredictable interaction problems that can occur between any two, independent implementations. Conformance testing can be useful for shaking out the 'normal' bugs, but never assures actual interoperability. There are two other benefits of interop testing that are easily missed: 1. Cost. Compared with more formal testing disciplines, interop events can be remarkably inexpensive, especially given the level of their efficacy. 2. Community. Especially for the early stages of a new technology, an interop event usually turns the independent implementers into a collaborative community. I've tended to claim that it moves adoption forward by at least 6 month, compared with only doing ad hoc testing. d/ -- Dave Crocker Brandenburg InternetWorking bbiw.net From braden at isi.edu Wed May 21 13:22:42 2014 From: braden at isi.edu (Bob Braden) Date: Wed, 21 May 2014 13:22:42 -0700 Subject: [ih] internet-history Digest, Vol 84, Issue 10 In-Reply-To: References: Message-ID: <537D0B12.4080605@meritmail.isi.edu> On 5/21/2014 12:00 PM, internet-history-request at postel.org wrote: > Send internet-history mailing list submissions to > internet-history at postel.org > > To subscribe or unsubscribe via the World Wide Web, visit > http://mailman.postel.org/mailman/listinfo/internet-history > or, via email, send a message with subject or body 'help' to > internet-history-request at postel.org > > You can reach the person managing the list at > internet-history-owner at postel.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of internet-history digest..." > > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Wed, 21 May 2014 10:31:31 -0400 (EDT) > From: jnc at mercury.lcs.mit.edu (Noel Chiappa) > Subject: Re: [ih] internet-history Digest, Vol 84, Issue 4 > To: internet-history at postel.org > Cc: jnc at mercury.lcs.mit.edu > Message-ID: <20140521143131.5577A18C0E0 at mercury.lcs.mit.edu> > > > From: Guy Almes > > > Clarity on the degree to which the authors of the early TCP RFCs did > > not recognize the importance of developing very good congestion control > > algorithms. > > I think it was as much (if not more) an issue of 'we didn't have the > capability to do one as good as Van's' as "recogniz[ing] the importance of > developing [a] very good" one. Very definitely the latter. Van brought an entirely new perspective to the problem, from his experience designing digital control systems for LBL. > > To what degree that was the lack of a good understanding of the problem, and > to what degree simply that Van was better at control theory and analysis of > the system than the rest of us, is a good question, and one I don't have a > ready answer too. But if you look at something like "Why TCP Timers Don't > Work Well", it's clear we all just didn't understand what could be done. It is not a "good question", there is no doubt. > > We did understand that congestion control was important (although my > recollection is that I don't think we clearly foresaw the severe congestive > collapse which the ARPANET-based section of the Internet suffered not too > long before Van started working on the problem). Hence, we did put a certain > amount of thought into congestion control (Source Quench, the Nagle > algorithm, etc). Nagle and Partridge had only recently made us clearly aware of the congestion collapse problem; Mills had already produced the disasterous consequences in NSFnet ;-) > My vague recollection is that in the very early days we were more focused on > flow control in the hosts, rather than congestion control in the network, but Yes, and to the extent we did recognize the problem, we had no clue about how to cure it. Van worked out, and taught the rest of us, the fundamentals of "packet physics". Once he explained about ack clocking, slow start, etc., it made sense, but that does not mean we figured it out ourselves. I think we did understand that congestion in the network was also aan issue (hence SQ, Yes, and Source Quench was a perfect example of our pre-VJ cluelesssness. Wwhat seemed a plausible congestion control mechanism was in fact completely broken. etc). The thing is that we understand all this so much better now - the importance of congestion control, source algorithms to control it, etc - and we were really groping in the dark back then. The ARPANET (because of its effective VC nature, with flow and thus congestion control built into the network itself) hadn't given us much in the way of advance experience in this particular area. So, as with many things, what is crystal clear in Quite true. Also, in fairness, we were being stressed by the complexity of making the entire system work at all in the face of exponential growth. We struggled to make routing actually work despite repeated routing table overflows, and we had to solve the network management problem. Until we began to experience congestion collapse in NSFnet, congestion control seemed more an academic problem. The early Internet was a continuing "success disaster". Bob Braden to solve hindsight was rather obscured without the mental frameworks, etc that we have now (e.g. F=ma). > Clarity on how/when it began to become evident that the naive > algorithms documented in the TCP RFCs and used in early testing would > themselves become the source of trouble. Not just testing, but early service! (Q.v. the ARPANET-local congestive collapse.) But your wording makes it sound like they were positively incorrect. Well, not really (to my eyes); they mostly simply were not _always effective_ at controlling congestion (although they did generate some useless, duplicate packets). But they were not positively defective, the way TFTP was, with Sorcerer's Apprentice Syndrome: http://en.wikipedia.org/wiki/Sorcerer's_Apprentice_Syndrome Noel ------------------------------ ------------------------------ -------------- next part -------------- An HTML attachment was scrubbed... URL: From jnc at mercury.lcs.mit.edu Wed May 21 13:44:56 2014 From: jnc at mercury.lcs.mit.edu (Noel Chiappa) Date: Wed, 21 May 2014 16:44:56 -0400 (EDT) Subject: [ih] Loss as a congestion signal [internet-history Digest, Vol 84, Issue 4] Message-ID: <20140521204456.5C34B18C0E4@mercury.lcs.mit.edu> > From: Brian E Carpenter > it's surely the case that actual bit destruction causing non-congestive > packet loss was a much bigger worry in the 1970s than it was ten years > later? I don't recall us worrying about damaged packets, to be honest. If they happened, they were re-transmitted, and you just didn't notice. The one exception I can remember is when a _particular packet_ could not be sent along the CHAOSNET (think Ethernet, but over heavy CATV cable) at MIT. Every time the source tried to send it, it got damaged. We only noticed because the email queue (it was from a piece of email) to that machine got wedged! :-) Other than that, errors were rare enough, even back then, that they just weren't an issue. > when actual packet loss became a significant factor with the rise of > wireless networks some years ago, it proved that treating it mainly as a > congestion signal was (and is) problematic. If you have a path that > includes both loss-prone and congestion-prone segments, TCP doesn't work > so well. Which is a good part of why I lament the loss of SQ! The commonly-heard reason for getting rid of it ('It increases congestion') doesn't make sense to me, because unless something's unusual about the path (and return path) between the source and the congestion point, that section of the path is by definition un-congested (since the user's packet made it to the congestion point OK). Maybe I'm missing something? And there are several good reasons to like SQ: First, and quite importantly, as you point out, it's an un-ambiguous congestion signal. Second, it's a slightly/somewhat faster congestion signal (since it's only the actual RTT from the source to the congestion point, not the end-end RTT plus a fudge factor for variability plus (potentially) wakeup clock delay/quantization/etc. My memory of control theory is dim, but I seem to remember that faster feedbacks are always better (although the response has to be suitably damped, of course). Although it's probably a second-order effect compared to the first one. Noel From louie at transsys.com Wed May 21 13:52:28 2014 From: louie at transsys.com (Louis Mamakos) Date: Wed, 21 May 2014 16:52:28 -0400 Subject: [ih] Loss as a congestion signal [internet-history Digest, Vol 84, Issue 4] In-Reply-To: <537D0639.6030604@gmail.com> References: <20140521190817.5212618C0E4@mercury.lcs.mit.edu> <537D0639.6030604@gmail.com> Message-ID: <31378858-5C2F-458E-8383-721A32506286@transsys.com> On May 21, 2014, at 4:02 PM, Brian E Carpenter wrote: > On 22/05/2014 07:08, Noel Chiappa wrote: > >> I fairly vividly remember being the IETF where Van gave his first talk about >> his congestion work, and when he started talking about how a lost packet was >> a congestion signal, I think we all went 'wow, that's so obvious - how come >> we never thought of that'! > > I wasn't there so I have no right to comment on this, but it's > surely the case that actual bit destruction causing non-congestive > packet loss was a much bigger worry in the 1970s than it was > ten years later? > > And indeed when actual packet loss became a significant factor > with the rise of wireless networks some years ago, it proved > that treating it mainly as a congestion signal was (and is) > problematic. If you have a path that includes both loss-prone > and congestion-prone segments, TCP doesn't work so well. > (See http://www.ietf.org/proceedings/87/slides/slides-87-nwcrg-4.pdf > for example.) > > Brian I think there was some awareness of this at the time. When the NSFNET phase-1 deployment planning begin, some of us at U of MD were involved in the procurement of the LSI-11 fuzzball systems. Of course, Dave Mills was deeply sucked into this at the time, and I vaguely remember a conversation with him regarding the use of DMV-11 Q-bus synchronous serial interfaces. These devices did some sort of ARQ between themselves, and being then still wet-behind-the-ears, I asked Dave why we didn?t just rely on IP and TCP checksums for reliable transport? I think based on his experience over very lossy packet radio paths, it was revealed that eliminating hop-by-hop loss due to damage would be value to avoid provoking TCP?s retransmit mechanism. And with only 56kb/s trunks between the fuzzballs in the NSFNET phase 1 network, there was ample opportunity to explore the congestion domain.. I also remember being there in Cambridge for Van?s early presentation on TCP congestion and the application of control theory. At the time I thought I was fortunate because the IP/TCP implementation for our UNIVAC 1108 system was ?working?, and then it became clear there was much more work to be done! I don?t think that people realize how much they take for granted the interoperability in protocol implementation we generally enjoy today. Certainly we all learned quite a bit with the early implementations; heck, who even worries about sequence space wrapping around and getting the math to work right on your 36-bit one?s-complement CPU. louie From jack at 3kitty.org Wed May 21 15:45:23 2014 From: jack at 3kitty.org (Jack Haverty) Date: Wed, 21 May 2014 15:45:23 -0700 Subject: [ih] internet-history Digest, Vol 84, Issue 4 In-Reply-To: <537CA933.1090600@tamu.edu> References: <537A50D2.3010807@meritmail.isi.edu> <537C9E1A.2070209@web.de> <537CA933.1090600@tamu.edu> Message-ID: Guy - good list of issues. As one of those "early TCP developers" I can at least provide my answers. Not sure how they generalize. Everything is of course IIRC - it's been almost 35 years.... There were a variety of things that we knew were important, but were specifically and intentionally excluded from specification. In many cases it was because we didn't know what "the right answer" was, or at least didn't agree. The "congestion control algorithm" is one such example. The Internet was new, it was tiny, and we had had little operational experience with it. We of course did experiments, but it wasn't obvious that you could draw any generalizations from the results in "toy" configurations that would apply to a larger net, with real traffic patterns from real-world users. Additionally, the core idea of an architecture like TCP was new. Although there were several pre-existing communications mechanisms in use at the time (in particular IBM's), all of them were very homogeneous - i.e., designed and implemented by a single organization. The ARPANET had existed for a decade, but it was also homogeneous - all the IMPs and software they contained were implemented and operated by BBN. Additionally, the ARPANET, and other contemporary networks, tended to use a hop-by-hop approach to implementing a reliable byte-stream service analogous to TCP's core service. TCP/IP chose a very different end-to-end approach, allowing packets to be discarded in transit, or even carved up into smaller pieces anywhere along the way. Existing algorithms and knowledge from the traditional "virtual circuit" networks simply didn't obviously apply to the "datagram" approach of TCP/IP. We were in uncharted territory. That was of course ARPA's charter - "Advanced Research", meaning trying things that hadn't been tried before. There was a point in time where the Internet hit a fork in the road. One fork was to follow the tried-and-proven ARPANET path, where all of the routers in the Internet would be designed and built by one organization. The other fork was to structure the design so that many different designs and implementations could be used, by different programmers, organizations, etc. We knew we could take the former fork and get an Internet to work, by building from the ARPANET experience, incorporating the internal mechanisms (including congestion control and flow control) of the ARPANET into the routers. We didn't know if it was even possible to go down the other fork, which was "where no man had gone before". As we all know, we're on that second fork and it's been working far better than I think anyone anticipated. Google "Kahn Haverty subway strap" if you'd like the story in an old posting to this list. There had been a lot of work on the internal algorithms of the ARPANET, i.e., the protocols used between IMPs, and the algorithms used for traffic management, error control, et al. Congestion control was one of the hot topics, since the ARPANET had grown large and complex enough, with large and diverse data flows associated with many users, and had been experiencing congestion events. I was probably more aware of the ARPANET work than others, since I was working at BBN and in the same group that was responsible for the ARPANET. There was a large group of scientists and mathematicians involved in observing ARPANET behavior, creating and deploying new algorithms, and evaluating the results in the live net. Congestion control was not a solved problem in the ARPANET. Even if it was, it wasn't clear that the techniques used in the ARPANET et al would be applicable, or effective, in the TCP/IP world. The various "suggestions" in the early TCP RFCs were explicit cases of the specification declining to select any particular design. Rather it was important to have a specification that permitted the research in areas such as congestion control to continue, developing new ideas and testing them out in the TCP/IP environment. In other words, the specification explicitly did not specify any particular required algorithm. If it did, we had no confidence that it would be the right one, and such a constraint might have ruled out later work done by Van and others. The Internet was based on interoperability, and the continued ability to communicate in spite of differences between the parties involved. Different computers, different networks, different software, different algorithms, different people, different organizations, different technologies.... Conformity to core design elements, like the basic packet formats and the TCP state machine, was needed to achieve interoperability. In other design choices, like congestion control algorithms, user APIs, packetization algorithms, etc., interoperability could still be attained while permitting diversity. i think (and thought at the time) that the ability to have such diversity was very important to allow new ideas to be introduced and to allow the Internet to evolve as it grew in operational use and the "lab" of our research work hit the "real world". The Internet was after all explicitly a Research project. The focus was on getting it to work at all, i.e., getting data to flow, and getting a system operational so that other people could get involved, try out new ideas, and collectively figure out what worked and what didn't. It would be interesting historically to note when The Internet stopped being a Research project and became Operational. Or has that not yet happened? Re: Did We Recognize It Was Important -- At every ICCB (precursor to IAB) meeting there was a list on the corner of the whiteboard of ongoing issues that Vint selected and Jon recorded. It listed the important things that needed to be done, but that we didn't know how to do. Congestion control was one of them. Others I recall were Expressway Routing, Multiple-Homing, and Multi-Path. There were maybe a dozen in all. I wonder how many of them could now be checked off as done. We knew that such things were going to cause problems and that the "naive algorithms" would cause trouble. We also knew that we didn't know "the" right answer (or even any answer at all) and that experimentation would be useful. We also knew that TCP4 had a limited lifetime, and such problems could be addressed with enhancements in TCP5, TCP6, etc. We had gone from TCP2 to TCP3 to TCP4 in a year or two, so TCP 5 could be expected within a year, and hopefully by then the naive algorithms wouldn't have wreaked too much havoc and could be replaced. But of course we really got that schedule wrong...I at least didn't appreciate how making something a Standard would cause it to set in concrete so quickly and firmly. I recall one meeting where we groused to Vint that "we're not done yet!" and TCP wasn't ready to Standardize. Of course researchers always say that, and like others we lost too. Regarding "best practices" ... there seems to be an implicit assumption that there is always such a thing as a "best practice" and you just have to find it. The Internet as a technology is very complex, and it is used in a wide range of situations. The 1980-ish Internet was designed with specific scenarios in mind, reflecting systems that would actually use TCP. For example, one scenario involved military personnel in aircraft or jeeps (packet radio for comms) interacting with command staff at HQ (land and satellite based comms) as well as with ships at sea (satellite based comms from unstable in-motion platforms). Toss in some electronic countermeasures and the general chaos of a battlefield situation, all of which cause packet loss, unpredictable connectivity, and changing traffic patterns. The 2014-ish Internet seems to now be dominated by email, streaming video, web browsing, multi-gigabit LANs and WANs, and the other activities of several billion users, virtually all of whom have computer power exceeding the aggregate of all the users on the 1980-ish Internet and generating huge amounts of traffic (my speculation only, but you get the idea). I'm not convinced that *any* congestion control algorithm is applicable to such a wide range of environments. Even in the 1980 timeframe we had different TCP implementations using different algorithms that their authors thought were appropriate for the environment in which that implementation would be used. The environments of a high-speed (for the time) LAN versus a lengthy string of terrestrial and satellite networks have very different characteristic behaviors in terms of delay, packet loss rates, variance, and other such parameters that are important to something like a congestion control algorithm. In some of the later RFCs, the IETF seems to have picked certain algorithms and declared them to be Required. So maybe it is possible to nail down "the" correct algorithm. I can't tell if today's TCPs in use actually conform to that Requirement though. If so, I guess it works - at least the Internet still seems to work amazingly well, at least from my perspective now as a User. But does it work because there's a single best practice algorithm in universal use? Or because there's not...? My $0.02, /Jack Haverty On Wed, May 21, 2014 at 6:25 AM, Guy Almes wrote: > Detlef et al., > The subtlety and difficulty and importance of TCP congestion control > algorithms are indeed worthy of discussion: more now with our 100-Gb/s > wide-area networks than in the early days of TCP/IP. > > But I'd suggest that, for this list, attention be focused on a few > issues. > > <> Clarity on the degree to which the early TCP RFCs were pretty clear > about the protocol, but only suggestive about an example congestion control > algorithm. > > <> Clarity on the degree to which the authors of the early TCP RFCs did > not recognize the importance of developing very good congestion control > algorithms. > > <> Clarity on the degree to which the early TCP developers did or did not > view as of any importance conformity by different TCP implementations of > the best (or set of almost best) practices in congestion control algorithms. > > <> Clarity on how/when it began to become evident that the naive > algorithms documented in the TCP RFCs and used in early testing would > themselves become the source of trouble. > > Even today, confusion between "TCP the protocol" vs "TCP the set of > common congestion control algorithms used in practice" persists. But, for > this list, I'm interested in the state of clarity vs confusion in these > matters early on. > > Regards, > -- Guy > > > On 5/21/14, 7:37 AM, Detlef Bosau wrote: > >> This does not really answer my original question, I consider asking Van >> directly, but I see that TCP resembles swabian "K?ssp?tzle". (cheesy >> noodles.) Everyone has his own recipe, there is not "that one standard" >> and the real clues in preparing them aren't written in any textbook. >> >> >> >> Am 19.05.2014 22:45, schrieb Jack Haverty: >> >>> Hi Bob, >>> >>> That sounds about right. IIRC, there were a lot of TCP >>> implementations in various stages of progress, as well as in various >>> stages of protocol genealogy - 2.5, 3, 4, and many could communicate >>> with themselves or selected others prior to January 1979. Jon's >>> "bakeoff" on the Saturday preceding the January 1979 TCP Meeting at >>> ISI was the first time a methodical test was done to evaluate the NxN >>> interoperability of a diverse collection of implementations. >>> >>> I remember that you were one of the six implementations in that test >>> session. We each had been given an office at ISI for the day and >>> kept at it until everyone could establish a connection with everyone >>> else and pass data. >>> >>> There were a lot of issues resolved that day, mostly having to do with >>> ambiguities in the then-current spec we had all been coding to meet. >>> As we all finally agreed (or our code agreed) on all the details, Jon >>> tweaked the spec to reflect what the collected software was now doing. >>> So I've always thought that those six implementations were the first >>> TCP4 implementations to successfully interoperate. Yours was one of >>> them. >>> >>> There was a lot of pressure at the time to get the spec of TCP4 nailed >>> down and published, and that test session was part of the process. >>> Subsequently that TCP4 spec became an RFC, and a DoD Standard, and >>> The Internet started to grow, and the rest is history.... >>> >>> I wonder if Dave Clark ever forgave Bill Plummer for crashing the >>> Multics TCP by innocently asking Dave to temporarily disable his >>> checksumming code....and then sending a kamikaze packet from Tenex. >>> >>> /Jack >>> >>> >>> >>> On Mon, May 19, 2014 at 11:43 AM, Bob Braden >> > wrote: >>> >>> >>> Jack, >>> >>> You wrote: >>> >>> I wrote a TCP back in the 1979 timeframe - the first one for a >>> Unix >>> system, running on a PDP-11/40. It first implemented TCP version >>> 2.5, and later evolved to version 4. It was a very basic >>> implementation, no "slow start" or any other such niceties >>> that were >>> created as the Internet grew. >>> >>> I have been trying to recall where my TCP/IP for UCLA's IBM 360/91 >>> ran in this horse race. The best I can tell from IEN 70 and IEN 77 >>> is that my TCP-4 version made it between Dec 1978 and Jan 1979, >>> although I think I had an initial TP-2.5 version talkng to itself >>> in mid 1978. >>> >>> Bob Braden >>> >>> >>> >> >> -- >> ------------------------------------------------------------------ >> Detlef Bosau >> Galileistra?e 30 >> 70565 Stuttgart Tel.: +49 711 5208031 >> mobile: +49 172 6819937 >> skype: detlef.bosau >> ICQ: 566129673 >> detlef.bosau at web.de http://www.detlef-bosau.de >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From jnc at mercury.lcs.mit.edu Wed May 21 17:06:31 2014 From: jnc at mercury.lcs.mit.edu (Noel Chiappa) Date: Wed, 21 May 2014 20:06:31 -0400 (EDT) Subject: [ih] internet-history Digest, Vol 84, Issue 10 Message-ID: <20140522000631.D3CBE18C0E4@mercury.lcs.mit.edu> > From: Bob Braden >> To what degree that was the lack of a good understanding of the >> problem, and to what degree simply that Van was better at control >> theory and analysis of the system than the rest of us, is a good >> question > It is not a "good question", there is no doubt. I was being polite... :-) > Source Quench was a perfect example of our pre-VJ cluelesssness. Wwhat > seemed a plausible congestion control mechanism was in fact completely > broken. Here is the 'SQ is wrong' meme again. But was it really broken, or did we just not know how to use it? E.g. if TCP reacted the same way to an SQ as if did to a missing ACK, would that in fact prevent congestive collapses? I'm not sure anyone knows for sure. Perhaps Van felt he didn't need SQ, that the missing ACK was a good enough signal? (Experience shows that there is a lot to that position.) So maybe there was this feeling that 'perfection has been attained ... when there is nothing left to take away', and SQ was ditched as un-needed complexity? I lived through all that, I should know, but I don't! As best I can recall,a I think there was just this feeling of 'we tried SQ and it didn't work'. Does anyone know the whole story of how SQ got dumped? But, as I say, I think it was more that we didn't know how to use it properly - the fault was in us, not SQ. Noel From faber at isi.edu Wed May 21 17:42:38 2014 From: faber at isi.edu (Ted Faber) Date: Wed, 21 May 2014 17:42:38 -0700 Subject: [ih] internet-history Digest, Vol 84, Issue 10 In-Reply-To: <20140522000631.D3CBE18C0E4@mercury.lcs.mit.edu> References: <20140522000631.D3CBE18C0E4@mercury.lcs.mit.edu> Message-ID: <537D47FE.5090802@isi.edu> On 05/21/14 17:06, Noel Chiappa wrote: > > From: Bob Braden > > > Source Quench was a perfect example of our pre-VJ cluelesssness. Wwhat > > seemed a plausible congestion control mechanism was in fact completely > > broken. > > Here is the 'SQ is wrong' meme again. But was it really broken, or did we > just not know how to use it? The SQ ICMP packet is a signal from the network to the endpoint that congestion exists. That's what the endpoint wants to know (in the world of congestion control), of course, so sending the message seems a straightforward plan. The explicit message can have all kinds of additional data in it that the endpoint can use to change plans. So far, the fault's in us. Consider, however the case where the path that the SQ will take back to the endpoint is congested itself. People have waved their hands about considerably about how likely that is, but it certainly happens. If the path is congested, the endpoint needs machinery to infer congestion exists and react to it in the absence of quench packets. IMHO, that inference and reaction is required. Not reacting to congestion results in congestion collapse. In a system that relies on explicit congestion signals, attackers may even preferentially corrupt SQ's to cause collapse. Implementing SQ, or any explicit congestion signal, is at best a hint. It doesn't save any code or thinking in the endpoints, because the congestion control system has to work when all SQs get lost or (dun, dun, dun: murdered!). Furthermore several good systems for piggybacking significant congestion information on existing packets exist (from DECBit to XCP with a couple interesting points between). The upshot is that I find SQ pedagogically useful, but practically subsumed by other systems. -- Ted Faber http://www.isi.edu/~faber PGP: http://www.isi.edu/~faber/pubkeys.asc Unexpected attachment on this mail? See http://www.isi.edu/~faber/FAQ.html#SIG -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 246 bytes Desc: OpenPGP digital signature URL: From wes at mti-systems.com Wed May 21 18:40:25 2014 From: wes at mti-systems.com (Wesley Eddy) Date: Wed, 21 May 2014 21:40:25 -0400 Subject: [ih] internet-history Digest, Vol 84, Issue 10 In-Reply-To: <20140522000631.D3CBE18C0E4@mercury.lcs.mit.edu> References: <20140522000631.D3CBE18C0E4@mercury.lcs.mit.edu> Message-ID: <537D5589.4040207@mti-systems.com> On 5/21/2014 8:06 PM, Noel Chiappa wrote: > Does anyone know the whole story of how SQ got dumped? A decent summary of rationale is in RFC 6633 (though it had been out of use long before that was written): http://tools.ietf.org/html/rfc6633 -- Wes Eddy MTI Systems From jack at 3kitty.org Wed May 21 19:06:09 2014 From: jack at 3kitty.org (Jack Haverty) Date: Wed, 21 May 2014 19:06:09 -0700 Subject: [ih] internet-history Digest, Vol 84, Issue 10 In-Reply-To: <20140522000631.D3CBE18C0E4@mercury.lcs.mit.edu> References: <20140522000631.D3CBE18C0E4@mercury.lcs.mit.edu> Message-ID: On Wed, May 21, 2014 at 5:06 PM, Noel Chiappa wrote: > Here is the 'SQ is wrong' meme again. But was it really broken, or did we > just not know how to use it? > Personally, I always thought SQ was broken, and said so when we first put it in. The problem was that SQs were to be sent when a packet was discarded somewhere in transit. So a gateway (router) that had to discard a packet because no buffers were available sent a SQ to the Host that had sent that packet. That's the best it could do since it didn't remember any kind of state information or flows etc. SQ was more accurately an "I dropped your packet, sorry about that" report that was just called Source Quench, launched at some possibly irrelevant user process. Since the gateways had no state information about connections, that SQ could have gone to some Host that really had nothing to do with the excessive traffic that was causing the problem. It could also have gone to some process with a TCP connection that had nothing to do with the congestion. It made little sense for a TCP connection that had just opened, or that was already sending only a little data (a User Telnet) to be told to slow down. Dave Mills figured out that an appropriate response to receiving an SQ was to immediately retransmit, since you knew that your packet had been dropped. This was especially appropriate if your system was hung out on the end of a low speed dialup line, and thus very very unlikely to be sending enough traffic to be causing congestion. This of course did nothing to reduce traffic at all. The problem was that an SQ could easily go to a user process that had nothing to do with the congestion being experienced, and could do nothing useful to alleviate that congestion. SQs of course could also create more congestion themselves. I think it would have been possible to make smarter gateways that remembered a lot about recent traffic flows, and could thereby deduce which ones were causing the problem, directing an SQ to a source that would actually be appropriate to slow down. But that would start to look a lot like a virtual circuit net, where the internal mechanisms knew about flows and connections, rather than a datagram one. We already had the ARPANET with such internal mechanisms. IP was supposed to be different, lean and mean with very simple very fast switches and a mix of TCP and UDP traffic. So, yes we didn't know how to use it, but I think it was also inappropriate for a "datagram network". /Jack -------------- next part -------------- An HTML attachment was scrubbed... URL: From sob at harvard.edu Wed May 21 19:39:46 2014 From: sob at harvard.edu (Bradner, Scott) Date: Thu, 22 May 2014 02:39:46 +0000 Subject: [ih] internet-history Digest, Vol 84, Issue 10 Message-ID: <66F2FCCE-1969-43ED-B6CD-488BC464C4EE@harvard.edu> On May 21, 2014, at 8:06 PM, Noel Chiappa wrote: >> Does > anyone know the whole story of how SQ got dumped? fwiw the first dumping I saw was in Router Requirements (RFC 1812) (section 4.3.3.3) "A router SHOULD NOT originate ICMP Source Quench messages." I recall talking at length with Craig Partridge about a draft of that doc on a trans continental flight and specifically talking about SQ Scott From jnc at mercury.lcs.mit.edu Wed May 21 20:15:38 2014 From: jnc at mercury.lcs.mit.edu (Noel Chiappa) Date: Wed, 21 May 2014 23:15:38 -0400 (EDT) Subject: [ih] internet-history Digest, Vol 84, Issue 10 Message-ID: <20140522031538.7CD0318C0E7@mercury.lcs.mit.edu> > From: Ted Faber > that inference and reaction is required. Not reacting to congestion > results in congestion collapse. ... > Implementing SQ, or any explicit congestion signal, is at best a hint. > It doesn't save any code or thinking in the endpoints, because the > congestion control system has to work when all SQs get lost Good point. So I'm not sure SQ really solves Brian's problem (how to sort out congestion drops from packets lost because of errors). If you _do_ get an SQ, you _can_ be sure it was congestion - but otherwise... who knows? > Furthermore several good systems for piggybacking significant > congestion information on existing packets exist (from DECBit to XCP Right, but do they solve Brian's problem? I don't remember enough about XCP - I keep re-reading it and then promptly forgetting it again! :-) Whatever the solution is, it _has_ to be something that involves the routers in the middle - because only they know whether, when a packet is tossed in the middle, if it's due to congestion or error. > From: Wesley Eddy > A decent summary of rationale is in RFC 6633 Ah, there's no answer there to my question about 'does SQ work if it is used correctly'. It basically seems to say 'we stopped using this aeons ago, this is the formal death notice'. It sent me to RFC-1812, but that also didn't say directly; _it_ sent me off to two other papers, one which seems not to be online, and the other is only available behind a paywall. It does mention some good reasons not to use SQ (e.g. it might be an attack vector), but again, this doesn't answer the fundamental question (above) - and if it was really useful, there might be away around the DoS etc issues. Noel From jnc at mercury.lcs.mit.edu Wed May 21 20:28:17 2014 From: jnc at mercury.lcs.mit.edu (Noel Chiappa) Date: Wed, 21 May 2014 23:28:17 -0400 (EDT) Subject: [ih] internet-history Digest, Vol 84, Issue 10 Message-ID: <20140522032817.9928518C0E7@mercury.lcs.mit.edu> > From: Jack Haverty > So a gateway (router) that had to discard a packet because no buffers > were available Ah, no. I don't know about other routers, but the one I did discarded a packet _when the output queue on a particular network got too long_. My router in fact worked very hard never to run out of buffers, and in fact would start tossing packets _before_ it used up the last buffer. Basically it tried to prevent one backed-up interface from interfering with the the operation of the other interfaces, so once an interface used up more than its 'fair share' of buffers, the code started keeping a rein on it. (I will pass over the details of its algorithm for the moment, I'm not sure it's relevant.) > sent a SQ to the Host that had sent that packet. That's the > best it could do ... that SQ could have gone to some Host that really > had nothing to do with the excessive traffic that was causing the > problem. But that's pretty much what congestion drop is (at least, in the early days; now we know about the evils of tail-drop, and we have RED and all that stuff). You took some random packet, which might or might not be from a problem source, and shot it. And even those new target selection mechanims are still probabilistic - you take some packet and shoot it, and _hope_ you got someone who's _a_ cause of the problem, and not some innocent by-stander, but there's no guarantee - it's just that these new algorithms are _more likely_ to zap a packet from a problem source. > Dave Mills figured out that an appropriate response to receiving an SQ > was to immediately retransmit, since you knew that your packet had been > dropped. I'm tempted to say something really snarky, but I will refrain. :-) > I think it would have been possible to make smarter gateways that > remembered a lot about recent traffic flows, and could thereby deduce > which ones were causing the problem You're talking about a mostly orthogonal problem (identification of actual congestion sources) - see above. That's a non-trivial problem, and there has been a lot of work on it, but I do think it's separable from what I'm focusing on here. Assume for the sake of this discussion that we have a perfect crystal ball procedure that will identify the 'bad packets'. Now, does SQ work if it is used 'correctly' (i.e. as an input to a working congestion control mechanism on the source host, an input that says 'congestion is happening')? Noel From craig at aland.bbn.com Thu May 22 04:25:13 2014 From: craig at aland.bbn.com (Craig Partridge) Date: Thu, 22 May 2014 07:25:13 -0400 Subject: [ih] Loss as a congestion signal [internet-history Digest, Vol 84, Issue 4] Message-ID: <20140522112513.EFDD828E137@aland.bbn.com> > > From: Brian E Carpenter > > > it's surely the case that actual bit destruction causing non-congestive > > packet loss was a much bigger worry in the 1970s than it was ten years > > later? > > I don't recall us worrying about damaged packets, to be honest. If they > happened, they were re-transmitted, and you just didn't notice. I remember damaged packets. They usually came from serial links and wireless (largely satellite links). Louie observed Dave Mills' hard work on the satellite front, but serial links were the norm and often pretty lossy. As I recall, the 1822 spec for ARPANET host connection had several variations, with varying degress of CRCs (and probably other stuff) depending on how far the serial line from the IMP to your host adapter was. Then of course, there was SLIP -- which ran over dialup modem lines with no CRC... Finally, host adapters were (and still are) of varying quality and buffer over/underruns, DMA drops, etc., were common. As an illustration of the severity of the error problem, it was possible to run the NFS distributed file system over UDP with checksums off and checksums on. Checksums off was much faster in the day, and many people believed errors were rare enough this was OK. Many stories of folks who after a few months realized that their filesystem had substantial numbers of corrupted files and switched to checksums on. (Still scary that the TCP checksum was considered strong enough, but...) Thanks! Craig From scott.brim at gmail.com Thu May 22 05:29:20 2014 From: scott.brim at gmail.com (Scott Brim) Date: Thu, 22 May 2014 08:29:20 -0400 Subject: [ih] Loss as a congestion signal [internet-history Digest, Vol 84, Issue 4] In-Reply-To: <20140522112513.EFDD828E137@aland.bbn.com> References: <20140522112513.EFDD828E137@aland.bbn.com> Message-ID: Then there was packet splicing (ref Craig). From jnc at mercury.lcs.mit.edu Thu May 22 08:21:42 2014 From: jnc at mercury.lcs.mit.edu (Noel Chiappa) Date: Thu, 22 May 2014 11:21:42 -0400 (EDT) Subject: [ih] Loss as a congestion signal [internet-history Digest, Vol 84, Issue 4] Message-ID: <20140522152142.BFAB318C0BF@mercury.lcs.mit.edu> > From: Craig Partridge > I remember damaged packets. I've been mildly racking my brain, and I just don't recall them being an issue at MIT. I'm sure we had them, but I can only think they didn't pose a big enough issue (performance-wise) that we noticed, and bothered to look into it. Of course, we did always run with checksums on - in fact, we were so devoted to the end-end principle that we even did a ring (the 10mbit/second one) that didn't have a hardware checksum! Had we realized just how useless the TCP/UDP checksum was... :-) (It did have a parity bit, but that was re-calculated at every station - it was designed to find flaky links, not for data safety. I'm not sure we ever bothered to collect statistics and look for flaky links, though! Too many fish...) > the 1822 spec for ARPANET host connection had several variations, with > varying degress of CRCs IIRC, VDH and the VDH replacement, HDH (I think that was the name), had CRCs. LH and DH (the two we used for all our machines at MIT) did not have a checksum. > Then of course, there was SLIP -- which ran over dialup modem lines > with no CRC... We ran an oddball serial line (leased line, but with asynchronous 9600 bps modems :-) between MIT and Proteon; it implemented an idea Dave Reed had for header compression. He noticed that sequential packets were mostly the same in the header, so we kept a copy of the first 32 bytes of the previous packet, and prepended a 32-bit vector, with a bit set in each to indicate that the byte was the same, and only included the bytes that differed in the packet sent over the line. Of course, if the two ends got out of sync, things went south, so the format included a one-byte (minimize line overhead) checksum; when the checksum failed, a 'sync reset' was sent back to the other end, and the next packet down the line was sent un-compressed. But I don't recall that we ever looked at it to see how many packets we were losing, etc, etc! In fact, looking at the code, it doesn't even seem to have counted them (although it did print a log message when it lost sync). And it also didn't count to see how much we were saving with the compression! Sigh! Noel From faber at isi.edu Thu May 22 08:46:16 2014 From: faber at isi.edu (Ted Faber) Date: Thu, 22 May 2014 08:46:16 -0700 Subject: [ih] internet-history Digest, Vol 84, Issue 10 In-Reply-To: <20140522031538.7CD0318C0E7@mercury.lcs.mit.edu> References: <20140522031538.7CD0318C0E7@mercury.lcs.mit.edu> Message-ID: <537E1BC8.9000505@isi.edu> On 05/21/14 20:15, Noel Chiappa wrote: > > From: Ted Faber > > > that inference and reaction is required. Not reacting to congestion > > results in congestion collapse. ... > > Implementing SQ, or any explicit congestion signal, is at best a hint. > > It doesn't save any code or thinking in the endpoints, because the > > congestion control system has to work when all SQs get lost > > Good point. > > So I'm not sure SQ really solves Brian's problem (how to sort out congestion > drops from packets lost because of errors). If you _do_ get an SQ, you _can_ > be sure it was congestion - but otherwise... who knows? I don't want to go too far on this list, because I'm not really talking about the history of the thinking. That said :-), there are two problems here: what does a network element know that an endpoint would benefit from knowing, and how does the element communicate that info if it knows any. I don't think there's a compelling case to use a separate packet a la SQ for doing the communication. The info would have to be both extremely important and the benefit of sending it immediately rather than piggybacking extremely high to justify sending a separate packet. That packet's at best a hint so unless the info is unbelievably valuable just send it along in due course. > > > Furthermore several good systems for piggybacking significant > > congestion information on existing packets exist (from DECBit to XCP > > Right, but do they solve Brian's problem? I don't remember enough about > XCP - I keep re-reading it and then promptly forgetting it again! :-) There's a lot in there, IMHO, but I think the XCP model of congestion is fairly conventional - too many packets for the store and forward engine to keep up with. > > Whatever the solution is, it _has_ to be something that involves the > routers in the middle - because only they know whether, when a packet > is tossed in the middle, if it's due to congestion or error. My 2 cents: the congestion/corruption distinction sounds very interesting to explore, but the urge to prematurely optimize is almost pathological. The corruption/congestion line can be blurry depending on the technology - load can make corruption more likely or not - and a good architecture should consider those possibilities. A lot of people seem to come at it bottom up with the attendant preconceptions from their technology and shut the door to interactions that seem obvious to people from other backgrounds. -- Ted Faber http://www.isi.edu/~faber PGP: http://www.isi.edu/~faber/pubkeys.asc Unexpected attachment on this mail? See http://www.isi.edu/~faber/FAQ.html#SIG -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 246 bytes Desc: OpenPGP digital signature URL: From braden at isi.edu Thu May 22 09:56:08 2014 From: braden at isi.edu (Bob Braden) Date: Thu, 22 May 2014 09:56:08 -0700 Subject: [ih] internet-history Digest, Vol 84, Issue 11 In-Reply-To: References: Message-ID: <537E2C28.1050206@meritmail.isi.edu> On 5/21/2014 1:14 PM, internet-history-request at postel.org wrote: > Send internet-history mailing list submissions to > internet-history at postel.org > > To subscribe or unsubscribe via the World Wide Web, visit > http://mailman.postel.org/mailman/listinfo/internet-history > or, via email, send a message with subject or body 'help' to > internet-history-request at postel.org > > You can reach the person managing the list at > > > That's it! Again, note the focus on 'hey, we gotta get the user's data there > as fast as we can'. That is basically a correct statement of our focus. And the narrowness of our vision (we were really groping in the dark in an unmapped territory) is well illustrated by our surprise to discover the Silly Window Syndrome. I haven't read Nagle's thing, but that would also be interesting to look at, to see how much we understood at that point. We were of course aware of Nagle's conclusions at the time, but I don't think I actually read his RFC on congestion collapse until a few months ago. It is well worth reading. Bob -------------- next part -------------- An HTML attachment was scrubbed... URL: From louie at transsys.com Thu May 22 10:26:26 2014 From: louie at transsys.com (Louis Mamakos) Date: Thu, 22 May 2014 13:26:26 -0400 Subject: [ih] Loss as a congestion signal [internet-history Digest, Vol 84, Issue 4] In-Reply-To: <20140522112513.EFDD828E137@aland.bbn.com> References: <20140522112513.EFDD828E137@aland.bbn.com> Message-ID: On May 22, 2014, at 7:25 AM, Craig Partridge wrote: >>> From: Brian E Carpenter >> >>> it's surely the case that actual bit destruction causing non-congestive >>> packet loss was a much bigger worry in the 1970s than it was ten years >>> later? >> >> I don't recall us worrying about damaged packets, to be honest. If they >> happened, they were re-transmitted, and you just didn't notice. > > I remember damaged packets. > > They usually came from serial links and wireless (largely satellite links). > Louie observed Dave Mills' hard work on the satellite front, but serial > links were the norm and often pretty lossy. As I recall, the 1822 spec > for ARPANET host connection had several variations, with varying degress of > CRCs (and probably other stuff) depending on how far the serial line from > the IMP to your host adapter was. > > Then of course, there was SLIP -- which ran over dialup modem lines > with no CRC... > > Finally, host adapters were (and still are) of varying quality and > buffer over/underruns, DMA drops, etc., were common. > > As an illustration of the severity of the error problem, it was possible > to run the NFS distributed file system over UDP with checksums off and > checksums on. Checksums off was much faster in the day, and many people > believed errors were rare enough this was OK. Many stories of folks who > after a few months realized that their filesystem had substantial numbers > of corrupted files and switched to checksums on. (Still scary that the > TCP checksum was considered strong enough, but...) > > Thanks! > > Craig Yeah, DMA drops and other hardware problem were not unknown. I recall a time in the mid 1980?s as we were bringing up an Ethernet interface on our UNIVAC mainframe. I was testing with a VAX running (I think) 4.2BSD at the time. Curiously, we were seeing occasional IP and TCP checksum errors on packets that traversed exactly one Ethernet segment. (This was at a time when Ethernet was REAL Ethernet on big fat yellow coaxial cables.) After much investigation we eventually discovered a bug in the DEUNA ethernet interface plugged into a UNIBUS adapter on the VAX. Every so often, a burst of DMA would fail, and the bytes not actually end up in the receive buffer. Apparently even weak checksums can discover and protect against this class of error. Other big fun when a Sun on the LAN had a defective/missing Ethernet address ROM. Reading it yielded all 1 bits. Someone ARP?s for the Sun?s IP address, gets ff:ff:ff:ff:ff:ff (ethernet broadcast address) and now sends packets to that destination. At the time, many hosts helpfully defaulted to IP forwarding being turned on. They?d each receive a packet, decide to helpfully forward it along, ARP, get broadcast, big melt down. So more sanity checks (ARP mappings don?t go to multicast MAC addresses) and maybe defaulting ip_forwarding on isn?t the best decision. It all seems so obvious now. This is the stuff that never ends up being in protocol specifications, but maybe in best practices documents. louie From vint at google.com Thu May 22 11:22:39 2014 From: vint at google.com (Vint Cerf) Date: Thu, 22 May 2014 14:22:39 -0400 Subject: [ih] internet-history Digest, Vol 84, Issue 10 In-Reply-To: <20140522032817.9928518C0E7@mercury.lcs.mit.edu> References: <20140522032817.9928518C0E7@mercury.lcs.mit.edu> Message-ID: Source Quench had the problem right but the solution wrong. If a packet was discarded because of congestion, it was not clear that the party sending THAT packet was the source of congestion. Even if the source backed off, it might still not be solving the problem if the real source of the congestion was a different packet stream. Random dropping had a better probability of "hitting" the real source of congestion but it was still "hit or miss" If there is a solution it may lie in monitoring specific flows - this is something that some routers now do. v -------------- next part -------------- An HTML attachment was scrubbed... URL: From wes at mti-systems.com Thu May 22 11:38:14 2014 From: wes at mti-systems.com (Wesley Eddy) Date: Thu, 22 May 2014 14:38:14 -0400 Subject: [ih] internet-history Digest, Vol 84, Issue 10 In-Reply-To: <20140522031538.7CD0318C0E7@mercury.lcs.mit.edu> References: <20140522031538.7CD0318C0E7@mercury.lcs.mit.edu> Message-ID: <537E4416.5080609@mti-systems.com> On 5/21/2014 11:15 PM, Noel Chiappa wrote: > > From: Wesley Eddy > > > A decent summary of rationale is in RFC 6633 > > Ah, there's no answer there to my question about 'does SQ work if it is used > correctly'. It basically seems to say 'we stopped using this aeons ago, this > is the formal death notice'. It sent me to RFC-1812, but that also didn't say > directly; _it_ sent me off to two other papers, one which seems not to be > online, and the other is only available behind a paywall. > > It does mention some good reasons not to use SQ (e.g. it might be an attack > vector), but again, this doesn't answer the fundamental question (above) - > and if it was really useful, there might be away around the DoS etc issues. Another important point is that we have ECN now. -- Wes Eddy MTI Systems From brian.e.carpenter at gmail.com Thu May 22 13:03:53 2014 From: brian.e.carpenter at gmail.com (Brian E Carpenter) Date: Fri, 23 May 2014 08:03:53 +1200 Subject: [ih] Broadcast storms [Loss as a congestion signal] In-Reply-To: References: <20140522112513.EFDD828E137@aland.bbn.com> Message-ID: <537E5829.6090008@gmail.com> On 23/05/2014 05:26, Louis Mamakos wrote: ... > Other big fun when a Sun on the LAN had a defective/missing Ethernet > address ROM. Reading it yielded all 1 bits. Someone ARP?s for the Sun?s > IP address, gets ff:ff:ff:ff:ff:ff (ethernet broadcast address) and now > sends packets to that destination. At the time, many hosts helpfully > defaulted to IP forwarding being turned on. They?d each receive a > packet, decide to helpfully forward it along, ARP, get broadcast, > big melt down. So more sanity checks (ARP mappings don?t go to > multicast MAC addresses) and maybe defaulting ip_forwarding on isn?t > the best decision. Ah. I wonder if you didn't just debug some of the broadcast storms we used to get at CERN in the mid-80s, when we first ran our home-made site-wide Ethernet bridges. Sometimes we just never found the original source. Brian From jnc at mercury.lcs.mit.edu Fri May 23 11:04:37 2014 From: jnc at mercury.lcs.mit.edu (Noel Chiappa) Date: Fri, 23 May 2014 14:04:37 -0400 (EDT) Subject: [ih] internet-history Digest, Vol 84, Issue 11 Message-ID: <20140523180437.58C3E18C110@mercury.lcs.mit.edu> > From: Bob Braden >> I haven't read Nagle's thing, but that would also be interesting to >> look at, to see how much we understood at that point. > We were of course aware of Nagle's conclusions at the time, but I don't > think I actually read his RFC on congestion collapse until a few months > ago. It is well worth reading. Indeed it is. Numerous gems; he probably understood the issues better than anyone else, before Van. Some of them: Adding additional memory to the gateways will not solve the problem. How true! A lesson some people have just (re)-learned recently! the ICMP Source Quench message. With careful handling, we find this adequate to prevent serious congestion problems. We do find it necessary to be careful about the behavior of our hosts .. regarding Source Quench messages. Interesting! This appears to be experimental data to support my thesis, that SQ was not, in fact, fundamentally broken as a congestion signal - although one has to use it properly, as he indicates (above), and here: Implementations of Source Quench entirely within the IP layer are usually unsuccessful because IP lacks enough information to throttle a connection properly. Making the point that it was the congestion response mechanism _in TCP_ that was lacking. And finally this: All our switching nodes send ICMP Source Quench messages well before buffer space is exhausted; they do not wait until it is necessary to drop a message before sending an ICMP Source Quench. which is again, a _very early_ recognition of things like today's congestion bit, etc. Alas, I'm afraid we probably didn't pay as close attention to his work as we should have: John, if you're out there somewhere, my deepest apologies! Noel From jeanjour at comcast.net Fri May 23 13:53:49 2014 From: jeanjour at comcast.net (John Day) Date: Fri, 23 May 2014 16:53:49 -0400 Subject: [ih] internet-history Digest, Vol 84, Issue 11 In-Reply-To: <20140523180437.58C3E18C110@mercury.lcs.mit.edu> References: <20140523180437.58C3E18C110@mercury.lcs.mit.edu> Message-ID: It wasn't just Nagle. LeLann, Gelenbe, and others were doing research on this topic in the early 1970s, it is discussed in IEN#1 in 1977 with the conjecture that ingress flow control was likely to be part of the solution, there was a conference held on the topic in 1979, and Jain's work begins about this time and is published in '82. It seems lots of people understood the problem and were investigating possible solutions. Take care, John At 2:04 PM -0400 5/23/14, Noel Chiappa wrote: > > From: Bob Braden > > >> I haven't read Nagle's thing, but that would also be interesting to > >> look at, to see how much we understood at that point. > > > We were of course aware of Nagle's conclusions at the time, but I don't > > think I actually read his RFC on congestion collapse until a few months > > ago. It is well worth reading. > >Indeed it is. Numerous gems; he probably understood the issues better than >anyone else, before Van. Some of them: > > Adding additional memory to the gateways will not solve the problem. > >How true! A lesson some people have just (re)-learned recently! > > the ICMP Source Quench message. With careful handling, we find this > adequate to prevent serious congestion problems. We do find it necessary > to be careful about the behavior of our hosts .. regarding Source Quench > messages. > >Interesting! This appears to be experimental data to support my thesis, that >SQ was not, in fact, fundamentally broken as a congestion signal - although >one has to use it properly, as he indicates (above), and here: > > Implementations of Source Quench entirely within the IP layer are usually > unsuccessful because IP lacks enough information to throttle a connection > properly. > >Making the point that it was the congestion response mechanism _in TCP_ that >was lacking. > >And finally this: > > All our switching nodes send ICMP Source Quench messages well before buffer > space is exhausted; they do not wait until it is necessary to drop a > message before sending an ICMP Source Quench. > >which is again, a _very early_ recognition of things like today's congestion >bit, etc. > > >Alas, I'm afraid we probably didn't pay as close attention to his work as we >should have: John, if you're out there somewhere, my deepest apologies! > > Noel From jnc at mercury.lcs.mit.edu Sat May 24 08:57:41 2014 From: jnc at mercury.lcs.mit.edu (Noel Chiappa) Date: Sat, 24 May 2014 11:57:41 -0400 (EDT) Subject: [ih] internet-history Digest, Vol 84, Issue 11 Message-ID: <20140524155741.A8E8918C0D5@mercury.lcs.mit.edu> > From: John Day >> he probably understood the issues better than anyone else, before Van. > It wasn't just Nagle. LeLann, Gelenbe, and others were doing research > on this topic in the early 1970s, it is discussed in IEN#1 in 1977 > ... there was a conference held on the topic in 1979, and Jain's work > begins about this time and is published in '82. Good point; let me re-phrase my comment: "better than anyone else who was active in the early TCP/IP world". :-) Noel PS: This pre-dates the IETF, of course, so one can't say "in the IETF"! From mfidelman at meetinghouse.net Sat May 24 09:21:04 2014 From: mfidelman at meetinghouse.net (Miles Fidelman) Date: Sat, 24 May 2014 12:21:04 -0400 Subject: [ih] the state of protocol R&D? Message-ID: <5380C6F0.4030909@meetinghouse.net> Hi Folks, For a while, it's been kind of bugging me that the Internet ecosystem is increasingly a world of API's tied to proprietary systems - quite different than the world of interoperable protocols. Sure, every once in a while something new comes along - like RSS and XMPP, but that's more at the fringes - and in a lot of cases we see attempts at things by folks who really don't have a clue (open social comes to mind). Obviously, a lot of that is driven by commercial factors - there's money to be made in centralizing systems and monetizing APIs; not so much for protocols. And it seems like there isn't a lot of R&D funding for such things. Which leads me to wonder - is there much of a protocol r&d community left - academic or otherwise? Or funders? And if so, where do folks "congregate?" For programming languages, there's http://lambda-the-ultimate.org/, conferences like OOPSLA, and there seems to be a steady stream of academic papers. Is there anything left like that for protocol R&D? Miles Fidelman -- In theory, there is no difference between theory and practice. In practice, there is. .... Yogi Berra From vint at google.com Sat May 24 14:17:22 2014 From: vint at google.com (Vint Cerf) Date: Sat, 24 May 2014 17:17:22 -0400 Subject: [ih] the state of protocol R&D? In-Reply-To: <5380C6F0.4030909@meetinghouse.net> References: <5380C6F0.4030909@meetinghouse.net> Message-ID: Dtnrg is one such group and a BOF is planned at next IETF for a WG. On May 24, 2014 12:37 PM, "Miles Fidelman" wrote: > Hi Folks, > > For a while, it's been kind of bugging me that the Internet ecosystem is > increasingly a world of API's tied to proprietary systems - quite different > than the world of interoperable protocols. Sure, every once in a while > something new comes along - like RSS and XMPP, but that's more at the > fringes - and in a lot of cases we see attempts at things by folks who > really don't have a clue (open social comes to mind). > > Obviously, a lot of that is driven by commercial factors - there's money > to be made in centralizing systems and monetizing APIs; not so much for > protocols. And it seems like there isn't a lot of R&D funding for such > things. > > Which leads me to wonder - is there much of a protocol r&d community left > - academic or otherwise? Or funders? And if so, where do folks > "congregate?" For programming languages, there's > http://lambda-the-ultimate.org/, conferences like OOPSLA, and there seems > to be a steady stream of academic papers. Is there anything left like that > for protocol R&D? > > Miles Fidelman > > -- > In theory, there is no difference between theory and practice. > In practice, there is. .... Yogi Berra > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From brian.e.carpenter at gmail.com Sat May 24 15:26:11 2014 From: brian.e.carpenter at gmail.com (Brian E Carpenter) Date: Sun, 25 May 2014 10:26:11 +1200 Subject: [ih] the state of protocol R&D? In-Reply-To: References: <5380C6F0.4030909@meetinghouse.net> Message-ID: <53811C83.9050705@gmail.com> Isn't this what the IRTF is for? It currently has 9 research groups and there's a procedure for proposing new ones. I also see quite a lot of protocol R&D in SIGCOMM agendas, and CCR generally. I also used to be deluged with MSc and PhD applications from people interested in layer 2 mesh networks and the like, but that probably isn't what you mean. Brian On 25/05/2014 09:17, Vint Cerf wrote: > Dtnrg is one such group and a BOF is planned at next IETF for a WG. > On May 24, 2014 12:37 PM, "Miles Fidelman" > wrote: > >> Hi Folks, >> >> For a while, it's been kind of bugging me that the Internet ecosystem is >> increasingly a world of API's tied to proprietary systems - quite different >> than the world of interoperable protocols. Sure, every once in a while >> something new comes along - like RSS and XMPP, but that's more at the >> fringes - and in a lot of cases we see attempts at things by folks who >> really don't have a clue (open social comes to mind). >> >> Obviously, a lot of that is driven by commercial factors - there's money >> to be made in centralizing systems and monetizing APIs; not so much for >> protocols. And it seems like there isn't a lot of R&D funding for such >> things. >> >> Which leads me to wonder - is there much of a protocol r&d community left >> - academic or otherwise? Or funders? And if so, where do folks >> "congregate?" For programming languages, there's >> http://lambda-the-ultimate.org/, conferences like OOPSLA, and there seems >> to be a steady stream of academic papers. Is there anything left like that >> for protocol R&D? >> >> Miles Fidelman >> >> -- >> In theory, there is no difference between theory and practice. >> In practice, there is. .... Yogi Berra >> >> > From craig at aland.bbn.com Sat May 24 17:24:08 2014 From: craig at aland.bbn.com (Craig Partridge) Date: Sat, 24 May 2014 20:24:08 -0400 Subject: [ih] the state of protocol R&D? Message-ID: <20140525002408.84B9728E137@aland.bbn.com> Hi Miles: There's is a protocol R&D community. Mostly academic folks (and the list of universities is long). A few industry folks (BBN has about 100 folks doing networking research, many of them protocol work; ISI has, I think, about 35; IBM has some; Google has a few; Telefonica has some). In terms of where folks publish -- the venues have been diffuse. SIGCOMM and its workshops are the best starting point, but some of the best protocol research discussions I've seen over the past 5 years or so have been at Dagstuhl and Monte Verita and NSF program meetings (cf. the next generation Internet meetings). Thanks! Craig > Which leads me to wonder - is there much of a protocol r&d community > left - academic or otherwise? Or funders? And if so, where do folks > "congregate?" For programming languages, there's > http://lambda-the-ultimate.org/, conferences like OOPSLA, and there > seems to be a steady stream of academic papers. Is there anything left > like that for protocol R&D? > > Miles Fidelman > > -- > In theory, there is no difference between theory and practice. > In practice, there is. .... Yogi Berra ******************** Craig Partridge Chief Scientist, BBN Technologies E-mail: craig at aland.bbn.com or craig at bbn.com Phone: +1 517 324 3425 From mfidelman at meetinghouse.net Sat May 24 17:33:47 2014 From: mfidelman at meetinghouse.net (Miles Fidelman) Date: Sat, 24 May 2014 20:33:47 -0400 Subject: [ih] the state of protocol R&D? In-Reply-To: <20140525002408.84B9728E137@aland.bbn.com> References: <20140525002408.84B9728E137@aland.bbn.com> Message-ID: <53813A6B.1070500@meetinghouse.net> Craig, Thanks! Good to hear that things are still going on, if not as visibly as in the past. Are there any good email lists or blogs where people discuss such things on an ongoing basis? At least, to a degree, I guess I'm bemoaning what I see as a general shift in thinking - away from "let's solve problem with a new protocol" and towards "let's build a new platform, with an exposed API." It seems to be shaping the approaches people take to designing distributed system, as well as the components and tools available for doing so. Just a few years ago, I was able to find funding for some work on protocols for distributed planning and collaboration (first AFRL, then CECOM) - but it seems like a lot of the funding has shifted elsewhere (mostly cyber security), and, things have become, as you put it, "diffuse." I worry more than a little that knowledge and approaches to things are starting to get lost - in part because the work isn't more visible. Miles Craig Partridge wrote: > Hi Miles: > > There's is a protocol R&D community. Mostly academic folks (and the list > of universities is long). A few industry folks (BBN has about 100 folks > doing networking research, many of them protocol work; ISI has, I think, > about 35; IBM has some; Google has a few; Telefonica has some). > > In terms of where folks publish -- the venues have been diffuse. SIGCOMM > and its workshops are the best starting point, but some of the best > protocol research discussions I've seen over the past 5 years or so have > been at Dagstuhl and Monte Verita and NSF program meetings (cf. the next > generation Internet meetings). > > Thanks! > > Craig > >> Which leads me to wonder - is there much of a protocol r&d community >> left - academic or otherwise? Or funders? And if so, where do folks >> "congregate?" For programming languages, there's >> http://lambda-the-ultimate.org/, conferences like OOPSLA, and there >> seems to be a steady stream of academic papers. Is there anything left >> like that for protocol R&D? >> >> Miles Fidelman >> >> -- >> In theory, there is no difference between theory and practice. >> In practice, there is. .... Yogi Berra > ******************** > Craig Partridge > Chief Scientist, BBN Technologies > E-mail: craig at aland.bbn.com or craig at bbn.com > Phone: +1 517 324 3425 -- In theory, there is no difference between theory and practice. In practice, there is. .... Yogi Berra From dhc2 at dcrocker.net Sat May 24 19:18:40 2014 From: dhc2 at dcrocker.net (Dave Crocker) Date: Sat, 24 May 2014 19:18:40 -0700 Subject: [ih] the state of protocol R&D? In-Reply-To: <20140525002408.84B9728E137@aland.bbn.com> References: <20140525002408.84B9728E137@aland.bbn.com> Message-ID: <53815300.3080700@dcrocker.net> On 5/24/2014 5:24 PM, Craig Partridge wrote: > There's is a protocol R&D community. Mostly academic folks (and the list > of universities is long). A few industry folks (BBN has about 100 folks > doing networking research, many of them protocol work; ISI has, I think, > about 35; IBM has some; Google has a few; Telefonica has some). > > In terms of where folks publish -- the venues have been diffuse. SIGCOMM > and its workshops are the best starting point, but some of the best > protocol research discussions I've seen over the past 5 years or so have > been at Dagstuhl and Monte Verita and NSF program meetings (cf. the next > generation Internet meetings). Hmmm. Should the IRTF attempt a kind of open-source listing of places and activities that could be classed as 'networking research'? For new protocols, I've found it helpful to have an open-to-anyone registry. Claim that you support the protocol and you get listed. (cf, http://dkim.org/deploy). The registry does not vet claimants; just lists them. This gives interested parties a place for finding implementations or consultant. It probably would be easy and probably would be helpful, for irtf.org to set up something similar, for research, such as a "Community Research Activities" trac wiki. d/ -- Dave Crocker Brandenburg InternetWorking bbiw.net From paul at redbarn.org Sat May 24 20:10:39 2014 From: paul at redbarn.org (Paul Vixie) Date: Sat, 24 May 2014 20:10:39 -0700 Subject: [ih] the state of protocol R&D? In-Reply-To: <53815300.3080700@dcrocker.net> References: <20140525002408.84B9728E137@aland.bbn.com> <53815300.3080700@dcrocker.net> Message-ID: <53815F2F.4080603@redbarn.org> distributed system research tends to focus on measurable results, which means it's glued to the side of the DIY hardware world (raspberry pi, etc) and the open source software world. within those worlds there are two relevant signalling levels: applications and physical/network/transport. applications, to be practical at scale, have to live inside HTTP or HTTPS, or at worst, SSH. (think of rsync or git here.) when we build things that we want people to be able to use, we pick a common denominator like "web" or "reliable stream". new UDP applications in the style of DNS or NTP cannot be created in today's global internet since so many edge and middle boxes won't allow anything they don't understand. physical research, for practicality purposes, is focused on "how else could we carry IEEE 802 around?" network research, for practical purposes, is dead. IPv6 might catch on some day but it's the last change we'll ever see. transport research, for practical purposes, is dead. SCTP is better in every way than TCP, but see "edge and middle box" comment above. we are in other words in the post-research phase of planet scale networking, shackled to the wildcat success of the lab grade toy technology called "the internet". that to me is the state of "protocol R&D", and those are the underlying reasons why there's not a lot of it going on. (yes, i am still bitter about RFC 6013 not being allowed to have a TCP type code. bear with me while i get over it.) vixie From lars at netapp.com Mon May 26 01:40:12 2014 From: lars at netapp.com (Eggert, Lars) Date: Mon, 26 May 2014 08:40:12 +0000 Subject: [ih] the state of protocol R&D? In-Reply-To: <53815300.3080700@dcrocker.net> References: <20140525002408.84B9728E137@aland.bbn.com> <53815300.3080700@dcrocker.net> Message-ID: <53830894-ADB7-46FB-BEF3-DB28C0A79831@netapp.com> Hi, On 2014-5-25, at 4:18, Dave Crocker wrote: > Hmmm. Should the IRTF attempt a kind of open-source listing of places > and activities that could be classed as 'networking research'? so that would certainly be something that could be done under the IRTF umbrella. We have wikis and can get other tooling set up as needed. But: As with anything, it requires someone to feel strongly enough about it to spend their own time on it and make it happen. Unless you mean "I want do to X in the IRTF" when you write "the IRTF should", it's unlikely that anyone else will do it. Lars (as IRTF chair) PS: irtf-discuss at irtf.org might be a better list for discussing this, however. > > For new protocols, I've found it helpful to have an open-to-anyone > registry. Claim that you support the protocol and you get listed. (cf, > http://dkim.org/deploy). The registry does not vet claimants; just > lists them. This gives interested parties a place for finding > implementations or consultant. > > It probably would be easy and probably would be helpful, for irtf.org to > set up something similar, for research, such as a "Community Research > Activities" trac wiki. > > d/ > -- > Dave Crocker > Brandenburg InternetWorking > bbiw.net -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 273 bytes Desc: Message signed with OpenPGP using GPGMail URL: From jack at 3kitty.org Mon May 26 10:50:02 2014 From: jack at 3kitty.org (Jack Haverty) Date: Mon, 26 May 2014 10:50:02 -0700 Subject: [ih] The History of Protocol R&D? Message-ID: All the discussion of algorithms and protocols has me wondering about the History of such research over the 30+ year experiment of The Internet. I apologize if this is common knowledge, but I haven't been tracking such work for many years now, so I may easily have missed it. But, since this is "internet-history", it seems like a good place to ask. Research is usually characterized by the use of the Scientific Method, where someone has a new idea and then performs experiments to validate the expected results. That often involves a series of steps - e.g., maybe a mathematical model, perhaps a simulation, and eventually a real-world live test, instrumented so that the validity of the new idea can be observed in action. In the 1980s early TCP work, there was a lot of discussion about algorithms and protocols for things like retransmission techniques. We also necessarily made a lot of assumptions about how the real world would or should behave. E.G., one question was "What percentage of packets should we expect to lose in transit?" - from discards, or checksum errors, or whatever. The consensus was 1% - although that number was totally just pulled out of the air. But it helped focus the discussions about appropriate algorithms and expected results. TCP and router implementations had various tools for looking at behavior in live situations, e.g,. counters of numbers of packets discarded in routers, or counters of numbers of duplicate packets received by a TCP. When we tried out some new idea, we could observe its actual effects by looking at such data collected, and judge whether or not the new idea was actually a good one. The Internet was of course tiny in those early days, and thus much easier to observe as it operated. AFAIK, those counters evolved into a more formal and organized mechanism for collecting data, e.g., by the definition of a MIB circa 1988, and a series of additions and refinements in later RFCs. These presumably made it possible to collect such data in a cohesive way as The Internet has grown and evolved over the last 25 years. But I have no idea whether or not anyone has been doing such work, or even whether or not such mechanisms as MIBs have been actually deployed or used in any significant scale. So, my basic questions are: - "How do researchers now do protocol R&D, and validate a new idea by measurements in the live Internet?" - "What have been the results over the life of The Internet so far?" I assume that use of mathematical models, simulations, and anecdotal experiments is common and publicized in papers, theses, et al, but how are ideas subsequently validated in the broad Internet world, and the results of models and simulations and lab tests verified in the large scale world of The Internet? Looking back at the History of The Internet, there have been a stream of new ideas, e.g., in the algorithms for congestion control, to name just one research topic. How was each idea validated in the live Internet? What metrics were observed to improve? Taking a specific concrete case - was our guesstimate of 1% as a "normal" packet loss rate valid? We used to look at the counters and if the rate was much higher than that we took it as an indication of a problem to be investigated and addressed. Has the packet-loss-rate of The Internet been going up, or down, over the last 30 years? Has the duplicate-packets rate improved? (Or whatever other metrics might have surfaced as a measure of proper behavior) Did the metric change positively in response to deploying some new idea (e.g., a new congestion control algorithm)? Today, TCP is everywhere, and packets are presumably getting discarded, retransmitted, mangled, and delivered, with the power of TCP still hiding most of the carnage inside from the users outside. Have our improved algorithms been getting better at delivering the user data with less and less carnage? I vaguely recall that we invented a metric called something like "byte-miles-per-user-byte", that would simply measure how many bytes were transported how many miles, for each byte of data successfully delivered to the user process by a TCP connection. The ultimate theoretical goal of course was just the line-of-sight distance in miles between the two endpoints - i.e., the metric for an actual physical error-free real circuit limited by the physics of the speed of light. But retransmissions, congestion discards, routing decisions, and other such internal mechanisms of the Virtual Circuits of The Internet, would dictate how close reality came to that theoretical limit. My TVs have TCP, and it can stream video from halfway around the world. So can my phone(s). So can the other millions of devices out there. If I could look at the Wastebins of the Internet after a typical day, how big a pile of discarded packets would I find, in the various hosts, routers, etc. out there? Over the History of The Internet, how has that daily operational experience been changing? How much observed effect have the new algorithms had on getting closer to that theoretical ideal behavior of one byte-mile per user byte-mile? Are the new algorithms even implemented in those devices? Is anybody watching the gauges and dials of The Internet? /Jack Haverty -------------- next part -------------- An HTML attachment was scrubbed... URL: From detlef.bosau at web.de Mon May 26 14:20:12 2014 From: detlef.bosau at web.de (Detlef Bosau) Date: Mon, 26 May 2014 23:20:12 +0200 Subject: [ih] internet-history Digest, Vol 84, Issue 4 In-Reply-To: <20140521190817.5212618C0E4@mercury.lcs.mit.edu> References: <20140521190817.5212618C0E4@mercury.lcs.mit.edu> Message-ID: <5383B00C.9040602@web.de> Am 21.05.2014 21:08, schrieb Noel Chiappa: > And as to what to do when a timeout happened (which usually, although not > always, indicates that a packet has been dropped due to congestion), it says: > > if the retransmission timeout expires on a segment in the retransmission > queue, send the segment at the front of the retransmission queue again > [and] reinitialize the retransmission timer thanks a lot, I should read texts much more carefully. > That's it! Again, note the focus on 'hey, we gotta get the user's data there > as fast as we can'. However, at a first glance (I will implement this) this appears as a "quasi GBN". Do I understand this correctly: When the packet is retransmitted, a copy is _appended_ to the retransmission queue? > > Absolutely no thought given to 'hey, maybe that packet was lost through > congestion, maybe I ought to consider that, and if so, how to respond'. The > later 'if you have a congesitve loss, back off exponentially to prevent > congestive overload' stuff is completely missing (and would be for some time, > probably until Van, I would guess). > > I fairly vividly remember being the IETF where Van gave his first talk about > his congestion work, and when he started talking about how a lost packet was > a congestion signal, I think we all went 'wow, that's so obvious - how come > we never thought of that'! > > The TCP RFC does, however, spend a great deal of time talking about the window > (which is purely a destination host buffering thing at that point). > > > Looking at RFC-792 (ICMP) there's a fair chunk in the Source Quench > section about congestion, but little in the way of a solid algorithmic > suggestion: > > the source host should cut back the rate at which it is sending traffic to > the specified destination until it no longer receives source quench > messages from the gateway > > (And I still think SQ has gotten a bad rap about being ineffective and/or > making things worse; I would love to see some rigorous investigation of > SQ. But I digress...) > > > The 'Dave Clark 5' RFCs are similarly thin on congestion-related content: > RFC-813, "Window and Acknowledgement Strategy in TCP" (which one would assume > would be the place to find it) doesn't even contain the word 'congestion'! > It's all about host buffering, etc. (Indeed, it suggests delayed ACKs, > which interrupts the flow of ACKs which are an important signal in VJCC.) > And one also finds this gem: > > the disadvantage of burstiness is that it may cause buffers to overflow, > either in the eventual recipient .. or in an intermediate gateway, a > problem ignored in this paper. > > It's interesting to see what _is_ covered in the DC5 set, and similar > writings: Dave goes into Silly Window Syndome at some length, but there's > nothing about congestive losses. > > Lixia's "Why TCP Timers Don't Work Well" paper is, I think, a valuable > snap-shot of thinking about related topics pre-Van; it too doesn't have much > about congestive losses, mentioning them only briefly. The general sense one > gets from reading it is that 'the increased measured RTT caused by congestive > losses will cause people to back off enough to get rid of the congestion' > (which wasn't true, of course). > > I haven't read Nagle's thing, but that would also be interesting to look at, > to see how much we understood at that point. > > > So I think congestion control was so lacking, in part, because we just hadn't > run into it as a serious problem. Yes, we knew that _in theory_ congestion > was possible, and we'd added some stuff for it (SQ), but we just hadn't seen > it a lot - and we probably hadn't seen how _bad_ it could get back then. > > (Although experience at MIT with Sorcerer's Apprentice had shown us how bad > congestive collapse _could_ get - and I seem to recall hearing that PARC had > seen a similar thing. But I suspect the particular circumstances of SAS, with > the exponential increases in volume, even though it was - in theory! - a > 'single packet outstanding' protocol, might have led us to believe that it > was a pathological case, one that didn't have any larger lessons.) > > We were off fixing other alligators (SWS, etc) that actually had bitten us > at that point... > > So I suspect it was only when congestive collapse hit on the ARPANET section > of the Internet (shortly before Van's work) that it really got a focus. > > Noel -- ------------------------------------------------------------------ Detlef Bosau Galileistra?e 30 70565 Stuttgart Tel.: +49 711 5208031 mobile: +49 172 6819937 skype: detlef.bosau ICQ: 566129673 detlef.bosau at web.de http://www.detlef-bosau.de From louie at transsys.com Mon May 26 16:46:20 2014 From: louie at transsys.com (Louis Mamakos) Date: Mon, 26 May 2014 19:46:20 -0400 Subject: [ih] internet-history Digest, Vol 84, Issue 4 In-Reply-To: <5383B00C.9040602@web.de> References: <20140521190817.5212618C0E4@mercury.lcs.mit.edu> <5383B00C.9040602@web.de> Message-ID: <6BF9716E-D958-4758-9C93-AB19E5930667@transsys.com> On May 26, 2014, at 5:20 PM, Detlef Bosau wrote: > Am 21.05.2014 21:08, schrieb Noel Chiappa: > > >> And as to what to do when a timeout happened (which usually, although not >> always, indicates that a packet has been dropped due to congestion), it says: >> >> if the retransmission timeout expires on a segment in the retransmission >> queue, send the segment at the front of the retransmission queue again >> [and] reinitialize the retransmission timer > > thanks a lot, I should read texts much more carefully. >> That's it! Again, note the focus on 'hey, we gotta get the user's data there >> as fast as we can'. > > However, at a first glance (I will implement this) this appears as a > "quasi GBN". > > Do I understand this correctly: When the packet is retransmitted, a copy > is _appended_ to the retransmission queue? Just to be clear, of the 4 or 5 different TCP stacks I?ve crawled around in and/or co-authored in one case, the contents of the send window are retransmitted, not the packet. I?ve not seen a particular TCP implementation that keeps previously transmitted segments around for retransmission. (I can see how some low memory, constrained implementations might make a choice to keep previously transmitted packets around, however, and this lets them re-use the same fragmentation ID in the IP header, too. Every TCP stack I?ve seen just regenerates segments, and the retransmit queue is really the TCP send window. Certainly the IP stack we did for the UNIVAC 1100 did this, the various 2.{8,9,10,11} BSD and 4.x BSD Berkeley stacks did this, as probably did the BBN TCP stack for 4.1. Pretty sure the Fuzzball TCP stack also used this strategy. I can often be the case, of course, that the retransmission attempt generates a segment larger the the one that wasn?t acknowledged if additional data was placed into the send window by the local application. Might was well fill up the retransmitted packet to the the MSS. This was pretty obvious to see with interactive (e.g., telnet) traffic and Nagle?s algorithm that would suppress additional tinygrams until the ACK was returned. louie From jack at 3kitty.org Mon May 26 18:04:14 2014 From: jack at 3kitty.org (Jack Haverty) Date: Mon, 26 May 2014 18:04:14 -0700 Subject: [ih] internet-history Digest, Vol 84, Issue 4 In-Reply-To: <6BF9716E-D958-4758-9C93-AB19E5930667@transsys.com> References: <20140521190817.5212618C0E4@mercury.lcs.mit.edu> <5383B00C.9040602@web.de> <6BF9716E-D958-4758-9C93-AB19E5930667@transsys.com> Message-ID: My recollections agree with Louie's - TCPs created a new packet from the unacked window contents when a retransmission was generated. However, I think it's still an interesting question if there are TCPs out "in the wild" of the global Internet that were implemented as the designer interpreted the published spec, and thus may have kept packets for retransmission when needed. I suspect all of the TCPs that "we" know about don't do that. But "we" were involved in the Internet research community and the discussions on mail lists, RFCs, conferences, etc. Many engineers were not. What do TCPs look like internally when they were implemented by someone who only had access to the published specs - i.e., the "Official Standard" RFCs of the day? How did an engineer at a cellular radio company, or a consumer appliance company, implement a TCP, especially if their experience had been with other technologies? I encountered something like this back in the 1990 timeframe, when my cousin called to ask me about some aspects of the X.25 specification that were unclear to him. He was an RF engineer, but had gotten the assignment to add X.25 to some system doing radio stuff. As I talked with him, it became clear that he was interpreting the tome of the X.25 spec quite precisely and literally, and what he was focused on had little relationship to how X.25 was actually being used in real deployed networks. His approach would have created an implementation that was functionally correct, but far from the "best practice" that had developed in the networking community. I can easily see how a similar situation could occur with TCP and its related protocols. This is what motivated my question in that other thread about observing today's live large-scale Internet and its behavior over history. I have no idea how the millions (billions?) of TCPs out there have actually been implemented, what algorithms they chose to incorporate, and how it's been working out. Related curiousity question - does Internet traffic today actually get Fragmented? How's that been working? /Jack Haverty On Mon, May 26, 2014 at 4:46 PM, Louis Mamakos wrote: > > On May 26, 2014, at 5:20 PM, Detlef Bosau wrote: > > > Am 21.05.2014 21:08, schrieb Noel Chiappa: > > > > > >> And as to what to do when a timeout happened (which usually, although > not > >> always, indicates that a packet has been dropped due to congestion), it > says: > >> > >> if the retransmission timeout expires on a segment in the > retransmission > >> queue, send the segment at the front of the retransmission queue again > >> [and] reinitialize the retransmission timer > > > > thanks a lot, I should read texts much more carefully. > >> That's it! Again, note the focus on 'hey, we gotta get the user's data > there > >> as fast as we can'. > > > > However, at a first glance (I will implement this) this appears as a > > "quasi GBN". > > > > Do I understand this correctly: When the packet is retransmitted, a copy > > is _appended_ to the retransmission queue? > > Just to be clear, of the 4 or 5 different TCP stacks I?ve crawled around in > and/or co-authored in one case, the contents of the send window are > retransmitted, > not the packet. I?ve not seen a particular TCP implementation that keeps > previously transmitted segments around for retransmission. (I can see how > some low memory, constrained implementations might make a choice to keep > previously transmitted packets around, however, and this lets them re-use > the same fragmentation ID in the IP header, too. > > Every TCP stack I?ve seen just regenerates segments, and the retransmit > queue is really the TCP send window. Certainly the IP stack we did for > the UNIVAC 1100 did this, the various 2.{8,9,10,11} BSD and 4.x BSD > Berkeley > stacks did this, as probably did the BBN TCP stack for 4.1. Pretty sure > the Fuzzball TCP stack also used this strategy. > > I can often be the case, of course, that the retransmission attempt > generates > a segment larger the the one that wasn?t acknowledged if additional data > was > placed into the send window by the local application. Might was well fill > up > the retransmitted packet to the the MSS. This was pretty obvious to see > with > interactive (e.g., telnet) traffic and Nagle?s algorithm that would > suppress > additional tinygrams until the ACK was returned. > > louie > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mfidelman at meetinghouse.net Mon May 26 19:12:57 2014 From: mfidelman at meetinghouse.net (Miles Fidelman) Date: Mon, 26 May 2014 22:12:57 -0400 Subject: [ih] the state of protocol R&D? In-Reply-To: <53830894-ADB7-46FB-BEF3-DB28C0A79831@netapp.com> References: <20140525002408.84B9728E137@aland.bbn.com> <53815300.3080700@dcrocker.net> <53830894-ADB7-46FB-BEF3-DB28C0A79831@netapp.com> Message-ID: <5383F4A9.5030905@meetinghouse.net> I'm kind of thinking more like the Command and Control Research Program (dodccrp.org) - which sponsors(ed?) a lot of network centric warfare related work, published the C2 Journal, and still seems to organize the ICCRTS and CCRTS conferences - lots of work on network interoperability for warfighting applications. There's also used to be a pretty active community around distributed simulation protocols. Somehow, one would think that either IRTF would be providing some kind of broader-based focal point. Eggert, Lars wrote: > Hi, > > On 2014-5-25, at 4:18, Dave Crocker wrote: >> Hmmm. Should the IRTF attempt a kind of open-source listing of places >> and activities that could be classed as 'networking research'? > so that would certainly be something that could be done under the IRTF umbrella. We have wikis and can get other tooling set up as needed. > > But: As with anything, it requires someone to feel strongly enough about it to spend their own time on it and make it happen. Unless you mean "I want do to X in the IRTF" when you write "the IRTF should", it's unlikely that anyone else will do it. > > Lars > (as IRTF chair) > > PS: irtf-discuss at irtf.org might be a better list for discussing this, however. > >> For new protocols, I've found it helpful to have an open-to-anyone >> registry. Claim that you support the protocol and you get listed. (cf, >> http://dkim.org/deploy). The registry does not vet claimants; just >> lists them. This gives interested parties a place for finding >> implementations or consultant. >> >> It probably would be easy and probably would be helpful, for irtf.org to >> set up something similar, for research, such as a "Community Research >> Activities" trac wiki. >> >> d/ >> -- >> Dave Crocker >> Brandenburg InternetWorking >> bbiw.net -- In theory, there is no difference between theory and practice. In practice, there is. .... Yogi Berra From jnc at mercury.lcs.mit.edu Mon May 26 20:30:30 2014 From: jnc at mercury.lcs.mit.edu (Noel Chiappa) Date: Mon, 26 May 2014 23:30:30 -0400 (EDT) Subject: [ih] internet-history Digest, Vol 84, Issue 4 Message-ID: <20140527033030.C5E1418C125@mercury.lcs.mit.edu> > From: Detlef Bosau > I should read texts much more carefully. This is rather an ironic comment, given the next... >> if the retransmission timeout expires on a segment in the >> retransmission queue, send the segment at the front of the >> retransmission queue again [and] reinitialize the retransmission >> timer > Do I understand this correctly: When the packet is retransmitted, a > copy is _appended_ to the retransmission queue? If you would examine RFC-793, you would see that by "retransmission queue" it means 'un-acknowledged data queue'. So your question makes no sense, on two grounds: first, the data being re-transmitted is _already_ in that queue; second, it is at the _start_ of that 'queue' (buffer, actually - although it's implementation-specific how un-acknowledge data is held - Dave Clark did a TCP for use with User TELNET awhich kept un-acknowledged data in a shift register :-), adding it at the end would be an error. Look, you need to understand how primitive our understanding, algorithms, etc were when RFC-793 was written. There was no sophisticated algorithm associated with re-transmission: when the (sole) re-transmission timer went off, the code re-sent the oldest un-acknowledged data again. That's all. Full stop. If you read RFC-793 (and the other ones I have mentioned), you will get a good sense of our level of understanding, how primitive many of our algorithms, etc were at that point. Noel From lars at netapp.com Mon May 26 23:34:09 2014 From: lars at netapp.com (Eggert, Lars) Date: Tue, 27 May 2014 06:34:09 +0000 Subject: [ih] the state of protocol R&D? In-Reply-To: <5383F4A9.5030905@meetinghouse.net> References: <20140525002408.84B9728E137@aland.bbn.com> <53815300.3080700@dcrocker.net> <53830894-ADB7-46FB-BEF3-DB28C0A79831@netapp.com> <5383F4A9.5030905@meetinghouse.net> Message-ID: Hi Miles, On 2014-5-27, at 4:12, Miles Fidelman wrote: > I'm kind of thinking more like the Command and Control Research Program (dodccrp.org) - which sponsors(ed?) a lot of network centric warfare related work, published the C2 Journal, and still seems to organize the ICCRTS and CCRTS conferences - lots of work on network interoperability for warfighting applications. There's also used to be a pretty active community around distributed simulation protocols. Somehow, one would think that either IRTF would be providing some kind of broader-based focal point. the IRTF does what its participants are interested in spending their time on. It operates completely by volunteer cycles. The IRTF can do many things under its charter, probably including what you suggest above, but it always starts and ends with volunteers who want to see something happen and put their own time towards it. Lars -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 273 bytes Desc: Message signed with OpenPGP using GPGMail URL: From lars at netapp.com Mon May 26 23:43:56 2014 From: lars at netapp.com (Eggert, Lars) Date: Tue, 27 May 2014 06:43:56 +0000 Subject: [ih] The History of Protocol R&D? In-Reply-To: References: Message-ID: <2E47785F-3263-4266-A3BA-87091B5BCC21@netapp.com> Hi, good questions. Two quick points in reply: On 2014-5-26, at 19:50, Jack Haverty wrote: > I assume that use of mathematical models, simulations, and anecdotal experiments is common and publicized in papers, theses, et al, but how are ideas subsequently validated in the broad Internet world, and the results of models and simulations and lab tests verified in the large scale world of The Internet? this has been getting increasingly difficult with the commercialization of the Internet. Unless you have close ties with an entity that is running productions networks or datacenters, or is controlling a sizable fraction of the end systems, it is very difficult to do any realistic verification. (I used to think the situation in the 90s was bad, where only researchers with ties to operators could really do practically meaningful routing work, but compared to the folks operating datacenters or controlling mobile platforms, operators are a pleasure to deal with.) > Taking a specific concrete case - was our guesstimate of 1% as a "normal" packet loss rate valid? We used to look at the counters and if the rate was much higher than that we took it as an indication of a problem to be investigated and addressed. > > Has the packet-loss-rate of The Internet been going up, or down, over the last 30 years? Has the duplicate-packets rate improved? (Or whatever other metrics might have surfaced as a measure of proper behavior) > > Did the metric change positively in response to deploying some new idea (e.g., a new congestion control algorithm)? In terms of measuring various aspects of the (publicly observable) Internet, the proceedings of the ACM IMC conference have been consistently having relevant papers: http://www.sigcomm.org/events/imc-conference > My TVs have TCP, and it can stream video from halfway around the world. So can my phone(s). So can the other millions of devices out there. If I could look at the Wastebins of the Internet after a typical day, how big a pile of discarded packets would I find, in the various hosts, routers, etc. out there? Over the History of The Internet, how has that daily operational experience been changing? How much observed effect have the new algorithms had on getting closer to that theoretical ideal behavior of one byte-mile per user byte-mile? > > Are the new algorithms even implemented in those devices? Is anybody watching the gauges and dials of The Internet? In terms of TCP, the IETF's working groups around the protocol (TCPM, MPTCP, etc.) usually get a pretty good understanding of what's seeing deployment where. (Again, not so much in terms of inside datacenters though - it's considered the secret sauce.) Lars -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 273 bytes Desc: Message signed with OpenPGP using GPGMail URL: From detlef.bosau at web.de Tue May 27 01:59:07 2014 From: detlef.bosau at web.de (Detlef Bosau) Date: Tue, 27 May 2014 10:59:07 +0200 Subject: [ih] internet-history Digest, Vol 84, Issue 4 In-Reply-To: <6BF9716E-D958-4758-9C93-AB19E5930667@transsys.com> References: <20140521190817.5212618C0E4@mercury.lcs.mit.edu> <5383B00C.9040602@web.de> <6BF9716E-D958-4758-9C93-AB19E5930667@transsys.com> Message-ID: <538453DB.8060502@web.de> Am 27.05.2014 01:46, schrieb Louis Mamakos: > Just to be clear, of the 4 or 5 different TCP stacks I?ve crawled around in > and/or co-authored in one case, the contents of the send window are retransmitted, > not the packet. I just postponed my suicide plans and know, you write that :-( *sniff* What I thought would be true until about 4 weeks ago was: 1. There is basically some routine "send_much" which, when called, sends any data provided by the application from snd.nxt up to (snd.next+snd.wnd-1). 2. send_much() is called when on either one of the following three events: a) When the sending application provides data (iow: when a write() call occurs), b) When a socket receives a packet from its peer (and hence may have updated snd.una) c) When a retransmission time out occurs. (Or in case of Tahoe: On a 3DA. Reno behaves a bit differently here.) > I?ve not seen a particular TCP implementation that keeps > previously transmitted segments around for retransmission. (I can see how > some low memory, constrained implementations might make a choice to keep > previously transmitted packets around, however, and this lets them re-use > the same fragmentation ID in the IP header, too. Louis, for years I got nuts about Karn's algorithm and how this is implemented correctly, and actually, this was the question that initiated the whole discussion ;-) What I just wanted to do before reading your post was to investigate, how a retransmission scheme using a retransmission queue deals with varying window sizes which may result from slow start or congestion avoidance. What happens, when the first packet in a retransmission queue is beyond the allowed window? > Every TCP stack I?ve seen just regenerates segments, and the retransmit > queue is really the TCP send window. And so behaves the TCP in the NS-2, IIRC. However, this would support my conjecture, we did basically GBN in TCP. -- ------------------------------------------------------------------ Detlef Bosau Galileistra?e 30 70565 Stuttgart Tel.: +49 711 5208031 mobile: +49 172 6819937 skype: detlef.bosau ICQ: 566129673 detlef.bosau at web.de http://www.detlef-bosau.de From detlef.bosau at web.de Tue May 27 02:18:54 2014 From: detlef.bosau at web.de (Detlef Bosau) Date: Tue, 27 May 2014 11:18:54 +0200 Subject: [ih] internet-history Digest, Vol 84, Issue 4 In-Reply-To: <20140527033030.C5E1418C125@mercury.lcs.mit.edu> References: <20140527033030.C5E1418C125@mercury.lcs.mit.edu> Message-ID: <5384587E.9090804@web.de> Am 27.05.2014 05:30, schrieb Noel Chiappa: > > From: Detlef Bosau > > > I should read texts much more carefully. > > This is rather an ironic comment, given the next... > > >> if the retransmission timeout expires on a segment in the > >> retransmission queue, send the segment at the front of the > >> retransmission queue again [and] reinitialize the retransmission > >> timer > > > Do I understand this correctly: When the packet is retransmitted, a > > copy is _appended_ to the retransmission queue? > > If you would examine RFC-793, you would see that by "retransmission queue" it > means 'un-acknowledged data queue'. May I just quote RFC-793, section 2.6: " When the TCP transmits a segment containing data, it puts a copy on a retransmission queue and starts a timer; when the acknowledgment for that data is received, the segment is deleted from the queue. " > So your question makes no sense, on two grounds: first, the data being > re-transmitted is _already_ in that queue; second, it is at the _start_ of > that 'queue' (buffer, actually - although it's implementation-specific how > un-acknowledge data is held - Dave Clark did a TCP for use with User TELNET > awhich kept un-acknowledged data in a shift register :-), adding it at the end > would be an error. Perhaps we talk at cross purposes here. What you name "unacknowledged data queue" is a vector or arry, whatever, which is basically a FIFO queue of bytes. The first byte in this queue is the one which corresponds to "snd.una", the first unacknowledged byte in flight. (There is no acknowledged byte in the unacknowledged dataqueue, otherwise it wouldn't be the unacknowledged data queue.) However, this leads to the conjecture, that unacknowledged packets are not stored in some "retransmission queue" but (as I implemented this myself in my simulator) we keep an unacknowledged data queue and packets, which require retransmission, a reconstructed from this queue. (Just as Louis pointed out.) The one question is: how do we implement Karn's algorithm correctly? (As we simply don't know whether a RTT sample belongs to a retransmitted packet because we don't keep a list of retransmitted packets. We keep a queue of unacknowledged data, no more!) And the other question is, refer to my response to Louis some minutes ago, what is the difference then to a GBN strategy, which was "never done" by TCP Tahoe, as I was told off list? > > Look, you need to understand how primitive our understanding, algorithms, etc > were when RFC-793 was written. There was no sophisticated algorithm associated > with re-transmission: when the (sole) re-transmission timer went off, the code > re-sent the oldest un-acknowledged data again. That's all. Full stop. > > If you read RFC-793 (and the other ones I have mentioned), you will get a > good sense of our level of understanding, how primitive many of our > algorithms, etc were at that point. I have no problems with that. But I have problems when I submit papers and these are rejected because reviewers complain about missing comparisons to other TCP flavours. And my general impression is that we are better than lawyers here. You know the saying: If you ask two lawyers one question, you will get three answers. If we would rephrase this for CS guys, I would ask: "Only four?" Perhaps, we should introduce a understanding disambiguator as in C++: Noel_Chiappa::Tahoe () Detlef_Bosau::Tahoe() Craig_Partridge::Tahoe() ..... ;-) > Noel -- ------------------------------------------------------------------ Detlef Bosau Galileistra?e 30 70565 Stuttgart Tel.: +49 711 5208031 mobile: +49 172 6819937 skype: detlef.bosau ICQ: 566129673 detlef.bosau at web.de http://www.detlef-bosau.de From craig at aland.bbn.com Tue May 27 04:04:03 2014 From: craig at aland.bbn.com (Craig Partridge) Date: Tue, 27 May 2014 07:04:03 -0400 Subject: [ih] The History of Protocol R&D? Message-ID: <20140527110403.D203328E137@aland.bbn.com> Hi Jack: Some quick answers. > - "How do researchers now do protocol R&D, and validate a new idea by > measurements in the live Internet?" The answer depends on whether you are trying to measure stuff on the current Internet or trying to innovate new protocols for the Internet. If you are measuring in the current network, measurements are done much as they always have been, by instrumenting boxes and end systems. The challenge is that fewer folks have open access to the boxes and end systems and access is often limited by confidentiality agreements. This sometimes leads to situations where a researcher has access to a valuable data set but doesn't know what questions to ask of it (i.e. right data, wrong researcher). Most of this kind of measurement work is published at the Internet Measurement Conference. If you are trying to innovate new protocols, the community has developed some large-scale test infrastructure: PlanetLab is an example, as is GENI. Some of these infrastructures allow you to run repeatable experiments, others do not. > - "What have been the results over the life of The Internet so far?" You'd have to narrow the question -- the volume of what has been learned is large. And we'd probably search for the answer in the IMC proceedings. Thanks! Craig From vint at google.com Tue May 27 04:31:45 2014 From: vint at google.com (Vint Cerf) Date: Tue, 27 May 2014 07:31:45 -0400 Subject: [ih] The History of Protocol R&D? In-Reply-To: <20140527110403.D203328E137@aland.bbn.com> References: <20140527110403.D203328E137@aland.bbn.com> Message-ID: has anyone mentioned the GENI and FIND programs at NSF? OpenFlow emerged from the FIND effort. GENI is a major laboratory facility for experimenting with new protocols. v On Tue, May 27, 2014 at 7:04 AM, Craig Partridge wrote: > > Hi Jack: > > Some quick answers. > > > - "How do researchers now do protocol R&D, and validate a new idea by > > measurements in the live Internet?" > > The answer depends on whether you are trying to measure stuff on the > current > Internet or trying to innovate new protocols for the Internet. If you are > measuring in the current network, measurements are done much as they always > have been, by instrumenting boxes and end systems. The challenge is that > fewer folks have open access to the boxes and end systems and access is > often limited by confidentiality agreements. This sometimes leads to > situations where a researcher has access to a valuable data set but doesn't > know what questions to ask of it (i.e. right data, wrong researcher). > > Most of this kind of measurement work is published at the Internet > Measurement > Conference. > > If you are trying to innovate new protocols, the community has developed > some large-scale test infrastructure: PlanetLab is an example, as is GENI. > Some of these infrastructures allow you to run repeatable experiments, > others > do not. > > > - "What have been the results over the life of The Internet so far?" > > You'd have to narrow the question -- the volume of what has been learned > is large. And we'd probably search for the answer in the IMC proceedings. > > Thanks! > > Craig > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jnc at mercury.lcs.mit.edu Tue May 27 06:04:51 2014 From: jnc at mercury.lcs.mit.edu (Noel Chiappa) Date: Tue, 27 May 2014 09:04:51 -0400 (EDT) Subject: [ih] internet-history Digest, Vol 84, Issue 4 Message-ID: <20140527130451.2450418C0DE@mercury.lcs.mit.edu> > From: Detlef Bosau > Perhaps we talk at cross purposes here. Perhaps! :-) > What you name "unacknowledged data queue" is a vector or arry, > whatever, which is basically a FIFO queue of bytes. The first byte in > this queue is the one which corresponds to "snd.una", the first > unacknowledged byte in flight. Right. That's what RFC-793 calls the "retransmission queue" (a poor name, perhaps). If you look for the word "queue" in the text, there are only two queues on the output side (other than queues in the interface between the application and the OS): the "retransmission queue" (perhaps better named the 'unacknowledged data queue'), and the "network output queue", i.e. the device driver queue. > However, this leads to the conjecture, that unacknowledged packets are > not stored in some "retransmission queue" but (as I implemented this > myself in my simulator) we keep an unacknowledged data queue and > packets, which require retransmission, a reconstructed from this queue. > (Just as Louis pointed out.) Yes... And? > The one question is: how do we implement Karn's algorithm correctly? > ... > what is the difference then to a GBN strategy, which was "never done" > by TCP Tahoe, as I was told off list? You're asking about much later TCP practice and implementations; sorry, I cannot help with those. Noel From tony.li at tony.li Tue May 27 07:49:06 2014 From: tony.li at tony.li (Tony Li) Date: Tue, 27 May 2014 07:49:06 -0700 Subject: [ih] internet-history Digest, Vol 84, Issue 4 In-Reply-To: References: <20140521190817.5212618C0E4@mercury.lcs.mit.edu> <5383B00C.9040602@web.de> <6BF9716E-D958-4758-9C93-AB19E5930667@transsys.com> Message-ID: <98534602-A861-4058-9D36-A8D555E19A32@tony.li> On May 26, 2014, at 6:04 PM, Jack Haverty wrote: > Related curiousity question - does Internet traffic today actually get Fragmented? How's that been working? Effectively, fragmentation is has been a total bust. The good news is that the world has standardized on Ethernet, so the only real MTU is 1500B. Nevertheless, there are many islands of Jumbo Ethernet. First, the performance of fragmentation has always been given short shrift. Router vendors never had significant motivation to make this fast. Second, Path MTU Detection (PMTUD) largely doesn?t work. It got added to the stack too late, and there are too few implementations of it. Before we could get it deployed, the Great DoS Wars started, with ICMP as the primary weapon of choice, and filtering kicked in. Today, ICMP is largely useless and reaches only a very small proportion of the net. Future network design either requires that we operate without feedback at all, or we provide a cryptographically secure way of authenticating arbitrary nodes rapidly and without subjecting ourselves to authentication DoS attacks. Third, IEEE refuses to standardize Jumbo Ethernet. Basically, their attitude is that anything that?s above 1500B is non-standard, non-interoperable, and evil. The IETF refuses to touch it because it?s clearly a link layer issue. It?s become an SDO no-man?s-land. So folks out there select large MTUs for their private data centers, but have to do strange things for departing traffic. And every data center is different. And so it goes, Tony From brian.e.carpenter at gmail.com Tue May 27 13:26:02 2014 From: brian.e.carpenter at gmail.com (Brian E Carpenter) Date: Wed, 28 May 2014 08:26:02 +1200 Subject: [ih] Fragmentation [internet-history Digest, Vol 84, Issue 4] In-Reply-To: <98534602-A861-4058-9D36-A8D555E19A32@tony.li> References: <20140521190817.5212618C0E4@mercury.lcs.mit.edu> <5383B00C.9040602@web.de> <6BF9716E-D958-4758-9C93-AB19E5930667@transsys.com> <98534602-A861-4058-9D36-A8D555E19A32@tony.li> Message-ID: <5384F4DA.90302@gmail.com> Tony, On 28/05/2014 02:49, Tony Li wrote: > On May 26, 2014, at 6:04 PM, Jack Haverty wrote: > >> Related curiousity question - does Internet traffic today actually get Fragmented? How's that been working? > > Effectively, fragmentation is has been a total bust. > > The good news is that the world has standardized on Ethernet, so the only real MTU is 1500B. Nevertheless, there are many islands of Jumbo Ethernet. > > First, the performance of fragmentation has always been given short shrift. Router vendors never had significant motivation to make this fast. > > Second, Path MTU Detection (PMTUD) largely doesn?t work. It got added to the stack too late, and there are too few implementations of it. Before we could get it deployed, the Great DoS Wars started, with ICMP as the primary weapon of choice, and filtering kicked in. Today, ICMP is largely useless and reaches only a very small proportion of the net. Future network design either requires that we operate without feedback at all, or we provide a cryptographically secure way of authenticating arbitrary nodes rapidly and without subjecting ourselves to authentication DoS attacks. > > Third, IEEE refuses to standardize Jumbo Ethernet. Basically, their attitude is that anything that?s above 1500B is non-standard, non-interoperable, and evil. The IETF refuses to touch it because it?s clearly a link layer issue. It?s become an SDO no-man?s-land. So folks out there select large MTUs for their private data centers, but have to do strange things for departing traffic. And every data center is different. I would add: Fourth, fragmentation breaks deep packet inspection, so fragments simply get dropped by many firewalls and server load balancers. > And so it goes, Or rather, doesn't go... Brian > Tony > > > From detlef.bosau at web.de Tue May 27 15:39:02 2014 From: detlef.bosau at web.de (Detlef Bosau) Date: Wed, 28 May 2014 00:39:02 +0200 Subject: [ih] O.k., so BSD 4.3 Tahoe is the version of interest. Message-ID: <53851406.5010507@web.de> http://www.informatica.co.cr/bsd/research/1988/0615.htm Does anybody happen to have a link where I can obtain the sources, which are free as I'm told? I deal with the question if I implement GBN or not for weeks now, in addition I would like to see the implementation of Karn's algorithm (which is later, I presume) to continue my work. I'm actually surprised, that this discussion hat so much ramifications but I miss a bit a central theme. Detlef -- ------------------------------------------------------------------ Detlef Bosau Galileistra?e 30 70565 Stuttgart Tel.: +49 711 5208031 mobile: +49 172 6819937 skype: detlef.bosau ICQ: 566129673 detlef.bosau at web.de http://www.detlef-bosau.de From detlef.bosau at web.de Tue May 27 16:08:27 2014 From: detlef.bosau at web.de (Detlef Bosau) Date: Wed, 28 May 2014 01:08:27 +0200 Subject: [ih] O.k., so BSD 4.3 Tahoe is the version of interest. In-Reply-To: <53851406.5010507@web.de> References: <53851406.5010507@web.de> Message-ID: <53851AEB.3030806@web.de> Aha!!!!!! http://www.tamacom.com/tour/kernel/bsd/S/23.html Oh yes: > 54 //*/ > 55 / * Determine length of data that should be transmitted,/ > 56 / * and flags that will be used./ > 57 / * If there is some data or critical controls (SYN, RST)/ > 58 / * to send, then transmit; otherwise, investigate further./ > 59 / *// > 60 idle = (tp ->snd_max == tp ->snd_una ); > 61 again : > 62 sendalot = 0; > 63 off = tp ->snd_nxt - tp ->snd_una ; > 64 win = MIN (tp ->snd_wnd , tp ->snd_cwnd ); > 65 And sendalot tells the loop whether only one packet is sent or anything from snd_nxt to what is available AND allowed? Next one: http://www.tamacom.com/tour/kernel/bsd/S/26.html > *case* TCPT_REXMT : > 148 tp ->t_rxtshift ++; > 149 *if* (tp ->t_rxtshift > TCP_MAXRXTSHIFT ) /{/ > 150 tp = tcp_drop (tp , ETIMEDOUT ); > 151 *break*; > 152 /}/ > 153 *if* (tp ->t_srtt == 0) > 154 rexmt = tcp_beta * TCPTV_SRTTDFLT ; > 155 *else* > 156 rexmt = (*int*)(tcp_beta * tp ->t_srtt ); > 157 rexmt *= tcp_backoff [tp ->t_rxtshift - 1]; > 158 TCPT_RANGESET (tp ->t_timer [TCPT_REXMT ], rexmt , > 159 TCPTV_MIN , TCPTV_MAX ); > 160 //*/ > 161 / * If losing, let the lower level know/ > 162 / * and try for a better route./ > 163 / *// > 164 *if* (tp ->t_rxtshift >= TCP_MAXRXTSHIFT / 4 || > 165 rexmt >= 10 * PR_SLOWHZ ) > 166 in_losing (tp ->t_inpcb ); > 167 tp ->snd_nxt = tp ->snd_una ; > 168 //*/ > 169 / * If timing a segment in this window,/ > 170 / * and we have already gotten some timing estimate,/ > 171 / * stop the timer./ > 172 / *// > 173 *if* (tp ->t_rtt && tp ->t_srtt ) > 174 tp ->t_rtt = 0; > 175 (*void*) tcp_output (tp ); > 176 *break*; now, what happens in line 175? Do we send 1 packet? At least, I think, tcp_output reads from an array with unacknowledged data here and not from a seperate queue? O.k., it is 1.00 h here, I'm not going to read this the whole night through, but gut feeling tells me, that I will find GBN here... -- ------------------------------------------------------------------ Detlef Bosau Galileistra?e 30 70565 Stuttgart Tel.: +49 711 5208031 mobile: +49 172 6819937 skype: detlef.bosau ICQ: 566129673 detlef.bosau at web.de http://www.detlef-bosau.de -------------- next part -------------- An HTML attachment was scrubbed... URL: From detlef.bosau at web.de Wed May 28 03:34:25 2014 From: detlef.bosau at web.de (Detlef Bosau) Date: Wed, 28 May 2014 12:34:25 +0200 Subject: [ih] Where is VJCC? Re: O.k., so BSD 4.3 Tahoe is the version of interest. In-Reply-To: <53851AEB.3030806@web.de> References: <53851406.5010507@web.de> <53851AEB.3030806@web.de> Message-ID: <5385BBB1.9090809@web.de> My joy was a bit to early, this version doesn't contain the VJCC actions but uses cwnd for source quench. (To my understanding, from the congavoid paper, VJ introduced a state variable cwnd, obviously it is already present here?) However, I don't find the actions "window halving" on time out and for the window increase, I find a probing in case of missing source quenchs. Anyway, as far as I see at the moment, this code clearly does Go Back N in case of an expiring retransmission timer and timer backoff. -- ------------------------------------------------------------------ Detlef Bosau Galileistra?e 30 70565 Stuttgart Tel.: +49 711 5208031 mobile: +49 172 6819937 skype: detlef.bosau ICQ: 566129673 detlef.bosau at web.de http://www.detlef-bosau.de From vint at google.com Wed May 28 03:55:20 2014 From: vint at google.com (Vint Cerf) Date: Wed, 28 May 2014 06:55:20 -0400 Subject: [ih] Where is VJCC? Re: O.k., so BSD 4.3 Tahoe is the version of interest. In-Reply-To: <5385BBB1.9090809@web.de> References: <53851406.5010507@web.de> <53851AEB.3030806@web.de> <5385BBB1.9090809@web.de> Message-ID: GBN is a funny way to characterize the process - the retransmission has to be limited by window permissions at least. v On Wed, May 28, 2014 at 6:34 AM, Detlef Bosau wrote: > My joy was a bit to early, this version doesn't contain the VJCC actions > but uses cwnd for source quench. (To my understanding, from the > congavoid paper, VJ introduced a state variable cwnd, obviously it is > already present here?) However, I don't find the actions > "window halving" on time out and for the window increase, I find a > probing in case of missing source quenchs. > > Anyway, as far as I see at the moment, this code clearly does Go Back N > in case of an expiring retransmission timer and timer backoff. > > > -- > ------------------------------------------------------------------ > Detlef Bosau > Galileistra?e 30 > 70565 Stuttgart Tel.: +49 711 5208031 > mobile: +49 172 6819937 > skype: detlef.bosau > ICQ: 566129673 > detlef.bosau at web.de http://www.detlef-bosau.de > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From louie at transsys.com Wed May 28 05:32:16 2014 From: louie at transsys.com (Louis Mamakos) Date: Wed, 28 May 2014 08:32:16 -0400 Subject: [ih] Where is VJCC? Re: O.k., so BSD 4.3 Tahoe is the version of interest. In-Reply-To: References: <53851406.5010507@web.de> <53851AEB.3030806@web.de> <5385BBB1.9090809@web.de> Message-ID: <40658437-7EBA-413A-A23B-A57FB2E1DF49@transsys.com> While I was a little late to the party (our TCP didn?t start to get written until early 1981), I don?t recall any discussion of GBN at the time. Perhaps that was a term of art invented later to describe all this. The retransmission mechanism was to send unacknowledged data from the "left window edge? of the TCP ?send window?. By definition, that?s where all of the unacknowledged data in the sliding window existed. The decision of how much data to send was implementation specific. I can?t imagine sending less than a whole TCP MSS on a retransmission attempt. louie On May 28, 2014, at 6:55 AM, Vint Cerf wrote: > GBN is a funny way to characterize the process - the retransmission has to be limited by window permissions at least. > > v > > > > On Wed, May 28, 2014 at 6:34 AM, Detlef Bosau wrote: > My joy was a bit to early, this version doesn't contain the VJCC actions > but uses cwnd for source quench. (To my understanding, from the > congavoid paper, VJ introduced a state variable cwnd, obviously it is > already present here?) However, I don't find the actions > "window halving" on time out and for the window increase, I find a > probing in case of missing source quenchs. > > Anyway, as far as I see at the moment, this code clearly does Go Back N > in case of an expiring retransmission timer and timer backoff. > > > -- > ------------------------------------------------------------------ > Detlef Bosau > Galileistra?e 30 > 70565 Stuttgart Tel.: +49 711 5208031 > mobile: +49 172 6819937 > skype: detlef.bosau > ICQ: 566129673 > detlef.bosau at web.de http://www.detlef-bosau.de > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From vint at google.com Wed May 28 05:51:07 2014 From: vint at google.com (Vint Cerf) Date: Wed, 28 May 2014 08:51:07 -0400 Subject: [ih] Where is VJCC? Re: O.k., so BSD 4.3 Tahoe is the version of interest. In-Reply-To: <40658437-7EBA-413A-A23B-A57FB2E1DF49@transsys.com> References: <53851406.5010507@web.de> <53851AEB.3030806@web.de> <5385BBB1.9090809@web.de> <40658437-7EBA-413A-A23B-A57FB2E1DF49@transsys.com> Message-ID: GBN is indeed a term of art and was used in link protocols that maintained packet boundaries and retransmitted them in that form. TCP was different because it was delivering a bytestream. louie is correct in his characterization. Whether all unacked data was resent was partly idiosyncratic to the implementation of TCP and partly a function of the current window size. As i recall, it was recommended not to reduce the window to less than previously permitted transfers. v On Wed, May 28, 2014 at 8:32 AM, Louis Mamakos wrote: > While I was a little late to the party (our TCP didn?t start to get > written until early 1981), I don?t recall any discussion of GBN at the > time. Perhaps that was a term of art invented later to describe all this. > The retransmission mechanism was to send unacknowledged data from the > "left window edge? of the TCP ?send window?. By definition, that?s where > all of the unacknowledged data in the sliding window existed. > > The decision of how much data to send was implementation specific. I > can?t imagine sending less than a whole TCP MSS on a retransmission > attempt. > > louie > > > On May 28, 2014, at 6:55 AM, Vint Cerf wrote: > > GBN is a funny way to characterize the process - the retransmission has to > be limited by window permissions at least. > > v > > > > On Wed, May 28, 2014 at 6:34 AM, Detlef Bosau wrote: > >> My joy was a bit to early, this version doesn't contain the VJCC actions >> but uses cwnd for source quench. (To my understanding, from the >> congavoid paper, VJ introduced a state variable cwnd, obviously it is >> already present here?) However, I don't find the actions >> "window halving" on time out and for the window increase, I find a >> probing in case of missing source quenchs. >> >> Anyway, as far as I see at the moment, this code clearly does Go Back N >> in case of an expiring retransmission timer and timer backoff. >> >> >> -- >> ------------------------------------------------------------------ >> Detlef Bosau >> Galileistra?e 30 >> 70565 Stuttgart Tel.: +49 711 5208031 >> mobile: +49 172 6819937 >> skype: detlef.bosau >> ICQ: 566129673 >> detlef.bosau at web.de http://www.detlef-bosau.de >> >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From detlef.bosau at web.de Wed May 28 07:01:13 2014 From: detlef.bosau at web.de (Detlef Bosau) Date: Wed, 28 May 2014 16:01:13 +0200 Subject: [ih] Where is VJCC? Re: O.k., so BSD 4.3 Tahoe is the version of interest. In-Reply-To: References: <53851406.5010507@web.de> <53851AEB.3030806@web.de> <5385BBB1.9090809@web.de> Message-ID: <5385EC29.3020009@web.de> Am 28.05.2014 12:55, schrieb Vint Cerf: > GBN is a funny way to characterize the process - the retransmission > has to be limited by window permissions at least. > > v > Exactly that's why I'm interested in this question. Ant the only reliable way to find out whether or not BSD does GBN is -- reading the source ;-) -- ------------------------------------------------------------------ Detlef Bosau Galileistra?e 30 70565 Stuttgart Tel.: +49 711 5208031 mobile: +49 172 6819937 skype: detlef.bosau ICQ: 566129673 detlef.bosau at web.de http://www.detlef-bosau.de -------------- next part -------------- An HTML attachment was scrubbed... URL: From detlef.bosau at web.de Wed May 28 07:04:12 2014 From: detlef.bosau at web.de (Detlef Bosau) Date: Wed, 28 May 2014 16:04:12 +0200 Subject: [ih] Where is VJCC? Re: O.k., so BSD 4.3 Tahoe is the version of interest. In-Reply-To: <40658437-7EBA-413A-A23B-A57FB2E1DF49@transsys.com> References: <53851406.5010507@web.de> <53851AEB.3030806@web.de> <5385BBB1.9090809@web.de> <40658437-7EBA-413A-A23B-A57FB2E1DF49@transsys.com> Message-ID: <5385ECDC.9030404@web.de> Am 28.05.2014 14:32, schrieb Louis Mamakos: > While I was a little late to the party (our TCP didn?t start to get > written until early 1981), I don?t recall any discussion of GBN at the > time. Perhaps that was a term of art invented later to describe all > this. The retransmission mechanism was to send unacknowledged data > from the "left window edge? of the TCP ?send window?. By definition, > that?s where all of the unacknowledged data in the sliding window > existed. > > The decision of how much data to send was implementation specific. I > can?t imagine sending less than a whole TCP MSS on a retransmission > attempt. > As far as I had a glance at the sources yesterday late in the evening (or today, early in the morning ;-)) this seems to be avoided. (In the context of the "silly window syndrome".) -------------- next part -------------- An HTML attachment was scrubbed... URL: