[ih] Detlef's TCP questions
Craig Partridge
craig at aland.bbn.com
Mon May 19 13:11:50 PDT 2014
> During the period of Van Jacobson's development of the algorithms that
> bear his
> name, he wrote many lengthy, pithy, and informative messages to various
> public
> mailing lists about the hazards of the Internet and how his algorithms cope.
> Maybe some of these lists are archived somewhere.
>
> Bob Braden
Hi Bob:
Turns out the late Rich Stevens had a small collection of notes he
considered Van's "greatest hits" from that time and was kind enough to
share it with me. Here are the notes.
Craig
> To: markl at PTT.LCS.MIT.EDU
> Subject: Re: interpacket arrival variance and mean
> In-reply-to: Your message of Mon, 08 Jun 87 12:31:33 EDT.
> Date: Mon, 15 Jun 87 06:08:01 PDT
> From: Van Jacobson <van at lbl-csam.arpa>
>
> Mark -
>
> I'm working on long paper about transport protocol timers (models,
> estimation, implementations, common implementation mistakes, etc.).
> This has me all primed on the subject so attached is a long-winded
> explanation and example C code for estimating the mean and variance
> of a series of measurements.
>
> Though I know your application isn't tcp, I'm going to use a new
> tcp retransmit timer algorithm as an example. The algorithm
> illustrates the method and, based on 3 months of experiment, is
> much better than the algorithm in rfc793 (though it isn't as good
> as the algorithm I talked about at IETF and am currently debugging).
>
> Theory
> ------
> You're probably familiar with the rfc793 algorithm for estimating
> the mean round trip time. This is the simplest example of a
> class of estimators called "recursive prediction error" or
> "stochastic gradient" algorithms. In the past 20 years these
> algorithms have revolutionized estimation and control theory so
> it's probably worth while to look at the rfc793 estimator in some
> detail.
>
> Given a new measurement, Meas, of the rtt, tcp updates an
> estimate of the average rtt, Avg, by
>
> Avg <- (1 - g) Avg + g Meas
>
> where g is a "gain" (0 < g < 1) that should be related to the
> signal- to-noise ratio (or, equivalently, variance) of Meas.
> This makes a bit more sense (and computes faster) if we rearrange
> and collect terms multiplied by g to get
>
> Avg <- Avg + g (Meas - Avg)
>
> We can think of "Avg" as a prediction of the next measurement.
> So "Meas - Avg" is the error in that prediction and the
> expression above says we make a new prediction based on the old
> prediction plus some fraction of the prediction error. The
> prediction error is the sum of two components: (1) error due to
> "noise" in the measurement (i.e., random, unpredictable effects
> like fluctuations in competing traffic) and (2) error due to a
> bad choice of "Avg". Calling the random error RE and the
> predictor error PE,
>
> Avg <- Avg + g RE + g PE
>
> The "g PE" term gives Avg a kick in the right direction while the
> "g RE" term gives it a kick in a random direction. Over a number
> of samples, the random kicks cancel each other out so this
> algorithm tends to converge to the correct average. But g
> represents a compromise: We want a large g to get mileage out of
> the PE term but a small g to minimize the damage from the RE
> term. Since the PE terms move Avg toward the real average no
> matter what value we use for g, it's almost always better to use
> a gain that's too small rather than one that's too large.
> Typical gain choices are .1 - .2 (though it's always a good
> idea to take long look at your raw data before picking a gain).
>
> It's probably obvious that Avg will oscillate randomly around
> the true average and the standard deviation of Avg will be
> something like g sdev(Meas). Also that Avg converges to the
> true average exponentially with time constant 1/g. So
> making g smaller gives a stabler Avg at the expense of taking
> a much longer time to get to the true average.
>
> [My paper makes the preceding hand-waving a bit more rigorous
> but I assume no one cares about rigor. If you do care, check
> out almost any text on digital filtering or estimation theory.]
>
> If we want some measure of the variation in Meas, say to compute
> a good value for the tcp retransmit timer, there are lots of
> choices. Statisticians love variance because it has some nice
> mathematical properties. But variance requires squaring (Meas -
> Avg) so an estimator for it will contain two multiplies and a
> large chance of integer overflow. Also, most of the applications
> I can think of want variation in the same units as Avg and Meas,
> so we'll be forced to take the square root of the variance to use
> it (i.e., probably at least a divide, multiply and two adds).
>
> A variation measure that's easy to compute, and has a nice,
> intuitive justification, is the mean prediction error or mean
> deviation. This is just the average of abs(Meas - Avg).
> Intuitively, this is an estimate of how badly we've blown our
> recent predictions (which seems like just the thing we'd want to
> set up a retransmit timer). Statistically, standard deviation
> (= sqrt variance) goes like sqrt(sum((Meas - Avg)^2)) while mean
> deviation goes like sum(sqrt((Meas - Avg)^2)). Thus, by the
> triangle inequality, mean deviation should be a more "conservative"
> estimate of variation.
>
> If you really want standard deviation for some obscure reason,
> there's usually a simple relation between mdev and sdev. Eg.,
> if the prediction errors are normally distributed, mdev =
> sqrt(pi/2) sdev. For most common distributions the factor
> to go from sdev to mdev is near one (sqrt(pi/2) ~= 1.25). Ie.,
> mdev is a good approximation of sdev (and is much easier to
> compute).
>
> Practice
> --------
> So let's do estimators for Avg and mean deviation, Mdev. Both
> estimators compute means so we get two instances of the rfc793
> algorithm:
>
> Err = abs (Meas - Avg)
> Avg <- Avg + g (Meas - Avg)
> Mdev <- Mdev + g (Err - Mdev)
>
> If we want to do the above fast, it should probably be done in
> integer arithmetic. But the expressions contain fractions (g < 1)
> so we'll have to do some scaling to keep everything integer. If
> we chose g so that 1/g is an integer, the scaling is easy. A
> particularly good choice for g is a reciprocal power of 2
> (ie., g = 1/2^n for some n). Multiplying through by 1/g gives
>
> 2^n Avg <- 2^n Avg + (Meas - Avg)
> 2^n Mdev <- 2^n Mdev + (Err - Mdev)
>
> To minimize round-off error, we probably want to keep the scaled
> versions of Avg and Mdev rather than the unscaled versions. If
> we pick g = .125 = 1/8 (close to the .1 that rfc793 suggests) and
> express all the above in C:
>
> Meas -= (Avg >> 3);
> Avg += Meas;
> if (Meas < 0)
> Meas = -Meas;
> Meas -= (Mdev >> 3);
> Mdev += Meas;
>
> I've been using a variant of this to compute the retransmit timer
> for my tcp. It's clearly faster than the two floating point
> multiplies that 4.3bsd uses and gives you twice as much information.
> Since the variation is estimated dynamically rather than using
> the wired-in beta of rfc793, the retransmit performance is also
> much better: faster retransmissions on low variance links and
> fewer spurious retransmissions on high variance links.
>
> It's not necessary to use the same gain for Avg and Mdev.
> Empirically, it looks like a retransmit timer estimate works
> better if there's more gain (bigger g) in the Mdev estimator.
> This forces the timer to go up quickly in response to changes in
> the rtt. (Although it may not be obvious, the absolute value in
> the calculation of Mdev introduces an asymmetry in the timer:
> Because Mdev has the same sign as an increase and the opposite
> sign of a decrease, more gain in Mdev makes the timer go up fast
> and come down slow, "automatically" giving the behavior Dave
> Mills suggests in rfc889.)
>
> Using a gain of .25 on the deviation and computing the retransmit
> timer, rto, as Avg + 2 Mdev, my tcp timer code looks like:
>
> Meas -= (Avg >> 3);
> Avg += Meas;
> if (Meas < 0)
> Meas = -Meas;
> Meas -= (Mdev >> 2);
> Mdev += Meas;
> rto = (Avg >> 3) + (Mdev >> 1);
>
> Hope this helps.
>
> - Van
> To: jain%erlang.DEC at decwrl.dec.com (Raj Jain, LKG1-2/A19, DTN: 226-7642)
> Cc: ietf at gateway.mitre.org
> Subject: Re: Your congestion scheme
> In-Reply-To: Your message of 03 Nov 87 12:51:00 GMT.
> Date: Mon, 16 Nov 87 06:03:29 PST
> From: Van Jacobson <van at lbl-csam.arpa>
>
> Raj,
>
> Thanks for the note. I hope you'll excuse my taking the liberty
> of cc'ing this reply to the ietf: At the last meeting there was a
> great deal of interest in your congestion control scheme and our
> adaptation of it.
>
> > I am curious to know what values of increase amount and decrease
> > factor did you use. In our previous study on congestion using
> > timeout (CUTE scheme, IEEE JSAC, Oct 1986), we had found that the
> > decrease factor should be small since packet losses are
> > expensive. In fact, we found that a decrease factor of zero
> > (decreasing to one) was the best.
>
> We use .5 for the decrease factor and 1 for the increase factor.
> We also use something very close to CUTE (Mike Karels and I were
> behind on our journal reading so we independently rediscovered
> the algorithm and called it slow-start). Since we use a lost
> packet as the "congestion experienced" indicator, the CUTE
> algorithm and the congestion avoidance algorithm (BTW, have you
> picked a name for it yet?) get run together, even though they are
> logically distinct.
>
> The sender keeps two state variables for congestion control: a
> congestion window, cwnd, and a threshhold size, ssthresh, to
> switch between the two algorithms. The sender's output routine
> sends the minimum of cwnd and the window advertised by the
> receiver. The rest of the congestion control sender code looks
> like: On a timeout, record half the current window size as
> "ssthresh" (this is the multiplicative decrease part of the
> congestion avoidance algorithm), then set the congestion window
> to 1 packet and call the output routine (this initiates
> slowstart/CUTE). When new data is acked, the sender does
> something like
>
> if (cwnd < ssthresh) // if we're still doing slowstart
> cwnd += 1 packet // open the window exponentially
> else
> cwnd += 1/cwnd // otherwise do the Congestion
> // Avoidance increment-by-1
>
> Notice that our variation of CUTE opens the window exponentially
> in time, as opposed to the linear scheme in your JSAC article.
> We looked at a linear scheme but were concerned about the
> performance hit on links with a large bandwidth-delay product
> (ie., satellite links). An exponential builds up fast enough to
> accomodate any bandwidth-delay and our testing showed no more
> packet drops with exponential opening that with linear opening.
> (My model of what slowstart is doing -- starting an ack "clock"
> to meter out packets -- suggested that there should be no
> loss-rate difference between the two policies).
>
> The reason for the 1/2 decrease, as opposed to the 7/8 you use,
> was the following hand-waving: When a packet is dropped, you're
> either starting (or restarting after a drop) or steady-state
> sending. If you're starting, you know that half the current
> window size "worked", ie., that a window's worth of packets were
> exchanged with no drops (slowstart guarantees this). Thus on
> congestion you set the window to the largest size that you know
> works then slowly increase the size. If the connection is
> steady-state running and a packet is dropped, it's probably
> because a new connection started up and took some of your
> bandwidth. We usually run our nets with rho <= .5 so it's
> probable that there are now exactly two conversations sharing the
> bandwidth. Ie., that you should reduce your window by half
> because the bandwidth available to you has been reduced to half.
> And, if there are more than two conversations sharing the
> bandwidth, halving your window is conservative -- and being
> conservative at high traffic intensities is probably wise.
>
> Obviously, this large decrease term is accounting for the high
> cost of our "congestion experienced" indicator compared to yours --
> a dropped packet vs. a bit in the ack. If the DEC bit were
> available, the preceding "logic" should be ignored. But I wanted
> tcp congestion avoidance that we could deploy immediately and
> incrementally, without adding code to the hundreds of Internet
> gateways. So using dropped packets seemed like the only choice.
> And, in system terms, the cost isn't that high: Currently packets
> are dropped only when a large queue has formed. If we had a bit
> to force senders to reduce their windows, we'd still be stuck
> with the queue since we'd still be running the bottleneck at 100%
> utilization so there's no excess bandwidth available to dissipate
> the queue. If we toss a packet, a sender will shut up for 2 rtt,
> exactly the time we need to empty the queue (in the ususal case).
> If that sender restarts with the correct window size the queue
> won't reform. Thus we've reduced the delay to minimum without
> the system losing any bottleneck bandwidth.
>
> The 1-packet increase has less justification that the .5
> decrease. In fact, it's almost certainly too large. If the
> algorithm converges to a window size of w, you get O(w^2) packets
> between drops with an additive increase policy. We were shooting
> for an average drop rate of <1% and found that on the Arpanet
> (the worst case of the 4 networks we tested), windows converged
> to 8 - 12 packets. This yields 1 packet increments for a 1%
> average drop rate.
>
> But, since we've done nothing in the gateways, the window we
> converge to is the maximum the gateway can accept without dropping
> packets. I.e., in the terms you used, we are just to the left of
> the cliff rather than just to the right of the knee. If we
> now fix the gateways so they start dropping packets when the
> queue gets pushed past the knee, our increment will be much too
> agressive and we'll have to drop it by at least a factor of 4
> (since all my measurements on an unloaded Arpanet or Milnet
> place their "pipe size" at 4-5 packets). Mike and I have talked
> about a second order control loop to adaptively determine the
> appropriate increment to use for a path (there doesn't seem to
> be much need to adjust the decrement). It looks trivial to
> implement such a loop (since changing the increment affects
> the frequency of oscillation but not the mean window size,
> the loop would affect rate of convergence but not convergence
> and (first-order) stability). But we think 2nd order stuff
> should wait until we've spent some time on the 1st order part
> of the algorithm for the gateways.
>
> I'm tired and probably no longer making sense. I think I'll
> head home and get some sleep. Hope to hear from you soon.
>
> Cheers.
>
> - Van
> To: tcp-ip at sri-nic.arpa
> Subject: Dynamic Congestion Avoidance / Control (long message)
> Date: Thu, 11 Feb 88 22:17:04 PST
> From: Van Jacobson <van at lbl-csam.arpa>
>
> A dozen people forwarded Phil Karn's question about TCP
> congestion control to me, usually with pithy subject lines like
> "how much longer till you publish something?". I do have three
> RFCs and two papers in various stages of preparation, but innate
> laziness combined with this semester's unexpectedly large
> teaching load means it will probably be late spring before
> anything gets finished. In the interim, here's a sort-of
> overview of our congestion control work.
>
> I should point out these algorithms are joint work with Mike
> Karels of UC Berkeley (I guess my name got stuck on things
> because I make the presentations while Mike is off fighting the
> battles on EGP or routing or domains or ...). I should also
> mention that there's not a whole lot that's original. In
> particular, the two most important algorithms are quite close to
> (prior) work done by DEC's Raj Jain. This was by accident in
> one case and deliberate in the other.
>
> This stuff has been discussed on the ietf and end2end lists
> (Phil participated in some of those discussions and was, in
> fact, one of the early beta testers for our new tcp -- I have
> this nagging suspicion I was set up). I've appended a couple of
> those mail messages.
>
>
> Mike and I have put a number (six, actually) of new algorithms
> into the 4bsd tcp. Our own measurements and the reports of our
> beta testers suggest that the result is quite good at dealing
> with current, congested conditions on the Internet. The various
> algorithms were developed and presented at different times over
> the past year and it may not be clear to anyone but the
> developers how, or if, the pieces relate.
>
> Most of the changes spring from one observation: The flow on a
> TCP connection (or ISO TP-4 or XNS SPP connection) should, by
> the design of the protocol, obey a `conservation of packets'
> principle. And, if this principle were obeyed, congestion
> collapse would become the exception rather than the rule.
> Thus congestion control turns into finding places where
> conservation is violated and fixing them.
>
> By `conservation of packets' I mean the following:
> If you look at the connection "in equilibrium", i.e., running
> stably with a full window of data in transit, you notice that
> the flow is what a physicist would call "conservative": A new
> packet isn't put into the network until an old packet leaves.
> Two scientific disciplines that deal with flow, hydrodynamics
> and thermodynamics, predict that systems with this conservation
> property should be extremely robust in the face of congestion.
> Observation of the Arpanet or NSFNet suggests that they were not
> particularly robust in the face of congestion. From whence
> comes the discrepancy?
>
> [Someone asked me for a simple explanation of why a conservative
> flow system should be stable under load. I've been trying to
> come up with one, thus far without success -- absolutely nothing
> in hydrodynamics admits to simple explanations. The shortest
> explanation is that the inhomogenous forcing terms in the
> Fokker-Planck equation vanish and the remaining terms are highly
> damped. But I don't suppose this means much to anyone (it
> doesn't mean much to me). My intuition is that conservation
> means there's never an `energy difference' between the outside
> world and the system that the system could use to `pump up' an
> instability to a state of collapse. Thus the only mechanism the
> system can use for self-destruction is to re-arrange its
> internal energy and create a difference large enough to break
> something. But entropy (or diffusion) always trys to erase
> large internal energy differences so collapse via these
> mechanisms is improbable. Possible, but improbable, at least
> until the load gets so outrageous that collective, non-ergodic
> effects start to dominate the overall behavior.]
>
> Packet conservation can fail three ways:
>
> 1) the connection doesn't get to equilibrium, or
>
> 2) a sender injects a packet before an old packet has exited, or
>
> 3) the equilibrium can't be reached because of resource
> limits along the path.
>
> (1) has to be from a connection starting or restarting after a
> packet loss. Another way to look at the conservation property
> is to say that the sender uses acks as a "clock" to strobe new
> packets into the network. Since the receiver can generate acks
> no faster than data packets can get through the network, the
> protocol is "self clocking" (an ack can't go in until a packet
> comes out, a packet can't go in until an ack comes out). Self
> clocking systems automatically adjust to bandwidth and delay
> variations and have a wide dynamic range (an important issue
> when you realize that TCP spans everything from 800 Mbit/sec
> Cray channels to 1200 bit/sec packet radio links). But the same
> thing that makes a self-clocked system stable when it's running
> makes it hard to get started -- to get data flowing you want acks
> to clock packets out but to get acks you have to have data flowing.
>
> To get the `clock' started, we came up with an algorithm called
> slow-start that gradually increases the amount of data flowing.
> Although we flatter ourselves that the design of this algorithm
> is quite subtle, the implementation is trivial -- 3 lines of
> code and one new state variable in the sender:
>
> 1) whenever you're starting or restarting after a loss,
> set your "congestion window" to 1 packet.
> 2) when you get an ack for new data, increase the
> congestion window by one packet.
> 3) when you send, send the minimum of the receiver's
> advertised window and the congestion window.
>
> (This is quite similar to Raj Jain's CUTE algorithm described in
> IEEE Transactions on Communications, Oct, '86, although we didn't
> know about CUTE at the time we were developing slowstart).
>
> Actually, the slow-start window increase isn't all that gradual:
> The window opening takes time proportional to log2(W) where W is
> the window size in packets. This opens the window fast enough
> to have a negligible effect on performance, even on links that
> require a very large window. And the algorithm guarantees that
> a connection will source data at a rate at most twice the
> maximum possible on the path. (Without slow-start, by contrast,
> when 10Mb ethernet hosts talk over the 56Kb Arpanet via IP
> gateways, the gateways see a window's worth of packets (8-16)
> delivered at 200 times the path bandwidth.)
>
> Once you can reliably start data flowing, problems (2) and (3)
> have to be addressed. Assuming that the protocol implementation
> is correct, (2) must represent a failure of sender's retransmit
> timer. A good round trip time estimator (the core of the
> retransmit timer) is the single most important feature of any
> protocol implementation that expects to survive heavy load. And
> it almost always gets botched.
>
> One mistake seems to be not estimating the variance of the rtt.
> >From queuing theory we know that rtt and rtt variation increase
> very quickly with load. Measuring the load by rho (the ratio of
> average arrival rate to average departure rate), the rtt goes up
> like 1/(1-rho) and the variation in rtt goes like 1/(1-rho)^2.
> To make this concrete, if the network is running at 75% of
> capacity (as the Arpanet was in last April's collapse), you
> should expect rtt to vary by a factor of 16 around its mean
> value. The RFC793 parameter "beta" accounts for rtt variation.
> The suggested value of "2" can adapt to loads of at most 30%.
> Above this point, a connection will respond to load increases by
> retransmitting packets that have only been delayed in transit.
> This makes the network do useless work (wasting bandwidth on
> duplicates of packets that have been or will be delivered) at a
> time when it's known to be having trouble with useful work.
> I.e., this is the network equivalent of pouring gasoline on a
> fire.
>
> We came up a cheap method for estimating variation (see first of
> attached msgs) and the resulting retransmit timer essentially
> eliminates spurious retransmissions. A pleasant side effect of
> estimating "beta" rather than using a fixed value is that low
> load as well as high load performance improves, particularly
> over high delay paths such as satellite links.
>
> Another timer mistake seems to be in the backoff after a
> retransmit: If we have to retransmit a packet more than once,
> how should the retransmits be spaced? It turns out there's only
> one scheme that's likely to work, exponential backoff, but
> proving this is a bit involved (one of the two papers alluded to
> in to opening goes through a proof). We might finesse a proof
> by noting that a network is, to a very good approximation, a
> linear system. (That is, it is composed of elements that behave
> like linear operators -- integrators, delays, gain stages,
> etc.). Linear system theory says that if a system is stable,
> the stability is exponential. This suggests that if we have a
> system that is unstable (a network subject to random load shocks
> and prone to congestive collapse), the way to stabilize it is to
> add some exponential damping (read: exponential timer backoff)
> to its primary excitation (read: senders, traffic sources).
>
> Once the timers are in good shape, you can state with some
> confidence that a timeout really means a lost packet and not a
> busted timer. At this point you can do something about (3).
> Packets can get lost for two reasons: they are damaged in
> transit or the network is congested and somewhere on the path
> there was insufficient buffer capacity. On most of the network
> paths we use, loss due to damage is rare (<<1%) so it is very
> probable that a packet loss is due to congestion in the network.
>
> Say we try to develop a `congestion avoidance' strategy like the
> one Raj Jain, et.al., propose in DEC TR506 and ISO 8473. It
> must have two components: The network must be able to signal
> the transport endpoints that congestion is occurring (or about
> to occur). And the endpoints must have a policy that decreases
> utilization if this signal is received and increases utilization
> if the signal isn't received.
>
> If packet loss is (almost) always due to congestion and if a
> timeout is (almost) always due to a lost packet, we have a good
> candidate for the `network is congested' signal. Particularly
> since this signal is delivered automatically by all existing
> networks, without special modification (as opposed to, say, ISO
> 8473 which requires a new bit in the packet headers and a mod to
> *all* existing gateways to set this bit).
>
> [As a (lengthy) aside:
> Using `lost packets' to communicate information seems to bother
> people. The second of the two papers mentioned above is devoted
> to analytic and simulation investigations of our algorithm under
> various network conditions. I'll briefly mention some of the
> objections we've heard and, I think, addressed.
>
> There have been claims that signaling congestion via dropping
> packets will adversely affect total throughput (it's easily
> proved that the opposite is true) or that this signal will be
> `too slow' compared to alternatives (The fundamental frequency
> of the system is set by its average round trip time. From
> control theory and Nyquist's theorem, any control strategy that
> attempts to respond in less than two round trip times will be
> unstable. A packet loss is detected in at most two rtt, the
> minimum stable response time).
>
> There have also been claims that this scheme will result in
> unfair or sub-optimal resource allocation (this is untrue if
> equivalent implementations of ISO and our schemes are compared)
> or that mistaking damage for congestion will unnecessarily
> throttle the endpoints on some paths with a high intrinsic loss
> rate (this is mostly untrue -- the scheme we propose is
> analytically tractable so we've worked out its behavior under
> random loss. It turns out that window flow control schemes just
> don't handle high loss rates well and throughput of a vanilla
> TCP under high, random loss is abysmal. Adding our congestion
> control scheme makes things slightly worse but the practical
> difference is negligible. As an example (that we have calculated
> and Jon Crowcroft at UCL has measured), it takes 40 256-byte
> packets to fill the Satnet pipe. Satnet currently shows a
> random, 1% average packet loss. The maximum throughput a
> vanilla tcp could achieve at this loss rate is 56% of the 40kbs
> channel bandwidth. The maximum throughput our TCP can achieve at
> this loss rate is 49% of the channel bandwidth.
>
> In theory, the use of dynamic congestion control should allow
> one to achieve much better throughput under high loss than is
> possible with normal TCP -- performance comparable to, say,
> NETBLT over the same lossy link. The reason is that regular TCP
> tries to communicate two different things with one number (the
> window field in the packet): the amount of buffer the receiver
> has available and the delay-bandwidth product of the pipe. Our
> congestion control scheme separates these two numbers. The
> sender estimates the pipe size so the receiver window need only
> describe the receiver's buffer size. As long as the receiver
> advertises a sufficiently large buffer, (twice the delay-
> bandwidth product) a 1% loss rate would translate into a 1%
> throughput effect, not a factor-of-two effect. Of course, we
> have yet to put this theory to test.]
>
> The other part of a congestion avoidance strategy, the endnode
> action, is almost identical in Jain's DEC/ISO scheme and our
> TCP. This is not an accident (we copied Jain's scheme after
> hearing his presentation at the Boston IETF & realizing that the
> scheme was, in a sense, universal): The endnode strategy
> follows directly from a first-order time-series model of the
> network. Say we measure network `load' by average queue length
> over fixed intervals of some appropriate length (something near
> the rtt). If L(i) is the load at interval i, a network not
> subject to congestion can be modeled by saying L(i) changes
> slowly compared to the sampling time. I.e.,
>
> L(i) = N
>
> (the average queue length doesn't change over time). If the
> network is subject to congestion, this zero'th order model
> breaks down. The average queue length becomes the sum of two
> terms, the N above that accounts for the average arrival rate
> of new traffic and a new term that accounts for the left-over
> traffic from the last time interval:
>
> L(i) = N + a L(i-1)
>
> (As pragmatists, we extend the original model just enough to
> explain the new behavior. What we're doing here is taking the
> first two terms in a taylor series expansion of L(t) when we
> find that just the first term is inadequate. There is reason to
> believe one would eventually need a three term, second order
> model, but not until the Internet has grown to several times its
> current size.)
>
> When the network is congested, the `a' term must be large and
> the load (queues) will start increasing exponentially. The only
> way to stabilize the system is if the traffic sources throttle
> back at least as fast as the queues are growing. Since the way
> a source controls load in a window-based protocol is to adjust
> the size of the window, W, we end up with the sender policy
>
> On congestion: W(i) = d W(i-1) (d < 1)
>
> I.e., a multiplicative decrease of the window size. (This will
> turn into an exponential decrease over time if the congestion
> persists.)
>
> If there's no congestion, the `a' term must be near zero and the
> network load is about constant. You have to try to increase the
> bandwidth you're using to find out the current limit (e.g., you
> could have been sharing the path with someone else and converged
> to a window that gives you each half the available bandwidth. If
> he shuts down, 50% of the bandwidth will get wasted unless you
> make some effort to increase your window size.) What should the
> increase policy be?
>
> The first thought is to use a symmetric, multiplicative increase,
> possibly with a longer time constant. E.g., W(i) = b W(i-1),
> 1 < b <= 1/d. This is a mistake. The result will oscillate
> wildly and, on the average, deliver poor throughput. The reason
> why is tedious to explain. It has to do with that fact that it
> is easy to drive the net into saturation but hard for the net
> to recover (what Kleinrock, vol.2, calls the "rush-hour effect").
> Thus overestimating the available bandwidth is costly. But an
> exponential, almost regardless of its time constant, increases
> so quickly that large overestimates are inevitable.
>
> Without justification, I'll simply state that the best increase
> policy is to make small, constant changes to the window size (if
> you've had a control theory course, you've seen the justification):
>
> On no congestion: W(i) = W(i-1) + u (u << the path delay-
> bandwidth product)
>
> I.e., an additive increase policy. This is exactly the policy that
> Jain, et.al., suggest in DEC TR-506 and exactly the policy we've
> implemented in TCP. The only difference in our implementations is
> the choice of constants for `d' and `u'. We picked .5 and 1 for
> reasons that are partially explained in the second of the attached
> messages. A more complete analysis is in the second in-progress
> paper (and may be discussed at the upcoming IETF meeting).
>
> All of the preceding has probably made it sound as if the
> dynamic congestion algorithm is hairy but it's not. Like
> slow-start, it turns out to be three lines of code and one new
> connection state variable (the combined slow-start/congestion
> control algorithm is described in the second of the attached
> msgs).
>
>
> That covers about everything that's been done to date. It's
> about 1/3 of what we think clearly needs to be done. The next
> big step is to do the gateway `congestion detection' algorithms
> so that a signal is sent to the endnodes as early as possible
> (but not so early that the gateway ends up starved for traffic).
> The way we anticipate doing these algorithms, gateway `self
> protection' from a mis-behaving host will fall-out for free (that
> host will simply have most of its packets dropped as the gateway
> trys to tell it that it's using more than its fair share). Thus,
> like the endnode algorithm, the gateway algorithm will improve
> things even if no endnode is modified to do dynamic congestion
> avoidance. And nodes that do implement congestion avoidance
> will get their fair share of bandwidth and a minimum number of
> packet drops.
>
> Since congestion grows exponentially, detecting it early is
> important. (If it's detected early, small adjustments to the
> senders' windows will cure it. Otherwise massive adjustments
> will be necessary to give the net enough spare capacity to pump
> out the backlog.) But, given the bursty nature of traffic,
> reliable detection is a non-trivial problem. There is a scheme
> proposed in DEC TR-506 but we think it might have convergence
> problems under high load and/or significant second-order dynamics
> in the traffic. We are thinking of using some of our earlier
> work on ARMAX models for rtt/queue length prediction as the
> basis of the detection. Preliminary results suggest that this
> approach works well at high load, is immune to second-order
> effects in the traffic and is cheap enough to compute that it
> wouldn't slow down thousand-packet-per-second gateways.
>
> In addition to the gateway algorithms, we want to apply the
> endnode algorithms to connectionless traffic (e.g., domain
> server queries, RPC requests). We have the rate-based
> equivalent of the TCP window algorithm worked out (there should
> be a message describing it on the tcp list sometime in the near
> future). Sun Microsystems has been very interested, and
> supportive, during the development of the TCP congestion control
> (I believe Sun OS 4.0 will ship with our new TCP) and Sun has
> offered to cooperate in developing the RPC dynamic congestion
> control algorithms, using NFS as a test-bed (since NFS is known
> to have congestion problems).
>
> The far future holds some work on controlling second-order, non-
> ergodic effects, adaptively determining the rate constants in
> the control algorithm, and some other things that are too vague
> to mention.
>
> - Van
> From: van at HELIOS.EE.LBL.GOV (Van Jacobson)
> Newsgroups: comp.protocols.tcp-ip
> Subject: some interim notes on the bsd network speedups
> Message-ID: <8807200426.AA01221 at helios.ee.lbl.gov>
> Date: 20 Jul 88 04:26:17 GMT
>
> I told the potential beta-tests for our new 4bsd network code
> that I hoped to have a version of the code out by the end of
> July. (BTW, we've got all the beta testers we can handle --
> please don't apply.) It looks like that's going to turn into
> the end of August, in part because of SIGCOMM and in part
> because Sun puts sending source to academic customers on the
> bottom of its priority list. I thought I'd flame about the
> latter and give a bit of a status report on the new code.
>
> I've been trying to put together a beta distribution for the
> "header prediction" bsd network code. This effort has been
> stymied because it's impossible to get current source from Sun.
> The code involves major changes to the 4.3bsd kernel. The only
> local machines I can use to develop new kernels are Suns --
> everything else is either multi-user or has pathetic ethernet
> hardware. But the only Sun kernel source I've got is the doubly
> obsolete Sun OS 3.5/4.2 BSD. It would be a massive waste of
> time to upgrade this system to 4.3 BSD just so I can develop
> the next BSD -- Bill Nowicki did the upgrade a year ago and
> binaries of the new system (totally worthless to me) are
> shipping as Sun OS 4.0. [I'm not the only one suffering this
> problem -- I understand Craig Partridge's multicast work is
> suffering because he can't get 4.3-compatible Sun source. I
> think he gave up & decided to do all his development on 4.3bsd
> Vaxen. And I think I heard Chuck Hedrick say 4.0 has all the
> rlogin, URG and nameserver bugs that we fondly remember fixing
> in 3.x. And he has to get source before the academic year
> starts or they won't be able to switch until a semester break.
> And Mike Karels is saying "I told you so" and suggesting I buy
> some CCIs. Pity that Sun can't figure out that it's in their
> best interest to do TIMELY source distribution to the academic
> and research community -- their software development base gets
> expanded a few hundred-fold for the cost of making tapes.]
>
> Anyway, now that I've vented my spleen, there are some interim
> results to talk about. While waiting for either useful source
> or a better hardware platform to show up, I've been cleaning up
> my original mods and backing out changes one and two at a time
> to gauge their individual effect. Because network performance
> seems to rest on getting a lot of things happening in parallel,
> this leave-one-out testing doesn't give simple good/bad answers
> (I had one test case that went like
>
> Basic system: 600 KB/s
> add feature A: 520 KB/s
> drop A, add B: 530 KB/s
> add both A & B: 700 KB/s
>
> Obviously, any statement of the form "feature A/B is good/bad"
> is bogus.) But, in spite of the ambiguity, some of the network
> design folklore I've heard seems to be clearly wrong.
>
> In the process of cleaning things up, they slowed down. Task-
> to-task data throughput using TCP between two Sun 3/60s dropped
> from 1 MB/s (about 8.6 Mb/s on the wire) to 890 KB/s (about 7.6
> Mb/s on the wire). I know where the 11% was lost (an
> interaction between "sosend" and the fact that an AMD LANCE chip
> requires at least 100 bytes in the first chunk of data if you
> ask it to chain -- massive braindamage on AMD's part) and how to
> get it back (re-do the way user data gets handed to the
> transport protocol) but need to talk with Mike Karels about the
> "right" way to do this.
>
> Still, 890 KB/s represents a non-trivial improvement over the
> stock Sun/4bsd system: Under identical test conditions (same
> socket & user buffer sizes, same MSS and MTU, same machines),
> the best tcp throughput I could get with an out-the-box Sun OS
> 3.5 was 380 KB/s. I wish I could say "make these two simple
> changes and your throughput will double" but I can't. There
> were lots and lots of fiddley little changes, none made a huge
> difference and they all seemed to be necessary.
>
> The biggest single effect was a change to sosend (the routine
> between the user "write" syscall and tcp_output). Its loop
> looked something like:
>
> while there is user data & space in the socket buffer
> copy from user space to socket
> call the protocol "send" routine
>
> After hooking a scope to our ethernet cable & looking at the
> packet spacings, I changed this to
>
> while there is user data & space in the socket buffer
> copy up to 1K (one cluster's worth) from user space to socket
> call the protocol "send" routine
>
> and the throughput jumped from 380 to 456 KB/s (+20%). There's
> one school of thought that says the first loop was better
> because it minimized the "boundary crossings", the fixed costs
> of routine calls and context changes. This same school is
> always lobbying for "bigger": bigger packets, bigger windows,
> bigger buffers, for essentially the same reason: the bigger
> chunks are, the fewer boundary crossings you pay for. The
> correct school, mine :-), says there's always a fixed cost and a
> variable cost (e.g., the cost of maintaining tcp state and
> tacking a tcp packet header on the front of some data is
> independent of the amount of data; the cost of filling in the
> checksum field in that header scales linearly with the amount of
> data). If the size is large enough to make the fixed cost small
> compared to the variable cost, making things bigger LOWERS
> throughput because you throw away opportunities for parallelism.
> Looking at the ethernet, I saw a burst of packets, a long dead
> time, another burst of packets, ... . It was clear that while
> we were copying 4 KB from the user, the processor in the LANCE
> chip and tcp_input on the destination machine were mostly
> sitting idle.
>
> To get good network performance, where there are guaranteed to
> be many processors that could be doing things in parallel, you
> want the "units of work" (loop sizes, packet sizes, etc.) to be
> the SMALLEST values that amortise the fixed cost. In Berkeley
> Unix, the fixed costs of protocol processing are pretty low and
> sizes of 1 - 2 KB on a 68020 are as large as you'd want to get.
> (This is easy to determine. Just do a throughput vs. size test
> and look for the knee in the graph. Best performance is just to
> the right of the knee.) And, obviously, on a faster machine
> you'd probably want to do things in even smaller units (if the
> overhead stays the same -- Crays are fast but hardware
> strangeness drives the fixed costs way up. Just remember that
> if it takes large packets and large buffers to get good
> performance on a fast machine, that's because it's broken, not
> because it's fast.)
>
> Another big effect (difficult to quantify because several
> changes had to be made at once) was to remove lots of
> more-or-less hidden data copies from the protocol processing.
> It's a truism that you never copy data in network code. Just
> lay the data down in a buffer & pass around pointers to
> appropriate places in that buffer. But there are lots of places
> where you need to get from a pointer into the buffer back to a
> pointer to the buffer itself (e.g., you've got a pointer to the
> packet's IP header, there's a header checksum error, and you
> want to free the buffer holding the packet). The routine "dtom"
> converts a data pointer back to a buffer pointer but it only
> works for small things that fit in a single mbuf (the basic
> storage allocation unit in the bsd network code). Incoming
> packets are never in an mbuf; they're in a "cluster" which the
> mbuf points to. There's no way to go from a pointer into a
> cluster to a pointer to the mbuf. So, anywhere you might need
> to do a dtom (anywhere there's a detectable error), there had to
> be a call to "m_pullup" to make sure the dtom would work.
> (M_pullup works by allocating a fresh mbuf, copying a bunch of
> data from the cluster to this mbuf, then chaining the original
> cluster behind the new mbuf.)
>
> So, we were doing a bunch of copying to expedite error handling.
> But errors usually don't happen (if you're worried about
> performance, first you make sure there are very, very few
> errors), so we were doing a bunch of copying for nothing. But,
> if you're sufficiently insomniac, in about a week you can track
> down all the pullup's associated with all the dtom's and change
> things to avoid both. This requires massive recoding of both
> the TCP and IP re-assembly code. But it was worth it: TCP
> throughput climbed from around 600 KB/s to 750 KB/s and IP
> forwarding just screamed: A 3/60 forwarding packets at the 9
> Mb/s effective ethernet bandwidth used less than 50% of the CPU.
>
> [BTW, in general I didn't find anything wrong with the BSD
> mbuf/cluster model. In fact, I tried some other models (e.g.,
> do everything in packet sized chunks) and they were slower.
> There are many cases where knowing that you can grab an mbuf and
> chain it onto a chunk of data really simplifies the protocol
> code (simplicity == speed). And the level of indirection and
> fast, reference counted allocation of clusters can really be a
> win on data transfers (a la kudp_fastsend in Sun OS). The
> biggest problem I saw, other than the m_pullup's, was that
> clusters are too small: They need to be at least big enough for
> an ethernet packet (2K) and making them page sized (8K on a Sun)
> doesn't hurt and would let you do some quick page swap tricks in
> the user-system data copies (I didn't do any of these tricks in
> the fast TCP. I did use 2KB clusters to optimize things for the
> ethernet drivers).]
>
> An interesting result of the m_pullup removals was the death of
> another piece of folklore. I'd always heard that the limiting
> CPU was the receiver. Wrong. After the pullup changes, the
> sender would be maxed out at 100% CPU utilization while the
> receiver loafed along at 65-70% utilization (utilizations
> measured with a microprocessor analyzer; I don't trust the
> system's stats). In hindsight, this seems reasonable. At the
> receiver, a packet comes in, wanders up to the tcp layer, gets
> stuck in the socket buffer and an ack is generated (i.e., the
> processing terminates with tcp_input at the socket buffer). At
> the sender, the ack comes in, wanders up to the tcp layer, frees
> some space, then the higher level socket process has to be woken
> up to fill that space (i.e., the processing terminates with
> sosend, at the user socket layer). The way Unix works, this
> forces a boundary crossing between the bottom half (interrupt
> service) and top half (process context) of the kernel. On a
> Sun, and most of the other Unix boxes I know of, this is an
> expensive crossing. [Of course, the user process on the
> receiver side has to eventually wake up and empty the socket
> buffer but it gets to do that asynchronously and the dynamics
> tend to arrange themselves so it processes several packets on
> each wakeup, minimizing the boundary crossings.]
>
> Talking about the bottom half of the kernel reminds me of
> another major effect. There seemed to be a "sound barrier" at
> 550 KB/s. No matter what I did, the performance stuck at 550
> KB/s. Finally, I noticed that Sun's LANCE ethernet driver,
> if_le.c, would only queue one packet to the LANCE at a time.
> Picture what this means: (1) driver hands packet to LANCE, (2)
> LANCE puts packet on wire, (3) end of packet, LANCE interrupts
> processor, (4) interrupt dispatched to driver, (5) go back to
> (1). The time involved in (4) is non-trivial, more than 150us,
> and represents a lot of idle time for the LANCE and the wire.
> So, I rewrote the driver to queue an arbitrary number of packets
> to the LANCE, the sound barrier disappeared, and other changes
> started making the throughput climb (it's not clear whether this
> change had any effect on throughput or just allowed other
> changes to have an effect).
>
> [Shortly after making the if_le change, I learned why Sun might
> have written the driver the silly way they did: Before the
> change, the 6 back-to-back IP fragments of an NFS write would
> each be separated by the 150us interrupt service time. After
> the change, they were really back-to-back, separated by only the
> 9.6us minimum ethernet spacing (and, no, Sun does NOT violate
> the ethernet spec in any way, shape or form. After my last
> message on this stuff, Apollo & DEC people kept telling me Sun
> was out of spec. I've been watching packets on our ethernet for
> so long, I'm starting to learn the middle name of every bit.
> Sun bits look like DEC bits and Sun, or rather, the LANCE in the
> 3/50 & 3/60, completely complys with the timings in the blue
> book.) Anyway, the brain-dead Intel 82586 ethernet chip Sun
> puts in all its 3/180, 3/280 and Sun-4 file servers can't hack
> back-to-back, minimum spacing packets. Every now and again it
> drops some of the frags and wanders off to never-never land
> ("iebark reset"). Diskless workstations don't work well when
> they can't write to their file server and, until I hit on the
> idea of inserting "DELAY" macros in kudp_fastsend, it looked
> like I could have a fast TCP or a functional workstation but not
> both.]
>
> Probably 30% of the performance improvements came from fixing
> things in the Sun kernel. I mean like, really, guys: If the
> current task doesn't change, and it doesn't 80% of the time
> swtch is called, there's no need to do a full context save and
> restore. Adding the two lines
>
> cmpl _masterprocp,a0
> jeq 6f ü restore of current proc is easy
>
> just before the call to "resume" in sun3/vax.s:swtch got me a
> quick 70 KB/s performance increase but felt more like a bug fix
> than progress. And a kernel hacker that does 16-bit "movw"s and
> "addw"s on a 68020, or writes 2 instruction dbra loops designed
> to put a 68010 in loop mode, should be shot. The alu takes the
> same 3 clocks for a 2 byte add or a 4 byte add so things will
> finish a lot quicker if you give it 4 bytes at a time. And
> every branch stalls the pipe, so unrolling that loop to cut down
> on branches is a BIG win.
>
> Anyway, I recoded the checksum routine, ocsum.s (now oc_cksum.s
> because I found the old calling sequence wasn't the best way to
> do things) and its caller, in_cksum.c, and got the checksumming
> time down from 490us/KB to 130us/KB. Unrolling the move loop in
> copyin/copyout (the routines that move data user to kernel and
> kernel to user), got them down from 200us/KB to 140us/KB. (BTW,
> if you combine the move with the checksum, which I did in the
> kludged up, fast code that ran 1 MB/s on a 15MHz 3/50, it costs
> 200us/KB, not the 300us/KB you'd expect from adding the move
> and sum times. Pipelined processors are strange and wonderful
> beasts.)
>
> From these times, you can work out most of the budgets and
> processing details: I was using 1408 data byte packets (don't
> ask why 1408). It takes 193us to copy a packet's worth of data
> and 184us to checksum the packet and its 40 byte header. From
> the logic analyzer, the LANCE uses 300us of bus and memory
> bandwidth reading or writing the packet (I don't understand why,
> it should only take half this). So, the variable costs are
> around 700us per packet. When you add the 18 byte ethernet
> header and 12 byte interpacket gap, to run at 10 Mb/s I'd have
> to supply a new packet every 1200us. I.e., there's a budget of
> 500us for all the fixed costs (protocol processing, interrupt
> dispatch, device setup, etc.). This is do-able (I've done it,
> but not very cleanly) but what I'm getting today is a packet
> every 1500us. I.e., 800us per packet fixed costs. When I look
> with our analyzer, 30% of this is TCP, IP, ARP and ethernet
> protocol processing (this was 45% before the "header prediction"
> tcp mods), 15% is stalls (idle time that I don't currently
> understand but should be able to eventually get rid of) and 55%
> is device setup, interrupt service and task handling. I.e.,
> protocol processing is negligible (240us per packet on this
> processor and this isn't a fast processor in today's terms). To
> make the network go faster, it seems we just need to fix the
> operating system parts we've always needed to fix: I/O service,
> interrupts, task switching and scheduling. Gee, what a surprise.
>
> - Van
>
>
> BBoard-ID: 7621
> BB-Posted: Tue, 25 Oct 88 2:06:08 EDT
> To: tcp-ip at sri-nic.ARPA
> Subject: 4BSD TCP Ethernet Throughput
> Date: Mon, 24 Oct 88 13:33:13 PDT
> From: Van Jacobson <van at helios.ee.lbl.gov>
>
> Many people have asked for the Ethernet throughput data I
> showed at Interop so it's probably easier to post it:
>
> These are some throughput results for an experimental version of
> the 4BSD (Berkeley Unix) network code running on a couple of
> different MC68020-based systems: Sun 3/60s (20MHz 68020 with AMD
> LANCE Ethernet chip) and Sun 3/280s (25MHz 68020 with Intel
> 82586 Ethernet chip) [note again the tests were done with Sun
> hardware but not Sun software -- I'm running 4.?BSD, not Sun
> OS]. There are lots and lots of interesting things in the data
> but the one thing that seems to have attracted people's
> attention is the big difference in performance between the two
> Ethernet chips.
>
> The test measured task-to-task data throughput over a TCP
> connection from a source (e.g., chargen) to a sink (e.g.,
> discard). The tests were done between 2am and 6am on a fairly
> quiet Ethernet (~100Kb/s average background traffic). The
> packets were all maximum size (1538 bytes on the wire or 1460
> bytes of user data per packet). The free parameters for the
> tests were the sender and receiver socket buffer sizes (which
> control the amount of 'pipelining' possible between the sender,
> wire and receiver). Each buffer size was independently varied
> from 1 to 17 packets in 1 packet steps. Four tests were done at
> each of the 289 combinations. Each test transferred 8MB of data
> then recorded the total time for the transfer and the send and
> receive socket buffer sizes (8MB was chosen so that the worst
> case error due to the system clock resolution was ~.1% -- 10ms
> in 10sec). The 1,156 tests per machine pair were done in random
> order to prevent any effects from fixed patterns of resource
> allocation.
>
> In general, the maximum throughput was observed when the sender
> buffer equaled the receiver buffer (the reason why is complicated
> but has to do with collisions). The following table gives the
> task-to-task data throughput (in KBytes/sec) and throughput on
> the wire (in MBits/sec) for (a) a 3/60 sending to a 3/60 and
> (b) a 3/280 sending to a 3/60.
>
> _________________________________________________
> | 3/60 to 3/60 | 3/280 to 3/60 |
> | (LANCE to LANCE) | (Intel to LANCE) |
> | socket | |
> | buffer task to | task to |
> | size task wire | task wire |
> |(packets) (KB/s) (Mb/s) | (KB/s) (Mb/s) |
> | 1 384 3.4 | 337 3.0 |
> | 2 606 5.4 | 575 5.1 |
> | 3 690 6.1 | 595 5.3 |
> | 4 784 6.9 | 709 6.3 |
> | 5 866 7.7 | 712 6.3 |
> | 6 904 8.0 | 708 6.3 |
> | 7 946 8.4 | 710 6.3 |
> | 8 954 8.4 | 718 6.4 |
> | 9 974 8.6 | 715 6.3 |
> | 10 983 8.7 | 712 6.3 |
> | 11 995 8.8 | 714 6.3 |
> | 12 1001 8.9 | 715 6.3 |
> |_____________________________|__________________|
>
> The theoretical maximum data throughput, after you take into
> account all the protocol overheads, is 1,104 KB/s (this
> task-to-task data rate would put 10Mb/s on the wire). You can
> see that the 3/60s get 91% of the the theoretical max. The
> 3/280, although a much faster processor (the CPU performance is
> really dominated by the speed of the memory system, not the
> processor clock rate, and the memory system in the 3/280 is
> almost twice the speed of the 3/60), gets only 65% of
> theoretical max.
>
> The low throughput of the 3/280 seems to be entirely due to the
> Intel Ethernet chip: at around 6Mb/s, it saturates. (I put the
> board on an extender and watched the bus handshake lines on the
> 82586 to see if the chip or the Sun interface logic was pooping
> out. It was the chip -- it just stopped asking for data. (The
> CPU was loafing along with at least 35% idle time during all
> these tests so it wasn't the limit).
>
> [Just so you don't get confused: Stuff above was measurements.
> Stuff below includes opinions and interpretation and should
> be viewed with appropriate suspicion.]
>
> If you graph the above, you'll see a large notch in the Intel
> data at 3 packets. This is probably a clue to why it's dying:
> TCP delivers one ack for every two data packets. At a buffer
> size of three packets, the collision rate increases dramatically
> since the sender's third packet will collide with the receiver's
> ack for the previous two packets (for buffer sizes of 1 and 2,
> there are effectively no collisions). My suspicion is that the
> Intel is taking a long time to recover from collisions (remember
> that you're 64 bytes into the packet when you find out you've
> collided so the chip bus logic has to back up 64 bytes -- Intel
> spent their silicon making the chip "programmable", I doubt they
> invested as much as AMD in the bus interface). This may or may
> not be what's going on: life is too short to spend debugging
> Intel parts so I really don't care to investigate further.
>
> The one annoyance in all this is that Sun puts the fast Ethernet
> chip (the AMD LANCE) in their slow machines (3/50s and 3/60s)
> and the slow Ethernet chip (Intel 82586) in their fast machines
> (3/180s, 3/280s and Sun-4s, i.e., all their file servers).
> [I've had to put delay loops in the Ethernet driver on the 3/50s
> and 3/60s to slow them down enough for the 3/280 server to keep
> up.] Sun's not to blame for anything here: It costs a lot
> to design a new Ethernet interface; they had a design for the
> 3/180 board set (which was the basis of all the other VME
> machines--the [34]/280 and [34]/110); and no market pressure to
> change it. If they hadn't ventured out in a new direction with
> the 3/[56]0 -- the LANCE -- I probably would have thought
> 700KB/s was great Ethernet throughput (at least until I saw
> Dave Boggs' DEC-Titan/Seeq-chip throughput data).
>
> But I think Sun is overdue in offering a high-performance VME
> Ethernet interface. That may change though -- VME controllers
> like the Interphase 4207 Eagle are starting to appear which
> should either put pressure on Sun and/or offer a high
> performance 3rd party alternative (I haven't actually tried an
> Eagle yet but from the documentation it looks like they did a
> lot of things right). I'd sure like to take the delay loops out
> of my LANCE driver...
>
> - Van
>
> ps: I have data for Intel-to-Intel and LANCE-to-Intel as well as
> the Intel-to-LANCE I listed above. Using an Intel chip on the
> receiver, the results are MUCH worse -- 420KB/s max. I chose
> the data that put the 82586 in its very best light.
>
> I also have scope pictures taken at the transceivers during all
> these tests. I'm sure there'll be a chorus of "so-and-so violates
> the Ethernet spec" but that's a lie -- NONE OF THESE CHIPS OR
> SYSTEMS VIOLATED THE ETHERNET SPEC IN ANY WAY, SHAPE OR FORM.
> I looked very carefully for violations and have the pictures to
> prove there were none.
>
> Finally, all of the above is Copyright (c) 1988 by Van Jacobson.
> If you want to reproduce any part of it in print, you damn well
> better ask me first -- I'm getting tired of being misquoted in
> trade rags.
>
> From van at helios.ee.lbl.gov Mon Apr 30 01:44:05 1990
> To: end2end-interest at ISI.EDU
> Subject: modified TCP congestion avoidance algorithm
> Date: Mon, 30 Apr 90 01:40:59 PDT
> From: Van Jacobson <van at helios.ee.lbl.gov>
> Status: RO
>
> This is a description of the modified TCP congestion avoidance
> algorithm that I promised at the teleconference.
>
> BTW, on re-reading, I noticed there were several errors in
> Lixia's note besides the problem I noted at the teleconference.
> I don't know whether that's because I mis-communicated the
> algorithm at dinner (as I recall, I'd had some wine) or because
> she's convinced that TCP is ultimately irrelevant :). Either
> way, you will probably be disappointed if you experiment with
> what's in that note.
>
> First, I should point out once again that there are two
> completely independent window adjustment algorithms running in
> the sender: Slow-start is run when the pipe is empty (i.e.,
> when first starting or re-starting after a timeout). Its goal
> is to get the "ack clock" started so packets will be metered
> into the network at a reasonable rate. The other algorithm,
> congestion avoidance, is run any time *but* when (re-)starting
> and is responsible for estimating the (dynamically varying)
> pipesize. You will cause yourself, or me, no end of confusion
> if you lump these separate algorithms (as Lixia's message did).
>
> The modifications described here are only to the congestion
> avoidance algorithm, not to slow-start, and they are intended to
> apply to large bandwidth-delay product paths (though they don't
> do any harm on other paths). Remember that with regular TCP (or
> with slow-start/c-a TCP), throughput really starts to go to hell
> when the probability of packet loss is on the order of the
> bandwidth-delay product. E.g., you might expect a 1% packet
> loss rate to translate into a 1% lower throughput but for, say,
> a TCP connection with a 100 packet b-d p. (= window), it results
> in a 50-75% throughput loss. To make TCP effective on fat
> pipes, it would be nice if throughput degraded only as function
> of loss probability rather than as the product of the loss
> probabilty and the b-d p. (Assuming, of course, that we can do
> this without sacrificing congestion avoidance.)
>
> These mods do two things: (1) prevent the pipe from going empty
> after a loss (if the pipe doesn't go empty, you won't have to
> waste round-trip times re-filling it) and (2) correctly account
> for the amount of data actually in the pipe (since that's what
> congestion avoidance is supposed to be estimating and adapting to).
>
> For (1), remember that we use a packet loss as a signal that the
> pipe is overfull (congested) and that packet loss can be
> detected one of two different ways: (a) via a retransmit
> timeout or (b) when some small number (3-4) of consecutive
> duplicate acks has been received (the "fast retransmit"
> algorithm). In case (a), the pipe is guaranteed to be empty so
> we must slow-start. In case (b), if the duplicate ack
> threshhold is small compared to the bandwidth-delay product, we
> will detect the loss with the pipe almost full. I.e., given a
> threshhold of 3 packets and an LBL-MIT bandwidth-delay of around
> 24KB or 16 packets (assuming 1500 byte MTUs), the pipe is 75%
> full when fast-retransmit detects a loss (actually, until
> gateways start doing some sort of congestion control, the pipe
> is overfull when the loss is detected so *at least* 75% of the
> packets needed for ack clocking are in transit when
> fast-retransmit happens). Since the pipe is full, there's no
> need to slow-start after a fast-retransmit.
>
> For (2), consider what a duplicate ack means: either the
> network duplicated a packet (i.e., the NSFNet braindead IBM
> token ring adapters) or the receiver got an out-of-order packet.
> The usual cause of out-of-order packets at the receiver is a
> missing packet. I.e., if there are W packets in transit and one
> is dropped, the receiver will get W-1 out-of-order and
> (4.3-tahoe TCP will) generate W-1 duplicate acks. If the
> `consecutive duplicates' threshhold is set high enough, we can
> reasonably assume that duplicate acks mean dropped packets.
>
> But there's more information in the ack: The receiver can only
> generate one in response to a packet arrival. I.e., a duplicate
> ack means that a packet has left the network (it is now cached
> at the receiver). If the sender is limitted by the congestion
> window, a packet can now be sent. (The congestion window is a
> count of how many packets will fit in the pipe. The ack says a
> packet has left the pipe so a new one can be added to take its
> place.) To put this another way, say the current congestion
> window is C (i.e, C packets will fit in the pipe) and D
> duplicate acks have been received. Then only C-D packets are
> actually in the pipe and the sender wants to use a window of C+D
> packets to fill the pipe to its estimated capacity (C+D sent -
> D received = C in pipe).
>
> So, conceptually, the slow-start/cong.avoid/fast-rexmit changes
> are:
>
> - The sender's input routine is changed to set `cwnd' to `ssthresh'
> when the dup ack threshhold is reached. [It used to set cwnd to
> mss to force a slow-start.] Everything else stays the same.
>
> - The sender's output routine is changed to use an effective window
> of min(snd_wnd, cwnd + dupacks*mss) [the change is the addition
> of the `dupacks*mss' term.] `Dupacks' is zero until the rexmit
> threshhold is reached and zero except when receiving a sequence
> of duplicate acks.
>
> The actual implementation is slightly different than the above
> because I wanted to avoid the multiply in the output routine
> (multiplies are expensive on some risc machines). A diff of the
> old and new fastrexmit code is attached (your line numbers will
> vary).
>
> Note that we still do congestion avoidance (i.e., the window is
> reduced by 50% when we detect the packet loss). But, as long as
> the receiver's offered window is large enough (it needs to be at
> most twice the bandwidth-delay product), we continue sending
> packets (at exactly half the rate we were sending before the
> loss) even after the loss is detected so the pipe stays full at
> exactly the level we want and a slow-start isn't necessary.
>
> Some algebra might make this last clear: Say U is the sequence
> number of the first un-acked packet and we are using a window
> size of W when packet U is dropped. Packets [U..U+W) are in
> transit. When the loss is detected, we send packet U and pull
> the window back to W/2. But in the round-trip time it takes
> the U retransmit to fill the receiver's hole and an ack to get
> back, W-1 dup acks will arrive (one for each packet in transit).
> The window is effectively inflated by one packet for each of
> these acks so packets [U..U+W/2+W-1) are sent. But we don't
> re-send packets unless we know they've been lost so the amount
> actually sent between the loss detection and the recovery ack is
> U+W/2+W-1 - U+W = W/2-1 which is exactly the amount congestion
> avoidance allows us to send (if we add in the rexmit of U). The
> recovery ack is for packet U+W so when the effective window is
> pulled back from W/2+W-1 to W/2 (which happens because the
> recovery ack is `new' and sets dupack to zero), we are allowed
> to send up to packet U+W+W/2 which is exactly the first packet
> we haven't yet sent. (I.e., there is no sudden burst of packets
> as the `hole' is filled.) Also, when sending packets between
> the loss detection and the recovery ack, we do nothing for the
> first W/2 dup acks (because they only allow us to send packets
> we've already sent) and the bottleneck gateway is given W/2
> packet times to clean out its backlog. Thus when we start
> sending our W/2-1 new packets, the bottleneck queue is as empty
> as it can be.
>
> [I don't know if you can get the flavor of what happens from
> this description -- it's hard to see without a picture. But I
> was delighted by how beautifully it worked -- it was like
> watching the innards of an engine when all the separate motions
> of crank, pistons and valves suddenly fit together and
> everything appears in exactly the right place at just the right
> time.]
>
> Also note that this algorithm interoperates with old tcp's: Most
> pre-tahoe tcp's don't generate the dup acks on out-of-order packets.
> If we don't get the dup acks, fast retransmit never fires and the
> window is never inflated so everything happens in the old way (via
> timeouts). Everything works just as it did without the new algorithm
> (and just as slow).
>
> If you want to simulate this, the intended environment is:
>
> - large bandwidth-delay product (say 20 or more packets)
>
> - receiver advertising window of two b-d p (or, equivalently,
> advertised window of the unloaded b-d p but two or more
> connections simultaneously sharing the path).
>
> - average loss rate (from congestion or other source) less than
> one lost packet per round-trip-time per active connection.
> (The algorithm works at higher loss rate but the TCP selective
> ack option has to be implemented otherwise the pipe will go empty
> waiting to fill the second hole and throughput will once again
> degrade at the product of the loss rate and b-d p. With selective
> ack, throughput is insensitive to b-d p at any loss rate.)
>
> And, of course, we should always remember that good engineering
> practise suggests a b-d p worth of buffer at each bottleneck --
> less buffer and your simulation will exhibit the interesting
> pathologies of a poorly engineered network but will probably
> tell you little about the workings of the algorithm (unless the
> algorithm misbehaves badly under these conditions but my
> simulations and measurements say that it doesn't). In these
> days of $100/megabyte memory, I dearly hope that this particular
> example of bad engineering is of historical interest only.
>
> - Van
>
> [...code diffs deleted...]
> Received: from rx7.ee.lbl.gov by uu2.psi.com (5.65b/4.0.071791-PSI/PSINet) via SMTP;
> id AA12583 for popbbn; Wed, 8 Sep 93 01:29:46 -0400
> Received: by rx7.ee.lbl.gov for craig at aland.bbn.com (5.65/1.44r)
> id AA05271; Tue, 7 Sep 93 22:30:15 -0700
> Message-Id: <9309080530.AA05271 at rx7.ee.lbl.gov>
> To: Craig Partridge <craig at aland.bbn.com>
> Cc: David Clark <ddc at lcs.mit.edu>
> Subject: Re: query about TCP header on tcp-ip
> In-Reply-To: Your message of Tue, 07 Sep 93 09:48:00 PDT.
> Date: Tue, 07 Sep 93 22:30:14 PDT
> From: Van Jacobson <van at ee.lbl.gov>
>
> Craig,
>
> As you probably remember from the "High Speed TCP" CNRI meeting,
> my kernel looks nothing at all like any version of BSD. Mbufs
> no longer exist, for example, and `netipl' and all the protocol
> processing that used to be done at netipl interrupt level are
> gone. TCP receive packet processing in the new kernel really is
> about 30 instructions on a RISC (33 on a sparc but three of
> those are compiler braindamage). Attached is the C code & the
> associated sparc assembler.
>
> A brief recap of the architecture: Packets go in 'pbufs' which
> are, in general, the property of a particular device. There is
> exactly one, contiguous, packet per pbuf (none of that mbuf
> chain stupidity). On the packet input interrupt, the device
> driver upcalls through the protocol stack (no more of that queue
> packet on the netipl software interrupt bs). The upcalls
> percolate up the stack until the packet is completely serviced
> (e.g., NFS requests that can be satisfied from in-memory data &
> data structures) or they reach a routine that knows the rest of
> the processing must be done in some process's context in which
> case the packet is laid on some appropriate queue and the
> process is unblocked. In the case of TCP, the upcalls almost
> always go two levels: IP finds the datagram is for this host &
> it upcalls a TCP demuxer which hashes the ports + SYN to find a
> PCB, lays the packet on the tail of the PCB's queue and wakes up
> any process sleeping on the PCB. The IP processing is about 25
> instructions & the demuxer is about 10.
>
> As Dave noted, the two processing paths that need the most
> tuning are the data packet send & receive (since at most every
> other packet is acked, there will be at least twice as many data
> packets as ack packets). In the new system, the receiving
> process calls 'tcp_usrrecv' (the protocol specific part of the
> 'recv' syscall) or is already blocked there waiting for new
> data. So the following code is embedded in a loop at the start of
> tcp_usrrecv that spins taking packets off the pcb queue until
> there's no room for the next packet in the user buffer or the
> queue is empty. The TCP protocol processing is done as we
> remove packets from the queue & copy their data to user space
> (and since we're in process context, it's possible to do a
> checksum-and-copy).
>
> Throughout this code, 'tp' points to the pcb and 'ti' points to
> the tcp header of the first packet on the queue (the ip header was
> stripped as part of interrupt level ip processing). The header info
> (excluding the ports which are implicit in the pcb) are sucked out
> of the packet into registers [this is to minimize cache thrashing and
> possibly to take advantage of 64 bit or longer loads]. Then the
> header checksum is computed (tp->ph_sum is the precomputed pseudo-header
> checksum + src & dst ports).
>
> int tcp_usrrecv(struct uio* uio, struct socket* so)
> {
> struct tcpcb *tp = (struct tcpcb *)so->so_pcb;
> register struct pbuf* pb;
>
> while ((pb = tp->tp_inq) != 0) {
> register int len = pb->len;
> struct tcphdr *ti = (struct tcphdr *)pb->dat;
>
> u_long seq = ((u_long*)ti)[1];
> u_long ack = ((u_long*)ti)[2];
> u_long flg = ((u_long*)ti)[3];
> u_long sum = ((u_long*)ti)[4];
> u_long cksum = tp->ph_sum;
>
> /* NB - ocadd is an inline gcc assembler function */
> cksum = ocadd(ocadd(ocadd(ocadd(cksum, seq), ack), flg), sum);
>
> Next is the header prediction check which is probably the most
> opaque part of the code. tp->pred_flags contains snd_wnd (the
> window we expect in incoming packets) in the bottom 16 bits and
> 0x4x10 in the top 16 bits. The 'x' is normally 0 but will be
> set non-zero if header prediction shouldn't be done (e.g., if
> not in established state, if retransmitting, if hole in seq
> space, etc.). So, the first term of the 'if' checks four
> different things simultaneously:
> - that the window is what we expect
> - that there are no tcp options
> - that the packet has ACK set & doesn't have SYN, FIN, RST or URG set
> - that the connection is in the right state
> and the 2nd term of the if checks that the packet is in sequence:
>
> #define FMASK (((0xf000 | TH_SYN|TH_FIN|TH_RST|TH_URG|TH_ACK) << 16) | 0xffff)
>
> if ((flg & FMASK) == tp->pred_flags && seq == tp->rcv_nxt) {
>
> The next few lines are pretty obvious -- we subtract the header
> length from the total length and if it's less than zero the packet
> was malformed, if it's zero we must have a pure ack packet & we
> do the ack stuff otherwise if the ack field didn't move we have
> a pure data packet which we copy to the user's buffer, checksumming
> as we go, then update the pcb state if everything checks:
>
> len -= 20;
> if (len <= 0) {
> if (len < 0) {
> /* packet malformed */
> } else {
> /* do pure ack things */
> }
> } else if (ack == tp->snd_una) {
> cksum = in_uiomove((u_char*)ti + 20, len, uio, cksum);
> if (cksum != 0) {
> /* packet or user buffer errors */
> }
> seq += len;
> tp->rcv_nxt = seq;
> if ((int)(seq - tp->rcv_acked) >= 0) {
> /* send ack */
> } else {
> /* free pbuf */
> }
> continue;
> }
> }
> /* header prediction failed -- take long path */
> ...
>
> That's it. On the normal receive data path we execute 16 lines of
> C which turn into 33 instructions on a sparc (it would be 30 if I
> could trick gcc into generating double word loads for the header
> & my carefully aligned pcb fields). I think you could get it down
> around 25 on a cray or big-endian alpha since the loads, checksum calc
> and most of the compares can be done on 64 bit quantities (e.g.,
> you can combine the seq & ack tests into one).
>
> Attached is the sparc assembler for the above receive path. Hope
> this explains Dave's '30 instruction' assertion. Feel free to
> forward this to tcp-ip or anyone that might be interested.
>
> - Van
>
> ----------------
> ld [%i0+4],%l3 ! load packet tcp header fields
> ld [%i0+8],%l4
> ld [%i0+12],%l2
> ld [%i0+16],%l0
>
> ld [%i1+72],%o0 ! compute header checksum
> addcc %l3,%o0,%o3
> addxcc %l4,%o3,%o3
> addxcc %l2,%o3,%o3
> addxcc %l0,%o3,%o3
>
> sethi %hi(268369920),%o1 ! check if hdr. pred possible
> andn %l2,%o1,%o1
> ld [%i1+60],%o2
> cmp %o1,%o2
> bne L1
> ld [%i1+68],%o0
> cmp %l3,%o0
> bne L1
> addcc %i2,-20,%i2
> bne,a L3
> ld [%i1+36],%o0
> ! packet error or ack processing
> ...
> L3:
> cmp %l4,%o0
> bne L1
> add %i0,20,%o0
> mov %i2,%o1
> call _in_uiomove,0
> mov %i3,%o2
> cmp %o0,0
> be L6
> add %l3,%i2,%l3
> ! checksum error or user buffer error
> ...
> L6:
> ld [%i1+96],%o0
> subcc %l3,%o0,%g0
> bneg L7
> st %l3,[%i1+68]
> ! send ack
> ...
> br L8
> L7:
> ! free pbuf
> ...
> L8: ! done with this packet - continue
> ...
>
> L1: ! hdr pred. failed - do it the hard way
More information about the Internet-history
mailing list