[ih] Loss as a congestion signal [internet-history Digest, Vol 84, Issue 4]

Noel Chiappa jnc at mercury.lcs.mit.edu
Tue Jun 3 19:41:03 PDT 2014


    > From: Detlef Bosau <detlef.bosau at web.de>

    > it can well make sense to do error correction at the transport layer,
    > particularly when retransmissions on demand aren't feasible or too
    > expensive
    > However, both, retransmission and error correction as well are annoying
    > for the rest of the world, both require resources which are not
    > available for others any more.

Well, yes and no. Here are some thoughts I've had thinking about this.


First, start with the point that the endpoint _pretty much_ _has_ to have the
mechanism to recognize that a packet has been lost, and retransmit it - no
matter what the rest of the design looks like.

Why? Because otherwise, the network has to never, ever, lose data - because
if, once the host has sent a packet, it cannot reliably notice that it has
been lost, and re-send it, the network cannot lose that packet.

That means the network has to be a lot more complex: switches have to have a
lot of state, they have to have their own mechanism for doing
acknowledgements - since an upstream switch cannot discard its copy of a
packet until the downstream has definitely gotten a copy - and the upstream
has to hold the packet until the downstream acks, etc. etc.

(In fact, you wind up with something that looks a lot like the ARPANET.)

And even if the design has all that mechanism/state/complexity built in, it's
_still_ not really guaranteed: what happens if the switch with the only copy
of a packet dies? (Unless the design adopts the rule that there must always be
_two_ switches with copies of a packet - even more complexity/etc.)

There are good architectural reasons why the endpoint is given the ultimate
responsibility for making sure the data gets through: For one, it's really
not possible to get someone else to do the job as well as the endpoint can
(see above). This is fate-sharing / the end-end principle.

For another, once the design does that, the switches become a _lot_ simpler -
an additional benefit. When you see things start to line up that way, it's
probably a sign that you have found what Erdos would have called 'the design
in The Book'.


So, now that the design _has_ to have end-end retransmission, adding any other
kind of re-transmission is - necessarily - just an optimization.

And to decide whether an optimization is worth it, one has to look at a
number of aspects: how much complexity it adds, how much it improves the
peformance, etc, etc.

I understand your feeling that 'doing the retransmission on an end-end basis
wastes resources', but... doing local retransmission _as well_ as end-end
retransmission (which one _has_ to have - see above) is going to make things
more complicated - perhaps _significantly_ more complicated. Just how much
more, depends on exactly what is done, and how.

E.g. does the mechanism only re-send packets when a damaged packet is
received at the down-stream, and the packet is not too badly damaged for the
down-stream to figure out which packet was lost, and ask for a
re-transmission from the up-stream? This is less complex than the down-stream
acking every packet, _but_ ... the up-stream _still_ has to keep a copy of
every packet (in case it needs to re-transmit) - and how does it decide when
to discard it (since there is no ack)? Add acks, and the up-stream knows for
sure when it can ditch its copy - but now there's a lot more complexity,
state, etc.

Doing local re-transmission is a _lot_ more complexity (and probably limits
performance) - and does it buy you enough to make it worth it? Probably not...


This is, I gather, basically the line of reasoning that the original
designers went through, which led them to the 'smart endpoints, stupid
switches' approach.

Yes, it means that sometimes you wind up using 'extra' resources, but,
_overall_, the design _as a whole_ is simpler, can have much higher
performance (a switch sends a packet, then _immediately_ throws away its
copy), etc.


    > In my view, it is one of our central fallacies that we intertwined
    > error _correction_, error _recovery_ and congestion detection.
    > These are completely different issues and should be carefully
    > distinguished.

Perhaps. But one needs to think through all the details, and I'm still
doing that.

(E.g. with Source Quench, the previous discussion showed that the
security/etc issues surrounding it as a 'reliable' congestion signal are
non-trivial when you start to look into it.)

	
    > When in a chain of 100 hops and links the last link is lossy, so a
    > packet cannot be delivered, it is not always the best solution
    > - to retransmit it locally,
    > - to retransmit it end to end, if the link is lossy, the packet will be
    > corrupted anyway, no matter whether it was sent end to end or locally

Sure, but (as I showed above), doing local retransmission means a lot of
extra complexity - state, buffering, acks, yadda-yadda.

Doing local retransmission may make sense on _some_ links, but doing it _as a
system-wide architecture_ probably doesn't make sense - because it can only
be an optimization, and it's one with severe costs (complexity, etc).

	Noel



More information about the Internet-history mailing list