[ih] error detection

Thu Oct 1 15:22:23 PDT 2020

This is a fascinating discussion.

There was a time, I think that people thought the hardware link checksum was sufficient, and indeed the CRC-32 is much better than the TCP software checksum.

Folks quickly realized that important parts of the overall system were not protected.  Datapath inside adapters, memory bus transfers, bad memory in hosts and routers, etc.

Consequently an end to end checksum is essential.

Craig put succinctly the properties of CRC-32 (and in general, linear congruential codes in general) - they detect 100% of single burst errors shorter than the checksum, and 1-2^-n of all other errors.  It is not even clear

The properties that make a good end-to-end checksum are a little different:

* you’d like to detect all or nearly all the common types of errors, such as memory addressing errors, core clobbers, etc.
* you’d like them to be <modifiable> if possible, so that a router can calculate a change to a checksum without recomputing the whole thing possibly based on erroneous data.
* you’d like them to be very fast, so they can run at memory bandwidth speeds

The latter requirement is a real problem, because we now have things like 100g interfaces and RDMA and zero copy data delivery to the end application.  When exactly is the software going to pick up every byte?

I think the attraction of error detecting codes over cryptographic hashes is a mistake.  CPUs now include AES hardware, and it can be faster than any software alternative. Sure it doesn’t catch every burst error less than the block size, but who cares?  You get the additional benefits of protection against actual adversaries in addition to protection against random and (most) burst errors.

In practice, all that is necessary is to push down the undetected error rate below the next most likely cause of trouble.  Undetected disk read errors for example, are around 10^-14 to 10^-16, which is the equivalent of about 48-50 bit CRCs.  It seems likely that AES computed hashes, at 2^-128 are not going to be a problem for a long time.  (And this is why people with a lot of disks use end-to-end file checksums as well.)

https://www.jandrewrogers.com/2019/03/06/aquahash/ says that several year old things like Skylake can do AES hashes at 15 bytes/cycle, which is..impressively fast.

Cryptographic hashes don’t solve the modifiable issue, but I suspect their other benefits are more important.

Other useful references: https://www.ieee802.org/3/hssg/public/nov07/gustlin_01_1107.pdf