[ih] Yet another subject change: Testing (Was Re: Gateway Issue: Certification (was Re: booting linux on a 4004))
Karl Auerbach
karl at iwl.com
Thu Oct 3 11:31:57 PDT 2024
My grandfather was a radio repair guy, my father repaired TV's that
other repair people could not fix. So I grew up with my hands inside
electronics learning how to figure out what was going wrong and what to
do about it. (I also learned a lot about keeping my fingers clear of
high voltages - some day ask me about how the phrase "beating the
bounds" [with regard to land titles] came about, and yes, there is an
analogy to high voltage shocks.)
I've carried that family history (of repairing, not shocking) into the
land of networks.
I am extremely concerned, and I mean *extremely* concerned, that our
race to lock and secure things is slowly making it increasingly
difficult for us to monitor, diagnose, and repair the Internet (and the
increasing number of other important infrastructures that have become
intermeshed with the net.)
I wrote a note about this issue:
Is The Internet At Risk From Too Much Security
https://www.cavebear.com/cavebear-blog/netsecurity/
My experience with designing, deploying, and running the Interop show
networks informed me that we have few decent tools. I looked in awe
with the collection of well designed tools that AT&T guys (they were
always guys in that era) had dangling from their tool belts. So I
designed and sold the first Internet buttset - a tool to get one up and
running within seconds to do testing and evaluation of a IP (and
Netware) network. (The tool was "Dr. Watson, The Network Detective's
Assistant" - https://www.cavebear.com/archive/dwtnda/ . However, I was
learning about how to run a company at that time and I didn't watch,
much less control, what my marketing group was spending - so we went
under. I then helped Fluke pick up some of the remnant ideas for their
products.)
Anyway, I have been bothered at how few test points we build into
network software. Even one of the most fundamental - remote loopback -
is barely present in network equipment (yes, we have ICMP Echo/ping) but
that's rather primitive. And I've long worked with SNMP and MIBs. (I
wrote and implemented an alternative to SNMP and Netconf that I though
was much more useful then either: KNMP at
https://www.iwl.com/idocs/knmp-overview )
My wife (Chris Wellens) and wrote up a paper in 1996 titled "Towards
Useful Management" in which we made several proposals to improve our
means to monitor and test networks.
https://www.cavebear.com/docs/simple-times-vol4-num3.pdf
In the meantime Marshall Rose and my wife spun a new company,
Interworking Labs, out from the Interop company. The initial purpose
was to develop test suits for network protocols. (These suites still
exist and often reveal mistakes in network code. One of my favorite is
to repackage Ethnernet frames that have short IP packets inside those
Ethernet frames. The IP packet is put into an Ethernet frame that is
larger than it needs to be to hold that IP packet. (Some vendors have
used that space to do things like announcing license identifiers in the
unused space in an Ethernet frame after an ARP packet.) Far too much
code uses the ethernet frame length rather than properly using the IP
length fields - bad things can happen as a result. And there is still
code out there that uses signed integer math on unsigned integer packet
fields - so a lot of code still wobbles if one tickles packets with
numbers just below or just above the point where that high order bit
toggles.)
Jon Postel came up with a testing idea for the bakeoff test events we
had at places like FTP Software and ISI - a router that does things
wrong in a controlled way. A few years later Steve Casner and I were
working to develop a portable RTP/RTCP engine for entertainment grade
audio/video (on IP multicast); we longed for a device such as Jon's
"flakeway" because of the need to evaluate all of the potential race
conditions that can happen when running several related media streams in
real time.
So a few years later at Interworking labs we started to develop Jon's
flakeway into a real tool. We called the line "Maxwell" after James
Clerk Maxwell's thought experiment about a daemon that could select and
control the flow of hot and cold particles, seemingly violating the laws
of Thermodynamics. It is still rather surprising how much code out
there wobbles (or worse) when faced with simple network behaviour such
as packet order resequencing (such as can happen when there are
parallel/load balanced/bound) network paths, or when packets are
accumulated for a short while and then suddenly released (as if a dam,
holding back a lake of packets, suddenly bursts.)
I have seen many network test suites that check that a protocol
implementation complies with the mandatory or suggested parts of RFCs.
Those are nice. But my concern is on the other side of the RFCs - what
about the DO NOT cases or undefined cases, what happens when those
situations happen.
For instance, I remember Dave Bridgham (FTP Software) one afternoon
saying "You know, if I received the last IP fragment first I would have
information that let me do better receive buffer allocation." So he
changed the FTP Software IP stack to send last fragment first. It
worked. That is it worked until an FTP Software based machine was added
to a network running competitor Netmanage TCP/IP code. That latter code
simply up and died when it got the last fragment first.
And at a TCP bakeoff I had a tool to test ARP, a protocol that has many
knobs and levers that are rarely used. I managed to generate a
broadcast ARP packet that used some of those knobs and levers. That ARP
hit the router between our test networks and the host company's main
network - that router crashed, but before it did it (for some reason)
propagated that ARP further along, causing every other (I believe
Proteon) router in the company to also crash.
We found a lot of things like that on the Interop show network. (I
usually got blamed because I was usually near, if not operating, the
device that triggered the flaws.) One of the worst was a difference in
opinion between Cisco and Wellfleet routers about what to do with
expansion of IP multicast packets into Ethernet frames (in particular
what group MAC addresses to use) resulting in infinite IP multicast
routing across the show net - every load LED on every one of our
hundreds of routers and switches turned red. (And, of course, all
fingers pointed at me. ;-)
The Interop show net was a wonderful place to discover flaws in protocol
standards and implementations. One of our team members (who I believe
is on this list) found a flaw the FDDI standard. I have a memory of
companies reworking their code and blasting new firmware overnight in
their hotel rooms.
The point of this long note is that the state of the art of testing
Internet protocol implementation is weak. It's not an exciting field,
QA people are not honored employees, and as more and more people believe
(often quite wrongly) that they can write code we are actually moving
backwards in some regards.
In addition, we do not adequately consider monitoring, testing, and
repair in our work defining protocols.
In 2003 I gave a long talk with a title that is now a bit misleading:
From Barnstorming to Boeing –
Transforming the Internet Into a Lifeline Utility.
(The slides are at
https://www.cavebear.com/archive/rw/Barnstorming-to-Boeing-slides.pdf
and the speaker notes at
https://www.cavebear.com/archive/rw/Barnstorming-to-Boeing.pdf )
(One of my suggestions was the imposition of legal, civil tort,
liability for network design, implementation, and operational errors -
using a negligence standard so that simple mistakes would not suffer
liability. Wow, the groans from the audience were quite loud.)
I had other suggestions as well - such as design rules and operational
practices that must be followed unless the person looking to deviate
could express a compelling, cogent, argument why deviation is
appropriate. This is the norm in many engineering disciplines, but not
for software where we are largely still in the anything goes, wild west.)
By-the-way, I have over the years been working on ideas to advance our
testing/repair capabilities.
One piece that we are missing is a database of network pathology. I am
thinking here of a database of symptoms that are tied to possible causes
and tests to distinguish among those causes. (Yes, I am taking a cue
from the practice of medicine.) Once we have such a database one could
build tools to do symptom-to-cause reasoning, including running of
diagnostic tests to work through the branches of the possible causation
tree. To do this right one needs trusted test agents disseminated
throughout the network - the word "trusted" is important because network
tests can be intrusive, sharp, and dangerous, like a surgeon's scalpel.
(Imagine a world where surgeons were required to use dull, but safe
plastic butter knives rather than sharp scalpels.)
Baseline records are important - and we do gather some of that, but we
always want more detail. But the amount of data to be collected is
voluminous and is subject to concerns about how it could be used
competitively. (This is why in our Interworking Labs test contracts we
prohibit the publishing of results to the public - we want to encourage
correction for the benefit of us all rather than creation of competitive
cudgels.)
(One element that I've slowly been working on in my zero free time is a
precisely timed beacon and precisely timed listeners - all tightly
synchronized to GPS time. The idea is for beacons to take subscriptions
from listeners and then to emit highly predictable patterns of packets
of various sizes and timings. I've been meaning to corner some of my
astrophysicist friends to adopt some of their methods of using that kind
of predictable behaviour, observed at a distance, to evaluate what lies
between the beacon's hither and the listerner's yon. [And yes, I did
pick up some ideas from Van J's pathchar and Bruce Mah's
re-implementation as pchar.)
I am also thinking that we need some legal and accounting rule changes
so that vendors are more able to share improvements and tests without
running afoul of restraint of trade laws or damaging their balance
sheets and that ever present, false fable of "shareholder value".)
--karl--
More information about the Internet-history
mailing list