[ih] Yet another subject change: Testing (Was Re: Gateway Issue: Certification (was Re: booting linux on a 4004))

Thu Oct 3 18:21:45 PDT 2024

Hi Karl,

You made me drag out my ancient notebooks to look at the "Problem List" 
that I wrote down from the first ICCB meeting back on September 21, 
1981.   The list included "test and verification of components" and 
"instrumentation and operational support".

So yes, there should also be "Gateway Issue: Operations Tools". Perhaps 
it should be "Internet Issue:...", since Gateways (routers) are just one 
component involved.

You and I at least are on the same page, extremely concerned about tools 
and techniques for operating networks.

When Vint gave me the assignment to "make the core gateways an 
operational 24x7 service" (also September 1981 but I had been informed 
earlier that year), there were virtually no "operations tools" 
available.   IIRC, the original 1984 paper defining TCP didn't address 
the issue at all.   It focussed on how the system would work when 
everything was implemented and operating correctly.

As the first TCP/IP software was implemented, few if any mechanisms such 
as "test points" or "loopbacks" had been included - e.g., the 
implementation I did for Unix had no such features.   It simply wasn't a 
priority for a research environment, especially when the computer 
involved was sitting in front of you and all of its regular debugging 
tools were readily usable.   IIRC, none of the people implementing those 
first TCPs had ever been involved in any network operations.

I didn't have any experience either  in operating a 24x7 service. But 
the Arpanet NOC was literally down the hall, and by that time the 
Arpanet had been operating for over a decade.   Many tools, procedures, 
and mechanisms had been created over that time.

The research community didn't seem to have any ideas about operating a 
network, or have much interest in researching that area.   So, ... the 
obvious way to operate the Internet as a 24x7 service was to simply 
steal the mechanisms that the Arpanet was using successfully to operate 
the Arpanet as a 24x7 service.   So that's what we did.

For example, the Arpanet had an internal mechanism called "Traps", 
whereby IMPs scattered around the network reported anomalous events, 
traffic statistics, and other such data back to the NOC at BBN.  All 
such Traps were printed out (on a Mod33 TTY IIRC) and eventually ended 
up in a large and ever-growing stack of paper off in a corner of the 
NOC.   But an Operator, or IMP programmer, could look back at the paper 
logs and often discover an imminent problem, or see the events which led 
up to a reported problem to be fixed.   The log was the first stop for 
anyone called in to fix a problem.

IMPs had "Fake Hosts", which were simply hosts implemented inside the 
IMP software, but able to do things that a normal host might do.   One 
example was simply generating, or sinking, a flow of traffic.  Another 
"Fake Host" contained a built-in DDT (a common debugger program of the 
era).   By connecting to the DDT Fake Host, an IMP programmer could 
examine the IMP memory, make patches or changes, load new software, or 
do whatever else you could normally do with DDT and a local machine.  
But the IMP might be many miles away.

Within the Internet, we lobbied, cajoled, encouraged, and implemented 
similar tools to replicate the Arpanet operations functionality.  
Arpanet Traps evolved into Internet SNMP mechanisms, and were extended 
to end-user computers, to access functionality (flow control, 
retransmissions, et al) that had been moved to the Hosts by the TCP 
architecture.  Fake Hosts such as DDT evolved into XNet, which IIRC was 
a Ray Tomlinson project.  I ended up documenting after it was updated 
for TCP/IP Version 4 (RFC 643 and IEN 158).  As such tools were created, 
operating Gateways became very similar to operating IMPs.   At some 
point (can't remember exactly when), the Arpanet NOC began also 
operating the "core gateways", able to perform simple tasks like 
reloading software, and able to call the programmers when a more serious 
situation was detected.

At the time, IIRC the "Gateway Group" was Bob Hinden, Mike Brescia, and 
Alan Sheltzer, all of whom interacted with the NOC and kept the Internet 
running.   David Floodpage built a system called the CMCC (Catenet 
Monitoring and Control Center) as a tool analogous to the NOC "U" 
program (Utilities) that was used to do maintenance activities.

When TCP/IP was standardized by DoD, Jon prepared the RFCs.   But he 
forgot to include some of the pieces that we, as operators, considered 
mandatory.  In particular, ICMP was absent as part of the 
Specification.   So government contractors felt no need to implement 
it.   That meant that tools, such as ECHO and SQ, wouldn't be available 
for use in operating and debugging.  We always thought that such 
mechanisms were just a part of IP.   After much grousing and 
complaining, ICMP was documented in an RFC and IIRC Dod contracts 
started requiring it.

We developed other tools as the need and inspiration allowed.  For 
example, we used the "hooks" still present in the IMP code which had 
permitted the NMC (Network Monitoring Center) at UCLA to collect 
performance data about the infant Arpanet.  That code had long been 
unused, but we noticed that we could easily point it to another network 
address and send reports wherever we liked.

That enabled the creation of a "Remote Datascope" (RD) tool, which 
(IIRC) was a program running on a Sun Sparc.  A remote IMP could be 
patched (using DDT in the IMP) to send reports to the RD computer. One 
very valuable use of that was to capture the beginning of an Arpanet 
"message", of sufficient size to contain an IP and TCP header.  Thus a 
"Internet Engineer" debugging some problem at a remote site could "hook 
a datascope" to that host's traffic flow and see exactly what was going on.

Of course, such a Datascope would also be a great Spy tool.  But we 
avoided mentioning that....

Another tool was the "Flakeway".  I don't know the timing or if the 
flakeway idea from Jon Postel was independent or not.  It's likely that 
I at some point told the ICCB and Jon about our Flakeway and how we were 
using it in operating the core part of the Internet.

Our "Flakeway" was built by Dan Wood on a Sun Sparc, which by then had 
become pretty common.  The problem we were facing was that the Arpanet 
was too reliable.  It never dropped, reordered, delayed, duplicated, or 
corrupted anything that one Host sent to another.  So the Arpanet was a 
poor testbed for TCP implementations.

The Flakeway that Dan built was a weekend project.  We noticed a quirk 
in the Internet protocols that made it possible.  Flakeway took 
advantage of a probably serious vulnerability in the Internet protocols 
and their implementations.  On a LAN (almost always Ethernet at the 
time), IP addresses were converted into LAN addresses using ARP.  
Basically, a Host needing to send a packet to another computer on the 
LAN would broadcast a query saying "Does anyone know where x.x.x.x 
is?"   The host which believed it was x.x.x.x would reply and say "It's 
me!  I'm at Ethernet address XX:XX:XX:XX:XX:XX.

The Flakeway inserted itself into traffic flows by watching for ARP 
exchanges, and then immediately contradicting the "It's me!" message 
with another of its own - "No, it's ME!  I'm really at 
YY:YY:YY:YY:YY:YY"  That would effectively direct all subsequent traffic 
to the Flakeway.  By exercising the same exchange with a user computer 
on the LAN, and a gateway on the LAN, the Flakeway could insert itself 
into the bidirectional flow of IP datagrams. Nothing needed to be 
changed on either the Hosts or Gateways to accomplish this.

Flakeway could then do whatever it wanted with the datagram flows. 
Reorder, duplicate, modify, delay, etc., were all easy.    Delays in 
particular were far more feasible than traditional network methods - 
which typically involved a *huge* roll of cable to create in a lab the 
delays that would normally be seen in a trans-continental circuit.

  It was even possible for Flakeway to alter where a new TCP connection 
went - so that when a user tried to connect to some particular IP 
address, the connection would instead go somewhere else.  Flakeway would 
modify the IP addresses in the headers as needed to make it all work.  I 
suspect it's similar to how NAT operates.

I don't know what the Specifications for ARP said about how ARPs SHOULD, 
MUST, or MAY be handled.  But in practice, the Flakeway worked for all 
the hosts we tried.   They all seemed to simply believe whoever most 
recently answered their ARP query.

When I migrated to the West Coast and "up the stack" to Webs and 
Databases, I took the Flakeway idea with me.  We used it extensively in 
operating our own intranet, to be able to see what all the different 
computers and their TCP/IP implementations were actually doing, without 
changing a thing on any of those computers.   It was a valuable tool.

We didn't talk about the Flakeway a lot, since it seemed like a 
dangerous tool to have around in a Hackers' Toolbox.  I do remember that 
we reported it to someone in the IETF world as a serious vulnerability.  
But I don't know if anything every changed.   It also seemed to be 
becoming less useful as LANs became switched and it was hard to find a 
place to plug in a Flakeway so that it could do its thing.   But that 
was before Wifi became dominant.  Computers now are often on a broadcast 
channel.  Perhaps a Flakeway would still work.

Thanks for "dropping the dime" on the culprit of the great Reverse 
Packet Caper.  I recall one operational incident in the core Internet 
when a gateway suddenly started reporting all sorts of IP errors.  
Investigation revealed that some computer out there was sending out IP 
datagrams in some kind of reverse order.  I thought we traced it to a 
BSD machine somewhere, but it may have been FTP Software.

Some years ago, one of our government clients struggling with operating 
their own (inter)network asked me to write up a report outlining how to 
do it.  I remember writing a report, called something like "Managing 
Large Network Systems", but it was delivered and disappeared into the 
bowels of the government. Haven't found it since.

The report outlined basic elements of a operators' toolbox.  For 
example, one component was an "Anchor Host", which was simply a computer 
installed at each remote location and accessible to the operators from 
their remote location.  Anchor Hosts were to be well-understood by the 
operators, so they could run familiar tests, tools such as Flakeway, or 
whatever else was needed remotely to debug a problem.   Similarly, some 
kind of database would be used to collect data about performance, not 
only during problem conditions but also during normal everyday 
operation.  Being able to compare what's happening when it's broken to 
what happens when it's working was a very useful tool - very similar to 
your "pathology database".   It also included things like conformance 
tests, not to verify that the implementation met a specification, but 
rather to collect measurements characterizing how the system behaved 
when it was officially "working", for use later when it wasn't.

All of the ideas contained in that report stemmed from years of watching 
network problems and observing how people (including myself) attacked 
such problems.

IMHO, researchers, and anyone creating protocols, algorithms, and 
implementations, should spend some time involved in actually operating 
one of these beasts, taking user complaints on the "network help line", 
and figuring out what the problem is, and what components of hardware 
and software have to be changed.   With no finger-pointing allowed.

But I agree, it's not easy to do that, and it's probably getting 
harder.   Today's systems' complexity seems to offer an increasing 
opportunity for finger-pointing, in addition to the security efforts you 
mentioned.

In addition to talking with vendors, who are often competing, I suggest 
talking to Users, who buy all those vendors' products and are faced with 
somehow getting it all to work.  Is there an "Internet Users 
Community"?   Almost every organization, company, government and even 
individual on the planet might be a member.

Jack Haverty

On 10/3/24 11:31, Karl Auerbach wrote:
> My grandfather was a radio repair guy, my father repaired TV's that 
> other repair people could not fix.  So I grew up with my hands inside 
> electronics learning how to figure out what was going wrong and what 
> to do about it.  (I also learned a lot about keeping my fingers clear 
> of high voltages - some day ask me about how the phrase "beating the 
> bounds" [with regard to land titles] came about, and yes, there is an 
> analogy to high voltage shocks.)
>
> I've carried that family history (of repairing, not shocking) into the 
> land of networks.
>
> I am extremely concerned, and I mean *extremely* concerned, that our 
> race to lock and secure things is slowly making it increasingly 
> difficult for us to monitor, diagnose, and repair the Internet (and 
> the increasing number of other important infrastructures that have 
> become intermeshed with the net.)
>
> I wrote a note about this issue:
>
> Is The Internet At Risk From Too Much Security
> https://www.cavebear.com/cavebear-blog/netsecurity/
>
> My experience with designing, deploying, and running the Interop show 
> networks informed me that we have few decent tools.  I looked in awe 
> with the collection of well designed tools that AT&T guys (they were 
> always guys in that era) had dangling from their tool belts.  So I 
> designed and sold the first Internet buttset - a tool to get one up 
> and running within seconds to do testing and evaluation of a IP (and 
> Netware) network.  (The tool was "Dr. Watson, The Network Detective's 
> Assistant" - https://www.cavebear.com/archive/dwtnda/ .  However, I 
> was learning about how to run a company at that time and I didn't 
> watch, much less control, what my marketing group was spending - so we 
> went under.  I then helped Fluke pick up some of the remnant ideas for 
> their products.)
>
> Anyway, I have been bothered at how few test points we build into 
> network software.  Even one of the most fundamental - remote loopback 
> - is barely present in network equipment (yes, we have ICMP Echo/ping) 
> but that's rather primitive. And I've long worked with SNMP and MIBs.  
> (I wrote and implemented an alternative to SNMP and Netconf that I 
> though was much more useful then either: KNMP at 
> https://www.iwl.com/idocs/knmp-overview )
>
> My wife (Chris Wellens) and wrote up a paper in 1996 titled "Towards 
> Useful Management" in which we made several proposals to improve our 
> means to monitor and test networks. 
> https://www.cavebear.com/docs/simple-times-vol4-num3.pdf
>
> In the meantime Marshall Rose and my  wife spun a new company, 
> Interworking Labs, out from the Interop company.  The initial purpose 
> was to develop test suits for network protocols.  (These suites still 
> exist and often reveal mistakes in network code.  One of my favorite 
> is to repackage Ethnernet frames that have short IP packets inside 
> those Ethernet frames.  The IP packet is put into an Ethernet frame 
> that is larger than it needs to be to hold that IP packet. (Some 
> vendors have used that space to do things like announcing license 
> identifiers in the unused space in an Ethernet frame after an ARP 
> packet.)  Far too much code uses the ethernet frame length rather than 
> properly using the IP length fields - bad things can happen as a 
> result.  And there is still code out there that uses signed integer 
> math on unsigned integer packet fields - so a lot of code still 
> wobbles if one tickles packets with numbers just below or just above 
> the point where that high order bit toggles.)
>
> Jon Postel came up with a testing idea for the bakeoff test events we 
> had at places like FTP Software and ISI - a router that does things 
> wrong in a controlled way.  A few years later Steve Casner and I were 
> working to develop a portable RTP/RTCP engine for entertainment grade 
> audio/video (on IP multicast); we longed for a device such as Jon's 
> "flakeway" because of the need to evaluate all of the potential race 
> conditions that can happen when running several related media streams 
> in real time.
>
> So a few years later at Interworking labs we started to develop Jon's 
> flakeway into a real tool.  We called the line "Maxwell" after James 
> Clerk Maxwell's thought experiment about a daemon that could select 
> and control the flow of hot and cold particles, seemingly violating 
> the laws of Thermodynamics.  It is still rather surprising how much 
> code out there wobbles (or worse) when faced with simple network 
> behaviour such as packet order resequencing (such as can happen when 
> there are parallel/load balanced/bound) network paths, or when packets 
> are accumulated for a short while and then suddenly released (as if a 
> dam, holding back a lake of packets, suddenly bursts.)
>
> I have seen many network test suites that check that a protocol 
> implementation complies with the mandatory or suggested parts of 
> RFCs.  Those are nice.  But my concern is on the other side of the 
> RFCs - what about the DO NOT cases or undefined cases, what happens 
> when those situations happen.
>
> For instance, I remember Dave Bridgham (FTP Software) one afternoon 
> saying "You know, if I received the last IP fragment first I would 
> have information that let me do better receive buffer allocation."  So 
> he changed the FTP Software IP stack to send last fragment first.  It 
> worked.  That is it worked until an FTP Software based machine was 
> added to a network running competitor Netmanage TCP/IP code.  That 
> latter code simply up and died when it got the last fragment first.
>
> And at a TCP bakeoff I had a tool to test ARP, a protocol that has 
> many knobs and levers that are rarely used.  I managed to generate a 
> broadcast ARP packet that used some of those knobs and levers. That 
> ARP hit the router between our test networks and the host company's 
> main network - that router crashed, but before it did it (for some 
> reason) propagated that ARP further along, causing every other (I 
> believe Proteon) router in the company to also crash.
>
> We found a lot of things like that on the Interop show network. (I 
> usually got blamed because I was usually near, if not operating, the 
> device that triggered the flaws.)  One of the worst was a difference 
> in opinion between Cisco and Wellfleet routers about what to do with 
> expansion of IP multicast packets into Ethernet frames (in particular 
> what group MAC addresses to use) resulting in infinite IP multicast 
> routing across the show net - every load LED on every one of our 
> hundreds of routers and switches turned red.  (And, of course, all 
> fingers pointed at me. ;-)
>
> The Interop show net was a wonderful place to discover flaws in 
> protocol standards and implementations.  One of our team members (who 
> I believe is on this list) found a flaw the FDDI standard.  I have a 
> memory of companies reworking their code and blasting new firmware 
> overnight in their hotel rooms.
>
> The point of this long note is that the state of the art of testing 
> Internet protocol implementation is weak.  It's not an exciting field, 
> QA people are not honored employees, and as more and more people 
> believe (often quite wrongly) that they can write code we are actually 
> moving backwards in some regards.
>
> In addition, we do not adequately consider monitoring, testing, and 
> repair in our work defining protocols.
>
> In 2003 I gave a long talk with a title that is now a bit 
> misleading:   From Barnstorming to Boeing –
> Transforming the Internet Into a Lifeline Utility.
>
> (The slides are at 
> https://www.cavebear.com/archive/rw/Barnstorming-to-Boeing-slides.pdf 
> and the speaker notes at 
> https://www.cavebear.com/archive/rw/Barnstorming-to-Boeing.pdf )
>
> (One of my suggestions was the imposition of legal, civil tort, 
> liability for network design, implementation, and operational errors - 
> using a negligence standard so that simple mistakes would not suffer 
> liability.  Wow, the groans from the audience were quite loud.)
>
> I had other suggestions as well - such as design rules and operational 
> practices that must be followed unless the person looking to deviate 
> could express a compelling, cogent, argument why deviation is 
> appropriate.  This is the norm in many engineering disciplines, but 
> not for software where we are largely still in the anything goes, wild 
> west.)
>
> By-the-way, I have over the years been working on ideas to advance our 
> testing/repair capabilities.
>
> One piece that we are missing is a database of network pathology. I am 
> thinking here of a database of symptoms that are tied to possible 
> causes and tests to distinguish among those causes. (Yes, I am taking 
> a cue from the practice of medicine.) Once we have such a database one 
> could build tools to do symptom-to-cause reasoning, including running 
> of diagnostic tests to work through the branches of the possible 
> causation tree.  To do this right one needs trusted test agents 
> disseminated throughout the network - the word "trusted" is important 
> because network tests can be intrusive, sharp, and dangerous, like a 
> surgeon's scalpel. (Imagine a world where surgeons were required to 
> use dull, but safe plastic butter knives rather than sharp scalpels.)
>
> Baseline records are important - and we do gather some of that, but we 
> always want more detail.  But the amount of data to be collected is 
> voluminous and is subject to concerns about how it could be used 
> competitively.  (This is why in our Interworking Labs test contracts 
> we prohibit the publishing of results to the public - we want to 
> encourage correction for the benefit of us all rather than creation of 
> competitive cudgels.)
>
> (One element that I've slowly been working on in my zero free time is 
> a precisely timed beacon and precisely timed listeners - all tightly 
> synchronized to GPS time.  The idea is for beacons to take 
> subscriptions from listeners and then to emit highly predictable 
> patterns of packets of various sizes and timings. I've been meaning to 
> corner some of my astrophysicist friends to adopt some of their 
> methods of using that kind of predictable behaviour, observed at a 
> distance, to evaluate what lies between the beacon's hither and the 
> listerner's yon.  [And yes, I did pick up some ideas from Van J's 
> pathchar and Bruce Mah's re-implementation as pchar.)
>
> I am also thinking that we need some legal and accounting rule changes 
> so that vendors are more able to share improvements and tests without 
> running afoul of restraint of trade laws or damaging their balance 
> sheets and that ever present, false fable of "shareholder value".)
>
>         --karl--
>
>

-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_signature.asc
Type: application/pgp-signature
Size: 665 bytes
Desc: OpenPGP digital signature
URL: <http://elists.isoc.org/pipermail/internet-history/attachments/20241003/7fa520eb/attachment.asc>