[ih] inter-network communication history

Sat Nov 9 22:26:42 PST 2019

On 11/9/19 4:23 AM, John Day wrote:

> As they say, Jack, ignorance is bliss!  ;-)  Were you doing configuration with it? Or was it just monitoring?

As I recall, configuration wasn't a big deal.  Nodes were typically
routers with Ethernets facing toward the users at the site and several
interfaces the other way for long-haul circuits.  Our approach was to
collect all the appropriate equipment for the next site in our
California lab, configure it and test it out on the live network, and
then ship it all to wherever it was to go.  So, for example, New Zealand
might have actually been in California at first, but when it got to NZ
it worked the same.

IIRC, there was lots of stuff that could be configured and tweaked in
the routers.   There was even a little documentation on what some of
those "virtual knobs" affected.   There was essentially nothing on why
you might want to set some knob to any particular position, what
information you needed to make such decisions, or how to predict
results.    Anything could happen.   So there was strong incentive never
to change the default configuration parameters after the site equipment
left our lab.

I don't remember any concerns about database performance.  But we only
had a hundred or so boxes out in our net.   Perhaps the Network
Management vendors had visions of customers with thousands of their
boxes so we didn't see the same problems.  Also, we only collected the
specific data from sources like SNMP that we expected we could actually
use.  We thought our network was pretty big for the time, spanning 5
continents and thousands of users and computers.  The database we had
worked fine for that.   Compared to other situations, like processing
credit card or bank transactions, it didn't seem like a big load.  I
think it all went into a Sparc.  But there were bigger machines around
if we needed one.

The vendor-supplied tools did provide some monitoring.  E.g., it was
fairly easy to see problems like a dead router or line, and pick up the
phone to call the right TelCo or local site tech to reboot the box. 
With alternate routing, often the users didn't even notice.  Just like
in the ARPANET...(Yay packet switching!)

To make things extra interesting, that was the era of "multi-protocol
routers", since TCP hadn't won the network wars quite yet.  Our
corporate product charter was to provide software that ran on any
computer, over any kind of network.  So our net carried not only TCP/IP,
but also other stuff - e.g., DECNet, AppleTalk, SPX/IPX, and maybe one
or two I don't remember.  SNA/LU6.2 anyone...?   Banyan Vines?

Most of our more challenging "network management" work involved fault
isolation and diagnosis, plus trend analysis and planning.

A typical problem would start with an urgent call from some user who was
having trouble doing something.  It might be "The network is way too
slow.   It's broken."  or "I can't get my quarterly report to go in".  
Often the vendor system would show that all routers were up and running
fine, and all lines were up.  But from the User's perspective, the
network was broken.

Figuring out what was happening was where the ad-hoc tools came in. 
Sometimes it was User Malfunction, but often there was a real issue in
the network that just didn't appear in any obvious way to the
operators.   But the Users saw it.

"You say the Network is running fine.....but it doesn't work!"

To delve into Users' problems, we needed to go beyond just looking at
the routers and circuits.  Part of the problem might be in the Host
computers where TCP lived, or in the Application, e.g., email.  

We ran the main data center in addition to the network.  There wasn't
anyone else for us to point the finger at.

We used simple shell scripts and common Unix programs to gather
SNMP-available data and stuff it into the database, parsed as much as we
could into appropriate tables with useful columns like Time, Router#,
ReportType, etc.   That provided data about how the routers saw the
network world, capturing status and behavior over whatever period of
time we ran the collector.

Following the "Standard Node" approach, wherever we placed a network
node we also made sure to have some well-understood machine on the User
side that we could use remotely from the NOC.  Typically it would be
some kind of Unix workstation, attached to the site's Ethernet close to
the router.   Today, I'd probably just velcro a Raspberry Pi to the router.

I used to call this an Anchor Host, since it provided a stable,
well-understood (by us at the NOC) machine out in the network.   This
was really just copying the ARPANET approach from the early 70s, where a
"Fake Host" inside the IMP could be used to do network management things
like generate test traffic or snoop on regular network traffic.   We
couldn't change the router code to add a Fake Host, but we could put a
Real Host next to it.

From that Fake (Real) Host, we could run Ping tests across the network
to measure RTT, measure bandwidth between 2 points during a test FTP,
generate traffic, and such stuff, simply using the tools that commonly
come in Unix boxes.  The results similarly made their way into tables in
the database.   Some tests were run continuously, e.g., ping tests every
5 minutes.  Others were enabled on demand to help figure out some
problem, avoiding burdening the network (and database I guess) with
extra unneeded traffic.

Also  from that Fake Host, we could run TCPDUMP, which captured traffic
flowing across that Ethernet and produced reams of output with a melange
of multi-protocol packet headers.  Again, all of that could make its way
into the database on demand, organized into useful Tables, delayed if
necessary to avoid impacting the network misbehavior we were trying to
debug.   Give a Unix guru awk, sed, cron and friends and amazing things
can happen.

We could even run a Flakeway on that Anchor Host, to simulate network
glitches for experimentation, but I can't recall ever having to do
that.  But perhaps the ops did and I never knew.

Once all that stuff got into the database, it became data.   Not a
problem.  I was a network guy afloat in an ocean of database gurus, and
I was astonished at the way they could manipulate that data and turn it
into Information.

I didn't get involved much in everyday network operations, but when
weird things happened I'd stick my nose in. 

Once there was an anomaly in a trans-pacific path, where there was a
flaky circuit that would go down and up annoyingly often.  The carrier
was "working on it..." 

What the ops had noticed was that after such a glitch finished, the
network would settle down as expected.  But sometimes, the RTT delay and
bandwidth measurements would settle down to a new stable level
noticeably different from before the line glitch.   They even had
brought up a rolling real-time graph of the data, kind of like a
hospital heart-monitor, that clearly showed the glitch and the change in
behavior.

Using our adhoc tools, we traced the problem down to a bug in some
vendor's Unix system.  That machine's TCP retransmission timer algorithm
was reacting to the glitch, and adapting as the rerouting occurred.  But
after the glitch, the TCP had settled into a new stable pattern where
the retransmission timer fired just a little too soon, and every packet
was getting sent twice.   The network anomaly would show up if a line
glitch occurred, but only if that Unix user was in the middle of doing
something like a file transfer across the Pacific at the time.   The
Hosts and TCPs were both happy, the Routers were blissfully ignorant,
and half that expensive trans-pacific circuit was being wasted carrying
duplicate packets.

With the data all sitting in the database, we had the tools to figure
that out.   We reported the TCP bug to the Unix vendor.  I've always
wondered if it ever got fixed, since most customers would probably never
notice.

Another weird thing was that "my quarterly report won't go" scenario. 
That turned out to be a consequence of the popularity of the "Global
Lan" idea in the network industry at the time.  IIRC, someone in some
office in Europe had just finished putting together something like a
library of graphics and photos for brochures et al, and decided to send
it over to the colleagues who were waiting for it.   Everybody was on
the department "LAN", so all you had to do was drag this folder over
there to those guys' icons and it would magically appear on their
desktops.  Of course it didn't matter that those other servers were in
the US, Australia, and Asia - it's a Global LAN, right!

The network groaned, but all the routers and lines stayed up, happily
conveying many packets per second.  For hours.  Unfortunately too few of
the packets were carrying that email traffic.

We turned off "Global LAN" protocols in the routers ... but of course
today such LAN-type services all run over TCP, so it might not be quite
as easy.

The other important but less urgent Network Management activity involved
things like Capacity Planning.  With the data in the database, it was
pretty easy to get reports or graphs of trends over a month/quarter, and
see the need to order more circuits or equipment.

We could also run various tests like traffic generators and such and
gather data when there were no problems in the network.   That collected
data provided a "baseline" of how things looked when everything was
working.  During problem times, it was straightforward to run similar
tests and compare the results with the baselines to figure out where the
source of a problem might be by highlighting significant differences.  
The ability to compare "working" and "broken" data is a powerful Network
Management tool.

So that'w what we did.  I'm not sure I'd characterize all that kind of
activity as either Configuration or Monitoring.  I've always thought it
was just Network Management.

There's a lot of History of the Internet protocols, equipment, software,
etc., but I haven't seen much of a historical account of how the various
pieces of the Internet have been operated and managed, and how the tools
and techniques have evolved over time.

If anybody's up for it, it would be interesting to see how other people
did such "Network Management" activities with their own adhoc tools as
the Internet evolved.

It would also be fascinating to see how today's expensive Network
Management Systems tools would be useful in my scenarios above.   I.e.,
how effective would today's tools be if  used by network operators to
deal with my example network management scenarios - along the lines of
RFC1109's observations about how to evaluate Network Management technology.

BTW, everything I wrote above occurred in 1990-1991.

/Jack

-- 
Internet-history mailing list
Internet-history at elists.isoc.org
https://elists.isoc.org/mailman/listinfo/internet-history