[ih] inter-network communication history

Sun Nov 10 13:45:25 PST 2019

This is getting pretty long and we are probably the only ones interested in it.

Configuration Management:
You were lucky. We had customers who had configuration changes between day and night, end of month, holidays vs non-holidays (both for less traffic and much more), in Hong Kong race day was a killer, etc. There are also networks that have to reconfigure quickly when new conditions arise (some predictable, some not). 

Database Performance:  All of that did make database performance an issue. But also a 20+ times slower can make even ordinary things far too slow. SNMP’s inherent dribble approach probably covered up the database performance issues. The idea that one couldn’t snapshot a router table before it was updated was a pain.

General: One of the problems we saw was getting it across to operators that they were *managing* the network, not controlling it.  There was a situation in the UK where the number of switch crashes dropped precipitously for 6 weeks and then came back up. No changes had been made. It was very strange until they realized it was the 6 weeks the operators were on strike. They had been trying to control the network and were making it worse.

I contend that network management is basically, Monitor and Repair but not control.

Multi-protocol: Yes this is where leveraging the commonality of structure paid off. We had to support all of those too. Given our approach to it, it was relatively straightforward.

Fault Management:  yes, this is another place where the commonality paid off. Also recognizing that doing diagnosis was basically creating a small management domain for the equipment under test. I was never able to get as far on this problem as I thought we could go. But one could do a lot with common tools to generate test patterns, test dictionaries, etc.

Most of this was on the market by 1986.

John

> On Nov 10, 2019, at 01:26, Jack Haverty <jack at 3kitty.org> wrote:
> 
> On 11/9/19 4:23 AM, John Day wrote:
> 
>> As they say, Jack, ignorance is bliss!  ;-)  Were you doing configuration with it? Or was it just monitoring?
> 
> As I recall, configuration wasn't a big deal.  Nodes were typically
> routers with Ethernets facing toward the users at the site and several
> interfaces the other way for long-haul circuits.  Our approach was to
> collect all the appropriate equipment for the next site in our
> California lab, configure it and test it out on the live network, and
> then ship it all to wherever it was to go.  So, for example, New Zealand
> might have actually been in California at first, but when it got to NZ
> it worked the same.
> 
> IIRC, there was lots of stuff that could be configured and tweaked in
> the routers.   There was even a little documentation on what some of
> those "virtual knobs" affected.   There was essentially nothing on why
> you might want to set some knob to any particular position, what
> information you needed to make such decisions, or how to predict
> results.    Anything could happen.   So there was strong incentive never
> to change the default configuration parameters after the site equipment
> left our lab.
> 
> I don't remember any concerns about database performance.  But we only
> had a hundred or so boxes out in our net.   Perhaps the Network
> Management vendors had visions of customers with thousands of their
> boxes so we didn't see the same problems.  Also, we only collected the
> specific data from sources like SNMP that we expected wet could actually
> use.  We thought our network was pretty big for the time, spanning 5
> continents and thousands of users and computers.  The database we had
> worked fine for that.   Compared to other situations, like processing
> credit card or bank transactions, it didn't seem like a big load.  I
> think it all went into a Sparc.  But there were bigger machines around
> if we needed one.
> 
> The vendor-supplied tools did provide some monitoring.  E.g., it was
> fairly easy to see problems like a dead router or line, and pick up the
> phone to call the right TelCo or local site tech to reboot the box. 
> With alternate routing, often the users didn't even notice.  Just like
> in the ARPANET...(Yay packet switching!)
> 
> To make things extra interesting, that was the era of "multi-protocol
> routers", since TCP hadn't won the network wars quite yet.  Our
> corporate product charter was to provide software that ran on any
> computer, over any kind of network.  So our net carried not only TCP/IP,
> but also other stuff - e.g., DECNet, AppleTalk, SPX/IPX, and maybe one
> or two I don't remember.  SNA/LU6.2 anyone...?   Banyan Vines?
> 
> Most of our more challenging "network management" work involved fault
> isolation and diagnosis, plus trend analysis and planning.
> 
> A typical problem would start with an urgent call from some user who was
> having trouble doing something.  It might be "The network is way too
> slow.   It's broken."  or "I can't get my quarterly report to go in".  
> Often the vendor system would show that all routers were up and running
> fine, and all lines were up.  But from the User's perspective, the
> network was broken.
> 
> Figuring out what was happening was where the ad-hoc tools came in. 
> Sometimes it was User Malfunction, but often there was a real issue in
> the network that just didn't appear in any obvious way to the
> operators.   But the Users saw it.
> 
> "You say the Network is running fine.....but it doesn't work!"
> 
> To delve into Users' problems, we needed to go beyond just looking at
> the routers and circuits.  Part of the problem might be in the Host
> computers where TCP lived, or in the Application, e.g., email.  
> 
> We ran the main data center in addition to the network.  There wasn't
> anyone else for us to point the finger at.
> 
> We used simple shell scripts and common Unix programs to gather
> SNMP-available data and stuff it into the database, parsed as much as we
> could into appropriate tables with useful columns like Time, Router#,
> ReportType, etc.   That provided data about how the routers saw the
> network world, capturing status and behavior over whatever period of
> time we ran the collector.
> 
> Following the "Standard Node" approach, wherever we placed a network
> node we also made sure to have some well-understood machine on the User
> side that we could use remotely from the NOC.  Typically it would be
> some kind of Unix workstation, attached to the site's Ethernet close to
> the router.   Today, I'd probably just velcro a Raspberry Pi to the router.
> 
> I used to call this an Anchor Host, since it provided a stable,
> well-understood (by us at the NOC) machine out in the network.   This
> was really just copying the ARPANET approach from the early 70s, where a
> "Fake Host" inside the IMP could be used to do network management things
> like generate test traffic or snoop on regular network traffic.   We
> couldn't change the router code to add a Fake Host, but we could put a
> Real Host next to it.
> 
> From that Fake (Real) Host, we could run Ping tests across the network
> to measure RTT, measure bandwidth between 2 points during a test FTP,
> generate traffic, and such stuff, simply using the tools that commonly
> come in Unix boxes.  The results similarly made their way into tables in
> the database.   Some tests were run continuously, e.g., ping tests every
> 5 minutes.  Others were enabled on demand to help figure out some
> problem, avoiding burdening the network (and database I guess) with
> extra unneeded traffic.
> 
> Also  from that Fake Host, we could run TCPDUMP, which captured traffic
> flowing across that Ethernet and produced reams of output with a melange
> of multi-protocol packet headers.  Again, all of that could make its way
> into the database on demand, organized into useful Tables, delayed if
> necessary to avoid impacting the network misbehavior we were trying to
> debug.   Give a Unix guru awk, sed, cron and friends and amazing things
> can happen.
> 
> We could even run a Flakeway on that Anchor Host, to simulate network
> glitches for experimentation, but I can't recall ever having to do
> that.  But perhaps the ops did and I never knew.
> 
> Once all that stuff got into the database, it became data.   Not a
> problem.  I was a network guy afloat in an ocean of database gurus, and
> I was astonished at the way they could manipulate that data and turn it
> into Information.
> 
> I didn't get involved much in everyday network operations, but when
> weird things happened I'd stick my nose in. 
> 
> Once there was an anomaly in a trans-pacific path, where there was a
> flaky circuit that would go down and up annoyingly often.  The carrier
> was "working on it..." 
> 
> What the ops had noticed was that after such a glitch finished, the
> network would settle down as expected.  But sometimes, the RTT delay and
> bandwidth measurements would settle down to a new stable level
> noticeably different from before the line glitch.   They even had
> brought up a rolling real-time graph of the data, kind of like a
> hospital heart-monitor, that clearly showed the glitch and the change in
> behavior.
> 
> Using our adhoc tools, we traced the problem down to a bug in some
> vendor's Unix system.  That machine's TCP retransmission timer algorithm
> was reacting to the glitch, and adapting as the rerouting occurred.  But
> after the glitch, the TCP had settled into a new stable pattern where
> the retransmission timer fired just a little too soon, and every packet
> was getting sent twice.   The network anomaly would show up if a line
> glitch occurred, but only if that Unix user was in the middle of doing
> something like a file transfer across the Pacific at the time.   The
> Hosts and TCPs were both happy, the Routers were blissfully ignorant,
> and half that expensive trans-pacific circuit was being wasted carrying
> duplicate packets.
> 
> With the data all sitting in the database, we had the tools to figure
> that out.   We reported the TCP bug to the Unix vendor.  I've always
> wondered if it ever got fixed, since most customers would probably never
> notice.
> 
> Another weird thing was that "my quarterly report won't go" scenario. 
> That turned out to be a consequence of the popularity of the "Global
> Lan" idea in the network industry at the time.  IIRC, someone in some
> office in Europe had just finished putting together something like a
> library of graphics and photos for brochures et al, and decided to send
> it over to the colleagues who were waiting for it.   Everybody was on
> the department "LAN", so all you had to do was drag this folder over
> there to those guys' icons and it would magically appear on their
> desktops.  Of course it didn't matter that those other servers were in
> the US, Australia, and Asia - it's a Global LAN, right!
> 
> The network groaned, but all the routers and lines stayed up, happily
> conveying many packets per second.  For hours.  Unfortunately too few of
> the packets were carrying that email traffic.
> 
> We turned off "Global LAN" protocols in the routers ... but of course
> today such LAN-type services all run over TCP, so it might not be quite
> as easy.
> 
> The other important but less urgent Network Management activity involved
> things like Capacity Planning.  With the data in the database, it was
> pretty easy to get reports or graphs of trends over a month/quarter, and
> see the need to order more circuits or equipment.
> 
> We could also run various tests like traffic generators and such and
> gather data when there were no problems in the network.   That collected
> data provided a "baseline" of how things looked when everything was
> working.  During problem times, it was straightforward to run similar
> tests and compare the results with the baselines to figure out where the
> source of a problem might be by highlighting significant differences.  
> The ability to compare "working" and "broken" data is a powerful Network
> Management tool.
> 
> So that'w what we did.  I'm not sure I'd characterize all that kind of
> activity as either Configuration or Monitoring.  I've always thought it
> was just Network Management.
> 
> There's a lot of History of the Internet protocols, equipment, software,
> etc., but I haven't seen much of a historical account of how the various
> pieces of the Internet have been operated and managed, and how the tools
> and techniques have evolved over time.
> 
> If anybody's up for it, it would be interesting to see how other people
> did such "Network Management" activities with their own adhoc tools as
> the Internet evolved.
> 
> It would also be fascinating to see how today's expensive Network
> Management Systems tools would be useful in my scenarios above.   I.e.,
> how effective would today's tools be if  used by network operators to
> deal with my example network management scenarios - along the lines of
> RFC1109's observations about how to evaluate Network Management technology.
> 
> BTW, everything I wrote above occurred in 1990-1991.
> 
> /Jack
> 
> 
>