[ih] inter-network communication history
John Day
jeanjour at comcast.net
Sun Nov 10 13:45:25 PST 2019
This is getting pretty long and we are probably the only ones interested in it.
Configuration Management:
You were lucky. We had customers who had configuration changes between day and night, end of month, holidays vs non-holidays (both for less traffic and much more), in Hong Kong race day was a killer, etc. There are also networks that have to reconfigure quickly when new conditions arise (some predictable, some not).
Database Performance: All of that did make database performance an issue. But also a 20+ times slower can make even ordinary things far too slow. SNMP’s inherent dribble approach probably covered up the database performance issues. The idea that one couldn’t snapshot a router table before it was updated was a pain.
General: One of the problems we saw was getting it across to operators that they were *managing* the network, not controlling it. There was a situation in the UK where the number of switch crashes dropped precipitously for 6 weeks and then came back up. No changes had been made. It was very strange until they realized it was the 6 weeks the operators were on strike. They had been trying to control the network and were making it worse.
I contend that network management is basically, Monitor and Repair but not control.
Multi-protocol: Yes this is where leveraging the commonality of structure paid off. We had to support all of those too. Given our approach to it, it was relatively straightforward.
Fault Management: yes, this is another place where the commonality paid off. Also recognizing that doing diagnosis was basically creating a small management domain for the equipment under test. I was never able to get as far on this problem as I thought we could go. But one could do a lot with common tools to generate test patterns, test dictionaries, etc.
Most of this was on the market by 1986.
John
> On Nov 10, 2019, at 01:26, Jack Haverty <jack at 3kitty.org> wrote:
>
> On 11/9/19 4:23 AM, John Day wrote:
>
>> As they say, Jack, ignorance is bliss! ;-) Were you doing configuration with it? Or was it just monitoring?
>
> As I recall, configuration wasn't a big deal. Nodes were typically
> routers with Ethernets facing toward the users at the site and several
> interfaces the other way for long-haul circuits. Our approach was to
> collect all the appropriate equipment for the next site in our
> California lab, configure it and test it out on the live network, and
> then ship it all to wherever it was to go. So, for example, New Zealand
> might have actually been in California at first, but when it got to NZ
> it worked the same.
>
> IIRC, there was lots of stuff that could be configured and tweaked in
> the routers. There was even a little documentation on what some of
> those "virtual knobs" affected. There was essentially nothing on why
> you might want to set some knob to any particular position, what
> information you needed to make such decisions, or how to predict
> results. Anything could happen. So there was strong incentive never
> to change the default configuration parameters after the site equipment
> left our lab.
>
> I don't remember any concerns about database performance. But we only
> had a hundred or so boxes out in our net. Perhaps the Network
> Management vendors had visions of customers with thousands of their
> boxes so we didn't see the same problems. Also, we only collected the
> specific data from sources like SNMP that we expected wet could actually
> use. We thought our network was pretty big for the time, spanning 5
> continents and thousands of users and computers. The database we had
> worked fine for that. Compared to other situations, like processing
> credit card or bank transactions, it didn't seem like a big load. I
> think it all went into a Sparc. But there were bigger machines around
> if we needed one.
>
> The vendor-supplied tools did provide some monitoring. E.g., it was
> fairly easy to see problems like a dead router or line, and pick up the
> phone to call the right TelCo or local site tech to reboot the box.
> With alternate routing, often the users didn't even notice. Just like
> in the ARPANET...(Yay packet switching!)
>
> To make things extra interesting, that was the era of "multi-protocol
> routers", since TCP hadn't won the network wars quite yet. Our
> corporate product charter was to provide software that ran on any
> computer, over any kind of network. So our net carried not only TCP/IP,
> but also other stuff - e.g., DECNet, AppleTalk, SPX/IPX, and maybe one
> or two I don't remember. SNA/LU6.2 anyone...? Banyan Vines?
>
> Most of our more challenging "network management" work involved fault
> isolation and diagnosis, plus trend analysis and planning.
>
> A typical problem would start with an urgent call from some user who was
> having trouble doing something. It might be "The network is way too
> slow. It's broken." or "I can't get my quarterly report to go in".
> Often the vendor system would show that all routers were up and running
> fine, and all lines were up. But from the User's perspective, the
> network was broken.
>
> Figuring out what was happening was where the ad-hoc tools came in.
> Sometimes it was User Malfunction, but often there was a real issue in
> the network that just didn't appear in any obvious way to the
> operators. But the Users saw it.
>
> "You say the Network is running fine.....but it doesn't work!"
>
> To delve into Users' problems, we needed to go beyond just looking at
> the routers and circuits. Part of the problem might be in the Host
> computers where TCP lived, or in the Application, e.g., email.
>
> We ran the main data center in addition to the network. There wasn't
> anyone else for us to point the finger at.
>
> We used simple shell scripts and common Unix programs to gather
> SNMP-available data and stuff it into the database, parsed as much as we
> could into appropriate tables with useful columns like Time, Router#,
> ReportType, etc. That provided data about how the routers saw the
> network world, capturing status and behavior over whatever period of
> time we ran the collector.
>
> Following the "Standard Node" approach, wherever we placed a network
> node we also made sure to have some well-understood machine on the User
> side that we could use remotely from the NOC. Typically it would be
> some kind of Unix workstation, attached to the site's Ethernet close to
> the router. Today, I'd probably just velcro a Raspberry Pi to the router.
>
> I used to call this an Anchor Host, since it provided a stable,
> well-understood (by us at the NOC) machine out in the network. This
> was really just copying the ARPANET approach from the early 70s, where a
> "Fake Host" inside the IMP could be used to do network management things
> like generate test traffic or snoop on regular network traffic. We
> couldn't change the router code to add a Fake Host, but we could put a
> Real Host next to it.
>
> From that Fake (Real) Host, we could run Ping tests across the network
> to measure RTT, measure bandwidth between 2 points during a test FTP,
> generate traffic, and such stuff, simply using the tools that commonly
> come in Unix boxes. The results similarly made their way into tables in
> the database. Some tests were run continuously, e.g., ping tests every
> 5 minutes. Others were enabled on demand to help figure out some
> problem, avoiding burdening the network (and database I guess) with
> extra unneeded traffic.
>
> Also from that Fake Host, we could run TCPDUMP, which captured traffic
> flowing across that Ethernet and produced reams of output with a melange
> of multi-protocol packet headers. Again, all of that could make its way
> into the database on demand, organized into useful Tables, delayed if
> necessary to avoid impacting the network misbehavior we were trying to
> debug. Give a Unix guru awk, sed, cron and friends and amazing things
> can happen.
>
> We could even run a Flakeway on that Anchor Host, to simulate network
> glitches for experimentation, but I can't recall ever having to do
> that. But perhaps the ops did and I never knew.
>
> Once all that stuff got into the database, it became data. Not a
> problem. I was a network guy afloat in an ocean of database gurus, and
> I was astonished at the way they could manipulate that data and turn it
> into Information.
>
> I didn't get involved much in everyday network operations, but when
> weird things happened I'd stick my nose in.
>
> Once there was an anomaly in a trans-pacific path, where there was a
> flaky circuit that would go down and up annoyingly often. The carrier
> was "working on it..."
>
> What the ops had noticed was that after such a glitch finished, the
> network would settle down as expected. But sometimes, the RTT delay and
> bandwidth measurements would settle down to a new stable level
> noticeably different from before the line glitch. They even had
> brought up a rolling real-time graph of the data, kind of like a
> hospital heart-monitor, that clearly showed the glitch and the change in
> behavior.
>
> Using our adhoc tools, we traced the problem down to a bug in some
> vendor's Unix system. That machine's TCP retransmission timer algorithm
> was reacting to the glitch, and adapting as the rerouting occurred. But
> after the glitch, the TCP had settled into a new stable pattern where
> the retransmission timer fired just a little too soon, and every packet
> was getting sent twice. The network anomaly would show up if a line
> glitch occurred, but only if that Unix user was in the middle of doing
> something like a file transfer across the Pacific at the time. The
> Hosts and TCPs were both happy, the Routers were blissfully ignorant,
> and half that expensive trans-pacific circuit was being wasted carrying
> duplicate packets.
>
> With the data all sitting in the database, we had the tools to figure
> that out. We reported the TCP bug to the Unix vendor. I've always
> wondered if it ever got fixed, since most customers would probably never
> notice.
>
> Another weird thing was that "my quarterly report won't go" scenario.
> That turned out to be a consequence of the popularity of the "Global
> Lan" idea in the network industry at the time. IIRC, someone in some
> office in Europe had just finished putting together something like a
> library of graphics and photos for brochures et al, and decided to send
> it over to the colleagues who were waiting for it. Everybody was on
> the department "LAN", so all you had to do was drag this folder over
> there to those guys' icons and it would magically appear on their
> desktops. Of course it didn't matter that those other servers were in
> the US, Australia, and Asia - it's a Global LAN, right!
>
> The network groaned, but all the routers and lines stayed up, happily
> conveying many packets per second. For hours. Unfortunately too few of
> the packets were carrying that email traffic.
>
> We turned off "Global LAN" protocols in the routers ... but of course
> today such LAN-type services all run over TCP, so it might not be quite
> as easy.
>
> The other important but less urgent Network Management activity involved
> things like Capacity Planning. With the data in the database, it was
> pretty easy to get reports or graphs of trends over a month/quarter, and
> see the need to order more circuits or equipment.
>
> We could also run various tests like traffic generators and such and
> gather data when there were no problems in the network. That collected
> data provided a "baseline" of how things looked when everything was
> working. During problem times, it was straightforward to run similar
> tests and compare the results with the baselines to figure out where the
> source of a problem might be by highlighting significant differences.
> The ability to compare "working" and "broken" data is a powerful Network
> Management tool.
>
> So that'w what we did. I'm not sure I'd characterize all that kind of
> activity as either Configuration or Monitoring. I've always thought it
> was just Network Management.
>
> There's a lot of History of the Internet protocols, equipment, software,
> etc., but I haven't seen much of a historical account of how the various
> pieces of the Internet have been operated and managed, and how the tools
> and techniques have evolved over time.
>
> If anybody's up for it, it would be interesting to see how other people
> did such "Network Management" activities with their own adhoc tools as
> the Internet evolved.
>
> It would also be fascinating to see how today's expensive Network
> Management Systems tools would be useful in my scenarios above. I.e.,
> how effective would today's tools be if used by network operators to
> deal with my example network management scenarios - along the lines of
> RFC1109's observations about how to evaluate Network Management technology.
>
> BTW, everything I wrote above occurred in 1990-1991.
>
> /Jack
>
>
>
More information about the Internet-history
mailing list