[ih] GOSIP & compliance

Sat Mar 26 13:12:55 PDT 2022

On 3/26/22 10:30 AM, Jack Haverty via Internet-history wrote:
> SNMP et al are mechanisms for data collection, i.e., retrieving all 
> sorts of metrics about how things out in the network are behaving. But 
> has there been much thought and effort about how to *use* all that 
> data to operate, troubleshoot, plan or otherwise manage all the 
> technology involved in whatever the users are doing?

The short answer is "yes".  I've been thinking about it for a long time, 
since the 1980's.  I tend to use the phrase "homeostatic networking".

I helped with a DARPA project about "smart networks".  (They weren't 
really "smart" in the switching plane, the "smarts" was in stuff in the 
control plane.)  In that project we fed a bunch of information into a 
modelling system that produced MPLS paths, including backups so that we 
could do things like switching over within a few milliseconds. The 
modelling was done externally; results would be disseminated into the 
network. The idea was to put somewhat autonomous smarts into the routers 
so that they could manage themselves (to a very limited degree) in 
accord with the model by watching things like queue lengths, internal 
drops, etc, and decide when to switchover to a new path definition.  (I 
was going to use JVMs into Cisco IOS - someone had already done that - 
to run this code.)

We realized, of course, that we were on thin ice - an error could bring 
down an otherwise operational network in milliseconds.

My part was based on the idea "what we are doing isn't improving things, 
what do we do now?"  To me that was a sign of one of several possible 
things:

   a) our model was wrong.

   b) the network topology was different than we thought it was (either 
due to failure, error, or security penetration)

   c) something was not working properly (or had been penetrated)

   d) A new thing had arrived in the structure of the net (all kinds of 
reasons, including security penetration)

   etc.

In our view that would trigger entry into a "troubleshooting" mode 
rather than a control/management mode.  That would invoke all kinds of 
tools, some of which would scare security managers (and thus needed to 
be carefully wielded by a limited cadre of people.)

One of the things that fell out of this is that we lack something that I 
call a database of network pathology.  It would begin with a collection 
of anecdotal data about symptoms and the reasoning chain (including 
tests that would need to be performed) to work backwards towards 
possible causes.

(Back in the 1990's I began some test implementations of pieces of this 
- originally in Prolog.  I've since taken a teensy tiny part of that and 
incorporated it into one of our protocol testing products.  But it is 
just a tiny piece, mainly some data structures to represent the head of 
a reverse-reasoning chain of logic.)

In broader sense several other things were revealed.

One was that we are deeply under investing in our network diagnostic and 
repair technology.  And as we build ever higher and thicker security 
walls we are making it more and more difficult to figure out what is 
going awry and correcting it.  And that, in turn, raises questions 
whether we are going to need to create a kind of network priesthood of 
people who are privileged to go into the depths of networks, often 
across administrative boundaries, going where privacy and security 
concerns must be honored.  As a lawyer who has lived for decades legally 
bound to such obligations I do not feel that this is a bad thing but 
many others do not feel the same way that I do about a highly privileged 
class of network repair people.

Another thing that I have realized along the way is that we need to look 
to biology for guidance.  Living things are very robust; they survive 
changes that would collapse many of our human creations.  How do they do 
that?  Well, first we have to realize that in biology, death is a useful 
tool that we often can't accept in our technical systems.

But as we dig deeper into why biological things survive while human 
things don't we find that evolution usually does not throw out existing 
solutions to problems, but layers on new solutions. All of these are 
always active, pulling with and against one another, but the newer ones 
tend to dominate.  So as a tree faces a 1000 year drought if first pulls 
the latest solutions from its genetic bag of tricks, like folding leaves 
down to reduce evaporation.  But when that doesn't do the job older 
solutions start to become top-dog and exercise control.

It is that competition of solutions in biology that provides 
robustness.  The goal is survival.  Optimal use of resources comes into 
play only as an element of survival.

But on our networks we too often have exactly one solution.  And if that 
solution is brittle or does not extend into a new situation then we have 
a potential failure.  An example of this is how TCP congestion detection 
and avoidance ran into something new - too many buffers in switching 
devices - and caused a failure mode: bufferbloat.

> We also discovered quite a few bugs in various SNMP implementations, 
> where the data being provided were actually quite obviously incorrect. 
> I wondered at the time whether anyone else had ever tried to actually 
> use the SNMP data, more than just writing it into a log file.

I still make a surprising large part of my income from helping people 
find and fix SNMP errors.  It's an amazing difficult protocol to 
implement properly.

My wife and I wrote a paper back in 1996, "Towards Useful Network 
Management" that remains even 26 years later, in my opinion, a rather 
useful guide to some things we need.

https://www.iwl.com/idocs/towards-useful-network-management

         --karl--