[ih] Fwd: The IMP Lights story (Was: Nit-picking an origin story) -- with interview text

Sun Aug 24 12:38:09 PDT 2025

[Apparently the attachment got stripped when I sent this to the
Internet-History list.  This is a resend, with the text of the interview
copied directly into this message at the end.]

Folks,

On Monday, 18 August 2025, I described how the lights on the IMPs often
burned out and caused a noticeable amount of downtime on the IMPs.  Geoff
Goodfellow asked for more details.  That exchange is copied below.

I learned of the problem with the IMP lights during a virtual roundtable
with Ben Barker and others.  We published the roundtable in [1].

I later interviewed Ben and Scott Bradner to learn more details. The
interview [2] is attached. appended.

In the process of checking, Alex McKenzie sent me a more recent article
Dave Walden, he and Ben wrote which covers several incidents related to
reliability, and he sent a reference to the article to the list.  See [3]
below.  I also learned that Ben passed away two years ago.  I'm sad.  He
was a delightful and always positive guy.

After further discussion with Alex, we agreed [1] has the least detail.
[3] is best, but it's behind a paywall.  The interview [2] is a
close second.

I think this is all the information that's available.

Thanks to Ben for the delightful story, to Geoff for asking for the
details, to Scott for permission to use the interview, and to Alex for the
recent article and advice on how to proceed.

Steve

[1]  "The Arpanet and Its Impact on the State of Networking," Stephen D.
Crocker, Shinkuro, Inc., Computer, October 2019.  This was a virtual
roundtable with Ben Barker, Vint Cerf, Bob Kan, Len Kleinrock and Jeff
Rulifson.  Ben mentioned the problem with the IMP lights.  It's only a
small portion of the overall roundtable.  The next two references have more
detail.

[2] "Fixing the lights on the IMPs," an unpublished interview with Ben
Barker and Scott Brader, 3 July 2020.  It's attached appended.

[3] "Seeking High IMP Reliability in the 1970' ARPAnet" by Walden,
McKenzie, and Barker, published in Vol 44, No 2 (April - June 2022) of IEEE
Annals of the History of Computing.

---------- Forwarded message ---------
From: the keyboard of geoff goodfellow <geoff at iconia.com>
Date: Mon, Aug 18, 2025 at 8:54 PM
Subject: Re: [ih] Nit-picking an origin story
To: Steve Crocker <steve at shinkuro.com>
Cc: John Day <jeanjour at comcast.net>, Dave Crocker <dhc at dcrocker.net>,
Internet-history <internet-history at elists.isoc.org>, <dcrocker at bbiw.net>

[I] am innately curious about the ARPANET "The IMPs Lights Reliability
Issue" you mention here and wonder if some additional color could be
elucidated to the colorful story as to just HOW "the lights on the IMP
panel being a major source of outages" and specifically what
"re-engineering" was effectuated to ameliorate them from crashing the IMPs?

On Mon, Aug 18, 2025 at 7:22 AM Steve Crocker via Internet-history <
internet-history at elists.isoc.org> wrote:

> ... Ben Barker has a colorful
> story about the lights on the IMP panel being a major source of outages.
> The IMPs had a 98% percent uptime at first.  98% was astonishingly good
> compared to other machines of the day, but intolerably poor in terms of
> providing an always available service.  Ben re-engineered the lights and
> brought the reliability up to 99.98%.  How's that for a small thing having
> a big effect!
>

Fixing the lights on the IMPs

Below is an exchange with Ben Barker, stimulated by a comment by Scott
Bradner.  I had been talking to Scott about another project but I opened by
asking a bit about his early years.  He worked at Harvard in several
capacities over many decades.  He started as a programmer in the psychology
department.  Ben Barker had been a student at Harvard and later joined
BBN.  Ben was a hardware guy.  Scott mentioned Ben hired him to develop
circuit boards for the front panels of the IMPs.  Ben participated in a
virtual roundtable last year, and I recall his vivid story of improving the
reliability of the IMPs.

For context, the following details are alluded to but not explained.

·    The first several IMPs were built using ruggedized Honeywell 516
computers.  I believe the cost to ARPA was $100K each.  Production later
shifted to using regular, i.e. not ruggedized, Honeywell 316 computers.  I
believe this dropped the cost to $50K each.  I believe the architecture was
identical but probably slightly slower.  Apparently, speed wasn’t an issue,
so this change saved a noticeable amount of money.  Also, as Ben makes
clear, there were some unfortunate changes to the front panel that may have
saved some cost in the production but were problematic in operation.

·    The software in the IMP included the routines for receiving and
forwarding packets and retransmitting if they hadn’t been received
correctly.  It also included the distributed algorithm for computing the
routing tables based on periodic exchange of tables with neighboring IMPs.
In addition to these primary functions, there were also four background
processes that implemented “fake hosts,” i.e. processes that were
addressable over the network as if they were hosts.  Each one implemented a
specific function.  One was DDT, the standard debugging technology of the
day.  I say “standard” because I had encountered other implementations of
DDT while at MIT.  I have no idea whether similar software was used in
other environments, but the concept was both simple and powerful, and I
assume it would have been copied widely.  In brief, it’s an interactive
program that has access to the memory of one or more processes.  There are
commands for starting and stopping the execution of the subject process,
examining or changing memory and setting breakpoints.  AI Memo AIM-147 by
Tom Knight in 1968 describes DDT for the MIT AI Lab PDP-6.  An earlier 1963
memo, Recent Improvements in DDT, by D. J. Edwards and M. L. Minsky makes
it clear DDT had been around for several years.

See my comments after the exchange.

Date: Jul 3, 2020, 2:37 PM
From: Steve Crocker <steve at shinkuro.com>
To: Scott, Ben, me

Ben,

I was chatting with Scott about his early days.  He mentioned doing the
circuit boards for the IMP front panels, and I recognized it was part of
the same story you told me about fixing the lights.

Scott,

Thanks for your time today.  You mentioned doing the circuit boards for the
IMP front panels.  As I mentioned, I listened to Ben Barker vividly
describe how this made a big difference in reducing the number of service
calls and improving the uptime of the IMPs.  Ben participated in a virtual
roundtable last year that was published in the IEEE's Computer magazine.  A
copy is attached.  Ben mentions reliability in both his brief bio and later
on page 22.  I've copied the text from that page.

*BARKER: *Reliability surprised me. I was surprised to find that the
reliability requirements for a network are much more extreme than the
reliability requirements for a computer, even for the computer upon which
the network switch is built. When we first started operating the Arpanet,
on average each node was down about a half hour a day. And people were
saying, the net’s always down and there was talk of canceling it, shutting
down the net because it just wasn’t good enough to be useful. We had a
meeting with our subcontractor for hardware to present our complaint to
them that nodes were down a half hour a day. Their reaction was, “You’re
down a half hour out of 24. That’s about 2%. You’re getting 98% up time.
We’ve never gotten 98% uptime on anything we’ve built. How are you doing
it?”

Eventually, I took over field operations of the network. They thought it
was strange to have a Ph.D. running a field service operation, but, you
know, we were weird guys. But little by little, we chipped away at it over
the course of a year and a half. We got the availability from 98 to 99.98%,
and the user community reaction went from “the net’s always down” to “the
net’s never down.” But that change is something that would have been
written into the spec if they were talking about that kind of application,
nuclear survivability.

Ben doesn't mention the lights in the article but I definitely remember him
describing this to me.  It might have been in a separate conversation that
wasn't recorded.

Cheers,

Steve

hat

Date: Fri, Jul 3, 4:01 PM
From: Ben Barker
To: me, sob at sobco.com

Indeed!  And hello you old SOB.  How have you been doing the last half
century?

Honeywell built the 516s and 316s using low-voltage incandescent bulbs for
the displays.  On the ruggedized 516s, they were in sockets with a screw-on
clouded cover.  Not too bad.  On the 316, the bulbs were mounted inside the
momentary contact rocker switches that were used to input data.
Unfortunately, these switches were actuated by pressing and releasing them,
allowing the switch to pop back to the resting position.  There was a
strong spring pushing the switch back out resulting in a pretty strong
mechanical snap on release.  More unfortunately, this mechanical shock was
simultaneous with the bulb turning on – the inrush of maximum current into
the cold filament – ideal conditions and timing for burning out a bulb.
More unfortunately yet, the bulbs were mounted in the switch by soldering
the leads to the connector.  This meant that the bulbs burnt out very
frequently and a dead bulb required taking the IMP down, disassembling the
front panel, unsoldering the dead bulb, and soldering in a new one.

But wait!  It gets worse!  The IMPs were fragile: once a machine was taken
down, it typically took hours – sometimes days – to get it back up.

I asked Scott to come up with a design that would replace the switches and
bulbs, using red LED bulbs in their place.  Scott found a switch / LED
combo that fit just right into the holes in the 316 front panel and
designed a PC card that carried the switch / lights in just the right
places to fit in. Scott went into production and we retrofitted them in all
the 316s in the field. Down time dropped amazingly.

The other half of the strategy was dealing with the fragile IMPs that took
a long time to bring back up.  Scott’s switch / light panel was the first
big step in eliminating probably the majority of the times we had to take
the IMPs down.  The next was stand-up PMs – leaving the machines up and
running while performing preventative maintenance – mostly cleaning the air
filters and checking and adjusting the power supply voltages and performing
a visual check.  It eliminated one IMP down episode per IMP per month and
helped enormously in eliminating the extended effort to bring the machines
back up.  I have recently been informed that this is a known phenomenon
known as the “Waddington Effect” from eliminating unnecessary PM on world
war 2 bombers producing a 60% increase in the number of effective flying
hours.

The third leg of the stool was using self-diagnostic and remote diagnostic
techniques to find problems early on before the network users were aware of
a problem and scheduling a tech to go out to replace an already-identified
card, typically off-hours when nobody from that site was using the net.

Sorry to ramble…

/b

Date: Jul 3, 2020, 4:15 PM
>From Steve Crocker
To: Ben, me, sob at sobco.com

Very cool stuff.  Question: what sorts of things could be diagnosed
remotely?

Steve

Date: Jul 3, 2020, 6:49 PM

From: Ben Barker
To: me, sob at sobco.com

[image: https://mail.google.com/mail/u/0/images/cleardot.gif]

Mostly it was figuring out how to find problems that had brought one or
more IMPs down previously.  One was an IMP that was just running slow.  We
figured out to check a count of how many background loops the machine would
complete in a given length of time – a minute? Easy to do with a program on
our PDP-1 reaching out through the IMPs DDT to read the register that was
incremented by the background loop.  If something was keeping the machine
inordinately busy, it would show up as low loop counts.  If it was low, we
would check the return address of the interrupt routines which would show
us what the machine was doing the last time the interrupt happened.  Then
there was just debugging using the DDT.

We had a machine that was confused about time.  It turned out its Real Time
Clock card was bad.  I wrote a PDP-1 routine called RTCHEK that would
trivially check an IMP’s RTC.

There was the infamous Harvard crash wherein a memory failure in the area
used by the IMP to store its routing tables. John McQuillan modified the
IMP code to checksum the table before using it.  Told us instantly of
memory failures in that stack.

The modem interfaces generated and checked 24-bit checksums on all packets
on the line. We sometimes would get packets that passed the checksum check
but whose contents were in error.  We started having the IMPs send such
packets to a teletype in Cambridge where we would print them out in octal
and I would examine them.  The most common packets were routing packets and
they were very stylized.  Sometimes a given bit would not always get
recorded in memory properly and it would be clear which one from looking at
the packet.  If it was a problem in the input or output shift register, it
would show up on a given bit and on the bits to its left.  More typically
it was a problem gating the data onto the data bus.  In any case, you could
pretty well identify which gate was failing and schedule out service tech
out to replace that card at night.

At times, we would patch the IMP to software checksum the packets on the
line to find out if the check registers were failing.  At times we would
turn on the software checksum and turn off the hardware checksum to see
problems in the AT&T equipment.

These are random examples.  We did lots of such.  Mostly all done from our
PDP-1 DDT.  It was pretty cool.

[image: 😊]

/b

Date: Fri, Jul 3, 7:27 PM
From: Steve Crocker
To: Ben, Scott

Cool stuff. Did you guys ever write up these details? A bigger question is
whether the techniques you developed ever influenced or got used by others. I
would imagine that virtually every network operator and router vendor
needed to develop similar tools.

Thanks,

Steve

Date: Fri, Jul 3, 7:32 PM
From: Ben Barker
To: me, sob at sobco.com

1 – No. This thread is the most detail I know of.

2- Not to my knowledge.  I believe that I was told that DECNet later
incorporated remote diagnostics, but I don’t think they had something like
the remote DDT in the switch that was the basis for most of what we did.
But I am only speculating here.

Reflective comments

·      These details bring a bit of life into the seemingly ordinary
process of fielding IMPs.

·      The BBN IMP group was small, smart and agile.  Most were software or
hardware engineers who had been at MIT, Harvard and Lincoln Laboratory.

·      Barker says the improvement from 98% uptime to 99.98%, i.e. a
reduction of downtime from 2% to 0.02%, which is the hundredfold
improvement Barker refers to in his bio section of the virtual round table,
made a qualitative difference in the perception within the user community
about the reliability of the network.  This speaks directly to the dual
nature of the project, i.e. a research project to some like Kleinrock and
Kahn, versus a durable platform for others to build applications upon and
get work done.

To press this point a bit further, there were several time-sharing projects
pursued with IPTO support.  These ranged from Multics at the high end down
to GENIE at Berkeley, Tenex at BBN, ITS at MIT and various others over the
years.  IPTO didn’t put all of its eggs into any single basket.  Some of
these, particularly GENIE on the SDS 940 and Tenex on the PDP-10, were
adopted by others in the community and became workhorses.  In the network
arena, however, IPTO did not sponsor multiple projects.  Hence, there was
more emphasis for the Arpanet to be usable system and not just one of
several possible systems.

·      The learning curve Barker describes is not a surprise.  It’s exactly
what’s to be expected once an initial system is put into operation.
However, the fact these techniques were not documented and promulgated
suggests either or both of:

a.     Although the BBN group published multiple papers about their work,
there may have been less publication than there would have been in a
university.

b.     The remote debugging and other aspects of improving the reliability
might not have seemed special enough to be worth publishing.