[ih] Design choices in SMTP

Thu Feb 9 14:18:48 PST 2023

RJE was another priority of Lick at MIT.   So I implemented an RJE for 
our PDP-10 to enable users to submit their "card decks" and get their 
"printout" after their job was run.   Checking to see if the job had 
been run was part of the RJE protocol, but there was no way to have the 
CCN system proactively tell you your output was ready. So I had to build 
a "daemon" for the PDP-10 (ITS OS) to interact with the CCN over the 
ARPANET, submitting the card deck, polling periodically to see if the 
job was done (there was no way to ask about the job queue), and then 
retrieve the output (whatever would have come out on the CCN printer).

AFAIK, there was very little if any use of this system at MIT.  Most 
people had had their fill of punch cards and waiting for listings. But 
the testing I did indicated an annoyingly high failure rate. Not as bad 
as 50%, but far less than 99%.

So, debugging time.....

Since the polling nature of the interaction meant that the human user 
might not be online, a daemon process had to run on the system to do the 
interactions with CCN whenever CCN was ready.   I had already written a 
mail server daemon, so it was an obvious design choice to simply add RJE 
functionality to the mail server.   To "submit a card deck", you would 
prepare the card deck in your favorite editor, then use it as the text 
of a "message".  But instead of "sending" the message, you would "submit 
to CCN".   The mail server would then fire up an RJE connection rather 
than an SMTP one, and carry out all of the needed interactions with the 
CCN system.   When the job output was ready, the daemon would retrieve 
the listing from CCN, and inform the originating user that the output 
was available.  That output looked like just another message, which 
could be archived, forwarded, sent to the printer, sent to the 
Datacomputer for posterity, or whatever else you could do with 
"email".   Essentially from a user's view you emailed your card deck to 
CCN, and somewhat later got a reply containing the printer output.

The use of a daemon meant that all of the interactions over the ARPANET 
were computer-computer rather than human-computer.   This followed 
Lick's vision of having the computer and network help people do stuff.   
We assumed people would have their local computers connected to the 
network, where traditionally people would have their terminal connected 
to the network, logged in to some remote computer over a Telnet-style 
connection.

I found the source of the unreliability.

Sadly, although our design was based on computer-computer interactions, 
the RJE protocol mechanisms were based on human-computer interactions.  
RJE assumed that there was a human at the other end of the connection.  
The commands and responses were pretty well defined, so my code could 
generally imitate a human pretty effectively and handle submitting a 
card deck and retrieving the results.

But since the system assumed there was a human in the picture....

Looking at the logs, I noticed that the protocol interactions, all those 
nicely formatted commands and responses, would occasionally be 
interrupted by "noise" on the connection.   You would occasionally see 
something like:

MESSAGE FROM SYSOP!!!   Remember CCN will be down this Saturday and 
Sunday for installation of additional disk drives!

Such "noise" would really screw up the state machine dealing with the 
RJE exchanges.   It could happen at any time, even in the middle of an 
important protocol word, and would simply appear in the ascii stream 
that was carrying all of the RJE protocol commands.

This experience and other similar encounters with trying to use a 
human-computer connection for computer-computer interactions (email 
headers were a big one, as well as the idiosyncracies of the FTP "MAIL" 
command) motivated me to write RFC 722 -- 
https://www.rfc-editor.org/rfc/rfc722 to try and capture some design 
principles for computer-computer interactions that were crucial to 
Lick's vision.

Fun times, if sometimes a bit frustrating.

Jack

On 2/9/23 11:22, Dave Crocker via Internet-history wrote:
>
>> NETRJE didn’t get a lot of use because the systems that could support 
>> a Server FTP didn’t need the NETRJE. (Can someone correct me about 
>> that?)
>>
>> What did get a lot of use of was CCNRJE written by Bob Braden for the 
>> UCLA CCN 360/91.
>
>
> I used the RJE program, available at ISI, to submit jobs to the UCLA 
> CCN 360/91.  But I would have sworn the author was someone other than 
> Braden.
>
>
> The nature of the use was not as straightforward as Steve's example.
>
> My job was user support and documentation at the UCLA Arpanet 
> project.  A year or two before getting hired, I'd become a fan of the 
> NLS system at SRI, even including the text-only interface. (Prior to 
> dropping out and getting hired for this job, I wrote a text formatter, 
> with inline commands, that emulated the hierarchical text model in 
> NLS.  I developed and ran it at my place of work which was the other 
> 360/91 on campus, the NIH-funded Health Sciences Computing.  There 
> were only 18 of those machines built. This was also my only major 
> foray into using PL/1.)
>
> Anyhow, I taught the department secretaries -- remember when that was 
> what they were called? -- to use the remote system.  After editing, we 
> needed to be able to print documents.  The one's they'd been editing 
> and the ones from the NIC, of course.
>
> So I had them FTP the document down to ISI and run the RJE program to 
> send the document to the high-speed, upper and lower case printer at 
> CCN.  The fastest U/L printer we had in the department was 120 cps, so 
> this was /much/ better.
>
> This also wound up generating a serious bit of learning about computer 
> science and statistics. (Before and after dropping out, I studied 
> Psychology and had taken 1 really basic stats course and no CS courses.)
>
> The secretaries quickly got facile with the process, but they started 
> complaining that the sequence would often fail.  I turned to my office 
> mate, Jon Postel, and asked whether he had any suggestions. He had me 
> explain the total sequence being used and asked how often things were 
> failing and what the symptoms were.
>
> I noted that there were widely different systems, but that the overall 
> failure rate seemed to be about 50%,
>
> He  asked how reliable our department Sigma 7 was, I suggested a good, 
> but not outstanding number and he agreed.  Then he asked about the net 
> itself, and we agreed it was highly reliable, maybe 90%. Then SRI, 
> which wasn't great, and ISI, which had gotten quite good, then CCN, 
> which was not great.
>
> Cumulative probably came out almost exactly at 50%
>
> I later hear that the failure to perform a similar, aggregate failure 
> rate exercise was the reason the Russians beat us to space...
>
> d/
>