[ih] Evolution of Internet audio and video

Mon Sep 29 18:21:36 PDT 2025

FYI, multimedia was on the Internet radar very early.  It was probably 
the most important driving force for the evolution of TCP2 to become 
TCP/IP4 in the late 1970s and early 1980s.

Shortly after I got the assignment to implement the first TCP for Unix 
in 1977, I started attending Internet meetings.  At one of the early 
ones, I remember Vint describing a bunch of "scenarios" that the 
Internet was expected to handle.  One was especially memorable. It 
involved a teleconference with a group of military officers, located 
over a broad geographic area including some perhaps in the Pentagon or 
regional command centers, and others far away in jeeps or tanks, or even 
helicopters, in action on a battlefield.

The gist of the teleconference was to collect information about what was 
happening, make decisions, and issue orders to the field units. At the 
time, video was not even a dream, but it was deemed feasible even in the 
near term to use the multimedia then available.  For example, everyone 
might have some kind of display device, enabling them all to see the 
same map or graphic.  A pointer device would allow anyone while speaking 
to point to the graphic and everyone else would see the same motions on 
their displays.  The teleconference would be conducted by voice, which 
of course had to be interactive.  It also had to be synchronized with 
the graphics, so that orders like "Move your battalion here; we're going 
to bomb over here." didn't cause serious problems if transmission delays 
were happening in the Internet and the voice and graphics became 
unsynchronized.

Such scenarios drove the thinking about what the Internet technology had 
to be able to do.  It led to a consensus that the virtual connection 
service of TCP2 was insufficient, due to its likelihood of delays that 
would disrupt interactive voice.  In addition, the consensus was that 
multiple types of service should be provided by the Internet.  One type 
might be appropriate for interactive voice, where getting as much data 
delivered as possible was more important than getting all the data 
delivered eventually.  Similarly, large data transfers, such as 
high-resolution graphics, could be delivered intact, but it was less 
important that they arrive within milliseconds.

That led to the split of TCP into TCP and IP, and the introduction of 
UDP as a possible vehicle for carrying interactive content with a need 
for low latency.  In addition, it might be useful for different types of 
traffic to follow different routes through the Internet. Interactive 
traffic might use a terrestrial route, where bulk traffic such as 
graphics might travel through long-delay, but high bandwidth, 
geosynchronous satellite networks.  The TOS field was added to the IP 
header so that a teleconferencing program could tell the Internet how to 
handle its traffic.

TCP/IP4 created an experimental environment where such approaches could 
be tried.  Various researchers used to come to the Internet meetings to 
report on their experiments and lobby for new mechanisms.  (I recall 
Steve Casner and Jim Forgie as being frequent attendees with those 
interests).   Experimentation later produced the MBONE, with multicast 
which helped reduce the traffic loads through the Internet.   MBONE 
seems to have faded away over the years, and various "silos" of 
proprietary teleconferencing mechanisms have popped up to provide such 
functionality, but unfortunately seem to have done so in a 
non-interoperable way.

Today, I use teleconferencing with Zoom, Facetime, and several others.   
There seems to be a lot of choices.   It seems to work pretty well, at 
least for my personal scenarios.  But a few years ago I was asked to 
give a presentation over the Internet to a conference halfway around the 
planet, and we decided that it was too risky to count on that Internet 
path being good enough at the scheduled time.  So we prerecorded the 
presentation and transferred it via FTP well ahead of time.   Perhaps it 
would have worked, but we couldn't be confident.

Recently I heard anecdotal reports that the Internet on cruise ships 
works well - but is reliable only when the ship is far out to sea. When 
it's in port, or even just approaching port, teleconferencing is 
unreliable.   My speculation is that traffic loads when near a port 
include all the land-based users and the network may be overwhelmed.  
But that's just speculation, I have no data.

So I wonder - is the multimedia on the Internet problem now solved?    
As near as I can tell, the Internet today only provides one type of 
service, with all datagrams following the same route. Did the 
introduction of fiber make the concerns of the 1980s moot? Does 
teleconferencing now work well throughout the Internet?  Do users simply 
abandon the idea of using the Internet for teleconferencing when they 
discover it doesn't work for them (as I did for my presentation)?   Does 
the military now do what the 1970s scenarios envisioned over the Internet?

How did multimedia on the Internet evolve over the last 45+ years?

Jack Haverty

On 9/29/25 14:59, Karl Auerbach via Internet-history wrote:
> On 9/29/25 2:13 PM, Craig Partridge wrote:
>
>>   * How to persuade video to deal with occasional loss. Dave Clark did
>>     early outreach to codec experts and said that in response to the
>>     question "What do we do if some of your data has to be dropped"
>>     were told "Don't.  We're good at compression and if the data could
>>     be dropped, we'd have removed it."  As I recall, it was Facebook
>>     that led to codecs that could deal with loss?
>>
> Steve Casner and I worked really hard on these issues.  And because we 
> often moved audio and video via different packet streams there was an 
> impact from loss/delay/duplication/re-sequencing on one of the 
> streams  on the other stream.
>
> Many codecs are not friendly to loss or underrunning their input 
> buffers.  And with cipher chained (aka block-chained) streams it can 
> get harder to pick up the sticks when a packet is lost.
>
> We were working with UDP so we did not have TCP trying to do 
> reliability and sequencing.
>
> Some of the issues we faced were "what do we do when we don't have a 
> video or audio packet at the time we need to feed it to the rendering 
> hardware?"  For audio there was "redundant audio transport", aka "RAT" 
> in which the data in packet N was carried in lower quality in packet 
> N+1 (or N+2).
>
> For video we had to deal with 30 per second freight trains of closely 
> spaced large packets.
>
> There were demarcations in the streams about where sound spurts began 
> and where video frames ended.  Loss of those packets forced us to 
> develop heuristics about how to imply where those packets were and 
> what to do about it.
>
> Out of order packets were a bane.
>
> Patching voice/video data is hard because it can create artifacts, 
> sometimes unexpected ones, such as synthetic tones when audio was 
> being patched (and patched with what - we experimented with silence 
> [doesn't work well] or averaging the prior/next [worked better], etc.)
>
> Things are worse these days because of the games that "smart" Ethernet 
> NICs play with Ethernet frames - such as combining several small 
> Ethernet frames and delivering to the receiving operating system as 
> one large (up to 64Kbyte!) ethernet frame. One's software has to 
> approach a modern Ethernet NIC with a software sledge hammer to turn 
> off all of the "offloads".
>
> All in all the cure for many things was to add delay before rendering 
> content.  But that affected conversational uses where, according to 
> the ITU we have a round trip budget of only about 140 milliseconds 
> before people go into half-duplex/walkie-talkie mode.  I really wanted 
> to get my physicist friends to consider increasing the speed of light, 
> but they were resistant to the idea.
>
> I began work on a meta stream to carry information about objects in 
> the video stream (in order to do fast, set top product placements and 
> such) and with scripted morphing in to react to events in the viewer's 
> space.  (E.g. morph Alan Arkin's eyes onto the source of a viewer 
> gasp, such as when he sneaks up on Audrie Hepburn in the film Wait 
> Until Dark.)  This was part of my notion about breaking down the 4th 
> wall.  I hypothesized a video conferencing system in which each person 
> posted a series of photos in a set of patterned poses - then the 
> conference would proceed by sending small morphing instructions rather 
> than full images.  One could turn a knob to change from "staid 
> English" to "hand waving Italian" modes of presentation. (This came 
> out of my work with communications with submarines in which voice was 
> converted into tokenized words rather than conveyed as voice itself - 
> that saved a lot of bandwidth on our 300 bits/second path and the 
> resulting voice was much clearer and comprehensible, even if the 
> speaker was synthetic - and it was something we suggested to the FCC 
> for air traffic control.  I had pieces of these things running, but 
> only small pieces.  it is an area that is waiting for further work.)
>
> Tools to test and exercise this stuff were hard to come by.  Jon had 
> proposed his "flakeway" and a few years later I built one (operating 
> as a malicious Ethernet switch rather than as a router.)  I now sell 
> that, or a distant successor, as a product.
>
>         --karl--
>

-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_signature.asc
Type: application/pgp-signature
Size: 665 bytes
Desc: OpenPGP digital signature
URL: <http://elists.isoc.org/pipermail/internet-history/attachments/20250929/26203aa2/attachment-0001.asc>