[ih] Evolution of Internet audio and video
Jack Haverty
jack at 3kitty.org
Mon Sep 29 18:21:36 PDT 2025
FYI, multimedia was on the Internet radar very early. It was probably
the most important driving force for the evolution of TCP2 to become
TCP/IP4 in the late 1970s and early 1980s.
Shortly after I got the assignment to implement the first TCP for Unix
in 1977, I started attending Internet meetings. At one of the early
ones, I remember Vint describing a bunch of "scenarios" that the
Internet was expected to handle. One was especially memorable. It
involved a teleconference with a group of military officers, located
over a broad geographic area including some perhaps in the Pentagon or
regional command centers, and others far away in jeeps or tanks, or even
helicopters, in action on a battlefield.
The gist of the teleconference was to collect information about what was
happening, make decisions, and issue orders to the field units. At the
time, video was not even a dream, but it was deemed feasible even in the
near term to use the multimedia then available. For example, everyone
might have some kind of display device, enabling them all to see the
same map or graphic. A pointer device would allow anyone while speaking
to point to the graphic and everyone else would see the same motions on
their displays. The teleconference would be conducted by voice, which
of course had to be interactive. It also had to be synchronized with
the graphics, so that orders like "Move your battalion here; we're going
to bomb over here." didn't cause serious problems if transmission delays
were happening in the Internet and the voice and graphics became
unsynchronized.
Such scenarios drove the thinking about what the Internet technology had
to be able to do. It led to a consensus that the virtual connection
service of TCP2 was insufficient, due to its likelihood of delays that
would disrupt interactive voice. In addition, the consensus was that
multiple types of service should be provided by the Internet. One type
might be appropriate for interactive voice, where getting as much data
delivered as possible was more important than getting all the data
delivered eventually. Similarly, large data transfers, such as
high-resolution graphics, could be delivered intact, but it was less
important that they arrive within milliseconds.
That led to the split of TCP into TCP and IP, and the introduction of
UDP as a possible vehicle for carrying interactive content with a need
for low latency. In addition, it might be useful for different types of
traffic to follow different routes through the Internet. Interactive
traffic might use a terrestrial route, where bulk traffic such as
graphics might travel through long-delay, but high bandwidth,
geosynchronous satellite networks. The TOS field was added to the IP
header so that a teleconferencing program could tell the Internet how to
handle its traffic.
TCP/IP4 created an experimental environment where such approaches could
be tried. Various researchers used to come to the Internet meetings to
report on their experiments and lobby for new mechanisms. (I recall
Steve Casner and Jim Forgie as being frequent attendees with those
interests). Experimentation later produced the MBONE, with multicast
which helped reduce the traffic loads through the Internet. MBONE
seems to have faded away over the years, and various "silos" of
proprietary teleconferencing mechanisms have popped up to provide such
functionality, but unfortunately seem to have done so in a
non-interoperable way.
Today, I use teleconferencing with Zoom, Facetime, and several others.
There seems to be a lot of choices. It seems to work pretty well, at
least for my personal scenarios. But a few years ago I was asked to
give a presentation over the Internet to a conference halfway around the
planet, and we decided that it was too risky to count on that Internet
path being good enough at the scheduled time. So we prerecorded the
presentation and transferred it via FTP well ahead of time. Perhaps it
would have worked, but we couldn't be confident.
Recently I heard anecdotal reports that the Internet on cruise ships
works well - but is reliable only when the ship is far out to sea. When
it's in port, or even just approaching port, teleconferencing is
unreliable. My speculation is that traffic loads when near a port
include all the land-based users and the network may be overwhelmed.
But that's just speculation, I have no data.
So I wonder - is the multimedia on the Internet problem now solved?
As near as I can tell, the Internet today only provides one type of
service, with all datagrams following the same route. Did the
introduction of fiber make the concerns of the 1980s moot? Does
teleconferencing now work well throughout the Internet? Do users simply
abandon the idea of using the Internet for teleconferencing when they
discover it doesn't work for them (as I did for my presentation)? Does
the military now do what the 1970s scenarios envisioned over the Internet?
How did multimedia on the Internet evolve over the last 45+ years?
Jack Haverty
On 9/29/25 14:59, Karl Auerbach via Internet-history wrote:
> On 9/29/25 2:13 PM, Craig Partridge wrote:
>
>> * How to persuade video to deal with occasional loss. Dave Clark did
>> early outreach to codec experts and said that in response to the
>> question "What do we do if some of your data has to be dropped"
>> were told "Don't. We're good at compression and if the data could
>> be dropped, we'd have removed it." As I recall, it was Facebook
>> that led to codecs that could deal with loss?
>>
> Steve Casner and I worked really hard on these issues. And because we
> often moved audio and video via different packet streams there was an
> impact from loss/delay/duplication/re-sequencing on one of the
> streams on the other stream.
>
> Many codecs are not friendly to loss or underrunning their input
> buffers. And with cipher chained (aka block-chained) streams it can
> get harder to pick up the sticks when a packet is lost.
>
> We were working with UDP so we did not have TCP trying to do
> reliability and sequencing.
>
> Some of the issues we faced were "what do we do when we don't have a
> video or audio packet at the time we need to feed it to the rendering
> hardware?" For audio there was "redundant audio transport", aka "RAT"
> in which the data in packet N was carried in lower quality in packet
> N+1 (or N+2).
>
> For video we had to deal with 30 per second freight trains of closely
> spaced large packets.
>
> There were demarcations in the streams about where sound spurts began
> and where video frames ended. Loss of those packets forced us to
> develop heuristics about how to imply where those packets were and
> what to do about it.
>
> Out of order packets were a bane.
>
> Patching voice/video data is hard because it can create artifacts,
> sometimes unexpected ones, such as synthetic tones when audio was
> being patched (and patched with what - we experimented with silence
> [doesn't work well] or averaging the prior/next [worked better], etc.)
>
> Things are worse these days because of the games that "smart" Ethernet
> NICs play with Ethernet frames - such as combining several small
> Ethernet frames and delivering to the receiving operating system as
> one large (up to 64Kbyte!) ethernet frame. One's software has to
> approach a modern Ethernet NIC with a software sledge hammer to turn
> off all of the "offloads".
>
> All in all the cure for many things was to add delay before rendering
> content. But that affected conversational uses where, according to
> the ITU we have a round trip budget of only about 140 milliseconds
> before people go into half-duplex/walkie-talkie mode. I really wanted
> to get my physicist friends to consider increasing the speed of light,
> but they were resistant to the idea.
>
> I began work on a meta stream to carry information about objects in
> the video stream (in order to do fast, set top product placements and
> such) and with scripted morphing in to react to events in the viewer's
> space. (E.g. morph Alan Arkin's eyes onto the source of a viewer
> gasp, such as when he sneaks up on Audrie Hepburn in the film Wait
> Until Dark.) This was part of my notion about breaking down the 4th
> wall. I hypothesized a video conferencing system in which each person
> posted a series of photos in a set of patterned poses - then the
> conference would proceed by sending small morphing instructions rather
> than full images. One could turn a knob to change from "staid
> English" to "hand waving Italian" modes of presentation. (This came
> out of my work with communications with submarines in which voice was
> converted into tokenized words rather than conveyed as voice itself -
> that saved a lot of bandwidth on our 300 bits/second path and the
> resulting voice was much clearer and comprehensible, even if the
> speaker was synthetic - and it was something we suggested to the FCC
> for air traffic control. I had pieces of these things running, but
> only small pieces. it is an area that is waiting for further work.)
>
> Tools to test and exercise this stuff were hard to come by. Jon had
> proposed his "flakeway" and a few years later I built one (operating
> as a malicious Ethernet switch rather than as a router.) I now sell
> that, or a distant successor, as a product.
>
> --karl--
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_signature.asc
Type: application/pgp-signature
Size: 665 bytes
Desc: OpenPGP digital signature
URL: <http://elists.isoc.org/pipermail/internet-history/attachments/20250929/26203aa2/attachment-0001.asc>
More information about the Internet-history
mailing list