[ih] Evolution of Internet audio and video

Karl Auerbach karl at iwl.com
Mon Sep 29 18:51:14 PDT 2025


Today's Internet multimedia (which mostly means videos and Zoom-like 
conferences) does not work all that well.

Original IP multicast did not work well in a multi-administrative 
environment.  There was a lot of problems when the distribution tree 
cross administrative boundaries, and there was pressure for carriers to 
hot-potato the traffic onto another provider.

(One day at Precept I was installing a new Cisco router - a small one, a 
2514 - and it was not yet configured with its addresses. But the MBone 
DVMRP routing found it and started to send the entire MBONE traffic over 
our T-1 link while our poor router tried to scream "prune", "prune!", 
"PRUNE!!!" but could not be heard because those prune messages could not 
be sent because the unicast routing had not yet been configured.  It was 
like when one of Dave Mills PDP-11/03 boxes became the destination for 
all destinations on the net.)

That problem was significantly reduced when one of Dave Cheriton's 
students came up with the idea to do single-source multicast.  This 
changed the original multiple-source IP multicast into something far 
more manageable and stable.  But I have not seen it used much in 
commercial products, nor do I know how well supported it is in routers 
and edge devices.

Single source works well for presentation-style audio/video or for 
systems in which every participant feeds into a mixing engine that 
resolves things like data formats (Zoom does this, but it uses direct 
TCP connections to deliver the mixed content to the users.)

One of the difficulties is when there is a mix of well provisioned 
user/clients and some poorly provisioned ones.  The question becomes 
"who waits?"  (And there is that problem of one stream, e.g. voice, 
being out of sync with visual pointers such as someone pointing to a map 
and saying "we meet here at dawn".)

Present day conferencing works because most conferences have relatively 
few users.  When Steve Casner and I did the Precept RTP stack we went 
full bore and did the support for large client populations - that 
required a lot of code to back down on the client feedback in order to 
avoid packet implosions crushing the sources.  (That was hard code to test!)

And things like Netflix and other streaming work basically because we 
are willing to dedicate a lot of bandwidth to those streams and to 
distribute the origination of the traffic across many, often widely 
dispersed, servers.

(I notice that the net on this very day seems to be having stuttering 
problems - I've observed it from many points of view, including 
point-of-sale devices.  So it does seem that our assumption of plenty of 
bandwidth may at times be optimistic.)

One of the interesting variations on IP multicast, variations that never 
took off, were things like multicast file transfers or multicast, 
reliable open-ended streams.  Multicast allowed expanding ring (TTL 
based) searches of "nearby" other recipients in order to obtain copies 
of lost data.  It was kinda cool (but did create security issues 
regarding malicious introduction of modified data.)

Given that broadcast TV and radio are slowly dying (and their RF bands 
snapped up for other purposes) we may need to revisit how we do Internet 
multimedia.

         --karl--


On 9/29/25 6:21 PM, Jack Haverty via Internet-history wrote:
> FYI, multimedia was on the Internet radar very early.  It was probably 
> the most important driving force for the evolution of TCP2 to become 
> TCP/IP4 in the late 1970s and early 1980s.
>
> Shortly after I got the assignment to implement the first TCP for Unix 
> in 1977, I started attending Internet meetings.  At one of the early 
> ones, I remember Vint describing a bunch of "scenarios" that the 
> Internet was expected to handle.  One was especially memorable. It 
> involved a teleconference with a group of military officers, located 
> over a broad geographic area including some perhaps in the Pentagon or 
> regional command centers, and others far away in jeeps or tanks, or 
> even helicopters, in action on a battlefield.
>
> The gist of the teleconference was to collect information about what 
> was happening, make decisions, and issue orders to the field units. At 
> the time, video was not even a dream, but it was deemed feasible even 
> in the near term to use the multimedia then available.  For example, 
> everyone might have some kind of display device, enabling them all to 
> see the same map or graphic.  A pointer device would allow anyone 
> while speaking to point to the graphic and everyone else would see the 
> same motions on their displays.  The teleconference would be conducted 
> by voice, which of course had to be interactive.  It also had to be 
> synchronized with the graphics, so that orders like "Move your 
> battalion here; we're going to bomb over here." didn't cause serious 
> problems if transmission delays were happening in the Internet and the 
> voice and graphics became unsynchronized.
>
> Such scenarios drove the thinking about what the Internet technology 
> had to be able to do.  It led to a consensus that the virtual 
> connection service of TCP2 was insufficient, due to its likelihood of 
> delays that would disrupt interactive voice.  In addition, the 
> consensus was that multiple types of service should be provided by the 
> Internet.  One type might be appropriate for interactive voice, where 
> getting as much data delivered as possible was more important than 
> getting all the data delivered eventually.  Similarly, large data 
> transfers, such as high-resolution graphics, could be delivered 
> intact, but it was less important that they arrive within milliseconds.
>
> That led to the split of TCP into TCP and IP, and the introduction of 
> UDP as a possible vehicle for carrying interactive content with a need 
> for low latency.  In addition, it might be useful for different types 
> of traffic to follow different routes through the Internet. 
> Interactive traffic might use a terrestrial route, where bulk traffic 
> such as graphics might travel through long-delay, but high bandwidth, 
> geosynchronous satellite networks.  The TOS field was added to the IP 
> header so that a teleconferencing program could tell the Internet how 
> to handle its traffic.
>
> TCP/IP4 created an experimental environment where such approaches 
> could be tried.  Various researchers used to come to the Internet 
> meetings to report on their experiments and lobby for new mechanisms.  
> (I recall Steve Casner and Jim Forgie as being frequent attendees with 
> those interests).   Experimentation later produced the MBONE, with 
> multicast which helped reduce the traffic loads through the 
> Internet.   MBONE seems to have faded away over the years, and various 
> "silos" of proprietary teleconferencing mechanisms have popped up to 
> provide such functionality, but unfortunately seem to have done so in 
> a non-interoperable way.
>
> Today, I use teleconferencing with Zoom, Facetime, and several 
> others.   There seems to be a lot of choices.   It seems to work 
> pretty well, at least for my personal scenarios.  But a few years ago 
> I was asked to give a presentation over the Internet to a conference 
> halfway around the planet, and we decided that it was too risky to 
> count on that Internet path being good enough at the scheduled time.  
> So we prerecorded the presentation and transferred it via FTP well 
> ahead of time.   Perhaps it would have worked, but we couldn't be 
> confident.
>
> Recently I heard anecdotal reports that the Internet on cruise ships 
> works well - but is reliable only when the ship is far out to sea. 
> When it's in port, or even just approaching port, teleconferencing is 
> unreliable.   My speculation is that traffic loads when near a port 
> include all the land-based users and the network may be overwhelmed.  
> But that's just speculation, I have no data.
>
> So I wonder - is the multimedia on the Internet problem now solved?    
> As near as I can tell, the Internet today only provides one type of 
> service, with all datagrams following the same route. Did the 
> introduction of fiber make the concerns of the 1980s moot? Does 
> teleconferencing now work well throughout the Internet?  Do users 
> simply abandon the idea of using the Internet for teleconferencing 
> when they discover it doesn't work for them (as I did for my 
> presentation)?   Does the military now do what the 1970s scenarios 
> envisioned over the Internet?
>
> How did multimedia on the Internet evolve over the last 45+ years?
>
> Jack Haverty
>
>
> On 9/29/25 14:59, Karl Auerbach via Internet-history wrote:
>> On 9/29/25 2:13 PM, Craig Partridge wrote:
>>
>>>   * How to persuade video to deal with occasional loss. Dave Clark did
>>>     early outreach to codec experts and said that in response to the
>>>     question "What do we do if some of your data has to be dropped"
>>>     were told "Don't.  We're good at compression and if the data could
>>>     be dropped, we'd have removed it."  As I recall, it was Facebook
>>>     that led to codecs that could deal with loss?
>>>
>> Steve Casner and I worked really hard on these issues.  And because 
>> we often moved audio and video via different packet streams there was 
>> an impact from loss/delay/duplication/re-sequencing on one of the 
>> streams  on the other stream.
>>
>> Many codecs are not friendly to loss or underrunning their input 
>> buffers.  And with cipher chained (aka block-chained) streams it can 
>> get harder to pick up the sticks when a packet is lost.
>>
>> We were working with UDP so we did not have TCP trying to do 
>> reliability and sequencing.
>>
>> Some of the issues we faced were "what do we do when we don't have a 
>> video or audio packet at the time we need to feed it to the rendering 
>> hardware?"  For audio there was "redundant audio transport", aka 
>> "RAT" in which the data in packet N was carried in lower quality in 
>> packet N+1 (or N+2).
>>
>> For video we had to deal with 30 per second freight trains of closely 
>> spaced large packets.
>>
>> There were demarcations in the streams about where sound spurts began 
>> and where video frames ended.  Loss of those packets forced us to 
>> develop heuristics about how to imply where those packets were and 
>> what to do about it.
>>
>> Out of order packets were a bane.
>>
>> Patching voice/video data is hard because it can create artifacts, 
>> sometimes unexpected ones, such as synthetic tones when audio was 
>> being patched (and patched with what - we experimented with silence 
>> [doesn't work well] or averaging the prior/next [worked better], etc.)
>>
>> Things are worse these days because of the games that "smart" 
>> Ethernet NICs play with Ethernet frames - such as combining several 
>> small Ethernet frames and delivering to the receiving operating 
>> system as one large (up to 64Kbyte!) ethernet frame. One's software 
>> has to approach a modern Ethernet NIC with a software sledge hammer 
>> to turn off all of the "offloads".
>>
>> All in all the cure for many things was to add delay before rendering 
>> content.  But that affected conversational uses where, according to 
>> the ITU we have a round trip budget of only about 140 milliseconds 
>> before people go into half-duplex/walkie-talkie mode.  I really 
>> wanted to get my physicist friends to consider increasing the speed 
>> of light, but they were resistant to the idea.
>>
>> I began work on a meta stream to carry information about objects in 
>> the video stream (in order to do fast, set top product placements and 
>> such) and with scripted morphing in to react to events in the 
>> viewer's space.  (E.g. morph Alan Arkin's eyes onto the source of a 
>> viewer gasp, such as when he sneaks up on Audrie Hepburn in the film 
>> Wait Until Dark.)  This was part of my notion about breaking down the 
>> 4th wall.  I hypothesized a video conferencing system in which each 
>> person posted a series of photos in a set of patterned poses - then 
>> the conference would proceed by sending small morphing instructions 
>> rather than full images.  One could turn a knob to change from "staid 
>> English" to "hand waving Italian" modes of presentation. (This came 
>> out of my work with communications with submarines in which voice was 
>> converted into tokenized words rather than conveyed as voice itself - 
>> that saved a lot of bandwidth on our 300 bits/second path and the 
>> resulting voice was much clearer and comprehensible, even if the 
>> speaker was synthetic - and it was something we suggested to the FCC 
>> for air traffic control.  I had pieces of these things running, but 
>> only small pieces.  it is an area that is waiting for further work.)
>>
>> Tools to test and exercise this stuff were hard to come by.  Jon had 
>> proposed his "flakeway" and a few years later I built one (operating 
>> as a malicious Ethernet switch rather than as a router.)  I now sell 
>> that, or a distant successor, as a product.
>>
>>         --karl--
>>
>
>


More information about the Internet-history mailing list