[ih] "how better protocols could solve those problems better"

Wed Sep 30 18:59:51 PDT 2020

Craig Partridge <craig at tereschau.net> wrote:
> * Naming and organizing big data.  We are generating big data in many areas
> faster than we can name it.  And by "name" I don't simply mean giving
> something a filename but creating an environment to find that name,
> including the right metadata, and storing the data in places folks can
> easily retrieve it.  You can probably through archiving into that too (when
> should data with this name be kept or discarded over time?).  What good are
> FTP, SCP, HTTPS, if you can't find or retrieve the data?

The Internet Archive has this problem.  I'm not the right expert to talk
about what they've done, but I can introduce you.

Their current storage design is to keep all their data in "items" which
are files up to multi-gigabyte size named with an arbitrary ascii
string, and containing metadata from the source.  These items are
replicated onto multiple drives in multiple data centers for resilience
and performance.  When drives fail, their former contents are replicated
onto new drives from other copies.  The items contain the definitive
metadata; the indices are caches.  Indices are built for high
performance searching, originally by reading all the metadata from all
the items, and then by incremental updating.  The main "catalog" index
allows finding which server(s) and drive(s) contain each individual
item, by its name.

Their metadata is organized around three main types of items:

  *  Uploaded files from arbitrary users who offered them into the
     Archive's collection.  (archive.org, millions of items)

  *  Copies of web pages downloaded by Heritrix, the web crawler
     that feeds the Wayback Machine which provides a user interface
     for seeing what web pages looked like in the past.  (web.archive.org,
     containing about 477 billion pages.)

  *  Books, shellac and vinyl records, and CDs that are owned by
     the Archive or partner libraries, and scanned in professionally.
     (archive.org or openlibrary.org)  Millions of items, but many
     are not publicly accessible until their copyrights expire.

All of these are stored in items, but has different metadata contained
within it, and with additional index databases organized for the kinds of
expected searches (such as by author, title, musician, web address,
date, etc).

Finding things in the Archive is considered difficult.  Their first few
decades focused on saving things for posterity, figuring that better
search methods could only be developed if the data itself was first
preserved.  They are now branching out into improved searchability.  For
example, they now offer full-text searching of their entire book
collection (using an ElasticSearch database built from the OCR'd texts
in the photographic scans of each page of each book, all of which are
stored in items).

Clearly, a key is how the arbitrary strings that name items are
generated.  I don't know details about this, though for books, some
algorithm hints come through in item names like "charlottesweb00whit";
see for example:

  https://archive.org/details/charlottesweb00whit

	John