^

Technology

Back to the e-future

- Benjie Villa -
Anybody who wants to view the The Philippine STAR’s original website cannot do so in our office. Sorry, nobody ever thought of storing the simple site that was launched in the wilds of the cyberspace on March 26, 1997. For one, not even our most veteran editor can remember how the site looked like.

But STAR fans need not sulk. The Internet is not a huge digital library for nothing. Through the so-called "Wayback Machine," named after a supposedly futuristic device popularized by a 1960s cartoon Rocky and Bullwinkle, anyone with access to the Internet can view the first-ever STAR website and more.

Wayback Machine can be accessed through the Internet Archive (www.archive.org), which styles itself as an online library with the purpose of offering permanent access for researchers, historians, and scholars to historical collections that exist in digital format.

Established in 1996 during the early days of the Internet, Internet Archive is more of a digital museum, serving as a rich depository of artifacts of web pages that date back to the 1960s pre-Internet days when the World Wide Web was still called Advanced Research Projects Agency Network (Arpanet) of the United States Department of Defense. You can watch movies as old as those filmed in 1903 digitized into MPEG (.mpg and .avi) formats for downloading and playing on any computer device. And you can also access a digital library of children’s literature around the world.

Recognizing the role of libraries (those that have "hard copy" documents, of course) in preserving society’s cultural artifacts, the Internet Archive aims to extend those functions into the digital world to foster education and scholarship in this era of digital technology.

Regarding the Internet as "a new medium with major historical significance," the Internet Archive further aims to prevent momentous web pages and other "born-digital" materials from disappearing into the past.

Internet Archive considers the present as still the Digital Dark Ages, apparently because of the wanting massive effort to document all activities taking place in cyberspace. "Without preserving cultural artifacts, civilization will have no memory and no mechanism to learn from its successes and failures," its creators explained.
Massive storage
Internet Archive’s Wayback Machine now holds 10 billion web pages, which include that of Philstar.com’s. Excluded from its collection are the 200 million web pages that pertain to the United States elections in 2000 and another 500 million web pages on the September 11 terrorist attacks on US soil.

Also stored in the website’s servers are about 16 million postings made from October 1996 to late 1998, and from the year 2000 up to the present, in various bulletin board systems (BBS); about 2,000 digital and digitized movies that date back to 1903; and, Arpanet historical documents.

Internet Archive also has 50,000 file transfer protocol (FTP) sites, although the number of FTP sites stored is still very minimal since they only cover the period of July to October 1996.

All in all, there are more than 100 terabytes of digital artifacts archived in the Wayback Machine.

And just what in the world is a terabyte? Way back in the early 1990s, when hard disks became widely used in place a floppy disks, people came to know the megabyte, which is about a million bytes, or exactly 1024 x 1024 bytes.

Before the turn of the millennium, hard disks capable of holding files in the gigabytes hit the markets. A gigabyte is about a thousand megabytes (or approximately 1 billion bytes or exactly 1024 x 1024 x 1024 bytes).

A terabyte, therefore, is about a million megabytes (approximately a trillion bytes) or exactly 1024 x 1024 x 1024 x 1024 bytes.

And presently, files in the Wayback Machine are growing at a rate of 12 terabytes per month.

"This eclipses the amount of text contained in the world’s largest libraries, including the Library of Congress," said the Internet Archive. "If you tried to place the entire content of the archive onto floppy disks and laid them end to end, it would stretch from New York, past Los Angeles, and halfway to Hawaii."
Web pioneers
Featured in the Archive’s Wayback Machine are the earliest pioneers of the cyber-frontier. Among these is the Internet Movie Database (www.imdb.com), which first appeared on the Web in 1993, and has been dispensing movie trivia since the time Brooke Shields starred as Brenda Starr.

But even before that, there was already the WELL or Whole Earth ’Lectronic Link, which started as a BBS in 1985. It was first connected to the Net in 1992 and with its first home-page in 1994, the WELL was at the forefront of the Internet re-volution.

Of course, there was the classic Jerry’s Guide to the World Wide Web, which was initiated in 1994 by two Stanford computer geeks and now popularly known as Yahoo!; the first online store, Amazon.com; and the WebCrawler, which, however, failed to keep up with the birth and advancement of other search engines.

Salvaged artifact pages of these sites, and countless others, can be browsed through the Wayback Machine.

Internet history, meanwhile, is incomplete without including its predecessor, the Arpanet.

Arpanet files stored in Internet Archive consist of memoranda, interview notes, periodicals, papers, and other materials documenting its development.

All of the digital materials were collected by Katie Hafner, who, together with Matthew Lyon, came out with the book, Where Wizards Stay Up Late: The Origins of the Internet, in 1999. The Arpanet files can be viewed by using the browser plug-in, DjVu, which can be downloaded from www.djvu.com.
Movies and children’s lit
Lamenting that much of the early 20th century media, particularly television and radio, were not being saved, Internet Archive has partnered with Prelinger.com in digitizing and storing movies from 35mm, 16mm, 8mm, Super 8mm, and various obsolete (film) formats like 28mm and 9.5mm.

The effort is obviously taxing. As Internet Archive explained, films were first transferred to Betacam SP videotape, a widely used analog broadcast video standard, on telecine machines. The film-to-tape transfer process was not a real-time process. It required inspection of the film, repair of any physical damage, and supervision by a skilled operator who manipulates color, contrast, speed, and video controls.

Then the videotape masters created in the film-to-tape transfer were digitized, using an encoding workstation, a 550 MHz PC with a FutureTel NS320 MPEG encoder card. Custom software drove the Betacam SP playback deck and managed the encoding process. The files were then uploaded to a hard disk.

After the files were encoded at constant bitrates ranging from 2.75 Mbps to 3.5 Mbps, these were converted to MPEG-2 video format, then to MPEG-4.

Most of the digitized films belong to a movie genre called "ephemeral films," those that were produced for specific purposes at specific times, not intended for long-term preservation. Ephemeral films are educational, industrial, advertising, and amateur films and home movies – genres that film historians and archives have largely ignored, despite their profound influence on the minds of spectators.

"Better than many other kinds of historical documents, ephemeral films document past persuasions and anxieties," Internet Archive explained. "They show us not only how we were, but how we were supposed to be."

Through a partnership with the Human-Computer Interaction Lab of the University of Maryland and a grant from the US National Science Foundation, Internet Archive is now building what it calls the International Children’s Digital Library (ICDL).

Its objective is to provide the world with a comprehensive library of international children’s literature that is available in digital format to all children, rich and poor. The digital collection has been envisioned to include the best books from all cultures.

Internet Archive said it further intends to create the mechanism, technologically and philosophically, that can stem the information divide and provide access to all.

At the moment, the people behind the Internet Archive aim to digitize 100,000 books and make them available throughout the world.
Drawbacks of Wayback
The Wayback Machine is, well, Internet Archive’s coolest tool, for it stores all the billions of artifact web pages.

Unlike the common search engines (e.g., Yahoo!, Altavista, Google, Yehey!, etc.), however, the Wayback Machine can only search web addresses, such as http://www.philstar.com. It cannot do a "full-text" search.

Aside from this, it can hardly properly archive so-called dynamic pages, such as those that contain forms, JavaScripts and other elements that require interaction between a web page and the originating host.

This is obviously the reason why the archived artifact web page of Philstar.com in the Internet Archive does not show the current news supposedly scrolling below the collage logo. Because of this, most of Internet Archive’s collection of web pages have broken links and missing graphics.

Other drawbacks to the Wayback is the due respect it accords to "robots.txt," a text file uploaded to a server that can command most search engines not to spider files from a specified website for future searches; "server side image map," which requires a web page to have contact with the original server; and "orphan pages."

In other words, web pages properly archived by the Wayback are only those with the simplest text encoding.

But all these do not distract Internet Archive from working towards its vision, which is to save the past in hard drive servers not only for posterity’s sake, but also for future historians, researchers and journalists to have reference to the new civilization and cultures evolving at this point in time from the advent of the information superhighway.

The task is definitely formidable, for data formats and digital artifacts are prone to damage and antiquity. Those behind the Internet Archive anticipate these. At the moment, to avoid data damage due to accidents and natural disasters, digital artifacts are stored in multiple servers.

Data migration is also a risk, as storage media can degrade to a point where the data becomes permanently irretrievable. Thus, the Internet Archive follows the industry’s rule of thumb which is to migrate data every 10 years and, given developments in computer hardware, migrate them more often after that.

What the Internet Archive’s biggest problem, though, is the continuing advancement in software applications which can make most of its presently-stored data formats in the Wayback Machine obsolete. This will definitely cost the site hugely, what with the commercialization of the information superhighway, as it also intends to collect software and emulators that can read digital artifacts in, say, five or 10 years from now.

vuukle comment

ARCHIVE

CENTER

DIGITAL

INTERNET

INTERNET ARCHIVE

MACHINE

PAGES

WAYBACK

WAYBACK MACHINE

WEB

  • Latest
Latest
Latest
abtest
Recommended
Are you sure you want to log out?
X
Login

Philstar.com is one of the most vibrant, opinionated, discerning communities of readers on cyberspace. With your meaningful insights, help shape the stories that can shape the country. Sign up now!

Get Updated:

Signup for the News Round now

FORGOT PASSWORD?
SIGN IN
or sign in with