Two Cheers For Google

Yes, Google’s newly launched News Archive Search is a great boon to those lacking subscriptions to super expensive public record/newspaper/academic databases – all the news going back decades that’s unclassified and fit to print – such as LexisNexis and JSTOR.

For a few dollars a shot, bloggers can now sample what journalists have become totally hooked on.

Click over to one place and search. Cut and paste from a clutch of database cuttings. Leaving no citations to indicate that your great thoughts aren’t your great thoughts alone, damn, you sound authoritative.

Due respect to old skool Google, but you won’t want to go back. It’s like coca leaves v. crack cocaine.

And you won’t talk about it. In the last month, no journalist at any British quality newspaper, not the Guardian nor the Times nor the Telegraph, has mentioned, casually, in passing, that he or she uses LexisNexis. No mention in any US newspaper either. But everyone’s doing it. Quick and easy access to vast databases of information must be one of the most significant changes to journalistic practice in recent years.

Once the Google News Archive Search becomes more compelling – the timeline becomes more intuitive and there’s more content and more of it is free – and once bloggers, the end users that turned, start mining archives in the same way they mine their RSS aggregators, the standard news story format, in blogs and then in newspapers, is bound to change.

The prosumer news blogger, brought up to link and link again, is likely to introduce links to old newspaper stories. Iraq is the new Vietnam? Or the new Suez? Why not compare and contrast in great detail?

Great. The mainstream media shudders as the bloggers at the gate get fourth dimensional. Breadth is easy to provide thanks to broadband and Google. Now, users can expect depth as well, the historical context for any breaking news story.

But there is a fl4w. You’re unlikely to ever touch bottom, get the fully searchable depth, not the full 200 years (LexisNexis in comparison goes back only about 25 years) that Google claims it can deliver.
BBC

Before getting out the tinfoil over Google Earth and the emergence of a Google cosmology or worrying about the recent airbrushing of a New York Times’ entry in LexisNexis, concerned citizens should be concerned about the technological limitations of news databases.

A search for “Internet” finds 15 news stories for 1819 and earlier:

He had heard a great deal about lhe .hipping internet…
The Times, 6 October 1812

Sadly, this isn’t evidence of any early success with steam-driven computers, the spawn of some Charles Babbage abandonware. It’s due to the limitations of Optical Character Recognition (OCR) technology.

For your average machine, Times Roman on yellowing newsprint is difficult to read. “Internet” and “interest” look pretty much the same.

According to Nicholson Baker, searchable OCR text is often “intolerably corrupt”. A typical JSTOR article has a new typo every 2,000 characters – every page or two.
Nicholson Baker, Double Fold, p71

Enough to throw any serious research.

Maybe the OCR technology will get better. Google is on the case. It recently released Tesseract OCR, an open-source version of an old OCR engine.

But, as Baker laments, the scanning of newspapers is mostly done. The cost of rescanning would be prohibitive. And, in any case, scanned newspapers tend to pass out of the archival system. They tend to get pulped or turned into decor for Dad’s den wall.