Blogonomicon: The Cthulhu Ngram

Saturday, December 18, 2010

The Cthulhu Ngram

Is it possible to study something as ill-defined as culture in a quantitative manner? Researchers from Harvard have collaborated with Google and some traditional publishers to answer that question with a qualified "yes." By leveraging a portion of Google's massive library of digitized books, the team has created what they call a "culturome," with which they can track the use of language and terms across hundreds of years. This lets them track not only trends in language and usage, but the rise and fall of celebrities and historic events in the books of many eras. And, thanks to Google, the underlying data has been exposed via a Web interface, allowing others to perform their own analysis.

The authors didn't work with the full complement of Google's digitized texts, but the amount of material they did use is staggering: over 5 million books. They estimate that's about four percent of the books ever published. Google has about three times as many works scanned, but the scan quality and metadata on these—date and location of publication, etc.—isn't uniformly good, so the research has focused on the material with the best quality. Works start appearing at 1500, and include significant contributions in seven languages. They estimate that it would take someone 80 years to read it all, assuming said individual didn't eat or sleep.

Given this data source, the authors performed a simple analysis, identifying the frequency of a word relative to the total number of words produced that year. This compensates for the vast expansion in the printed word. In 1900, about 1.4 billion words a year were being printed; by 2000, that figure had reached 8 billion. With this data in hand, they performed a similar process for what they call "n-grams," or short phrases of up to five words.

And it's fun to play with. You can see it here.