the Book Genome Project was created to identify, track, measure, and study the multitude of features that make up a book. Components such as language, character, and theme are mined and analyzed in order to sift, organize, categorize and ultimately separate one book from another in a crowded and complex "bookosphere."
Begun by students at the University of Idaho, the project grew quickly. By 2008 the team included researchers and programmers from Stanford University, Florida State University, and Boise State University. Over time, partnerships were formed with commercial publishers and the project exploded. In 2010 Booklamp and the Book Genome Project were self-sustaining and included collaborators from locations as diverse as New York, Idaho, California, and the United Kingdom.
The fundamental goal of the project remains the same as it was in 2003: Develop ways to intelligently understand the content and make-up of books and then apply that knowledge to the problems of book discovery and publishing.
Much like Pandora.com was created to provide a practical outlet for the Music Genome Project, we created BookLamp.org to allow readers and writers to use the tools that we've developed. It represents the public face of the project, so please check it out and let us know what you think..
The Book Genome is derived through the analysis and comparison of books, specifically the Language, Theme, and Characters of books. In order to survive in the marketplace, a book has to appeal to readers: a reader must find what happens in a book to be interesting (theme), care or be interested in who it happens to (characters or actors), and have both aspects translated through a writing style that is palatable (language).
The key focus of the Book Genome Project is to use computer intelligence to extract and quantify, on a scene-by-scene basis, useful information about these key elements of books. Consequently, we created a "gene structure" for each of the three primary elements that we analyze.
The genomic analogy is imperfect but useful nevertheless: we defined the three elements of Language, Story, and Character as the literary equivalent of DNA and RNA classifications. Each gene category contains its own subset of measurements specific to its branch of the book genome structure.
Language DNA, for example, is made up of components that we call Pacing, Perspective, Description, Density, Motion, and Dialog, and each of these is an amalgamation of alleles which capture the expressions of different aspects of linguistic style.
Thematic DNA - sometimes referred to as StoryDNA - is directly connected to the thematic content of a book and is composed of more than 2000 individual thematic ingredients. Each ingredient is measured, and each text is broken down and categorized in terms of its thematic "expressions." Some books express more Romance than Crime, others more Nature than Cities. Each "gene" of the StoryDNA is measured relative to the other genes (or memes) in a given book and then in relation to the dominant themes of the entire corpus.
Each individual book produces 32,162 genomic measurements, creating an ever-expanding database of hundreds of millions of data points that are utilized for classification and comparison.
We created BookLamp.org as a place to demonstrate the practical applications of the Book Genome Project. Please feel free to continue to explore our work there.
Thank you very much for your interest, and feel free to contact us with any questions.
The Book Genome Team