Choosing what to digitize

If you want to digitize a book, first check to see if it's already been digitized. Currently, The Online Books Page only lists a fraction of the books now online, but we're happy to list any significant complete, English-language books in any subject. Check Google Books. the Internet Archive. and other likely archives and indexes. If you do find a book you're interested in has been digitized already, and we don't already list it, tell us about it so that we can add it to our listings.

Don't know what to work on? We know of some books that people are looking for, or have partly digitized. but that aren't fully digitized anywhere that we know of. If you digitize a book from our requests list (or find an already digitized copy) that will definitely help out someone who's been seeking the book. In general, books that don't tend to be collected by academic libraries, or that are otherwise rare or obscure, stand a decent chance of not being digitized yet.

Any text you choose must either not be copyrighted, or be approved for free online use by the copyright holder. In the United States, any work published before 1923 is no longer copyrighted, and many lesser-known books from 1923 to as late as 1963 (or even 1989, in some cases) are also out of copyright. (In other countries, copyright usually lasts at least 50 years after the author's death, but laws vary.) Note that revised texts, translations, and other derivative works can get a new copyright from the date of their creation. Check the copyright information (usually on the back of the title page) to see what copyrights are claimed. For more details on copyrights and permissions, see this page.

If there's a particular title you know you want to digitize, but you're having trouble finding a copy to work from, see this page for some suggestions on where to find copies.

Creating digital images

It's possible to just type a book straight into the computer. If you're doing this, see "Producing a transcription" below for what to do next. But most people first create images of the pages with a scanner or with a digital camera. Page images provide a facsimile of the original that can be used to correct errors, or to show elements like illustrations, ornamentation, or layout that might not be represented in a simple transcription.

You can digitize a book with a simple flatbed scanner, if the book and the scanner are durable enough. Flatbed scanners are available in many schools, libraries, and workplaces, and are sold in electronics stores. Many consumer-grade scanners nowadays are designed for photos and single pages, and are not not built to have books lie on them. If you're considering buying a scanner for book digitization, check out its features and durability first. For best results with a flatbed, lay your book flat on the scanner, and close the top lid as much as possible. Then experiment with the brightness level until you find a level that gets all of the letters and little of the other stray marks found in books.

Some scanners will take sheet feeders, which will work for books you don't mind cutting up. You cut off the binding, and then feed sheets into them one by one. Don't do this for rare or valuable books.

If you can't get a good flat scan on a flatbed model without damaging the book or the scanner, you can use a scanner that allows you to open the book partway. There's at least one consumer-grade scanner that scans right up to the edge of the top surface, allowing you to open a book 90 degrees, with the scanned page on top and the opposite page hanging down the side. Better yet are "cradle" scanners that hold a book open partway and take pictures of the left and right pages from an angle. These tend to be more expensive than consumer-grade scanners, but some libraries have these for their own scanning projects, and might let you use them (or scan your book themselves). If you like do-it-yourself construction projects, you can also build your own.

If you're trying to scan a book you can't easily bring to a scanner (such as a book in a rare book room) you might also get decent results with a high-megapixel digital camera.

Most scanners come with optical character recognition (OCR) sofware, enabling you to get a first version of a transcription of the book as well. The quality of the text generated by the OCR software will vary depending on the age and condition of the book, the settings, and the quality of the images, but under good conditions, modern OCR programs can do a very good job at recognizing text.

Once you have a book set up, it need not take very long to digitize a full book. Back in 1995, it took me about 3 hours to scan in all of E. Nesbit's Five Children and It using a flatbed Silverscan II with OmniPage Professional software. A modern scanning setup could probably digitize the book considerably faster.

Sharing the image set online

Once you have a full set of page images, you can put them online if you like. There are various ways you can do this-- you can post them on your own site, or someone else's, in various formats.

If you want to package up the page images in a file that can be easily downloaded and read, you may want to put them all into a PDF file. There are various PDF creation software packages that will do this for you. The resulting PDF files can sometimes be quite large, but can be read on most computers.

Before you post the image set, double-check that you have all the pages in the right order, and that they're all legible.

If you'd like to have the book images live somewhere besides your own web site, there are various other sites that will host it. I recommend uploading to the Internet Archive text collection. They've been hosting books reliably for a long time, keep overhead to a minimum, and welcome user contributions. They might even provide your book in a variety of other formats besides the original PDF.

Many people find an online book transcription-- a file that encodes the actual words of the book, and not just the images of the book-- easier to deal with than a page-image form. If you'd like to produce a transcription, see below.

Clearing a book's copyright

There are now millions of books that have been digitized, but that are not freely readable online. That's often because the book's copyright status is uncertain. It's often difficult to determine whether a book is still under copyright, and if it is, who controls the rights.

But you might be able to research a book's copyright, and discover or verify that it's actually in the public domain, or can be used freely for some other reason. Or you might know who controls the rights to a book, and obtain permission for a free online copy. (Or you might control the rights to a book and give permission yourself.) If you can manage to "clear" the book's copyright in any of these ways, the book digitizer might make it available for all to read.

For information on how to clear a copyright, see our page How Can I Tell Whether a Book Can Go Online?

After you've cleared a copyright on a digitized book, you still need to to convince the digitizer to open up access to it. They will probably want to be fairly sure that the copyright is in fact cleared, to avoid the risk of a copyright infringement suit, so make sure you keep good records of your research that you can show them. If all else fails, you can re-digitize the book yourself. but we hope that will not be necessary in most cases.

Sometimes you may find

it easier to have the book made openly accessible on antother site. As I write this, for instance, Google Books is very conservative about opening access to many 20th century books. However, many of their scans also get copied onto Hathi Trust and the Internet Archive. These organizations accept user feedback, and have been known to open access to titles shown to be in the public domain, or otherwise authorized for open access.

Providing a transcription

Preparing the transcription

As I mentioned above, a transcription of a book is a file that contains a record of the actual text of a book, and not just images of the pages. (Transcribed online books may also include embedded illustrations or other additional content, but they primarily contain directly encoded, searchable, and copyable text.) Online book transcriptions can be easier for many uses than online book page images.

The text of a book can be produced by OCR software (as noted above), or by typing from the book directly. Scanning and OCR is usually faster than typing for most people, though typing requires no special equipment other than a computer.

If the text includes characters beyond the usual set of unaccented English letters and the other characters usually found on an American keyboard, the more exotic characters will need to be encoded in some fashion. The most reliable way to do this is with Unicode. a character encoding scheme that handles every major language and script the world has ever seen. Some computers and operating systems handle Unicode automatically. Unfortunately, many do not, and either use a region-specific encoding, or only support a subset of Unicode.

If you're producing your text in HTML, XML, or a format like Epub that's based on these, it's possible to represent unusual Unicode characters with special entities using only standard ASCII characters that can be typed in from any keyboard. You'll see the Unicode codes in the underlying file, but the actual characters will display properly in a Web browser or an Epub reader.

If you need fonts to display the full range of Unicode characters, here are some pointers. (I'm told that recent editions of Windows and MacOS do have at least one font that covers most Unicode characters.)

Checking for accuracy

Errors can -- and inevitably do -- creep into a text, whether it's been OCR'd or typed in. So you'll want to proofread the transcription, or have someone else proofread it, before posting it.

When academics or professional publishers prepare a research-quality text, they usually have it proofread at least twice, by different people, each carefully comparing the transcription with the original source. If you're just planning on supplying the text informally to Internet readers, you don't have to be that rigorous. You should, however, go through the entire text at least once, with the original book (or at least the page scans) handy to check consistency. With scanned works, it may be sufficient just to read the electronic text through at a reasonable speed, checking the book whenever something looks strange and making corrections as needed. Also run the text through a spelling checker for good measure. Errors in a typed text are often less obvious than those in a scanned text, so you may want to be more careful to compare the two texts as you go along. (The proofreading process can be a pleasant opportunity to read or re-read the book yourself.)

Occasionally, you (or your spelling checker) will come across something that looks like an error in the original source text. We recommend being very cautious about correcting any "errors" in the original book. Writers through history use many spellings and idioms that are not familiar to modern American readers or spell-checking programs. Text, particularly dialogue, can intentionally involve non-standard usage or mechanics. For editions meant for research, many scholars prefer that no changes whatsoever be made in the electronic version of a text, or at least that any changes be explicitly noted. If you want your electronic text to be used for scholarly research, or for preservation, Marc Demarest's essay The Responsible Preparation of Electronic Literary Texts describes what many serious scholars look for in electronic versions of previously published books.

If you mean to prepare texts for a casual reader, you needn't be as picky. To us, corrections of obvious typographical or printing errors, or shifts in line breaks (particularly those that split a word) can be useful if done with care. There can also be good reasons to prepare an electronic version of a text that does not exactly match any previous print edition. Choose the policy that makes the most sense to you. In any case, it's a good idea to include some brief transcriber's notes at the start or end of the text, explaining what you've done and giving publication information on the source text(s) you used. What I've done for some of the texts I've prepared is to correct obvious typographical errors from the original source inline, but then note my corrections at the end of the file. That way, that people who care about the exact form of the original source can see what I changed, and other people don't get their reading experience disrupted by having to wade through typos.

If proofreading a whole book seems too onerous, you can work with others. For instance, the Distributed Proofreaders project lets you proofread individual pages of a previously-scanned text. When all the pages have been proofread twice, the book gets posted to Project Gutenberg. a widely distributed etext collection that gets indexed on The Online Books Page. To get more information about the Distributed Proofreaders, or to join the project, see their site.

Sharing your transcription online

When you post a book online, you'll want to provide it in a format that people can easily and conveniently use, and that's portable and long-lasting. If you post the book in Microsoft Word, it will only be readable by people who have a copy of Word (or Word Reader) from around the same time as your copy. (Proprietary formats like Microsoft's often become unreadable over time.)

The most common format on the Web, and the one most likely to endure in its basic form, is HTML. That's the format used by ordinary Web pages. Plain text is also generally readable (if not as reflowable or expressive as HTML). PDF can be read on most computers, can be produced by most modern word processing prograns, and allows detailed control over page layout, and is described by an open, if complicated, standard. (However, PDF designed for large pages might not be so readable on small displays.) Some academic projects use TEI for computer-assisted text analysis, but it's not widely used by the general public. The Epub format, which can be derived from HTML, may become a standard way to read books on portable reader devices, but it's not yet clear if the reading world will adopt it in a big way.

You can post book transcriptions in various places, just as with book images. One very well-known, long-lasting site for book transcriptions is Project Gutenberg ; see their submission instructions if you'd like to have them host your book. Depending on the subject and nature of the book, it might also be of interest to various specialty archives. And you can also just place the book on your own website-- some people even post them on their blog. But it's nice to also have a copy somewhere else, in case your own site ever goes away.

Again, wherever you post the book, we'd be interested in hearing about it. assuming it meets our selection criteria. so we can point to it, if we don't already list that title.

Those are the basics of putting books online. Please let us know if you have any questions, or if you have a book we should list.

(Portions of this guide were adapted from a guide written for the Celebration of Women Writers by Mary Mark Ockerbloom.)

Edited by John Mark Ockerbloom


