Scan - and - Share - 1.07 - Printer (How To Scan and Create Quality DJVU Books)
Scan - and - Share - 1.07 - Printer (How To Scan and Create Quality DJVU Books)
version 1.07
2008
Contents
1 Introduction 2
2 Scanning a book 3
2.1 Setting up IrfanView for scanning . . . . . . . . . . . . . . . . . . . 4
2.2 Handwork while scanning . . . . . . . . . . . . . . . . . . . . . . . 6
1
1 Introduction
This is a mini-tutorial about scanning books and making high-quality files.
This tutorial is intended for newbies who would like to make good-quality
electronic books but do not know where to start. There are many ways to get
good results by scanning; this text shows you one reasonably easy way. The
tutorial has step-by-step screenshots and assumes some familiarity with Win-
dows. You may need to download and install a few programs (see Appendix A).
We will be mostly targeting the digitization of old books on science, mathe-
matics, or technical books. For these books, OCR is pointless because these
books contain too many equations, diagrams, graphs etc. The only solution is
to scan and make images of all pages. Such books are almost always printed
purely in black/white, with perhaps very few pages having color illustrations.
For that kind of books, the highest quality of scanned e-books is achieved if
one uses 600dpi black/white images for most pages.1 So you need to scan
either directly in 600dpi black/white, or at 300dpi greyscale and then pro-
cess the scans to make them into 600dpi black/white.2 If the book has a few
pages with color illustrations, you will need to scan them separately in 300dpi
24-bit color mode. The same applies to colorful book covers that you also may
want to scan.
Please note:
2
A high-quality scanned e-book is small in size, has great visual appearance
on the screen and also when printed, and has searchable text. There are
many ways to achieve high quality of scanned e-books; all methods involve
the resolution of 600dpi. Output files are in the DJVU3 format and take
typically about 5KB/page to 10KB/page.
You may of course experiment on your own with other programs. For example,
some people use Photoshop with special plugins, Book Restorer, Corel Pho-
toPaint, RasterID, even Matlab and IDL for picture processing. This tutorial
presents a particular method that practically guarantees good results. If you
are a beginner, please make a few books by closely following the instructions
in this tutorial. You will then see that you can achieve quite a high a level
of quality. If you develop your own methods, for example by using different
ScanKromsator options or different programs, you will be able to decide which
method is best because you can then compare the quality of the results with
the “reference” quality obtained by the methods in this tutorial.
One word of warning concerns using FineReader for scanning. Please do
not use FineReader for scanning and processing e-books! The FineReader
is a good program for making OCR only but is not optimal for scanning and
for processing the scans with the goal of making a digital scanned e-book.
FineReader attempts to give you a kind of all-in-one solution for scanning and
processing e-books; resist the temptation to use just one program for every-
thing. You will not get good results with FineReader; in any case, nowhere as
good as when you follow this tutorial. FineReader has the following technical
drawbacks: 1) It sometimes uses JPEG for image compression. This is not ap-
propriate for black/white texts! 2) It stores images internally as black/white
300dpi TIFFs and auto-rotates them. Black/white 300dpi is adequate for
OCR but not optimal for digital scanned e-books. The auto-rotate algorithm is
faulty and produces defects in the image (“broken” lines). The auto-rotation is
hard-coded into FineReader 7.x, 8.x and cannot be disabled.4 3) If you scan in
300dpi greyscale, which is the procedure recommended here, FineReader will
perform all operations at 300dpi, rather than resample to 600dpi. ScanKrom-
sator will first resample to 600dpi and then perform processing. The results
of FineReader processing are always going to be inferior for these reasons.
2 Scanning a book
You pick up a thick volume. Maybe you think that only a maniac could scan
it, page after page. Yes, you are right! But you can become that kind of
maniac and scan books of any size without much discomfort if you organize
your work well.
3
If you don’t know what DJVU is, please use Google or Wikipedia to read about it. The
DJVU format was specially developed for high-compression storage of scanned images. The
PDF format was intended for documents created in a word processor, i.e. for vector documents
rather than scanned documents. Scanned e-books in PDF format occupy much more space
and/or display slower than in the DJVU format.
4
Only in FineReader version 9 there was added an option to disable this auto-rotation.
However, FineReader version 9 cannot be used (yet) to produce OCR layer in DJVU files.
3
Figure 1: Two images of the same page, one made by a digital camera, another
by a cheap flatbed scanner. The image made by a flatbed scanner was scanned
at 300dpi greyscale and upsampled to 600dpi black/white. You can guess
which image that is! We recommend that you always use a flatbed scanner
and scan at 300dpi greyscale or higher resolution.
First note: Please do not use a digital camera for scanning books! You will
never get good results even with expensive 10 Megapixel or whatever cameras.
Use an ordinary flatbed scanner; even a cheap one is adequate. Look at
figure 1 below and guess which of the two images of the same page is made
by a digital camera.
For scanning, you need any program that can work with the TWAIN scanner
driver.5 It is convenient to have a program that can save scanned images
for every page to the hard disk, numbering the files like p0001.tif, p0002.tif,
etc. For example, image file viewers ACDsee, IrfanView, XnView can also scan
images. There is also a convenient scanning program VueScan if it works with
your scanner.
4
Start IrfanView. In the File menu, press "Choose TWAIN Source". Choose the
scanner that you need to use.
Here you can choose how to number the scanned files, where to store them,
and in which format to save them. As shown, the files will be named page0001.tif,
page0002.tif, etc. You should select TIFF as the image format. (Do not use
JPEG as the output format!)
Click on Options to the right of “Save as” field. This will set the options for the
TIFF format.
You should select “LZW” compression; this will cut the TIFF file size in two,
compared with no compression (“None”).6 If you later find that you have com-
patibility problems with these TIFF files (i.e. you later use a program that
6
Note that a typical page scanned in greyscale will occupy between 2 and 4 megabytes on
the hard disk with LZW compression.
5
Figure 2: Digital artifacts appearing due to JPEG compression of black/white
text. (In this example, the quality setting for the JPEG encoding was very
low, so these artifacts are apparent to the eye.) At left: greyscale image with
unnatural wavy-looking shadows around the letters. These “digital shadows”
are typical for JPEG compression of black/white images. At right: the same
image converted back to black/white, resulting in “digital noise”.
cannot open them) then you need to change the compression method. Do not
use the JPEG compression method for black/white text! JPEG compression
introduces digital artifacts, that is funny-looking shades around each letter
(see figure 2). It is pointless to use JPEG for black/white images.7
Now press OK and go to the TWAIN driver window for your scanner.
In the TWAIN window (or other configuration window if you are not using
TWAIN drivers), set the resolution to 300dpi and the color mode to greyscale.
These are the most important settings.
• First you need to try scanning some place in the book and check that
everything works well. Take a book, open somewhere where the pages
are full of text, put the book (both pages down) on the scanner glass.
• If necessary press with your hand so that the crease is as close to the
glass as possible. (You can also use a weight, e.g. another heavy book on
top, but it’s slower than pressing by hand.
• Do a “preview scan.” Then you can see what has been scanned in the
preview window. If needed, you can turn the page 90 degrees so that
the text is straight up. You can also adjust contrast, brightness, gamma
correction if necessary. Your goal is that the text must be clearly visible.
7
The JPEG format actually cannot handle black/white images; when one converts
black/white images to JPEG, the software must convert those images into greyscale images.
The JPEG compression then introduces a certain quality loss, as shown in the figure. The
quality loss in JPEG compression is acceptable for photographs but may degrade black/white
text quite significantly, unless a high quality JPEG mode is selected. (The quality of JPEG
compression is usually selectable as a number from 1% to 100%. No visible artifacts would
appear at 90% quality or higher. But some programs, especially for making PDF files or for
“optimizing” images, may not allow you to set the JPEG quality manually.)
6
• Select the scanning region by using the mouse. You should select the
scanning region such that some white space is left around the text.
• Press the “Scan” button with the mouse and wait until the scanner fin-
ishes scanning the page. This will get the scan of one page (or two pages
at once, if you can fit the book onto the scanner). The scanned file will
be saved to the disk.
• Now that the scanning program is set up, you can scan all the pages
with the same settings. While the scanner lamp is moving back, turn
the next page and put the book back to the same place on the scanner.
Then press the mouse button to scan again. (The mouse can be left
pointing at the “Scan” button, so you don’t need to look. Alternatively,
some scanners have buttons on them that make the next scan.)
This technique allows you to scan the entire book, one page after another,
without looking at the computer screen or at the keyboard. You can watch TV
or whatever while you are scanning. Depending on the scanner speed, you
can get between 100 and 200 scans per hour. Some scanners are particularly
fast (e.g. Plustek OpticBook).
It is not necessary to set the book onto the scaner absolutely straight (edge
of the book parallel to the edge of the scanner). You should try to put it
reasonably straight, but it is unavoidable that pages will not all be scanned
completely straight; many pages will be slightly skewed. This small skew is
okay and will be corrected later (after scanning) by software. Correcting this
skew is called deskewing.
When scanning you just need to avoid very large skews and “cut” pages,
i.e. when some of the text gets out of the scanning region. The region of the
text around the book crease is often difficult to scan. You can try scanning
one page at a time (rather than two pages) or pressing slightly harder onto
the book binding. It is important that the text is directly next to the scanner
glass. Even 1 mm distance between the glass and the paper will make a very
fuzzy scanned image in almost all scanners!
It is faster to scan a book two pages per scan rather than one page at a time.
But not all books can be scanned that way; some books are too large or don’t
open sufficiently to be scanned two pages per scan. You need to try and decide
how to proceed. Regardless of how you scan, the processing software will be
able to cut the images into single pages.
The result at this stage is a directory full of TIFF files. These files are the
raw material that you will start processing after you finish scanning. Note
that you need sufficient disk space to store all those scans (at least 4MB per
scanned image!). After you finish scanning, use a slideshow mode of some
picture viewer to quickly preview the scanned images to make sure that you
didn’t miss any pages and that every page is adequately scanned. It will be
too late when you discover that some pages are upside-down or missing at the
final processing stage, especially when the book has already left your hands!
7
Note: When you scan the book, please do not omit title pages, front matter,
including any information about the publisher, the table of contents, the in-
dex, the bibliography, empty pages, page numbers, or anything else!!! You will
not save much time if you decide not to scan some 20 pages or so. However, a
science book is almost unusable without bibliography and index and without
exact information about its publication. Also, do not think that you will make
your life easier from the legal point of view if you don’t scan the publication
information. However, try to avoid scanning the library stamps (just cover
them with paper, or remove them with digital image editor after scanning).
Nobody wants to see those library stamps in the e-book.
8
In the example shown, a book was scanned with two pages per scan, and
apparently there was some skewing. Our task now is to split, to deskew, and
to cut the page images so that every page has the same size and margins. If
your scan is single-page, you will not need to split, but you will still need to
deskew and cut. This operation is called “kromsating” in the program.10
The first step is a draft processing run, i.e. preparation for the final processing
of the raw files.
Click the tab “Files” in the toolbar. You get a dialog where
you can set the output resolution (very important!) to 600dpi,
the folder for storing the output files (the output folder is by
default the subdirectory out in the current directory), and
the way of numbering the output files (prefix, number of dig-
its, starting number, step). Note the format for compressing
the output files: it’s TIFF G4 encoding, which is optimal for
black/white TIFF images. This will be the output format after
processing.
10
The pseudoword “kromsate” is a mangled Russian word meaning “to cut in pieces.”
Within the ScanKromsator, the meaning of “kromsate” is the operation of splitting a two-page
scanned image into individual page images, and also the operation of cutting page images so
that the margins become even and equal on all pages.
9
To start the draft processing run, click the
button “Draft kromsate” bearing the pic-
togram of scissors, which is located to the
left of the “Process” button in the toolbar.
When you press the “Draft kromsate” but-
ton, and you get the dialog shown at right.
In this dialog you need to set tick marks on
“Split pages” and “Safe top/bottom.” The
field “Kromsate”=All means that the op-
tions are applied to all the pages. If some
pages do not need to be split, you can se-
lect “Kromsate”=Current and unset “Split
pages” for these pages.
Press OK and wait 10-15 minutes until the “Draft kromsate” operation is
finished. You will get the following screen.
Note that there are now green tick marks in the page list (top left column),
meaning that these pages have been “draft kromsated” successfully. For each
page you will see the blue lines across the page. These lines are the cut-
ters that determine how the page image will be cut and split. Note that the
program attempts to determine automatically where to cut the margins and
where to split a two-page image into single pages. In some cases the program
may make a mistake and cut too much or too little; in that case you will later
be able to adjust the position of the cutters by hand.
10
3.2 Set options
The next important step is to go through the processing options and prepare
for the main (not “draft”) run of ScanKromsator. The processing options are
set in the many different tabs in the toolbar (left middle column).
Please note: Each option can be set either to apply to all pages at once, or only
to the currently shown page. To apply an option to all pages, hold the Ctrl key
while clicking the option box with the mouse. In this way, you can set some
common options quickly for the entire task and then go to some problematic
page and select other options just for that page.
First click the “Page” tab. Here you can set processing options
for cutting the pages. The option “Split” means to split the
two-page image into single pages. “Deskew” will deskew each
single page image separately. “Despeckle” removes small dots.
Sometimes “Deskew” makes pages significantly skewed; this
is usually due to some complicated illustrations. In that case,
check “Art” for these pages. You can set “Ortho” if the page
needs to be rotated by 90 degrees. You can set these options
separately for left and right (L and R) pages.
Now click on the “Book” tab. Here you set options related to
the size and layout of the pages in the final book. “H.Gap” is
the size of the margins. The value of 200 is good for 600dpi
(meaning 1/3 inch). Page width and height can be set to Auto.
You can also center the pages differently (align to center/align
to top/align to bottom).
We already visited the “Files” tab at the “draft” stage. It is very important to
have 600dpi as the output resolution in the “Files” tab!
Now click on the “Options” tab. Set “Deskew method” =
Auto (shear), Resample filter = Lanczos3. The setting “De-
speckle”=Fine+Normal or Safe switches on an “intelligent” de-
speckle method that avoids removing the dots over i or j,
for example. “Text sensitivity” controls the logic of the auto-
cutting. Low sensitivity might cut off the page numbers if they
are too far away from the text. You may need to adjust the
sensitivity settings a little bit; but in most cases they do not
need to be adjusted.
You can skip the “Options 2” tab for now. Click on the “Con-
vert” tab. Here you set the threshold for converting greyscale
images to black/white. Do not forget to hold the Ctrl key (to
set this for all pages) as you select “Threshold”=MiddleDark.
Experiment with other settings if you don’t like the results.
11
Click the “Quality” tab; there you can further control the con-
version to black/white. This is a very important function! Set
Enhance image, Blur=1, and Sharpen=1. What is important
is that the image will become smoother with this setting. The
values of Blur and Sharpen could be 2 instead of 1, although
the value 1 is usually good. A larger value will make the let-
ters more black. You may need to experiment depending on
the quality of printing in a particular book.
Another important option is “Gray enhance.” Click on it since
you have greyscale scans (which is what you should have!).
You can use the File→Options... menu to write the options to a file. This will
save you all this work for the next time.
The last step before the main processing is a visual checking of the position
of the cutters. You need to go through every page and check that the cutters
are correctly positioned. Yes, this is a bit boring... but you can make it quick.
Put two fingers of the left hand onto the keys q and w; pressing these keys
will go to the previous/next page. With the right hand, you hold the mouse
12
and adjust the position of the cutters wherever needed. Sometimes there is a
skewed shadow, or it is necessary for some reason to set the cutter line at an
angle rather than vertically or horizontally. Hold the Shift key and drag the
cutter by its end to achieve this.
Now that everything is ready, you can begin the main run of ScanKromsator.
Press the large button that says “Process” and bears the icon of a book, in the
main toolbar at top:
The program will ask you to confirm that you really are sure you want to
change the resolution of the images. Confirm! The process will then start.
Now you need to wait a while. The upsampling operation can be quite slow;
in recent versions of ScanKromsator (5.8 and up) this operation was made
faster. You may expect to process 5 pages per minute or so. When everything
is finished, you should view the output files in the output folder. You should
check that all pages are cut and deskewed correctly. If some pages are not
processed correctly, you can repeat processing of just those pages with some
other options.
The main processing run may take some hours on a slow computer. It is not
necessary to process the entire book in one run. One can process only some
portion of the pages; then one needs to set Book→Page width→Fixed to the
size determined in the previous portion of the pages (so that all pages have
equal size at the end of processing). It is usually sufficient to take 10 to 15
pages for determining page size.
13
If you like, you can use the powerful cleaning features of ScanKromsator to
remove the “digital dirt” from some pages. Typically, the “digital dirt” is any
extraneous spots on the paper, pencil or pen marks, and library stamps. Of
course, you can also use any graphics editor to clean the images by hand.
Hopefully, there will not be many pages to clean.
toolbar:
There is also a possibility to have polygon-shaped picture zones. This is use-
ful, for example, if the page was scanned with a large skewing. Use the star-
shaped tool button to mark such zones:
To set the options for a picture zone, double-click on the selected region. You
will see the dialog “Picture zone properties.”
You need to set the color of the illustration. For example, if the page contains
a greyscale photograph (rather than a color photograph or color diagram), set
Color=Gray.
We cannot discuss other zone options here; as you see, there are many options
intended for advanced users. But note that after “kromsating” the picture
zones will be saved to separate files. So after the main processing run you
14
will have to merge them with the page files. This is done by using the menu
command Zones→Picture zone→Merge zones. The resulting page files will be
TIFF files in which the text is black/white but the picture zones have color.
15
a special set of options (or “custom profile”) for the DJVU encoding job. Run
the Document Express Configuration Manager, choose the profile “Bitonal
(600dpi)” from the list of profiles, click “Advanced settings”, and you will see
the following dialog.
Now choose the “Text” tab as shown above. In that tab, set “Pages per dictio-
nary = 1000” (if this consumes too much RAM on your computer, or if this is
too slow, set to 200 or 300 instead of 1000). Save the custom profile under
a new name, say Bitonal-1. Do the same for the “Scanned (600dpi)” profile if
you need to encode books with color drawings.
Now run the Document Express Workflow Manager. Load all the TIFF pages
into it. In the “Job name” field, write the name of the book if you want. Choose
the previously created custom profile in the list “Raster profile”.
16
Then click to the “Output” tab (the tabs are at the bottom of the window). In
the list “Separate document(s)” choose “One document only.” Tick the box
under “Enable” at far left. Wait until the encoding is finished. You can also
look at the “Log” tab to watch the progress. That’s all; the DJVU file is created.
Do not delete the TIFF files yet! You may need to encode again if the DJVU
file has some error. Also, the TIFF files are useful for OCR purposes (see
section 6).
The result of DJVU encoding is a multipage DJVU file containing the entire e-
book. You should rename that file to something sensible; not just math1.djvu.
At the very least, the file name should contain the author’s name, the title of
the book, the publication year, and/or the ISBN number if available. This is
just a little work, but it will be so much easier to share that file on the Internet
if its name is sensibly chosen.
17
Suppose you have already created the DJVU file out of some TIFF files. Hope-
fully, you didn’t delete the TIFF files. Load the TIFF files into a new batch
in FineReader (keep in mind the problem with selecting many files at once!).
Set the recognition language and press “Read all”. When the OCR process
is finished, click “Save batch”. It is not recommended to edit the OCR text.
Previous versions of DjvuOCR could not process FineReader batches if the
OCR text was edited. The most recent version DjvuOCR 2.2, can deal with
small edits. You should not rewrite large blocks of text; i.e. you should keep
many original symbols in their original positions if you edit. Also you should
not delete the end-of-line symbols, so that the number of lines in a paragraph
remains the same. But we recommend that you do not edit the OCR text at
all. After saving the FineReader batch, you can quit FineReader and run the
program DjvuOCR.
This program has several functions; for example, “DjVu Decoder” will produce
TIFF files out of DJVU in case you deleted your TIFF files, or if you are working
with somebody else’s DJVU file. For now, you will use only the “Manual mode
OCR manager.” Click that, and you get the following window.
18
Select the directory where the FineReader batch is located in the “FineReader
Project directory” field. “Output OCR text file” will be the name of the new file;
it doesn’t matter what that name is. Tick the “Burn DJVU file” box and select
the DJVU file below; it means that the OCR data will be inserted (“burned”)
into the DJVU file. Click “Process”, wait a few minutes, and that’s all. Now
the DJVU file is full-text searchable!
19
8 Adding hyperlinks and bookmarks
After finishing all the preceding work with the DJVU file (including OCR),
you can add some hyperlink navigation to it. There are two ways of adding
hyperlinks.
The first is to use the DjvuSolo or Djvu Editor programs and add hyperlinks by
hand. Usually, one adds hyperlinks to pages in the table of contents for easier
navigation. In DjvuSolo or Djvu Editor you can select any rectangular area on
any page and then insert a hyperlink to a different page of the DJVU file. The
user will go to this page when clicking anywhere in the area. Note that the
hyperlink will point to a page number, so adding hyperlinks has to be done
after any changes to the page order or after inserting any additional pages
into the DJVU file. So if you want you can sit and make some rectangular
areas into hyperlinks until you are blue in the face.
The second way to add hyperlinks is semi-automatic, using the program DJVU
Hyperlinks Editor.14 Run the program and you will see the following window.
14
This program has only the Russian-language interface.
20
First you need to specify options for the hyperlinks Then you need to specify
the page range ( ) in which the table of contents is located in the
DJVU file. These are DJVU page numbers, which may be different from the
page numbers printed in the book and in the table of contents (e.g. because
there are some pages taken by the cover and by the front matter). To compen-
sate for this, usually one needs to add a certain offset to the page number; for
instance, page 10 in the printed book may be actually page 11 in the DJVU
file because one page is taken by the cover.15 Then you need to enter the
corresponding offset into the box (“offset”). Now that all options are
enterd, press the button (which means “Add”). This will add a new
DJVU file to the list in the left panel; the current options will apply to that file.
You can now set different options and add a different file. Finally, press the
button (“create”). This will insert the hyperlink information into all
the DJVU files.
Similarly, one can create hyperlinks in the subject index. One needs to select
21
A Where to download software
22
Index
color plates, 19
deskewing, 7
DJVU, 3, 15
dictionary, 15
OCR layer, 17
rearrange pages, 19
FineReader
problems, 3
illustrations, 2
IrfanView, 4
JPEG, 5
digital artifacts, 6
problems, 6
kromsating, 9
quality, 2
Russian screenshots, 1
ScanKromsator, 3, 8
cutters, 10
draft run, 9
main run, 13
picture zones, 14
scanning, 7, 8
disk space, 7
greyscale, 2
with digital camera, 4
TIFF, 5
upsampling, 2, 13
using Linux, 22
23