On Creating your own collected works

Kapil Hari Paranjape

October 15, 2004

Introduction

The following short document describes how research mathematicians can go about creating an online archive or collected works for themselves. We assume that the following resources are available:

A Web server where the documents can be placed.
A fully configured GNU/Linux or Unix(TM) system with TeX and an editor. In cases like the one below SANE must also be available and installed. The ImageMagick and netpbm graphical conversion tools should also be available. Recently the tools djvulibre have also become available. These improve the storage and transmission of scanned domcuments greatly.
Print copies of those papers for which the paper is not available in electronic format.
A scanner with at least 100dpi resolution (if some papers fall under the above category).
Access to the internet!

Note that the tools described in (2) are available with most GNU/Linux distributions such as Debian, RedHat.

1 The Easy Part

If you already have your paper available as a PosctScript(TM), .dvi, TEX or LATEX file. Then life is easy as one can link this file directly on one’s list of publications. Even then you may want to examine the following packages to make the viewer’s job easier.

: pdftex This is a program grafted out of TEX that converts documents directly into PDF (Adobe(TM)’s Portable Document Format). In addition, by using the hyperref package and LATEX’s ability to cross-reference, the pdflatex program will produce a “clickable” PDF which is also cross-referenced on-line.
: latex2html This provides a nice way to convert LATEX documents into HTML. It even has some rudiments of Math built-in by using the -math option. The older version produced a lot of images, the more recent versions are better. The command line option -info 0 is useful to turn off the “About this document...” links that the program often produces.
: hevea Yet another converter from LATEX to HTML. This goes to the other extreme--it does not divide the file into sections and produces no images unless specifically asked to. A nice alternative to the previous program.
: dvipdf Converts DVI files into PDF. The result is not as good in some ways as that obtained by pdftex.

Occasionally, you may have a file as MicroSoft(TM)-Word or ChiWriter or soem other format. The simplest procedure is to use the original program to generate PostScript(TM). However, it is also possible to convert these documents to text by using programs such as antiword or even strings. After that you are only slightly better off than you would be at the end of Optical Character Recognition applied to scanned documents (end of next section).

2 The Hard Part

We now deal with papers that are only available in print form. First of all you need to make sure that SANE is properly installed and configured. Run the command scanimage -T from a command prompt. This will perform a sequence of tests and should give “PASS” for all these tests. On the other hand you may not be so lucky:

If you get an error like “no such command” or “bad command”, you should ask your system manager to install SANE. It is available in pre-packaged form for most GNU/Linux distributions. Alternatively, http://www.mostang.com/sane, is the primary location for SANE.
If you get an error like “No Scanners found”, you should ask your system manager to provide access to the scanner. Usually, this is achieved automatically by the default configurations for most GNU/Linux distributions. On some systems this may mean that you have to type scanimage -d |devname| instead of scanimage, where |devname| is the device name provided by your system manager. Instructions can also be found at the web location above.

Note that some people may suggest that you work with xscanimage at this point but it is very incovenient for what you want to do; the commandline is quicker.

Next you need to find out the precise way in which to scan the paper so that one corner of the paper is at (0,0). For a “Flat Bed” scanner this is usually one of the corners of the glass “bed”. Measure the height h and width w of the paper to be scanned in millimeters. Now enter the command

scanimage --auto-threshold --mode lineart --resolution 50 \
-l 0 -t 0 -x w -y h > /tmp/test.pbm

replacing w and h by the values you measured. Now use your viewer with the command display /tmp/test.pbm to view /tmp/test.pbm. If you placed the top and left corner correctly, you should obtain the page with correct orientation. with some minor experimentation you should be able to get this right. Don’t forget to remove the file /tmp/test.pbm when finished.

We are now set to scan the pages one by one in a simple fashion that is only possible with command-line invocations. The following script was the originally suggested solution (before the availability of djvulibre) a better solution is outlined below. This script will automatically number your pages and convert them to a browser friendly format; all you need to do is feed a new page when prompted and interrupt with a “Control-C” when done. Run this command with w and h as the command line paramenters

#!/bin/sh
if [ $# < 2 ]
then
echo Give Width and Height as parameters
exit
fi
WD=$1
HT=$2
# This directory should not exist!
if [ -d /tmp/paperscan ]
then
  echo Someone has already created /tmp/paperscan
  echo Edit the script and change the directory name
  exit
fi
mkdir /tmp/paperscan
cd /tmp/paperscan
i=1
while true
do
echo Feed the next page into the scanner or Ctrl-C to exit
scanimage --auto-threshold --mode lineart --resolution 100 \
      -l 0 -t 0 -x $WD -y $HT > page.pbm
      convert -mono pbm:page.pbm png:page$i.png
      pbmreduce 5 page.pbm | convert -mono pbm:- png:thumbpage$i.png
      rm page.pbm
i=`expr $i + 1`
done

To make these pages available via your web server move them to a separate directory under your home-page directory; don’t forget to clear out the directory /tmp/paperscan! Under your home-page you can create the HTML files as per the following templates:

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html>
<head>
<title>My fundamental paper</title>
</head>

<body>

<br>Each of the thumbnails below is a scan made for screen based
viewing of the paper on "My fundmental equation''. Unfortunately
this is for graphical mode viewing only.
<hr>
<A href="node1.html"><img src="thumbpage1.png" alt="Page 1"></A>
<A href="node2.html"><img src="thumbpage2.png" alt="Page 2"></A>
..... and so on
<hr>

</body>

</html>

A template for index.html

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<HTML>
<HEAD>
<TITLE>ct<My fundamental equation</TITLE>
<LINK REL="next" HREF="node3.html">
<LINK REL="previous" HREF="node1.html">
<LINK REL="up" HREF="index.html">
</HEAD>

<BODY >
<hr>
<img src="page2.png" alt="Page 2">
<hr>
</BODY>
</HTML>

A template for node2.html

If you wish to additionally convert to PDF format you can employ the command convert -mono -adjoin -page A4 -cache 32 page*.png doc.pdf. Note that this requires quite a lot of spare room in the /tmp directory and a lot of memory (if you have less RAM you can experiment with -cache 16). The process is rather slow and produces large-ish PDF files (one 14 page document came out as a 2.5MB PDF file).

2.1 DjVu-based solution

We can convert scanned documents at 300dpi directly into DjVu documents which also have thumbnails! As above we scan the pages one by one in a simple fashion that is only possible with command-line invocations. The following script is better solution than the one outlined above. This script will automatically convert your pages into a browser friendly format called DjVu; all you need to do is feed a new page when prompted and interrupt with an “n” when done. Run this command with w and h as the command line paramenters

#!/bin/sh
if [ $# < 2 ]
then
echo Give Width and Height as parameters
exit
fi
WD=$1
HT=$2
# This directory should not exist!
if [ -d /tmp/paperscan ]
then
  echo Someone has already created /tmp/paperscan
  echo Edit the script and change the directory name
  exit
fi
mkdir /tmp/paperscan
cd /tmp/paperscan
i=1;ans=y
while [ "$ans" = "y" ]
do
echo Feed the next page into the scanner or Ctrl-C to exit
scanimage --auto-threshold --mode lineart --resolution 300 \
      -l 0 -t 0 -x $WD -y $HT | \
      cjb2 -dpi 300 -clean -loose - page$i.djvu
pagelist="$pagelist page$i.djvu"
i=`expr $i + 1`
echo -n "Another page?(Y/n)"; read ans
done
# Now convert the pages into a bundled document
djvm -c bundle.djvu $pagelist
# We could stop here but it is probably a good
# idea to create an unbundled document as well
djvused -e 'save-indirect index.djvu' bundle.djvu

At the completion of this you can save the document in a seperate directory under your home page and offer the file index.djvu as the index file for this directory. The file bundle.djvu can also be offered as an all-in-one document.

At some stage we will become more ambitious and adventurous and examine using Optical Character Recognition (OCR) programs to convert the scanned document to LATEX!

3 Some examples

Chapter 1 of my thesis “On the Canonical Ring of a curve” and Chapter 2 of my thesis “The Kuga-Satake correspondence” . The original was printed using ChiWriter on a dot-matrix printer. This copy was scanned at 200dpi and reduced as above (the pbmreduce option -value 0.75 was used to touch it up a bit).

The same are also available as DjVu documents “On the Canonical Ring ...” and “The Kuga-Satake correspondence”.