|
01.
What is Purging?
Part of the process of
converting paper to images requires an evaluation of
the material involved. Some file folders, documents,
or other repositories contain extraneous material, duplicates,
notes, and other information that need not be scanned.
In these instances, you must decide whether it is more
cost effective to purge files before scanning, or to
scan everything and purge extraneous images. In some
instances, purging requires the advice of personnel
with a knowledge of the documents being scanned; e.g.,
purging drafts or other old versions of documents. We
call this subjective purging. In other instances, purging
can be done by persons without such knowledge; e.g.,
purging all handwritten notes, purging all notes, Post-It
notes, etc. We call this objective purging.
We do not recommend subjective paper purging; instead,
we recommend that subject purging be done with images--subjective
image purging. Subjective image purging means that a
cost was incurred in scanning the images; however, that
cost makes the purge process much more efficient and
inexpensive. In other words, scanning and throwing away
images is less expense that purging paper files.
We recommend objective paper purging. Since objective
purging specifies exactly what is to be throw away,
it can be done quickly and easily when paper is being
prepared for scanning.

02. How does
Organizing help in scanning?
Like pieces of paper,
images typically are grouped into documents. Accordingly,
the beginning and end of each paper document must be
clearly defined to maintain its integrity after conversion.
We do this with document separator pages.
Document separator pages are inserted between documents
during the document preparation phase. They typically
have a bar code or "patch code" printed on
them. This code tells our software tells our software
where one document ends and another begins.
In some cases, we encode indexing information in the
bar code on the separator page. This can be a file number,
a client name, date, etc.. When job constraints allow
us to use this technique, we can read the bar code and
use the information to automatically populate the index
fields associated with that document without human intervention.
This can greatly reduce the cost of indexing documents.

03. What do
you mean by Paper Preparation?
A certain amount of physical
preparation is required to prepare paper to be scanned.
For instance, if a job is to be undertaken using high-speed
autofeeders, binders like staples, brads, and paperclips
must be removed, as should Post-Iit notes and other
attachments. Depending on the job, it may be necessary
to rebind documents after scanning.
Physical preparation also typically requires that paper
be "jogged" so that all leading edges are
aligned and ready to be fed into the scanner. These
simple but necessary steps help eliminate scanner jams
and double-feeds.
Paper size and weight also should be considered. Many
scanner auto-feeders cannot handle mixed widths and
weights and have specific width and weight limitations
even when the paper is uniform. Accordingly, an appropriate
scanner and feeder must be selected to match the condition
of the purged, organized and physically prepared paper.
In some cases, flatbed scanners may be required.
Finally, consideration should be given to "batching"
documents. Batching helps to both control and improve
the efficiency of the conversion process. Batching provides
a convenient way to audit the process by matching scanned
batches of paper with corresponding batches of images.
Batching also can be used for other quality assurance
checks, such as batch scan count comparisons (paper
batch count compared to image batch count), batch tracking
through the conversion process and batch log files with
all information on images captured and indexed during
the conversion process.

04. Image
Quality
Document scanners typically
produce a black and white (bitonal) image. Grayscale
scanning while technologically simple, generally is
not employed because file sizes are orders of magnitude
larger than bitonal file sizes.
The initial question on document scanning is whether
a bitonal rendering of your paper will be satisfactory.
The answer is a simple "yes" where the paper
is white and the text, black. However, more careful
consideration must be given to documents like invoices,
where gray backgrounds or red or green colored boxes
are common; photos, where bitonal renditions offer significantly
less detail or documents containing photos and text.
Most image quality issues depend on the scanner selected
to do the job. Some scanners render black text on white
paper flawlessly, but do a very poor job where colors
or grays are part of the requirement. Other scanners
handle very difficult gray and color requirements nicely
using a process called dynamic thresholding. Still other
scanners allow two or more images to be captured from
a single piece of paper, with one image capturing the
whole page at a low resolution, and another image capturing
just a portion of the page at much higher resolution.
The choice of the scanner is critical to the issue of
efficiently producing high quality images of the particular
paper to be scanned.

05. What do
you mean by Paper Handling?
Paper handling also must
be considered. Different scanners use different paper
transports. Some use belts, others ball-bearings or
use rollers. Auto-feeders also use various paper handling
techniques and have different limitations. How a scanner
handles paper has a direct impact on its suitability
for a particular project. For instance, many scanners
can not handle onion skin, card stock or batches containing
mixed paper widths and weights. Others can't handle
small paper or paper wider than 8.5 inches.
Some scanners support manual feeding in a way that is
much faster than others. This is an important consideration
if you know that the paper to be scanned cannot be handled
by auto-feeders.
Like image quality, the key to fast, efficient paper
handling depends on selection of the right scanner.

06. Explain
Image Resolution
Image resolution determines
the number of pixels, or dots, per linear inch. Popular
resolutions include 200 dpi, 240 dpi, and 300 dpi, though
400, 600 and even 1200 dpi.
A 300 dpi resolution renders 300 dots per inch. Higher
resolutions generally improve image readability, though
there are issues that must be taken into consideration.
For instance, many monitors display images at 72 dpi,
regardless of the resolution at which they were scanned.
Even high-resolution monitors display at only around
200 dpi. In both instances, you must "zoom in"
on the image to view it at the resolution actually available.
Similar considerations relate to printing. A 200 dpi
image will print the same as a high-resolution 600 dpi
image on a printer that only prints at 200 dpi. Scale-to-gray
technology makes this subject even more confusing. It
employs special techniques to render images more readable
by using techniques like dithering. Dithering extrapolates
from available resolution information to make images
much more readable with less jagged lines and more complete
characters.
Image resolution also significantly effects image file
sizes. For instance, the file size of a compressed TIFF
Group 4 compressed image at 200 dpi might be 60 KB,
while at 300 dpi, it might be 90 KB.
We make our resolution recommendations based on the
level of detail required to capture all needed substantive
data from the paper, while minimizing file sizes.

07. What is
Image Deskew?
Two skew issues are involved
in document imaging; paper skew and print skew.
Paper skew relates to the relationship between the paper
and the scanner camera as the paper is scanned. If the
paper is skewed, the image is skewed. Paper skew typically
is introduced with scanner auto-feeders and to a lesser
extent, their internal paper transports. Some scanners
control paper skew well. Others do not.
Print skew relates to how the print actually was deposited
on the paper. Print skew relates to the relationship
between the print and the paper on which it is printed.
Photocopied and faxed documents tend to have skewed
print.
Paper and print skew effect document images in two ways.
Skewed image text is less legible and is not processed
well by OCR engines.
The solution is to electronically deskew the image by
re-orienting image pixels along a corrected x/y axis.
This technology usually is very effective; though in
a small number of cases, it can introduce unacceptable
distortion.

08. What is
Image Border Cropping?
Depending on the scanner
and scanner control software employed, it may not be
possible to exactly and automatically match the size
of the captured image to that of the paper scanned.
For instance, a software solution that requires you
to manually define the image size to match the size
of the 8.5 by 11 inch paper you expect to scan will
capture an 8.5 by 11 image even if some 4 by 6 inch
cards are mixed in. In these cases, an ugly black area
will surround images.
Image cropping removes extraneous black borders--either
by requiring a human operator to manually define the
area to be cropped, or by employing sophisticated algorithms
to evaluate the image and automatically crop borders.

09. What is
Noise Removal?
Scanners often interpret
minor paper imperfections or extraneous dots on paper
as small groups of black pixels called background noise.
Carbon forms are excellent examples of paper with significant
amounts of background noise.
Background noise makes bi-tonal images less legible
and image file compression schemes much less efficient.
Noise removal algorithms examine an image, identify
likely black pixels constituting background noise and
convert them to white pixels. The result is a much more
legible image and a much smaller compressed file size.

10. What is
Background Removal?
Documents can contain
vertical lines, horizontal lines and background shading
that represent no substantive data. In these cases,
it can be desirable to remove them, since doing so can
make the image more legible and dramatically reduce
file size.
Background removal algorithms are available for this
purpose. Care must be taken in using them, however,
since it is not always easy to predict when a vertical
or horizontal line might in fact be critical in conveying
the data represented in an image. We recommend the technology
only where all images to be processed using background
removal techniques have been tested and the results
evaluated.

11. How does
Annapolis Technologies implement Process Quality Control?
Whenever possible within
a conversion process, Annapolis Technologies uses technology
to replace typically labor intensive quality control
processes.
When bar code separation sheets are used in processing
documents, we include a quality control process to verify
that each bar code has been read by the system. Since
we produced the bar codes, we know which bar codes we
should find in each batch. Even the best bar code readers
on the market will miss a small percentage of bar codes
just as the bar code reader at the grocery store will
miss some. The Annapolis Technologies quality process
is to compare the list of expected bar codes with the
list of captured bar codes. Human intervention is only
required if the two lists do not match.
Annapolis Technologies uses a process called NIC-VIC
(Number Image Count - Verify Image Count) to ensure
that each and every page given to us for scanning is
scanned. The patent-pending process is simple yet powerful.
Annapolis Technologies uses the best production scanners
and counters on the market. Using High quality equipment
to do the work at production speeds, Annapolis Technologies
enables us to do better quality work at more competitive
prices.
The key to effective image-enabled data entry is setting
the job up properly and employing all appropriate data
extraction and validation techniques.

12. Explain
Full Text OCR/ICR Processing
Images are useful only
if they can be found when needed. There are four common
ways to address this issue:
- "Filing" related images
in subdirectories or electronic folders.
- Matching images with index fields
or keywords in a structured database
- Linking images using hypertext
links.
- Matching images with text files
in a full-text database.
If images are to be found
by searching a full-text database, a machine readable
ASCII text version of the image must be created. This
is done using OCR (optical character recognition) or
ICR (intelligent character recognition) processing engines.
These engines are useful for full-text processing only
if the text is machine print. They will not produce
acceptable results from hand printed or cursive data.
The text produced by these
engines from machine printed data can range from extremely
accurate to very poor, depending on image quality, resolution,
type faces and the OCR or ICR engine employed. We recommend
a careful analysis of your documents before making a
decision on whether to full-text OCR process them.

13. Explain
Form OCR/ICR Processing
Form OCR/ICR processing
differs from full-text OCR/ICR processing in that it
does not attempt to translate an entire image into ASCII
text. Instead, form OCR/ICR processing attempts only
to translate image form fields located at specific defined
image coordinates. These field coordinates are always
located at the same places on a given form. Since fields
are part of a form, steps can be taken to control the
way data inside the coordinates is presented. For example,
the form can prompt users to print, to print within
boxes and to print only in black or blue ink. Handprint
recognition is quite practical under these kinds of
constraints.
Since form OCR/ICR processing
deals with known data fields, it can incorporate a number
of techniques to improve and ensure accuracy even from
poor quality images or difficult handprint. For instance,
data extracted by OCR/ICR engines from known data fields
can be compared to tables defining acceptable types
of data; alpha, numeric, date, zip code, etc. to help
the OCR/ICR engine interpret the image pattern. Similarly,
post-OCR/ICR routines can be used to correct characters
flagged as questionable by the OCR/ICR engine. Finally,
human operators using a variety of display options can
quickly and efficiently review and correct OCR/ICR results.
Form OCR/ICR processing
by itself or assisted by human data entry editors and
verifiers can reduce key entry chores by orders of magnitude.

14. What is
Form Dropout?
One of the difficulties
in using form OCR/ICR processing relates to the manner
in which people fill out forms. They often ignore form
instructions by printing "outside the lines"
or over portions of the form itself. Data "outside
the lines" will not be within the OCR/ICR zone.
If the zone were enlarged to include it, interpreting
the data still would be difficult because form lines
and instructions would degrade it.
Form dropout techniques
address this problem by removing the form.
This is accomplished by
matching each image scanned to a library of blank dropout
form patterns that have been scanned and stored for
comparison. When an image contains a dropout form pattern,
the form dropout algorithm removes it, leaving only
the data that was entered onto the form. At this point,
zone OCR/ICR techniques can be used to translate the
data in the identified coordinates into ASCII text.
Form dropout is useful
for two reasons:
-
Like other
removal techniques, it dramatically reduces file
sizes
-
OCR or ICR
engines can much more efficiently and accurately
analyze and convert an image into ASCII text with
the form removed.
The actual images
in a form dropout scenario can be managed in three different
ways:
-
The original
image can be preserved, while the dropout image
is deleted after its data is extracted
-
The original
image can be deleted, while the dropout image is
preserved in a manner that combines it with a single
overlay image of the blank form when the dropout
image is retrieved and displayed
-
The original
and dropout image can be deleted, while the ASCII
data extracted from the image is stored and presented
as either ASCII data or as ASCII data with a single
overlay image of the blank form when the data is
retrieved and displayed.

15. What is
the standard for image Files?
As a result of the trend
towards standardization in the imaging industry, the
overwhelming image standard is TIFF G4 (tagged image
file format; TSS Group 4 compression). However, there
are some significant exceptions, most of which involve
large imaging system architectures implemented in the
early 1990's, large IBM imaging systems still being
sold today (ImagePlus, Visual Info, etc.), very large
image files from E-size drawings and JPG or GIF files
used with color and grayscale images.
We recommend checking your existing infrastructure to
verify that the TIFF format is appropriate. If it is
not, the required format must be identified. Matching
proprietary image headers is the common solution. This
is not particularly difficult. Matching proprietary
image compression schemes, though uncommon, can be extremely
difficult without the cooperation of the original developer.

16. How are
text files used in the Document Conversion process?
Text files and variations
of text files like RTF (rich text format files), store
ASCII output from images that have been OCR processed.
They also often are used as a temporary way to store
indexing information relating to an image or batches
of images until the information can be uploaded to a
database.
We can present OCR output in text files, RTF files and
other popular formats; however, the OCR engine employed
to produce the files directly impacts the file format
options available, as well as the speed at which OCR
processing can be performed.
Text files containing indexing information also can
be presented in a variety of formats employing carriage
return, comma and semi-colon delimiters, specifically
defined row/column formats, and more. The proper format
choice will depend on the spreadsheet, database, or
imaging system into which the data is to be uploaded.

17. What are
Objects?
Objects are files that
contain not only data, but information about or components
of the application software used to process the data.
Many imaging systems "wrap" TIFF image files
with code to turn them into objects. This allows the
imaging system to more easily associate system-related
data with the file.
Annapolis Technologies reformats images for object-oriented
imaging systems when the images are uploaded into the
system. This strategy has two advantages:
-
It allows
us to use TIFF images during the image and data
capture process, which means we can take advantage
of all of the industry-standard TIFF-oriented image
processing and data extraction tools.
-
It allows
us to take advantage of the import tools designed
and developed specifically for the object-oriented
imaging system into which the images are to be loaded.
Using native tools in this fashion insures image
and text compatibility.

|