Clara OCR Glossary

[Main] [FAQ] [Glossary] [Tutorial] [User's Manual] [Developer's Guide]

WELCOME

This is the Clara OCR glossary. It's somewhat specific to Clara OCR. The entries that do not refer an author were written by Ricardo Ueda Karpischek. Send new entries or suggestions to claraocr@claraocr.org. This glossary is part of the Clara OCR documentation. Clara OCR is distributed under the terms of the GNU GPL.

CONTENTS

1. algorithm

a well defined procedure. The term "algorithm" is usually reserved for procedures whose properties can be assured, generally through a rigorous mathematical proof. For instance, the procedure learned by children to multiply two numbers from their multi-digit decimal representations is an algorithm (see heuristic).

2. binarization

the conversion from color or grayscale (PGM) to black-and-white. The Clara OCR classification heuristics currently available require black-and-white input, so when the input is grayscale (PGM), Clara OCR needs to convert it to black-and-white before OCR. Note that to binarize an image, some choice must be done on how to map colors or graylevels to either black or white. Also and mainly, and the OCR results depends strongly on that choice.

3. bitmap

The Clara OCR documentation tries to use the term "bitmap" to mean only rectangular, black-and-white digital images. Grayscale rectangular digital images are called "graymaps" (see also pixel).

4. bitmap comparison

any method intended to decide if two given bitmaps are similar. Clara OCR implements three such methods: skeleton fitting, border mapping and pixel distance.

5. border

the line formed by the bitmap black pixels that have white neighbours. Note that the definition of "neighbour" may vary. Clara OCR generally consider that the neighbours of one pixel are all 8 pixels contiguous to it (top left, top, top right, left, right, bottom left, bottom, bottom right).

6. border mapping

a bitmap comparison technique that builds a mapping from the border pixels of one bitmap to the border pixels of another bitmap. If this mapping is found to satisfy certain mathematical properties, the bitmaps are considered similar.

7. clara

Cooperative Lightweight Recognizer. "Clara" is also a personal name: Clara (Latin, Portuguese, Spanish), "Chiara" (Italian), Claire (English).

8. classification

the process that recognizes a given bitmap as being the letter "a" or the digit "5", etc. Instead of saying that the bitmap was "recognized" as a letter "a", it's common to say that it was "classified" as a letter "a". All Clara OCR classification methods are currently based on bitmap comparison techniques.

9. density

see dpi.

10. depth

the number of bits available to store the color of each pixel. Black-and-white images have depth 1. Graymaps use to have depth 8 (256 graylevels). The larger the depth, the larger will be the amount of disk or ram space required to store a digital image. For instance, an image of size 100x100 and depth 8 requires 100*100*8 = 80000 bits = 8000 bytes to be stored.

11. digital image

see pixel.

12. dpi

dots-per-inch. A measure of linear image density. Example: scanning an A4 (210x297mm) page at 300 dpi results an image of size 2481x3508 (remember that 1 inch equals 25.4 millimeters). In most cases, all relevant visual details from printed characters can be conveniently captured at 600dpi (in some cases, 300dpi suffices). Some file formats, like TIFF or JPEG, include density information. Others, like PBM, PGM or PPM, don't. So when converting from TIFF to PGM, remember that the density information is dropped. So if, for instance, you ask SANE to scan a page creating a TIFF file, and subsequently convert it to PPM, and from PPM to TIFF again, the last file will not be equal to the first one. Density information uses to be irrelevant when displaying images on the computer monitor, because in this case a 1-1 mapping between image pixels and display pixels is assumed. However, density information is quite important when printing an image on paper, or when performing OCR. Clara OCR expects to be informed explicitly about the image density (default 600 dpi).

13. function

a rule that assigns, for each given element, another element, in a unique fashion. For instance, the equation y = x+1 defines a function that assigns to each number x the number x+1. A 2d digital image may be seen as a function that assigns to each dot, given by its horizontal and vertical coordinates, a color ("black", "white", "green", etc). Functions are also called "mappings".

14. graphic format

A standardised way to store the color of each pixel from a digital image in a disk file. The graphic format may include other information, like density and image annotations. Some graphic formats include a provision to compress the data. In some cases, this compression, if used, may change the color of some pixels or regions to colors close to the original ones, but different. So the usage of some graphic formats may imply in data loss. Examples of graphic formats are TIFF, JPEG, GIF, BMP, PNM, etc.

15. graymap

see bitmap.

16. heuristic

a procedure whose properties are not assured. Heuristics are generally the expression of some more or less vague feeling, or a naive, initial approch for a complex problem. If an heuristic can be proven to satisfy some interesting property, then it can be referred as an algorithm (in regard of that property). Some experts say that OCR is an engeneering field, not a mathematical field. Perhaps we can express this same idea saying that by its own nature, OCR is a field where nothing else than heuristics can be stated.

17. image size

As a digital image uses to be a rectangular matrix of pixels, its size in pixels can be conveniently described giving the rectangle width and height, usually in the form WxH. For instance, a 200x100 image is a rectangle of pixels having width 200 and height 100.

18. mapping

see function.

19. OCR

Optical Character Recognition. Some people feel hard to understand conveniently what OCR is due to the lack of knowledge on how computers store and process text and image data. Most users think OCR as being a required step before editing and spell-checking documents got from the scanner (it's not wrong, though).

20. page

a scanned document. The Clara OCR documentation tries to avoid using terms like "document", "image" or "file" to signify a scanned document. "Page" is used instead.

21. pattern

in the Clara OCR context, it's a letter, digit or accent instance, used to classify the page symbols through bitmap comparison. Clara OCR builds a set of patterns based on manual training or automatic selection, and uses it to classify all page symbols.

22. pixel

each one of the individual dots that compose a digital image (quite frequently, the term "pixel" is used to refer only the non-white dots of an image). A digital image uses to be a rectangular matrix of dots. To each one it's possible to assign one from many available colors, in order to form an image. If the available colors are only "black" and "white", the image thus formed is a "black-and-white image". As the representation of one from two possible values may be done using a bit, and the assignment of geometrically well positioned dots to colors may be seen as a function or mapping, a black-and-white image is also called a "bitmap". Similarly, if the colors available are only gray levels, usually from 0 (black) to 255 (white), then the image is a "grayscale image" or a graymap, and a generic assignment of pixels to colors is called a "pixmap".

23. pixel distance

a bitmap comparison technique that builds a mapping from all pixels of one bitmap to the pixels of another bitmap. If this mapping is found to satisfy certain mathematical properties, the bitmaps are considered similar.

24. pixmap

see pixel.

25. PBM

see PNM.

26. PGM

see PNM.

27. PNM

Portable aNyMap. PNM is a generic reference to the graphic file formats PBM, PGM and PPM defined by Jef Poskanzer. In other words, to say that a program supports PNM means that it handles PBM, PGM and PPM. PBM (Portable BitMap) files are black-and-white images, 1 bit per pixel. PGM (Protable GrayMap) files are grayscale images, 8 bits per pixel. PPM (Portable PixMap) files are color images, 24 bits per pixel. Currently Clara OCR likes PBM and PGM files only. A scanned page stored in some format other than PBM or PGM can be converted to PBM or PGM using the netpbm tools, ImageMagick or others. PNM files may be "raw" or "plain". The plain versions are rarely used. Clara OCR does not support plain PBM nor plain PGM.

28. PPM

see PNM.

29. resolution

this term is used along the Clara OCR documentation to refer either the image size (for instance: 640x480 pixels) or the image density (for instance: 300 pixels per inch).

30. skeleton

ideally, it's a minimal structural bitmap. From an algorithmic standpoint, the skeleton of a symbol is the bitmap obtained clearing a number of its peripheric pixels, whose remotion does not destroy the symbol shape.

31. skeleton fitting

a bitmap comparison technique that decides that two given bitmaps are similar if and only if the skeleton of each one fits into the other.

32. symbol

an instance of a letter or digit in a page. So if the word "classical" occurs in a page, all its letters ("c", "l", "a", "s", "s", "i", "c", "a", "l") are individual symbols. At the source code level, things that are not letters not digits are sometimes called symbols (for instance, pieces of broken symbols, dots, accents, noise, etc).

33. thresholding

a simple binarization method. It decides to map each pixel from a graymap to either black or white just testing if its gray level is smaller or larger than a given threshold. So, if the threshold is, say, 171, then all gray levels from 0 to 170 are mapped to 0 (black) and all graylevels from 171 to 255 are mapped to 255 (white). The thresholding is said to be global if one fixed (per-page) binarization threshold is used to decide the mapping of all page pixels. The thresholding is said to be local if the threshold is allowed to vary along the page, due to irregular printing intensity.

34. Xlib

the low-level, standard, Xwindows library. It offers basic graphic primitives, similar to others found on most graphic environments, like "draw line", "draw pixel", "get next event", etc, as well as services more specific to the Xwindows way of doing things, like "connect to an X display", properties (resources) handling, etc. The Xlib does not include facilities to create menus, buttons, etc. Application programs usually take these facilities from "toolkits" like Xt, GTK, Qt and others. Clara OCR creates the few facilities it needs using the Xlib primitives.