Abstract
To efficiently store the information found in paper documents, text and non-text regions need to be separated. Non-text regions include half-tone photographs and line diagrams. The text regions can be converted (via an optical character reader) to a computer-searchable form, and the non-text regions can be extracted and preserved in compressed form using image-compression algorithms. In this paper, an effective system for automatically segmenting a document image into regions of text and non-text is proposed. The system first performs an adaptive thresholding to obtain a binarized image. Subsequently the binarized image is smeared using a run-length differential algorithm. The smeared image is then subjected to a text characteristic filter to remove error smearing of non-text regions. Next, baseline cumulative blocking is used to rectangularize the smeared region. Finally, a text block growing algorithm is used to block out a text sentence. The recognition of text is carried out on a text sentence basis.
| Original language | American English |
|---|---|
| Pages (from-to) | 639-651 |
| Number of pages | 13 |
| Journal | Engineering Applications of Artificial Intelligence |
| Volume | 7 |
| Issue number | 6 |
| DOIs | |
| State | Published - Dec 1994 |
| Externally published | Yes |
ASJC Scopus subject areas
- Control and Systems Engineering
- Artificial Intelligence
- Electrical and Electronic Engineering
Keywords
- Text and image segmentation
- character recognition
- document processing
- run-length differential algorithm