GVIP Journal    

ICGST

Issue(7)

GVIP

An Efficient Text Segmentation Technique Based on Naive Bayes Classifier 
 
M. M. Haji, S. D. Katebi
Department of Computer Science and Engineering, Shiraz University, Shiraz, Iran
Abstract:
In this paper the Naive Bayes Classifier (NBC) is introduced for text segmentation. A set of training data is generated from a wide category of document images for learning the NBC. The images used for generating the training data include both machine-printed and handwritten text with different fonts, sizes, intensity values and background models. A small subset of the coefficients of a Discrete Cosine Transformed (DCT) image block is used to classify the block as text or non-text. The NBC decision threshold is optimized on a test set. The proposed segmentation method is tested with unseen documents and promising results are obtained.
 
Keywords: Document Image Analysis, Text Segmentation, Content Based Image Retrieval, Naive Bayes Classifier, Discrete Cosine Transform, Feature Selection, Morphological Operations.
 
Biographies:

Mohammad Mehdi Haji received his B.Sc. in computer engineering from the Department of Computer Science and Engineering, Shiraz University, Iran in September 2002. His B.Sc. thesis was about optimal state assignment in sequential synchronous circuits using genetic algorithms. He was ranked in top one percent of all students graduated from the department. He then was admitted to the M.Sc. course in AI in the same department and defended his M.Sc. thesis, which was on the subject of handwritten word recognition using hidden Markov models, with distinction in January 2005. He ranked first among the graduated students of the Engineering School of Shiraz University in 2004. He is currently a researcher in Iran and his research interests include pattern recognition, statistical language modeling, genetic programming, computer vision, handwritten recognition and document processing.

Seraj Dean Katebi graduated with an honor degree in Computer Systems Engineering from the Coventry University, England in 1972. He obtained his M.Sc. and Ph.D. from the Control Systems Center, University of Manchester Institute of Science and technology (UMIST) in 1973 and 1976 respectively. He has been a faculty member of the department of Computer Science and Engineering, Shiraz University since 1976, teaching undergraduate and graduate courses and conducting research in various aspects of nonlinear control and AI. He is the author of several papers in cited journals and has been a full professor since 1993.

BibTex:

@ARTICLE{P1150520001,

AUTHOR = {M. M. Haji and S. D. Katebi},

TITLE = {An Efficient Text Segmentation Technique Based on Naive Bayes Classifier},

JOURNAL = {ICGST International Journal on Graphics, Vision and Image Processing},

YEAR = {2005},

MONTH = {July},

VOLUME = {05},

ISSUE = {7},

PAGES  = {27--36}

}

( Full paper 945 KB)