|
|||
|
An
Efficient Text Segmentation Technique Based on Naive
Bayes Classifier
M. M. Haji, S. D. Katebi
Department of Computer Science
and Engineering, Shiraz University, Shiraz, Iran
Abstract:
In
this paper the Naive Bayes Classifier (NBC) is
introduced for text segmentation. A set of training data
is generated from a wide category of document images for
learning the NBC. The images used for generating the
training data include both machine-printed and
handwritten text with different fonts, sizes, intensity
values and background models. A small subset of the
coefficients of a Discrete Cosine Transformed (DCT)
image block is used to classify the block as text or
non-text. The NBC decision threshold is optimized on a
test set. The proposed segmentation method is tested
with unseen documents and promising results are
obtained.
Keywords:
Document Image Analysis, Text Segmentation, Content
Based Image Retrieval, Naive Bayes Classifier, Discrete
Cosine Transform, Feature Selection, Morphological
Operations.
Biographies:
Mohammad Mehdi Haji received his B.Sc. in computer engineering from the Department of Computer Science and Engineering, Shiraz University, Iran in September 2002. His B.Sc. thesis was about optimal state assignment in sequential synchronous circuits using genetic algorithms. He was ranked in top one percent of all students graduated from the department. He then was admitted to the M.Sc. course in AI in the same department and defended his M.Sc. thesis, which was on the subject of handwritten word recognition using hidden Markov models, with distinction in January 2005. He ranked first among the graduated students of the Engineering School of Shiraz University in 2004. He is currently a researcher in Iran and his research interests include pattern recognition, statistical language modeling, genetic programming, computer vision, handwritten recognition and document processing. Seraj Dean Katebi graduated with an honor degree in Computer Systems Engineering from the Coventry University, England in 1972. He obtained his M.Sc. and Ph.D. from the Control Systems Center, University of Manchester Institute of Science and technology (UMIST) in 1973 and 1976 respectively. He has been a faculty member of the department of Computer Science and Engineering, Shiraz University since 1976, teaching undergraduate and graduate courses and conducting research in various aspects of nonlinear control and AI. He is the author of several papers in cited journals and has been a full professor since 1993. BibTex: @ARTICLE{P1150520001,
AUTHOR = {M. M. Haji and S.
D. Katebi},
TITLE = {An Efficient Text Segmentation Technique Based on Naive Bayes Classifier}, JOURNAL = {ICGST International Journal on Graphics, Vision and Image Processing}, YEAR = {2005}, MONTH = {July}, VOLUME = {05}, ISSUE = {7}, PAGES = {27--36} }
( |
|||
|
|