Enhanced Segmentation And Feature Extraction For Sindhi Optical Character Recognition
Loading...
Date
2015-06
Authors
HAKRO, DIL NAWAZ
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Optical Character Recognition (OCR) system which is an integral part of machine vision
and image processing, biomedical imaging, language processing and speech recognition poses
many challenging problems. The non-cursive OCR systems have achieved perfection whereas
the OCRs for cursive scripts still need further attention. OCR work on Sindhi OCR which is
a cursive script based on the Arabic alphabet is still in infancy and there is no complete OCR
for the language which is spoken by over 60 million people in Pakistan and other parts of the
world. No text image database is available for testing and training of the Sindhi characters and
most text image databases are created for only one single script. Therefore, in this research,
the main goal is to develop a complete OCR system for the Sindhi script. So, an in-depth
study is carried out on the issues and challenges posed by the Sindhi script with respect to its
OCR. In this research, a huge database containing 4 billion words and 15 billion characters is
created for testing and training of Sindhi script with the help of custom-built software together
with a multi-script database for various other scripts. The multi-script database comprising
multi-billion words and characters for 84 scripts can be used for OCR training and testing on
a single platform. Sindhi has the largest extension of the original Arabic alphabet among the
languages adopting the Arabic alphabet. Therefore, in this research, an enhanced segmentation
algorithm and feature extraction algorithms are proposed for Sindhi which can also be
used for other scripts. The segmentation algorithm based on the energy level produces good
results for Sindhi characters as well as characters of other scripts. An enhanced feature extraction
algorithm based on character geometry is also proposed to select and extract features
from Sindhi segmented characters as well other Arabic related scripts and isolated scripts. The
enhanced method is based on 2 x 2, 3 x 3, and 4 x 4 individual zone formations, and a combined
approach that combines all the different zone formations. An integrated Sindhi OCR and
multi-script OCR is developed using the enhanced methods in this research. The OCR system
produced good results on Sindhi and multi-script characters with an average recognition rate of
89% for Sindhi, and recognition rates ranging from 90.33 to 99.90% for some of other scripts.
The multi-script database created in this research can also be extended easily in many ways so
that training and testing of various scripts can be made available on a single platform.
Description
Keywords
Enhanced Segmentation And Feature Extraction , For Sindhi Optical Character Recognition