Enhanced Segmentation And Feature Extraction For Sindhi Optical Character Recognition

Loading...
Thumbnail Image
Date
2015-06
Authors
HAKRO, DIL NAWAZ
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Optical Character Recognition (OCR) system which is an integral part of machine vision and image processing, biomedical imaging, language processing and speech recognition poses many challenging problems. The non-cursive OCR systems have achieved perfection whereas the OCRs for cursive scripts still need further attention. OCR work on Sindhi OCR which is a cursive script based on the Arabic alphabet is still in infancy and there is no complete OCR for the language which is spoken by over 60 million people in Pakistan and other parts of the world. No text image database is available for testing and training of the Sindhi characters and most text image databases are created for only one single script. Therefore, in this research, the main goal is to develop a complete OCR system for the Sindhi script. So, an in-depth study is carried out on the issues and challenges posed by the Sindhi script with respect to its OCR. In this research, a huge database containing 4 billion words and 15 billion characters is created for testing and training of Sindhi script with the help of custom-built software together with a multi-script database for various other scripts. The multi-script database comprising multi-billion words and characters for 84 scripts can be used for OCR training and testing on a single platform. Sindhi has the largest extension of the original Arabic alphabet among the languages adopting the Arabic alphabet. Therefore, in this research, an enhanced segmentation algorithm and feature extraction algorithms are proposed for Sindhi which can also be used for other scripts. The segmentation algorithm based on the energy level produces good results for Sindhi characters as well as characters of other scripts. An enhanced feature extraction algorithm based on character geometry is also proposed to select and extract features from Sindhi segmented characters as well other Arabic related scripts and isolated scripts. The enhanced method is based on 2 x 2, 3 x 3, and 4 x 4 individual zone formations, and a combined approach that combines all the different zone formations. An integrated Sindhi OCR and multi-script OCR is developed using the enhanced methods in this research. The OCR system produced good results on Sindhi and multi-script characters with an average recognition rate of 89% for Sindhi, and recognition rates ranging from 90.33 to 99.90% for some of other scripts. The multi-script database created in this research can also be extended easily in many ways so that training and testing of various scripts can be made available on a single platform.
Description
Keywords
Enhanced Segmentation And Feature Extraction , For Sindhi Optical Character Recognition
Citation