Latin Script Detection and Removal from Devanagari Document Image for OCR

Savita Pal Godara; Pratap Singh Patwal

doi:https://doi.org/10.14445/22492593/IJCOT-V6P308

Research Article | Open Access | Download PDF

Volume 4 | Issue 2 | Year 2014 | Article Id. IJCOT-V6P308 | DOI : https://doi.org/10.14445/22492593/IJCOT-V6P308

Latin Script Detection and Removal from Devanagari Document Image for OCR

Savita Pal Godara , Pratap Singh Patwal

Citation :

Savita Pal Godara , Pratap Singh Patwal, "Latin Script Detection and Removal from Devanagari Document Image for OCR," International Journal of Computer & Organization Trends (IJCOT), vol. 4, no. 2, pp. 33-36, 2014. Crossref, https://doi.org/10.14445/22492593/IJCOT-V6P308

Abstract

Document image analysis is the process or techniques used for images of documents to obtain a computer-readable description from pixel data. A document image analysis product is the Optical Character Recognition (OCR) software that recognizes text in a scanned document image. OCR makes it possible for the user to edit or search the document’s contents. In this paper we proposed a novel method for identification of Latin text from Devanagari script image document. There are many documents in Devanagari where a single document page may contain English text as well with Devanagari. In bilingual documents two scripts are generally mixed together within a single text line. There are existing methods for recognition of both script but methods lack the ability to recognize multiple scripts mixed within a single text line.

Keywords

OCR, Image Document, Devanagari Script, Latin Script

References

[1] Vijay Kumar , Pankaj K. Sengar , Segmentation of Printed Text in Devanagari Script and Gurmukhi Script International Journal of Computer Applications (0975 – 8887) Volume 3 – No.8, June 2010.
[2] Ankit kumar, Tushar Patnaik, Vivek Kr Verma“Discrimination of English to Other Indian Languages (Kannada and Hindi) for OCR System” AIRCC International Journal of Computer Science, Engineering and Applications (IJCSEA) Vol.2, No.2, April 2012 PP 167-175.
[3] P. A. Vijaya, M. C. Padma, “Text line identification from a multilingual document,” Proc. of Intl. Conf. on digital image processing (ICDIP2009) Bangkok, pp. 302-305, March 2009.
[4] A.L. Spitz, “Determination of the script and language content of document images,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 19, no. 3, pp. 235–245, March 1997.
[5] U. Pal and B.B. Chaudhuri, “Automatic identification of English, Chinese, Arabic, Devnagari and Bangla script line,” in ICDAR ’01: Proceedings of the Sixth International Conference on Document Analysis and Recognition,Washington, DC, USA, 2001, pp. 790–794.
[6] L. Zhou, Y. Lu, and C.L. Tan, “Bangla/English script identification based on analysis of connected component profiles,” in 7th IAPR Workshop on Document Analysis Systems, Nelson, New Zealand, Feb 2006, vol. 3872 of Lecture Notes in Computer Science, pp. 243–254.