Latin Script Detection and Removal from Devanagari Document Image for OCR

  IJCOT-book-cover
 
International Journal of Computer & Organization Trends  (IJCOT)          
 
© 2014 by IJCOT Journal
Volume - 4 Issue - 2
Year of Publication : 2014
Authors :  Savita Pal Godara , Pratap Singh Patwal
DOI :  10.14445/22492593/IJCOT-V6P308

Citation

Savita Pal Godara , Pratap Singh Patwal. "Latin Script Detection and Removal from Devanagari Document Image for OCR ", International Journal of Computer & organization Trends (IJCOT), V4(2):33-36 Mar - Apr 2014, ISSN:2249-2593, www.ijcotjournal.org. Published by Seventh Sense Research Group.

Abstract

Document image analysis is the process or techniques used for images of documents to obtain a computer-readable description from pixel data. A document image analysis product is the Optical Character Recognition (OCR) software that recognizes text in a scanned document image. OCR makes it possible for the user to edit or search the document’s contents. In this paper we proposed a novel method for identification of Latin text from Devanagari script image document. There are many documents in Devanagari where a single document page may contain English text as well with Devanagari. In bilingual documents two scripts are generally mixed together within a single text line. There are existing methods for recognition of both script but methods lack the ability to recognize multiple scripts mixed within a single text line.

References

[1] Vijay Kumar , Pankaj K. Sengar , Segmentation of Printed Text in Devanagari Script and Gurmukhi Script International Journal of Computer Applications (0975 – 8887) Volume 3 – No.8, June 2010.
[2] Ankit kumar, Tushar Patnaik, Vivek Kr Verma“Discrimination of English to Other Indian Languages (Kannada and Hindi) for OCR System” AIRCC International Journal of Computer Science, Engineering and Applications (IJCSEA) Vol.2, No.2, April 2012 PP 167-175.
[3] P. A. Vijaya, M. C. Padma, “Text line identification from a multilingual document,” Proc. of Intl. Conf. on digital image processing (ICDIP2009) Bangkok, pp. 302-305, March 2009.
[4] A.L. Spitz, “Determination of the script and language content of document images,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 19, no. 3, pp. 235–245, March 1997.
[5] U. Pal and B.B. Chaudhuri, “Automatic identification of English, Chinese, Arabic, Devnagari and Bangla script line,” in ICDAR ’01: Proceedings of the Sixth International Conference on Document Analysis and Recognition,Washington, DC, USA, 2001, pp. 790–794.
[6] L. Zhou, Y. Lu, and C.L. Tan, “Bangla/English script identification based on analysis of connected component profiles,” in 7th IAPR Workshop on Document Analysis Systems, Nelson, New Zealand, Feb 2006, vol. 3872 of Lecture Notes in Computer Science, pp. 243–254.
[7] Benjelil, M. , REGIM-ENIS, Sfax, Tunisia Mullot, R. ; Alimi, A.M.” Language and Script Identification Based on Steerable Pyramid Features” Frontiers in Handwriting Recognition (ICFHR), 2012 International Conference Pp 716-721 , 18-20 Sept. 2012

Keywords
OCR, Image Document, Devanagari Script, Latin Script