Academia.eduAcademia.edu

Enhanced portable text to speech converter for visually impaired

2018, International Journal of Intelligent Systems Technologies and Applications

The handwritten text reader is designed to help the visually impaired listen to an audio read-back of printed and handwritten scanned text. A hand-held page scanner is used to scan the text to be read. The image from the scanner is sent to the application in the paired Android phone over Bluetooth. An open source optical character recognition (OCR) engine, Tesseract is used to extract the text from the image, and this extracted text is converted to speech. Tesseract OCR engine is further trained to recognise handwritten text for a specific user. This OCR engine is trained with handwritten datasets. In addition to English, the application supports two regional languages-Hindi and Bengali.

42 Int. J. Intelligent Systems Technologies and Applications, Vol. 17, Nos. 1/2, 2018 Enhanced portable text to speech converter for visually impaired Chithra Selvaraj* and N. Bhalaji Department of Information Technology, SSN College of Engineering, Chennai, 603110, India Email: [email protected] Email: [email protected] *Corresponding author Abstract: The handwritten text reader is designed to help the visually impaired listen to an audio read-back of printed and handwritten scanned text. A hand-held page scanner is used to scan the text to be read. The image from the scanner is sent to the application in the paired Android phone over Bluetooth. An open source optical character recognition (OCR) engine, Tesseract is used to extract the text from the image, and this extracted text is converted to speech. Tesseract OCR engine is further trained to recognise handwritten text for a specific user. This OCR engine is trained with handwritten datasets. In addition to English, the application supports two regional languages – Hindi and Bengali. Keywords: handwriting recognition; Tesseract; OCR; optical character recognition; text to speech; Android; visually impaired. Reference to this paper should be made as follows: Selvaraj, C. and Bhalaji, N. (2018) ‘Enhanced portable text to speech converter for visually impaired’, Int. J. Intelligent Systems Technologies and Applications, Vol. 17, Nos. 1/2, pp.42–54. Biographical notes: Chithra Selvaraj is an Associate Professor in SSN College of Engineering, Chennai has over 18 years of teaching experience. She has completed her Bachelor of Engineering degree in Computer Science and Engineering from Bharathiar University with University Sixth rank. She received her Masters’ degree in Computer Science and Engineering from Sathyabama University. She has received her PhD from Anna University, Chennai and indulged in research work in the Networks domain. She is a recognised Supervisor in Anna University. She has received ‘Best Paper’ award in the conference ICCNT 2009, Singapore. She has published over 30 research papers in refereed International and National journals. She is IEEE Senior member. N. Bhalaji is an Associate Professor of Information Technology, has over 13 years of teaching experience. He received his BE and ME both in the discipline of Computer Science and Engineering and PhD specialising in Trust Based Routing approach for MANET’s from Anna University, Chennai. His current research interests include security in distributed systems and wireless mobile network. He has published over 25 research papers in refereed journals in the areas of ad hoc networks, soft computing and Internet of Things. He is an active member in Professional Bodies such as IEEE, ACM, IACSIT, IAENG, and ACEEE. He is an active reviewer for reputed publishers like Elsevier, IEEE and Springer. Copyright © 2018 Inderscience Enterprises Ltd. Enhanced portable text to speech converter for visually impaired 43 This paper is a revised and expanded version of a paper entitled ‘Portable text to speech converter for the visually impaired’ presented at International Conference on Soft Computing Systems, Thukulay, India, 20–21 April, 2015. 1 Introduction Visually impaired people cannot read handwritten or printed text and depend on a third person. Although Braille books are available, they are expensive and most of the blind people cannot read Braille. Using technologies like optical character recognition (OCR) and text to speech (TTS), a system can be developed which will help the visually impaired listen to an audio read-back of any text from documents, books or newspapers. This system has two main modules: extracting the text from an image, converting the text to speech. Tesseract is an optical character recognition engine which is considered one of the most accurate open source engines available now. To recognise text from an image, adaptive thresholding is first done in Tesseract to convert the input image to a binary image. Connected component analysis is done to store the outlines of each component. According to the character spacing, lines of text are broken into words. Text recognition is carried out in two steps for more accuracy. The same sequence of steps is followed to recognise both printed and handwritten text. However, the OCR engine is trained with user-specific training data to recognise handwritten text. Text to speech (TTS) API is used to convert the extracted text into speech. The following operations are carried out on the input text: detection, analysis, normalisation and linearisation. Following these steps, speech is synthesised on performing phonetic analysis and acoustic processing. A model of the vocal tract is used to produce the voice output. The organisation of this paper is as follows. In Section 2, a detailed literature survey regarding the existing systems has been presented, followed by Section 3 that describes the system design, Section 4 that discusses the implementation of the proposed system. The results are discussed in Section 5. Finally, Section 6 concludes the work. 2 Related works Handwritten text reader for the visually impaired is an extension of the system that was developed to help the visually impaired listen to an audio read-back of printed text only Ragavi et al. (2016). This extension adds two main functionalities: recognition of handwritten text, support for two regional languages (Hindi, Bengali). An overview of Tesseract OCR engine has been given by Smith (2007). Sasirekha and Chandra (2012) have described the process of text to speech synthesis. Mithe et al. (2013) have proposed an Android application which obtains images from high-resolution mobile phone cameras, and performs image to speech conversion. This application is proposed for use in fields like office automation, banking etc. This system is not suitable for visually impaired people because they have difficulty in capturing pictures of the text with mobile phone cameras. Gaudissart et al. (2005) have proposed a similar system named SYPOLE which performs the same function, but uses a personal digital assistant C. Selvaraj and N. Bhalaji 44 (PDA). The input image is the picture captured by an embedded camera in the PDA, which makes this system also not feasible for visually impaired people. Rakshit and Basu (2010) have described the process of training Tesseract OCR engine to recognise handwritten text. Tesseract is trained with user-specific handwriting samples of two groups of data. The first group contains isolated lower case characters and the second group contains free-flow text. The overall accuracy of the system was found to be 78.39%. The Tesseract training process can be split into three modules: a collection of dataset, labelling training data, training the data using Tesseract OCR engine. 3 System design Using the hand-held page scanner, the visually impaired can scan documents containing printed or handwritten text. The scanned image is then sent to the Android mobile phone that is paired with the scanner over Bluetooth. Figure 1 Architecture of handwritten-text reader The Android application developed obtains the language of the text to be processed – English, Hindi or Bengali from the user as voice input. The application then opens the most recently received scanned image from the Bluetooth folder and performs OCR and speech synthesis. Since Tesseract cannot recognise handwritten text, it has been trained to create a new language set for identifying handwritten text. If the scanned image contains handwritten English text, Tesseract uses the.trained data file, created after the collection of datasets, the labelling of trained data and the training Enhanced portable text to speech converter for visually impaired 45 of Tesseract OCR engine, for performing character recognition. If the scanned image contains printed English, Hindi or Bengali text, then the existing.trained data file available for these languages is used while performing OCR. The architecture of the system shown in Figure 1 describes the steps involved in the training of Tesseract to identify English handwritten text. This training step is skipped for the scanned images with printed text in English, Hindi or Bengali and so the existing.trained data files available are used for character recognition. Tesseract can be trained to identify and perform OCR on new languages. It can also be trained to recognise handwritten text in languages other than English. Languages other than English, Hindi and Bengali can be included into the system provided TextToSpeech library supports the languages. If the languages are not supported by the TextToSpeech library, transliteration must be performed or external libraries must be imported. 3.1 Algorithm The flow diagram of the Handwritten-text reader is shown in Figure 2. Step 1: Scan the document containing printed or handwritten text using the hand-held page scanner and transfer the scanned image to the Android mobile phone that is paired with the scanner over Bluetooth. Step 2: The Android application developed greets the user and obtains the input language i.e., the language of the scanned text from the user. Step 3: The application then opens the most recently received scanned image from the Bluetooth folder using LastModifiedFileComparator library of Apache. Step 4: If the scanned image contains printed text, the TessBaseAPI library of Tesseract uses the existing.trained data file available for the input language to perform OCR. Step 5: If the scanned image contains English handwritten text, the.trained data file generated after the manual training of Tesseract is used for performing OCR. Step 6: The editable text output of the OCR is converted to speech and conveyed to the visually impaired using TextToSpeech library of Java. 3.2 Description of the algorithm Tesseract assumes that its input is a binary image with optional polygonal text regions defined (Marosi, 2007). The first step of processing is a connected component analysis in which outlines of the components are stored. Tesseract was probably the first OCR engine able to handle white-on-black text so trivially. At this stage, outlines are gathered together, purely by nesting, into Blobs. Blobs are organised into text lines, and the lines and regions are analysed for fixed pitch or proportional text. Text lines are broken into words differently according to the kind of character spacing. Fixed pitch text is chopped immediately by character cells. 46 C. Selvaraj and N. Bhalaji The proportional text is broken into words using definite spaces and fuzzy spaces. Recognition then proceeds as a two-pass process (Pazio et al., 2007). In the first pass, an attempt is made to recognise each word in turn. Each word that is satisfactory is passed to an adaptive classifier as training data. The adaptive classifier then gets a chance to more accurately recognise text lower down the page. Since the adaptive classifier may have learned something useful too late to make a contribution near the top of the page, a second pass is run over the page, in which words that were not recognised well enough are recognised again. A final phase resolves fuzzy spaces, and checks alternative hypotheses for the x-height to locate smallcap text. It is clearly explained in Figure 3. Figure 2 Flow diagram of the system Enhanced portable text to speech converter for visually impaired Figure 3 4 47 Block diagram showing the steps of algorithm used for OCR (see online version for colours) Implementation For training Tesseract (Chandran et al., 2015; Rakshit et al., 2009) datasets containing isolated and free flow text must be collected from different users using the portable scanner device. It is shown in Figure 4. Figure 4 Figure showing the usage of page scanner (see online version for colours) These datasets should then be labelled using box editors. Once the datasets are ready, Tesseract can be trained to recognise handwritten text. These three steps are explained in detail in this section. 4.1 Collection of dataset For the collection of the dataset, it was concentrated on lower case characters. There were 13 handwritten document pages in lowercase that are collected from each user in two types of handwritten datasets. In the first dataset, we collected six pages of isolated handwritten lowercase characters as shown in Figure 5 and in the second dataset, 7 pages of free-flow handwritten text were collected from the user as shown in Figure 6. C. Selvaraj and N. Bhalaji 48 Figure 5 Sample page containing training set of isolated characters (see online version for colours) Figure 6 Sample page containing training set of free-flow characters (see online version for colours) 4.2 Labelling training data Training samples should be labelled. This process is done using a box editor called jTessBoxEditor. The following command generates box files for each image. tesseract engp.hw.exp0.jpg engp.hw.exp0 batch.nochop makebox …… tesseract engp.hw.exp19.jpg engp.hw.exp19 batch.nochop makebox Box file is a text file which contains each character in the training image, one per line with the coordinates of the bounding box around the character. In jTessBoxEditor, the Enhanced portable text to speech converter for visually impaired 49 bounding boxes can be merged, split, inserted or deleted as needed for incorrectly recognised characters as shown in Figure 7. Figure 7 Screenshot of jTessBoxEditor (see online version for colours) 4.3 Training Tesseract OCR engine The steps in training Tesseract OCR Engine is explained in detail in this section (Banerjee, 2010). The first step is to run the Tesseract engine in training mode using the following command: tesseract engp.hw.exp0.jpg eng2.hw.exp0 nobatch box.train The next step is to generate the unicharset file using the following command: unicharset_extractor engp.hw.exp0.box engp.hw.exp1.box . . . engp.hw.exp11.box engp.hw.exp19.box ‘shapeclustering’ command generates the shape table: shapeclustering -F font_properties -U unicharset engp.hw.exp0.tr.engp.hw.exp19.tr ‘cntraining’ command is used for character normalization: cntraining engp.hw.exp0.tr.engp.hw.exp19.tr Finally, combine_tessdata command is given: combine_tessdata eng2. This command generates the engp.trainneddata file, which is placed in the Assets folder in the Android studio project. This system is able to support two regional languages – Hindi and Bengali. The training data for Hindi and Bengali is added to Assets folder in Android Studio. When the user selects Hindi or Bengali, the appropriate training data is selected and optical character recognition is performed using it. C. Selvaraj and N. Bhalaji 50 5 Results The results observed using the scanner has been compared with the existing system that used a 5 Mega Pixel phone camera for image capturing of the page. A sample image with 90 words is considered for comparison. The result of the comparison is shown in Table 1. Table 1 Comparison of proposed system with existing system Portable scanner Camera Image size Comparison parameters 0.45 MB 1.4 MB Accuracy 98% 85% 7s 12 s Delay The accuracy of the system is determined by the following formula: Accuracy % = (Number of errors in recognition/Total number of words) × 100 The number of errors in using the scanner is 1 word and for phone camera is 13 words for the considered sample 90 words file. The delay parameter specifies the delay for speech output which is 7 s for the scanner and 12 s for a phone camera. The observed results have shown that the size of the scanned image is lesser than the camera image. Also, the processing speed is lesser in the proposed system. The results show that the accuracy of recognition is more than the existing system. The screenshot of the application after performing OCR of a sample printed English text in shown in Figure 8. The scanned image used in the application is displayed as an image view at the bottom left corner of the application for reference. On pressing the button on the top, the application begins performing OCR and speech synthesis of the recently received image without any further manual intervention. Figure 9 shows the application after performing OCR of a sample handwritten English text. The application after performing OCR of a sample printed Hindi text and a sample printed Bengali (Shahiduzzaman, 2015) text are shown in Figures 10 and 11, respectively. The accuracy of an OCR system can be measured on character level or word level. A system is said to be 99% accurate if 1 out of 100 characters are uncertain or incorrect and 99.9% accurate if 1 out of 1000 characters is uncertain or incorrect. To attain word level accuracy, the system should allow for the proper, relevant words to be found like city names, person names etc. There is no simple way to measure word level accuracy. Therefore, an intelligent, fuzzy search technology must be used for the same. The system has been tested with 3 OCR engines say, Tesseract, ABBYY and Microsoft Oxford OCR engines. The accuracy of the system has been for the languages Hindi and Bengali. Also, the character recognition for handwritten text is performed and the results in terms of accuracy are measured. Table 2 shows the accuracy comparison between the Tesseract OCR Engine used in the project and the other commercial OCR engines available. Enhanced portable text to speech converter for visually impaired 51 Figure 8 Scanned image and screenshot showing the optical character recognition of English printed text (see online version for colours) Figure 9 Scanned image and the screenshot showing the optical character recognition of English handwritten text (see online version for colours) 52 C. Selvaraj and N. Bhalaji Figure 10 Scanned image and the screenshots showing the optical character recognition of Hindi printed text (see online version for colours) Figure 11 Scanned image and the screenshots showing the optical character recognition of Bengali printed text (see online version for colours) Table 2 Comparison of OCR engines Language Tesseract OCR engine ABBYY OCR engine Microsoft Oxford English Handwriting 88.6% 77.3% 73.5% Hindi 82.4% Not supported Not supported Bengali 84.3% Not supported Not supported Enhanced portable text to speech converter for visually impaired 6 53 Conclusion The portable text to speech converter has been implemented for handwritten text for visually impaired people. This system can help the visually impaired people to learn from audio read-back of any scanned text, by sending the image from the scanner to Android mobile phone via Bluetooth. The major advantage of this system is it uses the document scanner which can scan the entire page; therefore, the visually impaired people can easily scan the document without the need to focus on the document. Additionally, the application supports regional languages – Hindi and Bengali. Moreover, the scanned image may contain text with background pictures which are simply ignored and only the text in the scanned image is extracted by the application to be converted to speech. This project is implemented using a handheld page or document scanner, an external Bluetooth module when the scanner does not have an inbuilt Bluetooth module, an Android application to perform OCR and speech synthesis and an Android mobile phone. The cost involved in developing the system is significantly low and the system provides a friendly user interface for the visually impaired people. References Banerjee, S. (2010) A Study on Tesseract Open Source Optical Character Recognition Engine, Thesis, Jadavpur University. Chandran, P., Aravind, S., Gopinath, J. and Saranya, S.S. (2015) ‘Text to speech conversion system using OCR’, International Journal of Emerging Technology and Advanced Engineering (IJETAE), Vol. 5, No. 1, pp.389–395. Gaudissart, V., Ferreira, S., Mancas-Thillou, C. and Gosselin, B. (2005) ‘Sypole: a mobile assistant for the blind’, Proceedings of European Signal Processing Conference, EUSIPCO 2005, Antalya, Turkey. Marosi, I. (2007) ‘Industrial OCR approaches: architecture, algorithms and adaptation techniques’, Document Recognition and Retrieval XIV, SPIE Jan 2007, 6500–01. Mithe, R., Indalkar, S. and Divekar, N. (2013) ‘Optical character recognition’, International Journal of Recent Technology and Engineering (IJRTE), Vol. 2, pp.72–75. Pazio, M., Niedzwiecki, M., Kowalik, R. and Lebiedz, J. (2007) ‘Text detection system for the blind’, in Signal Processing Conference, 2007 15th European, IEEE, September, pp.272–276. Ragavi, K., Radja, P. and Chithra, S. (2016) ‘Portable text to speech converter for the visually impaired’, in Proceedings of the International Conference on Soft Computing Systems, Springer, pp.751–758. Rakshit, S. and Basu, S. (2010) Development of a Multi-User Handwriting Recognition System Using Tesseract Open Source OCR Engine, arXiv preprint arXiv:1003.5886. Rakshit, S., Kundu, A. and Maity, M. (2009) ‘Recognition of handwritten Roman numerals using Tesseract open source OCR engine’, Proc. Int. Conf. on Advances in Computer Vision and Information Technology, pp.572–577. 54 C. Selvaraj and N. Bhalaji Sasirekha, D. and Chandra, E. (2012) ‘Text to speech: a simple tutorial’, International Journal of Soft Computing and Engineering, Vol. 2, No. 1, pp.275–278. Shahiduzzaman, Md. (2015) ‘Bangla handwritten character recognition’, International Journal of Science and Research (IJSR), Vol. 4, No. 11, November, ISSN (Online): 2319–7064. Smith, R. (2007) ‘An overview of the Tesseract OCR engine’, ICDAR, IEEE, pp.629–633.