eBook - ePub

Document Processing Using Machine Learning

Name: Document Processing Using Machine Learning
Author: Sk Md Obaidullah, KC Santosh, Teresa Goncalves, Nibaran Das, Kaushik Roy, Sk Md Obaidullah, KC Santosh, Teresa Goncalves, Nibaran Das, Kaushik Roy

168 pages
English
ePUB (mobile friendly)
Available on iOS & Android

eBook - ePub

Document Processing Using Machine Learning

Book details

Book preview

Table of contents

Citations

About This Book

Document Processing Using Machine Learning aims at presenting a handful of resources for students and researchers working in the document image analysis (DIA) domain using machine learning since it covers multiple document processing problems. Starting with an explanation of how Artificial Intelligence (AI) plays an important role in this domain, the book further discusses how different machine learning algorithms can be applied for classification/recognition and clustering problems regardless the type of input data: images or text.

In brief, the book offers comprehensive coverage of the most essential topics, including:

· The role of AI for document image analysis

· Optical character recognition

· Machine learning algorithms for document analysis

· Extreme learning machines and their applications

· Mathematical foundation for Web text document analysis

· Social media data analysis

· Modalities for document dataset generation

This book serves both undergraduate and graduate scholars in Computer Science/Information Technology/Electrical and Computer Engineering. Further, it is a great fit for early career research scientists and industrialists in the domain.

Frequently asked questions

Simply head over to the account section in settings and click on “Cancel Subscription” - it’s as simple as that. After you cancel, your membership will stay active for the remainder of the time you’ve paid for. Learn more here.

At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.

Both plans give you full access to the library and all of Perlego’s features. The only differences are the price and subscription period: With the annual plan you’ll save around 30% compared to 12 months on the monthly plan.

We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.

Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.

Yes, you can access Document Processing Using Machine Learning by Sk Md Obaidullah, KC Santosh, Teresa Goncalves, Nibaran Das, Kaushik Roy, Sk Md Obaidullah, KC Santosh, Teresa Goncalves, Nibaran Das, Kaushik Roy in PDF and/or ePUB format, as well as other popular books in Informatik & Programmierung von Spielen. We have over one million books available in our catalogue for you to explore.

Information

Publisher

Chapman and Hall/CRC

Year

2019

ISBN

9781000739831

Edition

Topic

Informatik

Subtopic

Programmierung von Spielen

1 Artificial Intelligence for Document Image Analysis

Himadri Mukherjee , Payel Rakshit , Ankita Dhar , Sk Md Obaidullah , KC Santosh , Santanu Phadikar and Kaushik Roy

CONTENTS

1.1 Introduction

1.2 Optical Character Recognition

1.2.1 Dealing with Noise

1.2.2 Segmentation

1.2.3 Applications

1.2.3.1 Legal Industry

1.2.3.2 Banking

1.2.3.3 Healthcare

1.2.3.4 CAPTCHA

1.2.3.5 Automatic Number Recognition

1.2.3.6 Handwriting Recognition

1.3 Natural Language Processing

1.3.1 Tokenization

1.3.2 Stop Word Removal

1.3.3 Stemming

1.3.4 Part of Speech Tagging

1.3.5 Parsing

1.3.6 Applications

1.3.6.1 Text Summarization

1.3.6.2 Question Answering

1.3.6.3 Text Categorization

1.3.6.4 Sentiment Analysis

1.3.6.5 Word Sense Disambiguation

1.4 Conclusion

References

1.1 Introduction

There has been rapid development in technology which has aided in the digitization of documents. The number of digital documents has increased significantly over time [1, 2]. Information is now easily available on the Internet and can be distributed with ease. Such voluminous numbers of documents demand efficient processing. Digitized documents can be broadly categorized into two types, namely document images and text documents. In the case of document images, it is first essential to understand what is written. This requires optical character recognition (OCR) [3–5]. Once the characters are identified, approaches based on natural language processing (NLP) [6–8] need to be used to understand what is written. In the case of text documents, research in the fields of OCR and NLP started way back in the last century and different systems in languages like English are now commercially available [9–11], but there have not been significant developments for Indic languages. One reason for this is the complex nature of Indic scripts [12]. This is also coupled with the unavailability of standard (and free) datasets for research.

1.2 Optical Character Recognition

Optical character recognition [13, 14] refers to the task of decoding what is written in a document. It does not involve understanding the written texts, but it does involve converting a scan or a picture of a document, identifying the characters and giving the text output of the identified words and characters. The document can be either handwritten or printed. In the case of handwritten documents, there are different variations which need to be considered prior to recognition. While writing, it is often observed that the texts have disparate degrees of slants. It is very important to deal with such slants while processing the documents. A document with characters having multiple degrees of slant is presented in Figure 1.1.

FIGURE 1.1
A document depicting multiple degrees of slant for different characters.

The second important factor which needs to be tackled is the similarity between different characters. For instance, the numeral “3” is similar to “ত” in Bangla. This is illustrated in Figure 1.2. It is very important to handle such cases or else, if interpreted wrongly, the entire sentence might change.

FIGURE 1.2
Similarity between different characters in Bangla.

Another important aspect is inter-writer and intra-writer variation. It is often observed that handwritten texts show variation at the character level. That is, the same character is slightly different when written in two instances. This is known as intra-writer variation. Another variation is observed when two different writers write the same thing. The handwriting of disparate people differ from each other in most cases. This is known as inter-writer variation. Thus the system should be able to handle such differences. Inter- and intra-writer variations for a Bangla text are presented in Figures 1.3 and 1.4.

F...

Cover
Half-Title
Title
Copyright
Contents
Preface
Editors
Contributors
1. ArtificialIntelligenceforDocumentImageAnalysis
2. AnApproachtowardCharacterRecognitionofBanglaHandwrittenIsolatedCharacters
3. ArtisticMulti-CharacterScriptIdentification
4. AStudyontheExtremeLearningMachineandItsApplications
5. AGraph-BasedTextClassificationModelforWebTextDocuments
6. AStudyofDistanceMetricsinDocumentClassification
7. AStudyofProximityofDomainsforTextCategorization
8. SupervisedLearningforAggressionIdentificationandAuthorProfilingoverTwitterDataset
9. TheEffectofUsingFeaturesComputedfromGeneratedOfflineImagesforOnlineBanglaHandwrittenCharacterRecognition
10. HandwrittenCharacterRecognitionforPalm-LeafManuscripts
Index