eBook - ePub

Multilingual Text Analysis

Name: Multilingual Text Analysis
Author: Marina Litvak, Natalia Vanetik

Challenges, Models, and Approaches

Marina Litvak,

Natalia Vanetik,

500 pages
English
ePUB (mobile friendly)
Available on iOS & Android

eBook - ePub

Multilingual Text Analysis

Challenges, Models, and Approaches

Marina Litvak,

Natalia Vanetik,

Book details

Book preview

Table of contents

Citations

About This Book

Text analytics (TA) covers a very wide research area. Its overarching goal is to discover and present knowledge — facts, rules, and relationships — that is otherwise hidden in the textual content. The authors of this book guide us in a quest to attain this knowledge automatically, by applying various machine learning techniques.

This book describes recent development in multilingual text analysis. It covers several specific examples of practical TA applications, including their problem statements, theoretical background, and implementation of the proposed solution. The reader can see which preprocessing techniques and text representation models were used, how the evaluation process was designed and implemented, and how these approaches can be adapted to multilingual domains.

Contents:

Multilingual Text Analysis: History, Tasks, and Challenges (Natalia Vanetik and Marina Litvak)
Using a Polytope Model for Unsupervised Document Summarization (Natalia Vanetik and Marina Litvak)
MDL Approach for Unsupervised Multilingual Document Summarization (Natalia Vanetik and Marina Litvak)
Rich Feature Spaces and Regression Models in Single-Document Extractive Summarization (Alexander Dlikman, Marina Litvak, and Mark Last)
Hierarchical Topic Model and Summarization (Lei Li and Yazhao Zhang)
A Survey of Neural Models in Abstractive Summarization (Tal Baumel and Michael Elhadad)
Headline Generation as a Sequence Prediction with Conditional Random Fields (Carlos A Colmenares, Marina Litvak, Amin Mantrach, Fabrizio Silvestri, and Horacio Rodríguez)
Crowdsourcing in Single-document Summary Evaluation: The Argo Way (Nikiforos Pittaras, Stefano Montanelli, George Giannakopoulos, Alfio Ferrara, and Vangelis Karkaletsis)
Multilingual Summarization and Evaluation Using Wikipedia Featured Articles (John M Conroy, Jeff Kubina, Peter A Rankel, and Julia S Yang)
Are Better Summaries Also Easier to Understand? Analyzing Text Complexity in Automatic Summarization (Elena Lloret, Tatiana Vodolazova, Paloma Moreda, Rafael Muñoz, and Manuel Palomar)
Twitter Event Detection, Analysis, and Summarization (Natalia Vanetik, Marina Litvak, Efi Levi, and Andrey Vashchenko)
Linguistic Bias in Crowdsourced Biographies: A Cross-lingual Examination (Jahna Otterbacher, Ioannis Katakis, and Pantelis Agathangelou)
Multilingual Financial Narrative Processing: Analyzing Annual Reports in English, Spanish, and Portuguese (Mahmoud El-Haj, Paul Rayson, Paulo Alves, Carlos Herrero-Zorita, and Steven Young)

Readership: This book is intended for both students and professionals. Generation;Text Extraction;Deep Learning;Topic Model;Neural Model;Minimal Description Length principle;Regression Model;Summary Evaluation;Text Complexity;Social Media;Event Detection;Event Analysis;Linguistic Bias;Crowdsourcing;Financial Narrative Processing0 Key Features:

Summarizes up-to-date research of MultiLing scientific community
Focuses on language-independent methods and models
Broad range of text analysis tasks is covered

Frequently asked questions

Simply head over to the account section in settings and click on “Cancel Subscription” - it’s as simple as that. After you cancel, your membership will stay active for the remainder of the time you’ve paid for. Learn more here.

At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.

Both plans give you full access to the library and all of Perlego’s features. The only differences are the price and subscription period: With the annual plan you’ll save around 30% compared to 12 months on the monthly plan.

We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.

Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.

Yes, you can access Multilingual Text Analysis by Marina Litvak, Natalia Vanetik in PDF and/or ePUB format, as well as other popular books in Computer Science & Programming Games. We have over one million books available in our catalogue for you to explore.

Information

Publisher

WSPC

ISBN

9789813274891

Topic

Computer Science

Subtopic

Programming Games

Index

Computer Science

Chapter 1 Multilingual Text Analysis: History, Tasks, and Challenges

Natalia Vanetik^∗ and Marina Litvak^†

Shamoon College of Engineering, Software Engineering Department, Byalik 56, Beer Sheva 84100
^∗[email protected]
^†[email protected]

Text analytics (TA) is a very broad research area that deals with knowledge discovery in written text. Almost all techniques of machine learning, data mining and information retrieval are applied to TA tasks which include text categorization, summarization, question answering and many more. Among a very large variety of TA methods, multilingual techniques hold a special place. In order to be deemed as multilingual, a system or an algorithm must be able to handle texts in several languages equally well; a very good method should be able to produce good results for languages from different language families. Multilingual techniques and algorithms need to apply analysis that is not related to a linguistic structure of text in one specific language but rather relies on general statistical and mathematical properties common to many languages.

In this chapter we provide an overview of the field of multilingual text analysis, starting with description of various TA tasks and the history of TA. We then survey TA challenges related to the multilingual domain.

1. Introduction

Text analytics is a very wide research area. Its overarching goal is to discover and present knowledge — facts, rules, and relationships — that is otherwise hidden in textual content and unattainable by automated processing. Prior to applying analytical methods, text needs to be turned into structured data through the application of natural language processing (NLP). Then, data mining techniques, including link and association analysis, visualization, and predictive analytics, can be applied to the structured input and used to produce a requested output. Typical TA tasks include text categorization, text clustering, concept/entity extraction, production of granular taxonomies, sentiment analysis, document summarization, question answering, slot filling, and entity relation modeling.

A list of possible subtasks composing the TA process includes but is not limited to:

• Information retrieval (IR) as a preparatory step: collecting or identifying a set of textual materials for analysis; the set may be comprised of material found in any number of places, including the Web or a file system, database, or content corpus manager;

• Advanced statistical methods, like computing word frequency distributions;

• Extensive NLP, such as part of speech (POS) tagging, syntactic parsing, and other types of linguistic analysis;

• Named entity recognition (NER) using gazetteers or statistical techniques to identify named text features such as people, organizations, places, or certain abbreviations;

• Disambiguation, which involves the use of contextual clues, may be required to decide where, for instance, “apple” can refer to a fruit, a software company, a multimedia corporation, a movie, or some other entity;

• Recognition of pattern-identified entities: features such as telephone numbers, email addresses, or quantities (with units), can be discerned through regular expression or other pattern matches;

• Coreference resolution involves finding all expressions that refer to the same entity in a text;

• Relationship, fact, concept, and event extraction involve the identification of associations among entities and other information in text;

• Sentiment analysis involves discerning subjective material and extracting various forms of attitudinal information, such as sentiment, opinion, mood, and emotion;

• Topic modeling enables the discovery of the abstract “topics” that occur in a collection of documents;

• Quantitative TA is the process of extracting semantic or grammatical relationships between words.

This chapter introduces the main directions and challenges in TA, both in general and with respect to multilingual domain. The next section summarizes the history of the TA area. Section 3 describes primary TA subareas and tasks. Section 4 provides an overview of challenges to the use of TA in the multilingual domain. Section 5 provides a brief overview of the remaining chapters in the book.

2. TA evolution

The ability to understand the key content of a text has become extremely important recently, when more and more sources in different languages are available on the net. New ideas for interesting, and even crucial, applications arise every day. Extracting the most critical facts and reducing information overload, mining opinions from social media and other domains, predicting important events, detecting fraud and security threats — these are just a small sample of current “hot topics” in the TA area. Globalization dictates its own rules — more text sources are published in original language; language that is different from English as an international standard. Therefore, all proposed methodologies must deal with an additional requirement — they must be able to process multiple languages.

The idea of using computers to analyze text and search for relevant pieces of information was raised for the first time in an article by Vannevar Bush in 1945.¹ In the 1950s, this idea was followed by several works. One of the most influential was the 1957 work of Luhn,² where he proposed to use words as indexing units for documents, and measure word overlap as a criterion for retrieval. A year later, Luhn published the first work on automated summarization,³ where he proposed a statistical method for ranking sentences. Several key developments in the field happened in the 1960s. Most notable were the development of the SMART system by Gerard Salton and his students, first at Harvard University and later at Cornell University.⁴ The 1970s and 1980s showed many developments built on the advances of the 1960s. An example is the famous Vector Space Model that was proposed by Salton,⁵ which is still very powerful in multiple and diverse tasks of TA. However, due to lack of available large text collections, the question whether proposed models and techniques would scale to large corpora remained open. This changed in 1992 with the inception of the Text Retrieval Conference (TREC),^a followed by the Document Understanding Conference (DUC)^b in 2001, which was later transformed into the Text Analysis Conference (TAC)^c in 2008. Each of these is part of a series of evaluation conferences sponsored by various US government agencies under the auspices of the National Institute of Standards and Technology (NIST), which aims at encouraging research in different areas of information retrieval (IR) from large text collections. These conferences have branched IR into related and important fields like retrieval of spoken information, multilingual and cross-language retrieval, information filtering, summarization, information extraction, and automatic evaluation. This book describes multiple approaches to different TA tasks. The main focus is the multilinguality of those approaches, specifically, their ability to be applied to multiple languages.

One of the most representative examples of joint international effort in the field of multilingual TA is a series of MultiLing conferences.⁶ The first MultiLing was organized in 2011, as a summarization track of DUC 2011.⁷ It gathered several scientists from different countries with a joint purpose — to create the first big collection of documents in multiple languages, a collection that will permit scientists around the world to evaluate their summarization systems based on different languages. The secondary goal was to encourage people to work on summarization systems that can be applied to multiple languages. For example, in order to participate in the MultiLing contest, a team was required to apply its system to at least two languages.

3. TA overview

In this section, we describe the main areas of TA, text preprocessing methods, and a process of evaluation of TA tasks. A good overview of the main tasks in the field of TA is given in Ref. 8.

3.1. TA areas

Text analysis is roughly divided into several broad areas, as follows.

Text mining (TM) (first mentioned in Ref. 9) is the process of seeking or extracting useful information from the textual data. It is an exciting research area as it tries to discover knowledge from unstructured texts.¹⁰ The scope of TM is its treatment of textual data through an application or adaptation of general knowledge discovery in databases^11,12 techniques. In order to apply these techniques, a suitable procedure among knowledge discovery methods is selected, modified to fit and handle the text data, and applied to large amounts of text. In general, text data is assumed to be available as character-based data in a standard encoding, although in many cases ...

Cover page
Title page
Copyright
Acknowledgment
Preface
Contents
1. Multilingual Text Analysis: History, Tasks, and Challenges
2. Using a Polytope Model for Unsupervised Document Summarization
3. MDL Approach for Unsupervised Multilingual Document Summarization
4. Rich Feature Spaces and Regression Models in Single-Document Extractive Summarization
5. Hierarchical Topic Model and Summarization
6. A Survey of Neural Models for Abstractive Summarization
7. Headline Generation as a Sequence Prediction with Conditional Random Fields
8. Crowdsourcing in Single-document Summary Evaluation: The Argo Way
9. Multilingual Summarization and Evaluation Using Wikipedia Featured Articles
10. Are Better Summaries Also Easier to Understand? Analyzing Text Complexity in Automatic Summarization
11. Twitter Event Detection, Analysis, and Summarization
12. Linguistic Bias in Crowdsourced Biographies: A Cross-lingual Examination
13. Multilingual Financial Narrative Processing: Analyzing Annual Reports in English, Spanish, and Portuguese
List of Abbreviations
List of Contributors
Index