Textual Data Science with R
eBook - ePub

Textual Data Science with R

  1. 204 pages
  2. English
  3. ePUB (mobile friendly)
  4. Available on iOS & Android
eBook - ePub

Textual Data Science with R

Book details
Book preview
Table of contents
Citations

About This Book

Textual Statistics with R comprehensively covers the main multidimensional methods in textual statistics supported by a specially-written package in R. Methods discussed include correspondence analysis, clustering, and multiple factor analysis for contigency tables. Each method is illuminated by applications. The book is aimed at researchers and students in statistics, social sciences, hiistory, literature and linguistics. The book will be of interest to anyone from practitioners needing to extract information from texts to students in the field of massive data, where the ability to process textual data is becoming essential.

Frequently asked questions

Simply head over to the account section in settings and click on “Cancel Subscription” - it’s as simple as that. After you cancel, your membership will stay active for the remainder of the time you’ve paid for. Learn more here.
At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.
Both plans give you full access to the library and all of Perlego’s features. The only differences are the price and subscription period: With the annual plan you’ll save around 30% compared to 12 months on the monthly plan.
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.
Yes, you can access Textual Data Science with R by Mónica Bécue-Bertaut in PDF and/or ePUB format, as well as other popular books in Economics & Statistics for Business & Economics. We have over one million books available in our catalogue for you to explore.

Information

Year
2019
ISBN
9781351816359
Edition
1
1
Encoding: from a corpus to statistical tables
1.1 Textual and contextual data
In the field of linguistics, a body of written or spoken documents is called a corpus (plural corpora). A large range of corpora, although differing in nature and reason for being, can be analyzed using the same exploratory multivariate statistical methods. This requires they be encoded as two-way frequency tables with a consistent structure. Furthermore, available contextual data can also be used to improve information extraction.
1.1.1 Textual data
The encoding step is particularly important in the field of textual data. Its aim is to convert a corpus into a lexical table of documents× words. Let us now clarify what we mean by document and word.
A corpus needs to be divided into documents, which correspond to the statistical units in the analysis. This division can sometimes be obvious but, at other times, choices have to be made. For example, when processing a corpus of free text answers collected by means of a questionnaire-based survey including open-ended questions, each individual answer is frequently considered as one document. Nevertheless, aggregation of the free text answers in terms of the values of a contextual qualitative variable into category documents is another possible choice (see Section 3.2). In the case of theater plays or film scripts, a division into scene documents is generally a good way to proceed. However, other options are possible. For instance, one of the first applications of correspondence analysis (CA), a core method in our conception of the statistical analysis of texts, was performed on Jean Racine’s play Phèdre. There, Brigitte Escofier grouped together the dialogue of each character into separate documents. Another example is the analysis of the speech of the French Minister of Justice, Robert Badinter, championing the abolition of the death penalty in France in 1981. Here, the aim is to unveil the organizing principles of his argument, which are of particular importance in a rhetorical speech, i.e., one which aims to convince its audience. In this kind of analysis, it is necessary to divide the speech into rather similar-length sequences of text. Each sequence, now looked at as a document, has to be long enough (from 500 to 1000 entries) to obtain a meaningful analysis when applying CA (see Section 7.3).
We will explain the difference between documents and what are known as aggregate documents in Section 1.1.3.
Word is used here as a generic term corresponding to the textual unit chosen, which could be for instance the graphical form (= continuous character strings), lemmas (= dictionary entries), or stems (= word roots). These are described further in Section 1.3. To count word frequencies first requires defining rules to segment a corpus into occurrences, and match each of them with a corresponding word.
When the documents have been defined, words identified, and counts performed, the lexical table to be analyzed can be constructed. This table contains the frequency with which the documents (rows) use the words (columns).
1.1.2 Contextual data
Contextual data is external information available on the documents, particularly important in the case of a corpus of free text answers collected through open-ended questions. Here, respondents have also answered many closed questions, later encoded in the form of quantitative or qualitative variables. However, any document can be linked to contextual data, which could be the publishing date, author, chronological position in the whole corpus, etc.
For example, in the case of plays or screenplays, each scene includes, besides dialogue, information such as character names, locations (outdoors, indoors, specific places, etc.), and a description of the action.
Contextual data are encoded as a standard documents × variables table called a contextual table.
1.1.3 Documents and aggregate documents
The same corpus can be divided into documents in different ways. The finest possible segmentation of the corpus into documents consists in considering each row of the original database as a document, called in the following a source document. In this case, an LT is constructed and analyzed. Then, if relevant, these documents can be aggregated according to either the categories of a contextual variable or clusters obtained from a clustering method, leading to construction of an aggregate lexical table (ALT). Aggregation of source documents into larger ones is particularly of interest when they are numerous and short, such as with free text answers, dialogue elements in a film script, and so on.
Nevertheless, aggregation can be carried out even if documents are long and/or not plentiful, depending on what the goals are. In any case, running analyses at different levels of detail may be useful. An analysis performed on source documents is referred to as a direct analysis, whereas aggregate analysis concerns aggregate documents. It must not be forgotten that the latter are aggregates of statistical units. As contextual variables are defined at the source document level, direct and aggregate analyses differ depending on the role the variables play in a given statistical method. Some indications will be provided in the following chapters.
1.2 Examples and notation
To illustrate the first six chapters of the book, we use part of the data collected during the International Aspiration survey, itself an extension of the Aspiration survey—which inquired about the lifestyles, opinions and aspirations of French people, to several countries. Aspiration was created by Ludovic Lebart at the Centre de recherche pour l’étude et l’observation des conditions de vie (CREDOC). Since the first wave in 1978, various open-ended questions have been introduced, leading Ludovic Lebart to develop original statistical methodology to address this type of textual data. Later, collaboration between Ludovic Lebart and Chikio Hayashi (then Director of the Institute of Statistical Mathematics in Tokyo) led them to design the International Aspiration survey, which included closed-ended questions from the original French survey, and open-ended questions such as:
What is most important to you in life?
What are other very important things to you? (relaunch of the first question)
What do you think of the culture of your country?
International Aspiration was conducted in seven countries (Japan, France, Germany, the United Kingdom (UK), the United States (US), the Netherlands, and Italy) by Masamichi Sasaki and Tatsuzo Suzuki in the late 1980s. The questionnaire was translated into each of the respective languages. The respondents answered in their own language, thus providing the multilingual corpora Life (composed of the free text answers to the first two questions concatenated as if they were a single answer) and Culture (answers to the third question), both distinctive in the genre. These corpora, used for 25 years, have resulted in numerous publications and reference works. In Chapters 1, 5, we use the UK component of both corpora (Life_UK and Culture_UK), and in Chapter 6, the UK, French and Italian components of the Life corpus (Life_UK, Life_F...

Table of contents

  1. Cover
  2. Half Title
  3. Title Page
  4. Copyright Page
  5. Table of Contents
  6. Foreword
  7. Preface
  8. 1 Encoding: from a corpus to statistical tables
  9. 2 Correspondence analysis of textual data
  10. 3 Applications of correspondence analysis
  11. 4 Clustering in textual data science
  12. 5 Lexical characterization of parts of a corpus
  13. 6 Multiple factor analysis for textual data
  14. 7 Applications and analysis workflows
  15. Appendix: Textual data science packages in R
  16. Bibliography
  17. Index