eBook - ePub

Data Mining Methods for the Content Analyst

Name: Data Mining Methods for the Content Analyst
Author: Kalev Leetaru

An Introduction to the Computational Analysis of Content

Kalev Leetaru,

106 pages
English
ePUB (mobile friendly)
Available on iOS & Android

eBook - ePub

Data Mining Methods for the Content Analyst

An Introduction to the Computational Analysis of Content

Kalev Leetaru,

Book details

Book preview

Table of contents

Citations

About This Book

With continuous advancements and an increase in user popularity, data mining technologies serve as an invaluable resource for researchers across a wide range of disciplines in the humanities and social sciences. In this comprehensive guide, author and research scientist Kalev Leetaru introduces the approaches, strategies, and methodologies of current data mining techniques, offering insights for new and experienced users alike.

Designed as an instructive reference to computer-based analysis approaches, each chapter of this resource explains a set of core concepts and analytical data mining strategies, along with detailed examples and steps relating to current data mining practices. Every technique is considered with regard to context, theory of operation and methodological concerns, and focuses on the capabilities and strengths relating to these technologies. In addressing critical methodologies and approaches to automated analytical techniques, this work provides an essential overview to a broad innovative field.

Frequently asked questions

Simply head over to the account section in settings and click on “Cancel Subscription” - it’s as simple as that. After you cancel, your membership will stay active for the remainder of the time you’ve paid for. Learn more here.

At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.

Both plans give you full access to the library and all of Perlego’s features. The only differences are the price and subscription period: With the annual plan you’ll save around 30% compared to 12 months on the monthly plan.

We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.

Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.

Yes, you can access Data Mining Methods for the Content Analyst by Kalev Leetaru in PDF and/or ePUB format, as well as other popular books in Languages & Linguistics & Communication Studies. We have over one million books available in our catalogue for you to explore.

Information

Publisher

Routledge

Year

2012

ISBN

9781136514586

Edition

Topic

Languages & Linguistics

Subtopic

Communication Studies

Index

Languages & Linguistics

1 INTRODUCTION

Nearly all fields that involve the synthesis of large amounts of information have begun to explore digital analysis methodologies in recent years. Countless projects have been launched that exploit advances in computer techniques to perform innovative, but computationally complex, forms of analysis on large warehouses of textual data. The myriad underlying algorithms powering such analyses hail from fields as varied as linguistics to psychology, but the underlying theme is the same: to leverage large quantities of cheap computing power to perform analyses previously unthinkable at scales formerly intractable.

This book is designed to provide an introduction to current digital analysis techniques and focus on these technologies through the capabilities they afford and the methodologies that underlie them. Topics include vocabulary analysis, lexicons, co-occurrence and concordance, spatial analysis, topic extraction, natural language processing, automatic categorization, network analysis, and hybrid applications like sentiment analysis. One of the purposes of this book is to collect together in a single work a cross-section of automated analytical techniques in use today for content analysis. While some discussion will be given to the underlying technology aspects, the primary focus will be on the broader implications of each technique: its theory of operation, the contexts in which it is appropriate or inappropriate, nuances that must be considered, and even, in some cases, deeper methodological concerns that must be addressed. New and experienced content analysis practitioners alike should find this book useful as a working guide for digital analysis in their field.

What Is Content Analysis?

As this book illustrates, content analysis is a broad term that sums up nearly every analytical technique that can be used to extract secondary meaning from information. Content analysis can refer to both human and machine techniques, but increasing computing power means that automated techniques are now able to process vast quantities of material very quickly, increasing the size and scope of the datasets that may be addressed. Computational tools can explore sentiment patterns in political speeches, representational issues in international news coverage, actor links in social networks, and even geographic diffusion in historical archives. Document clustering, geocoding, topic extraction, and categorization, join the more traditional ranks of co-occurrence and vocabulary exploration. In essence, content analysis is a catch-all term that refers to any kind of analysis that attempts to derive new meaning from existing content.

Why Use Computerized Analysis Techniques?

Many of the capabilities offered by the computerized content analysis techniques described in this book have historically been available through laborious human coding. One might therefore ask what benefit automated techniques have over this long-established tradition. Computer-based analysis offers substantial advantages over human-based approaches in three key areas:

• Reliability Projects that rely on large teams of human coders face substantial difficulties in ensuring consistency of results across all coders. For example, what one coder might term a major act of violence, another might label a minor skirmish. The same coder might change their standards through the course of a project, subconsciously influenced by the material being coded. A battery of tests such as inter- and intra-coder reliability must be continually employed to ensure a baseline of consistency across results and offer some objective measure of accuracy. Computer-based techniques suffer from no such limitations and results can be precisely quantified as to why a particular answer was chosen by the machine. A cluster of 20 computers processing tens of millions of articles will code every single article using the exact same set of standards, never varying once from its electronic colleagues.

• Reproducibility There is a substantial difference between establishing a set of coding guidelines and the way in which each coder actually implements them. Even with the use of reliability and consistency measures, the same project repeated multiple times with different groups of coders may yield slightly different results. A computer-based coding system using the same ruleset will never differ in its output, even if run millions of times.

• Scale Most content analysis projects have been forced to use small samples of texts in order to achieve reportable results in a reasonable time frame. Even well-funded projects with hundreds of human coders would have difficulty coding a corpus of tens of millions of news articles. Computer techniques can operate at considerably greater speed, with a single machine able to do the work of thousands of coders without ever needing a break. Additional capacity can be added for the price of an additional computer (typically just a few thousand dollars), and run continuously for years.

The last few decades have seen astounding innovation in the field of computational linguistics, but unfortunately few of these techniques have yet found their way into mainstream use by content analysts. In the area of sentiment analysis, computer scientists focus on assigning textual sentiment to individual actors and enhancing the contextual abilities of their systems, while many social scientists still rely on coarse dictionaries of hand-built terms to divide articles into “mostly positive” or “mostly negative” bins. Audio transcripts are typed up by hand and examined purely on their textual content, while large networks of interrelated documents are explored based only on their local connections, rather than as actors in global environments. In short, as computer scientists engage in a virtual arms race to develop the latest advances in automated content processing, the majority of content analysis projects are still manual affairs involving large teams of human coders.

Unlike human subject experts, who can make use of a wide range of qualitative analytical techniques involving abstract reasoning, computers are limited to more precise quantitative methods that can be expressed in concrete mathematical terms. Qualitative research questions must therefore be reshaped into techniques based on counts and connections as opposed to deep subjective musings. For example, leveraging computational techniques to examine the portrayal of a religious group in an historical document archive would require translating the notion of portrayal into a concrete metric that can be represented in terms of quantitative information such as counts of particular words. A technique like sentiment analysis can then be used to quantify the positivity of portrayal, not through a deep introspective reading, but rather by counting how many words about the group appear in dictionaries of positive and negative words. Computerized con-tent analysis is therefore about “substitute[ing] controlled observation and syste-matic counting for impressionistic ways of observing frequencies of occurrence” (De Sola Pool, 1959, p. 8)

Standalone Tools or Integrated Suites

Modern content analysts will find themselves confronted with a dizzying array of software packages touting every type of functionality imaginable. The increasing popularity of content analysis techniques across non-traditional disciplines, ranging from history to literature, has led to explosive growth in this market. Both commercial and open source offerings are available, and are largely broken into two categories: specialized tools and integrated suites. Specialized tools tend to focus on one specific domain area, such as sentiment analysis or document categorization, applying their algorithms to textual input documents and outputting numeric data. They do not perform statistical analysis themselves, instead generating output in a format ready for import into traditional statistical packages. A sentiment mining module, for example, might take a collection of documents and output a table of sentiment scores, leaving cross-correlation analysis to be done in an external statistics program. One of the greatest challenges in content analysis is the large assortment of specialty algorithms that must be used to transform text into numerical representations measuring different characteristics. Standalone tools that embody a particular algorithm and translate textual material into a form that can be processed using external tools allow the analyst to leverage the considerable statistical power of dedicated statistics packages, like STATA®, SPSS®, and SAS®, which offer a much greater array of numeric features than any integrated suite.

Integrated suites attempt to combine a cross section of capability into a single package, combining the textual analysis algorithms of specialized tools with a variety of statistical features, scripting languages for extensibility, and graphing capability. A primary advantage of integrated suites, especially for analysts just beginning with computerized tools, is the way in which they provide start-to-finish capability. A collection of documents can be imported into the system, analyzed using a variety of different modules, output to graphs and tables, and the results pasted directly into a document for publication. However, for more experienced researchers, or those who already have extensive familiarity with existing statistical analysis packages, specialized tools may be a better fit for their workflow. The statistical algorithms used in commercial statistics packages are often more robust and offer a greater number of options than those in integrated suites, where they may have been added as an afterthought. Researchers are also often familiar with the environments of the large statistical packages and don't necessarily have the time to learn entirely new software programs just for their content analysis tasks. Thus, users new to both content and statistical analysis may find integrated suites a good fit, while those with statistical experience looking to expand into content analysis will likely prefer specialized tools, using them like translators to integrate text analysis into their existing statistical modeling portfolio.

There are also many packages designed for use within a specific discipline. WordHoard (http://wordhoard.northwestern.edu/) is a widely known tool for literary analysis, and many of its features, such as detailed vocabulary and lemma analysis, have significant applications outside the literary discipline. However, WordHoard derives much of its power from the fact that it requires input documents to be pre-annotated by human editors with special tags that give it rich information about document structure. The investment required to code a document collection into the proprietary language required by this tool outweighs its possible benefits for many projects. It is therefore important to investigate the input requirements of a tool to make sure it is compatible with a project's specific needs.

Larger projects, or those with complex pipelined processing needs where the output of one tool feeds into the next, will usually require custom programming to merge all of the data together. Integrated suites tend to use their own proprietary scripting languages, each with its own unique syntax and methodology. Many suites have very small user communities with a limited set of developers, and so the analyst will often be left on their own to do all of the necessary architectural development and programming. Specialized tools simply perform one specific analysis and output their results in a format that makes it easy to load them into external tools as part of a larger pipeline. In many cases, the output of automated techniques can be dropped into the same workflows previously developed for human coding. A researcher used to working with human coders can simply substitute an automated module for a team of human coders, while keeping the rest of their workflow untouched. It is very easy to design complex multi-method projects using these tools and rely on the large industry-standard programming environments of commercial statistical packages to do the data integration. STATA®, SPSS®, and SAS®, in particular, have large user and developer communities, with users able to submit custom modules to be shared with other researchers, developing a community-based collection of specialty tools.

Transitioning from Theory to Practice

This book is intended to provide an overview of the breadth of computer-based content analysis techniques available today and act as a jumping-off point for further research in those areas. Each of the topics discussed here, from cluster analysis, to lexicons and sentiment analysis, has an extensive literature covering its methodology and operation. The purpose of this book is not to delve too deeply into any given topic, but rather provide a broad introduction to the entire array of techniques available today.

Unfortunately, the high degree of specialty of most commercial content analysis tools leads to very high price tags, with higher-end packages having additional per-year maintenance licensing costs and limits on the number of documents that can be processed per year. Open source data mining toolkits don't yet have the interfaces necessary for widespread adoption in the humanities and social sciences. Frameworks like the General Architecture for Text Engineering (GATE) (http://gate.ac.uk/) have been integrated into numerous research applications within the computer science and linguistics disciplines, but lack the ease-of-use necessary for broader adoption. Some disciplines, such as literature, have developed their own customized toolkits that address the subset of analytical problems of interest to them. WordHoard, as noted earlier, packages a number of vocabulary and lexical tools into a single program, with an interface tailored for that genre of scholarship. Yet, these tools are often designed in such a discipline-specific way that it is hard to apply them to other fields.

Whether commercial or open source, most content analysis programs are designed for small-scale use. Even those that do not impose limits on the number of documents they can process simply do not have the workflow capabilities to process hundreds of thousands or even millions of documents robustly. Custom programming is required for such projects, yet the expanding popularity of computational techniques has led to a proliferation of prebuilt modules that can be dropped into many programming environments. The Practical Extraction and Reporting Language (PERL) is a programming language developed in 1987 specifically for the task of manipulating, processing, and presenting large amounts of textual and numeric data very efficiently. Beyond its native capabilities, the PERL language attracts a flourishing community of developers that have built thousands of plugin modules for nearly every imaginable task. Many of the techniques described in this book are available as such modules on the main PERL archive CPAN (http://search.cpan.org/).

To get you started right away using a subset of these techniques, point your browser to the National Resource for Computational Content Analysis (NRCCA) (http://contentanalysis.ncsa.illinois.edu/), a free web portal made available through a partnership of the Institute for Computing in the Humanities, Arts, and Social Science, the National Center for Supercomputing Applications, and the University of Illinois. NRCCA offers many of the techniques described in this book through a web-based interface that doesn't require installing any software. It also includes links to many other resources that both new and experienced content analysts alike will find of use.

Chapter in Summary

Computerized content analysis exploits advances in computational techniques to perform innovative, but computationally complex, forms of analysis on large warehouses of textual data. Techniques range from traditional methods, like vocabulary analysis, lexicons, co-occurrence and concordance, to more advanced approaches, like spatial analysis, topic extraction, and sentiment analysis. Automated processing is more reliable, reproducible, and orders of magnitude faster and more scalable than traditional human coding. Current software packages for content analysis tend to be in the form of specialized tools that translate textual documents into numeric data for statistical processing, or integrated suites that combine text algorithms, statistical analysis, and graphing and reporting all in one.

2 OBTAINING AND PREPARING DATA

The first step in any content analysis project is to obtain the data to be analyzed and perform any necessary preprocessing to ready it for analysis. Data collection is often regarded as the easiest step of the entire analytical process, yet it is actually one of the most complex, with the quality of the collection and preparation processes impacting every other stage of the project. Issues such as ensuring the data sample is meaningful and data preparation tasks like cleaning and filtering are all critical concepts which are often overlooked. This chapter provides a basic introduction to the issues surrounding data collection and preparation, with a few words on advanced topics, like the integration of multimedia sources and random sampling.

Collecting Data from Digital Text Repositories

One of the most popular sources of material for content analysis is the online searchable digital text repository. However, while these collections have substantial advantages over their print brethren, one must be careful to understand their limitations. Of foremost danger is the all-too-common belief that a keyword query on a searchable database will return every single matching document. As discussed in this section, numerous complexities ranging from OCR errors in historical collections to content blackouts stemming from licensing restrictions may partially or entirely exclude works from the online edition of a print publication. Humans are also able to rely on a tremendous array of background knowledge when searching for relevant documents, giving them more flexibility in locating documents based on what they mean. Computers, on the other hand, rely on precise keyword matches and automated searches must therefore be carefully designed around their more limited capabilities.

Are the Data Meaningful?

One of the first issue...

Front Cover
DATA MINING METHODS FOR THE CONTENT ANALYST
Title Page
Copyright
Dedication
CONTENTS
List of Tables and Figures
Acknowledgments
1 Introduction
2 Obtaining and Preparing Data
3 Vocabulary Analysis
4 Correlation and Co-occurrence
5 Lexicons, Entity Extraction, and Geocoding
6 Topic Extraction
7 Sentiment Analysis
8 Similarity, Categorization and Clustering
9 Network Analysis
References
Index