R Web Scraping Quick Start Guide
eBook - ePub

R Web Scraping Quick Start Guide

Techniques and tools to crawl and scrape data from websites

  1. 114 pages
  2. English
  3. ePUB (mobile friendly)
  4. Available on iOS & Android
eBook - ePub

R Web Scraping Quick Start Guide

Techniques and tools to crawl and scrape data from websites

Book details
Book preview
Table of contents
Citations

About This Book

Web Scraping techniques are getting more popular, since data is as valuable as oil in 21st century. Through this book get some key knowledge about using XPath, regEX; web scraping libraries for R like rvest and RSelenium technologies.

Key Features

  • Techniques, tools and frameworks for web scraping with R
  • Scrape data effortlessly from a variety of websites
  • Learn how to selectively choose the data to scrape, and build your dataset

Book Description

Web scraping is a technique to extract data from websites. It simulates the behavior of a website user to turn the website itself into a web service to retrieve or introduce new data. This book gives you all you need to get started with scraping web pages using R programming.

You will learn about the rules of RegEx and Xpath, key components for scraping website data. We will show you web scraping techniques, methodologies, and frameworks. With this book's guidance, you will become comfortable with the tools to write and test RegEx and XPath rules.

We will focus on examples of dynamic websites for scraping data and how to implement the techniques learned. You will learn how to collect URLs and then create XPath rules for your first web scraping script using rvest library. From the data you collect, you will be able to calculate the statistics and create R plots to visualize them.

Finally, you will discover how to use Selenium drivers with R for more sophisticated scraping. You will create AWS instances and use R to connect a PostgreSQL database hosted on AWS. By the end of the book, you will be sufficiently confident to create end-to-end web scraping systems using R.

What you will learn

  • Write and create regEX rules
  • Write XPath rules to query your data
  • Learn how web scraping methods work
  • Use rvest to crawl web pages
  • Store data retrieved from the web
  • Learn the key uses of Rselenium to scrape data

Who this book is for

This book is for R programmers who want to get started quickly with web scraping, as well as data analysts who want to learn scraping using R. Basic knowledge of R is all you need to get started with this book.

Frequently asked questions

Simply head over to the account section in settings and click on “Cancel Subscription” - it’s as simple as that. After you cancel, your membership will stay active for the remainder of the time you’ve paid for. Learn more here.
At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.
Both plans give you full access to the library and all of Perlego’s features. The only differences are the price and subscription period: With the annual plan you’ll save around 30% compared to 12 months on the monthly plan.
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.
Yes, you can access R Web Scraping Quick Start Guide by Olgun Aydin in PDF and/or ePUB format, as well as other popular books in Computer Science & Data Processing. We have over one million books available in our catalogue for you to explore.

Information

Year
2018
ISBN
9781788992633
Edition
1

Web Scraping with rvest

All the data we need today is already available on the internet, which is great news for data scientists. The only barrier to using this data is the ability to access it. There are some platforms that even include APIs (such as Twitter) that support data collection from web pages, but it is not possible to crawl most web pages using this advantage.
Before we go on to scrape the web with R, we need to specify that this is advanced data analysis, data collection. We will use the Hadley Wickham's method for web scraping using rvest. The package also requires selectr and xml2 packages.
The way to operate the rvest pole is simple and straightforward. Just as we first made web pages manually, the rvest package defines the web page link as the first step. After that, appropriate labels have to be defined. The HTML language edits content using various tags and selectors. These selectors must be identified and marked for storage of their contents by the harvest package. Then, all the engraved data can be transformed into an appropriate dataset, and analysis can be performed.
In this section, we will discuss in detail how fast and practical it is to use R for web scraping. After this section, you will gain expertise in using R to collect data over the internet.
The topics to be covered in this chapter are as follows:
  • Introducing rvest
  • Step-by-step web scraping with rvest

Introducing rvest

Most of the data on the web is in large scale as HTML. It is often not available in a form that is useful for analysis, such as hierarchical or tree-based:
<html>
<head>
<title>Looks like a tittle</title>
</head>
<body>
<p align="center">What's up ?</p>
</body>
</html>
rvest is a very useful R library that helps you collect information from web pages. It is designed to work with magrittr, inspired by libraries such as BeatifulSoup.
To start the web scraping process, you first need to master the R bases. In this section, we will perform web scraping step by step, using the rvest R package written by Hadley Wickham.
For more information about the rvesr package, visit the following URLs.CRAN Page: https://cran.r-project.org/web/packages/rvest/index.html rvest on github: https://github.com/hadley/rvest.
Make sure this package is installed. If you do not have this package right now, you can use the following code to install it: install.packages('rvest').
Let's take a look at some important functions in rvest:
Function Description
read_html() Create an html document from a URL, a file on a disk, or a string containing HTML.
html_nodes(doc, "table td") Select parts of a document using css selectors.
html_nodes(doc, xpath =
"//table//td")
Select parts of a document using xpath selectors.
html_tag() Extract components with the name of the tag.
html_text() Extract text from html document.
html_attr() Get a single html attribute.
html_attrs() Get all HTML attributes.
xml() Working with XML files.
xml_node() Extract XML components.
html_table() Parse HTML tables into a data frame.
html_form() set_values()
submit_form()
Extract, modify, and submit forms.
guess_encoding() ,
repair_encoding()
Detect and repair problems regarding encoding.

Step-by-step web scraping with rvest

After talking about the fundamentals of the rvest library, now we are going to deep dive into web scraping with rvest. We are going to talk about how to collect URLs from the website we would like to scrape.
We will use some simple regex rules for this issue. As we have learned how XPath works, then its time to write XPath rules. Once we have XPath rules and regex rules ready, we will jump into writing scripts to collect data from the website. That would be great, if we have a chance to play with the data we are going to collect. Don't worry; we will play with data, draw some plots, and create some charts.
We will collect a dataset from a blog, which is about big data (www.devveri.com). This website provides useful information about big data, data science domains. It is totally free of charge. People can visit this website and find use cases, exercises, and discussions regarding big-data technologies.
Let's start collecting information to find out how many articles there are in each category. You can find this information on the main page of the blog, using the following URL: http://devveri.com/ . The screenshot shown is about the main page of the blog.
  • As you see...

Table of contents

  1. Title Page
  2. Copyright and Credits
  3. Dedication
  4. Packt Upsell
  5. Contributors
  6. Preface
  7. Introduction to Web Scraping
  8. XML Path Language and Regular Expression Language
  9. Web Scraping with rvest
  10. Web Scraping with Rselenium
  11. Storing Data and Creating Cronjob
  12. Other Books You May Enjoy