eBook - ePub

Go Web Scraping Quick Start Guide

Name: Go Web Scraping Quick Start Guide
Author: Vincent Smith

Implement the power of Go to scrape and crawl data from the web

Vincent Smith,

132 pages
English
ePUB (mobile friendly)
Available on iOS & Android

eBook - ePub

Go Web Scraping Quick Start Guide

Implement the power of Go to scrape and crawl data from the web

Vincent Smith,

Book details

Book preview

Table of contents

Citations

About This Book

Learn how some Go-specific language features help to simplify building web scrapers along with common pitfalls and best practices regarding web scraping.

Key Features

Use Go libraries like Goquery and Colly to scrape the web
Common pitfalls and best practices to effectively scrape and crawl
Learn how to scrape using the Go concurrency model

Book Description

Web scraping is the process of extracting information from the web using various tools that perform scraping and crawling. Go is emerging as the language of choice for scraping using a variety of libraries. This book will quickly explain to you, how to scrape data data from various websites using Go libraries such as Colly and Goquery.

The book starts with an introduction to the use cases of building a web scraper and the main features of the Go programming language, along with setting up a Go environment. It then moves on to HTTP requests and responses and talks about how Go handles them. You will also learn about a number of basic web scraping etiquettes.

You will be taught how to navigate through a website, using a breadth-first and then a depth-first search, as well as find and follow links. You will get to know about the ways to track history in order to avoid loops and to protect your web scraper using proxies.

Finally the book will cover the Go concurrency model, and how to run scrapers in parallel, along with large-scale distributed web scraping.

What you will learn

Implement Cache-Control to avoid unnecessary network calls
Coordinate concurrent scrapers
Design a custom, larger-scale scraping system
Scrape basic HTML pages with Colly and JavaScript pages with chromedp
Discover how to search using the "strings" and "regexp" packages
Set up a Go development environment
Retrieve information from an HTML document
Protect your web scraper from being blocked by using proxies
Control web browsers to scrape JavaScript sites

Who this book is for

Data scientists, and web developers with a basic knowledge of Golang wanting to collect web data and analyze them for effective reporting and visualization.

Frequently asked questions

Simply head over to the account section in settings and click on “Cancel Subscription” - it’s as simple as that. After you cancel, your membership will stay active for the remainder of the time you’ve paid for. Learn more here.

At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.

Both plans give you full access to the library and all of Perlego’s features. The only differences are the price and subscription period: With the annual plan you’ll save around 30% compared to 12 months on the monthly plan.

We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.

Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.

Yes, you can access Go Web Scraping Quick Start Guide by Vincent Smith in PDF and/or ePUB format, as well as other popular books in Computer Science & Data Processing. We have over one million books available in our catalogue for you to explore.

Information

Publisher

Packt Publishing

Year

2019

ISBN

9781789612943

Edition

Topic

Computer Science

Subtopic

Data Processing

Index

Computer Science

Scraping at 100x

By now, you should have a very broad understanding of how to build a solid web scraper. Up to this point, you have learned how to collect information from the internet efficiently, safely, and respectfully. The tools that you have at your disposal are enough to build web scrapers on a small to medium scale, which may be just what you need to accomplish your goals. However, there may come a day when you need to upscale your application to handle large and production-sized projects. You may be lucky enough to make a living out of offering services, and, as that business grows, you will need an architecture that is robust and manageable. In this chapter, we will review the architectural components that make a good web scraping system, and look at example projects from the open source community. Here are the topics we will discuss:

Components of a web scraping system
Scraping HTML pages with colly
Scraping JavaScript pages with chrome-protocol
Distributed scraping with dataflowkit

Components of a web scraping system

In Chapter 7, Scraping with Concurrency, about concurrency, we saw how defining a clear separation of roles between the worker goroutines and the main goroutine helped mitigate issues in the program. By clearly giving the main goroutine the responsibility of maintaining the state of the target URLs, and allowing the scraper threads to focus on scraping, we laid the groundwork for making a modular system which can easily scale components independently. This separation of concerns is the foundation for building large-scale systems of any kind.

There are a few main components that make up a web scraper. Each of these components should be able to scale without affecting other parts of the system, if they are properly decoupled. You will know if this decoupling is solid if you can break this system into its own package and reuse it for other projects. You might even want to release it to the open source community! Let's take a look at some of these components.

Queue

Before a web scraper can start collecting information, it needs to know where to go. It also needs to know where it has been. A proper queuing system will accomplish both of these goals. Queues can be set up in many different ways. In many of the previous examples, we used a []string or a map[string]string to hold the target URLs the scraper should pursue. This works for smaller scale web scrapers where the work is being pushed to the workers.

In larger applications, a work-stealing queue would be preferred. In a work-stealing queue, the worker threads would take the first available job out of the queue as fast as they can accomplish the task. This way, if you need your system to increase throughput, you can simply add more worker threads. In this system, the queue does not need to concern itself with the status of the workers, and focuses only on the status of the jobs. This is beneficial to systems that push to the workers because it must be aware of how many workers there are, which workers are busy or free, and handles workers coming on and offline.

Queuing systems are not always a part of the main scraping application. There are many suitable solutions for external queues, such as databases, or streaming platforms, such as Redis and Kafka. These tools will support your queuing system to the limits of your own imagination.

Cache

As we have seen in Chapter 3, Web Scraping Etiquette, caching web pages is an essential part of an efficient web scraper. With a cache, we are able to avoid requesting content from a website if we know nothing has changed. In our previous examples, we used a local cache which saves the content into a folder on the local machine. In larger web scrapers with multiple machines, this causes problems, as each machine would need to maintain its own cache. Having a shared caching solution would solve this problem and increase the efficiency of your web scraper.

There are many different ways to approach this problem. Much like the queuing system, a database can help store a cache of your information. Most databases support storage of binary objects, so whether you are storing HTML pages, images, or any other content, it is possible to put it into a database. You can also include a lot of metadata about a file, such as a date it was recovered, the date it expires, the size, the Etag, and so on. Another caching solution you can use is a form of cloud object storage, such as Amazon S3, Google Cloud Store, and Microsoft object storage. These services typically offer low-cost storage solutions that mimic a file system and require a specific SDK, or use of their APIs. A third solution you could use is a Network File System (NFS) where each node would connect. Writing to cache on an NFS would be the same as if it were on the local file system, as far as your scraper code is concerned. There can be challenges in configuring your worker machines to connect to an NFS. Each of these approaches has its own unique set of pros and cons, depending on your own setup.

Storage

In most cases, when you are scraping the web, you will be looking for very specific information. This is probably going to be a very small amount of data relative to the size of the web page itself. Because of the cache stores the entire contents of the web page, you will need some other storage system to store the parsed information. The storage component of a web scraper could be as simple as a text file, or as large as a distributed database.

These days, there are many database solutions available to satisfy different needs. If you have data that has many intricate relationships, then an SQL database might be a good fit for you. If you have data that has more of a nested structure, then you may want to look at NoSQL databases. There are also solutions that offer full-text indexing to make searching for documents easier, and time-series databases if you need to relate your data to some chronological order. Because there is no one-size-fits-all solution, the Go standard library only offers a package to handle the most common family of databases through the sql package.

The sql package was built to provide a common set of functions used to communicate with SQL databases such as MySQL, PostgreSQL, and Couchbase. For each of these databases, a separate driver has been written to fit into the framework defined by the sql package. These drivers, along with various others, can be found on GitHub and easily integrated with your project. The core of the sql package provides methods for opening and closing database connections, querying the database, iterating through rows of results, and performing inserts and modifications to the data. By mandating a standard interface for drivers, Go allows you to swap out your database for another SQL database with less effort.

Logs

One system that is often overlooked during the design of a scraping system is the logging system. It is important, first and foremost, to have clear log statements without logging too many unnecessary items. These statements should be informing the operator of the current status of scraping and any errors, or successes, the scraper encounters. This helps you get a picture of the overall health of your web scraper.

The simplest logging that can be done is printing messages to the terminal with println() or fmt.Println() type statements. This works well enough for a single node, but, as your scraper grows into a distributed architecture, it causes problems. In order to check how things are running in your system an operator would need to log into each machine to look at the logs. If there is an actual problem in the system, it may be difficult to diagnose by trying to piece together logs from multiple sources. A logging system built for distributed computing would be ideal at this point.

There are many logging solutions available in the open source world. One of the more popular choices is Graylog. Setting up a Graylog server is a simple process, requiring a MongoDB database and an Elasticsearch database to support it. Graylog defines a JSON format called GELF for sending log data to its servers, and accepts a very flexible set of keys. Graylog servers can accept log streams from multiple sources and you can define post-processing actions as well, such as reformatting data and sending alerts based on user-defined rules. There are many other similar systems, as well as paid services, that offer very similar features.

As there are various logging solutions, the open source community has built a library that eases the burden of integrating with different systems. The logrus package by GitHub user sirupsen provides a standard utility for writing log statements, as well as a plugin architecture for log formatters. Many people have built formatters for logging statements, including one for GELF statements to be sent to a Graylog server. If you decide to change your logging server during the development of your scraper, you need only to change the formatter instead of replacing all of your log statements.

Scraping HTML pages with colly

colly is one of the available projects on GitHub that covers most of the systems discussed earlier. This project is built to run on a single machine, due to its reliance on a local cache and queuing system.

The main worker object in colly, the Collector, is built to run in its own goroutine, allowing you to run multiple Collectors simultaneously. This design...

Title Page
Copyright and Credits
About Packt
Contributors
Preface
Introducing Web Scraping and Go
The Request/Response Cycle
Web Scraping Etiquette
Parsing HTML
Web Scraping Navigation
Protecting Your Web Scraper
Scraping with Concurrency
Scraping at 100x
Other Books You May Enjoy