eBook - ePub

Building Big Data Pipelines with Apache Beam

Name: Building Big Data Pipelines with Apache Beam
Author: Jan Lukavsky

Jan Lukavsky

Buch teilen

342 Seiten
English
ePUB (handyfreundlich)
Über iOS und Android verfügbar

eBook - ePub

Building Big Data Pipelines with Apache Beam

Jan Lukavsky

Angaben zum Buch

Buchvorschau

Inhaltsverzeichnis

Quellenangaben

Über dieses Buch

Implement, run, operate, and test data processing pipelines using Apache BeamKey Features• Understand how to improve usability and productivity when implementing Beam pipelines• Learn how to use stateful processing to implement complex use cases using Apache Beam• Implement, test, and run Apache Beam pipelines with the help of expert tips and techniquesBook DescriptionApache Beam is an open source unified programming model for implementing and executing data processing pipelines, including Extract, Transform, and Load (ETL), batch, and stream processing.This book will help you to confidently build data processing pipelines with Apache Beam. You'll start with an overview of Apache Beam and understand how to use it to implement basic pipelines. You'll also learn how to test and run the pipelines efficiently. As you progress, you'll explore how to structure your code for reusability and also use various Domain Specific Languages (DSLs). Later chapters will show you how to use schemas and query your data using (streaming) SQL. Finally, you'll understand advanced Apache Beam concepts, such as implementing your own I/O connectors.By the end of this book, you'll have gained a deep understanding of the Apache Beam model and be able to apply it to solve problems.What you will learn• Understand the core concepts and architecture of Apache Beam• Implement stateless and stateful data processing pipelines• Use state and timers for processing real-time event processing• Structure your code for reusability• Use streaming SQL to process real-time data for increasing productivity and data accessibility• Run a pipeline using a portable runner and implement data processing using the Apache Beam Python SDK• Implement Apache Beam I/O connectors using the Splittable DoFn APIWho this book is forThis book is for data engineers, data scientists, and data analysts who want to learn how Apache Beam works. Intermediate-level knowledge of the Java programming language is assumed.

Häufig gestellte Fragen

Wie kann ich mein Abo kündigen?

Gehe einfach zum Kontobereich in den Einstellungen und klicke auf „Abo kündigen“ – ganz einfach. Nachdem du gekündigt hast, bleibt deine Mitgliedschaft für den verbleibenden Abozeitraum, den du bereits bezahlt hast, aktiv. Mehr Informationen hier.

(Wie) Kann ich Bücher herunterladen?

Derzeit stehen all unsere auf Mobilgeräte reagierenden ePub-Bücher zum Download über die App zur Verfügung. Die meisten unserer PDFs stehen ebenfalls zum Download bereit; wir arbeiten daran, auch die übrigen PDFs zum Download anzubieten, bei denen dies aktuell noch nicht möglich ist. Weitere Informationen hier.

Welcher Unterschied besteht bei den Preisen zwischen den Aboplänen?

Mit beiden Aboplänen erhältst du vollen Zugang zur Bibliothek und allen Funktionen von Perlego. Die einzigen Unterschiede bestehen im Preis und dem Abozeitraum: Mit dem Jahresabo sparst du auf 12 Monate gerechnet im Vergleich zum Monatsabo rund 30 %.

Was ist Perlego?

Wir sind ein Online-Abodienst für Lehrbücher, bei dem du für weniger als den Preis eines einzelnen Buches pro Monat Zugang zu einer ganzen Online-Bibliothek erhältst. Mit über 1 Million Büchern zu über 1.000 verschiedenen Themen haben wir bestimmt alles, was du brauchst! Weitere Informationen hier.

Unterstützt Perlego Text-zu-Sprache?

Achte auf das Symbol zum Vorlesen in deinem nächsten Buch, um zu sehen, ob du es dir auch anhören kannst. Bei diesem Tool wird dir Text laut vorgelesen, wobei der Text beim Vorlesen auch grafisch hervorgehoben wird. Du kannst das Vorlesen jederzeit anhalten, beschleunigen und verlangsamen. Weitere Informationen hier.

Ist Building Big Data Pipelines with Apache Beam als Online-PDF/ePub verfügbar?

Ja, du hast Zugang zu Building Big Data Pipelines with Apache Beam von Jan Lukavsky im PDF- und/oder ePub-Format sowie zu anderen beliebten Büchern aus Informatik & Datenmodellierung- & design. Aus unserem Katalog stehen dir über 1 Million Bücher zur Verfügung.

Information

Verlag

Packt Publishing

Jahr

2022

ISBN

9781800566569

Auflage

Thema

Informatik

Thema

Datenmodellierung- & design

Section 1 Apache Beam: Essentials

This section represents a general introduction to how most streaming data processing systems work, what the general properties of data streams are, and what problems are needed to be solved for computational correctness and for balancing throughput and latency in the context of Apache Beam. This section also covers how pipelines are implemented, tested, and run.

This section comprises the following chapters:

Chapter 1, Introduction to Data Processing with Apache Beam
Chapter 2, Implementing, Testing, and Deploying Basic Pipelines
Chapter 3, Implementing Pipelines Using Stateful Processing

Chapter 1: Introduction to Data Processing with Apache Beam

Data. Big data. Real-time data. Data streams. Many buzzwords to describe many things, and yet they have many common properties. Mind-blowing applications can be developed from the successful application of (theoretically) simple logic – take data and produce knowledge. However, a simple-sounding task can turn out to be difficult when the amount of data needed to produce knowledge is huge (and still growing). Given the vast volumes of data produced by humanity every day, which tools should we choose to turn our simple logic into scalable solutions? That is, solutions that protect our investment in creating the data extraction logic, even in the presence of new requirements arising or changing on a daily basis, and new data processing technologies being created? This book focuses on why Apache Beam might be a good solution to these challenges, and it will guide you through the Beam learning process.

In this chapter, we will cover the following topics:

Why Apache Beam?
Writing your first pipeline
Running a pipeline against streaming data
Exploring the key properties of Unbounded data
Measuring the event time progress inside data streams
Assigning data to windows
Unifying batch and streaming data processing

Technical requirements

In this chapter, we will introduce some elementary pipelines written using Beam's Java Software Development Kit (SDK).

We will use the code located in the GitHub repository for this book: https://github.com/PacktPublishing/Building-Big-Data-Pipelines-with-Apache-Beam.

We will also need the following tools to be installed:

Java Development Kit (JDK) 11 (possibly OpenJDK 11), with JAVA_HOME set appropriately
Git
Bash
Important note
Although it is possible to run many tools in this book using the Windows shell, we will focus on using Bash scripting only. We hope Windows users will be able to run Bash using virtualization or Windows Subsystem for Linux (or any similar technology).

First of all, we need to clone the repository:

To do this, we create a suitable directory, and then we run the following command:
$ git clone https://github.com/PacktPublishing/Building-Big-Data-Pipelines-with-Apache-Beam.git
This will result in a directory, Building-Big-Data-Pipelines-with-Apache-Beam, being created in the working directory. We then run the following command in this newly created directory:
$ ./mvnw clean install

Throughout this book, the $ character will denote a Bash shell. Therefore, $ ./mvnw clean install would mean to run the ./mvnw command in the top-level directory of the git clone (that is, Building-Big-Data-Pipelines-with-Apache-Beam). By using chapter1$ ../mvnw clean install, we mean to run the specified command in the subdirectory called chapter1.

Why Apache Beam?

There are two basic questions we might ask when considering a new technology to learn and apply in practice:

What problem am I struggling with that the new technology can help me solve?
What would the costs associated with the technology be?

Every sound technology has a well-defined selling point – that is, something that justifies its existence in the presence of competing technologies. In the case of Beam, this selling point could be reduced to a single word: portability. Beam is portable on several layers:

Beam's pipelines are portable between multiple runners (that is, a technology that executes the distributed computation described by a pipeline's author).
Beam's data processing model is portable between various programming languages.
Beam's data processing logic is portable between bounded and unbounded data.

Each of these points deserves a few words of explanation. By runner portability, we mean the possibility to run existing pipelines written in one of the supported programming languages (for instance, Java, Python, Go, Scala, or even SQL) against a data processing engine that can be chosen at runtime. A typical example of a runner would be Apache Flink, Apache Spark, or Google Cloud Dataflow. However, Beam is by no means limited to these; new runners are created as new technologies arise, and it's very likely that many more will be developed.

When we say Beam's data processing model is portable between various programming languages, we mean it has the ability to provide support for multiple SDKs, regardless of the language or technology used by the runner. This way, we can code Beam pipelines in the Go language, and then run these against the Apache Flink Runner, written in Java.

Last but not least, the core of Apache Beam's model is designed so that it is portable between bounded and unbounded data. Bounded data is what was historically called batch processing, while unbounded data refers to real-time processing (that is, an application crunching live data as it arrives in the system and producing a low-latency output).

Putting these pieces together, we can describe Beam as a tool that lets you deal with your big data architecture with the following vision:

Choose your preferred language, write your data processing pipeline, run this pipeline using a runner of your choice, and do all of this for both batch and real-time data at the same time.

Because everything comes at a price, you should expect to pay for flexibility like this – this price would be a somewhat bigger overhead in terms of CPU and/or memory usage. The Beam community works...