Data Pipelines with Apache Airflow
eBook - ePub

Data Pipelines with Apache Airflow

Julian de Ruiter, Bas Harenslak

  1. English
  2. ePUB (handyfreundlich)
  3. Über iOS und Android verfügbar
eBook - ePub

Data Pipelines with Apache Airflow

Julian de Ruiter, Bas Harenslak

Angaben zum Buch
Buchvorschau
Inhaltsverzeichnis
Quellenangaben

Über dieses Buch

A successful pipeline moves data efficiently, minimizing pauses and blockages between tasks, keeping every process along the way operational. Apache Airflow provides a single customizable environment for building and managing data pipelines, eliminating the need for a hodgepodge collection of tools, snowflake code, and homegrown processes. Using real-world scenarios and examples, Data Pipelines with Apache Airflow teaches you how to simplify and automate data pipelines, reduce operational overhead, and smoothly integrate all the technologies in your stack.

Häufig gestellte Fragen

Wie kann ich mein Abo kündigen?
Gehe einfach zum Kontobereich in den Einstellungen und klicke auf „Abo kündigen“ – ganz einfach. Nachdem du gekündigt hast, bleibt deine Mitgliedschaft für den verbleibenden Abozeitraum, den du bereits bezahlt hast, aktiv. Mehr Informationen hier.
(Wie) Kann ich Bücher herunterladen?
Derzeit stehen all unsere auf Mobilgeräte reagierenden ePub-Bücher zum Download über die App zur Verfügung. Die meisten unserer PDFs stehen ebenfalls zum Download bereit; wir arbeiten daran, auch die übrigen PDFs zum Download anzubieten, bei denen dies aktuell noch nicht möglich ist. Weitere Informationen hier.
Welcher Unterschied besteht bei den Preisen zwischen den Aboplänen?
Mit beiden Aboplänen erhältst du vollen Zugang zur Bibliothek und allen Funktionen von Perlego. Die einzigen Unterschiede bestehen im Preis und dem Abozeitraum: Mit dem Jahresabo sparst du auf 12 Monate gerechnet im Vergleich zum Monatsabo rund 30 %.
Was ist Perlego?
Wir sind ein Online-Abodienst für Lehrbücher, bei dem du für weniger als den Preis eines einzelnen Buches pro Monat Zugang zu einer ganzen Online-Bibliothek erhältst. Mit über 1 Million Büchern zu über 1.000 verschiedenen Themen haben wir bestimmt alles, was du brauchst! Weitere Informationen hier.
Unterstützt Perlego Text-zu-Sprache?
Achte auf das Symbol zum Vorlesen in deinem nächsten Buch, um zu sehen, ob du es dir auch anhören kannst. Bei diesem Tool wird dir Text laut vorgelesen, wobei der Text beim Vorlesen auch grafisch hervorgehoben wird. Du kannst das Vorlesen jederzeit anhalten, beschleunigen und verlangsamen. Weitere Informationen hier.
Ist Data Pipelines with Apache Airflow als Online-PDF/ePub verfügbar?
Ja, du hast Zugang zu Data Pipelines with Apache Airflow von Julian de Ruiter, Bas Harenslak im PDF- und/oder ePub-Format sowie zu anderen beliebten Büchern aus Computer Science & Cloud Computing. Aus unserem Katalog stehen dir über 1 Million Bücher zur Verfügung.

Information

Jahr
2021
ISBN
9781617296901

Part 1. Getting started

This part of the book will set the stage for your journey into building pipelines for all kinds of wonderful data processes using Apache Airflow. The first two chapters are aimed at giving you an overview of what Airflow is and what it can do for you.
First, in chapter 1, we’ll explore the concepts of data pipelines and sketch the role Apache Airflow plays in helping you implement these pipelines. To set expectations, we’ll also compare Airflow to several other technologies, and discuss when it might or might not be a good fit for your specific use case. Next, chapter 2 will teach you how to implement your first pipeline in Airflow. After building the pipeline, we’ll also examine how to run this pipeline and monitor its progress using Airflow’s web interface.
Chapters 3–5 dive deeper into key concepts of Airflow to give you a solid understanding of Airflow’s underpinnings.
Chapter 3 focuses on scheduling semantics, which allow you to configure Airflow to run your pipelines at regular intervals. This lets you (for example) write pipelines that load and process data efficiently on a daily, weekly, or monthly basis. Next, in chapter 4, we’ll discuss templating mechanisms in Airflow, which allow you to dynamically reference variables such as execution dates in your pipelines. Finally, in chapter 5, we’ll dive into different approaches for defining task dependencies in your pipelines, which allow you to define complex task hierarchies, including conditional tasks, branches, and so on.
If you’re new to Airflow, we recommend making sure you understand the main concepts described in chapters 3–5, as these are key to using it effectively. Airflow’s scheduling semantics (described in chapter 3) can be especially confusing for new users, as they can be somewhat counterintuitive when first encountered.
After finishing part 1, you should be well-equipped to write your own basic pipelines in Apache Airflow and be ready to dive into some more advanced topics in parts 2–4.

1 Meet Apache Airflow

This chapter covers
  • Showing how data pipelines can be represented in workflows as graphs of tasks
  • Understanding how Airflow fits into the ecosystem of workflow managers
  • Determining if Airflow is a good fit for you
People and companies are continuously becoming more data-driven and are developing data pipelines as part of their daily business. Data volumes involved in these business processes have increased substantially over the years, from megabytes per day to gigabytes per minute. Though handling this data deluge may seem like a considerable challenge, these increasing data volumes can be managed with the appropriate tooling.
This book focuses on Apache Airflow, a batch-oriented framework for building data pipelines. Airflow’s key feature is that it enables you to easily build scheduled data pipelines using a flexible Python framework, while also providing many building blocks that allow you to stitch together the many different technologies encountered in modern technological landscapes.
Airflow is best thought of as a spider in a web: it sits in the middle of your data processes and coordinates work happening across the different (distributed) systems. As such, Airflow is not a data processing tool in itself but orchestrates the different components responsible for processing your data in data pipelines.
In this chapter, we’ll first give you a short introduction to data pipelines in Apache Airflow. Afterward, we’ll discuss several considerations to keep in mind when evaluating whether Airflow is right for you and demonstrate how to make your first steps with Airflow.

1.1 Introducing data pipelines

Data pipelines generally consist of several tasks or actions that need to be executed to achieve the desired result. For example, say we want to build a small weather dashboard that tells us what the weather will be like in the coming week (figure 1.1). To implement this live weather dashboard, we need to perform something like the following steps:
  1. Fetch weather forecast data from a weather API.
  2. Clean or otherwise transform the fetched data (e.g., converting temperatures from Fahrenheit to Celsius or vice versa), so that the data suits our purpose.
  3. Push the transformed data to the weather dashboard.
Figure 1.1 Overview of the weather dashboard use case, in which weather data is fetched from an external API and fed into a dynamic dashboard
As you can see, this relatively simple pipeline already consists of three different tasks that each perform part of the work. Moreover, these tasks need to be executed in a specific order, as it (for example) doesn’t make sense to try transforming the data before fetching it. Similarly, we can’t push any new data to the dashboard until it has undergone the required transformations. As such, we need to make sure that this implicit task order is also enforced when running this data process.

1.1.1 Data pipelines as graphs

One way to make dependencies between tasks more explicit is to draw the data pipeline as a graph. In this graph-based representation, tasks are represented as nodes in the graph, while dependencies between tasks are represented by directed edges between the task nodes. The direction of the edge indicates the direction of the dependency, with an edge pointing from task A to task B, indicating that task A needs to be completed before task B can start. Note that this type of graph is generally called a directed graph, due to the directions in the graph edges.
Applying this graph representation to our weather dashboard pipeline, we can see that the graph provides a relatively intuitive representation of the overall pipeline (figure 1.2). By just quickly glancing at the graph, we can see that our pipeline consists of three different tasks, each corresponding to one of the tasks outlined. Other than this, the direction of the edges clearly indicates the order in which the tasks need to be executed: we can simply ...

Inhaltsverzeichnis

  1. inside front cover
  2. Data Pipelines with Apache Airflow
  3. Copyright
  4. brief contents
  5. contents
  6. front matter
  7. Part 1. Getting started
  8. 1 Meet Apache Airflow
  9. 2 Anatomy of an Airflow DAG
  10. 3 Scheduling in Airflow
  11. 4 Templating tasks using the Airflow context
  12. 5 Defining dependencies between tasks
  13. Part 2. Beyond the basics
  14. 6 Triggering workflows
  15. 7 Communicating with external systems
  16. 8 Building custom components
  17. 9 Testing
  18. 10 Running tasks in containers
  19. Part 3. Airflow in practice
  20. 11 Best practices
  21. 12 Operating Airflow in production
  22. 13 Securing Airflow
  23. 14 Project: Finding the fastest way to get around NYC
  24. Part 4. In the clouds
  25. 15 Airflow in the clouds
  26. 16 Airflow on AWS
  27. 17 Airflow on Azure
  28. 18 Airflow in GCP
  29. appendix A. Running code samples
  30. appendix B. Package structures Airflow 1 and 2
  31. appendix C. Prometheus metric mapping
  32. index
  33. inside back cover
Zitierstile für Data Pipelines with Apache Airflow

APA 6 Citation

Ruiter, J., & Harenslak, B. (2021). Data Pipelines with Apache Airflow ([edition unavailable]). Manning Publications. Retrieved from https://www.perlego.com/book/2606855/data-pipelines-with-apache-airflow-pdf (Original work published 2021)

Chicago Citation

Ruiter, Julian, and Bas Harenslak. (2021) 2021. Data Pipelines with Apache Airflow. [Edition unavailable]. Manning Publications. https://www.perlego.com/book/2606855/data-pipelines-with-apache-airflow-pdf.

Harvard Citation

Ruiter, J. and Harenslak, B. (2021) Data Pipelines with Apache Airflow. [edition unavailable]. Manning Publications. Available at: https://www.perlego.com/book/2606855/data-pipelines-with-apache-airflow-pdf (Accessed: 15 October 2022).

MLA 7 Citation

Ruiter, Julian, and Bas Harenslak. Data Pipelines with Apache Airflow. [edition unavailable]. Manning Publications, 2021. Web. 15 Oct. 2022.