Data Pipelines with Apache Airflow
eBook - ePub

Data Pipelines with Apache Airflow

Julian de Ruiter, Bas Harenslak

  1. English
  2. ePUB (mobile friendly)
  3. Available on iOS & Android
eBook - ePub

Data Pipelines with Apache Airflow

Julian de Ruiter, Bas Harenslak

Book details
Book preview
Table of contents
Citations

About This Book

A successful pipeline moves data efficiently, minimizing pauses and blockages between tasks, keeping every process along the way operational. Apache Airflow provides a single customizable environment for building and managing data pipelines, eliminating the need for a hodgepodge collection of tools, snowflake code, and homegrown processes. Using real-world scenarios and examples, Data Pipelines with Apache Airflow teaches you how to simplify and automate data pipelines, reduce operational overhead, and smoothly integrate all the technologies in your stack.

Frequently asked questions

How do I cancel my subscription?
Simply head over to the account section in settings and click on “Cancel Subscription” - it’s as simple as that. After you cancel, your membership will stay active for the remainder of the time you’ve paid for. Learn more here.
Can/how do I download books?
At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.
What is the difference between the pricing plans?
Both plans give you full access to the library and all of Perlego’s features. The only differences are the price and subscription period: With the annual plan you’ll save around 30% compared to 12 months on the monthly plan.
What is Perlego?
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.
Do you support text-to-speech?
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.
Is Data Pipelines with Apache Airflow an online PDF/ePUB?
Yes, you can access Data Pipelines with Apache Airflow by Julian de Ruiter, Bas Harenslak in PDF and/or ePUB format, as well as other popular books in Computer Science & Cloud Computing. We have over one million books available in our catalogue for you to explore.

Information

Year
2021
ISBN
9781617296901

Part 1. Getting started

This part of the book will set the stage for your journey into building pipelines for all kinds of wonderful data processes using Apache Airflow. The first two chapters are aimed at giving you an overview of what Airflow is and what it can do for you.
First, in chapter 1, we’ll explore the concepts of data pipelines and sketch the role Apache Airflow plays in helping you implement these pipelines. To set expectations, we’ll also compare Airflow to several other technologies, and discuss when it might or might not be a good fit for your specific use case. Next, chapter 2 will teach you how to implement your first pipeline in Airflow. After building the pipeline, we’ll also examine how to run this pipeline and monitor its progress using Airflow’s web interface.
Chapters 3–5 dive deeper into key concepts of Airflow to give you a solid understanding of Airflow’s underpinnings.
Chapter 3 focuses on scheduling semantics, which allow you to configure Airflow to run your pipelines at regular intervals. This lets you (for example) write pipelines that load and process data efficiently on a daily, weekly, or monthly basis. Next, in chapter 4, we’ll discuss templating mechanisms in Airflow, which allow you to dynamically reference variables such as execution dates in your pipelines. Finally, in chapter 5, we’ll dive into different approaches for defining task dependencies in your pipelines, which allow you to define complex task hierarchies, including conditional tasks, branches, and so on.
If you’re new to Airflow, we recommend making sure you understand the main concepts described in chapters 3–5, as these are key to using it effectively. Airflow’s scheduling semantics (described in chapter 3) can be especially confusing for new users, as they can be somewhat counterintuitive when first encountered.
After finishing part 1, you should be well-equipped to write your own basic pipelines in Apache Airflow and be ready to dive into some more advanced topics in parts 2–4.

1 Meet Apache Airflow

This chapter covers
  • Showing how data pipelines can be represented in workflows as graphs of tasks
  • Understanding how Airflow fits into the ecosystem of workflow managers
  • Determining if Airflow is a good fit for you
People and companies are continuously becoming more data-driven and are developing data pipelines as part of their daily business. Data volumes involved in these business processes have increased substantially over the years, from megabytes per day to gigabytes per minute. Though handling this data deluge may seem like a considerable challenge, these increasing data volumes can be managed with the appropriate tooling.
This book focuses on Apache Airflow, a batch-oriented framework for building data pipelines. Airflow’s key feature is that it enables you to easily build scheduled data pipelines using a flexible Python framework, while also providing many building blocks that allow you to stitch together the many different technologies encountered in modern technological landscapes.
Airflow is best thought of as a spider in a web: it sits in the middle of your data processes and coordinates work happening across the different (distributed) systems. As such, Airflow is not a data processing tool in itself but orchestrates the different components responsible for processing your data in data pipelines.
In this chapter, we’ll first give you a short introduction to data pipelines in Apache Airflow. Afterward, we’ll discuss several considerations to keep in mind when evaluating whether Airflow is right for you and demonstrate how to make your first steps with Airflow.

1.1 Introducing data pipelines

Data pipelines generally consist of several tasks or actions that need to be executed to achieve the desired result. For example, say we want to build a small weather dashboard that tells us what the weather will be like in the coming week (figure 1.1). To implement this live weather dashboard, we need to perform something like the following steps:
  1. Fetch weather forecast data from a weather API.
  2. Clean or otherwise transform the fetched data (e.g., converting temperatures from Fahrenheit to Celsius or vice versa), so that the data suits our purpose.
  3. Push the transformed data to the weather dashboard.
Figure 1.1 Overview of the weather dashboard use case, in which weather data is fetched from an external API and fed into a dynamic dashboard
As you can see, this relatively simple pipeline already consists of three different tasks that each perform part of the work. Moreover, these tasks need to be executed in a specific order, as it (for example) doesn’t make sense to try transforming the data before fetching it. Similarly, we can’t push any new data to the dashboard until it has undergone the required transformations. As such, we need to make sure that this implicit task order is also enforced when running this data process.

1.1.1 Data pipelines as graphs

One way to make dependencies between tasks more explicit is to draw the data pipeline as a graph. In this graph-based representation, tasks are represented as nodes in the graph, while dependencies between tasks are represented by directed edges between the task nodes. The direction of the edge indicates the direction of the dependency, with an edge pointing from task A to task B, indicating that task A needs to be completed before task B can start. Note that this type of graph is generally called a directed graph, due to the directions in the graph edges.
Applying this graph representation to our weather dashboard pipeline, we can see that the graph provides a relatively intuitive representation of the overall pipeline (figure 1.2). By just quickly glancing at the graph, we can see that our pipeline consists of three different tasks, each corresponding to one of the tasks outlined. Other than this, the direction of the edges clearly indicates the order in which the tasks need to be executed: we can simply ...

Table of contents

  1. inside front cover
  2. Data Pipelines with Apache Airflow
  3. Copyright
  4. brief contents
  5. contents
  6. front matter
  7. Part 1. Getting started
  8. 1 Meet Apache Airflow
  9. 2 Anatomy of an Airflow DAG
  10. 3 Scheduling in Airflow
  11. 4 Templating tasks using the Airflow context
  12. 5 Defining dependencies between tasks
  13. Part 2. Beyond the basics
  14. 6 Triggering workflows
  15. 7 Communicating with external systems
  16. 8 Building custom components
  17. 9 Testing
  18. 10 Running tasks in containers
  19. Part 3. Airflow in practice
  20. 11 Best practices
  21. 12 Operating Airflow in production
  22. 13 Securing Airflow
  23. 14 Project: Finding the fastest way to get around NYC
  24. Part 4. In the clouds
  25. 15 Airflow in the clouds
  26. 16 Airflow on AWS
  27. 17 Airflow on Azure
  28. 18 Airflow in GCP
  29. appendix A. Running code samples
  30. appendix B. Package structures Airflow 1 and 2
  31. appendix C. Prometheus metric mapping
  32. index
  33. inside back cover
Citation styles for Data Pipelines with Apache Airflow

APA 6 Citation

Ruiter, J., & Harenslak, B. (2021). Data Pipelines with Apache Airflow ([edition unavailable]). Manning Publications. Retrieved from https://www.perlego.com/book/2606855/data-pipelines-with-apache-airflow-pdf (Original work published 2021)

Chicago Citation

Ruiter, Julian, and Bas Harenslak. (2021) 2021. Data Pipelines with Apache Airflow. [Edition unavailable]. Manning Publications. https://www.perlego.com/book/2606855/data-pipelines-with-apache-airflow-pdf.

Harvard Citation

Ruiter, J. and Harenslak, B. (2021) Data Pipelines with Apache Airflow. [edition unavailable]. Manning Publications. Available at: https://www.perlego.com/book/2606855/data-pipelines-with-apache-airflow-pdf (Accessed: 15 October 2022).

MLA 7 Citation

Ruiter, Julian, and Bas Harenslak. Data Pipelines with Apache Airflow. [edition unavailable]. Manning Publications, 2021. Web. 15 Oct. 2022.