Data Pipelines with Apache Airflow
eBook - ePub

Data Pipelines with Apache Airflow

Julian de Ruiter, Bas Harenslak

  1. English
  2. ePUB (disponibile sull'app)
  3. Disponibile su iOS e Android
eBook - ePub

Data Pipelines with Apache Airflow

Julian de Ruiter, Bas Harenslak

Dettagli del libro
Anteprima del libro
Indice dei contenuti
Citazioni

Informazioni sul libro

A successful pipeline moves data efficiently, minimizing pauses and blockages between tasks, keeping every process along the way operational. Apache Airflow provides a single customizable environment for building and managing data pipelines, eliminating the need for a hodgepodge collection of tools, snowflake code, and homegrown processes. Using real-world scenarios and examples, Data Pipelines with Apache Airflow teaches you how to simplify and automate data pipelines, reduce operational overhead, and smoothly integrate all the technologies in your stack.

Domande frequenti

Come faccio ad annullare l'abbonamento?
È semplicissimo: basta accedere alla sezione Account nelle Impostazioni e cliccare su "Annulla abbonamento". Dopo la cancellazione, l'abbonamento rimarrà attivo per il periodo rimanente già pagato. Per maggiori informazioni, clicca qui
È possibile scaricare libri? Se sì, come?
Al momento è possibile scaricare tramite l'app tutti i nostri libri ePub mobile-friendly. Anche la maggior parte dei nostri PDF è scaricabile e stiamo lavorando per rendere disponibile quanto prima il download di tutti gli altri file. Per maggiori informazioni, clicca qui
Che differenza c'è tra i piani?
Entrambi i piani ti danno accesso illimitato alla libreria e a tutte le funzionalità di Perlego. Le uniche differenze sono il prezzo e il periodo di abbonamento: con il piano annuale risparmierai circa il 30% rispetto a 12 rate con quello mensile.
Cos'è Perlego?
Perlego è un servizio di abbonamento a testi accademici, che ti permette di accedere a un'intera libreria online a un prezzo inferiore rispetto a quello che pagheresti per acquistare un singolo libro al mese. Con oltre 1 milione di testi suddivisi in più di 1.000 categorie, troverai sicuramente ciò che fa per te! Per maggiori informazioni, clicca qui.
Perlego supporta la sintesi vocale?
Cerca l'icona Sintesi vocale nel prossimo libro che leggerai per verificare se è possibile riprodurre l'audio. Questo strumento permette di leggere il testo a voce alta, evidenziandolo man mano che la lettura procede. Puoi aumentare o diminuire la velocità della sintesi vocale, oppure sospendere la riproduzione. Per maggiori informazioni, clicca qui.
Data Pipelines with Apache Airflow è disponibile online in formato PDF/ePub?
Sì, puoi accedere a Data Pipelines with Apache Airflow di Julian de Ruiter, Bas Harenslak in formato PDF e/o ePub, così come ad altri libri molto apprezzati nelle sezioni relative a Computer Science e Cloud Computing. Scopri oltre 1 milione di libri disponibili nel nostro catalogo.

Informazioni

Anno
2021
ISBN
9781617296901

Part 1. Getting started

This part of the book will set the stage for your journey into building pipelines for all kinds of wonderful data processes using Apache Airflow. The first two chapters are aimed at giving you an overview of what Airflow is and what it can do for you.
First, in chapter 1, we’ll explore the concepts of data pipelines and sketch the role Apache Airflow plays in helping you implement these pipelines. To set expectations, we’ll also compare Airflow to several other technologies, and discuss when it might or might not be a good fit for your specific use case. Next, chapter 2 will teach you how to implement your first pipeline in Airflow. After building the pipeline, we’ll also examine how to run this pipeline and monitor its progress using Airflow’s web interface.
Chapters 3–5 dive deeper into key concepts of Airflow to give you a solid understanding of Airflow’s underpinnings.
Chapter 3 focuses on scheduling semantics, which allow you to configure Airflow to run your pipelines at regular intervals. This lets you (for example) write pipelines that load and process data efficiently on a daily, weekly, or monthly basis. Next, in chapter 4, we’ll discuss templating mechanisms in Airflow, which allow you to dynamically reference variables such as execution dates in your pipelines. Finally, in chapter 5, we’ll dive into different approaches for defining task dependencies in your pipelines, which allow you to define complex task hierarchies, including conditional tasks, branches, and so on.
If you’re new to Airflow, we recommend making sure you understand the main concepts described in chapters 3–5, as these are key to using it effectively. Airflow’s scheduling semantics (described in chapter 3) can be especially confusing for new users, as they can be somewhat counterintuitive when first encountered.
After finishing part 1, you should be well-equipped to write your own basic pipelines in Apache Airflow and be ready to dive into some more advanced topics in parts 2–4.

1 Meet Apache Airflow

This chapter covers
  • Showing how data pipelines can be represented in workflows as graphs of tasks
  • Understanding how Airflow fits into the ecosystem of workflow managers
  • Determining if Airflow is a good fit for you
People and companies are continuously becoming more data-driven and are developing data pipelines as part of their daily business. Data volumes involved in these business processes have increased substantially over the years, from megabytes per day to gigabytes per minute. Though handling this data deluge may seem like a considerable challenge, these increasing data volumes can be managed with the appropriate tooling.
This book focuses on Apache Airflow, a batch-oriented framework for building data pipelines. Airflow’s key feature is that it enables you to easily build scheduled data pipelines using a flexible Python framework, while also providing many building blocks that allow you to stitch together the many different technologies encountered in modern technological landscapes.
Airflow is best thought of as a spider in a web: it sits in the middle of your data processes and coordinates work happening across the different (distributed) systems. As such, Airflow is not a data processing tool in itself but orchestrates the different components responsible for processing your data in data pipelines.
In this chapter, we’ll first give you a short introduction to data pipelines in Apache Airflow. Afterward, we’ll discuss several considerations to keep in mind when evaluating whether Airflow is right for you and demonstrate how to make your first steps with Airflow.

1.1 Introducing data pipelines

Data pipelines generally consist of several tasks or actions that need to be executed to achieve the desired result. For example, say we want to build a small weather dashboard that tells us what the weather will be like in the coming week (figure 1.1). To implement this live weather dashboard, we need to perform something like the following steps:
  1. Fetch weather forecast data from a weather API.
  2. Clean or otherwise transform the fetched data (e.g., converting temperatures from Fahrenheit to Celsius or vice versa), so that the data suits our purpose.
  3. Push the transformed data to the weather dashboard.
Figure 1.1 Overview of the weather dashboard use case, in which weather data is fetched from an external API and fed into a dynamic dashboard
As you can see, this relatively simple pipeline already consists of three different tasks that each perform part of the work. Moreover, these tasks need to be executed in a specific order, as it (for example) doesn’t make sense to try transforming the data before fetching it. Similarly, we can’t push any new data to the dashboard until it has undergone the required transformations. As such, we need to make sure that this implicit task order is also enforced when running this data process.

1.1.1 Data pipelines as graphs

One way to make dependencies between tasks more explicit is to draw the data pipeline as a graph. In this graph-based representation, tasks are represented as nodes in the graph, while dependencies between tasks are represented by directed edges between the task nodes. The direction of the edge indicates the direction of the dependency, with an edge pointing from task A to task B, indicating that task A needs to be completed before task B can start. Note that this type of graph is generally called a directed graph, due to the directions in the graph edges.
Applying this graph representation to our weather dashboard pipeline, we can see that the graph provides a relatively intuitive representation of the overall pipeline (figure 1.2). By just quickly glancing at the graph, we can see that our pipeline consists of three different tasks, each corresponding to one of the tasks outlined. Other than this, the direction of the edges clearly indicates the order in which the tasks need to be executed: we can simply ...

Indice dei contenuti

  1. inside front cover
  2. Data Pipelines with Apache Airflow
  3. Copyright
  4. brief contents
  5. contents
  6. front matter
  7. Part 1. Getting started
  8. 1 Meet Apache Airflow
  9. 2 Anatomy of an Airflow DAG
  10. 3 Scheduling in Airflow
  11. 4 Templating tasks using the Airflow context
  12. 5 Defining dependencies between tasks
  13. Part 2. Beyond the basics
  14. 6 Triggering workflows
  15. 7 Communicating with external systems
  16. 8 Building custom components
  17. 9 Testing
  18. 10 Running tasks in containers
  19. Part 3. Airflow in practice
  20. 11 Best practices
  21. 12 Operating Airflow in production
  22. 13 Securing Airflow
  23. 14 Project: Finding the fastest way to get around NYC
  24. Part 4. In the clouds
  25. 15 Airflow in the clouds
  26. 16 Airflow on AWS
  27. 17 Airflow on Azure
  28. 18 Airflow in GCP
  29. appendix A. Running code samples
  30. appendix B. Package structures Airflow 1 and 2
  31. appendix C. Prometheus metric mapping
  32. index
  33. inside back cover
Stili delle citazioni per Data Pipelines with Apache Airflow

APA 6 Citation

Ruiter, J., & Harenslak, B. (2021). Data Pipelines with Apache Airflow ([edition unavailable]). Manning Publications. Retrieved from https://www.perlego.com/book/2606855/data-pipelines-with-apache-airflow-pdf (Original work published 2021)

Chicago Citation

Ruiter, Julian, and Bas Harenslak. (2021) 2021. Data Pipelines with Apache Airflow. [Edition unavailable]. Manning Publications. https://www.perlego.com/book/2606855/data-pipelines-with-apache-airflow-pdf.

Harvard Citation

Ruiter, J. and Harenslak, B. (2021) Data Pipelines with Apache Airflow. [edition unavailable]. Manning Publications. Available at: https://www.perlego.com/book/2606855/data-pipelines-with-apache-airflow-pdf (Accessed: 15 October 2022).

MLA 7 Citation

Ruiter, Julian, and Bas Harenslak. Data Pipelines with Apache Airflow. [edition unavailable]. Manning Publications, 2021. Web. 15 Oct. 2022.