Deep Dive: What the heck is Airflow
This is the first installment in the Deep Dive series, where I go deep on a particular product or category. Some of these will be free, and some will be paid. This one is paid and was a special request by a paid subscriber. I hope you enjoy!
A Short History of Orchestration
Apache Airflow is part of a class of tools called an orchestrator, but to understand what it is and why people use it, we need to travel back a little bit to its origin and Airbnb.
Airflow was created in 2014 and released in 2015 at Airbnb. The original blog announcing the release is still up and is a good resource for reminding ourselves of where Airflow came up and what the world was like then.
At Airbnb, data engineers used tools like Apache Hive as a data warehouse, with much of their infrastructure built on Hadoop and Spark. There were many problems to be solved and jobs to be done: data extraction, cleaning, quality checks, and long-term storage.
Airbnb was also performing a lot of computation. They needed to know everything from how guests felt about their accommodations to how their hosts felt about their guests. They needed to understand how well their recommendations were doing and whether their experiments were working well. They needed to compute sessions from all the clickstream data on both their app and the web.
Like what you’re reading? The rest of this article is only for paid subscribers.