Cleaning data using Airflow
Now that you can clean your data in Python, you can create functions to perform different tasks. By combining the functions, you can create a data pipeline in Airflow. The following example will clean data, and then filter it and write it out to disk.
Starting with the same Airflow code you have used in the previous examples, set up the imports and the default arguments, as shown:
import datetime as dt
from datetime import timedelta
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.operators.python_operator import PythonOperator
import pandas as pd
default_args = {
'owner': 'paulcrickard',
'start_date': dt.datetime(2020, 4, 13),
'retries': 1,
'retry_delay': dt.timedelta(minutes=5),
}
Now you can write the functions that will perform the cleaning tasks. First...