This project automates an ETL pipeline (Extract, Transform, Load) using PySpark and Matplotlib. It extracts data from a raw dataset, transforms it into a clean format, and loads it into Snowflake for further analysis. Finally, it generates a basic visualization π of the transformed data.
-
First, clone the repository and run the
requirements.txt
π to install dependencies. -
Go to
extract.py
and run it in the terminal usingpython extract.py
π».- It will ask you to provide the path to the raw data.
- For example, in your cloned repository, the raw data is stored in
data/raw_data/raw_messy_data100k.csv
π. - After running, check the
logs
folder π for messages indicating the extraction was successful β .
-
Next, run
transform.py
π in the same way.- It will prompt you for the extracted data path.
- At the end, it will ask where to save the cleaned data. Provide the path
data/cleaned_data/cleaned_data.csv
ποΈ. - Once transformed, check the logs again for a message confirming the transformation was successful π’.
-
Then, run
load.py
to load the cleaned data into Snowflake βοΈ.- It will ask for the path to the cleaned data (use the
cleaned_data.csv
saved earlier). - Youβll also be prompted for your Snowflake credentials π.
- After successful loading, check the logs to confirm the data was loaded into Snowflake β .
- It will ask for the path to the cleaned data (use the
-
Finally, to visualize the data π, go to the
visualization
folder and runanalyze.py
in the terminal.- This will generate a bar chart π showing department salaries for those with more than 1000 employees π’.
This ETL pipeline extracts raw data, transforms it, loads it into Snowflake βοΈ, and generates a basic visualization π.