Skip to content

joyboy123-coder/Big-Data-ETL-with-PySpark-Visual-Analytics

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

37 Commits
Β 
Β 
Β 
Β 

Repository files navigation

πŸš€ ETL PySpark Matplotlib Project

πŸ“‹ Project Overview

This project automates an ETL pipeline (Extract, Transform, Load) using PySpark and Matplotlib. It extracts data from a raw dataset, transforms it into a clean format, and loads it into Snowflake for further analysis. Finally, it generates a basic visualization πŸ“Š of the transformed data.


πŸ› οΈ How to Use the Project

  • First, clone the repository and run the requirements.txt πŸ“‘ to install dependencies.

  • Go to extract.py and run it in the terminal using python extract.py πŸ’».

    • It will ask you to provide the path to the raw data.
    • For example, in your cloned repository, the raw data is stored in data/raw_data/raw_messy_data100k.csv πŸ“‚.
    • After running, check the logs folder πŸ“‚ for messages indicating the extraction was successful βœ….
  • Next, run transform.py πŸ”„ in the same way.

    • It will prompt you for the extracted data path.
    • At the end, it will ask where to save the cleaned data. Provide the path data/cleaned_data/cleaned_data.csv πŸ—ƒοΈ.
    • Once transformed, check the logs again for a message confirming the transformation was successful 🟒.
  • Then, run load.py to load the cleaned data into Snowflake ❄️.

    • It will ask for the path to the cleaned data (use the cleaned_data.csv saved earlier).
    • You’ll also be prompted for your Snowflake credentials πŸ”‘.
    • After successful loading, check the logs to confirm the data was loaded into Snowflake βœ….
  • Finally, to visualize the data πŸ“Š, go to the visualization folder and run analyze.py in the terminal.

    • This will generate a bar chart πŸ“‰ showing department salaries for those with more than 1000 employees 🏒.

πŸ“ˆ Summary

This ETL pipeline extracts raw data, transforms it, loads it into Snowflake ❄️, and generates a basic visualization πŸ“Š.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages