Task - Data Engineering
Task - Data Engineering
Task - Data Engineering
Create a scraper to get data from one of the following websites. The scrapper file should be in
the .py format and scrapper must have a single python class which will be called to get the required data.
The output should be in the csv format. Requirements:
● Only pick one of your trial tasks from the sources listed below
Note: This is also a gauge of which type of data structures you are most comfortable with.
● Create scrapper and follow evaluation guidelines below
● Build clean standards, data should contain metadata along with all the values present in the
dataset.
● Simple way to present your data in map, graphs or charts to provide synthesis and show
analytical skills in a short report
The submission will be evaluated on the quality of the data output as well as code. Scrapper should be
well optimized and able to handle large amounts of data. The deadline for the task is 3 days. Upload
your code in your GitHub repo and push your code for us to evaluate.
Learn more about our data standards: https://developer.taiyo.ai/api-doc/StandardLib/
1. Time Series Data (Fork Branch and Push your code to: https://github.com/Taiyo-ai/ts-mesh-pipeline)
Time Series Data Standards (to follow): https://developer.taiyo.ai/api-doc/TimeSeries/
2. Projects and Tenders (Fork Branch and Push your code to:
https://github.com/Taiyo-ai/pt-mesh-pipeline)
Projects and Tenders Data Standards (to follow): https://developer.taiyo.ai/api-doc/ProjectsandTenders/
Scrap data for the following sources by getting details of all the tenders present on the website:
● World Bank Evaluation and Ratings: https://ieg.worldbankgroup.org/data
● China Procurement Sources:
○ https://www.chinabidding.com/en
○ http://www.ggzy.gov.cn/
○ http://en.chinabidding.mofcom.gov.cn/
○ https://www.cpppc.org/en/PPPyd.jhtml
○ https://www.cpppc.org:8082/inforpublic/homepage.html#/searchresult
● E-procurement Government of India: https://etenders.gov.in/eprocure/app
Evaluation Guidelines:
Evaluation is based on the following parameters:
● Web Scraping Standards and Libraries used
○ Update requirements.txt for packages used in sample solution
● Modular, DRY Code
○ Follow Sample/Dummy Projects Directory/Packages Structure
○ Python Packages handling and client.py/main.py for calling different steps/module of
code is must
● Config Params or Control Params using External ENV Variables, Unit Tests & Logging Standards
● Working solution with control of config/Params driven/triggered using client.py/main.py package
file.