IF2106 – Data Engineering
Introduction to Data Science with Python (1)
• Data Science
• The Stages of Data Science
• Data Science Methodology
• Python Undergraduate
Computer Science
Overview
Introduces the main concepts of data science and its life
cycle
Demonstrates the importance of Python programming
and its main libraries for data science processing
Objectives
Upon completion of this Unit, you are expected to be able to:
• Understand and discuss data science and the stages of data science
projects development
• Properly build and practice application development using Python
Contents
a. Data Science
b. The Stages of Data Science
c. Data Science Methodology
d. Python
Data Science
Data Science
• The field that comprises everything related to cleaning, preparing, and
analyzing unstructured, semistructured, and structured data
• Uses a combination of:
• statistics
• mathematics
• programming
• problem-solving, and
• data capture
to extract insights and information from data
Data Science
Data Science
Data Science
The Stages of Data
Science
The Stages of Data Science
7.
Decision 1.
Making Understand
Business
Requirement
6.
Data 2.
Visualization Data
Acquisition
5.
Data 3.
Modeling Data
4. Preparation
Data
Exploring
• Data requirement:
• Define what kind of data will be collected based
on the requirements or problem analysis
• Data acquisition:
The Stages of • Read data from various sources of unstructured
data, semistructured data, or full-structured data
Data Science that might be stored in a spreadsheet, comma-
separated file, web page, database, etc.
• Data cleaning:
• Remove noisy data and make operations needed
to keep only the relevant data
• Exploratory analysis:
• Look at your cleaned data and make statistical
processing fits for specific analysis purposes
The Stages of • Data modeling:
Data Science • An analysis model needs to be created
• Advanced tools such as machine learning
(Cont.) algorithms can be used in this step
• Data visualization:
• The results are plotted using various systems to
help in the decision-making process
Data Science
Methodology
CRoss Industry Standard Process for Data
Mining (CRISP-DM)
Microsoft Team Data Science Process (TDSP)
Lifecycle
IBM Data Science Methodology
IBM Data Science Methodology (Cont.)
IBM Data Science Methodology (Cont.)
IBM Data Science Methodology (Cont.)
IBM Data Science Methodology (Cont.)
IBM Data Science Methodology (Cont.)
IBM Data Science Methodology (Cont.)
Python
• Easy to learn and use
• Expressive
• Interpreted
• Cross-platform
• Free and open source
Why Python? • Object-oriented
• Extensible
• Large standard library
• GUI programming support
• Integrated
Python Environment: Anaconda Navigator
https://www.anaconda.com/products/distribution
Python Environment: Google Colaboratory
https://colab.research.google.com
Python Frameworks for Data Science
Python Frameworks for Data Science
• Numpy:
• Python package that stands for “numerical Python”
• Consisting of multidimensional array objects and a collection of routines for
manipulating arrays
• Can be used to perform mathematical, logical, and linear algebra
operations on arrays
• Pandas:
• Open source Python library used to load, organize, manipulate, model, and
analyze data by offering powerful data structures
Python Frameworks for Data Science (Cont.)
• Matplotlib:
• Python library used to create 2D graphs and plots
• Supports a wide variety of graphs and plots such as histograms, bar
charts, power spectra, error charts, and so on, with additional
formatting such as control line styles, font properties, formatting
axes, and more
Summary
This Unit introduced the data science field and the use of Python
programming for implementation. Let’s recap what was covered in this
Unit:
• The data science main concepts and life cycle
• The importance of Python programming and its main libraries used for
data science processing
Discussion