Introduction to data science
Python for data science
Introduction to Python
Basic types and control flow
Intro to Python: part 1
Intro to IDLE, Python
Keyword: print
Types: str, int, float
Variables
User input
Saving your work
Comments
Conditionals
Type: bool
Keywords: and, or, not, if, elif, else
IDLE: Interactive DeveLopment Environment
Shell
Evaluates what you enter and displays output
Interactive
Type at “>>>” prompt
Editor
Create and save .py files
Can run files and display output in
shell
A simple example: Hello world!
Syntax highlighting:
IDLE colors code differently depending on
functionality: Keywords, string, results
Some common keywords:
print if else and or class while for
break elif in def not from import
Fundamental concepts in Python
Variable: a memory space that is named.
Names of variables include letters, digits or underscore (_);
names of variables can not begin by a digit and:
Can not be a keyword
Can use the digits in Unicode (from Python 3. version)
All variables in Python are the objects. They have a type
and a location in memory (id)
6
Fundamental concepts in Python
Variables in Python:
Case sensitive
No need to declare in program
No need to specify the data type
Can be changed to another data type
The values of variables should be assigned as soon as they appear
Basic data types:
String: str
Number: integer, float, fraction and complex
list, tuple, dictionary
7
Fundamental concepts in Python
Variables
>>> someVar = 2
To re-use a value in multiple >>> print someVar # it’s an
int
computations, store it in a 2
>>> someVar = “Why hello there”
variable. >>> print someVar # now str
Why hello there
Python is “dynamically-
typed”, so you can change
the type of value stored.
(differs from Java, C#,
C++, …)
Fundamental concepts in Python
Input data for variables
<variable’s name>= input(“string”)
The users can change the data type when input data
< variable’s name >= <data type>(input (‘title’))
A=int(input(‘a=‘))
Display data
print (Value1, Value2,….)
print(“string”, value)
Display strings in one line: adding end=‘’ to print statement.
Examples:
9
Data types in Python
String
We’ve already seen one
type in Python, used for
words and phrases.
In general, this type is
called “string”.
In Python, it’s referred to as
str.
Data types in Python
Python also has types >>> print 4
4
# int
for numbers.
>>> print 6. # float
6.0
int – integers
>>> print 2.3914 # float
2.3914
float – floating
point (decimal)
numbers
Data types in Python
int
In Python 3.X, int has unlimited range.
Can be used for the computations on very large numbers.
Common operators
Operations Examples
The division with x+y 20 + 3 = 23
the rounded
x–y 20 – 3 = 17
result
x*y 20 * 3 = 60
x/y 20 / 3 = 6.666
x // y 20 // 3 = 6
x%y 20 % 3 = 2
Exponent x ** y 20**3 = 8000
Unary operators
12
Data types in Python
int
Examples
13
Data types in Python
Float
Values: distinguished from integers by decimals. The integer part and
the real part are separated by ‘.’
Operators: +, –, *, /, ** and unary operators
Use decimal for higher precision: from decimal import *
14
Data types in Python - bool
Boolean values are true or >>> a
>>> b
= 2
= 5
false. >>> a
False
> b
>>> a <= b
True
Python has the values True >>> a
False
== b # does a equal b?
and False (note the capital >>> a != b # does a not-equal b?
True
letters!).
You can compare values
with ==, !=, <, <=, >, >=,
and the result of these
expressions is a bool.
Data types in Python - bool
When combining >>> a = 2
>>> b = 5
Boolean expressions, >>> False == (a > b)
True
parentheses are your
friends.
Keywords: and, or, not
and is True if both >>> a
>>> b
= 2
= 5
parts evaluate to >>> a
False
< b and False
True, otherwise >>> a < b or a == b
True
False >>> a < b and a == b
False
>>> True and False
or is True if at least False
>>> True and True
one part evaluates to True
>>> True or False
True , otherwise True
False
Keywords: and, or, not
and is True if both
parts evaluate to True, >>> not True
otherwise False False
>>> not False
True
or is True if at least >>> True and (False or not True)
one part evaluates to False
>>> True and (False or not False)
True , otherwise False True
not is the opposite of
its argument
Mathematics functions in Python
The list of numeracy and mathematics modules
math: Mathematical functions (sin() etc.).
cmath: Mathematical functions for complex numbers.
decimal: Declaration of general decimal arithmetic forms
random: Generates "pseudo" random numbers with
normal probability distribution functions.
itertools: Functions that generate “iterators” used in
efficient loops
functools“: Functions and operations have higher priority
on callable objects
operator: All standard Python operators are built in
Mathematics functions in Python
math
exp(x)
log(x[, base])
log10(x)
pow(x, y)
sqrt(x)
acos(x)
asin(x)
atan(x) atan2(y, x)
cos(x) hypot(x, y)
sin(x) tan(x)
degrees(x) radians(x)
cosh(x) sinh(x) tanh(x)
Constant number: pi, e
Built-in functions in Python
Conditionals: if, elif, else
Conditionals: if, elif, else
The keywords if and else provide a way to
control the flow of your program.
Python checks each condition in order, and
executes the block (whatever’s indented) of the
first one to be True
Conditionals: if, elif, else
The keywords if, elif, and else provide a way to control
the flow of your program.
Python checks each condition in order, and executes the block
(whatever’s indented) of the first one to be True.
Conditionals: if, elif, else
Indentation is important in
Python!
Make sure each if, elif,
and else has a colon after it,
and its block is indented one
tab (4 spaces by default).
Conditionals: if, elif, else
Make sure you’re careful what you compare to the result of
raw_input. It is a string, not a number.
# The right way: str to str or int to int
>>> gradYear = raw_input(“When do you plan to graduate? ”)
When do you plan to graduate? 2019
>>> gradYear == 2019 # gradYear is not an integer
False
>>> gradYear == “2019”
True # gradYear is a string :(
>>> int(gradYear) == 2019 # cast gradYear to an int :)
True
Conditionals: if, elif, else
Make sure you’re careful how to compare the result of raw_input.
It is a string, not a number.
Doing it wrong leads to a ValueError:
>>> gradYear = raw_input(“When do you plan to graduate? ”)
When do you plan to graduate? Sometime
>>> int(gradYear) == 2019
Traceback (most recent call last):
File “<pyshell#4>”, line 1, in <module>
int(gradYear) == 2019
ValueError: invalid literal for int() with base 10: ‘sometime’
Nested if
Syntax : Example:
var = 100
if condition1: if var < 200:
tasks_1 print (“The value of variable is less than 200")
if condition2: if var == 150:
tasks_2 print (“The value is 150")
elif condition3: elif var == 100:
tasks_3 print (" The value is 100")
else elif var == 50:
tasks print (" The value is 50")
elif condition4: elif var < 50:
tasks_4 print (" The value of variable is less than 50")
else: else:
tasks_5 print (“There is no true condition")
Nested if
Example:
var = int(input('Enter a value: '))
if var < 200:
print (“The value of variable is less than 200")
if var == 150:
print (“The value is 150")
elif var == 100:
print (" The value is 100")
elif var == 50:
print (" The value is 50")
elif var < 50:
print (" The value of variable is less than 50")
else:
print (“There is no true condition")
Exersise
Enter the coordinates of 3 points A, B and C on the 2-
dimensional plane. Let's check if triangle ABC is an
equilateral triangle
Exersise
Enter the coordinates of 3 points A, B and C on the 2-
dimensional plane. Let's check if triangle ABC is an
equilateral triangle
WHILE, FOR loops - Syntax
WHILE loop - Examples
Example 1: while without else
count = 0
while (count < 5):
print (‘Your sequence number is :', count)
count = count + 1
WHILE loop - Examples
• Example 2: while with else
count = 0
while count < 5:
print (count, " is less than 5")
count = count + 1
else:
print (count, " is not less than 5")
FOR loop - Examples
Example 1: FOR without else
for i in range (0,10):
print ('The sequence number is:',i)
FOR loop - Examples
• Example 2: FOR with else
for i in range(0,10):
print (‘The sequence number is:',i)
else:
print (‘The last number!')
Nested loops - Exersise
Example: Find all prime numbers that are less than 100
Nested loops - Exersise
Example: Find all prime numbers that are less than 100
i=2
while(i < 100):
j=2
while(j <= (i/j)):
if not(i%j): break
j=j+1
if (j > i/j) : print (i, " is a prime number!")
i=i+1
Lists
A sequence of items
Has the ability to grow (unlike array)
Use indexes to access elements (array notation)
examples
aList = []
another = [1,2,3]
You can print an entire list or an element
print another
print another[0]
index -1 accesses the end of a list
List operation
append method to add elements (don't have to be the same
type)
aList.append(42)
aList.append(another)
del removes elements
del aList[0] # removes 42
Concatenate lists with +
Add multiple elements with *
zerolist = [0] * 5
Multiple assignments
point = [1,2]
x , y = point
More operations can be found at
http://docs.python.org/lib/types-set.html
Exercises
1. Enter a string, check if it is a valid email address or not?
(a valid email can be considered as containing the @
letter)
2. Given a random sequence A including 100 integer
elements (values in range of 1 and 300), separate all the
odd elements into another sequence (B)
Exercises
2. Given a random sequence A including 100 integer
elements (values in range of 1 and 300), separate all the
odd elements into another sequence (B)
Exercise 2
Given a random sequence A including 100 integer
elements (values in range of 1 and 300), separate all the
odd elements into another array (B).
Interacting with user
Obtaining data from a user
Use function raw_input for strings or input for
numbers
Example
name = raw_input("What's your name?")
Command Line arguments
Example:
import sys
for item in sys.argv:
print item
Remember sys.argv[0] is the program name
Libraries in Python
Libraries in Python
Data processing in Python
Numpy
Matplotlib
Pandas
Scikit-learn
Data processing in Python
Basic data types: string, number, boolean
Other data types: set, dictionary, tuple, list, file
The errors in Python
Syntax error: errors in syntax, programs can
not be compiled.
Exception: abnormalities occur that are not as
designed
Data processing in Python
Deal with the exceptions: using up to 4 blocks
“try” block: code that is likely to cause an error. When an error
occurs, this block will stop at the line that caused the error
“except” block: error handling code, only executed if an error
occurs, otherwise it will be ignored
“else” block: can appear right after the last except block, the
code will be executed if no except is performed (the try block
has no errors)
“finally” block: also known as clean-up block, always executed
whether an error occurs or not
Data processing in Python
Deal with the exceptions: using up to 4 blocks
Data processing in Python
Numpy
The main object of numpy is
homogeneous multidimensional arrays:
The data types of elements in the array must
be the same
Data can be one-dimensional or multi-
dimensional arrays
The dimensions (axis) are numbered from 0
onwards
The number of dimensions is called rank.
There are up to 24 different number types
The ndarray type is the main class that
handles multidimensional array data
Lots of functions and methods for handling
matrices
Numpy
Syntax: import numpy [as <new name>]
Create array:
<variable name>=<library name>.array(<value>)
Access: <variable name>[<index>]
Examples:
import numpy as np
x = np.arange(3.0)
a = np.zeros((2, 2))
b = np.ones((1, 2))
c = np.full((3, 2, 2), 9)
d = np.eye(2)
e = np.random.random([3, 2])
Numpy
Examples:
Numpy
Access by index (slicing)
import numpy as np
a = np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])
row_r1 = a[1, :] #1-dimensional array of length 4
row_r2 = a[1:2, :]
# 2-dimensional array 2x4
print(row_r1, row_r1.shape)
# Display "[5 6 7 8] (4,)"
print(row_r2, row_r2.shape)
# Display "[[5 6 7 8]] (1, 4)"
col_r1 = a[:, 1] # 1-dimensional array of length 3
col_r2 = a[:, 1:2] print(col_r1,
# 2-dimensional array 3x2
col_r1.shape) print(col_r2,
col_r2.shape) #Display "[ 2 6 10] (3,)“
# Display "[[ 2]
[ 6]
Numpy
import numpy as np
x = np.array([[1, 2 ] , [ 3 , 4 ] ] , dtype=np.float64)
y = np.array([[5, 6 ] , [ 7 , 8 ] ] , dtype=np.float64)
print(x + y) # print(np.add(x, y)),
print(x - y) # print(np.subtract(x, y))
print(x * y) # print(np.multiply(x, y))
print(x / y) # print(np.divide(x, y))
print(np.sqrt(x)) # applied for all elements of x
print(2**x) # applied for all elements of x
mathplotlib
“matplotlib” is a library specializing in plotting,
extended from numpy
“matplotlib” has the goal of maximally simplifying
charting work to "just a few lines of code“
“matplotlib” supports a wide variety of chart types,
especially those used in research or economics
such as line charts, lines, histograms, spectra,
correlations, errorcharts, scatterplots, etc.
The structure of matplotlib consists of many parts,
serving different purposes
mathplotlib
Necessary condition: available data
There can be 4 basic steps:
Step 1: Choose the right chart type
Depends a lot on the type of data
Depends on the user's intended use
Step 2: Set parameters for the chart
Parameters of axes, meaning, division ratio,...
Highlights on the map
Perspective, fill pattern, color and other details
Additional information
Step 3: Draw a chart
Step 4: Save to file
mathplotlib
Some charts drawn by using matplotlib
mathplotlib
Some charts drawn by using matplotlib
mathplotlib
The graph shows the correlation between X and Y
Syntax:
plot([x], y, [fmt], data=None, **kwargs)
plot([x], y, [fmt], [x2], y2, [fmt2], ..., **kwargs)
“fmt” is the line drawing specification
“data” is the label of the data
**kwargs: line drawing parameter
Plot multiple times on one chart
The returned result is a list of Line2D objects
mathplotlib
fmt = '[color][marker][line]‘
[colors] :
‘b’ – blue
‘g’ – green
‘r’ –red
‘c’ – cyan
‘m’ – magenta
‘y’ –yellow
‘b’ – black
‘w’ –white
#rrggbb – chỉ ra mã màu theo hệRGB
mathplotlib
Line plot
[marker] – the notation for data:
‘o’ – circle
‘v’ – (‘^’, ‘<‘,‘>’)
‘*’ – star
‘.’ – dot
‘p’ – pentagon
…
[line] – line type:
‘-’ solid line
‘--‘ dash
‘-.’ dotted line
‘:’
mathplotlib
Example – Line plot
import numpy as np
import matplotlib.pyplot as plt
# divide the interval 0-5 with the step of 0.2
t = np.arange(0., 5., 0.2)
# Draw three lines:
# - red dash line: y = x
# - blue, square marker : y = x^2
# - green, triangle marker: y = x^3
plt.plot(t, t, 'r--', t, t**2, 'bs', t, t**3, 'g^')
plt.show()
mathplotlib
Example – Line plot
mathplotlib
Example – Bar plot
import matplotlib.pyplot as plt
D = { ‘MIS': 60,
‘AC': 310,
‘AAI': 360,
‘BDA': 580,
‘FDB': 340, ‘MKT': 290 }
plt.bar(range(len(D)), D.values(), align='center')
plt.xticks(range(len(D)), D.keys())
plt.title(‘The majors in IS')
plt.show()
mathplotlib
Example – Bar plot
mathplotlib
Example – Pie plot
mathplotlib
Example – subplot
import numpy as np
import matplotlib.pyplot as p l t
x1 = np.linspace(0.0, 5.0)
x2 = np.linspace(0.0, 2.0)
y1 = np.cos(2 * np.pi * x1) * np.exp(-x1)
y2 = np.cos(2 * np.pi * x2)
plt.subplot(2, 1, 1)
plt.plot(x1, y1, 'o-')
plt.subplot(2, 1, 2)
plt.plot(x2, y2, '.-')
plt.show()
Pandas
“pandas” is an extension library from numpy,
specializing in processing tabular data
The name “pandas” is the plural form of “panel
data”
Pandas
Read data from multiple formats
Data binding and missing data processing
integration
Rotate and convert data dimensions easily
Split, index, and split large data sets based on
labels
Data can be grouped for consolidation and
transformation purposes
Filter data and perform queries on the data
Time series data processing and sampling
Pandas
Pandas data has 3 main structures:
Series: 1-dimensional structure, uniform data array
Dataframe (frame): 2-dimensional structure, data
on columns is identical (somewhat like table in
SQL, but with named rows)
Panel: 3-dimensional structure, can be viewed as
a set of dataframes with additional information
Series data is similar to the array type in
numpy, but there are two important differences:
Accept missing data (NaN – unknown)
Rich indexing system (like a dictionary?)
Pandas
General syntax:
pd.DataFrame(data, index, columns, dtype, copy)
In there:
‘data’ will receive values from many different types such
as list, dictionary, ndarray, series,... and even other
DataFrames
‘index’ is the column index label of the dataframe
‘columns’ is the row index label of the dataframe
‘dtype’ is the data type for each column
‘copy’ takes the value True/False to indicate whether
data is copied to a new memory area, default is False
Pandas
Syntax:
pandas.Panel(data, items, major_axis, minor_axis, dtype, copy)
In there:
‘data’ can accept the following data types: ndarray,
series, map, lists, dict, constants and other dataframes
‘items’ is axis = 0
‘major_axis’ is axis = 1
‘minor_axis’ is axis = 2
‘dtype’ is the data type of each column
‘copy’ takes the value True/False to determine whether
the data shares memory or not
Pandas – Series
import pandas as pd
import numpy as np
chi_so = ["KT", "KT", "CNTT", "Co khi"] #duplicated
gia_tri = [310, 360, 580, 340]
S = pd.Series(gia_tri, index=chi_so)
KT 310
print(S) KT 360
print(S.index) CNTT 580
Cokhi 340
print(S.values)
dtype: int64
Index(['KT', 'KT', 'CNTT', 'Co k h i ' ] , dtype='object')
[310 360 580 340]
Pandas – Series
Functions on Series
S.axes: returns a list of indexes of S
S.dtype: returns the data type of S's elements
S.empty: returns True if S is empty
S.ndim: returns the dimension of S (1)
S.size: returns the number of elements of S
S.values: returns a list of elements of S
S.head(n): returns the first n elements of S
S.tail(n): returns the last n elements of S
Pandas – Series
Operations on Series
import pandas as pd import numpy as np
chi_so = ["Ke toan", "KT", "CNTT", "Co khi"]
gia_tri = [310, 360, 580, 340]
# If the index is the same, combine it, otherwise NaN
CNTT 680.0
S = pd.Series(gia_tri, index=chi_so) Co khi NaN
P= pd.Series([100, 100], ['CNTT', 'PM']) KT NaN
Y= S +P Ke NaN
print(Y) toan NaN
dtype:
PM float64
Pandas - Frame
Create dataframe from list
names_rank = [['MIT',1],["Stanford",2],["DHTL",200]] df
= pd.DataFrame(names_rank)
0 1
print(df) 0 MIT 1
1 Stanford 2
2 DHTL 200
Pandas - Frame
Create dataframe from list
names_rank = [['MIT',1],["Stanford",2],["DHTL",200]] df
= pd.DataFrame(names_rank)
0 1
print(df) 0 MIT 1
1 Stanford 2
2 DHTL 200
Pandas - Panel
• Panels are widely used in
econometrics
The data has 3 axes:
Items (axis 0): each item is an
internal dataframe
Major axis (axis 1 – main axis):
lines
Minor axis (axis 2 - minor axis):
columns
• No further development (replaced
by MultiIndex)
Pandas - Panel
Syntax:
pandas.Panel(data, items, major_axis, minor_axis, dtype, copy)
In there:
‘data’ can accept the following data types: ndarray,
series, map, lists, dict, constants and other dataframes
‘items’ is axis = 0
‘major_axis’ is axis = 1
‘minor_axis’ is axis = 2
‘dtype’ is the data type of each column
‘copy’ takes the value True/False to determine whether
the data shares memory or not
scikit- learn (sklearn)
Basic machine learning problem classes
scikit- learn (sklearn)
Linear regression
Data clustering
Data layering
Linear regression
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from sklearn import linear_model, metrics
# reading data from file csv
df = pd.read_csv("nguoi.csv", index_col = 0)
print(df)
#Draw the figure
plt.plot(df.Cao, df.Nang, 'ro')
plt.xlabel(‘Height (cm)')
plt.ylabel(‘Weight (kg)')
plt.show()
Linear regression
Using old data, adding
gender column
(Nam/Nu)
Using the old method, to
see how gender affects
weight
Linear regression
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from sklearn import linear_model, metrics
df = pd.read_csv("nguoi2.csv", index_col = 0)
print(df)
df['GT'] = df.Gioitinh.apply(lambda x: 1 if x=='Nam' else 0)
print(df)
Linear regression
#Training model
X = df.loc[:, ['Cao‘, 'GT']].values
y = df.Nang.values
model = linear_model.LinearRegression()
model.fit(X, y)
# Show the information of model
mse = metrics.mean_squared_error(model.predict(X), y)
print(“Mean squared error: ", mse)
print(“Regression coefficient : ", model.coef_)
print(“Intercept: ", model.intercept_)
print(f"[weight] = {model.coef_} x [height, sex] +{model.intercept_}")
Linear regression
#Applying model into some cases
while True:
x = float(input(“Enter the height (0 for stop): "))
if x <= 0: break
print(“Male with the height ", x, “ cm, will have the weight ", model.predict([[x,1]]))
print(" Female with the height ", x, "cm, will have the weight ", model.predict([[x,0]]))
scikit- learn (sklearn)
Data clustering
from sklearn.cluster import Kmeans
Data layering
from sklearn.naive_bayes import GaussianNB
from sklearn import tree
Classification
Classification
Clustering
Clustering
Exercises
Choose three of the following models:
- Linear regression
- Classification based on Naïve Bayes
- Classification based on SVM
- K-means clustering
- FCM clustering
Apply the selected models on 2 datasets taken
from the standard data set of the computer