3 Powerful Data Structure and Software Ecosystem
3 Powerful Data Structure and Software Ecosystem
WHY
DICTIONARY?
Nanjing University
3
Why Dictionary?
Use Python to build a simple employee information table
including names and salaries. Use the table to query salary
of Niuyun.
F ile Output:
2000
# Filename: info.py
names = ['Wangdachui', 'Niuyun', 'Linling', 'Tianqi']
salaries = [3000, 2000, 4500, 8000]
print(salaries[names.index('Niuyun')]) salaries['Niuyun']
Nanjing University
Dictionary 4
– key
– value
– key-value pair
Nanjing University
5
Create a Dictionary
Info
• Create a dictionary
0 'Wangdachui' − directly
2 'Linling' cInfo['Niuyun']
3 'Tianqi'
S ource
S ource
Nanjing University
7
Generate a Dictionary
S ource
Nanjing University
8
Generate a Dictionary
How to generate a dictionary of company code and stock
price from data?
{'AXP': '78.51', 'BA': '184.76', 'CAT ': '96.39', 'CSCO': '33.71', 'CVX': '106.09'}
Nanjing University
9
Generate a Dictionary
F ile
# Filename: createdict.py
pList = … pList = [('AXP', 'American Express Company', '78.51'),
aList = [] ('BA', 'The Boeing Company', '184.76'),
bList = []
for i in range(5): ('CAT', 'Caterpillar Inc.', '96.39'), …]
aStr = pList[i][0]
bStr = pList[i][2]
aList.append(aStr)
bList.append(bStr)
aDict = dict(zip(aList,bList))
print(aDict)
{'AXP': '78.51', 'BA': '184.76', 'CAT ': '96.39', 'CSCO': '33.71', 'CVX': '106.09'}
Nanjing University
Data Processing Using
Python
USING
DICTIONARY
Nanjing University
11
Basic Operation of Dictionary
S ource
Nanjing University
12
Built-in Functions of Dictionary
S ource
Nanjing University
13
Built-in Functions of Dictionary
S ource
>>> hash('Wangdachui')
7716305958664889313
>>> testList = [1, 2, 3]
>>> hash(testList)
Traceback (most recent call last):
File "<pyshell#127>", line 1, in <module>
hash(testList)
TypeError: unhashable type: 'list'
Nanjing University
14
Dictionary Methods
An information dictionary is known as {'Wangdachui':3000,
'Niuyun':2000, 'Linling':4500, 'Tianqi':8000},how to output the
name and salary of employee separately?
S ource
>>> aInfo = {'Wangdachui': 3000, 'Niuyun': 2000, 'Linling': 4500, 'Tianqi': 8000}
>>> aInfo.keys()
['Tianqi', 'Wangdachui', 'Niuyun', 'Linling']
>>> aInfo.values()
[8000, 3000, 2000, 4500]
>>> for k, v in aInfo.items():
print(k, v)
Nanjing University
15
Dictionary Methods
There are two dictionaries, the first one contains original
information, while the second one has some new members
and updates, how to merge and update information?
S ource
Nanjing University
16
Dictionary Methods
What’s the difference between the two kinds of search
operation?
S ource S ource
Nanjing University
17
Dictionary Methods
• Delete a dictiontary
S ource
Source
Nanjing University
Case Study 18
Nanjing University
Variable Length Keyword Parameter(dict)19
Parameter type in Python
function: S ource
Nanjing University
Data Processing Using
Python
SET
Nanjing University
21
Set
How to remove the duplicate values in information form?
S ource
Nanjing University
Set 22
• What is set?
A combination of several unordered elements with
no duplicate
– Variable set(set)
– Fixed set(frozenset)
Nanjing University
23
Create a Set
S ource
Nanjing University
24
Comparison between Sets
S ource
Mathematic Python
>>> aSet = set('sunrise') in
>>> bSet = set('sunset') not in
>>> 'u' in aSet = ==
True
≠ !=
>>> aSet == bSet
False ⊂ <
>>> aSet < bSet ⊆ <=
False ⊃ >
>>> set('sun') < aSet ⊇ >=
True Standard type operators
Nanjing University
25
Relational Operation
S ource
Mathematic Python
>>> aSet = set('sunrise')
>>> bSet = set('sunset') ∩ &
>>> aSet & bSet ∪ |
{'u', 's', 'e', 'n'}
>>> aSet | bSet - or \ -
{'e', 'i', 'n', 's', 'r', 'u', 't'} Δ ^
>>> aSet - bSet
{'i', 'r'} Set type operator
>>> aSet ^ bSet
{'i', 'r', 't'} compound
>>> aSet -= set('sun')
>>> aSet assignment operators
{'e', 'i', 'r'}
&= |= -= ^=
Nanjing University
26
Built-in Function for Set
• Function can also be used
to do similar work S ource
Nanjing University
27
Built-in Function for Set
• Function can also be used
to do similar work S ource
Nanjing University
Data Processing Using
Python
SCIPY
LIBRARY
Nanjing University
29
SciPy
Feature
• A software ecosystem based on Python
• Open-source
• Serve for math, science and engineering
Nanjing University
30
Common Data Type in Python
Dict Numeric
String Set
Tuple List
Nanjing University
Other Data Structure 31
– ndarray(n-dimension array)
– Series(dictionary with
variable length)
– DataFrame
Nanjing University
32
NumPy
Feature
• Powerful ndarray object and ufunc() function
• Ingenious funciton
• Suitable for scientific computation like linear algebra and
random number handling
• Flexible and available general multi-dimension data structure
• Easy to connect with database
S ource
Nanjing University
33
SciPy
Feature
• Key package for scientific computation in Python and it is based
on NumPy. It includes richer functions and methods than NumPy
and it probably has stronger function when they have the same
functions or methods.
• Efficiently compute NumPy matrix to benefit collaboration
between NumPy and SciPy.
• Toolbox to deal with different fields in scientific computation
with modules including interpolation, integration, optimization
and image processing. S
ource
Nanjing University
35
pandas
Feature
• Based on SciPy and NumPy
• Efficient Series and DataFrame structure
• Powerful Python library for scalable data processing
• Efficient solution for large dataset slides
• Optimized library function to read/write many types of
files, like CSV and HDF5
S ource
…
>>> df[2 : 5]
>>> df.head(4)
>>> df.tail(3)
Nanjing University
Data Processing Using
Python
NDARRAY
Nanjing University
37
Array in Python
Format
• Array module
Nanjing University
Ndarray 38
• What is ndarray?
0 10 0
2 03 04 N-dimensional array
50 60 0
7 08 09 – Basic data type in NumPy
10
0 11
0 12
0 13
0 14
0 – Elements are of the same type
15
0 16
0 17
0 18
0 19
0
– With another name array
20
0 21
0 22
0 23
0 24
0
– Reduce memory cost and
improve the computational
efficiency
– Powerful functions
Nanjing University
Basic Concepts of Ndarray 39
0 10 0
2 03 04 – Dimensions are called axes, the number of
axes is rank.
50 60 0
7 08 09
axis = 0
– Basic attributes
10
0 11
0 12
0 13
0 14
0
• ndarray.ndim(rank)
15
0 16
0 17
0 18
0 19
0
• ndarray.shape(dimension)
20
0 21
0 22
0 23
0 24
0
• ndarray.size(total size)
• ndarray.dtype(type of element)
• ndarray.itemsize(size of item(in byte))
Nanjing University
40
Creation of Ndarray
S ource
Nanjing University
41
Creation of Ndarray
S ource
Nanjing University
43
Ndarray Operations
S ource
S ource
>>> aArray.resize(3,2)
>>> aArray = np.array([(1,2,3),(4,5,6)])
>>> aArray
>>> aArray.shape
array([[1, 2],
(2, 3)
[3, 4],
>>> bArray = aArray.reshape(3,2)
[5, 6]])
>>> bArray
array([[1, 2], >>> bArray = np.array([1,3,7])
[3, 4], >>> cArray = np.array([3,5,8])
[5, 6]]) >>> np.vstack((bArray, cArray))
>>> aArray array([[1, 3, 7],
array([[1, 2, 3], [3, 5, 8]])
[4, 5, 6]]) >>> np.hstack((bArray, cArray))
array([1, 3, 7, 3, 5, 8])
Nanjing University
44
Ndarray Calculation
S ource
# Filename: math_numpy.py
can operate each element in
import time
the array. As many ufunc()s in import math
import numpy as np
NumPy are implemented by C, x = np.arange(0, 100, 0.01)
t_m1 = time.process_time()
the speed can be fast.
for i, t in enumerate(x):
x[i] = math.pow((math.sin(t)), 2)
t_m2 = time.process_time()
add, all, any, arange, apply_along_axis, y = np.arange(0,100,0.01)
argmax, argmin, argsort, average, t_n1 = time.process_time()
bincount, ceil, clip, conj, corrcoef, cov, y = np.power(np.sin(y), 2)
cross, cumprod, cumsum, diff, dot, t_n2 = time.process_time()
exp, floor, …
print('Running time of math:', t_m2 - t_m1)
print('Running time of numpy:', t_n2 - t_n1)
Nanjing University
Data Processing Using
Python
SERIES
Nanjing University
49
Series
• Basic feature
− Object similar to one-dimensional array
− Consist of data and index.
Source
Nanjing University
51
Basic Operation of Series
S ource
Nanjing University
52
Data Alignment of Series
S ource
Nanjing University
53
Data Alignment of Series
• Important feature S ource
Nanjing University
54
Data Alignment of Series
Source
• Important feature
>>> data = {'AXP':86.40,'CSCO':122.64,'BA':99.44}
− Align data with >>> aSer = pd.Series(data, index = sindex)
>>> aSer
different indexes AXP 86.40
CSCO 122.64
during computation BA 99.44
AAPL NaN
dtype: object
>>> bSer = {'AXP':86.40,'CSCO':130.64,'CVX':23.78}
>>> cSer = pd.Series(bSer)
>>> (aSer+cSer)/2
AAPL NaN
AXP 86.40
BA NaN
CSCO 126.64
CVX NaN
dtype: float64
Nanjing University
Data Processing Using
Python
DATAFRAME
Nanjing University
56
DataFrame
• Basic Feature
− A form-like data structure
− Have an ordered column(like index)
− Can be considered as a set of Series sharing the same index
S ource
>>> data = {'name': ['Wangdachui', 'Linling', 'Niuyun'], 'pay': [4000, 5000, 6000]}
>>> frame = pd.DataFrame(data)
>>> frame
name pay
0 Wangdachui 4000
1 Linling 5000
2 Niuyun 6000
Nanjing University
57
Index and Value of Dataframe
S ource
>>> frame.pay
0 4000 >>> frame.iloc[ : 2, 1]
1 5000 0 4000
2 6000 1 5000
Name: pay, dtype: int64 Name: pay, dtype: object
Nanjing University
59
Basic Operation of DataFrame
• Modification and deletion of DataFrame object
Source
S ource
Nanjing University
60
Statistics with DataFrame
• Find groups with lowest and high salaries in DataFrame object members
name pay
0 Wangdachui 4000
1 Linling 5000
2 Niuyun 6000 Source
Nanjing University
61
Summary
Nanjing University