Python for Data Analysis
CSC 430_530_DA_401_501
Summer 2023
Chaofan Sun (sunc@cua.edu)
Textbook
Python for Data Analysis: Data Wrangling with Pandas,
NumPy, and IPython 2nd Edition, Kindle Edition, by
Wes McKinney (Author)
• ISBN-13: 978-1491957660,
• ISBN-10: 1491957662
PDF version is available in Blackboard.
2
Outline of this class
1. Review (chapters 1-4):Three or Four weeks, then Exam 1
• Jupyter Notebook
• Python (list, tuple, dictionary, loop, and if-elif-else)
• NumPy
2. Pandas (Chapter 5-8, 10): Six weeks, then Exam 2
• Data extractions, parsing, joining, standardizing, cleaning
• Data consolidating and filtering
• Statistics
• Etc.
3. Visualization (Chapter 9 and online resources): Two weeks, No exam
• pandas data frame plot
• matplotlib
• Seaborn
3
Python Libraries for Data Science
Many popular Python toolboxes/libraries:
• NumPy
• SciPy
• Pandas
• SciKit-Learn
• Keras
Visualization libraries
• matplotlib
• Seaborn
and many more …
4
Python Libraries for Data Science
• introduces objects for multidimensional arrays
and matrices, as well as functions that allow to
easily perform advanced mathematical and
NumPy:
statistical operations on those objects
• provides vectorization of mathematical
operations on arrays and matrices which
significantly improves the performance
• many other python libraries are built on NumPy
Link: http://www.numpy.org/
5
Python Libraries for Data Science
• collection of algorithms for linear
algebra, differential equations,
numerical integration,
SciPy: optimization, statistics and more
• part of SciPy Stack
• built on NumPy
Link: https://www.scipy.org/scipylib/
6
Python Libraries for Data Science
• adds data structures and tools designed
to work with table-like data (similar to
Series and Data Frames in R)
Pandas: • provides tools for data manipulation:
reshaping, merging, sorting, slicing,
aggregation etc.
• allows handling missing data
Link: http://pandas.pydata.org/
7
Python Libraries for Data Science
• provides machine learning
algorithms: classification,
SciKit- regression, clustering, model
validation etc.
Learn: • built on NumPy, SciPy and
matplotlib
Link: http://scikit-learn.org/
8
Python Libraries for Data Science
• python 2D plotting library which produces
publication quality figures in a variety of hardcopy
formats
• a set of functionalities similar to those of MATLAB
matplotlib: • line plots, scatter plots, barcharts, histograms, pie
charts etc.
• relatively low-level; some effort needed to create
advanced visualization
Link: https://matplotlib.org/
9
Python Libraries for Data Science
Seaborn:
provides high level
Similar (in style) to the
interface for drawing
based on matplotlib popular ggplot2
attractive statistical
library in R
graphics
Link: https://seaborn.pydata.org/
10
For Coding: Python
Jupyter notebook
11
Python/Pandas: Online Resources
Webpages:
• https://www.w3schools.com/python/python_syntax.asp
• https://www.geeksforgeeks.org/python-programming-language/
Video:
• https://www.youtube.com/playlist?list=PL5-da3qGB5ICCsgW1MxlZ0
Hq8LL5U3u9y
• https://www.youtube.com/watch?v=ZyhVh-qRZPA
Github stores many notebooks.
12