0% found this document useful (0 votes)
9 views

Python Datasci Slides

Uploaded by

mira.jeni1love
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Python Datasci Slides

Uploaded by

mira.jeni1love
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Teaching Data Science using Python II

The Python Data Science Ecosystem


Teaching Data Science using Python

Sandbox option: Berkeley's Data 8 course


• Uses the datascience package

Real world option uses a set of Python


packages:
• Standard Python libraries
• NumPy
• Pandas
• Matplotlib
• Also: seaborn, statsmodels, scikitlearn
Python Data Science packages

Going to give a basic overview of some of the main Python Data


Science packages

Will redo the avocado analyses using some of these packages


NumPy is a library that adds support for large, multi-dimensional arrays and
matrices, along with a large collection of high-level mathematical functions
to operate on these arrays.
• i.e., it is similar to MATLAB

The core data structure of NumPy is its "ndarray".

Ndarrays are similar to Python lists, except that all elements in an ndarray
must of the same type
• E.g., all elements are numbers, or all elements are strings, etc.
import numpy as np SciPy contains modules for
optimization, linear algebra,
x = np.array([1, 2, 3]) integration, interpolation, FFT, signal
2 * x and image processing, etc.
• Uses ndarrays as main data structure
# the numbers 0 to 9
x = np.arange(10)

# 3 x 3 matrix
M = np.array([[1, 2, 3], [3, 4, 6.7], [5, 9.0, 5]])
pandas is a library for data manipulation and analysis that has two main
data structures:

1. Series: One-dimensional ndarray with an index for each value


• Similar to a named vector in R

2. DataFrame: Two-dimensional, size-mutable, potentially


heterogeneous tabular data.
• Similar to an R data frame
• (or multiple Series of the same length with the same index)
import pandas as pd
avocado = pd.read_csv("avocado.csv")
avocado.head(3) # show the first 3 rows

avocado["AveragePrice"] # returns a series

# Get the average value for all numerical


columns separately for each type of avocado
avocado.groupby("type").mean().reset_index()
Matplotlib is a plotting library. Each plot has a figure and a number of different
subplots (axes).
• somewhat similar to base R graphics

It has two interfaces for plotting:

1. A "pylab" procedural interface based on a state machine that closely resembles


MATLAB
• Updates are made to the most recent axis plotted on

2. An object-oriented API
• Updates are made to the axis that is selected

The objected oriented interface is preferred (not a big difference)


import matplotlib.pyplot as plt

# pylab interface (like matlab)


plt.plot([1,3,10]);

# object oriented interface


fig, ax = plt.subplots()
ax.plot([1,3,10]);
seaborn is a visualization library built off Matplotlib, but it provides a
higher level interface that uses Pandas DataFrames
• somewhat similar to ggplot
Figure level plots
There are "axes-level" functions that plot
on a single axis and "figure-level"
functions that plot across multiple axes

Figure level plots are grouped based on


the types of variables being plotted
• E.g., a single quantitative variable, two
quantitative variables, etc.
import seaborn as sns
penguins = sns.load_dataset("penguins")

# figure-level plot
sns.displot(data=penguins,
x="flipper_length_mm",
hue="species",
multiple="stack",
kind="kde");
Translation between Tables and DataFrames

Translation between datascience Tables and pandas DataFrames

Translation between datascience Tables and babypandas DataFrames


Let’s try it ourselves!

You might also like