Pandas Notes
Pandas Notes
import pandas as pd
0. INTRODUCTION
Pandas is used to analyze big data. It allows us to clean messy data sets,
and make them readable and relevant.
1. PANDAS SERIES
2. PANDAS DATAFRAMES
We can return a row with '.loc[index]' or a series of rows with a list of the
indices of those rows '.loc[list]'. The returned value will be a Series and a
DataFrame respectively. Recall that with series we could just get the rows with the
index but now the indexing will give us the columns.
4. ANALYZING DATA
To get a quick overview of the data we use '.head(=5)' where the number in
parenthesis is the rows that will be shown. '.tail()' does the same but starting
from the end.
The function '.info()' gives some information about the data set like rows,
columns, labels, non-null counts and data types.
We can transform the data from the columns to arrays with the command
'.values' over a column of a dataframe or a series.
We can get a list with the labels of the columns with '.columns'. We convert
it to a list with the function '.tolist()'.
The function '.describe()' gives information as the count, mean, std, min,
max and values at different percentages of each column.
5. CLEANING DATA
It is important to clean bad data from the set before computing with it. We
can find empty cells, data in wrong format, wrong data and duplicates.
Also, the use of masks in the indexing will not change the format of the
data, it will not flatten it. It will just get rid of the rows that not satisfy the
mask.
For empty cells we can either remove the whole row with that value or replace
the value.
For the wrong format it is a little bit more complicated. We can either
remove the rows or try to correct the value. Dates that are not a string but the
numbers are good, can be retrieved with the function '.to_datetime(column)' by re-
writing the column. If this does not work, now at least it will be a null value
that can be removed with the function '.dropna(subset=column,inplace=)'.
If we want to do this at a larger scale, we use a loop for the indices (with
'.index') and then using an if statement with the '.loc[index,label]' function. We
could also remove the column with '.drop(index, inplace=)'.
9. REMOVING DUPLICATES
We can add new data easily based on exisitng data on the dataframe. We can
use it for example to create columns with normalized values. The procedure is the
following:
v_mean = np.mean(v)
v_rms = np.sqrt(np.mean((v-v_mean)**2))
df[col+'_normalized'] = (v-v_mean)/v_rms
To do so, we use 'np.isnan(array)' to replace possible NaN for 0. To get the
info in a new column we just calculate the array with the values and we add it as
if we were changing the value, with the typical syntax.
Although Pandas allow building of plots, we need the Pyplot library to show
it, imported with 'import matplotlib.pyplot as plt'.
We can also plot histograms with the information of one column (it shows the
frequency of the different intervals of the values) with 'column.plot(kind =
'hist')'. Also, histograms can be done with 'plt.hist(column)'
Other things are used in the code: 'import sys', 'import matplotlib',
'matplotlib.use('Agg')', 'plt.savefig(sys.stdout.buffer)' and 'sys.stdout.flush()'.
Other plotting options are, for example, the scatter matrices, imported with:
12. PANDAS CORRELATION? Looks interesting but has not been done in class