0% found this document useful (0 votes)
15 views

Pandas Notes

Pandas is a Python library used for working with structured data and tables. It allows users to clean, analyze, and visualize data. Pandas provides Series and DataFrame objects for working with one-dimensional and two-dimensional labeled data structures. DataFrames can be created from various data sources like lists, dictionaries, and CSV/JSON files. Pandas offers methods for cleaning data by handling missing values, reformatting data types, and removing duplicates. It also provides functions for analyzing data through descriptive statistics, grouping, and plotting visualizations.

Uploaded by

Edu Costa
Copyright
© © All Rights Reserved
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Pandas Notes

Pandas is a Python library used for working with structured data and tables. It allows users to clean, analyze, and visualize data. Pandas provides Series and DataFrame objects for working with one-dimensional and two-dimensional labeled data structures. DataFrames can be created from various data sources like lists, dictionaries, and CSV/JSON files. Pandas offers methods for cleaning data by handling missing values, reformatting data types, and removing duplicates. It also provides functions for analyzing data through descriptive statistics, grouping, and plotting visualizations.

Uploaded by

Edu Costa
Copyright
© © All Rights Reserved
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 3

PANDAS NOTES

import pandas as pd

0. INTRODUCTION

Pandas is used to analyze big data. It allows us to clean messy data sets,
and make them readable and relevant.

1. PANDAS SERIES

It is a column in a table and it is created with 'pd.Series(list, index=)' or


'pd.Series(dict, index=)'. If not specified, the labels will be their index, but we
can write the exact label we want for each value. Then, we can index through the
label, which will be the index or the label we gave it. We can also create the
series from a dictionary, given it has already key and value, and in the index part
we can select only the labels we want.

2. PANDAS DATAFRAMES

It is a full table of data. We use 'pd.DataFrame(data,index=)' where data is


a dictionary where the values are lists with the elements in each column of the
table. We can also label the indices with a list as done with the series and then
call them with that.

We can return a row with '.loc[index]' or a series of rows with a list of the
indices of those rows '.loc[list]'. The returned value will be a Series and a
DataFrame respectively. Recall that with series we could just get the rows with the
index but now the indexing will give us the columns.

We can use the function '.rename(columns=,inplace=)' to replace the names of


the columns.

3. READ CSV (AND JSON)

A CSV file can be imported to a DataFrame with 'pd.read_csv(file)'. We can


print the entire DataFrame with '.to_string()', or we can print it directly but it
will not show all the rows. We can check the maximum number of rows displayed with
'pd.options.display.max_rows' and change it with 'pd.options.display.max_rows =
1000'. For the JSON files the command is 'pd.read_json(file)'. They have the same
format as dictionaries.

4. ANALYZING DATA

To get a quick overview of the data we use '.head(=5)' where the number in
parenthesis is the rows that will be shown. '.tail()' does the same but starting
from the end.

The function '.info()' gives some information about the data set like rows,
columns, labels, non-null counts and data types.

We can transform the data from the columns to arrays with the command
'.values' over a column of a dataframe or a series.

We can get a list with the labels of the columns with '.columns'. We convert
it to a list with the function '.tolist()'.

The function '.describe()' gives information as the count, mean, std, min,
max and values at different percentages of each column.
5. CLEANING DATA

It is important to clean bad data from the set before computing with it. We
can find empty cells, data in wrong format, wrong data and duplicates.

Also, the use of masks in the indexing will not change the format of the
data, it will not flatten it. It will just get rid of the rows that not satisfy the
mask.

6. CLEANING EMPTY CELLS

For empty cells we can either remove the whole row with that value or replace
the value.

To remove the row, we use '.dropna(subset=,inplace = True)', where the


inplace command allows us to change the original when set to true.

Empty cells can be replaced with a desired value with '.fillna(value,


inplace=)'. If we want to do that to a certain column, we have to select that
column using indexing syntax with the data frame, with the label inside the
indexing. This value can be a chosen one, the '.mean()' (average value), the
'mode()' (vaule in the middle after them being sorted out ascending) or the
'median()' (the value that appears more frequently).

7. CLEANING WRONG FORMAT

For the wrong format it is a little bit more complicated. We can either
remove the rows or try to correct the value. Dates that are not a string but the
numbers are good, can be retrieved with the function '.to_datetime(column)' by re-
writing the column. If this does not work, now at least it will be a null value
that can be removed with the function '.dropna(subset=column,inplace=)'.

8. CLEANING WRONG DATA

If we spot a wrong value, we can just change it indexing with


'.loc[indexoftherow, labelofcolumn] = newvalue'.

If we want to do this at a larger scale, we use a loop for the indices (with
'.index') and then using an if statement with the '.loc[index,label]' function. We
could also remove the column with '.drop(index, inplace=)'.

9. REMOVING DUPLICATES

We find duplicates with '.duplicated()', that returns a list of booleans with


true values in the ones duplicated. We remove them with '.drop_duplicates(inplace =
True)'.

10. ADDING DATA

We can add new data easily based on exisitng data on the dataframe. We can
use it for example to create columns with normalized values. The procedure is the
following:

v_mean = np.mean(v)
v_rms = np.sqrt(np.mean((v-v_mean)**2))
df[col+'_normalized'] = (v-v_mean)/v_rms
To do so, we use 'np.isnan(array)' to replace possible NaN for 0. To get the
info in a new column we just calculate the array with the values and we add it as
if we were changing the value, with the typical syntax.

11. PANDAS PLOTTING

Although Pandas allow building of plots, we need the Pyplot library to show
it, imported with 'import matplotlib.pyplot as plt'.

We simply can plot with '.plot()' and show with 'plt.show()'.

We can do scatter plots with an x and a y axis with '.plot(kind = 'scatter',


x = 'Duration', y = 'Calories')'.

We can also plot histograms with the information of one column (it shows the
frequency of the different intervals of the values) with 'column.plot(kind =
'hist')'. Also, histograms can be done with 'plt.hist(column)'

Other things are used in the code: 'import sys', 'import matplotlib',
'matplotlib.use('Agg')', 'plt.savefig(sys.stdout.buffer)' and 'sys.stdout.flush()'.

Other plotting options are, for example, the scatter matrices, imported with:

from pandas.plotting import scatter_matrix

and used with 'scatter_matrix(df[normalized_cols][:2000], figsize=(12, 12),


alpha=0.2, s=50, diagonal='kde')', although I do not know what it does.

12. PANDAS CORRELATION? Looks interesting but has not been done in class

You might also like