0% found this document useful (0 votes)
16 views

Pandas Notes

Pandas Notes

Uploaded by

Qudrat Ullah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as RTF, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

Pandas Notes

Pandas Notes

Uploaded by

Qudrat Ullah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as RTF, PDF, TXT or read online on Scribd
You are on page 1/ 4

Data Manipulation With Pandas

Transforming DataFrames:
Inspecting a DataFrame
.head() returns the first few rows (the “head” of the DataFrame).

.info() shows information on each of the columns, such as the data type and number of missing values.

.shape returns the number of rows and columns of the DataFrame.

.describe() calculates a few summary statistics for each column.

Parts of a DataFrame
.values: A two-dimensional NumPy array of values.

.columns: An index of columns: the column names.

.index: An index for the rows: either row numbers or row names.

Sorting rows
Finding interesting bits of data in a DataFrame is often easier if you change the order of the rows. You
can sort the rows by passing a column name to .sort_values().

In cases where rows have the same value (this is common if you sort on a categorical variable), you may
wish to break the ties by sorting on another column. You can sort on multiple columns in this way by
passing a list of column names.

Sort on … Syntax

one column df.sort_values("breed")

multiple columns df.sort_values(["col1", "col2"])

homelessness_reg_fam = homelessness.sort_values(["region", "family_members"], ascending=[True,


False])
Subsetting columns
When working with data, you may not need all of the variables in your dataset. Square brackets ([]) can
be used to select only the columns that matter to you in an order that makes sense to you. To select only
"col_a" of the DataFrame df, use

df["col_a"]

To select "col_a" and "col_b" of df, use

df[["col_a", "col_b"]]

Subsetting rows
A large part of data science is about finding which bits of your dataset are interesting. One of the
simplest techniques for this is to find a subset of rows that match some criteria. This is sometimes known
as filtering rows or selecting rows.

There are many ways to subset a DataFrame, perhaps the most common is to use relational operators to
return True or False for each row, then pass that inside square brackets.

dogs[dogs["height_cm"] > 60]

dogs[dogs["color"] == "tan"]

You can filter for multiple conditions at once by using the "bitwise and" operator, &.

dogs[(dogs["height_cm"] > 60) & (dogs["color"] == "tan")]

Subsetting rows by categorical variables


Subsetting data based on a categorical variable often involves using the "or" operator (|) to select rows
from multiple categories. This can get tedious when you want all states in one of three different regions,
for example. Instead, use the .isin() method, which will allow you to tackle this problem by writing one
condition instead of three separate ones.

colors = ["brown", "black", "tan"]

condition = dogs["color"].isin(colors)
dogs[condition]

Adding new columns


You aren't stuck with just the data you are given. Instead, you can add new columns to a DataFrame. This
has many names, such as transforming, mutating, and feature engineering.

You can create new columns from scratch, but it is also common to derive them from other columns, for
example, by adding columns together or by changing their units.

Df["new_col"] = Df["col_1"] + Df["col_2"]

Statistics
2)Agregating DataFrames
Efficient summaries
While pandas and NumPy have tons of functions, sometimes, you may need a different function
to summarize your data.

The .agg() method allows you to apply your own custom functions to a DataFrame, as well as
apply functions to more than one column of a DataFrame at once, making your aggregations
super-efficient. For example,

df['column'].agg(function)

In the custom function for this exercise, "IQR" is short for inter-quartile range, which is the 75th
percentile minus the 25th percentile. It's an alternative to standard deviation that is helpful if
your data contains outliers.

Cumulative statistics
Cumulative statistics can also be helpful in tracking summary statistics over time. In this
exercise, you'll calculate the cumulative sum and cumulative max of a department's weekly
sales, which will allow you to identify what the total sales were so far as well as what the
highest weekly sales were so far.

You might also like