Pandas basics
Pandas basics
but they serve different purposes and have distinct features. Here’s a comparison of the two:
NumPy
Purpose:
pandas
Purpose:
• pandas is designed for data manipulation and analysis. It provides data structures and
functions needed to work with structured data, such as tables and time series.
• NOTE: A time series is a sequence of data points typically measured at successive points in
time, usually at equally spaced intervals. Time series data is often used in various fields such
as finance, economics, weather forecasting, signal processing, and many others to analyze
patterns, trends, and cycles.
Pandas is a powerful and widely-used data manipulation and analysis library in Python. It
provides data structures and functions needed to work with structured data seamlessly. Here's
an introduction to the core components of pandas:
• Series: A one-dimensional labeled array capable of holding any data type. It is similar
to a column in a spreadsheet.
• DataFrame: A two-dimensional labeled data structure with columns that can be of
different types, much like a table in a database or a data frame in R.
import pandas as pd
# Creating a Series
data = [1, 2, 3, 4, 5]
series = pd.Series(data)
# Creating a DataFrame
data = { 'Name': ['John', 'Anna', 'Peter', 'Linda'], 'Age': [28, 24, 35, 32] }
df = pd.DataFrame(data)
Series:
01
12
23
34
45
dtype: int64
DataFrame:
Name Age
0 John 28
1 Anna 24
2 Peter 35
3 Linda 32
The Series is a one-dimensional array with integer values and default index labels. The DataFrame
is a two-dimensional table with columns for 'Name' and 'Age', and it also has default integer index
labels.
A pandas Series is a one-dimensional labeled array capable of holding any data type
(integers, strings, floats, etc.). It is similar to a column in a table or a single column in an
Excel sheet.
1. pd: This is the common alias for the pandas library. Before you can use it, you need to
import pandas with import pandas as pd.
2. Series: This is a constructor for creating a Series object in pandas.
3. data: This is the input data that you want to convert into a Series. In this case, data is
a list [1, 2, 3, 4, 5].
Creating the Series
print(series)
a1
b2
c3
d4
e5
dtype: int64
To create a Series from a dictionary, the keys of the dictionary become the indices of the Series, and
the values of the dictionary become the values of the Series
import pandas as pd
# Creating a dictionary
series = pd.Series(data)
print(series)
Explanation
1. Creating the dictionary: The dictionary data has keys 'a', 'b', 'c', 'd', 'e'
and corresponding values 1, 2, 3, 4, 5.
2. Creating the Series: When you pass this dictionary to pd.Series, pandas creates a
Series where the dictionary keys become the index, and the dictionary values become
the data of the Series.
Output
a1
b2
c3
d4
e5
dtype: int64
• Indices (a, b, c, d, e): These are the keys from the dictionary.
• Values (1, 2, 3, 4, 5): These are the values from the dictionary.
• dtype: int64: This indicates the data type of the values in the Series. In this case, it is
int64, which means 64-bit integers.
This approach is useful when you have labeled data that you want to convert into a pandas
Series for further manipulation or analysis.
Creating a pandas Series from scalar values involves specifying the scalar value and an index. This
will generate a Series where each index label is associated with the same scalar value.
import pandas as pd
# Scalar value
scalar_value = 10
print(series)
output
a 10
b 10
c 10
d 10
e 10
dtype: int64
Explanation of the Output
• Indices (a, b, c, d, e): These are the index labels specified in the index list.
• Values (10): Each index label is associated with the scalar value 10.
• dtype: int64: This indicates the data type of the values in the Series. In this case, it is
int64, which means 64-bit integers.
This approach is useful when you want to create a Series with a constant value for each
index, such as initializing a Series or filling a Series with a specific value.
Creating a pandas Series using a specified index involves associating values with specific indices.
import pandas as pd
# List of values
print(series)
output:
a 10
b 20
c 30
d 40
e 50
dtype: int64
To create an empty Series with a specified index. This is useful for initializing a Series to be
filled later.
import pandas as pd
series = pd.Series(index=index)
print(series)
output:
a NaN
b NaN
c NaN
d NaN
e NaN
dtype: float64
Explanation
• NaN: Stands for "Not a Number" and is used to denote missing or undefined values in
pandas.
• dtype: float64: When creating an empty Series, the default data type is float64.
By specifying the index, you can create Series that are tailored to your data structure needs,
whether you are initializing with specific values, using a scalar value, or creating an empty
Series for later use.
import pandas as pd
print(series)
output:
Series:
a 10
b 20
c 30
d 40
e 50
dtype: int64
Size of Series: 5
Dimensions of Series: 1
Shape of Series: (5,)
Index of Series: Index(['a', 'b', 'c', 'd', 'e'], dtype='object')
Explanation
Importance of dtype
• Efficiency: Knowing the dtype helps Pandas optimize storage and computation.
dtype='object' indicates that the index labels of the Series are stored as generic objects, which is
appropriate for string labels. If the index labels were numeric, then dtype, is int64 or float64.
These attributes provide essential information about the Series, helps to understand its
structure and how to manipulate it.
Note: A tuple is an immutable, ordered collection of elements in Python. Tuples are similar to
lists but have some key differences:
1. Ordered: The elements in a tuple have a defined order, and this order will not
change.
2. Immutable: Once a tuple is created, you cannot modify, add, or remove elements
from it. This immutability makes tuples a good choice for read-only collections of
data.
3. Heterogeneous: Tuples can contain elements of different types, including other
tuples, lists, dictionaries, and more.
4. Indexable: You can access elements in a tuple using their index, starting from 0 for
the first element.
the comma in the shape (5,) signifies that the shape is a tuple with one element. This is a
requirement of Python's syntax to differentiate single-element tuples from regular
parenthesized expressions.
Creating an empty DataFrame can be useful in various scenarios when working with data in
pandas. Here are some common reasons and scenarios where you might need to create an
empty DataFrame:
output:
Empty DataFrame
Columns: []
Index: []
Creating an empty DataFrame with specific columns is a way to initialize a DataFrame with
predefined column names but without any data. This can be useful when you know the
structure of your DataFrame in advance but will populate it with data later.
Explanation
Creating a DataFrame:
df = pd.DataFrame(data)
data is assumed to be a predefined variable containing the data , to convert into a DataFrame. A
DataFrame is a two-dimensional labeled data structure with columns of potentially different types.
This function, extract_column, is defined to take two arguments: a DataFrame (dataframe) and
a column name (column_name). The purpose of the function is to check if the specified column
exists in the DataFrame and print its data. If the column does not exist, it prints a message indicating
that the column was not found.
if column_name in dataframe.columns:
This line checks if the provided column_name exists in the DataFrame's columns. If it does, the
function proceeds to the next step; otherwise, it goes to the else block.
Extracting and printing the column:
column_data = dataframe[column_name]
print(column_data)
If the column exists, the function extracts the column data from the DataFrame and prints it.
else:
print(f"Column '{column_name}' not found in DataFrame.")
If the column does not exist in the DataFrame, this block prints a message indicating that the column
was not found.
# Sample DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'Los Angeles', 'Chicago']
}
df = pd.DataFrame(data)
# Example usage
row_2 = extract_row(df, 2)
print("Extracted row:")
print(row_2)
return dataframe.iloc[row_index]
If the row index exists, the function extracts the row data from the DataFrame using iloc,
which is an integer-location based indexing for selection by position, and returns the row
data.
The f in the string print(f"Row index '{row}' not found in DataFrame.") represents
an f-string, which is a feature introduced in Python 3.6. An f-string (formatted string literal)
allows you to embed expressions inside string literals, using curly braces {}.
Working
• f-string: The f before the opening quote marks indicates that the string is an f-string.
• Curly braces {}: Any expression inside the curly braces {} will be evaluated and its
result will be inserted into the string at that position.
row = 5
print(f"Row index '{row}' not found in DataFrame.")
Benefits
• Readability: f-strings make it easier to format strings and read the code.
• Performance: f-strings are generally faster than other methods of string formatting,
like using % or str.format().
# Sample DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'Los Angeles', 'Chicago']
}
df = pd.DataFrame(data)
• len(df): This expression returns the number of rows in the DataFrame. If the
DataFrame initially has 3 rows, len(df) will return 3.
• df.loc[len(df)]: The loc accessor is used to access a group of rows and columns
by labels or a boolean array. In this case, it is used to add a new row at the index
position returned by len(df).
• ['David', 40, 'San Francisco']: This is the data for the new row being added to
the DataFrame. The new row will have 'David' as the Name, 40 as the Age, and 'San
Francisco' as the City.
Output
Data Types
Pandas uses NumPy for its underlying data storage, which means it leverages NumPy's
efficient array-based storage. Common data types in pandas include:
• int64
• float64
• bool
• datetime64[ns]
• object (for string or mixed types)
Example:
python
print(df.dtypes)
Output:
Name object
Age int64
City object
dtype: object
3. Indexing
Indexes provide fast lookups and are essential for aligning data:
Example:
python
df = pd.DataFrame(data, index=['a', 'b', 'c'])
print(df)
Output:
4. Storage Formats
In-Memory Storage
Data is typically stored in memory in the form of DataFrame and Series objects, allowing for
fast data manipulation and analysis.
File-Based Storage
• CSV
• Excel
• HDF5
• Parquet
• SQL databases
• JSON
• and more
EXPLANATION:
to write the contents of a pandas DataFrame to a CSV (Comma Separated
Values) file.
• df: This is the pandas DataFrame that you want to write to a CSV file. In your context, df
contains data with columns such as 'Name', 'Age', and 'City'.
• .to_csv(): This is a pandas DataFrame method that exports the DataFrame to a CSV file.
The to_csv method has several optional parameters that allow you to control the output
format, but here we're using two of them: the file path and index.
• 'data.csv': This is the path to the file where the DataFrame will be written. If the file
does not exist, it will be created. If it does exist, it will be overwritten.
• index=False: This parameter specifies whether to write row indices to the CSV file. By
default, index=True, meaning the row indices are included in the CSV file. Setting
index=False excludes the row indices from the output file.
DELIMITER
A delimiter in a CSV (Comma Separated Values) file is a character that separates individual
data values within each row of the file. The delimiter tells the software reading the CSV file
how to split the data into individual fields or columns.
Common Delimiters
1. Comma (,): This is the most common delimiter and is the default in CSV files. Each
value is separated by a comma. Example:
Name,Age,City
Alice,25,New York
Bob,30,Los Angeles
Charlie,35,Chicago
2. Semicolon (;): Sometimes used instead of a comma, especially in regions where the
comma is used as a decimal separator. Example:
Name;Age;City
Alice;25;New York
Bob;30;Los Angeles
Charlie;35;Chicago
4. Pipe (|): Occasionally used to avoid conflicts with data that may contain commas or
semicolons. Example:
Name|Age|City
Alice|25|New York
Bob|30|Los Angeles
Charlie|35|Chicago
When reading or writing CSV files with pandas, you can specify the delimiter using the sep
parameter in the read_csv and to_csv methods.
import pandas as pd
df = pd.read_csv('data.csv', sep=';')
Writing a CSV File with a Custom Delimiter
import pandas as pd
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'Los Angeles', 'Chicago']
}
df = pd.DataFrame(data)
df.to_csv('data.csv', sep=';', index=False)
Summary
In Pandas, slicing rows from a DataFrame can be done in various ways depending on your
requirements.
Using iloc
import pandas as pd
# Sample DataFrame
data = {
df = pd.DataFrame(data)
print(df)
sliced_df = df.iloc[1:3]
print(sliced_df)
import pandas as pd
# Sample DataFrame
data = {
'A': [1, 2, 3, 4, 5],
'B': [10, 20, 30, 40, 50],
'C': [100, 200, 300, 400, 500]
}
df = pd.DataFrame(data)
print(df)
output
A B C
0 1 10 100
1 2 20 200
2 3 30 300
3 4 40 400
4 5 50 500
A B C
0 1 10 100
2 3 30 300
4 5 50 500
Explanation of df.iloc[::2]
The expression df.iloc[::2] is using the iloc indexer to slice the DataFrame. Here's a
breakdown of the syntax and how it works:
• iloc: This is a Pandas method used for integer-location based indexing. It allows you
to select rows and columns by their integer positions.
• ::2: This is Python's slicing syntax. In general, slicing has the form
start:stop:step, where:
o start is the index to start the slice (inclusive).
o stop is the index to end the slice (exclusive).
o step is the step size or stride between each index in the slice.
only every other row from the original DataFrame is included in the sliced DataFrame.
rows with indices 0, 2, and 4 are selected, while rows with indices 1 and 3 are skipped.
This kind of slicing is useful when you want to downsample your data by taking every nth
row, in this case, every 2nd row.
import pandas as pd
This dictionary data contains the following keys and corresponding lists:
Output Explanation
Timestamp Temperature Vibration Position
0 2023-01-01 00:00:00 20 0.01 5
1 2023-01-01 00:01:00 21 0.02 6
2 2023-01-01 00:02:00 19 0.03 5
3 2023-01-01 00:03:00 22 0.02 7
4 2023-01-01 00:04:00 20 0.01 5
5 2023-01-01 00:05:00 21 0.04 6
6 2023-01-01 00:06:00 19 0.02 5
7 2023-01-01 00:07:00 23 0.03 8
8 2023-01-01 00:08:00 20 0.01 5
9 2023-01-01 00:09:00 22 0.02 7
This DataFrame could be used for further analysis, such as plotting the sensor data, detecting
anomalies, or performing statistical analysis on the readings.
Loading data into a pandas DataFrame:
Loading data into a pandas DataFrame is a fundamental step in data analysis and
manipulation. Pandas provides a variety of functions to read data from different file formats
and data sources. Here’s a guide on how to load data into pandas DataFrames from common
sources:
• Accuracy: Ensures columns are interpreted correctly (e.g., 'Age' as integers, not
strings).
• Performance: Helps pandas optimize memory usage and improve performance by
using appropriate data types.
• Error Prevention: Prevents potential issues with data type mismatches during data
processing and analysis.
Using SQLAlchemy
from sqlalchemy import create_engine
import pandas as pd
Reading JSON
import pandas as pd
From Dictionary
# Creating a DataFrame from a dictionary
data = {'Name': ['Alice', 'Bob'], 'Age': [25, 30]}
df = pd.DataFrame(data)
print(df)
From List of Dictionaries
# Creating a DataFrame from a list of dictionaries
data = [{'Name': 'Alice', 'Age': 25}, {'Name': 'Bob', 'Age': 30}]
df = pd.DataFrame(data)
print(df)