FDS Module 2 Notes
FDS Module 2 Notes
pandas will be a major tool which contains data structures and data manipulation tools designed to
make data cleaning and analysis fast and convenient in Python. While pandas adopts many coding
idioms from NumPy, the biggest difference is that pandas is designed for working with tabular or
heterogeneous data. NumPy, by contrast, is best suited for working with homogeneously typed
numerical array data.
1. The DataFrame:
At the heart of pandas is the DataFrame, a versatile and intuitive data structure. Think of it as a table
with rows and columns, much like a spreadsheet. Creating a DataFrame is straightforward:
import pandas as pd
data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 22], 'City': ['New York', 'San Francisco', 'Los
Angeles']}
df = pd.DataFrame(data)
print(df)
You get a nice, organized table with columns for names, ages, and cities.
Pandas makes it easy to import and export data from various formats, such as CSV, Excel, and more. For
instance, reading a CSV file into a DataFrame:
import pandas as pd
df = pd.read_csv('my_data.csv')
print(df)
And saving your DataFrame to a CSV file:
import pandas as pd
df.to_csv('my_output.csv', index=False)
Pandas is like a data middleman, effortlessly shuttling data between your code and external files.
When you want to slice and dice your data, pandas offers intuitive methods. For example, selecting
specific columns:
import pandas as pd
names = df['Name']
print(names)
import pandas as pd
print(young_people)
Pandas gives you the power to extract exactly what you need from your dataset.
4. Descriptive Statistics:
Need a quick overview of your data? Pandas has you covered with descriptive statistics.
import pandas as pd
# Descriptive statistics
statistics = df.describe()
print(statistics)
You get a summary of key statistics like mean, standard deviation, and quartiles.
In a nutshell, pandas is like a trusted companion in the world of data science, offering an intuitive and
efficient way to handle tabular data. Whether you're wrangling messy datasets, conducting analyses, or
preparing data for machine learning, pandas is your go-to tool, simplifying complex data tasks into easy-
to-understand Python code.
1. Series:
In pandas, a Series is a one-dimensional labeled array capable of holding any data type. It is essentially a
column in a DataFrame or a standalone data structure. Series provides a powerful and flexible way to
index and manipulate data.
Creating a Series:
You can create a Series using the pd.Series() constructor, passing a Python list or NumPy array.
pythonCopy code
import pandas as pd
s = pd.Series(data)
Accessing Elements:
You can access elements in a Series using the index, similar to how you access elements in a Python list.
pythonCopy code
s[0]
s_custom_index['b']
pythonCopy code
s.values
s_custom_index.index
s.dtype
Series supports various operations and vectorized functions, making it convenient for element-wise
operations.
pythonCopy code
# Arithmetic operations
s*2
# Mathematical functions
import numpy as np
np.sqrt(s)
You can use boolean indexing to filter elements based on certain conditions.
pythonCopy code
# Boolean indexing
2. DataFrames:
In pandas, a DataFrame is a two-dimensional labeled data structure with columns that can be of
different types. It is similar to a spreadsheet or a SQL table and provides a flexible and powerful tool for
data manipulation and analysis. Here are some key aspects of pandas DataFrames:
Creating a DataFrame:
You can create a DataFrame using the pd.DataFrame() constructor, passing a dictionary, a list of
dictionaries, a NumPy array, or another DataFrame.
pythonCopy code
import pandas as pd
data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35], 'City': ['New York', 'San Francisco', 'Los
Angeles']}
df = pd.DataFrame(data)
info(): Provides a concise summary of the DataFrame, including data types and missing values.
pythonCopy code
df.head()
df.info()
df.describe()
You can access and manipulate data in a DataFrame using various methods:
pythonCopy code
# Selecting a column
df['Name']
df[['Name', 'Age']]
df.loc[1, 'City']
You can combine multiple DataFrames using methods like merge() and concat().
pythonCopy code
3. Index Objects:
In pandas, an Index object is an immutable array or an ordered set used to uniquely identify the rows or
columns of a DataFrame or Series. Indexing allows for efficient data retrieval and manipulation. Here are
some key points about Index objects in pandas:
Index objects are ordered, which means they have a specific order that can be utilized in various
operations.
For DataFrames, there are both row (axis 0) and column (axis 1) indexes. Row indexes are Index
objects, and column indexes are typically pandas Series with an Index.
For Series, the index is an Index object that labels the data.
For example:
import pandas as pd
data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 22], 'City': ['New York', 'San Francisco', 'Los
Angeles']}
df = pd.DataFrame(data, index=custom_index)
print(df)
Here, 'a', 'b', and 'c' are the custom indices for the DataFrame.
These pandas data structures work together seamlessly. You can extract a Series from a DataFrame,
manipulate data within each structure, and even perform operations between them.
import pandas as pd
# Creating a DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 22], 'City': ['New York', 'San Francisco', 'Los
Angeles']}
df = pd.DataFrame(data)
ages = df['Age']
print("DataFrame:")
print(df)
print("\nSeries (Ages):")
print(ages)
Pandas allows you to harness the power of these structures to make data manipulation and analysis
more intuitive and efficient. Whether you're handling a single column or managing an entire dataset,
Series, DataFrames, and Index objects are your trusty companions in the world of data science.
Essential functionality:
In the world of data science basics, let's unravel some essential functionalities of pandas – a powerful
tool for handling and manipulating data. These functions make it easier to clean, analyze, and extract
insights from your datasets.
1. Reindexing:
Reindexing in pandas refers to the process of creating a new object with the data conformed to a new
index. It involves rearranging the existing data to match a new set of labels. This operation is useful in
various scenarios, such as aligning data from different sources or reshaping data to fit a desired
structure. Here are the key aspects of reindexing in pandas:
Reindexing a Series:
You can reindex a Series using the reindex() method. It creates a new Series with the specified
index.
pythonCopy code
import pandas as pd
# Original Series
The new index can contain additional labels, and missing labels will be filled with NaN by
default.
Reindexing a DataFrame:
# Original DataFrame
Example2:
import pandas as pd
# Creating a DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 22], 'City': ['New York', 'San Francisco', 'Los
Angeles']}
df = pd.DataFrame(data)
custom_index = ['a', 'b', 'c']
df = pd.DataFrame(data, index=custom_index)
print(df)
# Reindexing the DataFrame
new_order = ['c', 'a', 'b']
df_reindexed = df.reindex(new_order)
print(df_reindexed)
In pandas, dropping entries from an axis typically refers to removing rows or columns from a DataFrame
or elements from a Series. The drop() method is commonly used for this purpose. Here are the key
aspects of dropping entries from an axis:
You can drop rows from a DataFrame using the drop() method and specifying the index labels of
the rows to be removed.
pythonCopy code
import pandas as pd
# Creating a DataFrame
The drop() method returns a new DataFrame with the specified rows removed. To modify the
original DataFrame in place, you can use the inplace=True parameter.
pythonCopy code
You can drop columns from a DataFrame by specifying the column names and setting the axis
parameter to 1.
pythonCopy code
For a Series, you can use the drop() method to remove specific elements by index label.
pythonCopy code
# Creating a Series
new_s = s.drop(['b'])
You can drop rows based on a certain condition using boolean indexing.
pythonCopy code
Example2:
import pandas as pd
# Creating a DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 22], 'City': ['New York', 'San Francisco', 'Los
Angeles']}
df = pd.DataFrame(data)
# Dropping a column
df_dropped_column = df.drop('Age', axis=1)
print(df_dropped_column)
Indexing, selection, and filtering are fundamental operations in pandas for retrieving and manipulating
data in Series and DataFrames.
import pandas as pd
# Creating a DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 22], 'City': ['New York', 'San Francisco', 'Los
Angeles']}
df = pd.DataFrame(data)
# Selecting a column
names = df['Name']
# Filtering based on a condition
young_people = df[df['Age'] < 30]
print("Selected Names:")
print(names)
print("\nYoung People:")
print(young_people)
In pandas, arithmetic operations can be performed on Series and DataFrames, allowing for element-wise
computations and mathematical operations.
import pandas as pd
# Creating two Series
series1 = pd.Series([1, 2, 3, 4], index=['a', 'b', 'c','d'])
series2 = pd.Series([5, 6, 7, 8], index=['b', 'c', 'd', 'a'])
# Adding the Series
result = series1 + series2
print("Result of Addition:")
print(result)
Data alignment is a fundamental concept in pandas that ensures meaningful and predictable results
when performing operations on Series and DataFrames with different indices. When you perform
operations (such as addition, subtraction, multiplication, etc.) on pandas objects with different indices,
pandas aligns the data based on the indices, aligning values with the same label.
import pandas as pd
result_addition = s1 + s2
In this example, when performing the addition (s1 + s2), pandas aligns the data based on the indices:
For index 'a', there is a value in s1 but not in s2, so the result for 'a' is NaN.
For indices 'b' and 'c', values from both Series are added.
For index 'd', there is a value in s2 but not in s1, so the result for 'd' is NaN.
pythonCopy code
df1 = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}, index=['a', 'b', 'c'])
df2 = pd.DataFrame({'B': [7, 8, 9], 'C': [10, 11, 12]}, index=['b', 'c', 'd'])
In this example, when performing the addition (df1 + df2), pandas aligns the data based on both indices
and columns:
For index 'a' and column 'A', there is a value in df1 but not in df2, so the result for 'a' and 'A' is
NaN.
For indices 'b' and 'c', values from both DataFrames and columns 'B' are added.
For index 'd' and column 'C', there is a value in df2 but not in df1, so the result for 'd' and 'C' is
NaN.
5. Function Application and Mapping:
For Series, you can apply a function to each element using the apply() method:
pythonCopy code
import pandas as pd
# Creating a Series
s = pd.Series([1, 2, 3, 4])
result = s.apply(lambda x: x * 2)
For DataFrames, use the applymap() method to apply a function to each element:
pythonCopy code
# Creating a DataFrame
result_df = df.applymap(lambda x: x * 2)
pythonCopy code
Function application in pandas is a powerful feature that allows you to efficiently manipulate and
transform your data. Depending on the specific use case, you can choose the appropriate method for
applying functions at the element-wise level, along columns, or within groups.
import pandas as pd
# Creating a DataFrame
data = {'Numbers': [1, 2, 3, 4, 5]}
df = pd.DataFrame(data)
# Applying a square root function
result = df['Numbers'].apply(lambda x: x**0.5)
print("Result after Applying Square Root:")
print(result)
Sorting and ranking are common operations in pandas for arranging data in a meaningful order. Here are
key aspects of sorting and ranking in pandas:
Sorting:
Sorting Series:
For Series, you can use the sort_values() method to sort the values:
pythonCopy code
import pandas as pd
# Creating a Series
s = pd.Series([3, 1, 4, 1, 5, 9, 2])
sorted_s = s.sort_values()
To sort based on index labels, use sort_index():
pythonCopy code
sorted_s_index = s.sort_index()
Sorting DataFrames:
For DataFrames, you can use sort_values() to sort by one or more columns:
pythonCopy code
# Creating a DataFrame
sorted_df_A = df.sort_values(by='A')
pythonCopy code
sorted_df_index = df.sort_index()
sorted_df_columns = df.sort_index(axis=1)
Ranking:
Ranking Series:
For Series, you can use the rank() method to assign ranks to elements:
pythonCopy code
# Creating a Series
s = pd.Series([3, 1, 4, 1, 5, 9, 2])
The method parameter in rank() specifies how to handle ties. Default is 'average', but other
methods include 'min', 'max', and 'dense'.
pythonCopy code
ranked_s_min = s.rank(method='min')
Ranking DataFrames:
For DataFrames, you can use rank() to assign ranks along rows or columns:
pythonCopy code
# Creating a DataFrame
ranked_df_rows = df.rank(axis=1)
ranked_df_columns = df.rank(axis=0)
pythonCopy code
These sorting and ranking operations are essential for organizing and analyzing data in pandas. Whether
you need to order your data or assign ranks based on certain criteria, these functions provide flexibility
and control over the arrangement of your data.
import pandas as pd
# Creating a DataFrame
data = {'Numbers': [5, 2, 8, 1, 3]}
df = pd.DataFrame(data)
# Sorting the DataFrame
df_sorted = df.sort_values(by='Numbers')
# Ranking the DataFrame
df_ranked = df.rank()
print("Sorted DataFrame:")
print(df_sorted)
print("\nRanked DataFrame:")
print(df_ranked)
Pandas allows us to peek into the relationships between variables using correlation and covariance.
Correlation gives us a sense of how two variables move together, while covariance shows whether they
change in tandem.
import pandas as pd
# Creating a DataFrame
df = pd.DataFrame(data)
correlation = df['A'].corr(df['B'])
covariance = df['A'].cov(df['B'])
print("Correlation:", correlation)
print("Covariance:", covariance)
2. Unique Values:
When dealing with categorical data, it's often useful to know the unique values in a column. Pandas
provides a straightforward way to extract this information.
import pandas as pd
df = pd.DataFrame(data)
unique_values = df['Category'].unique()
print("Unique Values:")
print(unique_values)
import pandas as pd
df = pd.DataFrame(data)
# Counting occurrences
value_counts = df['Category'].value_counts()
# Checking membership
print("Value Counts:")
print(value_counts)
In a nutshell, pandas provides a set of tools to make data summary and descriptive statistics a walk in
the park. Whether you're exploring relationships between variables, extracting unique values, or
counting occurrences, pandas offers an intuitive and efficient way to unravel insights from your data. It's
like having a trustworthy assistant that simplifies the complexities of data analysis into easy-to-
understand Python code.
Consider a scenario where you have data stored in a text file, and you want to bring it into your Python
environment as a DataFrame – pandas makes this a piece of cake. The read_csv() function is your go-to
tool, even if your data isn't actually comma-separated.
import pandas as pd
print(df)
Here, the delimiter='\t' specifies that the data is tab-separated. Pandas is flexible enough to
accommodate various delimiters, ensuring your data is imported correctly.
Conversely, if you've performed your data magic using pandas and want to save the results in a text file,
the to_csv() function comes to the rescue. You can specify the delimiter and other parameters as
needed.
import pandas as pd
data_to_save = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 22], 'City': ['New York', 'San Francisco',
'Los Angeles']}
df = pd.DataFrame(data_to_save)
In this example, sep='\t' specifies a tab as the delimiter when saving. The index=False ensures that the
DataFrame index is not included in the output.
Pandas isn't limited to just CSV or plain text. It can handle various text formats, including Excel files. The
read_excel() function extends its capabilities to reading data from Excel spreadsheets.
import pandas as pd
df_from_excel = pd.read_excel('my_data.xlsx')
print(df_from_excel)
Pandas' flexibility allows you to seamlessly read data from different text-based sources, making it a
versatile tool for various data formats.
In summary, pandas' functionality for reading and writing data in text format is like having a language
interpreter for your data. It understands different dialects (delimiters) and can translate your data
between Python and text files effortlessly. Whether you're importing data for analysis or saving your
results, pandas is your trusty companion in the journey of data science.
HDF5 Format:
One notable binary format pandas supports is HDF5 (Hierarchical Data Format version 5). It's like a
digital bookshelf, organizing your data hierarchically. With HDF5, you can store large datasets and
retrieve specific portions without loading the entire content.
import pandas as pd
# Assuming df is your DataFrame
data_to_save = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 22], 'City': ['New York', 'San Francisco',
'Los Angeles']}
df = pd.DataFrame(data_to_save)
# Saving DataFrame to HDF5 format
df.to_hdf('my_data.h5', key='data', mode='w')
This snippet saves your DataFrame to an HDF5 file. The key='data' provides a label for your data within
the HDF5 file.
import pandas as pd
# Reading data from an HDF5 file
df_from_hdf = pd.read_hdf('my_data.h5', key='data')
print(df_from_hdf)
Pandas effortlessly retrieves the data from the HDF5 file.
In essence, pandas' support for binary data formats is like having a superpower for handling large
datasets. Whether you're saving your work for later or loading massive datasets on-the-fly, these binary
formats provide efficiency and speed, making pandas a reliable companion in the world of data science.
Firstly, let's break down the jargon. A Web API (Application Programming Interface) is essentially
a set of rules that allows one piece of software to interact with another. In our case, we're
interested in web APIs, which are like data gatekeepers on the internet. Websites often expose
certain data through APIs, granting you access to specific information in a structured format.
To start our conversation with a web API, we need a way to send requests and receive
responses. Python's requests library is like our messenger, handling the communication.
import requests
# Making a simple GET request to an API
response = requests.get('https://api.example.com/data')
# Checking the status of our request
if response.status_code == 200:
# The data is in JSON format, let's print it
print(response.json())
else:
print(f"Request failed with status code {response.status_code}")
This is a basic example of making a GET request to an API. If all goes well, the data is usually
returned in JSON format, a common language for data exchange on the web.
Now, let's bring in our data wrangler, pandas. Once we've fetched data from a web API, pandas
helps us organize it into a DataFrame – a structured table that's easy to work with.
import requests
import pandas as pd
# Making a GET request to an API
response = requests.get('https://api.example.com/data')
# Checking the status of our request
if response.status_code == 200:
# Converting JSON data to a DataFrame
df = pd.DataFrame(response.json())
# Let's print our DataFrame
print(df)
else:
print(f"Request failed with status code {response.status_code}")
With pandas, our API data is now neatly arranged in a DataFrame, ready for analysis or further
manipulation.
Sometimes, APIs require additional information to fetch the exact data you need. We can
include parameters in our requests.
import requests
import pandas as pd
# Making a GET request with parameters
params = {'user_id': 123, 'start_date': '2022-01-01'}
response = requests.get('https://api.example.com/user_data', params=params)
# Checking the status of our request
if response.status_code == 200:
# Converting JSON data to a DataFrame
df = pd.DataFrame(response.json())
# Let's print our DataFrame
print(df)
else:
print(f"Request failed with status code {response.status_code}")
By incorporating parameters, we tailor our request to fetch user-specific data or data within a
certain timeframe.
In essence, interacting with web APIs using Python and pandas is akin to having a conversation
with the vast knowledge stored on the internet. It's a dynamic process where Python's requests
library acts as our messenger, fetching data from APIs, and pandas organizes that data into a
structured format for seamless analysis. This interaction opens doors to a world of real-time
information and possibilities in the realm of data science.
pythonCopy code
import sqlite3
pythonCopy code
# Create a cursor object to execute SQL queries
cursor = connection.cursor()
4. Execute SQL Queries:
pythonCopy code
# Create a table
cursor.execute(''' CREATE TABLE IF NOT EXISTS users ( id INTEGER PRIMARY KEY, username TEXT,
email TEXT ) ''')
# Insert data into the table
cursor.execute("INSERT INTO users (username, email) VALUES (?, ?)", ('john_doe',
'john@example.com'))
# Commit the changes
connection.commit()
5. Fetch Data:
pythonCopy code
# Fetch data from the table
cursor.execute("SELECT * FROM users")
rows = cursor.fetchall()
for row in rows:
print(row)
6. Close the Connection:
pythonCopy code
# Close the connection when done
connection.close()
Using SQLAlchemy for Relational Databases: