0% found this document useful (0 votes)
57 views

FDS Module 2 Notes

The document discusses key concepts and functionalities of the pandas library in Python. It covers pandas data structures like Series and DataFrames, and functions for data cleaning, manipulation and analysis like reindexing, merging, filtering and descriptive statistics.

Uploaded by

Nanda Kishore. E
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
57 views

FDS Module 2 Notes

The document discusses key concepts and functionalities of the pandas library in Python. It covers pandas data structures like Series and DataFrames, and functions for data cleaning, manipulation and analysis like reindexing, merging, filtering and descriptive statistics.

Uploaded by

Nanda Kishore. E
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 24

Module 2: DATA EXPLORATION WITH PANDAS

Chapter1: Getting Started with pandas

Process of exploring data using pandas:


pandas is a powerful and popular open-source data manipulation and analysis library for Python. It
provides easy-to-use data structures like DataFrame and Series, along with a variety of functions for
data cleaning, exploration, and analysis.

pandas will be a major tool which contains data structures and data manipulation tools designed to
make data cleaning and analysis fast and convenient in Python. While pandas adopts many coding
idioms from NumPy, the biggest difference is that pandas is designed for working with tabular or
heterogeneous data. NumPy, by contrast, is best suited for working with homogeneously typed
numerical array data.

1. The DataFrame:

At the heart of pandas is the DataFrame, a versatile and intuitive data structure. Think of it as a table
with rows and columns, much like a spreadsheet. Creating a DataFrame is straightforward:

import pandas as pd

# Creating a simple DataFrame

data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 22], 'City': ['New York', 'San Francisco', 'Los
Angeles']}

df = pd.DataFrame(data)

print(df)

You get a nice, organized table with columns for names, ages, and cities.

2. Reading and Writing Data:

Pandas makes it easy to import and export data from various formats, such as CSV, Excel, and more. For
instance, reading a CSV file into a DataFrame:

import pandas as pd

# Reading data from a CSV file

df = pd.read_csv('my_data.csv')

print(df)
And saving your DataFrame to a CSV file:

import pandas as pd

# Saving DataFrame to a CSV file

df.to_csv('my_output.csv', index=False)

Pandas is like a data middleman, effortlessly shuttling data between your code and external files.

3. Data Selection and Filtering:

When you want to slice and dice your data, pandas offers intuitive methods. For example, selecting
specific columns:

import pandas as pd

# Selecting specific columns

names = df['Name']

print(names)

Or filtering rows based on a condition:

import pandas as pd

# Filtering rows based on age

young_people = df[df['Age'] < 30]

print(young_people)

Pandas gives you the power to extract exactly what you need from your dataset.

4. Descriptive Statistics:

Need a quick overview of your data? Pandas has you covered with descriptive statistics.

import pandas as pd

# Descriptive statistics

statistics = df.describe()

print(statistics)

You get a summary of key statistics like mean, standard deviation, and quartiles.
In a nutshell, pandas is like a trusted companion in the world of data science, offering an intuitive and
efficient way to handle tabular data. Whether you're wrangling messy datasets, conducting analyses, or
preparing data for machine learning, pandas is your go-to tool, simplifying complex data tasks into easy-
to-understand Python code.

Pandas data structures – Series, Data frame, Index objects


In the realm of data science essentials, let's dive into the world of pandas and its fundamental data
structures: Series, DataFrames, and Index objects. Think of these as the building blocks that pandas uses
to help you manage and analyze data in a way that's both powerful and user-friendly.

1. Series:

In pandas, a Series is a one-dimensional labeled array capable of holding any data type. It is essentially a
column in a DataFrame or a standalone data structure. Series provides a powerful and flexible way to
index and manipulate data.

Creating a Series:

You can create a Series using the pd.Series() constructor, passing a Python list or NumPy array.

pythonCopy code

import pandas as pd

# Creating a Series from a list

data = [10, 20, 30, 40]

s = pd.Series(data)

# Creating a Series with custom index

s_custom_index = pd.Series(data, index=['a', 'b', 'c', 'd'])

Accessing Elements:

You can access elements in a Series using the index, similar to how you access elements in a Python list.

pythonCopy code

# Accessing elements by position

s[0]

# Returns the first element


# Accessing elements by index

s_custom_index['b']

# Returns the element with index 'b'

Attributes and Methods:

 values: Returns the data as a NumPy array.

 index: Returns the index of the Series.

 dtype: Returns the data type of the Series.

pythonCopy code

s.values

s_custom_index.index

s.dtype

Operations and Vectorized Functions:

Series supports various operations and vectorized functions, making it convenient for element-wise
operations.

pythonCopy code

# Arithmetic operations

s*2

# Mathematical functions

import numpy as np

np.sqrt(s)

Filtering and Boolean Indexing:

You can use boolean indexing to filter elements based on certain conditions.

pythonCopy code

# Boolean indexing

s[s > 20]


import pandas as pd
# Creating a Series
my_series = pd.Series([10, 20, 30, 40])
print(my_series)
You get a neat output of index-value pairs, where each value has its unique index.

2. DataFrames:

In pandas, a DataFrame is a two-dimensional labeled data structure with columns that can be of
different types. It is similar to a spreadsheet or a SQL table and provides a flexible and powerful tool for
data manipulation and analysis. Here are some key aspects of pandas DataFrames:

Creating a DataFrame:

You can create a DataFrame using the pd.DataFrame() constructor, passing a dictionary, a list of
dictionaries, a NumPy array, or another DataFrame.

pythonCopy code

import pandas as pd

# Creating a DataFrame from a dictionary

data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35], 'City': ['New York', 'San Francisco', 'Los
Angeles']}

df = pd.DataFrame(data)

Viewing and Inspecting DataFrames:

 head(): Returns the first few rows of the DataFrame.

 tail(): Returns the last few rows of the DataFrame.

 info(): Provides a concise summary of the DataFrame, including data types and missing values.

 describe(): Generates descriptive statistics of the DataFrame.

pythonCopy code

df.head()

df.info()
df.describe()

Indexing and Selecting Data:

You can access and manipulate data in a DataFrame using various methods:

pythonCopy code

# Selecting a column

df['Name']

# Selecting multiple columns

df[['Name', 'Age']]

# Selecting rows based on conditions

df[df['Age'] > 30]

# Accessing specific elements

df.loc[1, 'City']

# Accessing element in the second row and 'City' column

Merging and Concatenating:

You can combine multiple DataFrames using methods like merge() and concat().

pythonCopy code

# Merging two DataFrames based on a common column

df1 = pd.DataFrame({'ID': [1, 2, 3], 'Name': ['Alice', 'Bob', 'Charlie']})

df2 = pd.DataFrame({'ID': [2, 3, 4], 'Salary': [60000, 70000, 80000]})

merged_df = pd.merge(df1, df2, on='ID')

3. Index Objects:

In pandas, an Index object is an immutable array or an ordered set used to uniquely identify the rows or
columns of a DataFrame or Series. Indexing allows for efficient data retrieval and manipulation. Here are
some key points about Index objects in pandas:

Immutable and Ordered:


 Index objects are immutable, meaning once they are created, their values cannot be changed.
This immutability is beneficial for maintaining data integrity.

 Index objects are ordered, which means they have a specific order that can be utilized in various
operations.

Used in DataFrames and Series:

 For DataFrames, there are both row (axis 0) and column (axis 1) indexes. Row indexes are Index
objects, and column indexes are typically pandas Series with an Index.

 For Series, the index is an Index object that labels the data.

For example:

import pandas as pd

# Creating a DataFrame with a custom index

data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 22], 'City': ['New York', 'San Francisco', 'Los
Angeles']}

custom_index = ['a', 'b', 'c']

df = pd.DataFrame(data, index=custom_index)

print(df)

Here, 'a', 'b', and 'c' are the custom indices for the DataFrame.

These pandas data structures work together seamlessly. You can extract a Series from a DataFrame,
manipulate data within each structure, and even perform operations between them.

import pandas as pd

# Creating a DataFrame

data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 22], 'City': ['New York', 'San Francisco', 'Los
Angeles']}

df = pd.DataFrame(data)

# Extracting a Series from the DataFrame

ages = df['Age']

print("DataFrame:")

print(df)
print("\nSeries (Ages):")

print(ages)

Pandas allows you to harness the power of these structures to make data manipulation and analysis
more intuitive and efficient. Whether you're handling a single column or managing an entire dataset,
Series, DataFrames, and Index objects are your trusty companions in the world of data science.

Essential functionality:
In the world of data science basics, let's unravel some essential functionalities of pandas – a powerful
tool for handling and manipulating data. These functions make it easier to clean, analyze, and extract
insights from your datasets.

1. Reindexing:

Reindexing in pandas refers to the process of creating a new object with the data conformed to a new
index. It involves rearranging the existing data to match a new set of labels. This operation is useful in
various scenarios, such as aligning data from different sources or reshaping data to fit a desired
structure. Here are the key aspects of reindexing in pandas:

Reindexing a Series:

 You can reindex a Series using the reindex() method. It creates a new Series with the specified
index.

pythonCopy code

import pandas as pd

# Original Series

s = pd.Series([1, 2, 3], index=['a', 'b', 'c'])

# Reindexing with a new index

new_s = s.reindex(['a', 'b', 'c', 'd'])

 The new index can contain additional labels, and missing labels will be filled with NaN by
default.

Reindexing a DataFrame:

 For DataFrames, reindexing can be applied to both rows and columns.


pythonCopy code

# Original DataFrame

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}, index=['a', 'b', 'c'])

# Reindexing rows with a new index

new_df_rows = df.reindex(['a', 'b', 'c', 'd'])

# Reindexing columns with a new index

new_df_columns = df.reindex(columns=['A', 'B', 'C'])

Example2:

import pandas as pd
# Creating a DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 22], 'City': ['New York', 'San Francisco', 'Los
Angeles']}
df = pd.DataFrame(data)
custom_index = ['a', 'b', 'c']
df = pd.DataFrame(data, index=custom_index)
print(df)
# Reindexing the DataFrame
new_order = ['c', 'a', 'b']
df_reindexed = df.reindex(new_order)
print(df_reindexed)

2. Dropping Entries from an Axis:

In pandas, dropping entries from an axis typically refers to removing rows or columns from a DataFrame
or elements from a Series. The drop() method is commonly used for this purpose. Here are the key
aspects of dropping entries from an axis:

Dropping Rows from a DataFrame:

 You can drop rows from a DataFrame using the drop() method and specifying the index labels of
the rows to be removed.

pythonCopy code

import pandas as pd

# Creating a DataFrame

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}, index=['a', 'b', 'c'])


# Dropping rows with index labels 'b' and 'c'

new_df = df.drop(['b', 'c'])

 The drop() method returns a new DataFrame with the specified rows removed. To modify the
original DataFrame in place, you can use the inplace=True parameter.

pythonCopy code

df.drop(['b', 'c'], inplace=True)

Dropping Columns from a DataFrame:

 You can drop columns from a DataFrame by specifying the column names and setting the axis
parameter to 1.

pythonCopy code

# Dropping columns 'B'

df_without_b = df.drop(['B'], axis=1)

Dropping Entries from a Series:

 For a Series, you can use the drop() method to remove specific elements by index label.

pythonCopy code

# Creating a Series

s = pd.Series([10, 20, 30], index=['a', 'b', 'c'])

# Dropping the element with index label 'b'

new_s = s.drop(['b'])

Dropping Rows with Conditions:

 You can drop rows based on a certain condition using boolean indexing.

pythonCopy code

# Dropping rows where the value in column 'A' is less than 2

df_filtered = df[df['A'] >= 2]

Example2:

import pandas as pd
# Creating a DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 22], 'City': ['New York', 'San Francisco', 'Los
Angeles']}
df = pd.DataFrame(data)
# Dropping a column
df_dropped_column = df.drop('Age', axis=1)
print(df_dropped_column)

3. Indexing, Selection, and Filtering:

Indexing, selection, and filtering are fundamental operations in pandas for retrieving and manipulating
data in Series and DataFrames.

import pandas as pd
# Creating a DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 22], 'City': ['New York', 'San Francisco', 'Los
Angeles']}
df = pd.DataFrame(data)
# Selecting a column
names = df['Name']
# Filtering based on a condition
young_people = df[df['Age'] < 30]
print("Selected Names:")
print(names)
print("\nYoung People:")
print(young_people)

4. Arithmetic and Data Alignment:

In pandas, arithmetic operations can be performed on Series and DataFrames, allowing for element-wise
computations and mathematical operations.

import pandas as pd
# Creating two Series
series1 = pd.Series([1, 2, 3, 4], index=['a', 'b', 'c','d'])
series2 = pd.Series([5, 6, 7, 8], index=['b', 'c', 'd', 'a'])
# Adding the Series
result = series1 + series2
print("Result of Addition:")
print(result)

Data alignment is a fundamental concept in pandas that ensures meaningful and predictable results
when performing operations on Series and DataFrames with different indices. When you perform
operations (such as addition, subtraction, multiplication, etc.) on pandas objects with different indices,
pandas aligns the data based on the indices, aligning values with the same label.

Data Alignment in Series:


pythonCopy code

import pandas as pd

# Creating two Series with different indices

s1 = pd.Series([1, 2, 3], index=['a', 'b', 'c'])

s2 = pd.Series([4, 5, 6], index=['b', 'c', 'd'])

# Performing addition on Series

result_addition = s1 + s2

In this example, when performing the addition (s1 + s2), pandas aligns the data based on the indices:

 For index 'a', there is a value in s1 but not in s2, so the result for 'a' is NaN.

 For indices 'b' and 'c', values from both Series are added.

 For index 'd', there is a value in s2 but not in s1, so the result for 'd' is NaN.

Data Alignment in DataFrames:

pythonCopy code

# Creating two DataFrames with different indices and columns

df1 = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}, index=['a', 'b', 'c'])

df2 = pd.DataFrame({'B': [7, 8, 9], 'C': [10, 11, 12]}, index=['b', 'c', 'd'])

# Performing addition on DataFrames

result_addition_df = df1 + df2

In this example, when performing the addition (df1 + df2), pandas aligns the data based on both indices
and columns:

 For index 'a' and column 'A', there is a value in df1 but not in df2, so the result for 'a' and 'A' is
NaN.

 For indices 'b' and 'c', values from both DataFrames and columns 'B' are added.

 For index 'd' and column 'C', there is a value in df2 but not in df1, so the result for 'd' and 'C' is
NaN.
5. Function Application and Mapping:

Function application in pandas involves applying a function to elements, rows, or columns of a


DataFrame or Series. Pandas provides several methods for function application, allowing you to perform
operations on your data efficiently. Here are some key aspects of function application in pandas:

Applying Functions to Elements (Element-wise Operations):

 For Series, you can apply a function to each element using the apply() method:

pythonCopy code

import pandas as pd

# Creating a Series

s = pd.Series([1, 2, 3, 4])

# Applying a function to each element

result = s.apply(lambda x: x * 2)

 For DataFrames, use the applymap() method to apply a function to each element:

pythonCopy code

# Creating a DataFrame

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})

# Applying a function to each element

result_df = df.applymap(lambda x: x * 2)

Applying Functions to Columns or Rows:


 Use the apply() method along with the axis parameter to apply a function to either columns or
rows of a DataFrame:

pythonCopy code

# Applying a function to each column

result_column = df.apply(lambda x: x.sum(), axis=0)

# Applying a function to each row

result_row = df.apply(lambda x: x.sum(), axis=1)

Function application in pandas is a powerful feature that allows you to efficiently manipulate and
transform your data. Depending on the specific use case, you can choose the appropriate method for
applying functions at the element-wise level, along columns, or within groups.

import pandas as pd
# Creating a DataFrame
data = {'Numbers': [1, 2, 3, 4, 5]}
df = pd.DataFrame(data)
# Applying a square root function
result = df['Numbers'].apply(lambda x: x**0.5)
print("Result after Applying Square Root:")
print(result)

6. Sorting and Ranking:

Sorting and ranking are common operations in pandas for arranging data in a meaningful order. Here are
key aspects of sorting and ranking in pandas:

Sorting:

Sorting Series:

 For Series, you can use the sort_values() method to sort the values:

pythonCopy code

import pandas as pd

# Creating a Series

s = pd.Series([3, 1, 4, 1, 5, 9, 2])

# Sorting the Series

sorted_s = s.sort_values()
 To sort based on index labels, use sort_index():

pythonCopy code

# Sorting by index labels

sorted_s_index = s.sort_index()

Sorting DataFrames:

 For DataFrames, you can use sort_values() to sort by one or more columns:

pythonCopy code

# Creating a DataFrame

df = pd.DataFrame({'A': [3, 1, 4, 1], 'B': [5, 9, 2, 6]})

# Sorting by column 'A'

sorted_df_A = df.sort_values(by='A')

# Sorting by multiple columns

sorted_df_multiple = df.sort_values(by=['A', 'B'])

 To sort based on index labels or column names, use sort_index():

pythonCopy code

# Sorting by index labels

sorted_df_index = df.sort_index()

sorted_df_columns = df.sort_index(axis=1)

Ranking:

Ranking Series:

 For Series, you can use the rank() method to assign ranks to elements:

pythonCopy code

# Creating a Series

s = pd.Series([3, 1, 4, 1, 5, 9, 2])

# Ranking the Series


ranked_s = s.rank()

 The method parameter in rank() specifies how to handle ties. Default is 'average', but other
methods include 'min', 'max', and 'dense'.

pythonCopy code

# Ranking with the 'min' method for ties

ranked_s_min = s.rank(method='min')

Ranking DataFrames:

 For DataFrames, you can use rank() to assign ranks along rows or columns:

pythonCopy code

# Creating a DataFrame

df = pd.DataFrame({'A': [3, 1, 4, 1], 'B': [5, 9, 2, 6]})

# Ranking along rows

ranked_df_rows = df.rank(axis=1)

# Ranking along columns

ranked_df_columns = df.rank(axis=0)

 You can specify the ascending parameter to rank in descending order:

pythonCopy code

# Ranking in descending order along rows

ranked_df_descending = df.rank(axis=1, ascending=False)

These sorting and ranking operations are essential for organizing and analyzing data in pandas. Whether
you need to order your data or assign ranks based on certain criteria, these functions provide flexibility
and control over the arrangement of your data.

import pandas as pd
# Creating a DataFrame
data = {'Numbers': [5, 2, 8, 1, 3]}
df = pd.DataFrame(data)
# Sorting the DataFrame
df_sorted = df.sort_values(by='Numbers')
# Ranking the DataFrame
df_ranked = df.rank()
print("Sorted DataFrame:")
print(df_sorted)
print("\nRanked DataFrame:")
print(df_ranked)

Summarizing and computing descriptive statistics - Correlation and covariance,


Unique values, Value counts and membership
In the vast terrain of data science essentials, let's explore how pandas, a nifty Python library, simplifies
the task of summarizing and computing descriptive statistics. This involves understanding the correlation
and covariance between variables, identifying unique values, and counting occurrences.

1. Correlation and Covariance:

Pandas allows us to peek into the relationships between variables using correlation and covariance.
Correlation gives us a sense of how two variables move together, while covariance shows whether they
change in tandem.
import pandas as pd

# Creating a DataFrame

data = {'A': [1, 2, 3, 4, 5], 'B': [5, 4, 3, 2, 1]}

df = pd.DataFrame(data)

# Calculating correlation and covariance

correlation = df['A'].corr(df['B'])

covariance = df['A'].cov(df['B'])

print("Correlation:", correlation)

print("Covariance:", covariance)

2. Unique Values:

When dealing with categorical data, it's often useful to know the unique values in a column. Pandas
provides a straightforward way to extract this information.

import pandas as pd

# Creating a DataFrame with a categorical column

data = {'Category': ['Apple', 'Banana', 'Apple', 'Orange', 'Banana', 'Orange']}

df = pd.DataFrame(data)

# Extracting unique values

unique_values = df['Category'].unique()

print("Unique Values:")

print(unique_values)

3. Value Counts and Membership:


Counting occurrences of each unique value in a column and checking membership in a set are common
tasks. Pandas makes these operations a breeze.

import pandas as pd

# Creating a DataFrame with a categorical column

data = {'Category': ['Apple', 'Banana', 'Apple', 'Orange', 'Banana', 'Orange']}

df = pd.DataFrame(data)

# Counting occurrences

value_counts = df['Category'].value_counts()

# Checking membership

is_apple_present = 'Apple' in df['Category'].values

print("Value Counts:")

print(value_counts)

print("\nIs 'Apple' Present:", is_apple_present)

In a nutshell, pandas provides a set of tools to make data summary and descriptive statistics a walk in
the park. Whether you're exploring relationships between variables, extracting unique values, or
counting occurrences, pandas offers an intuitive and efficient way to unravel insights from your data. It's
like having a trustworthy assistant that simplifies the complexities of data analysis into easy-to-
understand Python code.

Chapter2:Data loading, Storage, and file formats


Reading and writing data in text format:
In the world of data science basics, pandas serves as a valuable ally when it comes to handling data in
text format. Whether you're importing information from external sources or exporting your processed
data, pandas simplifies the process with easy-to-use functions.

Reading Data from Text:

Consider a scenario where you have data stored in a text file, and you want to bring it into your Python
environment as a DataFrame – pandas makes this a piece of cake. The read_csv() function is your go-to
tool, even if your data isn't actually comma-separated.

import pandas as pd

# Reading data from a text file (assuming it's tab-separated)


df = pd.read_csv('my_data.txt', delimiter='\t')

print(df)

Here, the delimiter='\t' specifies that the data is tab-separated. Pandas is flexible enough to
accommodate various delimiters, ensuring your data is imported correctly.

Writing Data to Text:

Conversely, if you've performed your data magic using pandas and want to save the results in a text file,
the to_csv() function comes to the rescue. You can specify the delimiter and other parameters as
needed.

import pandas as pd

# Assuming df is your DataFrame

data_to_save = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 22], 'City': ['New York', 'San Francisco',
'Los Angeles']}

df = pd.DataFrame(data_to_save)

# Saving DataFrame to a text file (tab-separated)

df.to_csv('my_output.txt', sep='\t', index=False)

In this example, sep='\t' specifies a tab as the delimiter when saving. The index=False ensures that the
DataFrame index is not included in the output.

Reading Other Text Formats:

Pandas isn't limited to just CSV or plain text. It can handle various text formats, including Excel files. The
read_excel() function extends its capabilities to reading data from Excel spreadsheets.

import pandas as pd

# Reading data from an Excel file

df_from_excel = pd.read_excel('my_data.xlsx')
print(df_from_excel)

Pandas' flexibility allows you to seamlessly read data from different text-based sources, making it a
versatile tool for various data formats.

In summary, pandas' functionality for reading and writing data in text format is like having a language
interpreter for your data. It understands different dialects (delimiters) and can translate your data
between Python and text files effortlessly. Whether you're importing data for analysis or saving your
results, pandas is your trusty companion in the journey of data science.

Binary data formats:


In the field of data science basics, pandas steps up its game when dealing with binary data formats.
These formats are like the secret codes of data storage, allowing you to efficiently save and retrieve
information without the fuss of text-based files.

HDF5 Format:

One notable binary format pandas supports is HDF5 (Hierarchical Data Format version 5). It's like a
digital bookshelf, organizing your data hierarchically. With HDF5, you can store large datasets and
retrieve specific portions without loading the entire content.

import pandas as pd
# Assuming df is your DataFrame
data_to_save = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 22], 'City': ['New York', 'San Francisco',
'Los Angeles']}
df = pd.DataFrame(data_to_save)
# Saving DataFrame to HDF5 format
df.to_hdf('my_data.h5', key='data', mode='w')
This snippet saves your DataFrame to an HDF5 file. The key='data' provides a label for your data within
the HDF5 file.

To read from an HDF5 file:

import pandas as pd
# Reading data from an HDF5 file
df_from_hdf = pd.read_hdf('my_data.h5', key='data')
print(df_from_hdf)
Pandas effortlessly retrieves the data from the HDF5 file.

In essence, pandas' support for binary data formats is like having a superpower for handling large
datasets. Whether you're saving your work for later or loading massive datasets on-the-fly, these binary
formats provide efficiency and speed, making pandas a reliable companion in the world of data science.

Interacting with web APIs:


In the basics of data science, there's a powerful way to tap into the wealth of information on the web –
interacting with web APIs using Python and pandas. It's like having a conversation with a digital librarian
who fetches the exact data you need.

1. What is a Web API?

Firstly, let's break down the jargon. A Web API (Application Programming Interface) is essentially
a set of rules that allows one piece of software to interact with another. In our case, we're
interested in web APIs, which are like data gatekeepers on the internet. Websites often expose
certain data through APIs, granting you access to specific information in a structured format.

2. Using the requests Library:

To start our conversation with a web API, we need a way to send requests and receive
responses. Python's requests library is like our messenger, handling the communication.

import requests
# Making a simple GET request to an API
response = requests.get('https://api.example.com/data')
# Checking the status of our request
if response.status_code == 200:
# The data is in JSON format, let's print it
print(response.json())
else:
print(f"Request failed with status code {response.status_code}")

This is a basic example of making a GET request to an API. If all goes well, the data is usually
returned in JSON format, a common language for data exchange on the web.

3. Using Pandas to Handle API Data:

Now, let's bring in our data wrangler, pandas. Once we've fetched data from a web API, pandas
helps us organize it into a DataFrame – a structured table that's easy to work with.

import requests
import pandas as pd
# Making a GET request to an API
response = requests.get('https://api.example.com/data')
# Checking the status of our request
if response.status_code == 200:
# Converting JSON data to a DataFrame
df = pd.DataFrame(response.json())
# Let's print our DataFrame
print(df)
else:
print(f"Request failed with status code {response.status_code}")
With pandas, our API data is now neatly arranged in a DataFrame, ready for analysis or further
manipulation.

4. Handling Parameters in API Requests:

Sometimes, APIs require additional information to fetch the exact data you need. We can
include parameters in our requests.

import requests
import pandas as pd
# Making a GET request with parameters
params = {'user_id': 123, 'start_date': '2022-01-01'}
response = requests.get('https://api.example.com/user_data', params=params)
# Checking the status of our request
if response.status_code == 200:
# Converting JSON data to a DataFrame
df = pd.DataFrame(response.json())
# Let's print our DataFrame
print(df)
else:
print(f"Request failed with status code {response.status_code}")

By incorporating parameters, we tailor our request to fetch user-specific data or data within a
certain timeframe.

In essence, interacting with web APIs using Python and pandas is akin to having a conversation
with the vast knowledge stored on the internet. It's a dynamic process where Python's requests
library acts as our messenger, fetching data from APIs, and pandas organizes that data into a
structured format for seamless analysis. This interaction opens doors to a world of real-time
information and possibilities in the realm of data science.

Interacting with databases:


Interacting with databases in Python is a common task, and the process can vary depending on
the type of database you're working with. We interacting with relational databases using the
sqlite3 module (for SQLite databases) and the SQLAlchemy library (for various relational
databases).

Using sqlite3 for SQLite Databases:

1. Import the sqlite3 module:

pythonCopy code

import sqlite3

2. Connect to the Database:


pythonCopy code

# Connect to an SQLite database (creates a new database if it doesn't exist)


connection = sqlite3.connect('example.db')
3. Create a Cursor:

pythonCopy code
# Create a cursor object to execute SQL queries
cursor = connection.cursor()
4. Execute SQL Queries:

pythonCopy code
# Create a table
cursor.execute(''' CREATE TABLE IF NOT EXISTS users ( id INTEGER PRIMARY KEY, username TEXT,
email TEXT ) ''')
# Insert data into the table
cursor.execute("INSERT INTO users (username, email) VALUES (?, ?)", ('john_doe',
'john@example.com'))
# Commit the changes
connection.commit()
5. Fetch Data:

pythonCopy code
# Fetch data from the table
cursor.execute("SELECT * FROM users")
rows = cursor.fetchall()
for row in rows:
print(row)
6. Close the Connection:

pythonCopy code
# Close the connection when done
connection.close()
Using SQLAlchemy for Relational Databases:

import sqlalchemy as sqla


db = sqla.create_engine("sqlite:///mydata.sqlite")
pd.read_sql("SELECT * FROM test", db)
Using SQLAlchemy provides a more abstract and database-agnostic approach, making it easier to switch
between different databases. The examples provided here are for SQLite databases, but you can adapt
them for other relational databases by changing the database URL in the create_engine() function.

You might also like