0% found this document useful (0 votes)
15 views

Pandas basics

Uploaded by

221201011
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Pandas basics

Uploaded by

221201011
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

NumPy and pandas are both popular libraries in Python for data manipulation and analysis,

but they serve different purposes and have distinct features. Here’s a comparison of the two:

NumPy

Purpose:

• NumPy (Numerical Python) is primarily used for numerical computations. It provides


support for arrays, matrices, and a wide range of mathematical functions to operate on
these data structures.

Typical Use Cases:

• Numerical computations and mathematical operations.


• Scientific computing and engineering problems.
• Operations on arrays and matrices.

pandas

Purpose:

• pandas is designed for data manipulation and analysis. It provides data structures and
functions needed to work with structured data, such as tables and time series.
• NOTE: A time series is a sequence of data points typically measured at successive points in
time, usually at equally spaced intervals. Time series data is often used in various fields such
as finance, economics, weather forecasting, signal processing, and many others to analyze
patterns, trends, and cycles.

Typical Use Cases:

• Data manipulation and analysis.


• Working with structured data in tables or time series.
• Data cleaning and preparation for analysis.

Pandas is a powerful and widely-used data manipulation and analysis library in Python. It
provides data structures and functions needed to work with structured data seamlessly. Here's
an introduction to the core components of pandas:

1. Key Data Structures

• Series: A one-dimensional labeled array capable of holding any data type. It is similar
to a column in a spreadsheet.
• DataFrame: A two-dimensional labeled data structure with columns that can be of
different types, much like a table in a database or a data frame in R.
import pandas as pd

# Creating a Series

data = [1, 2, 3, 4, 5]

series = pd.Series(data)

# Creating a DataFrame

data = { 'Name': ['John', 'Anna', 'Peter', 'Linda'], 'Age': [28, 24, 35, 32] }

df = pd.DataFrame(data)

Series:

01

12

23

34

45

dtype: int64

DataFrame:

Name Age

0 John 28

1 Anna 24

2 Peter 35

3 Linda 32

The Series is a one-dimensional array with integer values and default index labels. The DataFrame
is a two-dimensional table with columns for 'Name' and 'Age', and it also has default integer index
labels.

What is a pandas Series?

A pandas Series is a one-dimensional labeled array capable of holding any data type
(integers, strings, floats, etc.). It is similar to a column in a table or a single column in an
Excel sheet.

Breaking down series = pd.Series(data)

1. pd: This is the common alias for the pandas library. Before you can use it, you need to
import pandas with import pandas as pd.
2. Series: This is a constructor for creating a Series object in pandas.
3. data: This is the input data that you want to convert into a Series. In this case, data is
a list [1, 2, 3, 4, 5].
Creating the Series

When you call pd.Series(data), pandas does the following:

1. Takes the input data: Here, it is a list [1, 2, 3, 4, 5].


2. Creates an index: By default, pandas will create an integer index starting from 0. So,
the indices for this Series will be [0, 1, 2, 3, 4].
3. Pairs the data with the index: Each element in the list is paired with an index value.

Customizing the Series

Creating a Series data = [1, 2, 3, 4, 5]

custom_index = ['a', 'b', 'c', 'd', 'e']

series = pd.Series(data, index=custom_index)

print(series)

Output with Custom Index:

a1

b2

c3

d4

e5

dtype: int64

Creating a pandas Series from a dictionary

To create a Series from a dictionary, the keys of the dictionary become the indices of the Series, and
the values of the dictionary become the values of the Series

import pandas as pd

# Creating a dictionary

data = {'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5}

# Creating a Series from the dictionary

series = pd.Series(data)

print(series)

Explanation

1. Creating the dictionary: The dictionary data has keys 'a', 'b', 'c', 'd', 'e'
and corresponding values 1, 2, 3, 4, 5.
2. Creating the Series: When you pass this dictionary to pd.Series, pandas creates a
Series where the dictionary keys become the index, and the dictionary values become
the data of the Series.
Output

a1

b2

c3

d4

e5

dtype: int64

Explanation of the Output

• Indices (a, b, c, d, e): These are the keys from the dictionary.
• Values (1, 2, 3, 4, 5): These are the values from the dictionary.
• dtype: int64: This indicates the data type of the values in the Series. In this case, it is
int64, which means 64-bit integers.

This approach is useful when you have labeled data that you want to convert into a pandas
Series for further manipulation or analysis.

create series from scalar values

Creating a pandas Series from scalar values involves specifying the scalar value and an index. This
will generate a Series where each index label is associated with the same scalar value.

import pandas as pd

# Scalar value

scalar_value = 10

# Specifying the index

index = ['a', 'b', 'c', 'd', 'e']

# Creating the Series from the scalar value

series = pd.Series(scalar_value, index=index)

print(series)

output

a 10

b 10

c 10

d 10

e 10

dtype: int64
Explanation of the Output

• Indices (a, b, c, d, e): These are the index labels specified in the index list.
• Values (10): Each index label is associated with the scalar value 10.
• dtype: int64: This indicates the data type of the values in the Series. In this case, it is
int64, which means 64-bit integers.

This approach is useful when you want to create a Series with a constant value for each
index, such as initializing a Series or filling a Series with a specific value.

create series using index

Creating a pandas Series using a specified index involves associating values with specific indices.

import pandas as pd

# List of values

values = [10, 20, 30, 40, 50]

# Specifying the index

index = ['a', 'b', 'c', 'd', 'e']

# Creating the Series with specified index

series = pd.Series(values, index=index)

print(series)

output:
a 10
b 20
c 30
d 40
e 50
dtype: int64

4. Creating an Empty Series with a Specified Index

To create an empty Series with a specified index. This is useful for initializing a Series to be
filled later.

import pandas as pd

# Specifying the index

index = ['a', 'b', 'c', 'd', 'e']

# Creating an empty Series with the specified index

series = pd.Series(index=index)

print(series)
output:
a NaN
b NaN
c NaN
d NaN
e NaN
dtype: float64

Explanation

• NaN: Stands for "Not a Number" and is used to denote missing or undefined values in
pandas.
• dtype: float64: When creating an empty Series, the default data type is float64.

By specifying the index, you can create Series that are tailored to your data structure needs,
whether you are initializing with specific values, using a scalar value, or creating an empty
Series for later use.

To know the size, dimension, shape and index of series

import pandas as pd

# Creating a Series with specified index

data = [10, 20, 30, 40, 50]

index = ['a', 'b', 'c', 'd', 'e']

series = pd.Series(data, index=index)

# Printing the Series

print(series)

# Getting the size of the Series

print("\nSize of Series:", series.size)

# Getting the dimensions of the Series

print("Dimensions of Series:", series.ndim)

# Getting the shape of the Series

print("Shape of Series:", series.shape)

# Getting the index of the Series

print("Index of Series:", series.index)

output:
Series:
a 10
b 20
c 30
d 40
e 50
dtype: int64

Size of Series: 5
Dimensions of Series: 1
Shape of Series: (5,)
Index of Series: Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

Explanation

1. Size of the Series:


o series.size: Returns the number of elements in the Series.
o Output: 5
2. Dimensions of the Series:
o series.ndim: Returns the number of dimensions of the Series. For a Series, it
is always 1.
o Output: 1
3. Shape of the Series:
o series.shape: Returns a tuple representing the dimensionality of the Series.
For a one-dimensional Series, it returns a tuple with one element, which is the
number of elements in the Series.
o Output: (5,)
4. Index of the Series:
o series.index: Returns the index (labels) of the Series.
o Output: Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

Importance of dtype

• Efficiency: Knowing the dtype helps Pandas optimize storage and computation.

dtype='object' indicates that the index labels of the Series are stored as generic objects, which is
appropriate for string labels. If the index labels were numeric, then dtype, is int64 or float64.

These attributes provide essential information about the Series, helps to understand its
structure and how to manipulate it.

Note: A tuple is an immutable, ordered collection of elements in Python. Tuples are similar to
lists but have some key differences:

Key Characteristics of Tuples:

1. Ordered: The elements in a tuple have a defined order, and this order will not
change.
2. Immutable: Once a tuple is created, you cannot modify, add, or remove elements
from it. This immutability makes tuples a good choice for read-only collections of
data.
3. Heterogeneous: Tuples can contain elements of different types, including other
tuples, lists, dictionaries, and more.
4. Indexable: You can access elements in a tuple using their index, starting from 0 for
the first element.

the comma in the shape (5,) signifies that the shape is a tuple with one element. This is a
requirement of Python's syntax to differentiate single-element tuples from regular
parenthesized expressions.

Creating an empty DataFrame

Creating an empty DataFrame can be useful in various scenarios when working with data in
pandas. Here are some common reasons and scenarios where you might need to create an
empty DataFrame:

#Empty DataFrame with no columns and no rows:


import pandas as pd
# Creating an empty DataFrame
df = pd.DataFrame()
print(df)

output:

Empty DataFrame
Columns: []
Index: []

1. Incremental Data Loading:


o You may start with an empty DataFrame and then incrementally add rows of
data to it as they are processed or received from another source.
2. Data Aggregation:
o When aggregating data from multiple sources or files, you can start with an
empty DataFrame and append the data as it is read from each source.
3. Dynamic DataFrame Creation:
o In cases where the structure of the DataFrame (i.e., columns) is known in
advance but the data is generated or fetched dynamically, you can initialize an
empty DataFrame and populate it later.
4. Initializing Data Structures:
o To create placeholders for future data manipulation, ensuring you have the
correct structure before data insertion.
5. Placeholders in Functions or Classes:
o When writing functions or classes that return DataFrames, you can start with
an empty DataFrame and populate it based on certain conditions or input
parameters.
6. Ensuring Data Consistency:
o When you need to ensure a certain structure (specific columns with specific
data types) from the start, even before data is available.
7. Error Handling:
o To avoid errors in code execution, initializing an empty DataFrame can help
manage cases where no data is available, ensuring that the code can handle
such scenarios gracefully.
# Creating an empty DataFrame with specific columns

Creating an empty DataFrame with specific columns is a way to initialize a DataFrame with
predefined column names but without any data. This can be useful when you know the
structure of your DataFrame in advance but will populate it with data later.

# Creating an empty DataFrame with specific columns


df = pd.DataFrame(columns=['Column1', 'Column2', 'Column3'])
print(df)
output:
Empty DataFrame
Columns: [Column1, Column2, Column3]
Index: []

# Function to extract and print one column


# Sample DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'Los Angeles', 'Chicago']
}
df = pd.DataFrame(data)
# Function to extract and print one column
def extract_column(dataframe, column_name):
if column_name in dataframe.columns:
column_data = dataframe[column_name]
print(column_data)
else:
print(f"Column '{column_name}' not found in DataFrame.")
# Example usage
extract_column(df, 'Age')

Explanation
Creating a DataFrame:
df = pd.DataFrame(data)
data is assumed to be a predefined variable containing the data , to convert into a DataFrame. A
DataFrame is a two-dimensional labeled data structure with columns of potentially different types.

Defining the function:

def extract_column(dataframe, column_name):

This function, extract_column, is defined to take two arguments: a DataFrame (dataframe) and
a column name (column_name). The purpose of the function is to check if the specified column
exists in the DataFrame and print its data. If the column does not exist, it prints a message indicating
that the column was not found.

if column_name in dataframe.columns:

This line checks if the provided column_name exists in the DataFrame's columns. If it does, the
function proceeds to the next step; otherwise, it goes to the else block.
Extracting and printing the column:

column_data = dataframe[column_name]
print(column_data)

If the column exists, the function extracts the column data from the DataFrame and prints it.

Handling a non-existent column:

else:
print(f"Column '{column_name}' not found in DataFrame.")

If the column does not exist in the DataFrame, this block prints a message indicating that the column
was not found.

# Sample DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'Los Angeles', 'Chicago']
}
df = pd.DataFrame(data)

# Function to extract and return one row


def extract_row(dataframe, row):
if row in dataframe.index:
return dataframe.iloc[row]
else:
print(f"Row index '{row}' not found in DataFrame.")
return None

# Example usage
row_2 = extract_row(df, 2)
print("Extracted row:")
print(row_2)
return dataframe.iloc[row_index]

If the row index exists, the function extracts the row data from the DataFrame using iloc,
which is an integer-location based indexing for selection by position, and returns the row
data.

The f in the string print(f"Row index '{row}' not found in DataFrame.") represents
an f-string, which is a feature introduced in Python 3.6. An f-string (formatted string literal)
allows you to embed expressions inside string literals, using curly braces {}.

Working

• f-string: The f before the opening quote marks indicates that the string is an f-string.
• Curly braces {}: Any expression inside the curly braces {} will be evaluated and its
result will be inserted into the string at that position.
row = 5
print(f"Row index '{row}' not found in DataFrame.")

• f"Row index '{row}' not found in DataFrame.": This is an f-string.


• Inside the curly braces {row}, the value of the variable row will be evaluated and
converted to a string.
• If row is 5, the output of the print statement will be:

Row index '5' not found in DataFrame.

Benefits

• Readability: f-strings make it easier to format strings and read the code.
• Performance: f-strings are generally faster than other methods of string formatting,
like using % or str.format().

Adding a new row using loc:

# Sample DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'Los Angeles', 'Chicago']
}
df = pd.DataFrame(data)

# Adding a new row using loc


df.loc[len(df)] = ['David', 40, 'San Francisco']
print("DataFrame after adding a new row using loc:")
print(df)

df.loc[len(df)] = ['David', 40, 'San Francisco']

• len(df): This expression returns the number of rows in the DataFrame. If the
DataFrame initially has 3 rows, len(df) will return 3.
• df.loc[len(df)]: The loc accessor is used to access a group of rows and columns
by labels or a boolean array. In this case, it is used to add a new row at the index
position returned by len(df).
• ['David', 40, 'San Francisco']: This is the data for the new row being added to
the DataFrame. The new row will have 'David' as the Name, 40 as the Age, and 'San
Francisco' as the City.

Output

DataFrame after adding a new row using loc:


Name Age City
0 Alice 25 New York
1 Bob 30 Los Angeles
2 Charlie 35 Chicago
3 David 40 San Francisco
In pandas, data storage refers to how data is organized, manipulated, and stored within the
DataFrame and Series objects. Understanding data storage in pandas is essential for efficient
data manipulation and analysis. Here's an overview of key aspects of data storage in pandas:

Data Types

Pandas uses NumPy for its underlying data storage, which means it leverages NumPy's
efficient array-based storage. Common data types in pandas include:

• int64
• float64
• bool
• datetime64[ns]
• object (for string or mixed types)

Example:

python
print(df.dtypes)

Output:

Name object
Age int64
City object
dtype: object

3. Indexing

Indexes provide fast lookups and are essential for aligning data:

• Default Index: Integer-based starting from 0.


• Custom Index: User-defined labels for rows.

Example:

python
df = pd.DataFrame(data, index=['a', 'b', 'c'])
print(df)

Output:

Name Age City


a Alice 25 New York
b Bob 30 Los Angeles
c Charlie 35 Chicago

4. Storage Formats
In-Memory Storage

Data is typically stored in memory in the form of DataFrame and Series objects, allowing for
fast data manipulation and analysis.

File-Based Storage

Pandas supports various file formats for data storage, including:

• CSV
• Excel
• HDF5
• Parquet
• SQL databases
• JSON
• and more

Reading and Writing CSV:

# Writing to a CSV file


df.to_csv('data.csv', index=False)

EXPLANATION:
to write the contents of a pandas DataFrame to a CSV (Comma Separated
Values) file.

• df: This is the pandas DataFrame that you want to write to a CSV file. In your context, df
contains data with columns such as 'Name', 'Age', and 'City'.

• .to_csv(): This is a pandas DataFrame method that exports the DataFrame to a CSV file.
The to_csv method has several optional parameters that allow you to control the output
format, but here we're using two of them: the file path and index.

• 'data.csv': This is the path to the file where the DataFrame will be written. If the file
does not exist, it will be created. If it does exist, it will be overwritten.

• index=False: This parameter specifies whether to write row indices to the CSV file. By
default, index=True, meaning the row indices are included in the CSV file. Setting
index=False excludes the row indices from the output file.

# Reading from a CSV file


df = pd.read_csv('data.csv')
print(df)

DELIMITER

A delimiter in a CSV (Comma Separated Values) file is a character that separates individual
data values within each row of the file. The delimiter tells the software reading the CSV file
how to split the data into individual fields or columns.

Common Delimiters
1. Comma (,): This is the most common delimiter and is the default in CSV files. Each
value is separated by a comma. Example:

Name,Age,City
Alice,25,New York
Bob,30,Los Angeles
Charlie,35,Chicago

2. Semicolon (;): Sometimes used instead of a comma, especially in regions where the
comma is used as a decimal separator. Example:

Name;Age;City
Alice;25;New York
Bob;30;Los Angeles
Charlie;35;Chicago

3. Tab (\t): Used in TSV (Tab Separated Values) files. Example:

Name Age City


Alice 25 New York
Bob 30 Los Angeles
Charlie 35 Chicago

4. Pipe (|): Occasionally used to avoid conflicts with data that may contain commas or
semicolons. Example:

Name|Age|City
Alice|25|New York
Bob|30|Los Angeles
Charlie|35|Chicago

Specifying Delimiters in pandas

When reading or writing CSV files with pandas, you can specify the delimiter using the sep
parameter in the read_csv and to_csv methods.

Reading a CSV File with a Custom Delimiter

Example of reading a CSV file with a semicolon delimiter:

import pandas as pd

df = pd.read_csv('data.csv', sep=';')
Writing a CSV File with a Custom Delimiter

Example of writing a DataFrame to a CSV file with a semicolon delimiter:

import pandas as pd

data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'Los Angeles', 'Chicago']
}
df = pd.DataFrame(data)
df.to_csv('data.csv', sep=';', index=False)

Why Use Different Delimiters?

• Regional Preferences: In some countries, the comma is used as a decimal separator. To


avoid confusion, semicolons or other delimiters might be used in CSV files.
• Data Content: If the data itself contains commas, using a different delimiter like a tab or
pipe can help avoid conflicts.
• Compatibility: Some software might require a specific delimiter to correctly parse the CSV
file.

Summary

• Delimiter: A character that separates individual values in a CSV file.


• Common Delimiters: Comma, semicolon, tab, and pipe.
• Custom Delimiters in pandas: Use the sep parameter in read_csv and to_csv methods to
specify a delimiter other than the default comma.

SLICING ROWS FROM A DATAFRAME

In Pandas, slicing rows from a DataFrame can be done in various ways depending on your
requirements.

Using iloc

iloc is used for integer-location based indexing for selection by position.

import pandas as pd

# Sample DataFrame

data = {

'A': [1, 2, 3, 4, 5],

'B': [10, 20, 30, 40, 50],

'C': [100, 200, 300, 400, 500]

df = pd.DataFrame(data)
print(df)

# Slicing rows from index 1 to 3 (excluding 3)

sliced_df = df.iloc[1:3]

print(sliced_df)

import pandas as pd

# Sample DataFrame
data = {
'A': [1, 2, 3, 4, 5],
'B': [10, 20, 30, 40, 50],
'C': [100, 200, 300, 400, 500]
}
df = pd.DataFrame(data)
print(df)

# Slicing every other row


sliced_df = df.iloc[::2]
print(sliced_df)

output
A B C
0 1 10 100
1 2 20 200
2 3 30 300
3 4 40 400
4 5 50 500

A B C
0 1 10 100
2 3 30 300
4 5 50 500

Explanation of df.iloc[::2]

The expression df.iloc[::2] is using the iloc indexer to slice the DataFrame. Here's a
breakdown of the syntax and how it works:

• iloc: This is a Pandas method used for integer-location based indexing. It allows you
to select rows and columns by their integer positions.
• ::2: This is Python's slicing syntax. In general, slicing has the form
start:stop:step, where:
o start is the index to start the slice (inclusive).
o stop is the index to end the slice (exclusive).
o step is the step size or stride between each index in the slice.

When you use ::2, it means:


• start is omitted, so it starts from the beginning of the DataFrame.
• stop is omitted, so it goes until the end of the DataFrame.
• step is 2, so it takes every second row.

This effectively selects every other row from the DataFrame.

only every other row from the original DataFrame is included in the sliced DataFrame.

rows with indices 0, 2, and 4 are selected, while rows with indices 1 and 3 are skipped.

This kind of slicing is useful when you want to downsample your data by taking every nth
row, in this case, every 2nd row.

import pandas as pd

# Simulating a DataFrame for sensor data in a mechatronics system


data = {
'Timestamp': pd.date_range(start='2023-01-01', periods=10, freq='T'),
'Temperature': [20, 21, 19, 22, 20, 21, 19, 23, 20, 22],
'Vibration': [0.01, 0.02, 0.03, 0.02, 0.01, 0.04, 0.02, 0.03, 0.01, 0.02],
'Position': [5, 6, 5, 7, 5, 6, 5, 8, 5, 7]
}
df = pd.DataFrame(data)
print(df)
Creating the Data Dictionary:

This dictionary data contains the following keys and corresponding lists:

• Timestamp: A range of timestamps starting from '2023-01-01' with 10 periods at a


frequency of one minute ('T').
• Temperature: A list of temperature readings.
• Vibration: A list of vibration readings.
• Position: A list of position readings.

Creating the DataFrame:


df = pd.DataFrame(data)
This line creates a Pandas DataFrame df using the dictionary data. The DataFrame will have
columns 'Timestamp', 'Temperature', 'Vibration', and 'Position', and each list in the dictionary
becomes a column in the DataFrame.

Output Explanation
Timestamp Temperature Vibration Position
0 2023-01-01 00:00:00 20 0.01 5
1 2023-01-01 00:01:00 21 0.02 6
2 2023-01-01 00:02:00 19 0.03 5
3 2023-01-01 00:03:00 22 0.02 7
4 2023-01-01 00:04:00 20 0.01 5
5 2023-01-01 00:05:00 21 0.04 6
6 2023-01-01 00:06:00 19 0.02 5
7 2023-01-01 00:07:00 23 0.03 8
8 2023-01-01 00:08:00 20 0.01 5
9 2023-01-01 00:09:00 22 0.02 7

The DataFrame contains 10 rows, corresponding to the 10 periods specified in the


pd.date_range function, each with associated readings for temperature, vibration, and
position.

This DataFrame could be used for further analysis, such as plotting the sensor data, detecting
anomalies, or performing statistical analysis on the readings.
Loading data into a pandas DataFrame:

Loading data into a pandas DataFrame is a fundamental step in data analysis and
manipulation. Pandas provides a variety of functions to read data from different file formats
and data sources. Here’s a guide on how to load data into pandas DataFrames from common
sources:

1. Reading CSV Files

Basic CSV Reading


import pandas as pd

# Reading a CSV file


df = pd.read_csv('data.csv')
print(df)
Custom Delimiter
# Reading a CSV file with a custom delimiter (e.g., semicolon)
df = pd.read_csv('data.csv', sep=';')
print(df)
Specifying Column Data Types
# Reading a CSV file and specifying data types for columns
df = pd.read_csv('data.csv', dtype={'Age': int, 'Name': str})
print(df)

# Reading a CSV file and specifying data types for columns

df = pd.read_csv('data.csv', dtype={'Age': int, 'Name': str})


print(df)

Benefits of Specifying Data Types

• Accuracy: Ensures columns are interpreted correctly (e.g., 'Age' as integers, not
strings).
• Performance: Helps pandas optimize memory usage and improve performance by
using appropriate data types.
• Error Prevention: Prevents potential issues with data type mismatches during data
processing and analysis.

2. Reading Excel Files

Reading from Excel


# Reading an Excel file
df = pd.read_excel('data.xlsx')
print(df)
Specifying Sheet Name
# Reading a specific sheet from an Excel file
df = pd. read_excel('data.xlsx', sheet_name='Sheet1')
print(df)
3. Reading from SQL Databases

Using SQLAlchemy
from sqlalchemy import create_engine
import pandas as pd

# Creating an engine to connect to the database


engine = create_engine('sqlite:///my_database.db')

# Reading data from a SQL table


df = pd.read_sql('SELECT * FROM my_table', engine)
print(df)

4. Reading JSON Files

Reading JSON
import pandas as pd

# Reading a JSON file


df = pd.read_json('data.json')
print(df)

5. Reading from Other Formats

Reading HTML Tables


# Reading tables from an HTML file
dfs = pd.read_html('https://example.com/table.html')
print(dfs[0]) # Print the first table found on the page
Reading from HDF5
# Reading from an HDF5 file
df = pd.read_hdf('data.h5', 'dataset_name')
print(df)

6. Loading Data from Python Dictionaries and Lists

From Dictionary
# Creating a DataFrame from a dictionary
data = {'Name': ['Alice', 'Bob'], 'Age': [25, 30]}
df = pd.DataFrame(data)
print(df)
From List of Dictionaries
# Creating a DataFrame from a list of dictionaries
data = [{'Name': 'Alice', 'Age': 25}, {'Name': 'Bob', 'Age': 30}]
df = pd.DataFrame(data)
print(df)

7. Handling Missing Values

Specifying Missing Values


# Reading a CSV file and specifying missing values
df = pd.read_csv('data.csv', na_values=['NA', 'N/A', ''])
print(df)

8. Specifying Index Column


Setting an Index Column
# Reading a CSV file and setting a column as the index
df = pd.read_csv('data.csv', index_col='ID')
print(df)

You might also like