NumPy and Pandas for Data Analysis AI ML Training
NumPy Tutorial
Introduction
NumPy (Numerical Python) is a library for the Python programming language, adding support
for large, multi-dimensional arrays and matrices, along with a large collection of high-level
mathematical functions to operate on these arrays.
Installation
To install NumPy, use the following command:
pip install numpy
Basic Operations
Importing NumPy
import numpy as np
Creating Arrays
# Create a 1D array
array_1d = np.array([1, 2, 3, 4, 5])
print(array_1d)
# Create a 2D array
array_2d = np.array([[1, 2, 3], [4, 5, 6]])
print(array_2d)
# Create an array with zeros
zeros_array = np.zeros((3, 4))
print(zeros_array)
# Create an array with ones
ones_array = np.ones((2, 3))
print(ones_array)
# Create an identity matrix
identity_matrix = np.eye(3)
print(identity_matrix)
# Create an array with a range of values
range_array = np.arange(10, 20, 2)
print(range_array)
# Create an array with evenly spaced values
linspace_array = np.linspace(0, 1, 5)
print(linspace_array)
Array Operations
# Arithmetic operations
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
LinkedIn: www.linkedin.com/in/nidhi-grover-raheja-904211138 1 |Pa ge
NumPy and Pandas for Data Analysis AI ML Training
print(a + b) # Addition
print(a - b) # Subtraction
print(a * b) # Element-wise multiplication
print(a / b) # Element-wise division
# Matrix multiplication
matrix_a = np.array([[1, 2], [3, 4]])
matrix_b = np.array([[5, 6], [7, 8]])
print(np.dot(matrix_a, matrix_b))
# Broadcasting
array_broadcast = np.array([1, 2, 3])
print(array_broadcast + 1) # Adds 1 to each element
# Statistical operations
print(np.mean(a)) # Mean
print(np.median(a)) # Median
print(np.std(a)) # Standard deviation
print(np.sum(a)) # Sum
print(np.min(a)) # Minimum
print(np.max(a)) # Maximum
Indexing and Slicing
array = np.array([1, 2, 3, 4, 5, 6])
# Indexing
print(array[0]) # First element
print(array[-1]) # Last element
# Slicing
print(array[1:4]) # Elements from index 1 to 3
print(array[:3]) # First three elements
print(array[3:]) # Elements from index 3 to end
print(array[::2]) # Every second element
Reshaping Arrays
array = np.arange(1, 10)
reshaped_array = array.reshape((3, 3))
print(reshaped_array)
# Flattening arrays
flattened_array = reshaped_array.flatten()
print(flattened_array)
Pandas Tutorial
Introduction
Pandas is a library providing high-performance, easy-to-use data structures and data analysis
tools for the Python programming language.
Installation
LinkedIn: www.linkedin.com/in/nidhi-grover-raheja-904211138 2 |Pa ge
NumPy and Pandas for Data Analysis AI ML Training
To install Pandas, use the following command:
pip install pandas
Basic Operations
Importing Pandas
import pandas as pd
Creating DataFrames
# Create a DataFrame from a dictionary
data = {
'Name': ['John', 'Anna', 'Peter', 'Linda'],
'Age': [28, 24, 35, 32],
'City': ['New York', 'Paris', 'Berlin', 'London']
}
df = pd.DataFrame(data)
print(df)
# Create a DataFrame from a CSV file
df_from_csv = pd.read_csv('path_to_csv_file.csv')
print(df_from_csv)
Viewing Data
# Display the first few rows
print(df.head())
# Display the last few rows
print(df.tail())
# Display the data types of columns
print(df.dtypes)
# Display the shape of the DataFrame
print(df.shape)
# Display summary statistics
print(df.describe())
Selecting Data
# Select a single column
print(df['Name'])
# Select multiple columns
print(df[['Name', 'City']])
# Select rows by index
print(df.iloc[0]) # First row
print(df.iloc[0:2]) # First two rows
# Select rows by label
print(df.loc[0]) # First row
print(df.loc[0:2]) # First three rows (inclusive)
LinkedIn: www.linkedin.com/in/nidhi-grover-raheja-904211138 3 |Pa ge
NumPy and Pandas for Data Analysis AI ML Training
# Conditional selection
print(df[df['Age'] > 30])
Adding and Dropping Columns
# Add a new column
df['Country'] = ['USA', 'France', 'Germany', 'UK']
print(df)
# Drop a column
df = df.drop('Country', axis=1)
print(df)
Handling Missing Data
# Create a DataFrame with missing values
data_with_nan = {
'Name': ['John', 'Anna', 'Peter', 'Linda'],
'Age': [28, None, 35, 32],
'City': ['New York', 'Paris', None, 'London']
}
df_nan = pd.DataFrame(data_with_nan)
print(df_nan)
# Drop rows with missing values
df_dropped_nan = df_nan.dropna()
print(df_dropped_nan)
# Fill missing values
df_filled_nan = df_nan.fillna({'Age': df_nan['Age'].mean(), 'City':
'Unknown'})
print(df_filled_nan)
Grouping and Aggregating Data
# Group by a column and calculate mean
print(df.groupby('City').mean())
# Group by multiple columns and calculate sum
print(df.groupby(['City', 'Name']).sum())
Merging DataFrames
# Create two DataFrames
df1 = pd.DataFrame({'Name': ['John', 'Anna'], 'Age': [28, 24]})
df2 = pd.DataFrame({'Name': ['Peter', 'Linda'], 'City': ['Berlin',
'London']})
# Concatenate DataFrames
df_concat = pd.concat([df1, df2], ignore_index=True)
print(df_concat)
# Merge DataFrames
df_merge = pd.merge(df1, df2, on='Name', how='inner')
print(df_merge)
LinkedIn: www.linkedin.com/in/nidhi-grover-raheja-904211138 4 |Pa ge
NumPy and Pandas for Data Analysis AI ML Training
Exporting Data
# Export DataFrame to CSV
df.to_csv('output.csv', index=False)
# Export DataFrame to Excel
df.to_excel('output.xlsx', index=False)
Advanced Pandas Tutorial
Handling Time Series Data
Pandas provides robust support for time series data. Here's how to work with it.
Creating Time Series Data
# Create a date range
date_range = pd.date_range(start='2023-01-01', periods=10, freq='D')
print(date_range)
# Create a DataFrame with time series data
time_series_data = {
'Date': date_range,
'Value': np.random.randn(10)
}
df_time_series = pd.DataFrame(time_series_data)
df_time_series.set_index('Date', inplace=True)
print(df_time_series)
Resampling Time Series Data
# Resample to weekly frequency and calculate the mean
df_resampled = df_time_series.resample('W').mean()
print(df_resampled)
# Resample to monthly frequency and calculate the sum
df_resampled_monthly = df_time_series.resample('M').sum()
print(df_resampled_monthly)
Working with Categorical Data
# Create a DataFrame with categorical data
data = {
'Name': ['John', 'Anna', 'Peter', 'Linda'],
'City': ['New York', 'Paris', 'Berlin', 'London'],
'Gender': ['Male', 'Female', 'Male', 'Female']
}
df_categorical = pd.DataFrame(data)
# Convert a column to categorical type
df_categorical['Gender'] = df_categorical['Gender'].astype('category')
print(df_categorical)
# Get the categories and codes
print(df_categorical['Gender'].cat.categories)
LinkedIn: www.linkedin.com/in/nidhi-grover-raheja-904211138 5 |Pa ge
NumPy and Pandas for Data Analysis AI ML Training
print(df_categorical['Gender'].cat.codes)
Pivot Tables
# Create a DataFrame
data = {
'Name': ['John', 'Anna', 'John', 'Anna', 'John', 'Anna'],
'Month': ['Jan', 'Jan', 'Feb', 'Feb', 'Mar', 'Mar'],
'Sales': [150, 200, 130, 210, 170, 220]
}
df_sales = pd.DataFrame(data)
# Create a pivot table
pivot_table = df_sales.pivot_table(values='Sales', index='Name',
columns='Month', aggfunc='sum')
print(pivot_table)
Handling Large Datasets
# Read a large CSV file in chunks
chunk_size = 1000
chunks = pd.read_csv('large_dataset.csv', chunksize=chunk_size)
# Process each chunk
for chunk in chunks:
# Perform operations on the chunk
print(chunk.shape)
Applying Functions
Using apply()
# Create a DataFrame
data = {
'A': [1, 2, 3],
'B': [4, 5, 6]
}
df = pd.DataFrame(data)
# Define a function
def add_one(x):
return x + 1
# Apply the function to each element
print(df.applymap(add_one))
# Apply the function to each column
print(df.apply(lambda x: x + 1))
# Apply the function to each row
print(df.apply(lambda x: x + 1, axis=1))
Joining DataFrames
# Create two DataFrames
df1 = pd.DataFrame({
'key': ['A', 'B', 'C', 'D'],
LinkedIn: www.linkedin.com/in/nidhi-grover-raheja-904211138 6 |Pa ge
NumPy and Pandas for Data Analysis AI ML Training
'value': [1, 2, 3, 4]
})
df2 = pd.DataFrame({
'key': ['B', 'D', 'E', 'F'],
'value': [5, 6, 7, 8]
})
# Inner join
inner_joined = pd.merge(df1, df2, on='key', how='inner')
print(inner_joined)
# Left join
left_joined = pd.merge(df1, df2, on='key', how='left')
print(left_joined)
# Right join
right_joined = pd.merge(df1, df2, on='key', how='right')
print(right_joined)
# Outer join
outer_joined = pd.merge(df1, df2, on='key', how='outer')
print(outer_joined)
Window Functions
# Create a DataFrame with time series data
data = {
'Date': pd.date_range(start='2023-01-01', periods=10, freq='D'),
'Value': np.random.randn(10)
}
df = pd.DataFrame(data)
df.set_index('Date', inplace=True)
# Calculate rolling mean
rolling_mean = df['Value'].rolling(window=3).mean()
print(rolling_mean)
# Calculate expanding sum
expanding_sum = df['Value'].expanding().sum()
print(expanding_sum)
# Calculate exponentially weighted mean
ewm_mean = df['Value'].ewm(span=3).mean()
print(ewm_mean)
Handling JSON Data
# Create a JSON string
json_str = '''
[
{"Name": "John", "Age": 28, "City": "New York"},
{"Name": "Anna", "Age": 24, "City": "Paris"},
{"Name": "Peter", "Age": 35, "City": "Berlin"}
]
'''
# Read JSON string into DataFrame
df_json = pd.read_json(json_str)
print(df_json)
LinkedIn: www.linkedin.com/in/nidhi-grover-raheja-904211138 7 |Pa ge
NumPy and Pandas for Data Analysis AI ML Training
# Export DataFrame to JSON
df_json.to_json('output.json', orient='records', lines=True)
Advanced Indexing with MultiIndex
# Create a MultiIndex DataFrame
arrays = [
['A', 'A', 'B', 'B'],
['one', 'two', 'one', 'two']
]
index = pd.MultiIndex.from_arrays(arrays, names=('first', 'second'))
df_multi = pd.DataFrame({'value': [1, 2, 3, 4]}, index=index)
print(df_multi)
# Accessing data in MultiIndex DataFrame
print(df_multi.loc['A'])
print(df_multi.loc[('A', 'one')])
Combining DataFrames with concat and append
# Create DataFrames
df1 = pd.DataFrame({
'A': ['A0', 'A1', 'A2'],
'B': ['B0', 'B1', 'B2']
})
df2 = pd.DataFrame({
'A': ['A3', 'A4', 'A5'],
'B': ['B3', 'B4', 'B5']
})
# Concatenate DataFrames
concatenated = pd.concat([df1, df2], ignore_index=True)
print(concatenated)
# Append DataFrames
appended = df1.append(df2, ignore_index=True)
print(appended)
Performance Tips
# Use vectorized operations instead of loops
data = pd.DataFrame({
'A': range(1000000),
'B': range(1000000)
})
# Inefficient way: Using loops
data['C'] = [x + y for x, y in zip(data['A'], data['B'])]
# Efficient way: Using vectorized operations
data['C'] = data['A'] + data['B']
LinkedIn: www.linkedin.com/in/nidhi-grover-raheja-904211138 8 |Pa ge