Python for Data Analysis - Complete Notes
1. Introduction to Python for Data Analysis
Python is a high-level, versatile programming language ideal for data analysis due to its readability and
ecosystem. It supports a variety of tasks including data cleaning, transformation, statistical modeling, and
visualization.
2. NumPy - Numerical Python
NumPy provides efficient array structures and mathematical functions.
Key Features:
- ndarray: Multidimensional array object
- Broadcasting: Arithmetic operations on arrays of different shapes
- Mathematical functions: mean, std, dot, etc.
Example:
import numpy as np
arr = np.array([[1, 2], [3, 4]])
print(np.mean(arr)) # Output: 2.5
print(arr.shape) # Output: (2, 2)
3. Pandas - Data Manipulation and Analysis
Pandas introduces two main data structures:
- Series: 1D labeled array
- DataFrame: 2D labeled data structure
Key Operations:
- Reading data: pd.read_csv(), pd.read_excel()
- Inspecting data: df.head(), df.info()
- Filtering: df[df['Age'] > 25]
- Sorting: df.sort_values(by='Salary')
Example:
import pandas as pd
df = pd.DataFrame({'Name': ['A', 'B'], 'Age': [22, 28]})
print(df[df['Age'] > 25])
Python for Data Analysis - Complete Notes
4. Data Cleaning in Pandas
- Handling Missing Data:
df.isnull().sum()
df.dropna(), df.fillna(value)
- Renaming Columns:
df.rename(columns={'old': 'new'})
- Changing Data Types:
df['col'] = df['col'].astype('int')
Example:
df['Age'] = df['Age'].fillna(df['Age'].mean())
5. Grouping and Aggregation
- Grouping: df.groupby('Department')['Salary'].mean()
- Aggregation: df.agg({'Age': ['mean', 'max'], 'Salary': 'sum'})
- Pivot Tables:
df.pivot_table(index='Dept', values='Salary', aggfunc='mean')
6. Matplotlib - Basic Visualization
Matplotlib is used to create static, animated, and interactive plots.
Example:
import matplotlib.pyplot as plt
x = [1, 2, 3]
y = [10, 20, 30]
plt.plot(x, y)
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title('Line Plot')
plt.show()
7. Seaborn - Statistical Visualization
Seaborn is built on top of Matplotlib and is used for statistical graphics.
Python for Data Analysis - Complete Notes
Example:
import seaborn as sns
sns.set(style='darkgrid')
tips = sns.load_dataset('tips')
sns.barplot(x='day', y='total_bill', data=tips)
plt.show()
8. Time Series Analysis with Pandas
Time series data has timestamps. Pandas supports powerful time-based indexing.
Example:
df['date'] = pd.to_datetime(df['date'])
df.set_index('date', inplace=True)
monthly_avg = df['sales'].resample('M').mean()
9. Statistics with Pandas and NumPy
- Descriptive Stats: df.describe()
- Correlation: df.corr()
- Value Counts: df['Category'].value_counts()
- Standard Deviation: df['Salary'].std()
NumPy Examples:
np.mean(data), np.median(data), np.std(data)
10. Plotly - Interactive Visualization
Plotly is a graphing library for interactive charts.
Example:
import plotly.express as px
df = px.data.gapminder().query("year == 2007")
fig = px.scatter(df, x="gdpPercap", y="lifeExp", size="pop", color="continent")
fig.show()
Python for Data Analysis - Complete Notes
11. Scikit-learn - Machine Learning Library
Scikit-learn provides simple tools for predictive data analysis.
Steps:
- Load dataset
- Split data: train_test_split()
- Train model: model.fit()
- Predict: model.predict()
Example:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)
preds = model.predict(X_test)
12. Summary & Tips for Interviews
- Master Pandas and NumPy first
- Practice real datasets (Kaggle, UCI, etc.)
- Know how to visualize and clean data
- Understand ML workflow: EDA -> Preprocessing -> Model
- Practice SQL + Python-based case studies