Pandas Tutorial
### Pandas Overview
Pandas is a Python library designed for data manipulation and
analysis. It provides powerful, flexible data structures-Series and
DataFrames-for working with structured data efficiently.
---
## 1. DataFrames and Series
### Series
A Series is a one-dimensional array-like object that can hold data of
any type (integers, strings, floats, etc.), along with an associated
index. It is similar to a column in a spreadsheet or a dictionary where
keys are the index.
Example:
```python
import pandas as pd
data = [10, 20, 30, 40]
index = ['A', 'B', 'C', 'D']
series = pd.Series(data, index=index)
print(series)
```
Output:
```
A 10
B 20
C 30
D 40
dtype: int64
```
### DataFrame
A DataFrame is a two-dimensional, tabular data structure with labeled
rows and columns, akin to a spreadsheet. It is essentially a collection
of Series sharing the same index.
Example:
```python
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'Salary': [50000, 60000, 70000]
}
df = pd.DataFrame(data)
print(df)
```
Output:
```
Name Age Salary
0 Alice 25 50000
1 Bob 30 60000
2 Charlie 35 70000
```
---
## 2. Reading Data
Pandas makes it easy to read and write data in various formats like
CSV, Excel, JSON, SQL, and more.
### Reading CSV Files
```python
df = pd.read_csv('data.csv') # Reads data from a CSV file
```
### Reading Excel Files
```python
df = pd.read_excel('data.xlsx', sheet_name='Sheet1')
```
### Reading JSON Files
```python
df = pd.read_json('data.json')
```
---
## 3. Data Cleaning
Data cleaning involves preparing raw data by handling
inconsistencies or errors.
### Dropping Rows/Columns
```python
df = df.drop(columns=['UnnecessaryColumn'])
df = df.dropna() # Drops rows with missing values
```
### Renaming Columns
```python
df = df.rename(columns={'OldName': 'NewName'})
```
### Replacing Values
```python
df['ColumnName'] = df['ColumnName'].replace({'OldValue':
'NewValue'})
```
### Changing Data Types
```python
df['Age'] = df['Age'].astype(int) # Converts to integer type
```
---
## 4. Data Manipulation
### Selecting Data
- By column name:
```python
df['ColumnName']
```
- By multiple columns:
```python
df[['Column1', 'Column2']]
```
- By condition:
```python
df[df['Age'] > 30]
```
### Adding New Columns
```python
df['NewColumn'] = df['Column1'] + df['Column2']
```
### Sorting Data
```python
df = df.sort_values(by='Age', ascending=True)
```
---
## 5. Handling Missing Data
Pandas provides tools to detect and handle missing data effectively.
### Detecting Missing Data
```python
df.isnull() # Returns a DataFrame of True/False for missing values
df.isnull().sum() # Counts missing values for each column
```
### Filling Missing Data
- Fill with a specific value:
```python
df['ColumnName'] = df['ColumnName'].fillna(0)
```
- Fill with column mean/median/mode:
```python
df['ColumnName'] =
df['ColumnName'].fillna(df['ColumnName'].mean())
```
### Dropping Missing Data
```python
df = df.dropna() # Drops rows with missing values
```
---
## 6. Grouping Data
Grouping allows you to aggregate data based on one or more keys.
### Group By
```python
grouped = df.groupby('Category')
```
### Aggregate Functions
```python
grouped['ColumnName'].mean() # Computes the mean for each
group
grouped['ColumnName'].sum() # Computes the sum for each group
```
### Multiple Aggregations
```python
df.groupby('Category').agg({'Column1': 'mean', 'Column2': 'sum'})
```
---
## 7. Merging Data
Pandas provides several methods to merge or join datasets.
### Merging DataFrames
```python
merged_df = pd.merge(df1, df2, on='common_column')
```
### Join Types
- Inner Join (default):
Matches rows with keys in both DataFrames.
- Outer Join:
Includes all rows, filling missing values with NaN.
```python
pd.merge(df1, df2, on='common_column', how='outer')
```
- Left Join:
Includes all rows from the left DataFrame.
```python
pd.merge(df1, df2, on='common_column', how='left')
```
- Right Join:
Includes all rows from the right DataFrame.
```python
pd.merge(df1, df2, on='common_column', how='right')
```
### Concatenating DataFrames
Combine rows or columns of DataFrames:
```python
pd.concat([df1, df2], axis=0) # Stacks rows
pd.concat([df1, df2], axis=1) # Combines columns
```
---
### Summary
Pandas is a versatile tool that allows efficient handling of structured
data. Whether you're cleaning messy data, performing calculations,
or preparing data for visualization, Pandas is your go-to library in
Python. Each operation-reading, cleaning, manipulating, grouping,
and merging-forms the foundation of data analysis workflows.