PANDAS:
1- Create a Pandas Data frame from the given data and create a new column “Voter” based on
voter age, i.e., if age >18 then voter column should be “Yes” otherwise if age <18 then voter
column should be “No”
raw_Data = {'Voter_name': ['Geek1', 'Geek2', 'Geek3', 'Geek4',
'Geek5', 'Geek6', 'Geek7', 'Geek8'],
'Voter_age': [15, 23, 25, 9, 67, 54, 42, np.NaN]}
ills
Solution:
import pandas as pd
Sk
import numpy as np
raw_Data = {'Voter_name': ['Geek1', 'Geek2', 'Geek3', 'Geek4', 'Geek5', 'Geek6', 'Geek7',
'Geek8'], a
'Voter_age': [15, 23, 25, 9, 67, 54, 42, np.NaN]}
at
df = pd.DataFrame(raw_Data)
D
# Create a new column "Voter" based on voter age
df['Voter'] = np.where(df['Voter_age'] > 18, 'Yes', 'No')
w
print(df)
ro
2 – Create a Pandas Data frame from the given data and collapse First and Last column into
G
one column as Full Name, so the output contains Full Name and Age, then convert column age
to index
raw_Data = {'First': ['Manan ', 'Raghav ', 'Sunny '],
'Last': ['Goel', 'Sharma', 'Chawla'],
'Age' : [12, 24, 56]}
Solution:
raw_Data = {'First': ['Manan', 'Raghav', 'Sunny'],
'Last': ['Goel', 'Sharma', 'Chawla'],
'Age': [12, 24, 56]}
df = pd.DataFrame(raw_Data)
# Combine First and Last columns into Full Name
ills
df['Full Name'] = df['First'] + ' ' + df['Last']
# Set Age as index
Sk
df.set_index('Age', inplace=True)
print(df)
a
3- Create a Pandas Data frame from the given data -
at
raw_Data = {'Date':['10/2/2011', '11/2/2011', '12/2/2011', '13/2/2011'],
'Product':['Umbrella', 'Matress', 'Badminton','Shuttle'],
D
'Price':[1250, 1450, 1550, 400],
'Expense': [ 21525220.653, 31125840.875, 23135428.768, 56245263.942]}
a- Add Index as Item1, Item2, Item3, Item4
w
b- Find the index labels of all items whose ‘Price’ is greater than 1000.
c- Replace products using Map() with respective codes- Umbrella : ‘U’, Matress : 'M', Badminton
ro
: 'B', Shuttle: 'S'
d- Round off the Expense column values to two decimal places.
G
e- Create a new column called ‘Discounted_Price’ after applying a 10% discount on the existing
‘price’ column.(try using lambda function)
f- Convert the column type of “Date” to datetime format
g- Create a column rank which ranks the products based on the price (one with highest price will
be rank 1).
Solution:
raw_Data = {'Date': ['10/2/2011', '11/2/2011', '12/2/2011', '13/2/2011'],
'Product': ['Umbrella', 'Matress', 'Badminton', 'Shuttle'],
'Price': [1250, 1450, 1550, 400],
'Expense': [21525220.653, 31125840.875, 23135428.768, 56245263.942]}
df = pd.DataFrame(raw_Data)
ills
# Task a: Add Index as Item1, Item2, Item3, Item4
df.index = ['Item1', 'Item2', 'Item3', 'Item4']
Sk
# Task b: Find index labels with Price > 1000
indexes_with_price_gt_1000 = df[df['Price'] > 1000].index.tolist()
a
# Task c: Replace products using Map()
product_map = {'Umbrella': 'U', 'Matress': 'M', 'Badminton': 'B', 'Shuttle': 'S'}
at
df['Product'] = df['Product'].map(product_map)
D
# Task d: Round off Expense column to two decimal places
df['Expense'] = df['Expense'].round(2)
w
# Task e: Create 'Discounted_Price' column with 10% discount
ro
df['Discounted_Price'] = df['Price'] * 0.9
G
# Task f: Convert 'Date' column to datetime format
df['Date'] = pd.to_datetime(df['Date'])
# Task g: Create 'Rank' column based on price
df['Rank'] = df['Price'].rank(ascending=False).astype(int)
print(df)
Assignment: Exploring NBA Player Data
Download the nba.csv file containing NBA player data Complete the following tasks using
Python, Pandas, and data visualization libraries:
1. Load Data:
● Load the nba.csv data into a Pandas DataFrame.
ills
● Display basic information about the DataFrame.
2. Data Cleaning:
● Handle missing values by either removing or imputing them.
Sk
● Remove duplicate rows.
3. Data Transformation:
● Create a new column 'BMI' (Body Mass Index) using the formula: BMI = (weight in
pounds / (height in inches)^2) * 703.(Assuming a fixed height value of 70 inches (5 feet
a
10 inches)
4. Exploratory Data Analysis (EDA):
at
● Display summary statistics of the 'age', 'weight', and 'salary' columns.
● Calculate the average age, weight, and salary of players in each 'position' category.
D
5. Data Visualization:
● Create a histogram of player ages.
w
● Create a box plot of player salaries for each 'position'.
● Plot a scatter plot of 'age' vs. 'salary' with a different color for each 'position'.
ro
6. Top Players:
● Display the top 10 players with the highest salaries.
G
7. College Analysis:
● Determine the top 5 colleges with the most represented players.
8. Position Distribution:
● Plot a pie chart to show the distribution of players across different 'positions'.
9. Team Analysis:
● Display the average salary of players for each 'team'.
● Plot a bar chart to visualize the average salary of players for each 'team'.
10. Extras
● Get the index at which minimum weight value is present.
● Sort values based on name in alphabetical order for the rows (the original Dataframe
sorting should not change)
● Create a series from given dataframe on “name” column and display top and last 10
Guidelines:
ills
1. Write Python code to complete each task.
2. Provide comments explaining your code.
3. Use meaningful variable names.
Sk
4. Include necessary library imports.
5. Present your findings in a clear and organized manner.
6. Feel free to use additional code cells for each task.
Solution:
a
at
1. Load Data:
import pandas as pd
D
# Load the data into a Pandas DataFrame
w
df = pd.read_csv('nba.csv')
ro
# Display basic information about the DataFrame
print(df.info())
G
print(df.head())
2. Data Cleaning:
# Handle missing values
df.dropna(inplace=True)
# Remove duplicate rows
df.drop_duplicates(inplace=True)
3. Data Transformation: Create 'BMI' column using a fixed height value
# Assuming a fixed height value of 70 inches (5 feet 10 inches)
fixed_height = 70
ills
# Create 'BMI' column
df['BMI'] = (df['Weight'] / (fixed_height ** 2)) * 703
Sk
4. Exploratory Data Analysis (EDA):
# Summary statistics
print(df[['Age', 'Weight', 'Salary']].describe())
a
# Average age, weight, and salary by position
at
avg_by_position = df.groupby('Position')[['Age', 'Weight', 'Salary']].mean()
print(avg_by_position)
D
5. Data Visualization:
w
import matplotlib.pyplot as plt
ro
# Histogram of player ages
plt.hist(df['Age'], bins=20)
G
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.title('Distribution of Player Ages')
plt.show()
# Box plot of player salaries by position
plt.figure(figsize=(10, 6))
df.boxplot(column='Salary', by='Position')
plt.ylabel('Salary')
plt.title('Box Plot of Player Salaries by Position')
plt.suptitle('')
plt.xticks(rotation=45)
plt.show()
ills
# Scatter plot of 'age' vs. 'salary' by position
plt.figure(figsize=(10, 6))
Sk
colors = {'PG': 'red', 'SG': 'blue', 'SF': 'green', 'PF': 'purple', 'C': 'orange'}
plt.scatter(df['Age'], df['Salary'], c=df['Position'].map(colors), alpha=0.5)
plt.xlabel('Age')
plt.ylabel('Salary')
a
plt.title('Age vs. Salary by Position')
at
plt.legend(colors)
plt.show()
D
6. Top Players:
w
top_players = df.nlargest(10, 'Salary')
print(top_players)
ro
7. College Analysis:
G
top_colleges = df['College'].value_counts().nlargest(5)
print(top_colleges)
8. Position Distribution:
position_counts = df['Position'].value_counts()
plt.pie(position_counts, labels=position_counts.index, autopct='%1.1f%%', startangle=140)
plt.title('Position Distribution of Players')
plt.axis('equal')
plt.show()
9. Team Analysis:
ills
avg_salary_by_team = df.groupby('Team')['Salary'].mean()
print(avg_salary_by_team)
Sk
plt.figure(figsize=(10, 6))
avg_salary_by_team.plot(kind='bar')
plt.xlabel('Team')
plt.ylabel('Average Salary')
a
plt.title('Average Salary of Players by Team')
at
plt.xticks(rotation=45)
plt.show()
D
10.Extras:
w
min_weight_index = df['Weight'].idxmin()
print("Index with minimum weight value:", min_weight_index)
ro
df_sorted = df.sort_values(by='Name', ignore_index=True)
G
print(df_sorted)
name_series = df['Name']
print("Top 10 names:\n", name_series.head(10))
print("\nLast 10 names:\n", name_series.tail(10))
G
ro
w
D
at
a
Sk
ills