Pandas Interview Questions with Real-time Scenarios for Power BI and SQL
1. Scenario: Data Preprocessing before Power BI Import
Q: You're working with a CSV dataset that contains sales data. Some rows have missing values in the ProductName
and Revenue columns. Before importing into Power BI, how would you handle this using Pandas?
A:
import pandas as pd
df = pd.read_csv('sales.csv')
df = df[df['ProductName'].notna()]
df['Revenue'] = df['Revenue'].fillna(0)
2. Scenario: SQL to Pandas Conversion
Q: You have this SQL query:
SELECT Region, SUM(SalesAmount) FROM Sales GROUP BY Region HAVING SUM(SalesAmount) > 10000
How would you replicate it in Pandas?
A:
df_grouped = df.groupby('Region')['SalesAmount'].sum().reset_index()
df_filtered = df_grouped[df_grouped['SalesAmount'] > 10000]
3. Scenario: Merging Datasets like SQL Joins
Q: You have two dataframes: orders and customers. How would you perform a LEFT JOIN in Pandas?
A:
df_joined = pd.merge(orders, customers, how='left', on='CustomerID')
4. Scenario: Time Intelligence like DAX
Q: How would you calculate YoY Sales using Pandas?
A:
df['OrderDate'] = pd.to_datetime(df['OrderDate'])
df['Year'] = df['OrderDate'].dt.year
df['Month'] = df['OrderDate'].dt.month
monthly_sales = df.groupby(['Year', 'Month'])['Sales'].sum().reset_index()
monthly_sales['Sales_PY'] = monthly_sales['Sales'].shift(12)
monthly_sales['YoY_Growth'] = ((monthly_sales['Sales'] - monthly_sales['Sales_PY']) / monthly_sales['Sales_PY']) * 100
5. Scenario: Handling Large Datasets like Fact Tables
Q: You have a large dataset (10 million rows) in CSV format. What Pandas techniques would you use?
A:
chunks = pd.read_csv('large_sales.csv', chunksize=500000)
filtered_chunks = []
for chunk in chunks:
chunk = chunk[chunk['Sales'] > 1000]
filtered_chunks.append(chunk)
df_filtered = pd.concat(filtered_chunks)
6. Scenario: Replace Calculated Columns in Power BI with Pandas
Q: How would you calculate Profit Margin?
A:
df['ProfitMargin'] = (df['Revenue'] - df['Cost']) / df['Revenue']
7. Scenario: Automating Excel/CSV to Power BI Pipeline
Q: How would you clean and transform data from Excel sheets and push to SQL?
A:
all_sheets = pd.read_excel('sales_data.xlsx', sheet_name=None)
combined = pd.concat(all_sheets.values(), ignore_index=True)
combined['Revenue'] = combined['Revenue'].fillna(0)
combined['Date'] = pd.to_datetime(combined['Date'])
from sqlalchemy import create_engine
engine = create_engine('mssql+pyodbc://server/db?driver=SQL+Server')
combined.to_sql('SalesCleaned', con=engine, if_exists='replace')