0% found this document useful (0 votes)
8 views19 pages

Interview Questions For Data Analysis and Data Science

The document outlines a comprehensive overview of a Power BI project focused on analyzing ozone pollution levels, detailing the challenges faced, data cleaning methods, visualizations used, and interactivity features implemented. It also includes SQL and Python interview questions, behavioral questions, and advanced SQL concepts relevant to data analysis. Additionally, it discusses practical scenarios and mock interview questions tailored for data analyst roles.

Uploaded by

manish04874
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views19 pages

Interview Questions For Data Analysis and Data Science

The document outlines a comprehensive overview of a Power BI project focused on analyzing ozone pollution levels, detailing the challenges faced, data cleaning methods, visualizations used, and interactivity features implemented. It also includes SQL and Python interview questions, behavioral questions, and advanced SQL concepts relevant to data analysis. Additionally, it discusses practical scenarios and mock interview questions tailored for data analyst roles.

Uploaded by

manish04874
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 19

✅ Power BI Project Review

1. Explain your Power BI project in under 2 minutes.

o My Power BI project focused on analyzing ozone pollution levels over time across various
cities. I used Python for data cleaning and visualization. The final dashboard included city-
wise pollution trends, yearly comparisons, and alerts for critical levels.

2. What were the main challenges you faced while building the dashboards?

o Handling missing or inconsistent data, optimizing DAX measures for performance, and
choosing the right visualizations for clarity were key challenges.

3. How did you clean and transform the data before visualization?

o I used Power Query to remove nulls, standardize formats, split columns, and merge
datasets. I also created custom columns and date hierarchies.

4. What visuals did you choose, and why?

o Line charts for trends, bar charts for comparisons, KPIs for key metrics, and maps for geo-
distribution. These were intuitive and effective.

5. How did you implement interactivity in your report (e.g., slicers, filters)?

o I used slicers for date, city, and pollutant levels. Drill-through and tooltip pages provided
deeper analysis.

6. Did you use any DAX measures? Can you give an example?

o Yes. Example: Total Pollution = SUM(Data[Pollution])

7. How did you publish or share your Power BI report?

o I published it to the Power BI Service and shared it with stakeholders using workspaces.

8. What would you improve in your project if given more time?

o Add forecasting, anomaly detection, and integrate data from external APIs for real-time
updates.

9. Did you use bookmarks, drill-through, or tooltips in your project?

o Yes, bookmarks for toggling between views, drill-through for city-wise analysis, and tooltips
for hover info.

10. How would you make your dashboard mobile-friendly?

 I would adjust the layout using the mobile layout view and test on various screen sizes.

✅ SQL Interview Questions

1. Second highest salary:

SELECT MAX(salary) FROM Employee WHERE salary < (SELECT MAX(salary) FROM Employee);
2. Difference: RANK, DENSE_RANK, ROW_NUMBER

o RANK() gives same rank but skips numbers. DENSE_RANK() does not skip. ROW_NUMBER()
gives unique rows.

3. CTE Example:

WITH AvgSalary AS (

SELECT dept_id, AVG(salary) as avg_sal FROM Employee GROUP BY dept_id

SELECT * FROM AvgSalary;

4. Customers with >2 orders in a month:

SELECT customer_id, COUNT(*) FROM Orders

GROUP BY customer_id, MONTH(order_date)

HAVING COUNT(*) > 2;

5. Moving average:

SELECT date, AVG(sales) OVER(ORDER BY date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) FROM
Sales;

6. Window Function:

ROW_NUMBER() OVER(PARTITION BY department ORDER BY salary DESC)

7. Remove duplicates:

SELECT DISTINCT * FROM table_name;

8. WHERE vs HAVING:

o WHERE filters rows before aggregation, HAVING filters after.

9. Customers with no orders:

SELECT * FROM Customers WHERE customer_id NOT IN (SELECT customer_id FROM Orders);

10. LEFT vs FULL OUTER JOIN:

 LEFT returns all from left and matched from right; FULL OUTER returns all from both.

✅ Python for Data Analysis

1. Handle missing values:

df.dropna(), df.fillna(), df.isnull().sum()

2. loc vs iloc:

o loc[] is label-based, iloc[] is index-based.

3. Merge datasets:
pd.merge(df1, df2, on='id', how='inner')

4. Group by and average:

df.groupby('category')['sales'].mean()

5. Lambda function:

df['price_with_tax'] = df['price'].apply(lambda x: x*1.18)

6. Visualize data:

import seaborn as sns

sns.barplot(x='city', y='pollution', data=df)

7. apply vs map:

o map() is for Series only. apply() works on both Series and DataFrames.

8. Filter rows:

df[df['value'] > 100]

9. df.describe():

o Returns count, mean, std, min, max, quartiles.

10. Remove outliers (IQR):

Q1 = df['value'].quantile(0.25)

Q3 = df['value'].quantile(0.75)

IQR = Q3 - Q1

df_filtered = df[(df['value'] >= Q1 - 1.5*IQR) & (df['value'] <= Q3 + 1.5*IQR)]

✅ Google Sheets / Excel

1. VLOOKUP vs XLOOKUP:

o VLOOKUP only searches right; XLOOKUP can search in both directions and is more flexible.

2. Remove duplicates:

o Use Data > Remove Duplicates

3. Extract year:

=YEAR(A1)

4. FILTER function:

=FILTER(A2:A100, B2:B100 > 500)

5. Pivot tables:

o Used to summarize data (sum, count, avg) quickly.


6. Conditional formatting:

o Format cells based on rules (e.g., > 100 turns red).

7. QUERY function:

=QUERY(A1:D100, "SELECT A, B WHERE C > 100")

8. Combine text:

=A1 & " " & B1

9. Absolute vs Relative:

o Absolute ($A$1) stays fixed; relative (A1) shifts when copied.

10. Top 5 sales:

=LARGE(A1:A100, {1,2,3,4,5})

✅ Behavioral / Interview Skills

1. Tell me about yourself.

o I am a data enthusiast pursuing B.Tech in CSE with hands-on experience in Power BI,
Python, and SQL. I’ve done data analysis projects, including ozone analysis and mess
management systems using Django.

2. Why data analyst?

o I love uncovering insights from data to support decision-making and storytelling through
visualizations.

3. Handling missing/inconsistent data:

o I analyze the pattern, fill with mean/median, or remove if too sparse. I ensure consistency
in format and units.

4. Incorrect dashboard before deadline:

o I’d cross-check formulas, consult documentation, verify source data, and communicate
promptly with stakeholders.

5. Prioritize tasks:

o I use priority matrices (urgent/important), break down tasks, and allocate time based on
deadlines.

6. Teamwork example:

o In my major project, I handled the backend, collaborated with frontend teammates, and
ensured data integrity.

7. Explaining technical concept to non-technical:

o I use analogies and visuals. For example, I explain joins as "bringing columns from two Excel
sheets using a common column."
8. Stay updated:

o I follow blogs, YouTube channels, and communities like Kaggle, DataCamp, and LinkedIn.

9. Mistake made:

o I once misused a join, inflating values. I learned the importance of verifying intermediate
outputs.

10. Where in 2 years?

 I see myself as a skilled data analyst, leading projects and contributing to meaningful business
decisions.

✅ Advanced & Tricky SQL Questions


1. What will happen if NULL is present in NOT IN clause?

2. SELECT * FROM Customers WHERE customer_id NOT IN (SELECT customer_id FROM Orders);

o Answer: If the subquery returns even a single NULL, NOT IN returns no rows. Prefer using
LEFT JOIN ... IS NULL.

3. Find duplicate records in a table.

4. SELECT name, COUNT(*) FROM Customers

5. GROUP BY name HAVING COUNT(*) > 1;

6. When should we use IN vs EXISTS vs JOIN?

o IN: Good for small lists.

o EXISTS: More efficient for checking presence, especially with correlated subqueries.

o JOIN: Used when you need columns from both tables.

7. What does ON 1=1 do in a SQL JOIN?

o It creates a CROSS JOIN — every row in table A combines with every row in table B.

8. Group by two columns – why and when?

9. SELECT city, product, COUNT(*) FROM Orders GROUP BY city, product;

o Used when aggregation needs to be done across combinations of dimensions, like sales by
city and product.

10. Identify customers who placed orders every month.

11. SELECT customer_id FROM Orders

12. GROUP BY customer_id

13. HAVING COUNT(DISTINCT MONTH(order_date)) = 12;


✅ Python Data Analysis – Tricky & Practical

1. Detect and remove correlated features in a dataset.

2. corr_matrix = df.corr().abs()

3. upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))

4. to_drop = [column for column in upper.columns if any(upper[column] > 0.9)]

5. df.drop(columns=to_drop, inplace=True)

6. Use of np.where() in data transformation.

7. df['category'] = np.where(df['sales'] > 500, 'High', 'Low')

8. Group by multiple columns and filter complex condition.

9. df.groupby(['city', 'month'])['sales'].sum().reset_index()

10. Difference between .apply() and vectorized operations?

o apply() is slower; vectorized operations use C-speed and are preferred for performance.

✅ Power BI Interview Trick Questions

1. What happens when you use a slicer from one table on visuals from another unrelated table?

o If there's no relationship, slicer won’t filter the visuals.

2. Difference between calculated column vs measure?

o Calculated Column: Computed at row level and stored in model.

o Measure: Computed at query time and affected by filters.

3. How would you detect and fix performance issues in a large Power BI report?

o Remove unnecessary visuals, limit data load, use SUMMARIZECOLUMNS or VAR, reduce
cardinality, disable auto-date.

4. What happens if two tables have a circular relationship?

o Power BI won’t allow it; you'll need to redesign the model, possibly by de-normalizing or
creating a bridge table.

✅ Google Sheets / Excel Trick Questions

1. Return all rows where sales are in top 10%:

2. =FILTER(A2:B100, B2:B100 >= PERCENTILE(B2:B100, 0.9))

3. What does $A$1, A$1, $A1 mean?

o Absolute/relative referencing:
 $A$1: Fixed cell

 A$1: Fixed row

 $A1: Fixed column

4. QUERY function with dynamic range:

5. =QUERY(INDIRECT("Sheet1!A1:D" & COUNTA(A:A)), "SELECT A, B WHERE D > 100")

✅ Data Scenario-Based Questions

1. You noticed the report showing unusually high revenue. What steps will you take?

o Check DAX logic, data source accuracy, joins causing duplicates, or row context errors.

2. You’re told to generate insights from an unfamiliar dataset. What's your approach?

o Understand business goal → df.info() and df.describe() → Handle missing/outliers → Group


by → Correlation → Visualize

3. If you're asked to predict future sales using Excel only, how would you do it?

o Use FORECAST.ETS(), or build trendline regression in charts.

✅ More Behavioral / HR Questions (Data Role Focused)

1. Tell me a time you had a data conflict – how did you resolve it?

2. How do you explain a technical problem to a stakeholder who has no tech background?

3. What do you do if you have incomplete data but a deadline to deliver?

4. Describe a time you automated a task. What was the impact?

5. Which of your projects had the most business impact? How?

Here are 10 more mock interview questions with answers tailored to a Data Analyst role (especially for
freshers or early-career candidates), covering SQL, Power BI, Python, Excel/Sheets, and analytics
thinking. These can be added to your interview prep:

✅ Additional Mock Interview Q&A (Data Analyst Role)

1. What is a foreign key in SQL?

o A foreign key is a column that creates a relationship between two tables. It refers to the
primary key in another table, ensuring referential integrity.
2. When would you use a CROSS JOIN?

o A CROSS JOIN is used when you need all combinations of rows from two tables. It's often
used in generating scenarios or testing combinations, like pairing products and regions.

3. Explain how GROUP BY works with multiple columns.

o When you use GROUP BY col1, col2, the result is grouped by unique combinations of both
columns. It's useful for aggregated metrics across multiple dimensions (e.g., sales by region
and category).

4. What is the use of ON 1=1 in joins?

o ON 1=1 is a condition that is always true, used in CROSS JOIN or for creating a flexible JOIN
structure when filters are applied later using WHERE. It allows merging all rows with all
others (Cartesian product).

5. When to use IN vs NOT IN?

o Use IN when you want to match values from a list or subquery. Use NOT IN to filter out
such values. Be cautious—NOT IN fails if subquery returns NULL. Prefer LEFT JOIN ...
WHERE col IS NULL for safety.

6. Difference between COUNT(*) and COUNT(column_name)?

o COUNT(*) counts all rows including NULLs. COUNT(column_name) counts only non-null
values in that column.

7. How would you explain correlation to a non-technical person?

o Correlation shows how strongly two things move together. For example, if ice cream sales
increase with temperature, they are positively correlated.

8. What is the difference between INNER JOIN and FULL OUTER JOIN?

o INNER JOIN returns only matched rows. FULL OUTER JOIN returns all records from both
tables, matched or not.

9. Explain how a filter context works in Power BI.

o Filter context determines what subset of data is evaluated for a calculation, based on filters
applied via visuals, slicers, or page/report level.

10. What is a moving average and where would you use it?

 A moving average smooths fluctuations by averaging over a window of values. Useful in trend
analysis (e.g., sales over time) to identify underlying patterns.

🔹 Advanced SQL & Database Design


Explain normalization and denormalization. When would you use each?

 Normalization organizes data to reduce redundancy and improve integrity by dividing tables into
smaller related tables.
 Denormalization combines tables to reduce JOINs, improving read performance.

 Use normalization for OLTP systems and denormalization for OLAP/reporting scenarios.

Design a schema for a ride-sharing app (like Uber). Explain your design decisions.
Tables:

 Users (user_id, name, role [rider/driver], rating, phone)

 Rides (ride_id, user_id, driver_id, start_time, end_time, status, fare)

 Locations (location_id, address, lat, long)

 Payments (payment_id, ride_id, user_id, amount, method, status)

 Separation ensures modularity, scalability, and efficient queries.

Common performance issues in SQL queries, and how to resolve them

 Missing indexes ➝ Add proper indexes

 Too many joins ➝ Consider denormalization or CTEs

 Select * ➝ Use specific columns

 Inefficient filters ➝ Rewrite WHERE clauses

 Large data scans ➝ Partitioning, indexing

Write a query to identify users whose monthly purchase value increased consistently for 3 months.

WITH monthly_spend AS (

SELECT user_id, DATE_TRUNC('month', purchase_date) AS month,

SUM(amount) AS total

FROM purchases

GROUP BY user_id, month

),

ranked AS (

SELECT *,

ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY month) AS rn

FROM monthly_spend

),

joined AS (

SELECT a.user_id

FROM ranked a

JOIN ranked b ON a.user_id = b.user_id AND a.rn = b.rn - 1


JOIN ranked c ON a.user_id = c.user_id AND a.rn = c.rn - 2

WHERE a.total < b.total AND b.total < c.total

SELECT DISTINCT user_id FROM joined;

Explain indexing. When does it hurt performance instead of helping?

 Indexes speed up read operations by creating lookup structures.

 Hurts performance when:

o Too many indexes increase insert/update overhead

o Indexes not used by queries

o Wrong type (e.g., full-text where B-Tree needed)

Handling Slowly Changing Dimensions in SQL

 Type 1: Overwrite

 Type 2: Add row with start/end dates

 Type 3: Add new column for previous value

 Depends on whether history needs to be preserved.

🔹 Power BI / Dashboarding & Reporting (Advanced)

Optimize large Power BI dashboards for performance

 Use star schema

 Reduce visuals and filters

 Use aggregations

 Use DirectQuery with care

 Disable auto date/time

Dynamic measures based on slicer input

 Use SWITCH(TRUE(), ...) with disconnected tables

 Create slicer table with measure names

Experience in Row-Level Security (RLS)

 Used RLS to restrict users to see only their region

 Created roles in Power BI Service

 Applied DAX filter like [Region] = USERNAME()

Publishing to Power BI Service & Workspace Access

 Published reports to shared workspace


 Managed roles and access levels (Viewer, Contributor)

Executive vs Operational Dashboards

 Executive: High-level KPIs, strategic, less detail, mobile-friendly

 Operational: Detailed, real-time or daily metrics, more filters/interactions

Challenges with Real-Time Dashboards

 Data latency

 API limits

 Solution: Push datasets, streaming dataflows, hybrid tables

🔹 Python for Data Analytics

Pandas & NumPy for performance tuning

 Used vectorized operations over loops

 Set data types to reduce memory

 Used .loc over .apply()

Used Dask/Polars for large datasets

 Dask for parallelized DataFrame processing

 Polars for lightning-fast query speed

End-to-end EDA process

 Data loading ➝ Cleaning ➝ Profiling ➝ Visualizing ➝ Feature engineering ➝ Automated report


using Pandas-Profiling or Jupyter

APIs for Data Fetching

 Used requests and json to call REST APIs

 Auth ➝ Fetch ➝ Parse ➝ Store ➝ Validate

Python pipelines + Power BI/Tableau

 Prepared dataset in Python ➝ Saved to SQL/CSV ➝ Used Power BI to connect and refresh

Real-life Feature Engineering

 Built time since last login, user segmentation, session duration

 Improved churn model accuracy by 12%

🔹 Excel / Google Sheets - Advanced Use

Working with Excel >1M rows

 Split files

 Use Power Query or SQL Server


 Use Excel Data Model (Power Pivot)

Excel for Financial Modeling

 Created forecast models using historicals + assumptions

 Used FORECAST.ETS, NPV, IRR

Custom VBA or Excel Script

 VBA script for automated report generation & email

 Example: Loop through sheets and send PDFs

Power Query + Power Pivot

 Used Power Query to clean/transform

 Power Pivot for DAX models, linked tables, KPIs

Most complex Excel dashboard

 Sales dashboard with slicers, pivot tables, KPIs, trendlines, interactive scenario modeling

🔹 Stakeholder Management / Soft Skills

Conflicting KPIs

 Arranged meeting ➝ Understood each team's goal ➝ Proposed unified KPI with subset views

Last-minute data requests

 Built reusable templates

 Prioritized based on impact

 Automated ETL where possible

Insights ➝ Business Change

 Identified top churn reasons ➝ Triggered retention campaigns ➝ Reduced churn by 18%

Uncertainty/Incomplete Data

 Used confidence intervals, annotated visuals

 Communicated assumptions

Requirement Gathering

 Conducted discovery meetings

 Asked business questions, not just technical

 Created mockups and shared for feedback

🔹 Business Acumen & Problem Solving

40% Drop in User Engagement


 Checked tracking/data accuracy ➝ Analyzed funnel ➝ Segment-wise trend ➝ Interviewed users ➝
Recommended fixes

Good KPI vs Vanity Metric

 KPI: Conversion Rate (actionable)

 Vanity: Page Views (non-actionable)

Unstructured Feedback Approach

 Clean ➝ Tokenize ➝ Sentiment Analysis ➝ Topic Modeling ➝ Visualization

Design A/B Test

 Split random users into control/test

 Define success metric

 Use T-test to validate significance

Biggest Data Challenge

 Consolidating 7 data sources with inconsistent schemas

 Created unified model ➝ Enabled centralized reporting ➝ Saved 25+ hours/month

Advanced SQL & Database Design


1. Explain normalization and denormalization. When would you use each?

 Normalization: Organizes data to minimize redundancy and maintain data integrity by splitting
tables into related entities (1NF, 2NF, 3NF, BCNF, etc.). Used mostly in OLTP systems where data
consistency and update efficiency are critical.

 Denormalization: Combines data into fewer tables to reduce the need for complex JOINs,
improving read performance. Used in OLAP, reporting, or read-heavy systems where query speed
matters more than write efficiency.

2. Design a schema for a ride-sharing app (like Uber). Explain your design decisions.

 Users: (user_id, name, phone, email, role [driver/rider], rating)

 Rides: (ride_id, rider_id, driver_id, start_location_id, end_location_id, start_time, end_time,


status, fare)

 Locations: (location_id, latitude, longitude, address)

 Payments: (payment_id, ride_id, amount, payment_method, payment_status, timestamp)


 Design reasons: clear separation of entities, scalability, indexing on user and ride IDs, flexible
location referencing.

3. What are common performance issues in SQL queries, and how do you resolve them?

 Missing or inefficient indexes → Add proper indexes on frequently queried columns.

 SELECT * fetching unnecessary columns → Select only needed columns.

 Large table scans → Use partitioning or filter with indexed columns.

 Excessive JOINs → Optimize joins, consider denormalization if needed.

 Non-sargable queries (functions on indexed columns) → Rewrite filters to be sargable.

4. Write a query to identify users whose monthly purchase value increased consistently for 3 months.

WITH monthly_spend AS (

SELECT user_id, DATE_TRUNC('month', purchase_date) AS month,

SUM(amount) AS total

FROM purchases

GROUP BY user_id, month

),

ranked AS (

SELECT *, ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY month) AS rn

FROM monthly_spend

),

joined AS (

SELECT a.user_id

FROM ranked a

JOIN ranked b ON a.user_id = b.user_id AND a.rn = b.rn - 1

JOIN ranked c ON a.user_id = c.user_id AND a.rn = c.rn - 2

WHERE a.total < b.total AND b.total < c.total

SELECT DISTINCT user_id FROM joined;

5. Explain indexing. When does it hurt performance instead of helping?

 Indexes speed up read queries by creating fast lookup structures.

 They hurt performance on writes (INSERT, UPDATE, DELETE) because indexes must be updated.

 Excessive or unused indexes consume storage and slow writes.


 Incorrect index types or on low-cardinality columns may be ineffective.

6. How would you handle slowly changing dimensions in SQL?

 Type 1: Overwrite existing data (no history).

 Type 2: Add new row with versioning or start/end dates (full history).

 Type 3: Add new columns to store previous values (limited history).

 Choose based on whether historical tracking is required.

Power BI / Dashboarding & Reporting (Advanced)

1. How do you optimize large Power BI dashboards for performance?

 Use star schema data models.

 Limit the number of visuals and slicers.

 Use aggregations and pre-calculated tables.

 Disable auto date/time.

 Prefer Import mode or optimize DirectQuery usage.

2. How do you handle dynamic measures based on slicer input (e.g., change KPI based on dropdown)?

 Create a disconnected slicer table with KPI names.

 Use DAX with SWITCH(TRUE()) or SELECTEDVALUE() to change measures dynamically.

3. Explain your experience in Row-Level Security and real-world use case.

 Implemented RLS to restrict regional sales data visibility.

 Created roles with DAX filters, e.g., [Region] = USERPRINCIPALNAME() to limit data by logged-in
user.

4. Have you published reports to Power BI Service and managed workspace access?

 Yes, published reports to shared workspaces.

 Managed roles: Viewer, Contributor, Admin.

 Controlled dataset refresh schedules and permissions.

5. How do you build executive-level dashboards vs operational dashboards? What changes?

 Executive dashboards: high-level KPIs, minimal detail, visually clean, mobile-friendly.

 Operational dashboards: detailed metrics, real-time or daily data, more slicers and interactions.

6. What are the challenges with real-time dashboards and how do you overcome them?

 Challenges: data latency, API rate limits, data volume.

 Solutions: Use streaming datasets, push dataflows, incremental refresh, optimize data sources.
Python for Data Analytics

1. Explain how you’ve used Pandas and NumPy for performance tuning.

 Vectorized operations over loops for speed.

 Downcast data types to reduce memory.

 Avoided apply/lambda when possible.

 Used .loc and .iloc for efficient slicing.

2. Have you used Dask, Polars, or other tools to work with large datasets? Why and how?

 Used Dask for parallelized processing on datasets larger than RAM.

 Used Polars for fast DataFrame queries with less memory.

 Integrated these tools when Pandas was insufficient.

3. Explain your end-to-end EDA process using Python, including automation of reports.

 Load and clean data → Handle missing values → Generate statistical summaries → Visualize
distributions and correlations → Create automated HTML reports using Pandas Profiling or
Sweetviz.

4. Have you worked with APIs to fetch data for analysis? Explain the workflow.

 Authenticated via tokens → Made HTTP GET requests using requests → Parsed JSON → Stored
data in DataFrames → Cleaned and analyzed.

5. How do you integrate Python data pipelines with Power BI or Tableau?

 Output data to SQL databases or CSV files.

 Connect Power BI/Tableau to these sources with scheduled refresh.

 Alternatively, embed Python scripts in Tableau.

6. Explain a real-life use case where you used feature engineering to improve analysis.

 Created features like user activity recency, session duration, and segmented users by behavior.

 Improved churn prediction accuracy by 12%.

Excel / Google Sheets - Advanced Use

1. How do you handle working with Excel files >1 million rows?

 Split data into multiple files or sheets.

 Use Power Query to process data externally.

 Use Power Pivot to create data models.

 Move to database solutions if needed.


2. Explain how you’ve used Excel for financial modeling or forecasting.

 Built models incorporating assumptions, scenarios, and historical data.

 Used functions like FORECAST.ETS, NPV, and IRR.

 Created sensitivity analyses with data tables.

3. Can you write custom VBA or Excel Script? Give an example.

 Wrote VBA macro to automate monthly report generation: loop through sheets, export PDFs, and
email them automatically.

4. How have you used Power Query and Power Pivot in Excel to build a model?

 Used Power Query to clean and transform raw data.

 Built relationships and DAX measures in Power Pivot for analysis and KPIs.

5. What’s the most complex Excel dashboard or model you’ve built?

 Sales performance dashboard with interactive slicers, pivot tables, dynamic charts, and scenario
modeling for forecasting.

Stakeholder Management / Soft Skills

1. Describe a time when stakeholders requested conflicting KPIs. How did you resolve it?

 Facilitated discussions to understand each stakeholder’s goal.

 Proposed consolidated KPIs that addressed multiple perspectives.

 Delivered separate views for different teams.

2. How do you handle last-minute data requests during a report delivery sprint?

 Prioritize requests by business impact.

 Use pre-built data templates or automated reports for quick turnaround.

 Communicate realistic timelines clearly.

3. Explain a project where your insights led to a business change. How did you measure impact?

 Analyzed customer churn drivers → Recommended retention campaign → Reduced churn by 18%
in 3 months (tracked via KPIs).

4. Describe how you communicate uncertainty or incomplete data to non-technical managers.

 Use visual cues (confidence intervals, error bars).

 Explain assumptions and data gaps in simple terms.

 Suggest cautious interpretation or further data collection.

5. How do you gather requirements for an analytics solution from a non-technical team?

 Conduct discovery meetings using open-ended questions.


 Translate business problems into measurable metrics.

 Use mockups and prototypes for feedback.

Business Acumen & Problem Solving

1. You notice a 40% drop in user engagement last month. What steps would you take?

 Verify data accuracy.

 Segment user groups and analyze funnel metrics.

 Look for recent changes or external factors.

 Conduct user surveys or interviews.

 Recommend targeted actions.

2. How do you define a good KPI vs a vanity metric? Give examples.

 Good KPI: actionable and aligned to business goals, e.g., Conversion Rate.

 Vanity metric: easy to measure but not actionable, e.g., total page views without context.

3. You are given unstructured text data from customer feedback. What’s your approach?

 Clean and tokenize text.

 Perform sentiment analysis.

 Use topic modeling (LDA) to identify themes.

 Visualize findings with word clouds or dashboards.

4. Design an experiment (A/B test) to determine the effectiveness of a new UI feature.

 Randomly assign users to control (old UI) or test (new UI) groups.

 Define success metrics (e.g., click-through rate).

 Collect data over sufficient time.

 Use statistical tests (t-test, chi-square) to analyze significance.

5. What’s the biggest data challenge you've solved that had high business impact?

 Integrated 7 disparate data sources with inconsistent schemas into a unified data model, enabling
centralized reporting and saving 25+ hours monthly in manual consolidation.

Flashcards
Flashcard 1
Q: What is normalization in SQL, and when is it used?
A: Normalization is the process of organizing data to reduce redundancy and improve data integrity. It is
typically used in OLTP systems to ensure consistent updates.
Flashcard 2
Q: When is denormalization preferred over normalization?
A: Denormalization is preferred in read-heavy systems or OLAP scenarios where faster query performance
is needed, and some redundancy is acceptable.

Flashcard 3
Q: What is a star schema, and why is it preferred in Power BI?
A: A star schema is a database structure with a central fact table connected to dimension tables. It
improves performance and simplifies DAX calculations in Power BI.

Flashcard 4
Q: How do you optimize Power BI performance on large datasets?
A: Use Import mode, reduce visuals, apply aggregations, disable auto-date/time, and optimize data
models with star schema.

Flashcard 5
Q: What is Row-Level Security (RLS) in Power BI?
A: RLS restricts data access based on user roles. It is implemented using DAX filters and managed in Power
BI Service.

Flashcard 6
Q: How do you handle conflicting KPI requirements from stakeholders?
A: Understand each team's goals, propose a consolidated KPI structure, and provide custom views if
needed.

Flashcard 7
Q: What causes SQL performance issues and how can they be fixed?
A: Issues include missing indexes, large scans, and bad JOINs. Fix by indexing, filtering, and optimizing
schema design.

Flashcard 8
Q: What is a Type 2 Slowly Changing Dimension (SCD)?
A: Type 2 SCD stores historical data by adding a new row with versioning or date ranges to track changes.

Flashcard 9
Q: How can you use Python for EDA automation?
A: Load data with Pandas, clean it, generate profiling reports using Pandas Profiling or Sweetviz, and
export visuals.

Flashcard 10
Q: What is the benefit of using Dask or Polars with large datasets?
A: They enable out-of-core or parallel processing, handling data that doesn’t fit into memory and speeding
up analysis.

You might also like