Data Analytics Questions
Data Analytics Questions
1 What is the difference between Data Mining and Data Profiling? Data Mining Data Profiling Data
mining is the process of finding relevant information which has not been found before Data profiling is
usually done to assess a dataset for its uniqueness, consistency and logic It is the way in which raw
data is turned into valuable information It cannot identify incorrect or inaccurate data values
3. 2 Define the term Data Wrangling in data analytics? Data Wrangling is the process of cleaning,
structuring and enriching the raw data into a desired usable format for better decision making
ValidateDiscover Structure Clean Enrich Analyze
4. What’s in it for you? Data Analytics Interview Click here to watch the video
5. 3 What are the common problems that data analysts encounter during analysis? Collecting the
meaningful right data and the right time Handling duplicate and missing values Handling data purging
and storage problems Making data secure and dealing with compliance issues
6. 4 What are the various steps involved in any analytics project? Understand the problem Data
collection Data cleaning Data exploration and analysis Interpret the results 1 2 34 5
7. 5 Which technical tools have you used for analysis and presentation purposes? Being a data analyst,
you are expected to have knowledge of the below tools for analysis and presentation purposes
8. 6 What are the best practices for data cleaning? Make a data cleaning plan by understanding where
the common errors take place and keep communications open Identify and remove duplicates before
working with the data. This will lead to an effective data analysis process Focus on the accuracy of the
data. Maintain the value types of data, provide mandatory constraints and set cross-field validation
Standardise the data at the point of entry so that it is less chaotic and you will be able to ensure that all
information is standardised, leading to fewer errors on entry
9. 7 How can you handle missing values in a dataset? In listwise deletion method, an entire record is
excluded from analysis if any single value is missingListwise deletion Use the average value of the
responses from the other participants to fill in the missing valueAverage imputation You can use
multiple-regression analysis to estimate a missing valueRegression substitution It creates plausible
values based on the correlations for the missing data and then averages the simulated datasets by
incorporating random errors in your predictions Multiple imputation
10. 8 What do you understand by the term Normal Distribution? Normal Distribution is a type of
continuous probability distribution that is symmetric about the mean and in a graph, normal
distribution will appear as a bell curve 0 1 2 3-3 -2 -1 34% 34% 13.5% 13.5% 2.5% 2.5% 0.15%0.15%
• The mean, median and mode are equal • All of them are located at the centre of the distribution •
68% of the data lies within 1 standard deviation of the mean • 95% of the data falls within 2 standard
deviations of the mean • 99.7% of the data lies within 3 standard deviations of the mean
11. 9 What is Time Series analysis? Time Series analysis is a statistical method that deals with ordered
sequence of values of a variable at equally spaced time intervals Time series data on Covid19 cases
Time series graph
12. 10 How is joining different from blending in Tableau? Data Joining Data Blending Data joining
can only be done when the data comes from the same source Data Blending is used when the data is
from 2 or more different sources Eg: combining two Tables from the same databases, or two or more
worksheets from the same Excel file Eg: Combining the Oracle table with SQL Server, or two sheets
from Excel, or combining Excel sheet and Oracle table All the combined tables or sheets contains
common set of Dimensions and Measures In Data Blending, each data source contains its own set of
Dimensions and Measures A B
13. 11 How is overfitting different from underfitting? Overfitting Underfitting Model trains the data
too well using the training set The model neither trains the data well nor can generalize to new data
The performance drops significantly over the test set Performs poorly both on train and the test set
Happens when the model learns the noise and random fluctuations in the training dataset in detail
Happens when there is less data to build an accurate model and also when we try to build a linear
model with a non-linear data
14. 12 What is the correct syntax for reshape() function in NumPy? (a) array.reshape(shape) (b)
reshape(shape, array) (c) reshape(array, shape) (d) reshape(shape)
15. 12 What is the correct syntax for reshape() function in NumPy? (a) array.reshape(shape) (b)
reshape(shape, array) (c) reshape(array, shape) (d) reshape(shape) Example
16. 13 What is the difference between COUNT, COUNTA, COUNTBLANK and COUNTIF in Excel?
Sales Table COUNT function returns the count of numeric cells in a range Output
17. 13 What is the difference between COUNT, COUNTA, COUNTBLANK and COUNTIF in Excel?
Sales Table COUNTA function returns the count of non-blank cells in a range Output
18. 13 What is the difference between COUNT, COUNTA, COUNTBLANK and COUNTIF in Excel?
Sales Table COUNTBLANK function returns the count of blank cells in a range Output 3
19. 13 What is the difference between COUNT, COUNTA, COUNTBLANK and COUNTIF in Excel?
Sales Table COUNTIF function returns the count of values by checking a given condition Output
20. 14 Explain how VLOOKUP works in Excel? VLOOKUP is used when you need to find things in a
table or a range by row Syntax: VLOOKUP(lookup_value, table_array, col_index_num,
[range_lookup]) lookup_value - The value you want to look up table_array - The range where the
lookup value is located col_index_num - The column number in the range that contains the return
value range_lookup – Specify TRUE if you want an approximate match or FALSE if you want an
exact match of the return value
21. 14 Explain how VLOOKUP works in Excel? VLOOKUP is used when you need to find things in a
table or a range by row Find the age of Prince by looking up to his name Table from where you want
to fetch the value of certain columns Find the height of Angela by looking up to her name
22. To subset or filter data in SQL, we use WHERE and HAVING clause Title Director Year Duration
Race Stephen Hopkins 2016 134 Cars John Lasseter 2006 117 Toy Story John Lasseter 1995 81 The
Incredibles Brad Bird 2004 116 Brave Brenda Chapman 2012 102 Ratatouille Brad Bird 2007 115
Vertigo Alfred Hitchcock 1958 128 Movies Table Filter the table for movies that were directed by
Brad Bird select * from Movies where Director = ‘Brad Bird’; 15 How do you subset or filter data in
SQL? Title Director Year Duration The Incredibles Brad Bird 2004 116 Ratatouille Brad Bird 2007
115
23. 15 To subset or filter data in SQL, we use WHERE and HAVING clause Title Director Year
Duration Race Stephen Hopkins 2016 134 Cars John Lasseter 2006 117 Toy Story John Lasseter 1995
81 The Incredibles Brad Bird 2004 116 Brave Brenda Chapman 2012 102 Ratatouille Brad Bird 2007
115 Vertigo Alfred Hitchcock 1958 128 Movies Table select Director, sum(Duration) as
total_duration, avg(Duration) as avg_duration from Movies group by Director having
avg(Duration)>115 Filter the table for directors whose movies have an average duration greater than
115 minutes How do you subset or filter data in SQL?
24. 16 What is the difference between WHERE and HAVING clause in SQL? WHERE clause
HAVING clause WHERE clause works on row data HAVING clause works on aggregated data In
WHERE clause, the filter occurs before any groupings are made HAVING is used to filter values from
a group Aggregate functions cannot be used Aggregate functions can be used Syntax: SELECT
column1, column2, ... FROM table_name WHERE condition; Syntax: SELECT column_name(s)
FROM table_name WHERE condition GROUP BY column_name(s) HAVING condition ORDER
BY column_name(s);
25. 17 What is the correct syntax for reshape() function in NumPy? (a) array.reshape(shape) (b)
reshape(shape, array) (c) reshape(array, shape) (d) reshape(shape)
26. 17 What is the correct syntax for reshape() function in NumPy? (a) array.reshape(shape) (b)
reshape(shape, array) (c) reshape(array, shape) (d) reshape(shape) Example
27. 18 What are the different ways to create a dataframe in Pandas? By initializing a list1 By
initializing a dictionary2
28. 19 Write the Python code to create an employees dataframe from the “emp.csv” file and display
the head and summary of it. To create a DataFrame in Python, you need to import the Pandas library
and use the read_csv function to load the .csv file Example: Display the head of the dataset Summary
of the dataset
29. 20 How will you select the Department and Age columns from an Employees dataframe? To select
Department and Age from the dataframe
30. 21 What is the criteria to say whether a developed data model is good or not? A good model
should be intuitive, insightful and self-explanatory The model developed should be able to easily
consumed by the clients for actionable and profitable results A good model should easily adapt to
changes according to business requirements If the data gets updated, the model should be able to scale
according to the new data
31. 22 What is the significance of Exploratory data analysis? Exploratory data analysis is an important
step in any data analysis process • Exploratory data analysis (EDA) helps to understand the data better
• It helps you obtain confidence in your data to a point where you’re ready to engage a machine
learning algorithm • It allows you to refine your selection of feature variables that will be used later for
model building • You can discover hidden trends and insights from the data
32. 23 How do you treat outliers in a dataset? An outlier is a data point that is distant from other
similar points. They may be due to variability in the measurement or may indicate experimental errors
• Drop the outlier records • Cap your outliers data • Assign a new value • Try a new transformation
33. 24 Explain descriptive, predictive, and prescriptive analytics. Descriptive Predictive Prescriptive
Provides insights into the past to answer “what has happened” Understands the future to answer “what
could happen” Suggest various courses of action to answer “what should you do” Uses data
aggregation and data mining techniques Uses statistical models and forecasting techniques Uses
optimization and simulation algorithms to advise possible outcomes Example: An ice cream company
can analyze how much ice cream was sold, which flavours were sold, and whether more or less ice
cream was sold than the day before Example: Predicts the sale of ice creams during summer, spring
and rainy days Example: Lower prices to increases sale of ice creams, produce more/less quantities of
a certain flavour of ice cream
34. 25 What are the different types of sampling techniques used by data analysts? Sampling is a
statistical method to select a subset of data from an entire dataset (population) to estimate the
characteristics of the whole population 1 Simple random sampling 2 Systematic sampling 3 Cluster
sampling 4 Stratified sampling 5 Judgmental or purposive sampling
35. 26 What are the different types of Hypothesis testing? Hypothesis testing is the procedure used by
statisticians and scientists to accept or reject statistical hypotheses Null hypothesis Hypothesis Testing
Alternative hypothesis It states that there is no relation between the predictor and outcome variables in
the population. It is denoted by H0 Example: There is no association between patient’s BMI and
diabetes It states that there is some relation between the predictor and outcome variables in the
population. It is denoted by H1 Example: There could be an association between patient’s BMI and
diabetes
36. 27 Describe univariate, bivariate, and multivariate analysis. Univariate Analysis It is the simplest
form of data analysis where the data being analysed contains only one variable Example – Studying
the heights of players in NBA Height in cm 180.5 190.5 200.3 185.8 196.4 Univariate analysis can be
described using: • Central Tendency • Dispersion • Quartiles • Bar charts • Histograms • Pie charts •
Frequency distribution tables
37. 27 Describe univariate, bivariate, and multivariate analysis. Bivariate Analysis It involves analysis
of two variables to find causes, relationships and correlations between the variables Example –
Analysing sale of ice creams based on the temperature outside Bivariate analysis can be explained
using: • Correlation coefficients • Linear regression • Logistic regression • Scatter plots • Box plots
Temperature in Celsius Ice cream sales 30 1500 35.6 1900 27.2 1200 40.3 2100 45.5 2500
38. 27 Describe univariate, bivariate, and multivariate analysis. Multivariate Analysis It involves
analysis of three or more variables to understand the relationship of each variable with the other
variables Example – Analysing Revenue based on expenditure Multivariate analysis can be performed
using: • Multiple regression • Factor analysis • Classification & regression trees • Cluster analysis •
Principal component analysis • Clustering bar chart • Dual axis charts TV Ads Newspaper Ads Social
media Ads Revenue $1000 $70 $600 $6000 $2000 $130 $750 $8000 $2600 $300 $900 $9500 $3000
$250 $470 $8500 $4500 $100 $1200 $10000
39. 28 What function would you use the get current date and time in Excel? In Excel, you can use the
TODAY() and NOW() function to get the current date and time To get the current date To get the date
and time
40. 29 Using the SUMIFS function in Excel, find the total quantity sold by sales representatives whose
name start with A and cost of each item they have sold is greater than 10 Sales Table Output
41. 30 Is the below query correct? If not, how will you rectify it? The above query is incorrect as we
cannot use the alias name while filtering data using the WHERE clause SELECT custid,
YEAR(order_date) AS order_year FROM Order WHERE YEAR(order_date) >= 2016; solution
SELECT custid, YEAR(order_date) AS order_year FROM Order WHERE order_year >= 2016; SQL
Query
42. 31 How are Union, Intersect and Except used in SQL? The Union operator is used to combine the
results of 2 or more SELECT statements Syntax: SELECT column_name(s) FROM table1 UNION
SELECT column_name(s) FROM table2; Region 1 Region 2
43. 31 How are Union, Intersect and Except used in SQL? The Intersect operator returns the common
records that are the results of 2 or more SELECT statements Syntax: SELECT column_name(s)
FROM table1 INTERSECT SELECT column_name(s) FROM table2; Region 1 Region 2
44. 31 How are Union, Intersect and Except used in SQL? The Except operator returns the uncommon
records that are the results of 2 or more SELECT statements Syntax: SELECT column_name(s)
FROM table1 EXCEPT SELECT column_name(s) FROM table2; Region 1 Region 2
45. 32 Using the product_price table, write a SQL query to find the record with fourth highest market
price select top 4 * from product_price order by mkt_price desc select top 1 from the above result that
is in ascending order of mkt_price
46. 33 From the product_price table, find the total and average market price for each currency where
average market price is greater than 100 and currency is in INR or AUD Product Price Table Output
SQL Query
47. 34 This question will test your knowledge in Tableau, exploring the different features of Tableau
and creating a suitable graph to solve a business problem
48. 34 Using the Sample Superstore dataset, create a view to analyse the sales, profits and quantity
sold across different subcategories of items present under each category Load the Sample – Superstore
dataset Drag Category and Sub-category on Rows and Sales on to Columns It will result in a
horizontal bar chart
49. 34 Using the Sample Superstore dataset, create a view to analyse the sales, profits and quantity
sold across different subcategories of items present under each category Drag Profit on to Colour and
Quantity on to Label Sort the Sales axis in descending order of sum of sales within each sub-category
50. 34 Using the Sample Superstore dataset, create a view to analyse the sales, profits and quantity
sold across different subcategories of items present under each category Chairs under Furniture
category had the highest sales and profit, while Tables had the lowest profit. For Office Supplies, sub-
category Binders made the highest profit even though Storage had the highest sales. Under
Technology category, Copiers made the highest profit though it has the least amount of sales`
51. 35 Create a dual axis chart in Tableau to present Sales and Profits across different years using
Sample Superstore dataset Load the Orders sheet from the Sample Superstore dataset Drag the Order
Date field from Dimensions on to Columns and convert it into continuous Month
52. 35 Create a dual axis chart in Tableau to present Sales and Profits across different years using
Sample Superstore dataset Drag Sales on to Rows and Profits to the right corner of the view until you
see a light green rectangle Synchronise the right axis by right clicking on the profit axis
53. 35 Create a dual axis chart in Tableau to present Sales and Profits across different years using
Sample Superstore dataset Under the Marks card, change SUM(Sales) to Bar and SUM(Profit) to Line
and adjust the size
54. 35 Create a dual axis chart in Tableau to present Sales and Profits across different years using
Sample Superstore dataset
55. 36 Design a view in Tableau to show State wise Sales and Profits using the Sample Superstore
dataset Drag the Country field on to the view section and expand it to see the States Drag the Sales
field on to Size and Profit on to Colour
56. 36 Design a view in Tableau to show State wise Sales and Profits using the Sample Superstore
dataset Increase the size of the bubbles, add a border and halo color States like Washington, California
and New York have the highest sales and profits. While Texas, Pennsylvania and Ohio has good
amount of sales but least profits
57. 36 Design a view in Tableau to show State wise Sales and Profits using the Sample Superstore
dataset
58. 37 Suppose there is an array num = np.array([[1,2,3],[4,5,6],[7,8,9]]). Extract the value 8 using 2D
indexing. Since the value 8 is present in the 2nd row and 1st column, we use the same index positions
and pass it to the array Below is the given array
59. 38 Suppose there is an array that has values [0,1,2,3,4,5,6,7,8,9]. How will display the following
values from the array - [1,3,5,7,9]? To get the desired result, you can perform the modulus operation
and check if the result is equal to 1
60. 39 There are 2 arrays ‘a’ and ‘b’. Stack the arrays a and b horizontally. a =
np.arange(10).reshape(2,-1) b = np.repeat(1, 10. reshape(2, -1) Method 1: Using concatenate function
Method 2: Using hstack function
61. 40 How can you add a column to a Pandas DataFrame? Suppose there is an emp dataframe that has
information about few employees. Let’s add Address column to that dataframe
62. 41 Using the below given data, create a pivot table to find the total sales made by each sales
representative for each item. Display the sales as % of grand total Select the entire table range, click on
Insert tab and choose PivotTable
63. 41 Using the below given data, create a pivot table to find the total sales made by each sales
representative for each item. Display the sales as % of grand total Select the table range and the
worksheet where you want to place the pivot table
64. 41 Using the below given data, create a pivot table to find the total sales made by each sales
representative for each item. Display the sales as % of grand total It will return a pivot table where you
can analyze your data
65. 41 Using the below given data, create a pivot table to find the total sales made by each sales
representative for each item. Display the sales as % of grand total Drag Sale total on to Values, and
Sales Rep and Item on to Row Labels. It will give the sum of sales made by each representative for
each item they have sold
66. 41 Using the below given data, create a pivot table to find the total sales made by each sales
representative for each item. Display the sales as % of grand total Right click on “Sum of Sale Total’
and expand Show Values As to select % of Grand Total
67. 41 Using the below given data, create a pivot table to find the total sales made by each sales
representative for each item. Display the sales as % of grand total Final Pivot Table
68. 42 Using the product and sales order detail table, find the products that have total units sold greater
than 1.5 million Product Table
69. 42 Using the product and sales order detail table, find the products that have total units sold greater
than 1.5 million Sales Order Detail Table
70. 42 Using the product and sales order detail table, find the products that have total units sold greater
than 1.5 million SQL Query
71. 43 How do you write a stored procedure in SQL? Let’s create a stored procedure to find the sum
the squares of the first N natural numbers. Formula: n*(n+1)*(2n+1)/6 Output: Display the sum of the
square for first 4 natural numbers
72. 44 Write a stored procedure to find the total even number between two user given numbers. Stored
procedure to count the even numbers To execute the stored procedure
73. 45 What is the difference between Treemaps and Heatmaps in Tableau? • Treemaps are used to
display data in nested rectangles • You use dimensions to define the structure of the treemap, and
measures to define the size or color of the individual rectangles • Treemaps are a relatively simple data
visualization that can provide insight in a visually attractive format
74. 45 What is the difference between Treemaps and Heatmaps in Tableau? • Heat map helps to
visualize measures against dimensions with the help of colors and size to compare one or more
dimensions & up to two measures • The layout is similar to a text table with variations in values
encoded as colors • In heat map, you can quickly see a wide array of information
75. 46 Using the Sample Superstore dataset, display the top 5 and bottom 5 customers based on their
profit Drag Customer Name field on to Rows and Profit on to Columns Right click on the Customer
Name column to create a Set
76. 46 Using the Sample Superstore dataset, display the top 5 and bottom 5 customers based on their
profit Give a name to the set and select Top tab to choose top 5 customers by sum(profit) Similarly,
create a set for the bottom 5 customers by sum(profit)
77. 46 Using the Sample Superstore dataset, display the top 5 and bottom 5 customers based on their
profit Select both the sets, right click to create a combined set Give a name to the set and choose All
members in both sets
78. 46 Using the Sample Superstore dataset, display the top 5 and bottom 5 customers based on their
profit Drag Top and Bottom Customers set on to Filters and Profit field on to Colour to get the desired
result
79. 47 How will print 4 random integers between 1 and 15 using NumPy? To generate Random
numbers using NumPy, we use the random.randint() function
80. 48 From the below DataFrame, how will you find the unique values for each column and subset
the data for Age<35 and Height>6? DataFrame To find the unique values and number of unique
elements, use the unique() and nunique() function Subset the data for Age<35 and Height>6
81. 49 Plot a sine graph using NumPy and Matplotlib library in Python.
82. 49 Plot a sine graph using NumPy and Matplotlib library in Python. Resultant plot
83. 50 Using the below Pandas dataframe, find the company with the highest average sales. Derive the
summary statistics for the sales column and transpose the statistics Dataframe Group the company
column and use the mean function to find the average sales FB has the highest average sales
84. 50 Using the below Pandas dataframe, find the average sales for each company. Derive the
summary statistics for the sales column and transpose the statistics Use the describe() function to find
the summary statistics Apply the transpose() function over the describe() method to transpose the
statistic