DATA ANALYTICS Unit III & IV

Download as pdf or txt
Download as pdf or txt
You are on page 1of 83

DATA ANALYTICS BCS SY

Unit III
Working with time series data and regression analysis.

1) Introduction of time series data - Time series data is a type of data that is
collected or recorded over time at regular intervals. In time series analysis, the
order of observations is crucial, as they are taken at successive points in time. This
type of data is commonly used in various fields, including finance, economics,
signal processing, environmental science, and many others

2) Understanding Time series data and importance- Understanding time


series data is crucial in various fields due to its ability to reveal patterns, trends,
and dependencies that can inform decision-making, forecasting, and analysis.
a) Pattern identification –
Time series data allows for the identification of patterns and trends over time.
Recognizing these patterns is essential for understanding the behavior of a
system or process.
b) Forecasting - Time series analysis enables the prediction of future values
based on historical data patterns. Forecasting is valuable in areas such as
finance, where it's used to predict stock prices, or in weather forecasting to
predict future weather condition
c) Resource planning - In business and manufacturing, understanding time
series data is crucial for resource planning. It helps in predicting demand,
managing inventory, and optimizing production schedules.
d) Healthcare Research - In healthcare, time series data is used for patient
monitoring, disease progression analysis, and predicting outbreaks. It plays a
vital role in understanding and managing various health-related phenonema.

3) Working with time series data in excel : data and time –


A) Date import - If your time series data is large or if you have it in a different
format, you might want to import it into Excel. You can use the "Data" tab and
choose various options like "From Text" or "From Workbook" to import external
data.
B) Date Formating - Excel may not always recognize date formats
automatically. Make sure your date or time column is formatted correctly. You
can use the "Format Cells" option under the "Home" tab to format the date as
required.
C) Trend analysis - You can use Excel's built-in functions for trend analysis. For
example, the "LINEST" function can be used for linear regression, and the
"GROWTH" function can be used for exponential growth.

Bhise N K
DATA ANALYTICS BCS SY

Trend Analysis and Forcasting :-


1) Identifying trend and pattern in time series data - Trend analysis in data
analytics involves examining data over time to identify patterns, tendencies, or trends. This
process helps analysts and decision-makers understand how a particular variable or set of
variables changes and evolves over a specified period. Trend analysis is widely used in
various fields, including finance, marketing, economics, and healthcare, among others.
Key Aspect of trends analysis.
❖ Time series data.

❖ Pattern Identification.

❖ Visualization.

❖ Forcasting.

❖ Business Inteligence.
2) Introduction to time series forcasting - Time series forecasting is a specialized area
of predictive analytics that involves making predictions about future values based on
historical data points ordered chronologically. In a time series, each data point is associated
with a specific timestamp, and the goal is to use the patterns and trends within the
historical data to make accurate predictions for future time point.
Key Aspect of time series analysis.
❖ Trend

❖ Seasonality

❖ Time series forcasting methods like machine learning, statistical


methods.
❖ Data preprocessing.

❖ Cross validation like train and test data

Bhise N K
DATA ANALYTICS BCS SY

ForCasting Technique in excel – Linear and polynomial trendline –


A) Linear regression - Linear regression in Excel is a statistical technique that is
used to find the relationship between two variables by fitting a linear equation to
observed data. In the context of forecasting or predicting, linear regression can be
used to estimate the values of one variable (dependent variable) based on the
values of another variable (independent variable.
Formula –

B) Polynomial TrendLine - A polynomial trendline is a curved line that is used


when data fluctuates. It is useful, for example, for analyzing gains and losses
over a large data set. The order of the polynomial can be determined by the
number of fluctuations in the data or by how many bends (hills and valleys)
appear in the curve.

4. Smoothing Technique : Moving Average

A) Introduction to moving average as a smoothing technique - In Excel, a


moving average is a statistical calculation that is used to analyze data over a
certain period of time by creating a series of averages of different subsets of the
full data set. It is often used to smooth out fluctuations in data and highlight
trends or patterns.

There are different types of moving averages, but the most common one is the
Simple Moving Average (SMA). The Simple Moving Average is calculated by
taking the average of a set of data points over a specified period and then
moving the average to the next set of data points. The formula for calculating the
Simple Moving Average for a given data set is :

SMA = Some of data points in the specified period / No. of data point in the
specified period.

Steps : 1) Prepare your data - Arrange your data in a column in excel.

2)Choose aperied - Decide on the period for your moving average. For example, if
you want a 3-period moving average, you would use the average of the first 3
data points, then the next 3, and so on.

Bhise N K
DATA ANALYTICS BCS SY

3) Calculate the moving average - Place the formula in the cell where you want
the moving average to start.

If your data is in column A and you are calculating a 3-period moving average,
and your first data point is in cell A2, the formula in cell B4 would be:
=AVERAGE(A2:A4)

B) Calculating Simple , weighted and exponential moving average -


● weighted moving avg - The Weighted Moving Average assigns different
weights to different data points. Assume you have weights in column B
corresponding to your data in column A. For a 3-period WMA starting from
cell.

● Exponential moving avg - An Exponential Moving Average (EMA) is a type


of moving average that places more emphasis on recent data points,
giving them higher weightage in the calculation compared to older data
points. This makes the EMA more responsive to changes in the data
series, especially when compared to the Simple Moving Average (SMA)
where all data points are equally weighted.

Application of Exponential Moving Average -

1) Financial Forcasting - Exponential smoothing is applied in finance


to predict future trends in stock prices, currency exchange rates, or
other financial indicators. Investors and financial analysts use these
forecasts to make informed decisions about investment strategies.
2) Sales Forcasting - Retailers and manufacturers utilize exponential
smoothing to predict sales volumes for products. This helps in
planning production schedules, managing supply chains, and
optimizing marketing strategies based on anticipated demand.
3) Call Center Volume Prediction - Call centers use exponential
smoothing to forecast the volume of incoming calls. This
information is valuable for efficiently scheduling staff, ensuring
optimal customer service levels, and managing resources
effectively
4) Weather forecasting - Meteorologists use exponential smoothing
to predict weather patterns based on historical climate data. This

Bhise N K
DATA ANALYTICS BCS SY

aids in providing short-term forecasts for temperature, precipitation,


and other meteorological variables.

5. Simple Linear Regression.

Simple linear regression is a statistical method used to model the relationship


between two continuous variables. In simple linear regression, one variable
(often denoted as X) is considered the independent variable or predictor variable,
while the other variable (often denoted as Y) is considered the dependent
variable or response variable.

The relationship between the two variables is assumed to be approximately


linear, meaning that changes in the independent variable are associated with changes in
the dependent variable in a straight-line fashion. The simple linear regression model is
represented by the equation:

Y = B0 + B1X + E

★ Y is the dependent variable.


★ X is the independent variable.
★ B0 is the intercept , which represents the value of Y where X is zero.
★ B1 is slope , which represents the change in Y for a one unit change in X.
★ E is error terms,representing the difference between the observed and predicted
values of Y. it captures the variability in Y that is not explained by the linear
relationship with X. (E = Epsilon).

The goal of simple linear regression is to estimate the values of the coefficients B0 and
B1 that minimize the sum of the squared differences between the observed values of Y
and the values predicted by the regression model. This is often done using the method
of least squares..

Bhise N K
DATA ANALYTICS BCS SY

Once the coefficients are estimated, the regression model can be used to make
predictions about the dependent variable based on values of the independent variable.
Additionally, the fit of the model can be assessed using various metrics such as the
coefficient of determination (R2) and hypothesis tests for the significance of the
coefficients.

6. Multiple Linear Regression.

Performing multiple linear regression in Excel involves using the built-in


functions to analyze a dataset with multiple independent variables (predictors)
and one dependent variable (outcome). Here's a step-by-step guide on how to do
it:

​ Prepare Your Data:


● Organize your data so that each row represents an observation, and each
column represents a variable.
● Ensure that you have a column for the dependent variable (outcome) and
separate columns for each independent variable (predictor).
​ Open Excel and Load Your Data:
● Open Excel and load your dataset into a new or existing spreadsheet.
● Make sure your dataset is arranged in a tabular format with variable
names in the header row and data in subsequent rows.
​ Activate the Data Analysis Toolpak:
● If you haven't already done so, activate the Data Analysis Toolpak add-in in
Excel. You can do this by clicking on "File" > "Options" > "Add-Ins" > "Excel
Add-Ins" > "Analysis Toolpak" > "Go..." and then checking the "Analysis
Toolpak" option.
​ Access the Data Analysis Toolpak:
● Once the Data Analysis Toolpak is activated, you can find it in the "Data"
tab on the Excel ribbon.
● Click on "Data Analysis" in the "Analysis" group to open the Data Analysis
dialog box.
​ Select Regression Analysis:
● In the Data Analysis dialog box, scroll down and select "Regression" from
the list of analysis tools.
● Click "OK" to proceed.

Bhise N K
DATA ANALYTICS BCS SY

​ Enter Input Range and Options:


● In the Regression dialog box, specify the input range for your independent
variables (predictors) and dependent variable (outcome).
● Check the box for "Labels" if your data includes variable names in the first
row.
● Choose where you want the output to be displayed (e.g., a new worksheet
or a specific range in the current worksheet).
● Optionally, you can specify additional options such as confidence level and
residuals.
​ Run the Regression Analysis:
● Click "OK" to run the regression analysis.
● Excel will calculate the coefficients for each independent variable, as well
as other statistics such as R-squared, adjusted R-squared, standard error,
F-statistic, and p-values.
​ Interpret the Results:
● Review the output to interpret the results of the regression analysis.
● Pay attention to the coefficients for each independent variable, as they
represent the magnitude and direction of the relationship with the
dependent variable.
● Evaluate the significance of each coefficient using the p-values. Lower
p-values indicate greater significance.
● Consider other statistics such as R-squared to assess the overall fit of the
model

7. Model Diagnostics And Validation

Model diagnostics and validation in Excel typically involve assessing the performance
and accuracy of a model built within Excel, such as a financial model, forecasting
model, or regression analysis. Here's a general overview of steps you can take for
diagnostics and validation:

​ Data Preparation:
● Ensure your data is clean, organized, and appropriately formatted.
● Split your data into training and testing sets if applicable.

Bhise N K
DATA ANALYTICS BCS SY

​ Model Building:
● Construct your model using Excel functions, formulas, or add-ins.
● Document your model's assumptions, methodology, and limitations.
​ Diagnostic Checks:
● Perform basic checks to ensure your model is functioning correctly, such
as:
● Verifying formulas and references.
● Checking for errors or inconsistencies.
● Assessing outliers or anomalies in the data.
​ Model Evaluation:
● Evaluate the performance of your model using appropriate metrics.
● For forecasting or regression models, consider metrics like Mean Absolute
Error (MAE), Root Mean Squared Error (RMSE), or R-squared (R²).
● For financial models, assess metrics such as Net Present Value (NPV),
Internal Rate of Return (IRR), or Payback Period.
​ Validation:
● Validate your model against real-world data or known outcomes.
● Compare model predictions or outputs with observed results.
● Use techniques like cross-validation if applicable.
​ Sensitivity Analysis:
● Conduct sensitivity analysis to understand how changes in input variables
affect model outputs.
● Use Excel's built-in tools like Data Tables or Scenario Manager for
sensitivity analysis.
​ Visualizations:
● Create visualizations to present your model's outputs and insights
effectively.
● Excel offers various chart types and formatting options for visual
representation.
​ Documentation:
● Document your findings, assumptions, methodologies, and validation
results thoroughly.
● Include notes within your Excel file or create a separate documentation
file.
​ Peer Review:
● Have your model reviewed by colleagues or subject matter experts to
identify potential errors or areas for improvement.
​ Revision and Iteration:

Bhise N K
DATA ANALYTICS BCS SY

● Based on feedback and validation results, revise and refine your model as
needed.
● Iteratively improve your model to enhance its accuracy and reliability.
​ Version Control:
● Maintain version control to track changes and ensure traceability of model
revisions.
​ Final Review and Approval:
● Conduct a final review of your model before deployment or presentation.
● Obtain necessary approvals or sign-offs from stakeholders.

A.assigning the quality of model : R-squared, adjusted


R-squared, and standard error-

Assigning the quality of a model based on R-squared, adjusted R-squared, and standard
error involves assessing how well the model fits the data and whether it provides
meaningful insights. Here's how you can interpret these metrics:

​ R-squared (R²):
● R-squared is a statistical measure that represents the proportion of the
variance in the dependent variable that is explained by the independent
variables in the model.
● It ranges from 0 to 1, where 1 indicates that the model explains all the
variability of the response data around its mean.
● Higher R-squared values generally indicate a better fit of the model to the
data.
● However, R-squared alone does not determine whether a model is good or
bad; it should be interpreted in conjunction with other metrics.
​ Interpretation:
● R-squared values closer to 1 imply that the model explains a large portion
of the variability in the data and is considered desirable.
● R-squared values closer to 0 suggest that the model does not explain
much of the variability in the data and may not be useful for prediction.
​ Adjusted R-squared:
● Adjusted R-squared is similar to R-squared but adjusts for the number of
predictors in the model.
● It penalizes excessive use of predictors and provides a more accurate
measure of model fit, especially when comparing models with different
numbers of predictors.

Bhise N K
DATA ANALYTICS BCS SY

● Adjusted R-squared tends to be slightly lower than R-squared, especially


when additional predictors do not significantly improve the model.
​ Interpretation:
● Similar to R-squared, higher adjusted R-squared values indicate better
model fit.
● Comparing adjusted R-squared values across different models helps
determine which model provides the best balance between explanatory
power and complexity.
​ Standard Error:
● The standard error of the regression (also known as the standard error of
the estimate) measures the average deviation of the observed values from
the predicted values by the model.
● It provides an indication of the accuracy of the predictions made by the
model.
● A lower standard error indicates that the model's predictions are closer to
the actual observed values.
​ Interpretation:
● Lower standard error values suggest that the model's predictions are more
precise and accurate.
● Higher standard error values indicate greater variability in the predictions
and less precision.

In summary, when assessing the quality of a model based on R-squared, adjusted


R-squared, and standard error:

● Look for higher R-squared and adjusted R-squared values, indicating better model
fit.
● Compare adjusted R-squared values across models to assess the trade-off
between model complexity and explanatory power.
● Aim for lower standard error values, indicating more accurate predictions

B.Testing the assumption : normality linearity,multicollinearity


and homoscedasticity

Testing the assumptions of normality, linearity, multicollinearity, and homoscedasticity


is crucial when building regression models. Here's how you can test each assumption in
Excel:

Bhise N K
DATA ANALYTICS BCS SY

​ Normality Assumption:
● Normality of residuals is essential for regression analysis. You can assess
this assumption by examining the distribution of residuals.
● After running your regression model, calculate the residuals (the
differences between the observed and predicted values).
● Use Excel to create a histogram or a Q-Q plot of the residuals to visually
inspect their distribution.
● Additionally, you can perform a formal test for normality, such as the
Shapiro-Wilk test, using Excel's statistical functions or add-ins like Real
Statistics Resource Pack.
​ Linearity Assumption:
● The relationship between the independent and dependent variables should
be linear. You can check this assumption by plotting the observed values
of the dependent variable against the predicted values from your
regression model.
● After running the regression, create a scatter plot in Excel with the
observed values on the y-axis and the predicted values on the x-axis.
● Ensure that the points on the scatter plot are randomly distributed around
a diagonal line, indicating linearity.
● You can also check for linearity by examining residual plots, where
residuals should be randomly distributed around zero for different values
of the independent variables.
​ Multicollinearity Assumption:
● Multicollinearity occurs when independent variables in a regression model
are highly correlated with each other.
● Calculate correlation coefficients between independent variables using
Excel's CORREL function.
● Alternatively, you can use Excel's Data Analysis Toolpak to perform a
correlation analysis.
● Look for high correlation coefficients (close to +1 or -1) between pairs of
independent variables, indicating potential multicollinearity issues.
● Consider using variance inflation factor (VIF) calculations to quantitatively
assess multicollinearity, which can be computed using Excel formulas
after estimating your regression model.
​ Homoscedasticity Assumption:
● Homoscedasticity means that the variance of the residuals is constant
across all levels of the independent variables.
● After running the regression, plot the residuals against the predicted
values or against each independent variable.

Bhise N K
DATA ANALYTICS BCS SY

● Ensure that there are no discernible patterns or trends in the residual plot,
indicating constant variance.
● You can also perform formal tests for homoscedasticity, such as the
Breusch-Pagan test or White's test, using Excel's statistical functions or
add-ins

C. Cross Validation and model selection technique .

Cross-validation and model selection testing are both important techniques used in

machine learning to evaluate and select the best-performing model for a given dataset.

Here's a brief overview of each:

​ Cross-validation:
● Cross-validation is a resampling technique used to assess how well a
model generalizes to an independent dataset.
● The basic idea is to partition the dataset into multiple subsets or folds.
The model is trained on a portion of the data and validated on the
remaining portion.
● Common types of cross-validation include k-fold cross-validation,
stratified k-fold cross-validation, leave-one-out cross-validation (LOOCV),
etc.
● By repeating this process with different partitions of the data, we can
obtain multiple estimates of model performance. The final performance
metric is often computed as the average across all folds.
​ Model selection testing:
● Model selection refers to the process of choosing the best model or
algorithm from a set of candidate models.
● Model selection testing involves evaluating different models using a
performance metric and selecting the one that performs best on unseen
data.
● This process typically involves comparing the performance of models
using techniques such as cross-validation, holdout validation, or other
validation strategies.
● Performance metrics used for model selection testing depend on the
problem at hand but often include accuracy, precision, recall, F1 score,
ROC AUC, etc.

Bhise N K
DATA ANALYTICS BCS SY

● Hyperparameter tuning is often a part of model selection testing, where


different combinations of hyperparameters are tested to find the optimal
configuration for a given model.

In practice, cross-validation is commonly used during model selection testing. Each

model candidate is trained and evaluated using cross-validation, and the model with the

best average performance across the folds is selected as the final model. Additionally,

cross-validation helps in estimating the generalization performance of the selected

model on unseen data.

8.Non-Linear Regression Model


Nonlinear regression models are used when the relationship between the independent

variables and the dependent variable is not linear. In these cases, the relationship may

be better described by a curve or some other nonlinear function. Nonlinear regression

models can capture more complex patterns and relationships in the data compared to

linear regression models.

Here's a general overview of nonlinear regression models:

​ Model Representation:

● Nonlinear regression models can take various forms, depending on the

specific problem and the nature of the data.

● A common form of nonlinear regression model is:

y = f(x, β) + ε

Bhise N K
DATA ANALYTICS BCS SY

● Where:

● y is the dependent variable.

● x is the independent variable(s).

● β represents the parameters of the model.

● f() is a nonlinear function that relates the independent variable(s)

to the dependent variable.

● ε is the error term, representing the difference between the

observed and predicted values.

​ Model Fitting:

● Fitting a nonlinear regression model involves estimating the parameters β

that best fit the observed data.

● This is typically done using optimization techniques such as least squares

estimation, maximum likelihood estimation, or other numerical

optimization methods.

● The objective is to minimize the sum of squared differences between the

observed and predicted values (the residual sum of squares).

​ Types of Nonlinear Models:

● There are many types of nonlinear regression models, including

polynomial regression, exponential regression, logarithmic regression,

power regression, sigmoidal regression, etc.

● The choice of the specific nonlinear function depends on the underlying

relationship between the variables and the characteristics of the data.

​ Model Evaluation:

● Evaluation of nonlinear regression models involves assessing how well the

model fits the data and how well it generalizes to unseen data.

Bhise N K
DATA ANALYTICS BCS SY

● Common evaluation metrics include R-squared (coefficient of

determination), root mean squared error (RMSE), mean absolute error

(MAE), etc.

● Cross-validation techniques can also be applied to assess the model's

performance and guard against overfitting.

​ Applications:

● Nonlinear regression models are widely used in various fields, including

economics, biology, physics, engineering, finance, etc., wherever the

relationships between variables are nonlinear.

In summary, nonlinear regression models provide a flexible framework for capturing

complex relationships in the data and are valuable tools for modeling real-world

phenomena. However, fitting and interpreting these models require careful

consideration of the underlying relationships and appropriate model selection and

evaluation techniques.

8.1 Implementing Non linear regression model in excel using solver

add-in.
Implementing a non-linear regression model in Excel using the Solver Add-In involves

fitting a curve to data points by minimizing the sum of squared differences between the

observed and predicted values. Here's a step-by-step guide on how to do this:

​ Organize your data: Have your independent variable (X) in one column and your

dependent variable (Y) in another column.

Bhise N K
DATA ANALYTICS BCS SY

​ Choose a Model: Decide on the type of non-linear model you want to fit to your

data. Common models include exponential, logarithmic, polynomial, etc. For

example, let's consider fitting an exponential model: Y = A * exp(B * X), where A

and B are parameters to be determined.

​ Initial Guess: Provide initial guesses for the parameters A and B. You can either

estimate them from the data or start with reasonable values.

​ Set up the Model in Excel: In another column, calculate the predicted Y values

based on the current parameter guesses and the chosen model.

​ Calculate Residuals: In another column, calculate the differences between the

observed Y values and the predicted Y values.

​ Sum of Squared Residuals (SSR): Square each residual and sum them up. This is

the objective function you want to minimize.

​ Use Solver Add-In: Go to the "Data" tab, click on "Solver" (if you haven't installed it

yet, you may need to add it from Excel Add-Ins), and set up Solver to minimize the

SSR by changing the cell values of the parameters (A and B).

​ Run Solver: Click Solve, and Solver will try different values of A and B to minimize

the SSR.

​ Analyze Results: Once Solver converges, you'll get the optimal values of

parameters A and B.

Bhise N K
DATA ANALYTICS BCS SY

Here's a simplified example:

Let's say your data is in columns A and B, with X values in column A and Y values in

column B.

● Initial guess for A = 1

● Initial guess for B = 0.1

In another column, calculate predicted Y values using the formula Y = A * EXP(B * X).

Then, calculate residuals (difference between observed Y and predicted Y).

Calculate SSR as the sum of squared residuals.

Set up Solver to minimize SSR by changing cells containing A and B values.

Run Solver, and it will find the optimal values for A and B.

This approach can be generalized to any non-linear model. Just replace the model
formula with the one you want to fit.

9. Time Series Decomposition


Time series decomposition in Excel involves separating a time series into its constituent

components, typically trend, seasonal, and irregular components. While Excel doesn't

have a built-in function specifically for time series decomposition, you can use some of

its functionalities to achieve this.

One common method for time series decomposition is the classical decomposition

method, which involves:

Bhise N K
DATA ANALYTICS BCS SY

● Trend Estimation: Estimating the trend component of the time series.


● Seasonal Adjustment: Adjusting for seasonal effects.
● Residual (Irregular) Calculation: Calculating the irregular component as the
remainder after removing trend and seasonal effects.

Here's a general guide on how to perform time series decomposition in Excel using

these steps:

​ Import Your Time Series Data: Input your time series data into Excel. Typically,
you'll have two columns: one for the dates (time) and another for the
corresponding values.

​ Estimate the Trend: You can use various methods to estimate the trend, such as
moving averages or linear regression. For instance, you could calculate a moving
average over a certain window of time to smooth out fluctuations and estimate
the trend.

​ Seasonal Adjustment: To adjust for seasonal effects, you'll need to calculate
seasonal indices. One simple method is to calculate the average value of the
time series for each season (e.g., each month or each quarter) and then calculate
seasonal indices by dividing each observed value by the corresponding seasonal
average. Subtracting these seasonal indices from the original values gives you
the seasonally adjusted series.

​ Residual Calculation: Once you have estimated the trend and adjusted for
seasonal effects, you can calculate the residuals (irregular component) as the
difference between the original values and the sum of the trend and seasonal
components.

Bhise N K
DATA ANALYTICS BCS SY

​ Visualization and Analysis: Plot the original time series data along with the
estimated trend, seasonal, and irregular components to visualize the
decomposition. You can use Excel charting features for this.

While performing time series decomposition manually in Excel can be somewhat

laborious, it is possible with careful use of formulas and data manipulation techniques.

Alternatively, you may consider using more specialized software or programming

languages like Python or R that have built-in functions and libraries for time series

analysis and decomposition.

9.1 Understanding the components of time series data:


trend, seasonality, and noise.
Understanding the components of time series data—trend, seasonality, and

noise—is crucial for analyzing and modeling time-dependent phenomena

accurately. Here's an overview of each component:

​ Trend:
● Definition: The long-term movement or direction of the data over
time. It represents the underlying pattern in the data that persists
over a long period.
● Characteristics:
● Trends can be increasing, decreasing, or stable over time.
● They reflect changes due to underlying factors such as
population growth, economic cycles, technological
advancements, etc.
● Identification:
● Visual inspection of the time series plot.

Bhise N K
DATA ANALYTICS BCS SY

● Statistical techniques like moving averages, exponential


smoothing, or linear regression to estimate and extract the
trend component.
● Example: A steady increase in sales over several years due to
population growth and increasing demand for the product.

​ Seasonality:
● Definition: Regular, periodic fluctuations in the data occurring at
fixed intervals of time (e.g., daily, weekly, monthly, quarterly, or
yearly).
● Characteristics:
● Seasonality represents patterns that repeat over a specific
period, often driven by calendar-related or natural factors.
● It can be additive (consistent amplitude throughout) or
multiplicative (amplitude scales with the level of the series).
● Identification:
● Visual inspection of the time series plot, looking for recurring
patterns at fixed intervals.
● Seasonal decomposition techniques to isolate and estimate
seasonal effects.
● Example: Higher sales of ice cream during summer months and
lower sales during winter, driven by weather conditions.

​ Noise (Irregular or Residual):
● Definition: Random fluctuations or irregular variations in the data
that cannot be attributed to the trend or seasonality. It represents the
randomness or unpredictability in the data.
● Characteristics:
Bhise N K
DATA ANALYTICS BCS SY

● Noise is typically short-term and unpredictable.


● It may arise from various sources such as measurement
errors, sampling variability, or unpredictable external factors.
● Identification:
● Examining the residuals after removing the trend and seasonal
components.
● Statistical techniques like autocorrelation analysis or residual
diagnostics to assess randomness.
● Example: Random fluctuations in daily stock prices due to market
uncertainty and investor behavior.

Understanding and accurately modeling these components are essential for

forecasting, anomaly detection, and decision-making based on time series data.

Effective decomposition allows analysts to isolate each component's effects,

making it easier to interpret patterns and make informed predictions.

decomposing time series data in excel by using moving


average and seasonal indices:

Decomposing time series data in Excel using moving averages and seasonal indices
involves estimating the trend and seasonal components separately. Here's how you can
do it step by step:

​ Import Your Time Series Data: Input your time series data into Excel. You should
have two columns: one for the dates (time) and another for the corresponding
values.
​ Calculate Moving Averages for Trend Estimation:

Bhise N K
DATA ANALYTICS BCS SY

● Choose a window size for your moving average. The window size
determines how many consecutive data points are averaged.
● In a new column, calculate the moving average for each data point using
Excel's AVERAGE function combined with relative cell references. For
example, if your time series values are in column B starting from B2, and
you've chosen a window size of 5, in cell C3, you would input
=AVERAGE(B2:B6) and drag this formula down to calculate moving averages
for all data points.
​ Calculate Seasonal Indices:
● Determine the periodicity of your seasonal component (e.g., monthly,
quarterly).
● Calculate the average value for each season. For instance, if you have
monthly data, calculate the average value for each month across all years.
● Divide each observed value by the corresponding seasonal average to
obtain seasonal indices.
● In Excel, you can calculate these seasonal averages manually or use
functions like AVERAGEIFS or PivotTables.
● Once you have the seasonal indices, expand them to match the length of
your time series data.
​ Calculate Seasonally Adjusted Values:
● Divide the original time series values by the seasonal indices to obtain
seasonally adjusted values. You can do this in a new column.
● This step removes the seasonal component from the original data, leaving
the trend and irregular components.
​ Calculate Residuals (Irregular Component):
● Subtract the trend (moving averages) from the seasonally adjusted values
to obtain residuals.
● Residuals represent the irregular component of the time series data.
​ Visualize the Components:

Bhise N K
DATA ANALYTICS BCS SY

● Plot the original time series data, moving averages (trend), seasonal
indices, and residuals to visualize how each component contributes to the
overall series.
● Excel's charting features can be used for this purpose.

By following these steps, you can decompose your time series data into its trend,

seasonal, and irregular components using moving averages and seasonal indices in

Excel. This decomposition facilitates a better understanding of the underlying patterns

in the data, aiding in forecasting and analysis.

Fig: Seasonal indices.

Bhise N K
DATA ANALYTICS BCS SY

10. Advance time series forecasting technique


Advanced time series forecasting techniques go beyond simple methods like moving

averages or exponential smoothing and often involve more sophisticated algorithms

and models. Here are some advanced techniques commonly used for time series

forecasting:

​ ARIMA (AutoRegressive Integrated Moving Average):


● ARIMA is a widely used model for time series forecasting that combines
autoregressive (AR) and moving average (MA) components.
● It can handle both trend and seasonality in the data and is suitable for
stationary or non-stationary time series.
● ARIMA models require tuning parameters such as the order of differencing
(d), the number of autoregressive terms (p), and the number of moving
average terms (q).
​ Seasonal ARIMA (SARIMA):
● SARIMA extends the ARIMA model to incorporate seasonal components
in addition to trend and irregular components.
● It includes seasonal parameters (P, D, Q) in addition to the non-seasonal
parameters of ARIMA.
● SARIMA models are effective for time series data with clear seasonal
patterns.
​ Exponential Smoothing State Space Models (ETS):
● ETS models are a class of state space models that include several
exponential smoothing methods such as simple exponential smoothing,
Holt's method, and Holt-Winters' method.
● These models provide a flexible framework for capturing trend,
seasonality, and irregular components in the data.
● ETS models are particularly useful when the data exhibit changing
patterns over time.

Bhise N K
DATA ANALYTICS BCS SY

​ Seasonal Decomposition of Time Series (STL):


● STL decomposes a time series into trend, seasonal, and irregular
components using a robust iterative algorithm.
● It allows for more flexible handling of non-linear trends and irregular
patterns in the data compared to traditional decomposition methods.
● STL is useful for forecasting time series with complex seasonal patterns.
​ Machine Learning Models:
● Various machine learning algorithms such as neural networks, random
forests, support vector machines, and gradient boosting machines can be
applied to time series forecasting.
● These models can capture complex relationships and patterns in the data
and often outperform traditional statistical methods.
● Feature engineering, model selection, and hyperparameter tuning are
important considerations when using machine learning for time series
forecasting.
​ Deep Learning Models:
● Deep learning models, particularly recurrent neural networks (RNNs) and
variants like long short-term memory (LSTM) networks and gated
recurrent units (GRUs), have shown promise for time series forecasting.
● These models can capture long-term dependencies and non-linear
patterns in the data, making them suitable for a wide range of forecasting
tasks.
● Deep learning models may require large amounts of data and
computational resources for training.
​ Ensemble Methods:
● Ensemble methods combine forecasts from multiple models to improve
prediction accuracy and robustness.
● Techniques such as model averaging, weighted averaging, and stacking
can be used to combine forecasts from different models or model
configurations.

Bhise N K
DATA ANALYTICS BCS SY

● Ensemble methods can mitigate the weaknesses of individual models and


provide more reliable forecasts.

When applying advanced time series forecasting techniques, it's essential to evaluate

model performance using appropriate metrics and consider factors such as data

quality, seasonality, trend patterns, and the forecasting horizon. Additionally, model

interpretation and the computational complexity of the chosen approach should be

taken into account when selecting the most suitable technique for a particular

forecasting task.

10.1 AutoRegressive Model :

The Autoregressive (AR) model is a time series forecasting technique that


predicts future values based on a linear combination of past observations. In an
AR model of order p(denote as AR(p)), the current value of Yt is modeled as a
function of the p recent observation.

Yt​=c+ϕ1​Yt−1​+ϕ2​Yt−2​+⋯+ϕp​Yt−p​+εt​

Where :

● c is the constant term.


● ϕ1​,ϕ2​,…,ϕp​are the autoregressive coefficient.
● Yt−1​,Yt−2​,…,Yt−p​are the lagged values of time series.
● εt is the error term at time t.

The AR model captures the linear relationship between the current value of the
time series and its past values.

Bhise N K
DATA ANALYTICS BCS SY

10.2 Moving Average Model :

The Moving Average (MA) model is another time series forecasting technique
that predicts future values based on the weighted sum of past prediction errors.
In an MA model of order q (denoted as a MA(q)) the current value of Yt is
modeled as a function of the q most recent predictors errors.

Yt​=μ+θ1​εt−1​+θ2​εt−2​+⋯+θq​εt−q​+εt​

Where:

● μ (mu) is the mean of time series.


● θ1​,θ2​,…,θq are the moving average coefficients.
● εt−1​,εt−2​,…,εt−q​ are lagged prediction errors.
● εt is the error term at time t.

The MA model captures the dependence between the current value of the time series
and the residual errors from previous predictions.

10.3 Implementing advanced forecasting technique in excel using


custom formula and add-in:
Implementing advanced time series forecasting techniques such as ARIMA in Excel

using custom formulas alone can be quite challenging due to the complexity of these

models. However, you can use Excel in conjunction with add-ins or external tools to

perform advanced forecasting. One such popular add-in for Excel is the "Solver" add-in,

which can be used to optimize parameters for simpler models like exponential

smoothing or simple linear regression.

Here's a general approach using Solver and an external tool like R or Python for ARIMA

forecasting:

Bhise N K
DATA ANALYTICS BCS SY

​ Data Preparation in Excel:


● Organize your time series data in Excel, typically in two columns: one for
the time index (dates) and another for the corresponding values.
​ Export Data:
● Export the data from Excel into a format that can be read by R or Python.
Common formats include CSV (Comma-Separated Values) or Excel files.
​ Model Estimation in R or Python:
● Use R packages like "forecast" or Python libraries like "statsmodels" or
"pmdarima" to fit ARIMA models to your time series data.
● Write scripts or functions in R or Python to read the data, fit the ARIMA
model, and generate forecasts.
​ Optimization with Solver (Optional):
● If you want to optimize model parameters (e.g., for exponential smoothing
models), you can use Excel's Solver add-in.
● Define an objective function in Excel that measures the error between
actual and predicted values (e.g., Mean Squared Error).
● Use Solver to minimize this objective function by adjusting the model
parameters.
​ Import Forecasts into Excel:
● Once you have generated forecasts using R or Python, import the
forecasted values back into Excel.
● You can create a new sheet or column to display the forecasted values
alongside the original data.
​ Visualization and Analysis in Excel:
● Use Excel's charting features to visualize the original time series data and
the forecasted values.
● Analyze the accuracy of the forecasts and make adjustments as needed.

This approach leverages the strengths of both Excel and external tools like R or Python

to perform advanced time series forecasting. While Excel may not be suitable for

Bhise N K
DATA ANALYTICS BCS SY

directly implementing complex forecasting models, it can still be a valuable tool for data

preparation, visualization, and analysis in conjunction with more powerful statistical

software.

Bhise N K
DATA ANALYTICS BCS SY

Unit IV

Hypothesis Testing, Confidence Intervals and Excel


Add-ins for data analytics:

1. Hypothesis Testing Basic:

A) Introduction to hypothesis testing and its importance -


Hypothesis testing is a fundamental concept in statistics used to make inferences

about populations based on sample data. It involves evaluating two competing

hypotheses about a population parameter: the null hypothesis (H0) and the alternative

hypothesis (H1 or HA).

​ Null Hypothesis (H0): This hypothesis typically represents the status quo or the
belief that there is no effect or no difference. It is the hypothesis that researchers
aim to test against.
​ Alternative Hypothesis (H1 or HA): This hypothesis represents the opposite of
the null hypothesis, suggesting that there is an effect or a difference.

The importance of hypothesis testing lies in its ability to provide a systematic

framework for making decisions and drawing conclusions based on data. Here's why it's

important:

​ Guidance in Decision Making: Hypothesis testing provides a structured approach


to decision-making by allowing researchers to determine whether the evidence
Bhise N K
DATA ANALYTICS BCS SY

from the sample data is strong enough to reject the null hypothesis in favor of
the alternative hypothesis.

​ Scientific Rigor : In scientific research, hypothesis testing helps ensure that
conclusions drawn from data are based on sound statistical principles rather
than intuition or anecdotal evidence.

​ Inference about Populations: By analyzing sample data and drawing conclusions
about population parameters, hypothesis testing allows researchers to make
generalizations and infer about broader populations from which the samples
were drawn.

​ Control of Type I and Type II Errors: Hypothesis testing helps control the risks of
making Type I errors (rejecting a true null hypothesis) and Type II errors (failing to
reject a false null hypothesis). By setting a significance level (alpha), researchers
can control the probability of making Type I errors.

​ Comparison of Models or Interventions: Hypothesis testing is often used to
compare different models, treatments, or interventions to determine which one is
more effective or whether there is a significant difference between them.

​ Validation of Theories: Hypothesis testing provides a means to validate or refute
theories by subjecting them to empirical scrutiny.

Overall, hypothesis testing serves as a cornerstone of statistical inference, providing a

systematic framework for making decisions and drawing conclusions based on data,

which is crucial in various fields including science, medicine, economics, psychology,

and more.

Bhise N K
DATA ANALYTICS BCS SY

B) Understanding Null and Alternative hypothesis -


Understanding the concepts of null and alternative hypotheses is crucial in hypothesis

testing, as they form the foundation for making statistical decisions based on sample

data. Let's delve deeper into each:

Null Hypothesis (H0):


The null hypothesis is a statement that suggests there is no significant difference or no

effect. It often represents the status quo or the default assumption. In hypothesis

testing, the null hypothesis is denoted by H0.

Example:

● H0: The mean exam scores of students who received tutoring are equal to the
mean exam scores of students who did not receive tutoring.

Alternative Hypothesis (H1 or HA):


The alternative hypothesis is a statement that contradicts the null hypothesis. It

suggests that there is a significant difference or an effect present in the population. In

hypothesis testing, the alternative hypothesis is denoted by H1 or HA.

Example:

● HA: The mean exam scores of students who received tutoring are not equal to
the mean exam scores of students who did not receive tutoring.

Relationship between Null and Alternative Hypotheses:


● The null hypothesis typically represents the position of skepticism or no effect.

Bhise N K
DATA ANALYTICS BCS SY

● The alternative hypothesis contradicts the null hypothesis and suggests the
presence of an effect or a difference.
● In most cases, the objective of hypothesis testing is to assess the evidence
provided by sample data and decide whether there is sufficient evidence to reject
the null hypothesis in favor of the alternative hypothesis.

Types of Alternative Hypotheses:


​ Two-Tailed (Non-directional) Alternative Hypothesis:
● Implies that the parameter is different from the hypothesized value, but
does not specify the direction of the difference.
● Example: HA: The mean exam scores of students who received tutoring
are not equal to 75%.
​ One-Tailed (Directional) Alternative Hypothesis:
● Specifies the direction of the difference from the hypothesized value.
● Can be either a right-tailed (greater than) or left-tailed (less than)
alternative hypothesis.
● Example:
● Right-tailed: HA: The mean exam scores of students who received
tutoring are greater than the mean exam scores of students who
did not receive tutoring.
● Left-tailed: HA: The mean exam scores of students who received
tutoring are less than the mean exam scores of students who did
not receive tutoring.

Understanding and properly formulating null and alternative hypotheses is essential for

conducting hypothesis tests accurately and drawing meaningful conclusions based on

statistical evidence.

Bhise N K
DATA ANALYTICS BCS SY

C) type of hypothesis test : one tailed and two tailed test -


Certainly! Hypothesis tests can be broadly categorized into one-tailed and two-tailed tests,

depending on the nature of the alternative hypothesis. Let's explore each type:

One-Tailed Test:
In a one-tailed test, the alternative hypothesis specifies the direction of the difference or effect

relative to the null hypothesis. It focuses on testing whether a parameter is either greater than or less

than a certain value.

Types of One-Tailed Tests:


​ Right-Tailed Test (Upper-Tailed Test):
● The alternative hypothesis suggests that the parameter is greater than the null
hypothesis value.
● This type of test is used when researchers are interested in determining if there is a
significant increase or improvement.
● Example: Testing whether a new drug leads to an increase in average test scores.
​ Left-Tailed Test (Lower-Tailed Test):
● The alternative hypothesis suggests that the parameter is less than the null
hypothesis value.
● This type of test is used when researchers are interested in determining if there is a
significant decrease or decline.
● Example: Testing whether a certain process leads to a reduction in defect rates.

Two-Tailed Test:
In a two-tailed test, the alternative hypothesis does not specify the direction of the difference or

effect relative to the null hypothesis. It focuses on testing whether a parameter is different (either

greater than or less than) from a certain value.

Bhise N K
DATA ANALYTICS BCS SY
Two-Tailed Test:
​ Two-Tailed Test:
● The alternative hypothesis suggests that the parameter is simply different from the
null hypothesis value, without specifying the direction.
● This type of test is used when researchers are interested in determining if there is a
significant difference, regardless of whether it is an increase or decrease.
● Example: Testing whether there is a significant difference in average response times
between two customer service methods.

Key Differences:
● One-tailed tests are more sensitive to detecting effects in one specific direction, while
two-tailed tests are sensitive to detecting effects in either direction.
● The choice between one-tailed and two-tailed tests depends on the research question and
the specific hypothesis being tested.
● One-tailed tests may be appropriate when there is a strong theoretical basis for predicting
the direction of the effect, while two-tailed tests are more conservative and are used when
the direction is uncertain or when researchers want to detect any significant difference.

Understanding the distinction between one-tailed and two-tailed tests is crucial for selecting the

appropriate hypothesis test and interpreting the results accurately based on the research objectives

and hypotheses.

2. Confidence Intervals.
In Excel, you can calculate confidence intervals for population means using the
confidence function. This function returns the confidence interval for a population
mean, given a sample mean, the standard deviation of the population, and the size of
the sample. Here's how you can use it:

CONFIDENCE(alpha, standard_dev, size)

Bhise N K
DATA ANALYTICS BCS SY

● Alpha - The significance level (or confidence level) which corresponds to the
probability that the calculated confidence interval contains the true population
parameter. It ranges from 0 to 1. For example, for a 95% confidence level, alpha
would be 0.05.
● standard_dev - The standard deviation of the population.
● size - The size of the sample.

Let's say you have a sample of exam scores with a mean of 80, a standard
deviation of 10, and a sample size of 30. To calculate the 95% confidence interval
for the population mean:

=CONFIDENCE(0.05, 10, 30)

This will return the margin of error (half-width of the confidence interval). To get the
upper and lower bounds of the confidence interval, you would add and subtract this
margin of error from the sample mean.

For example, if the sample mean is 80:

=80 - CONFIDENCE(0.05, 10, 30) // Lower bound

=80 + CONFIDENCE(0.05, 10, 30) // Upper bound

You can also use the CONFIDENCE.NORM function or CONFIDENCE.T function if you know
the population standard deviation or have a sample size smaller than 30 and want to
use a t-distribution instead of a normal distribution.

Bhise N K
DATA ANALYTICS BCS SY

Remember, these functions help you calculate the confidence interval for the population

mean. If you want to calculate confidence intervals for other parameters or perform

different types of statistical analysis, Excel offers various other functions and tools.

● Interpretation of confidence intervals


In Excel, you can calculate confidence levels using the CONFIDENCE, CONFIDENCE.NORM,
or CONFIDENCE.T functions. These functions are used to determine the range of values
within which the true population parameter is likely to fall, based on sample data and a
specified confidence level. Here's how to interpret confidence levels and their usage in
Excel:

1. CONFIDENCE Function:
The CONFIDENCE function in Excel is used to calculate the confidence interval for a
population mean, given a sample mean, standard deviation, and sample size. It
assumes a normal distribution.

CONFIDENCE(alpha, standard_dev, size)

● Alpha - The significance level (or confidence level) which corresponds to the
probability that the calculated confidence interval contains the true population
parameter. It ranges from 0 to 1.
● standard_dev - The standard deviation of the population.
● size - The size of the sample.

Bhise N K
DATA ANALYTICS BCS SY

Interpretation:
● Higher confidence levels correspond to wider confidence intervals. For example,
a 95% confidence level implies that if the sampling process were repeated many
times, approximately 95% of the calculated confidence intervals would contain
the true population parameter.
● A lower confidence level, such as 90%, would yield a narrower confidence interval
but with less certainty that it contains the true population parameter.

2. CONFIDENCE.NORM Function:
The CONFIDENCE.NORM function calculates the confidence interval for a population mean
when the population standard deviation is known.

CONFIDENCE.NORM(alpha, standard_dev, size)

3. CONFIDENCE.T Function:
The CONFIDENCE.T function calculates the confidence interval for a population mean
when the population standard deviation is unknown and the sample size is small
(typically less than 30).

CONFIDENCE.T(alpha, standard_dev, size)

Interpretation for CONFIDENCE.NORM and CONFIDENCE.T Functions:


● These functions are similar to CONFIDENCE but are used under specific conditions
(known or unknown population standard deviation) and assuming a normal or
t-distribution, respectively.
● The interpretation of confidence levels remains the same: higher confidence
levels imply wider intervals with greater certainty of containing the true
population parameter.

Example Interpretation:
Bhise N K
DATA ANALYTICS BCS SY

Let's say you calculate a 95% confidence interval for the population mean using
CONFIDENCE. If the interval is (72, 88), it means that you are 95% confident that the true

population mean lies between 72 and 88 based on your sample data.

In conclusion, confidence levels in Excel provide a measure of the uncertainty


surrounding estimated population parameters based on sample data. They help
quantify the reliability of statistical estimates and are widely used in inferential statistics
and hypothesis testing

calculating confidence intervals for mean and proportion in excel :-

3. Z-Test and T-Test :-

In Excel, you can perform both Z-tests and T-tests to compare sample means or
proportions to population parameters or to compare two sample means. Here's an
overview of each:

Z-Test:
A Z-test is used when the population standard deviation is known, or the sample size is
large (typically n > 30). It is commonly used to compare a sample mean to a known
population mean or to test the proportion of successes in a sample against a known
population proportion.

Bhise N K
DATA ANALYTICS BCS SY

In Excel, you can perform a Z-test for means using the Z.TEST function and for
proportions using the Z.TEST function as well.

Syntax for Z-Test for Means:


Z.TEST(array, x, sigma)

● array: The sample data range.


● x: The hypothesized population mean.
● sigma: The population standard deviation.

Syntax for Z-Test for Proportions:

Z.TEST(array, sigma)

● array: The sample data range.


● sigma: The hypothesized population proportion.

T-Test:
A T-test is used when the population standard deviation is unknown and the sample size
is small (typically n < 30). It is used to compare two sample means or to compare a
sample mean to a known population mean.

In Excel, you can perform different types of T-tests depending on whether the data is
from paired or unpaired samples and whether the variances of the samples are
assumed to be equal or not.

Syntax for T-Test (Assuming Equal Variances):

T.TEST(array1, array2, tails, type)

● array1: The data range for the first sample.


● array2: The data range for the second sample.
● tails: The number of tails for the test (1 for one-tailed, 2 for two-tailed).
● type: Specifies the type of T-test to perform (1 for paired samples, 2 for
two-sample equal variance, 3 for two-sample unequal variance).

Bhise N K
DATA ANALYTICS BCS SY

Syntax for T-Test (Assuming Unequal Variances):

T.TEST(array1, array2, tails, type, var_type)

● array1: The data range for the first sample.


● array2: The data range for the second sample.
● tails: The number of tails for the test (1 for one-tailed, 2 for two-tailed).
● type: Specifies the type of T-test to perform (1 for paired samples, 2 for
two-sample equal variance, 3 for two-sample unequal variance).
● var_type: Specifies whether to assume equal variances (1) or not (2).

Example:
Let's say you have two samples, A1:A10 and B1:B10, and you want to perform a
two-sample T-test assuming equal variances:

=T.TEST(A1:A10, B1:B10, 2, 2)

This formula will return the probability associated with a two-tailed Student's t-test.

Remember to adjust the inputs based on your specific data and hypothesis testing
requirements.

3.1 performing one sample two sample and paied t test in excel
using data analysis in toolpak :-

( https://www.geeksforgeeks.org/paired-sample-t-test-in-excel/
)

Bhise N K
DATA ANALYTICS BCS SY

4. Chi- Square Test and ANOVA -

● Chi Square Test -


The Chi-squared test is used to determine whether there is a significant association
between two categorical variables. In Excel, you can perform this test using the
CHISQ.TEST function.

Syntax : CHISQ.TEST(actual_range, expected_range)

● actual_range: The range of observed values.


● expected_range: The range of expected values.

Example:
Let's say you have observed and expected frequencies in cells A2:A5 and B2:B5,
respectively. To perform the Chi-squared test, you would use:

=CHISQ.TEST(A2:A5, B2:B5)

This function returns the p-value associated with the Chi-squared test, allowing you to
assess the significance of the association.

● ANOVA (Analysis of Variance):


ANOVA is used to compare means across multiple groups to determine if there
are statistically significant differences between them. In Excel, you can perform
ANOVA using the ANOVAs function.

Syntax: ANOVA(data, grouping_column)

● data: The range of data values (including column headers).


● grouping_column: The column containing the group labels.

Bhise N K
DATA ANALYTICS BCS SY

Example:
Suppose you have data in columns A and B, where column A contains the group labels
and column B contains the corresponding values. To perform ANOVA, select a range
including both columns and use:

=ANOVA(B:B, A:A)

This function returns the F-statistic and associated p-value, allowing you to assess the
significance of differences between group means.

Notes:
● Ensure your data is properly organized in Excel before applying these functions.
● Interpretation of results involves comparing the p-value to a chosen significance
level (e.g., 0.05). If the p-value is less than the significance level, you reject the
null hypothesis and conclude there is a significant difference.

4.1 performing chi square test in excel using data analysis


toolpak or custom formula -

Bhise N K
DATA ANALYTICS BCS SY

4.2 introducing to analysis of variance anova for comparing


multiple means -

Analysis of Variance (ANOVA) is a statistical method used to compare means across

multiple groups to determine if there are statistically significant differences between

them. It's a powerful tool for analyzing the effects of categorical independent variables

on a continuous dependent variable. ANOVA assesses whether the variation in the

dependent variable is due to differences between groups or simply random variability.

Here's an introduction to ANOVA:

Key Concepts:
​ Independent and Dependent Variables:
● ANOVA involves one dependent variable (continuous) and one or more
independent variables (categorical) with two or more levels or groups.
​ Null and Alternative Hypotheses:
● Null Hypothesis (H0): There is no significant difference in means between
groups.
● Alternative Hypothesis (HA): At least one group mean is significantly
different from others.
​ F-Statistic:
● ANOVA calculates the F-statistic, which represents the ratio of the
variance between groups to the variance within groups. If the ratio is
sufficiently large, it suggests that the group means are significantly
different.

Bhise N K
DATA ANALYTICS BCS SY

​ Assumptions:
● Independence: Observations within each group are independent.
● Normality: The dependent variable follows a normal distribution within
each group.
● Homogeneity of Variance: The variance of the dependent variable is equal
across all groups.

Types of ANOVA:
​ One-Way ANOVA: Compares means across two or more independent groups or
levels of a single categorical variable.

​ Two-Way ANOVA: Examines the effects of two categorical independent variables
(factors) on a continuous dependent variable, allowing for the analysis of main
effects and interaction effects.

Steps in Conducting ANOVA:


​ Formulate Hypotheses: Define null and alternative hypotheses based on the
research question.

​ Check Assumptions: Verify that the assumptions of ANOVA are met, including
independence, normality, and homogeneity of variance.

​ Select ANOVA Model: Choose the appropriate ANOVA model based on the
research design and number of factors.

​ Calculate F-Statistic: Compute the F-statistic using the variation between groups
and within groups.

Bhise N K
DATA ANALYTICS BCS SY

​ Determine Significance: Compare the calculated F-statistic to the critical value


from the F-distribution or calculate the p-value.

​ Interpret Results: If the p-value is less than the chosen significance level, reject
the null hypothesis and conclude that there are significant differences between
groups.

Interpretation:
● A significant result in ANOVA indicates that at least one group mean is different
from the others, but it doesn't specify which groups are different.
● Post-hoc tests, such as Tukey's HSD or Bonferroni correction, can be used to
identify specific group differences after obtaining a significant ANOVA result.

ANOVA is widely used in various fields, including experimental research, social sciences,

medicine, and business, to compare means across multiple groups and understand the

effects of categorical variables on continuous outcomes. It provides valuable insights

into group differences and helps make informed decisions based on statistical

evidence.

4.3 conducting one way and two way ANOVA in excel using data
analysis toolpak -

ANOVA (Analysis of Variance) is a statistical technique used to compare the means of

two or more groups to determine if there are statistically significant differences

between them. It's particularly useful when you have categorical independent variables

Bhise N K
DATA ANALYTICS BCS SY

and continuous dependent variables. ANOVA assesses whether the means of different

groups are equal or not by examining the variance within groups and between groups.

There are two main types of ANOVA:

​ One-Way ANOVA: Compares the means of two or more independent groups on


one continuous dependent variable.
​ Two-Way ANOVA: Extends the one-way ANOVA by considering the effects of two
categorical independent variables (factors) on one continuous dependent
variable, allowing for the examination of interaction effects between the factors.

Conducting ANOVA in Excel using Data Analysis ToolPak:


To conduct one-way or two-way ANOVA in Excel, you can use the Data Analysis ToolPak,

which is an add-in that provides various statistical analysis tools.

Steps to enable Data Analysis ToolPak:


1) Go to the "File" menu.
2) Click on "Options."
3) In the Excel Options dialog box, select "Add-Ins" on the left sidebar.
4) In the "Manage" dropdown at the bottom, select "Excel Add-ins" and click "Go..."
5) Check the "Analysis ToolPak" checkbox and click "OK."

Steps to perform ANOVA:


​ Organize your data:
● Ensure your data is organized in columns with each column representing a
different group or factor.
​ Click on the "Data" tab.
​ Locate the "Data Analysis" tool in the "Analysis" group (this appears after
enabling the Data Analysis ToolPak).

Bhise N K
DATA ANALYTICS BCS SY

​ Choose "ANOVA: Single Factor" for one-way ANOVA or "ANOVA: Two-Factor with
Replication" for two-way ANOVA.
​ In the dialog box that appears:
● Select the input range for the data.
● Choose whether your data has labels in the first row.
● Specify where you want the output to be placed.
​ Click "OK" to run the analysis.
​ Review the output:
● The output will include various statistics, including the F-statistic, p-value,
and means for each group.
● Examine the p-value to determine if there are statistically significant
differences between groups. A p-value less than your chosen significance
level (e.g., 0.05) indicates significance.

Notes:
● Ensure your data meets the assumptions of ANOVA, including normality of
residuals and homogeneity of variances.
● Interpret the results carefully, considering both statistical significance and
practical significance.
● ANOVA does not identify which specific groups differ from each other. Post-hoc
tests (e.g., Tukey's HSD test) can be used for pairwise comparisons if ANOVA
results are significant.

By following these steps, you can conduct one-way or two-way ANOVA in Excel using

the Data Analysis ToolPak, allowing you to compare multiple means and assess the

significance of differences between groups.

Bhise N K
DATA ANALYTICS BCS SY

5 . Excel Analysis ToolPak

5.1 Introduction to excel analysis toolpak and its feature -


The Analysis ToolPak is a Microsoft Excel add-in that provides powerful data analysis

tools for users. It's particularly useful for tasks such as statistical analysis, financial

modeling, and engineering calculations. Here's an introduction to some of its key

features:

​ Installation: The Analysis ToolPak is not enabled by default in Excel. To use it,
you need to enable it first. You can do this by going to the "File" menu, selecting
"Options," then choosing "Add-Ins." From there, you can select "Analysis ToolPak"
and click "Go" to enable it.

​ Descriptive Statistics: The ToolPak includes functions for calculating various
descriptive statistics such as mean, median, mode, standard deviation, variance,
and quartiles. These functions are handy for summarizing and understanding the
characteristics of your data.

​ Histograms: You can use the ToolPak to create histograms, which are graphical
representations of the distribution of data. Histograms are useful for visualizing
the frequency distribution of a dataset.

​ Regression Analysis: The ToolPak provides tools for performing regression
analysis, including linear regression, multiple regression, and exponential
regression. These tools allow you to analyze the relationship between variables
and make predictions based on the data.

Bhise N K
DATA ANALYTICS BCS SY

​ Analysis of Variance (ANOVA): ANOVA is a statistical technique used to analyze


the differences among group means in a sample. The ToolPak includes functions
for performing one-way and two-way ANOVA, which are useful for comparing
means across different groups.

​ Sampling: The ToolPak includes functions for generating random samples from
a population. This feature is handy for conducting simulation studies or for
estimating population parameters based on sample data.

​ Data Analysis Tools: In addition to specific statistical functions, the ToolPak
includes a set of data analysis tools such as sorting, filtering, and subtotaling.
These tools make it easier to manipulate and analyze large datasets within Excel.

​ Solver: While technically part of Excel's add-ins rather than the Analysis ToolPak
itself, the Solver tool is often associated with the ToolPak. It allows you to find
optimal solutions to complex problems by adjusting input values according to
specified constraints.

Overall, the Analysis ToolPak is a valuable resource for users who need to perform

advanced data analysis tasks within Excel. Whether you're a student, researcher, or

business professional, the ToolPak can help you gain insights from your data and make

informed decisions.

Bhise N K
DATA ANALYTICS BCS SY

5.2 using the toolpak for statistical analysis : t test , ANOVA,

correlation, and regression. -


Certainly! Here's a brief overview of how you can use the Analysis ToolPak for each of

these statistical analyses in Excel:

​ T-Test:

● To perform a T-Test using the Analysis ToolPak in Excel, first, ensure that

the ToolPak is enabled (as mentioned earlier).

● Once enabled, you can find the T-Test tool under the "Data Analysis" tab.

● Click on "Data Analysis" > "t-Test: Two-Sample Assuming Equal Variances"

or "t-Test: Two-Sample Assuming Unequal Variances," depending on the

nature of your data.

● Input the ranges of your data for the two samples and other required

parameters.

● Excel will then calculate the T-Test statistics and provide results including

the t-value, degrees of freedom, and p-value.

​ ANOVA (Analysis of Variance):

● Similar to the T-Test, you can find the ANOVA tool under the "Data

Analysis" tab after enabling the Analysis ToolPak.

● Click on "Data Analysis" > "Anova: Single Factor" or "Anova: Two-Factor

With Replication," depending on your experimental design.

● Input the ranges of your data and select the appropriate options.

● Excel will compute the ANOVA table, which includes the sum of squares,

degrees of freedom, mean squares, F-value, and p-value.

Bhise N K
DATA ANALYTICS BCS SY

​ Correlation:

● To perform a correlation analysis using the Analysis ToolPak, you'll first

need to arrange your data so that each variable is in a separate column.

● Go to the "Data" tab and click on "Data Analysis."

● Select "Correlation" from the list of analysis tools.

● Input the range of your data, select the appropriate options, and specify

whether you want the correlation matrix to be outputted in a new

worksheet or in a specified location.

● Excel will calculate the correlation coefficients between all pairs of

variables in your dataset.

​ Regression:

● For regression analysis, ensure the Analysis ToolPak is enabled.

● Go to the "Data" tab, click on "Data Analysis," and select "Regression."

● Input the range of your independent and dependent variables.

● Choose whether you want Excel to output regression statistics, residuals,

and other options.

● After running the regression analysis, Excel will provide coefficients,

standard errors, t-values, and p-values for each predictor variable, as well

as other relevant statistics like R-squared and adjusted R-squared.

These steps provide a basic outline of how to use the Analysis ToolPak in Excel for

various statistical analyses. Depending on your specific dataset and research questions,

you may need to adjust the settings and interpret the results accordingly.

Bhise N K
DATA ANALYTICS BCS SY

6.power query for data transformation -


Power Query is a powerful data transformation and data connection tool that's built into

Excel and available as a standalone add-in for Excel 2010 and 2013. It's also integrated

into other Microsoft products such as Power BI and SQL Server Integration Services

(SSIS). Here's an overview of how you can use Power Query for data transformation in

Excel:

​ Getting Data: The first step in using Power Query is to import your data into

Excel. You can do this by going to the "Data" tab and selecting "Get Data" or

"From Table/Range." Power Query supports a wide range of data sources

including Excel tables, CSV files, text files, databases, and online sources like

SharePoint, Azure, and OData feeds.

​ Data Cleaning and Transformation: Once you've imported your data, you can use

Power Query to clean and transform it as needed. This includes tasks such as:

● Removing duplicates

● Filtering rows based on criteria

● Renaming columns

● Removing or replacing values

● Splitting or merging columns

● Extracting data from text or HTML

● Pivoting and unpivoting data

● Adding custom columns with calculated values or transformations

● Applying conditional logic to data transformations

Bhise N K
DATA ANALYTICS BCS SY

​ Query Editor: Power Query provides a user-friendly interface called the Query

Editor for performing these data transformation tasks. The Query Editor allows

you to see a preview of your data and apply transformation steps in a

step-by-step manner. Each transformation step is recorded and can be easily

modified or removed.

​ Merging and Appending Data: Power Query allows you to merge multiple tables

or append data from different sources into a single table. This is useful for

combining data from multiple sources or consolidating data from different files

or worksheets.

​ Query Parameters and Functions: Power Query supports the use of query

parameters and functions, which allow you to create reusable queries and

automate data transformation tasks. Query parameters can be used to

dynamically adjust filter criteria, file paths, or connection strings, while functions

enable you to encapsulate complex transformations into reusable code.

​ Data Load Options: Once you've transformed your data in Power Query, you can

choose how to load it into Excel. Options include loading the data to a new

worksheet, existing worksheet, or data model. You can also choose to refresh the

data automatically when the workbook is opened or manually refresh it at any

time.

Overall, Power Query provides a flexible and intuitive toolset for cleaning, transforming,

and reshaping data in Excel, making it easier to prepare data for analysis, reporting, and

visualization. It's particularly useful for working with large datasets or data from

Bhise N K
DATA ANALYTICS BCS SY

multiple sources, as it allows you to automate and streamline the data preparation

process.

Applications of power query -


Power Query has a wide range of applications across various industries and business

functions. Here are some common applications of Power Query:

​ Data Cleaning and Preparation: Power Query is commonly used to clean and

prepare raw data for analysis. This includes tasks such as removing duplicates,

correcting errors, standardizing formats, and transforming data into a usable

format.

​ Data Integration: Power Query can be used to integrate data from multiple

sources into a single dataset. This is particularly useful in scenarios where data

is stored in different formats or systems, such as combining data from Excel

files, databases, text files, and online sources.

​ Data Transformation: Power Query provides powerful tools for transforming and

reshaping data. Users can perform tasks such as splitting columns, merging

tables, unpivoting data, and creating calculated columns using a simple and

intuitive interface.

​ Data Enrichment: Power Query enables users to enrich their datasets by pulling

in additional data from external sources. This could include adding demographic

information, geographic data, financial data, or other relevant information to

enhance analysis and reporting.

Bhise N K
DATA ANALYTICS BCS SY

​ Data Aggregation and Summarization: Power Query can be used to aggregate

and summarize data to create summary reports or dashboards. Users can group

data, calculate totals, averages, or other summary statistics, and create

aggregated views of their data for analysis.

​ Data Transformation for Reporting: Power Query is often used in conjunction

with Excel's reporting features to create dynamic and interactive reports. Users

can use Power Query to prepare and clean data before feeding it into pivot tables,

charts, and other reporting tools for visualization and analysis.

​ Data Exploration and Analysis: Power Query can be used for ad-hoc data

exploration and analysis. Users can quickly filter, sort, and manipulate data to

identify patterns, trends, and insights that can inform decision-making.

​ Data Refresh and Automation: Power Query enables users to automate data

refresh processes, ensuring that their reports and dashboards are always

up-to-date with the latest data. Users can schedule data refreshes or set up

triggers to update data automatically when certain conditions are met.

Overall, Power Query is a versatile tool that can streamline and automate various

data-related tasks, from data cleaning and integration to analysis and reporting, making

it an indispensable tool for Excel users across industries.

Bhise N K
DATA ANALYTICS BCS SY

6.2 Importing, cleaning, and transforming data using power query -


Importing, cleaning, and transforming data using Power Query is a common task for

data analysts and data scientists working with Microsoft Excel or Power BI. Power

Query provides a user-friendly interface to perform these operations on your data.

Here's a general outline of how you can do this:

​ Import Data:

● Open Excel or Power BI.

● Go to the "Data" tab and select "Get Data" or "Get & Transform Data"

depending on your version.

● Choose the data source you want to import from (e.g., Excel workbook,

CSV file, SQL database, etc.).

● Connect to your data source and load the data into the Power Query Editor.

​ Clean Data:

● Remove any unnecessary columns or rows.

● Rename columns if needed.

● Handle missing values (remove, replace, or fill).

● Filter out any irrelevant data.

● Correct data types (e.g., change text to numbers, dates, etc.).

● Standardize data formats.

​ Transform Data:

● Add new columns derived from existing ones (e.g., calculations,

concatenations, etc.).

Bhise N K
DATA ANALYTICS BCS SY

● Split columns if needed (e.g., splitting full names into first and last

names).

● Merge tables if you have multiple data sources.

● Pivot or unpivot data to reshape it.

● Apply any specific transformations required for your analysis.

​ Advanced Data Cleaning and Transformation:

● Use formulas and functions available in the Power Query Editor to perform

more advanced data manipulation.

● Utilize custom functions if needed.

● Handle errors or exceptions gracefully.

​ Data Load:

● Once you're satisfied with your data cleaning and transformations, you can

load the data back into Excel or Power BI.

● Choose whether to load it to a table, a data model, or a connection-only

query.

​ Data Refresh:

● If your data source is dynamic or changes frequently, set up automatic

refresh options to keep your data up-to-date.

Power Query provides a wide range of functionalities to accomplish these tasks,

including a user-friendly interface for most operations, formula language (M), and

advanced options for handling different types of data.

Remember, the exact steps might vary slightly depending on your version of Excel or

Power BI, but the general process remains the same. Practice and exploration of Power

Bhise N K
DATA ANALYTICS BCS SY

Query features will help you become proficient in data import, cleaning, and

transformation tasks.

6.3 Merging and Appending queries to combine data for

multiple sources -
Merging and appending queries are essential operations in Power Query for combining

data from multiple sources. Here's how you can perform these operations:

​ Merging Queries:

● Merging is used when you have two or more tables with related

information and you want to combine them based on a common column

(like a primary key in a database).

● To merge queries:

● Select the first query (table) in the Power Query Editor.

● Go to the "Home" tab, click on "Merge Queries" in the "Combine"

group, and choose "Merge Queries" from the dropdown menu.

● Select the second query you want to merge with and choose the

common column to merge on.

● Choose the type of join (e.g., inner, left outer, right outer, full outer)

depending on your requirements.

● Click OK, and the merged data will be displayed in the Power Query

Editor.

Bhise N K
DATA ANALYTICS BCS SY

​ Appending Queries:

● Appending is used when you have similar data structures from different

sources, and you want to stack them on top of each other to create a

single table.

● To append queries:

● Select the first query in the Power Query Editor.

● Go to the "Home" tab, click on "Append Queries" in the "Combine"

group, and choose "Append Queries" from the dropdown menu.

● Select the queries you want to append together.

● Click OK, and the data from the selected queries will be appended

to the first query.

​ Advanced Options:

● You can also perform more advanced merges and appends using the

"Merge Queries" and "Append Queries" options with additional options like

matching multiple columns or using custom logic.

● For more complex scenarios, you might need to use the "Advanced Editor"

to write custom M code to achieve the desired merge or append.

​ Data Validation:

● After merging or appending queries, it's important to validate the resulting

data to ensure that it meets your expectations. Check for any unexpected

duplicates, missing data, or incorrect matches.

​ Load or Close & Load:

● Once you've merged or appended your queries and verified the data, you

can choose to load it back into Excel or Power BI, or you can close and

load the query to the data model for further analysis.

Bhise N K
DATA ANALYTICS BCS SY

By mastering merging and appending operations in Power Query, you can efficiently

combine data from multiple sources to perform comprehensive analyses. Practice and

experimentation will help you become proficient in handling various data integration

scenarios.

7. Power Pivot for data modeling -

7.1 Introduction Power Pivot for data modeling in excel-


Power Pivot is a powerful data modeling tool available in Microsoft Excel. It extends

Excel's capabilities to handle larger volumes of data and perform more complex

analyses than traditional spreadsheets. Here's an introduction to Power Pivot for data

modeling in Excel:

What is Power Pivot?


● Data Modeling: Power Pivot enables you to create data models within Excel,

allowing you to bring together data from multiple sources, relate them, and

analyze them using PivotTables and PivotCharts.

● In-Memory Technology: Power Pivot uses an in-memory engine called xVelocity

(VertiPaq) that allows for fast calculations and analysis of large datasets directly

within Excel.

● Advanced Calculations: With Power Pivot, you can create advanced calculations

using Data Analysis Expressions (DAX), a powerful formula language designed

for business intelligence and data analysis.

Bhise N K
DATA ANALYTICS BCS SY

● Integration with Power Query: Power Pivot works seamlessly with Power Query,

allowing you to import, clean, and transform data from various sources before

modeling it in Power Pivot.

Key Features of Power Pivot:


​ Data Integration:

● Import data from various sources such as Excel tables, relational

databases, OLAP databases, text files, and more.

● Create relationships between different tables to establish connections

between related data.

​ Data Modeling:

● Define calculated columns and measures using DAX formulas to enrich

your data model with additional insights.

● Utilize hierarchies and perspectives to organize and navigate complex

datasets effectively.

​ Data Analysis:

● Create PivotTables and PivotCharts based on your data model to perform

multidimensional analysis.

● Slice and dice your data dynamically to explore different perspectives and

uncover patterns and trends.

​ Performance Optimization:

● Optimize performance by managing relationships, hierarchies, and

calculations efficiently.

● Utilize features like calculated columns vs. measures judiciously to

minimize memory consumption and improve processing speed.

​ Data Visualization:

Bhise N K
DATA ANALYTICS BCS SY

● Visualize your data using a variety of chart types and formatting options

available in Excel.

● Design interactive dashboards and reports to communicate insights

effectively.

Benefits of Using Power Pivot:


● Scalability: Handle large datasets with ease, beyond the limitations of traditional

Excel worksheets.

● Flexibility: Model data according to your specific business requirements, creating

custom calculations and aggregations.

● Integration: Seamlessly integrate with other Microsoft BI tools such as Power

Query, Power BI, and SQL Server Analysis Services.

● Empowerment: Empower Excel users to perform sophisticated data analysis

without the need for extensive IT support.

In summary, Power Pivot empowers Excel users to create robust data models, perform

advanced analytics, and gain valuable insights from their data, all within the familiar

Excel environment. With its intuitive interface and powerful features, Power Pivot is a

valuable tool for data modeling and analysis in Excel.

Bhise N K
DATA ANALYTICS BCS SY

7.2 Creating and managing data model using power pivot -


Creating and managing a data model using Power Pivot in Excel involves several steps,

including importing data, defining relationships, creating calculated columns and

measures, and optimizing performance. Here's a guide on how to do it:

1. Importing Data:
​ Open Power Pivot: In Excel, go to the "Power Pivot" tab and click on "Manage" to

open the Power Pivot window.

​ Import Data: In the Power Pivot window, click on "From Other Sources" to import

data from various sources such as Excel tables, SQL databases, text files, etc.

​ Transform Data (Optional): You can use Power Query within Power Pivot to clean

and transform your data before loading it into the data model.

​ Load Data: After importing and transforming your data, click on "Close & Load" to

load it into the Power Pivot data model.

2. Defining Relationships:
​ Create Relationships: In the Power Pivot window, go to the "Home" tab and click

on "Diagram View." Here, you can visually define relationships between the tables

by dragging and dropping fields from one table to another.

Bhise N K
DATA ANALYTICS BCS SY

​ Manage Relationships: You can manage relationships by going to the "Design"

tab and clicking on "Manage Relationships." Here, you can edit existing

relationships, create new ones, or delete unnecessary ones.

3. Creating Calculated Columns and Measures:

​ Calculated Columns: In the Power Pivot window, select a table and click on "New

Column" in the "Columns" tab. You can then define a calculated column using

DAX formulas to derive new values based on existing data.

​ Measures: Click on "New Measure" in the "Measures" tab to create a new

measure. Measures are dynamic calculations that aggregate data based on

specified criteria. Use DAX formulas to define measures such as sums, averages,

counts, etc.

4. Optimizing Performance:

​ Manage Calculated Columns and Measures: Avoid creating unnecessary

calculated columns and measures that consume memory. Use measures instead

of calculated columns whenever possible for aggregations.

​ Data Model Properties: In the Power Pivot window, go to the "Advanced" tab to

access data model properties. Here, you can optimize memory settings, set

storage options, and configure other advanced settings to improve performance.

Bhise N K
DATA ANALYTICS BCS SY

​ Relationship Optimization: Ensure that relationships between tables are properly

defined and optimized for performance. Use inactive relationships or

bidirectional filtering sparingly to avoid unnecessary overhead.

5. Refreshing Data:

​ Manual Refresh: You can manually refresh the data model by clicking on

"Refresh" in the Power Pivot window. This reloads data from the original data

sources and updates the data model accordingly.

​ Automatic Refresh: Configure automatic data refresh options to keep your data

model up-to-date with changes in the source data.

6. Testing and Validating:

​ Test Calculations and Relationships: Verify that calculated columns, measures,

and relationships are functioning as expected by testing them with sample data.

​ Data Validation: Validate the accuracy and integrity of your data model by

comparing results with known benchmarks or performing data validation checks.

7. Deploying and Sharing:

Once your data model is created and validated, you can deploy it for use by yourself or

others in your organization. You can share Excel workbooks containing the Power Pivot

data model or publish it to Power BI for broader consumption.

By following these steps, you can create and manage a data model using Power Pivot in

Excel, enabling you to perform advanced data analysis and gain valuable insights from

your data.

Bhise N K
DATA ANALYTICS BCS SY

7.3 Working with calculated column and measure using DAX ( Data

Analysis Expression ) -
Working with calculated columns and measures using Data Analysis Expressions (DAX)

in Power Pivot allows you to perform advanced calculations and aggregations on your

data. Here's a guide on how to create and use calculated columns and measures using

DAX:

Calculated Columns:
​ Create Calculated Column:

● In Power Pivot, select the table to which you want to add a calculated

column.

● Go to the "Power Pivot" tab, click on "Add Column," and then select

"Calculated Column."

● A new column will appear in the table, and the formula bar will allow you to

enter a DAX formula.

​ Write DAX Formula:

● Write your DAX formula in the formula bar. For example, to concatenate

two columns, you can use the CONCATENATEX function:

= CONCATENATEX(Table1, Table1[Column1] & " " & Table1[Column2])

​ Use DAX Functions:

Bhise N K
DATA ANALYTICS BCS SY

● Utilize a wide range of DAX functions to perform calculations and

transformations, such as mathematical functions (SUM, AVERAGE), text

functions (CONCATENATEX, LEFT, RIGHT), date functions (YEAR, MONTH,

DAY), and more.

​ Test and Validate:

● Verify that your calculated column is producing the desired results by

examining the values in the column and comparing them to your

expectations.

Measures:
​ Create Measure:

● In Power Pivot, go to the "Power Pivot" tab and click on "New Measure" in

the "Measures" group.

● The formula bar will appear, allowing you to enter a DAX formula for your

measure.

​ Write DAX Formula:

● Write your DAX formula in the formula bar. For example, to calculate the

total sales amount, you can use the SUM function:

Total Sales = SUM(Sales[Amount])

​ Aggregate Functions:

● Use DAX aggregate functions like SUM, AVERAGE, COUNT, MIN, MAX, etc.,

to perform calculations across rows or filtered data.

​ Contextual Functions:

Bhise N K
DATA ANALYTICS BCS SY

● Leverage DAX functions that work within the context of PivotTables and

PivotCharts, such as CALCULATE, FILTER, ALL, RELATED, etc., to

dynamically calculate results based on user selections or filters.

​ Formatting:

● Format measures to display results in a user-friendly format using

formatting options available in Power Pivot or Excel.

​ Test and Validate:

● Test your measures within PivotTables or PivotCharts to ensure they are

producing accurate results based on the context of the analysis.

Best Practices:
● Performance Optimization: Use measures instead of calculated columns

whenever possible to optimize memory usage and performance.

● Reuse and Modularization: Break down complex calculations into smaller,

reusable measures for better maintainability and flexibility.

● Documentation: Document your DAX formulas and provide clear descriptions for

calculated columns and measures to facilitate understanding and collaboration

with other users.

By mastering calculated columns and measures using DAX in Power Pivot, you can

perform sophisticated data analysis and derive valuable insights from your data models

in Excel.

Bhise N K
DATA ANALYTICS BCS SY

8.Power Map for Geospatial Data visualization -


Power Map, previously known as GeoFlow, is a powerful geospatial data visualization

tool available in Microsoft Excel. It allows users to plot data on a 3D globe or custom

map, providing interactive and visually compelling ways to analyze geographic and

time-based data. Here's an introduction to Power Map for geospatial data visualization:

Key Features of Power Map:


​ Interactive 3D Visualization:

● Power Map enables users to create dynamic 3D visualizations of

geographic data on a globe or custom map within Excel.

● Users can navigate, rotate, and zoom in/out of the map to explore data

from different angles and perspectives.

​ Geographic and Time-Based Analysis:

● Power Map supports mapping of geographical data such as country,

region, city, latitude/longitude coordinates, etc.

● Users can also visualize time-based data over time by animating data

points on the map to show changes over a specific period.

​ Data Exploration and Insights:

● Power Map allows users to plot multiple data points on the map and

analyze relationships, patterns, and trends spatially.

● Users can interactively filter, drill down, and slice data to gain deeper

insights into their geographical datasets.

​ Customization and Styling:

Bhise N K
DATA ANALYTICS BCS SY

● Users can customize the appearance of the map by adjusting colors,

markers, themes, and map styles to better represent their data and

enhance visual clarity.

● Power Map provides options to add annotations, labels, and tooltips to

data points for additional context and information.

​ Integration with Excel:

● Power Map seamlessly integrates with Excel, allowing users to leverage

existing Excel data models, PivotTables, and charts for geospatial

visualization.

● Users can easily create, edit, and update maps directly within Excel

without the need for additional software or complex configurations.

How to Use Power Map:


​ Enable Power Map:

● To use Power Map, ensure that you have the appropriate version of Excel

(2013 or later) and that the Power Map add-in is enabled. You can enable

the add-in from the Excel Options menu.

​ Prepare Data:

● Organize your data in Excel with geographic information such as country,

region, city, or latitude/longitude coordinates.

● Power Map works best with tabular data containing columns for location

and numeric values that you want to visualize.

​ Launch Power Map:

● In Excel, go to the "Insert" tab and click on "Map" (or "3D Map" in older

versions) to launch Power Map.

Bhise N K
DATA ANALYTICS BCS SY

● Import your data into Power Map by selecting the appropriate Excel table

or range containing your data.

​ Map Configuration:

● Configure the map by selecting the geographic fields for location plotting

and the numeric fields for data visualization.

● Customize the map appearance, including colors, markers, themes, and

map styles to enhance visual clarity and appeal.

​ Analyze and Interact:

● Explore your data on the map by navigating, rotating, and zooming in/out

to visualize patterns and trends spatially.

● Use interactive features such as filtering, drilling down, and animating data

points to gain deeper insights into your data.

​ Create Storytelling Tours:

● Power Map allows users to create storytelling tours by capturing key

perspectives, insights, and narratives from their data visualizations.

● Users can record and share tours to communicate findings, presentations,

or reports effectively.

Use Cases for Power Map:


● Sales and Marketing: Visualize sales performance, customer locations, and

market penetration geographically to identify target areas and opportunities.

● Supply Chain and Logistics: Analyze distribution networks, transportation routes,

and inventory levels to optimize supply chain operations.

● Demographic Analysis: Map demographic data such as population density,

income levels, and age distributions for sociodemographic insights.

Bhise N K
DATA ANALYTICS BCS SY

● Healthcare and Epidemiology: Plot disease outbreaks, healthcare facilities, and

patient demographics to track and respond to public health issues.

In summary, Power Map is a valuable tool for geospatial data visualization in Excel,

offering interactive 3D mapping capabilities to analyze and communicate insights from

geographical datasets effectively. With its intuitive interface and powerful features,

Power Map enables users to explore, analyze, and present their data in visually

compelling ways.

8.2 Creating interactive 3D geospatial visualization using

power map -
Creating interactive 3D geospatial visualizations using Power Map in Excel allows you to

explore and analyze geographic data in a dynamic and immersive way. Here's a

step-by-step guide on how to create interactive 3D geospatial visualizations using

Power Map:

1. Prepare Your Data:

​ Organize Data: Ensure that your data is organized in Excel with columns

containing geographic information such as country, region, city, or

latitude/longitude coordinates, along with numeric values you want to visualize.

​ Clean and Format Data: Clean up any inconsistencies or errors in your data and

format it appropriately to ensure accurate visualization.

2. Enable Power Map:

Bhise N K
DATA ANALYTICS BCS SY

​ Open Excel: Launch Microsoft Excel and open the workbook containing your

data.

​ Enable Power Map Add-In: If you haven't already enabled the Power Map add-in,

go to the "File" menu, select "Options," then choose "Add-Ins." From the "Manage"

dropdown, select "COM Add-ins," then click "Go" and check the "Microsoft Power

Map for Excel" box to enable it.

3. Launch Power Map:

​ Insert Power Map: With your data selected, go to the "Insert" tab in Excel and

click on "3D Map" (or "Map" depending on your Excel version) in the "Tours"

group.

​ Launch Power Map: Power Map will launch, and a new window will appear

containing a blank map.

4. Import Data into Power Map:

​ Select Data: In the Power Map window, click on "Home" and then click on "New

Tour."

​ Select Data: Select the Excel table or range containing your data. Power Map will

automatically detect the geographic and numeric fields in your data.

5. Customize Your Map:

​ Choose Geography: Select the geographic fields you want to visualize on the

map, such as country, region, city, or latitude/longitude coordinates.

Bhise N K
DATA ANALYTICS BCS SY

​ Choose Data: Select the numeric fields you want to represent visually on the map,

such as sales revenue, population density, or temperature.

​ Customize Appearance: Customize the appearance of your map by adjusting

colors, markers, themes, and map styles to enhance visual clarity and appeal.

6. Interact with Your Map:

​ Navigate: Use the navigation controls to pan, zoom, and rotate the map to

explore different regions and perspectives.

​ Filter Data: Apply filters to focus on specific subsets of your data or to analyze

data based on different criteria.

​ Analyze Trends: Use the timeline slider to visualize time-based data and analyze

trends over time.

7. Create Storytelling Tours:

​ Record Tour: Click on the "Tours" tab and then click on "New Scene" to start

recording a tour. Navigate around the map to capture different viewpoints and

insights.

​ Add Annotations: Add annotations, titles, and descriptions to your scenes to

provide context and narrative to your tour.

​ Play Tour: Click on "Play Tour" to play back your tour and showcase your

interactive 3D geospatial visualization to others.

8. Share Your Visualization:

​ Save Tour: Save your Power Moap tour within your Excel workbook for future

reference or sharing.

Bhise N K
DATA ANALYTICS BCS SY

​ Export Images or Videos: Export images or videos of your visualization to share

with others or to include in presentations or reports.

​ Share Workbook: Share your Excel workbook containing the Power Map

visualization with others, ensuring they have the Power Map add-in enabled to

view and interact with the visualization.

By following these steps, you can create interactive 3D geospatial visualizations using

Power Map in Excel, enabling you to explore and analyze geographic data in a dynamic

and immersive way.

8.3 Customizing map layer , chart type, and visual elements :


In Excel, you can customize map layers, chart types, and visual elements to create more

impactful and informative data presentations. Here's a step-by-step guide on how to do

this:

Customizing Map Layers:


​ Insert a Map Chart:

● Go to the "Insert" tab on the Excel ribbon.

● Click on "Maps" in the Charts group.

● Choose the type of map chart you want to create.

​ Select Data:

● Once the map chart is inserted, you can select the data you want to

visualize on the map.

Bhise N K
DATA ANALYTICS BCS SY

● Click on the map chart to select it.

● Go to the "Design" tab that appears when the chart is selected.

● Click on "Select Data" in the Data group.

● Add or edit the data series as required.

​ Customize Map Layer:

● Click on the map chart to select it.

● Go to the "Design" tab.

● Click on "Map Chart Options" in the Type group.

● In the Format Map pane, you can customize various aspects such as map

projection, map style, color scheme, and data labels.

Customizing Chart Types:


​ Insert a Chart:

● Select the data you want to visualize.

● Go to the "Insert" tab on the Excel ribbon.

● Choose the type of chart you want to create from the Charts group.

​ Select Data:

● Click on the chart to select it.

● Go to the "Design" tab.

● Click on "Select Data" in the Data group.

● Add or edit the data series as required.

​ Change Chart Type:

● With the chart selected, go to the "Design" tab.

● Click on "Change Chart Type" in the Type group.

Bhise N K
DATA ANALYTICS BCS SY

● Select the desired chart type from the list of options.

● Customize further by adjusting specific chart options like axis, titles,

legend, etc.

Customizing Visual Elements:


​ Modify Chart Elements:

● Click on the chart to select it.

● Use the Chart Elements button (plus sign icon) that appears next to the

chart to add or remove chart elements like axes, titles, data labels, and

gridlines.

​ Format Chart Elements:

● Right-click on any chart element you want to format.

● Choose the "Format [element]" option from the context menu.

● Customize the appearance of the selected element using the formatting

options that appear in the Format pane or Format dialog box.

​ Apply Styles and Themes:

● Go to the "Design" tab.

● Explore various chart styles and themes available in the Chart Styles and

Chart Themes groups to apply a predefined set of formatting options to

your chart.

By following these steps, you can effectively customize map layers, chart types, and

visual elements in Excel to create visually appealing and insightful data presentations.

9. Advance Hypothesis Testing Technique :

Bhise N K
DATA ANALYTICS BCS SY

9.1) Introduction to Advanced Hypothesis Testing Techniques: F Test,

Mann-Whitney U Test, and Kruskal-Wallis Test-

Hypothesis testing is a fundamental statistical technique used to make inferences about population

parameters based on sample data. While traditional hypothesis tests like t-tests and ANOVA are

widely used, there are advanced techniques available for situations where the assumptions of these

tests are violated or when dealing with non-parametric data. Three such advanced techniques are

the F test, Mann-Whitney U test, and Kruskal-Wallis test.

​ F Test:
● The F test is a statistical test used to compare the variances of two or more

populations.

● It is commonly used in analysis of variance (ANOVA) to test the equality of means

across multiple groups.

● The null hypothesis of the F test states that the variances of the populations are

equal.

● The F test calculates the ratio of variances between groups to variances within

groups. If this ratio is sufficiently large, it suggests that the variances between

groups are significantly different.

● The F test is sensitive to departures from normality and can be used as a robust

alternative to ANOVA when the assumptions of normality and homogeneity of

variances are violated.

​ Mann-Whitney U Test:

Bhise N K
DATA ANALYTICS BCS SY
● The Mann-Whitney U test, also known as the Wilcoxon rank-sum test, is a

non-parametric test used to compare the distributions of two independent samples.

● It is used when the assumptions of parametric tests like the t-test cannot be met,

such as when the data are not normally distributed or when dealing with ordinal or

ranked data.

● The Mann-Whitney U test ranks all the observations from both samples together, then

compares the sums of ranks for each group.

● The null hypothesis of the Mann-Whitney U test states that there is no difference

between the distributions of the two groups.

● If the p-value calculated from the test is less than the chosen significance level, the

null hypothesis is rejected, indicating a significant difference between the

distributions of the two groups.

​ Kruskal-Wallis Test:
● The Kruskal-Wallis test is a non-parametric alternative to one-way ANOVA, used to

compare the distributions of three or more independent samples.

● It is used when the assumptions of ANOVA, such as normality and homogeneity of

variances, are violated.

● The Kruskal-Wallis test ranks all the observations from all groups together, then

compares the average ranks across groups.

● The null hypothesis of the Kruskal-Wallis test states that there is no difference

between the distributions of the groups.

● If the p-value calculated from the test is less than the chosen significance level, the

null hypothesis is rejected, indicating a significant difference between at least two of

the groups.

In summary, the F test, Mann-Whitney U test, and Kruskal-Wallis test are advanced hypothesis

testing techniques that provide alternatives to traditional parametric tests in situations where the

underlying assumptions cannot be met or when dealing with non-parametric data. These tests allow

Bhise N K
DATA ANALYTICS BCS SY
researchers to make valid statistical inferences and draw conclusions from a wide range of data

types and study designs.

9.2) Implementing advance hypothesis test in excel using custom

formula add in-


Implementing advanced hypothesis tests like the Mann-Whitney U test and

Kruskal-Wallis test in Excel can be achieved using custom formulas or add-ins. Here's a

general overview of how you can implement these tests using custom formulas and

add-ins:

Mann-Whitney U Test:
The Mann-Whitney U test is a non-parametric test used to compare two independent

groups.

Using Custom Formulas:


You can create custom Excel formulas to compute the Mann-Whitney U statistic and

p-value. Here's a simplified outline of how you might approach it:

​ Rank the combined data from both groups.

​ Calculate the sum of ranks for each group.

​ Use the formula to compute the U statistic.

​ Use the U statistic to calculate the p-value using the appropriate distribution

(normal approximation or exact calculation).

Using Add-Ins:

Bhise N K
DATA ANALYTICS BCS SY

There are Excel add-ins available that can perform the Mann-Whitney U test. You can

search for add-ins specifically designed for statistical analysis in Excel. These add-ins

often provide user-friendly interfaces to input data and interpret results.

Kruskal-Wallis Test:
The Kruskal-Wallis test is a non-parametric test used to compare more than two

independent groups.

Using Custom Formulas:


Implementing the Kruskal-Wallis test using custom formulas in Excel can be more

complex due to its multi-step computation. You would need to:

​ Rank the combined data from all groups.

​ Calculate the sum of ranks for each group.

​ Use the ranks to compute the Kruskal-Wallis H statistic.

​ Calculate the p-value using the chi-squared distribution.

Using Add-Ins:
Similar to the Mann-Whitney U test, you can search for Excel add-ins that offer the

Kruskal-Wallis test functionality. These add-ins can automate the computation and

provide result interpretation.

Custom Formula Add-In:


To implement these tests efficiently in Excel, you might consider developing a custom

formula add-in. This involves writing VBA (Visual Basic for Applications) code to create

custom functions that perform the calculations for the Mann-Whitney U test,

Bhise N K
DATA ANALYTICS BCS SY

Kruskal-Wallis test, or other advanced hypothesis tests. Once created, these functions

can be easily accessed and used like built-in Excel functions.

When developing a custom formula add-in, ensure it's properly tested and validated to

ensure accurate results. Additionally, provide clear documentation and instructions for

users on how to install and utilize the add-in effectively.

Keep in mind that while implementing these tests in Excel can be useful for quick

analyses, it's essential to verify the accuracy of results and understand the assumptions

and limitations of each test. For critical or complex analyses, consider using dedicated

statistical software or programming languages like R or Python.

Bhise N K

You might also like