Business Analytics Using Excel Notes
Business Analytics Using Excel Notes
Business analytics is a field that leverages data analysis, statistical methods, and predictive
modelling to extract valuable insights from data and help organizations make informed
business decisions. It involves the use of various techniques and tools to interpret data,
identify trends, and support strategic planning.
Definition:
Business analytics involves the use of statistical analysis, predictive modelling, and
quantitative analysis to understand and interpret data. It aims to provide insights that can
drive business planning and strategy.
1. Statistical Software: Tools like R and Python with libraries such as Pandas and
NumPy are commonly used for statistical analysis.
2. Business Intelligence (BI) Tools: Platforms like Tableau, Power BI, and Qlik enable
visualisation and reporting of data.
3. Machine Learning Tools: Platforms like TensorFlow and scikit-learn are used for
building and deploying machine learning models.
The evolution and scope of business analytics have undergone significant changes over the
years, driven by advancements in technology, the increasing availability of data, and the
growing need for organizations to make data-driven decisions. Here's an overview of the
evolution and scope of business analytics:
1. Descriptive Analytics:
Scope: Understanding historical data to describe and summarize what has happened in the
past.
Applications: Reporting, data visualisation, key performance indicators (KPIs).
2. Diagnostic Analytics:
Scope: Identifying reasons for past outcomes by analyzing historical data.
Applications: Root cause analysis, identifying patterns and trends.
3. Predictive Analytics:
Scope: Forecasting future trends and outcomes based on historical data and statistical
algorithms.
Applications: Demand forecasting, sales prediction, risk assessment.
4. Prescriptive Analytics:
Scope: Recommending actions to optimize decision-making and achieve desired outcomes.
Applications: Decision optimization, scenario planning, resource allocation.
5. Advanced Analytics:
Scope: Leveraging sophisticated techniques, including machine learning and artificial
intelligence, for complex problem-solving.
Applications: Predictive modeling, clustering, anomaly detection.
6. Real-time Analytics:
Scope: Analyzing data as it is generated to make immediate, data-driven decisions.
Applications: Monitoring and responding to live events, dynamic pricing.
1. Transactional Data:
Definition: This type of data captures information about transactions that occur within a
business. It includes details about sales, purchases, orders, and other financial transactions.
Example: Sales transactions, purchase orders, invoices.
2. Customer Data:
Definition: Information about customers and their behavior is crucial for understanding
market trends, improving customer experiences, and making targeted marketing efforts.
Example: Customer demographics, purchase history, customer feedback.
3. Financial Data:
Definition: Financial data provides insights into a company's fiscal health. It includes income
statements, balance sheets, cash flow statements, and other financial reports.
Example: Profit and loss statements, balance sheets, cash flow statements.
4. Operational Data:
Definition: This type of data is related to the day-to-day operations of a business. It helps in
monitoring and improving efficiency.
Example: Production data, inventory levels, supply chain information.
5. Social Media Data:
Definition: Social media data offers valuable insights into customer sentiment, brand
perception, and trends in the market.
Example: Likes, shares, comments, mentions on platforms like Facebook, Twitter, Instagram.
6. Website Analytics:
Definition: Data from website analytics tools helps in understanding online user behavior,
website performance, and the effectiveness of online marketing efforts.
Example: Website traffic, click-through rates, conversion rates.
7. Employee Data:
Definition: Information about employees can be useful for workforce management,
performance evaluation, and strategic planning.
Example: Employee demographics, performance reviews, training records.
Decision models
Decision models are a fundamental part of the decision-making process in business analytics.
They are broadly categorized into three types: descriptive models, predictive models, and
prescriptive models. Each type serves a distinct purpose in helping organizations understand,
predict, and optimize their decision-making processes.
1. Descriptive Models:
Descriptive models are used to describe and summarize historical data. They help in
understanding what has happened in the past, identifying patterns, and gaining insights into
the current state of affairs.
Characteristics:
● Focuses on historical data analysis.
● Provides a snapshot of the current situation.
● Does not make predictions or recommendations.
Examples:
Reports and Dashboards: Presenting key performance indicators (KPIs) and historical trends.
Data Visualization: Charts, graphs, and other visual representations of data.
2. Predictive Models:
Predictive models are designed to forecast future outcomes based on historical data and
patterns. They use statistical algorithms and machine learning techniques to make predictions
or estimations.
Characteristics:
● Involves analyzing historical data to make future predictions.
● Requires algorithms and statistical methods.
● Provides insights into potential future scenarios.
Examples:
Regression Analysis: Predicting numerical values based on historical data.
Machine Learning Models: Predicting customer churn, sales forecasts, demand prediction.
3. Prescriptive Models:
Prescriptive models go beyond predicting outcomes; they recommend actions or strategies to
optimise decision-making. These models help organisations make better choices by
suggesting the most favourable course of action.
Characteristics:
● Involves recommending actions to achieve desired outcomes.
● Utilises optimization algorithms and decision analysis.
● Provides insights into the best course of action.
Examples:
Optimization Models: Determining the most cost-effective production plan.
Decision Trees: Mapping out decision paths based on potential outcomes.
Simulation Models: Analysing different scenarios to identify optimal strategies.
Spreadsheet
Spreadsheet software is a type of application software that allows users to organize, analyze,
and manipulate data in tabular form. Spreadsheets consist of cells arranged in rows and
columns, and each cell can contain text, numbers, formulas, or functions.
BASIC EXCEL
3. Basic Formulas:
Sum: =SUM(A1:A10) adds up the values in cells A1 to A10.
Average: =AVERAGE(B1:B5) calculates the average of values in cells B1 to B5.
Basic arithmetic operations: +, -, *, /.
4. AutoFill:
Drag the fill handle (a small square at the bottom-right corner of a cell) to copy data or
formulas to adjacent cells.
5. Formatting:
Format cells for currency, percentage, date, etc.
Change font style, size, and color.
Apply cell borders and background color.
6. Cell References:
Relative references (e.g., A1): Adjust when copied to other cells.
Absolute references (e.g., $A$1): Do not change when copied.
8. Charts:
Create basic charts (bar, line, pie) to visualize data.
Use the Chart Wizard or the Ribbon to customize charts.
9. Basic Functions:
=IF(condition, value_if_true, value_if_false): Conditional statements.
=VLOOKUP(lookup_value, table_array, col_index_num, [range_lookup]): Look up values in
a table.
11. PivotTables:
Summarize and analyze large datasets.
Easily rearrange and view data in different ways.
12. Find and Replace:
Search for specific data within the spreadsheet.
Replace values with new ones.
Creating a spreadsheet model involves using Excel functions and features to organize,
analyze, and present data in a structured manner. Below, I'll provide a step-by-step guide for
developing a simple spreadsheet model using Excel functions:
Visual Revolution
"A Visual Revolution" can refer to a transformative change in the way visual information is
created, consumed, and understood across various fields and industries. This revolution
encompasses several key trends and developments:
3. Visual Storytelling: With the rise of social media and digital platforms, visual
storytelling has become increasingly popular as a way to capture and retain audience
attention. From videos and animations to interactive graphics and virtual reality
experiences, visual storytelling techniques are being employed to convey narratives,
evoke emotions, and drive engagement across various media channels.
5. User Interface (UI) and User Experience (UX) Design: In the realm of digital
design, there has been a growing focus on creating visually appealing and intuitive
user interfaces (UI) and user experiences (UX). Visual elements such as color,
typography, layout, and imagery are carefully crafted to enhance usability,
accessibility, and overall user satisfaction.
Science of Storytelling
The science of storytelling is an interdisciplinary field that examines the cognitive, emotional,
and social mechanisms underlying the creation, reception, and impact of narratives. It draws
from various fields such as psychology, neuroscience, linguistics, anthropology, and literary
theory to understand how stories shape human cognition, behavior, and culture. Here are
some key aspects:
3. Narrative Structure and Elements: Scholars analyze the structure and components
of narratives, such as plot, characters, setting, theme, and point of view, to understand
how they create meaning and evoke emotional responses in audiences.
5. Cultural and Social Influence: Cultural theorists examine how storytelling practices
vary across cultures and historical periods, as well as how narratives reflect and shape
societal norms, values, and identities. Storytelling can be a powerful tool for
promoting social change, fostering empathy, and building community.
6. Digital Storytelling: With the advent of digital media, researchers explore how
storytelling is evolving in the digital age, including the impact of multimedia formats,
interactive storytelling, and virtual reality on narrative experiences.
Bar Chart: A bar chart is a graphical representation of data in which rectangular bars of
varying lengths are used to represent different categories or values. Bar charts are commonly
used to compare discrete categories or show changes over time.
Line Graph: A line graph displays data points connected by straight lines, typically used to
illustrate trends or changes in data over a continuous interval, such as time.
Pie Chart: A pie chart divides a circle into sectors, with each sector representing a proportion
of a whole. Pie charts are useful for showing the relative sizes of different categories within a
dataset.
Scatter Plot: A scatter plot displays individual data points as dots on a two-dimensional
coordinate system, with one variable plotted on the x-axis and another variable plotted on the
y-axis. Scatter plots are used to visualize the relationship between two variables and identify
patterns or correlations.
Box Plot: A box plot, also known as a box-and-whisker plot, is a graphical summary of a
dataset's distribution, displaying the median, quartiles, and outliers. Box plots are useful for
comparing distributions and identifying potential outliers or anomalies.
4. Narrative Journalism: Journalists often use small personal data as the foundation for
narrative-driven storytelling, weaving together individual stories, quotes, and
observations to construct compelling narratives that shed light on larger social,
political, or cultural issues. By focusing on the human element, narrative journalism
brings statistics and data to life, engaging readers and prompting reflection.
6. Ethical Considerations: It's essential to approach the use of small personal data
ethically and responsibly, respecting individuals' privacy, consent, and autonomy.
Researchers, journalists, and organizations must obtain informed consent from
participants and ensure that their data is anonymized and protected to prevent harm or
exploitation.
CHAPTER 3
Getting Started with Tableau
Tableau is a powerful and widely-used data visualisation and business intelligence software
that allows users to connect, analyse, and visualise data in an interactive way.
It is developed by Tableau Software and is designed to help individuals and organisations
make sense of their data by turning complex datasets into actionable insights.
It empowers individuals and organizations to transform raw data into meaningful insights,
aiding decision-making and communication. Here's an introduction and overview of Tableau:
2. Visual Analytics:
Tableau allows users to create a variety of visualizations, such as charts, graphs, maps, and
dashboards.The drag-and-drop interface makes it easy to build visualizations without
requiring extensive coding skills.
3. Interactive Dashboards:
Users can combine multiple visualizations into interactive dashboards. Dashboards enable
users to explore data, apply filters, and see dynamic changes in real time.
4. Ad Hoc Analysis:
Tableau supports ad hoc analysis, allowing users to explore data freely and ask ad hoc
questions without predefined reports.
9. Data Storytelling:
Tableau helps users create compelling data stories by arranging visualizations in a sequence
that guides viewers through insights.
Tableau Products:
1. Tableau Desktop: The primary authoring and design tool used to create visualizations
and dashboards.
2. Tableau Server: A platform for sharing, collaborating, and managing Tableau content
within an organization.
4. Tableau Public: A free version of Tableau that allows users to create and share
visualizations publicly on the web.
5. Tableau Prep: A data preparation tool that helps users clean, shape, and combine data
before analysis.
Data preparation
Data preparation is a crucial step in the data analysis process, and Tableau provides a variety
of tools and features within its workspace to help you prepare and shape your data for
analysis.
1. Connect to Data:
Open Tableau Desktop and connect to your data source. Tableau supports various data
sources, including databases, spreadsheets, web data connectors, and more.
After connecting to your data source, Tableau loads the data into the Data Source tab. Here,
you can perform various data preparation tasks:
3. Data Cleaning: Identify and handle missing values, duplicates, and outliers.
4. Data Filtering: Filter rows based on specific criteria to exclude or include data.
5. Data Joining and Blending: Combine multiple data sources or tables using common
fields. Use the "Data" menu to create relationships between tables.
6. Data Aggregation: Create aggregated measures using functions like SUM, AVG,
COUNT, etc., to summarize data at different levels of granularity.
7. Pivoting and Unpivoting: Change the structure of your data by pivoting columns
into rows or un-pivoting rows into columns.
8. Data Splitting and Combining: Use string functions to split or combine text data as
needed.
9. Data Source Calculation: Create calculated fields at the data source level using SQL
or other data source-specific expressions.
10. Data Preparation in Data Source Tab:These transformations are applied in the Data
Source tab of Tableau. You can see the changes in the Data Source tab, and they will
be reflected in your worksheets and dashboards.
11. Extracts and Live Connections: Choose whether to create an extract (a snapshot of
the data) or use a live connection to your data source. Extracts can be used for
performance optimization and offline work.
Tableau Workspace
Tableau provides a versatile workspace that encompasses a wide range of capabilities and
tools for data analysis, visualization, and reporting. The scope of Tableau's workspace is
extensive and covers various aspects of data analytics and visualization. Here are the key
components and the scope of Tableau's workspace:
2. Data Preparation:
Tableau provides tools for cleaning, transforming, and reshaping data to prepare it for
analysis.
You can handle missing values, remove duplicates, create calculated fields, pivot, and
unpivot data, among other data preparation tasks.
8. Map Integration:
Tableau provides robust mapping capabilities, allowing you to create custom maps, plot
geographic data, and perform spatial analysis.
14. Extensibility:
Tableau's workspace can be extended with custom calculations, scripts, and external data
sources.
Aggregation: Measures can be aggregated using functions like SUM, AVG (average),
COUNT, MIN (minimum), MAX (maximum), and more. Aggregation allows you to
summarize and analyze numeric data.
Usage: Measures are typically used for creating charts that involve calculations or
aggregations, such as bar charts, line charts, scatter plots, and maps.
Examples:
Creating a bar chart to show total sales by product category.
Calculating the average temperature for a specific region.
Summarizing the total profit across multiple regions.
Dimensions:
Definition: Dimensions are categorical data fields that represent qualitative or categorical
attributes. Examples include product categories, geographic regions, dates, or customer
names.
Granularity: Dimensions determine the level of detail in your data. They define the
categories or groups into which your data can be organized.
Usage: Dimensions are used to segment, categorize, or group data in your visualizations.
They are often used for creating filters, grouping data, and defining hierarchies.
Examples:
Grouping sales data by product category to create a bar chart.
Using date dimensions to create a time series line chart.
Creating a geographic map by using location dimensions.
Working with Measures and Dimensions in Tableau:
Continuous:
● Continuous fields contain data that can take any value within a range.
● They are represented with a green pill in Tableau.
● Continuous fields can be aggregated and support continuous axes in visualizations.
● Examples include Age, Temperature, and Time.
● When used in a view, continuous fields create an axis.
Discrete:
● Discrete fields contain data that take on distinct, separate values.
● They are represented with a blue pill in Tableau.
● Discrete fields segment the data into distinct groups or categories.
● Examples include Year, Product Category, and Country.
● When used in a view, discrete fields create headers.
NOTE
● Aggregate Measures: When you add a measure to a visualization, Tableau
automatically aggregates it. You can control the aggregation method by clicking on
the measure in the visualization and choosing an aggregation function.
● Grouping: Dimensions can be used to group data points together. Right-click on a
dimension and choose "Create Group" to create custom groupings.
● Hierarchies: You can create hierarchies by combining dimensions. For example, you
can create a date hierarchy with levels like year, quarter, month, and day.
● Filtering: Use dimensions to create filters that allow users to interactively filter data in
a visualization.
● Calculated Fields: You can create calculated fields that use both measures and
dimensions to perform custom calculations.
● Visualization Types: The type of visualization you choose depends on the
combination of measures and dimensions you use. Different chart types are suitable
for different data structures.
● Dashboard Interaction: You can create interactive dashboards by combining multiple
visualizations, allowing users to explore data by interacting with measures and
dimensions.
Tableau Workspace:
The Tableau workspace refers to the entire environment where you work on your data
analysis and visualization projects.
It includes the application interface, various panes, menus, and toolbars that you use to
connect to data, create visualizations, and build dashboards.
The workspace provides a canvas for you to interact with your data and design visualizations.
Tableau Workbook:
A Tableau workbook is a file that contains one or more worksheets, dashboards, and stories.
It's essentially a container for your entire data analysis project.
Workbooks have a ".twb" extension for Tableau workbooks, which are XML-based files, or a
".twbx" extension for packaged workbooks, which include the data source along with the
workbook.
Workbooks allow you to save and organize your analysis, making it easy to revisit and edit
your visualizations.
Tableau Worksheet:
A Tableau worksheet is a single canvas where you create and design visualizations. Each
worksheet is part of a workbook.
Worksheets are where you connect to data sources, drag and drop dimensions and measures,
and build charts, graphs, and tables.
You can create multiple worksheets within a single workbook, each focused on a different
aspect of your analysis.
Create or Modify the Workbook: Start by creating a new workbook or making changes to an
existing one in Tableau Desktop.
Save Incremental Changes: After the initial save, you can use the "Save" option in the File
menu or simply press Ctrl + S (Cmd + S on Mac) to save incremental changes to the
workbook.
If your data is not sensitive, you can publish your workbook to Tableau Public, a free
platform for sharing Tableau visualizations with the public. This option generates a URL that
you can share.
2. Connect to Data: On the start page, you'll see options to connect to various data
sources. If you don't see the start page, you can click on "File" > "New" to create a
new workbook and then select "Data Source" from the Data menu.
4. To a File: Choose this option to connect to local files like Excel, CSV, JSON, or text
files.
7. Select Data Source: Depending on your choice, you'll need to select the specific data
source or file. For example, if you choose "To a File," you'll navigate to the location
of the file on your computer.
8. Connect to the Data: After selecting the data source, click the "Connect" button.
Tableau will establish a connection to the data source.
Once you've connected to your data source, you may need to configure data connectors based
on the type of source:
File Connection: If you're connecting to a file-based data source, Tableau will automatically
detect the format of the file. You may need to specify details such as the worksheet (in the
case of Excel), the sheet name, or the delimiter used in a CSV file.
Web Data Connector: If you're connecting to a web data connector, you'll need to provide
the URL of the connector, and Tableau will guide you through the setup process.
Custom Data Connector: In some cases, you might need to use a custom data connector,
especially for niche or specialized data sources. Tableau provides options for developing and
using custom connectors.
Data Source Configuration: After setting up the connection, you can configure additional
options, such as data filtering, joins, or custom calculations, depending on your analysis
needs.
Data Preview: Tableau typically provides a data preview feature that allows you to see a
sample of the data from the source to verify that the connection is correctly configured.
Save Data Source: Once you've configured the data connector and previewed the data, you
can save the data source for future use in your Tableau workbook.
Build Visualizations: With the data source set up, you can now start building visualizations
and dashboards using your data.
Types of Data
1. String (Text):
String data types represent text or alphanumeric characters.
Examples: Names, addresses, product descriptions, and any data stored as plain text.
2. Integer:
Integer data types represent whole numbers, either positive or negative.
Examples: Counts, quantities, years, and unique identifiers.
5. Boolean (Logical):
Boolean data types represent binary values, usually denoting "true" or "false" or "yes" or
"no."
Examples: Yes/no responses, binary outcomes, and logical conditions.
6. Geographic (Spatial):
Geographic data types represent spatial information such as latitude and longitude, shapes, or
maps.
Examples: Geographic coordinates, polygons, maps, and spatial geometries.
Metadata
In Tableau, metadata refers to the information and properties associated with your data
sources, tables, fields (columns), and other elements of your analysis. Metadata plays a
critical role in understanding and managing your data effectively. Here's an overview of how
metadata is used in Tableau:
Table Metadata:
For each table or data table in your data source, Tableau stores metadata about the table. This
metadata includes details about the table's name, columns, data types, and other properties.
You can access and view this metadata in the "Data Source" tab, which provides an overview
of all the tables in your data source.
Field Metadata:
Field metadata contains information about individual data fields (columns) within your
tables. This includes the field name, data type, aggregation method, and, in some cases, a
description.
You can edit field metadata to provide descriptions and custom aliases to make your data
more understandable to others.
Custom Metadata:
Tableau allows you to add custom metadata to your data sources, tables, and fields. This is
particularly useful for providing context or descriptions for the data.
Custom metadata can include notes, comments, and explanations that help others understand
the data, its source, or its intended use.
Adding Hierarchies:
Hierarchies in Tableau allow you to organize related dimensions into a structured, drill-down
format, making it easier to explore and analyze data at different levels of detail. Common
examples of hierarchies include date hierarchies (e.g., year > quarter > month) and
geographical hierarchies (e.g., country > state > city).
To add a hierarchy:
Table Calculations:
Table calculations in Tableau allow you to perform calculations on the results of a
visualization, taking into account the context of your view, such as filters, groups, and
sorting. Table calculations are especially useful for creating running totals, percent of total,
and other complex calculations.
1. Create a Table Calculation: In a worksheet, select the field or measure for which
you want to create the table calculation.Right-click on the selected field, and choose
"Quick Table Calculation" or "Create Table Calculation."
2. Choose the Calculation Type: Tableau provides various built-in calculation types
like running total, percent of total, moving average, and many more. Select the one
that fits your analysis needs.
3. Configure the Calculation:Configure the table calculation by specifying partitioning
and addressing options. These options determine how the calculation is applied within
the visualization.
4. Apply the Calculation:After configuring the table calculation, you can apply it to
your visualization by dragging it to the "Columns," "Rows," or "Marks" shelf as
needed.
Module 4
Descriptive Analytics
3. Data Cleaning:
Clean your data by handling missing values, removing duplicates, and addressing any other
data quality issues.
7. Interactive Elements:
Excel allows some level of interactivity. For example, you can use data validation lists to
allow users to select specific categories for charts or use slicers to filter data in PivotTables.
8. Creating PivotTables:
PivotTables are powerful tools in Excel for summarizing and analyzing data. They can be
used to group, filter, and analyze data dynamically.
9. Conditional Formatting:
Utilize conditional formatting to highlight specific data points or trends. This can help in
identifying patterns quickly.
12. Documentation:
As you explore and visualize data, consider creating a separate sheet or document to
document your findings, insights, and any notable observations.
Descriptive Statistics
1. Mean (Average):
The mean is the sum of all values in a dataset divided by the number of values.
2. Median:
The median is the middle value of a dataset when it is ordered. If the dataset has an even
number of values, the median is the average of the two middle values.
3. Mode:
The mode is the value that occurs most frequently in a dataset.
4. Range:
The range is the difference between the maximum and minimum values in a dataset.
Range =Maximum value−Minimum value
6. Variance:
Variance measures the average squared deviation of each data point from the mean.
7. Standard Deviation:
The standard deviation is the square root of the variance. It provides a measure of the average
deviation of each data point from the mean.
8. Coefficient of Variation (CV):
CV is the ratio of the standard deviation to the mean, expressed as a percentage. It is used to
compare the relative variability of datasets with different means.
9. Skewness:
Skewness measures the asymmetry of the distribution. A positive skewness indicates a
right-skewed distribution, while negative skewness indicates a left-skewed distribution.
10. Kurtosis:
Kurtosis measures the "tailedness" of the distribution. It indicates whether the tails of the
distribution are heavier or lighter than those of a normal distribution.
Data Modelling:
Data modelling in statistics involves creating a mathematical representation or structure that
describes the relationships and patterns within a dataset. The goal is to develop a model that
captures the underlying structure of the data, allowing for analysis, prediction, and inference.
Statistical data modelling is a fundamental aspect of statistical analysis and is widely used in
various fields, including economics, biology, finance, and social sciences.
Probability distributions
Probability distributions play a crucial role in data modelling by providing a mathematical
framework to describe the likelihood of different outcomes in a given set of data.
Understanding and choosing the appropriate probability distribution for a dataset is essential
for accurate modelling and analysis. Here are some key probability distributions commonly
used in data modelling:
2. Binomial Distribution:
The binomial distribution models the number of successes in a fixed number of independent
and identically distributed (i.i.d.) Bernoulli trials.
It is characterized by two parameters: the number of trials (n) and the probability of success
on each trial (p).
Binomial distribution is suitable for scenarios like coin tosses or success/failure experiments.
3. Poisson Distribution:
The Poisson distribution models the number of events occurring in a fixed interval of time or
space.
It is characterized by a single parameter λ (lambda), which represents the average rate of
occurrence of the events.
Poisson distribution is often used for rare events or count data.
4. Exponential Distribution:
The exponential distribution models the time between independent and identically distributed
events occurring at a constant rate.
It is characterized by the parameter λ, which is the rate parameter.
Exponential distribution is commonly used in survival analysis and reliability studies.
5. Uniform Distribution:
The uniform distribution models a random variable with equal probability of taking any value
within a specified range.
It is characterized by the minimum and maximum values of the range.
Uniform distribution is often used in scenarios where all outcomes are equally likely.
6. Log-Normal Distribution:
The log-normal distribution is used to model positively skewed data whose logarithm is
normally distributed.
It is characterized by the parameters μ and σ of the corresponding normal distribution.
Log-normal distribution is commonly applied in finance and biology.
7. Gamma Distribution:
The gamma distribution is a generalisation of the exponential distribution and is often used to
model the waiting time until a Poisson process reaches a certain number of events.
It is characterised by two parameters: shape (k) and rate (θ).
Gamma distribution is versatile and can resemble a range of shapes depending on its
parameters.
Inferential Statistics
Sampling techniques
Sampling techniques are methods used to select a subset of individuals or items from a larger
population in order to make inferences about the entire population. Different sampling
techniques have specific advantages and are chosen based on the characteristics of the
population and the goals of the study. Here are several common sampling techniques:
1. Random Sampling:
In random sampling, every individual or item in the population has an equal chance of being
selected. This is achieved through randomization techniques such as random number
generators or drawing names from a hat.
Advantages: Unbiased representation of the population, eliminates selection bias.
Disadvantages: May be logistically challenging, especially in large populations.
2. Stratified Sampling:
The population is divided into subgroups or strata based on certain characteristics, and then
samples are randomly selected from each stratum.
Advantages: Ensures representation from different strata, useful when certain subgroups are
of particular interest.
Disadvantages: Requires knowledge of population characteristics, and the process can be
complex.
3. Systematic Sampling:
Every nth individual or item is selected from a list after an initial random start. For example,
every 10th person on a list is chosen after randomly selecting a starting point between 1 and
10.
Advantages: Simple and easy to implement, suitable for large populations.
Disadvantages: Vulnerable to periodic patterns in the list, may lead to biased samples if
there's a periodicity in the population.
4. Cluster Sampling:
The population is divided into clusters, and entire clusters are randomly selected. Then, all
individuals within the selected clusters are included in the sample.
Advantages: Cost-effective, especially when clusters are naturally occurring.
Disadvantages: May introduce variability within clusters, not suitable for populations with a
homogeneous distribution.
5. Convenience Sampling:
Individuals are selected based on their availability and accessibility to the researcher. This
method is often used for practical reasons rather than representativeness.
Advantages: Quick and easy to implement.
Disadvantages: Prone to selection bias, results may not be generalizable to the entire
population.
6. Quota Sampling:
The researcher identifies specific quotas based on certain characteristics (e.g., age, gender,
occupation) and then fills these quotas with individuals who meet the criteria.
Advantages: Useful when certain characteristics are essential for the study.
Disadvantages: May not fully represent the population, and there is still potential for bias.
7. Purposive Sampling:
The researcher deliberately selects individuals or items based on specific criteria relevant to
the study's objectives.
Advantages: Useful for in-depth studies, especially when specific expertise is needed.
Disadvantages: Not suitable for making generalizations about the entire population.
8. Snowball Sampling:
Existing participants refer the researcher to other potential participants. This method is often
used in studies where the population is hard to reach.
Advantages: Useful when the population is not easily accessible.
Disadvantages: May lead to biased samples if the initial participants share similar
characteristics.
Using Excel Data Analysis add in for estimation and hypothesis testing
b.Regression Analysis:
● Go to the "Data Analysis" group on the "Data" tab.
● Click on "Data Analysis" and select "Regression."
● Choose the input and output ranges, and select the desired options (e.g., labels,
residuals).
● Click "OK" to generate the regression analysis output.
c.T-Test
● t-Test (Two-Sample Assuming Equal Variances):
● Go to the "Data Analysis" group on the "Data" tab.
● Click on "Data Analysis" and select "t-Test: Two-Sample Assuming Equal Variances."
● Choose the input ranges for your two samples and specify where you want the output
to be placed.
● Set the significance level and choose options for output.
● Click "OK" to perform the t-test.
d.Z-Test:
● Go to the "Data Analysis" group on the "Data" tab.
● Click on "Data Analysis" and select "Z-Test."
● Choose the input range for your data and specify where you want the output to be
placed.
● Set the significance level and choose options for output.
● Click "OK" to perform the Z-test.
Interpretation:
Review the output generated by Excel to interpret the results of the analysis.
Pay attention to p-values and confidence intervals for hypothesis testing and estimation.
CHAPTER 5
Predictive analytics
Predictive analytics
Predictive analytics is an advanced branch of data analysis that involves the use of statistical
algorithms, machine learning techniques, and modeling to identify the likelihood of future
outcomes based on historical data.
The goal is to make predictions and inform decision-making by uncovering patterns, trends,
and relationships within the data. Predictive analytics is widely applied across various
industries for purposes such as risk assessment, customer relationship management, financial
forecasting, and operational optimization.
Businesses can anticipate future trends, identify opportunities, and mitigate risks, ultimately
driving sustainable growth and success.
1. Hypothesis Testing: The most common approach involves testing hypotheses about
the regression coefficients. The null hypothesis typically states that the coefficient is
equal to zero, indicating no effect of the independent variable on the dependent
variable. The alternative hypothesis suggests that the coefficient is not equal to zero,
indicating a significant effect.
2. P-values: P-values indicate the probability of observing the estimated coefficient (or
one more extreme) under the null hypothesis of no effect. A small p-value (typically
less than 0.05) suggests that the coefficient is statistically significant, providing
evidence against the null hypothesis.
In regression, when two or more independent variables are strongly related, the model gets
confused about which one is actually affecting the outcome. This confusion can lead to
unreliable estimates of how each variable affects the outcome (the regression coefficients),
making it hard to trust the results of the analysis.
Prevention:
Multicollinearity can often be avoided by carefully selecting independent variables for
inclusion in the regression model. Prior knowledge of the data and theoretical considerations
should guide variable selection to minimise the risk of multicollinearity.
Include/Exclude Decisions
It refers to the process of deciding which independent variables should be included or
excluded from the regression model. This decision-making process is crucial because it
directly impacts the validity, interpretability, and predictive power of the regression model.
4. Model Complexity: Including too many independent variables can lead to overfitting,
where the model fits the noise in the data rather than the underlying relationships.
Overfitting can reduce the model's generalizability to new data. Therefore, include
only those variables that are essential for explaining the variation in the dependent
variable and avoid unnecessary complexity.
Stepwise regression
Stepwise regression is a method used to select the most relevant independent variables from a
larger set of potential predictors. It involves adding or removing variables from the regression
model based on their statistical significance.
Steps involved:
1. Define Variables:
Identify your dependent variable (the variable you want to predict) and independent variables
(predictor variables).
2. Data Preparation:
Ensure your data is cleaned and organized, with no missing values or outliers that could
affect the results.
Decide on the criteria for adding or removing variables from the model.
Common stepwise methods include:
a. Forward selection: Start with an empty model and add the most statistically
significant variable at each step until no additional variables meet the criteria
for inclusion.
7. Interpret Results:
● Interpret the coefficients of the final model to understand the relationships between
the independent variables and the dependent variable.
● Identify the most important predictors in the model and their impact on the dependent
variable.
8. Communicate Findings:
Present the results of the stepwise regression analysis in a clear and understandable manner,
highlighting the significant predictors and their implications for the problem at hand.
Partial F-test:
● The Partial F-test is a statistical test used to determine whether adding or removing a
particular independent variable (or group of variables) from a regression model
significantly improves the model's overall fit.
● It is often used in multiple regression analysis to assess the significance of individual
predictors or sets of predictors.
● The test compares the fit of two nested models: one with the variables of interest
included and one without them.
● If adding the variables significantly improves the model's fit, the Partial F-test will
yield a significant result.
Outliers:
● Outliers are data points that deviate significantly from the rest of the data in a dataset.
They can be influential in regression analysis because they can disproportionately
affect the estimation of the regression coefficients and hence the overall model.
● Outliers can arise due to measurement errors, data entry mistakes, or genuine
variability in the data.
● It's important to identify and examine outliers to determine whether they should be
kept in the analysis, transformed, or removed. Techniques such as residual analysis,
leverage plots, and Cook's distance can help in identifying outliers.
Interpretation of R-squared:
● R-squared (R^2) is a measure of the proportion of variance in the dependent variable
that is explained by the independent variables in the regression model.
● It ranges from 0 to 1, where 0 indicates that the independent variables explain none of
the variability in the dependent variable, and 1 indicates that they explain all of it.
● For instance, an R-squared value of 0.70 means that 70% of the variance in the
dependent variable is accounted for by the independent variables in the model.
● R-squared is often interpreted as the goodness-of-fit of the regression model: the
higher the R-squared, the better the model fits the data. However, it does not indicate
the appropriateness of the model's functional form or the validity of the underlying
assumptions.
In summary, when interpreting multiple regression results, one should pay attention to the
coefficients, which describe the relationships between variables; the standard error of the
estimate, which gauges the precision of the regression model's predictions; and R-squared,
which assesses the overall goodness-of-fit of the model.
Linearity: The relationship between the independent variables (predictors) and the dependent
variable (outcome) should be linear. This means that the change in the dependent variable is
proportional to the change in the independent variables, holding all other variables constant.
Linearity can be assessed by examining scatterplots of the variables or by using techniques
like residual plots.
Independence of Errors: The errors (residuals), which are the differences between the
observed values and the values predicted by the regression model, should be independent of
each other. This assumption ensures that there is no systematic pattern or correlation in the
residuals. Violations of this assumption can occur in time series data or when observations
are spatially correlated.
Constant Variance of Errors (Homoscedasticity): The variance of the errors should be
constant across all levels of the independent variables. In other words, the spread of the
residuals should be the same throughout the range of predictor values. Violations of
homoscedasticity result in heteroscedasticity, where the spread of residuals systematically
varies across levels of the predictors. This can lead to inefficient estimates and biased
inference. Residual plots or statistical tests like the Breusch-Pagan test can be used to assess
homoscedasticity.
The standard error measures the variability or uncertainty associated with the estimated
regression coefficient. It indicates the average amount by which the coefficient estimate is
likely to differ from the true population parameter, assuming that the regression model is
correct and the statistical assumptions are met.
Interpreting the ANOVA (Analysis of Variance) test in the context of a regression model
involves understanding whether the overall regression model is statistically significant.
Here's how to interpret the ANOVA test results:
2. Interpretation:
● If the p-value is less than your chosen significance level (e.g., 0.05), you reject the
null hypothesis. This suggests that at least one of the predictor variables in the model
has a statistically significant effect on the dependent variable.
● Conversely, if the p-value is greater than your chosen significance level, you fail to
reject the null hypothesis. This suggests that the regression model, as a whole, does
not provide a significant improvement in predicting the dependent variable compared
to a model with no predictors.
3. Conclusion:
● If the ANOVA test is statistically significant (i.e., the p-value is less than the
significance level), you can conclude that the regression model as a whole is
statistically significant, and at least one of the predictor variables has a significant
effect on the dependent variable.
● If the ANOVA test is not statistically significant (i.e., the p-value is greater than the
significance level), you may conclude that the regression model does not provide a
significant improvement in predicting the dependent variable compared to a null
model with no predictors.
● In summary, interpreting the ANOVA test in a regression model helps you understand
whether the model, as a whole, is useful for predicting the dependent variable and
whether any of the predictor variables have a significant impact on the outcome.
Example:
The outcome variable (dependent variable) is whether the student passes (1) or fails (0) the
exam.
The predictor variable (independent variable) is the number of study hours.
Binomial logistic regression helps us predict the probability of passing the exam based on the
number of study hours.
2. Example:
Suppose you want to predict a person's political affiliation based on their age, education level,
and income. In this case:
The outcome variable (dependent variable) is political affiliation, which could have multiple
categories like Democrat, Republican, Independent, etc.
The predictor variables (independent variables) are age, education level, and income.
Multinomial logistic regression helps us predict the probability of belonging to each political
affiliation category based on age, education, and income.
CHAPTER 6
Time Series Analysis
1. Descriptive Analysis: Understand and visualize patterns and trends in the data.
2. Forecasting: Predict future values based on historical data.
3. Seasonal Decomposition: Separate the time series into seasonal, trend, and residual
components.
4. Anomaly Detection: Identify outliers or irregularities in the data.
5. Modeling: Develop statistical or machine learning models to capture and represent
the underlying patterns and relationships in the time series.
intervals. time-dependent.
patterns. variables.
Economics (demand
Finance (stock market analysis),
analysis), Social sciences
Applications Economics (GDP forecasting),
(predicting voting patterns),
Weather forecasting, etc.
etc.
Additive model
An additive model is a common approach in time series analysis used to decompose a time
series data set into its underlying components: trend, seasonality, and noise. The additive
model assumes that these components are independent and can be combined by simple
addition to reconstruct the original time series.
Time Series=Trend+Seasonality+Noise
1. Component Analysis:
The additive model helps in identifying and separating the different components of a time
series, such as trend, seasonality, and noise. This decomposition aids in understanding the
underlying patterns and behaviors in the data.
2. Forecasting:
Once the components are identified and estimated, they can be used to forecast future values
of the time series. Forecasting models can be built separately for each component and then
combined to make predictions for the overall time series.
3. Anomaly Detection:
By analyzing the noise component of the additive model, anomalies or unusual patterns in the
time series can be detected. These anomalies may represent significant events or changes in
the data that require further investigation.
4. Data Interpretation:
Understanding the individual components of a time series can provide valuable insights into
the factors driving the observed patterns and fluctuations. This information can be used for
decision-making, planning, and strategy development.
Multiplicative model
The multiplicative model is another commonly used approach in time series analysis,
alongside the additive model. While the additive model decomposes a time series into its
constituent components by simple addition, the multiplicative model decomposes a time
series by multiplication. The multiplicative model assumes that the components of the time
series interact with each other multiplicatively.
The multiplicative model captures the relative interactions between the components and
provides a more nuanced understanding of the time series data. It is particularly useful when
the relative changes and interactions between the components are more relevant than the
absolute changes.
Suitable for data with stable Suitable for data with changing or
Applications
seasonal patterns relative seasonal patterns
Stationarity
Stationarity is a crucial concept in time series analysis that plays a significant role in building
reliable and accurate forecasting models. A time series is said to be stationary if its statistical
properties, such as mean, variance, and autocorrelation, remain constant over time. The
stationarity of a time series is important because many time series forecasting methods and
models assume or require the data to be stationary.
Importance of Stationarity:
1. Modeling:
Stationary time series are easier to model and forecast because their statistical properties do
not change over time. Models built on stationary data tend to be more reliable and accurate.
2. Assumptions:
Many forecasting techniques, like ARIMA (AutoRegressive Integrated Moving Average),
require the data to be stationary or transformable to a stationary series.
3. Interpretability:
Stationary time series have a stable, consistent behavior over time, making it easier to
interpret the underlying patterns and trends.
1. Visual Inspection:
Plotting the time series data and observing if there are any trends or patterns can provide
initial insights into stationarity.
2. Summary Statistics:
Computing the mean and variance over different time periods and checking if they are
constant can indicate stationarity.
Autocorrelation
Autocorrelation is a fundamental concept in time series analysis that measures the linear
relationship between lagged values of a time series. It quantifies the degree to which a time
series is correlated with its own past values at different lags. Understanding autocorrelation is
essential for identifying patterns, trends, and dependencies in time series data.
Importance of Autocorrelation:
1. Identifying Patterns:
Autocorrelation helps in identifying repeating patterns or cycles in the time series data. It can
reveal if there is a systematic relationship between the current observation and its past values.
2. Modeling:
Autocorrelation is crucial for selecting appropriate forecasting models. Time series models
like ARIMA (AutoRegressive Integrated Moving Average) and SARIMA (Seasonal ARIMA)
rely on autocorrelation to determine the order of autoregressive (AR) and moving average
(MA) components.
3. Diagnostics:
Examining autocorrelation plots and correlograms can serve as diagnostic tools to check the
adequacy of a time series model. Residuals from a fitted model should ideally not exhibit
significant autocorrelation.
Handling
Does not handle seasonality Does not handle seasonality
Seasonality
Autogression
The AutoRegressive (AR) model is a fundamental concept in time series analysis and
forecasting. It is used to capture the linear relationship between an observation and its past
values. The term "autoregressive" indicates that the model regresses the variable on its own
past values, hence the name.
The AutoRegressive (AR) model is like looking back to predict the future. It uses past values
of a time series to forecast what comes next. Imagine you're trying to predict tomorrow's
temperature based on today's and the past few days' temperatures – that's essentially what an
AR model does.
Key Points:
1. Looking Back:
The AR model uses past values, or "lagged values," of a time series to make predictions
about future values.
2. Linear Relationship:
It assumes that there's a straight-line relationship between the current value and its past
values.
3. Order:
The "order" of the AR model tells us how many past values we should look at to make our
prediction.
Moving Average (MA) model
In time series analysis, the Moving Average (MA) model is a common method used to
understand and forecast data points by averaging past observations.
The MA model assumes that the current value of a time series is a combination of a white
noise process (random error) and a linear combination of past observations and/or past
forecast errors.
There are different orders of MA models, denoted as MA(q), where 'q' represents the number
of lagged forecast errors that are included in the model. For example, an MA(1) model uses
the previous error term, an MA(2) model uses the previous two error terms, and so on.
Random Error
It refers to the unpredictable component of a data series that cannot be explained by the
model being used. It represents the difference between the observed value and the value
predicted by the model.
Random errors are typically assumed to be independent and identically distributed (i.i.d.)
with a mean of zero. This means that on average, the random errors will cancel each other out
and not exhibit any systematic pattern over time.
ARIMA Model:
● ARIMA is a flexible and powerful model for analyzing and forecasting time series
data. It combines autoregressive (AR), differencing (I), and moving average (MA)
components to capture the temporal dependencies and patterns in the data.
● Autoregressive (AR) component captures the linear relationship between an
observation and a number of lagged observations.
● Integrated (I) component refers to differencing the time series to achieve stationarity,
which involves removing trends or seasonal patterns.
● Moving Average (MA) component models the dependency between an observation
and a residual error from a moving average model applied to lagged observations.
● ARIMA models are effective for modeling and forecasting stationary or
near-stationary time series data with a linear trend or seasonal component.