1 Introduction

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

1 Introduction to Statistics for Data Analytics

1.1 What is the difference between Statistics and Data


Analytics?
There was a time not long ago, that all data analysis was called statistics. That has changed
because of rapid change in our ability to capture, store and process vast amounts of data.

In the 1960’s, computers were “mainframes” filling large rooms or even buildings. Few
organizations could afford them, so “time-sharing” was a common practice. The Government
of New Brunswick “shared” the one computer at the University of New Brunswick. Data was
stored on magnetic tape and programs were stored on punched cards or paper tape.

It was not until 1990 that it was cheaper to store information on computers, rather than paper!

The first major software for statistical analysis was SPSS, released in 1968. Previous to that
time, most data analysis was done by hand, using mechanical adding machines or slide rules for
operations, such as finding square roots.

Data analysis was hard work and data sets were small by today’s standards.

The statistical methods taught in almost every university program today are based upon
methodologies developed in the first half of the 20th century when data sets were small and
analysis was done by hand. To make analysis practical, the approach was to start your research
by articulating a belief or theory and then collecting the data to see if it supported the belief.
Much of the debates in statistics centred around how to measure the “strength” of this
support. It was considered cheating if you looked at the data first.

If you took a statistics course in university, you were likely taught statistics as practiced before
you were born, not what is data analysis today.

Advances in computing speed and massive reductions in the cost of data storage have
transformed the landscape.

• All of your interactions with an organization can now be captured – every visit, transaction,
email, phone call, text, ….
• Online activity generates “meta data” that may not be traceable to you personally, but still
tells us about someone.
• Vast amounts of information are posted on social media and other public sites and
technology exists to search and process text efficiently.
• “Data” now includes images, video, sound, emails, social media, websites,… and we are
developing tools to capture and analyze this data.

1 Introduction to Statistics for Data Analytics Page 1 of 9


With this abundance of low cost, accessible data, we can explore the data to identify patterns
and then collect new data to confirm that the patterns are real and reproducible.

• By exploring this “Big Data”, we can discover new or hidden patterns and relationships.
Although we may not be able to explain the relationship, the insight may still allow for
better decisions.
• Data science is question-driven science. We use data to answer questions.
• With data analytics, we can discover your secrets! Big Brother is here!

What is Data Analytics? Over the last 20 years, there has been a shift in the way organizations
approach decision-making. The emphasis is on evidence-based decision-making, not simply
making decisions based upon experience. In the last half of the 20th century, there was
increasing use of models to inform decision-making. We used statistics to assist in confirming
whether our model was valid, but frequently, the models were simplistic and captured only a
portion of what was happening.

Through the latter part of the 20th century and more so in the last 20 years, computing scientists
have developed many algorithms for machine learning (artificial intelligence). Much of this work
is based upon analyzing vast amounts of data to discover patterns. An early example was in
optical character recognition (digitizing text) and was rather clumsy, but today the software is
very accurate. Speech recognition software is amazingly good now (Amazon’s Echo or Google
Home are testaments to this).

Machine Learning algorithms are the basis of much of the field of data mining – the search for
valuable nuggets of information from large volumes of data. Some call this knowledge discovery.
The marriage of these many approaches to analyzing data to extract information in support of
decision-making is collectively called data science or data analytics.

In the context of decision-making in business, some may refer to business analytics. In this case
we are referring not simply to a set of skills or tools, but a business process. It is not just an
integral part of the decision-making process, but is an actual business function. The firm’s
business strategy includes processes for continuously capturing data on all aspects of the firm,
and then using the data to extract useful insights to improve the organization’s success.

Traditionally, firms did data analysis as part of a problem solving process. You needed to
recognize that you had a problem and have appropriately defined the problem before you
looked at data collection and analysis. In data analytics, we start by problem finding. We are
looking for opportunities, but may not know what we are looking for. It can be exploratory.

And by data we do not mean just the data in our organization’s information system (sales
records, accounting entries, employee records, …), but almost anything. Text data is a huge
source of information (comments, emails, tweets, news articles, Facebook likes, ….). Images and
sounds are challenging areas to explore, but we are rapidly getting better at it. Clearview.ai is

1 Introduction to Statistics for Data Analytics Page 2 of 9


controversial. This firm has captured over 3 billion facial images from social media and other
online sources to create a database used by law enforcement. Did you know that the police
have your picture?

The analysis of large data sets goes by many names: data analytics, data science, data mining,
knowledge discovery in data,…. One of the earliest definitions that is still widely cited is from a
1996 paper by Fayyad et al. It was a definition of KDD (Knowledge Discovery in Data) but has
been used to describe data analytics.

The non-trivial process of identifying valid, novel, potentially useful, and ultimately
understandable patterns in data.1

All of the words in the definition are important. It must be emphasized that this is not a formula
or a tool, but a process. This process is non-trivial - not simple and is usually challenging. We
want the outcome to be valid and something new (novel) that we did not know before. We are
discovering knowledge. We would like it to be useful knowledge, but we may not know its
usefulness until we try to act upon it. Usually, to apply the new knowledge we must engage
others and convince them of the value of the knowledge, so it should be understandable. You
can explain it. And lastly, most knowledge we discover involves patterns and relationships. These
patterns are not necessarily coming from any theory, but from the data itself.

1.2 Scope of this text


Data analytics is a diverse and growing field of study. Its scope and boundaries are in continuous
change. This text is limited to looking at classical statistical methods through the data analytics
lens. The teaching of statistics at most institutions and in most disciplines has not kept up with
the change in the data landscape and the applications of data analytic tools. Many established
research methods, such as hypothesis testing (tests of statistical significance) are being
challenged and their interpretation questioned.

Nonetheless, it is important to understand classical statistical methodology because of its


widespread use. It is still valuable and is the starting point for many strategies for data analysis.

From a theoretical perspective, an understanding of basic probability and how it informs random
events and the relationships among events is core to being able to evaluate the “accuracy” of
data analytics tools.

Although many tools and methodologies will be introduced, the text is built around a framework
for conducting a data analysis project, from initial problem definition to solution
implementation. The first half of the text focuses on the sequence of steps in conducting the
project, followed by an examination of some standard statistical models for decision-making and
prediction.

1 Introduction to Statistics for Data Analytics Page 3 of 9


Although data analytics professionals do most modeling using programming tools such as SPSS,
Stata, R and Python, we will limit ourselves to using Microsoft Excel. Just like every accountant
still uses a hand calculator for quick calculations, a data analyst frequently uses Excel for initial
data exploration and simple modeling.

1.3 Example: How Companies Learn Your Secrets


In 2012, Charles Duhigg published a story in NYT Magazine2 that described how Target
department stores knew which of their customers were pregnant without them telling Target.
The story is often repeated in other publications, but it is worth reading the whole article in NYT
Magazine. his short video also retells the article.

What can we learn from this story?

The story describes the process of data analytics from the posing of the question, “If we wanted
to figure out if a customer is pregnant, even if she didn’t want us to know, can you do that? ” to
the business actions that resulted from this new knowledge. To appreciate the nature of the
process, consider the following questions.

1. What was the objective of marketing staff asking the question?


2. What was (were) the data analysis problem(s)?
3. What data was needed to answer the question(s)?
4. How accurate was the model?
5. How was the knowledge applied and what were the consequences?

1 Introduction to Statistics for Data Analytics Page 4 of 9


The story is both entertaining and alarming. The marketing department knew that if they could
identify expectant mothers early, they could entice them to shop at Target for their pregnancy
and baby needs. At this unique transitional period in their lives, parents are more open to
influence.

Target had data on thousands of customers who were part of their loyalty program. In the past,
many expectant mothers had joined Target's baby registry, so Target was able to examine
shopping behaviour before, during and after pregnancy for these customers. By mining this
data, they found 25 types of products for which changes in shopping patterns were highly
predictive of pregnancy and the stage of the pregnancy. With a high level of accuracy, Target
could identify which customers were pregnant and what the approximate due date was.

The article tells the story of identifying a pregnant teenager before the parents knew she was
pregnant. Does use of this model present ethical issues? This article was published in 2012 and
is based upon events from several years before that. Given advances in data analytics and
technology in the last 10-20 years, imagine what Target knows about its customers today.

1.4 What do data analytics professionals look like?


There are many Youtube videos that describe the experiences of data analytics professionals.
Take a look at what data analytics looks like at Chevron,

Mount Sinai Hospital,

1 Introduction to Statistics for Data Analytics Page 5 of 9


or for a very scary experience, Cambridge Analytica.

1.5 Common Data Mining Applications


When facing a new problem, it is often challenging to know where to start. If you can classify
your problem into a familiar category, then you can look at past experience with these types of
problems to help guide your work.

The most common business questions related to data are either “descriptive” or “predictive”.

Descriptive questions may be simple, such as, “what is the average time to process a claim?”
We may be trying to build a profile of cases where clients complain about or appeal our
insurance claim decisions. We may be investigating service levels among different operating
units to identify whether there are significant differences. Maybe we want to know whether
there is a relationship between insurance claim appeals and type of claim.

These are classic statistical questions. They frequently inform subsequent decisions on how to
improve operations or how to identify problems or opportunities before they occur. Naturally,
this leads us to use descriptive models as the basis for making predictions.

Many business problems can be broken down into sub-problems, each involving different data
analysis methodologies. Rarely is a major problem just one problem. Although many large
problems are unique, the sub-problems are not.

The most common types of data mining decisions are

• Value estimation
• Classification
• Clustering
• Co-occurrence grouping
• Link prediction

1 Introduction to Statistics for Data Analytics Page 6 of 9


• Similarity matching
• Profiling
• Causal analysis
• Data reduction
• Feature selection

An important subject in computing science is artificial intelligence. Can machines think? Like
humans, for machines to be able to make decisions, they must first be taught. Teaching
machines implies that machines can learn. This is more than programming computers to carry
out tasks. Computing scientists speak of machine learning – computer algorithms search for
patterns in the data. The computers are discovering knowledge themselves.

When there is a specified outcome, then the learning is said to be supervised. One of the key
differentiators among data mining applications is whether the case is one of supervised or
unsupervised learning. In supervised learning, we have a specific outcome we are trying to
predict (is this transaction fraudulent? What will sales be in the 4th quarter?). In unsupervised
learning, we are simply trying to get some insight (can we cluster customers into homogeneous
groups? What are the factors that affect customer churn? What are the characteristics of our
best customers?).

Methods used for both classification and value estimation are often called supervised learning
algorithms in the data mining literature. In both cases, we have historical data that includes data
on many characteristics of the customer, as well as the final outcome. For example, the
outcome might be "good customer risk" and the customer characteristics might include age,
gender, income, past borrowing history, homeowner,... We need a model that links the
characteristics to the outcome.

Clustering is an example of unsupervised learning. There is no specified target outcome. We


would like to cluster customers into groups, such that those within a group are similar to each
other, but the groups are somewhat different from each other.

Supervised Learning Models

Supervised learning models are “trained” on data for which we know the values of the target
outcome and have other data about the client. In the case of Target, the model was based upon
training data where we know which customers were pregnant and which are not, and we have
significant past data on their shopping behaviour. The training is “supervised” to make the
“best” predictions. Once trained, the model is applied to data for which we do not know the
value of the target variable. We evaluate the predictions on this test data to estimate expected
accuracy when we put the model into use.

Value estimation – what is the maximum credit limit that we can safely offer to this customer?
How many gigabytes of data is this smartphone customer likely to use on average? The most

1 Introduction to Statistics for Data Analytics Page 7 of 9


widely used method for value estimation is regression. Regression estimates a formula that can
be used to predict the value of one variable based upon the values of a set of predictor variables.

Classification – is this credit card applicant an acceptable risk or not? Will this customer renew
their cell phone plan? Most classification problems involve splitting cases into two groups. There
are numerous methods for doing this. Splitting cases into more than 2 groups is more
challenging and usually requires use of very different tools. Some methods classify a new case
into the category that has the most previous cases that are similar, whereas other methods
estimate the probability that the new case belongs to a particular category.

Link Prediction - Facebook and LinkedIn will suggest new connections for you to consider. In
Facebook, it is usually based upon individuals who are friends with several of your friends. In
LinkedIn the suggestions are based upon who is part of your network, where you went to school,
where you work or other factors that together suggest that an individual may be a good fit with
your network. The model is looking to suggest new links that currently do not exist between two
objects, but such new linkages would be consistent with other links that have been observed.

Unsupervised Learning Models

In unsupervised learning models, we have no target variable we are trying to predict, but we are
trying to identify or summarize patterns in the data. Although they are descriptive models, we
may use them to inform decisions. The consequences of decisions are not known exactly when
the decisions are taken, so in a sense, decisions are predictions. Without clear definitions of
what outcomes we want to predict and outcomes data, we cannot “train” descriptive models.

Clustering – how can I segment my customers into a manageable number of distinct


segments? Marketers may group customers geographically, but this implies that all customers in
a region can be treated the same way. Why use geography? We know many things about our
clients. The obvious ones are age, gender and home address, but we also know what they buy,
how much, when, cash/debit/credit, in-store/online,... Are there patterns in all this data that
will allow us to identify clusters based upon their similarity across many attributes?

Co-occurrence grouping – if you have shopped on Amazon, you have likely seen “Frequently
bought together” or “Customers who viewed this item, also viewed”. While clustering tries
grouping objects together based upon the similarity of various attributes, co-occurrence
grouping looks at objects that are frequently seen together, such as items that are frequently
purchased together. Note that we are grouping transactions that occur together rather than
features that occur together. The latter is part of clustering. The difference is subtle but
important.

Similarity matching – can we find customers that look like you? From what we know about
them, can we make predictions on what you will do? Link prediction, clustering and co-

1 Introduction to Statistics for Data Analytics Page 8 of 9


occurrence grouping all involve examining similarities among clients or objects. Similarity
matching takes this further to finding your lookalike.

Profiling – if we build a profile of a credit card customer, then if a behavior significantly departs
from the profile, this may suggest that fraud has occurred. This sounds like similarity matching,
but rather than looking for similar cases, we are trying to define what makes objects similar.
What are their defining characteristics? What is the profile of our best customers? Can we use
this to target new customers?

Causal Modeling - Causal modeling may arise within another data mining application, such as
value estimation or classification. In making a prediction, we may also want to know why the
outcome is likely to occur. What are the factors that influence the outcome?

Data Reduction and Feature Selection - These applications both arise when we are trying to
understand our environment but are overwhelmed by too much data or too many variables. Can
we reduce our dataset without losing important information? Can I evaluate a student’s
academic ability by just looking at their GPA rather than looking at the whole transcript? We are
combining many variables into a single measure.

Further Reading
A very short history of Big Data, by Gil Press, Forbes, May 9, 2013
https://www.forbes.com/sites/gilpress/2013/05/09/a-very-short-history-of-big-
data/?sh=6c36e27e65a1
Accessed 6 June 2022.

Footnotes
1 Fayyad, U., Piatetsky, G. and Smyth, P. The KDD Process for Extracting Useful Knowledge from
Volumes of Data, Communications of the ACM, November 1996, Vol 39, No. 11, pp. 27-34.
2 Duhigg, C. How Companies Learn Your Secrets. New York Times Magazine, 16 February 2012.

https://www.nytimes.com/2012/02/19/magazine/shopping-habits.html

1 Introduction to Statistics for Data Analytics Page 9 of 9

You might also like