D.a_introduction to Data Analytics
D.a_introduction to Data Analytics
KATARIYA COLLEGE
Data science is the study of data that helps us derive useful insight for business decision
making. Data Science is all about using tools, techniques, and creativity to uncover insights
hidden within data. It combines math, computer science, and domain expertise to tackle real-
world challenges in a variety of fields.
Data Science processes the raw data and solve business problems and even make prediction
about the future trend or requirement.
For example, from the huge raw data of a company, data science can help answer following
question:
What do customer want?
How can we improve our services?
What will the upcoming trend in sales?
How much stock they need for upcoming festival.
In short, data science empowers the industries to make smarter, faster, and more informed
decisions. To find patterns and achieve such insights, expertise in relevant domain is required.
With expertise in healthcare, a data scientists can predict patient risks and suggest
personalized treatments.
1|Page
3. Roles in data analytics
Data Analyst
Data Architect
Data Engineer
Data Scientist
Marketing Analyst
Business Analyst
2|Page
4. Life cycle of data analytics
The lifecycle of data analytics provides
framework for the best performance of each
phase from the creation of project until
completion.
The data analytics lifestyle is a process that consists
of 6 stages/phases: -
i. DATA DISCOVERY
The data science team learns and investigates
the problem.
Develop context and understanding.
Come to know about data sources needed and
available for the project.
The team formulates the initial hypothesis
that can be later tested with data.
3|Page
v. COMMUNICATION RESULT
After executing model team need to compare outcomes of modelling to criteria
established for success and failure.
Team considers how best to articulate findings and outcomes to various team members
and stakeholders, considering warning, assumptions.
Team should identify key findings, quantify business value, and develop narrative to
summarize and convey findings to stakeholders.
vi. OPERATIONALIZE
2. The team communicates benefits of project more broadly and sets up pilot project to
deploy work in controlled way before broadening the work to full enterprise of users.
3. This approach enables team to learn about performance and related constraints of the
model in production environment on small scale which adjust before full deployment.
4. The team delivers final reports, briefings, codes.
5. Free or open-source tools – Octave, WEKA, SQL, Madlib.
I. Improving efficiency: Helps to analyse large amount of data quickly & displays it in
formulated manner to achieve goals.
It encourages culture of efficiency and teamwork.
By leveraging data analytics, decision-makers can access real-time information, predictive
models, and visualizations that support informed decision-making across all levels of the
organization. This leads to better strategic planning, resource allocation, risk management, and
overall business performance.
4|Page
DISADVANTAGE:
I. Low quality of data: lack of access to quality data.
It is possible that organizations already have access to lot of data, but the question is, do they
have right data they need?
II. privacy concerns: As more data is collected and processed, the risk of data breaches
increases. Collecting and analysing personal data raises privacy concerns.
III. implementation costs: Implementing and maintaining the infrastructure for data analytics
can be expensive. Deploying data analytics tools and systems can be complex and resource
intensive, requiring expertise and investment in technology.
IV. Over reliance on data: Relying solely on data analytics for decision making can overlook
qualitative factors and human judgment, potentially leading to misguided strategies or actions.
It includes several stages like the collection of To process data, firstly raw data is defined in a meaningful
2. data and then the inspection of business data is manner, then data cleaning and conversion are done to get
done. meaningful information from raw data.
It supports decision making by analysing It analyses the data by focusing on insights into business
3.
enterprise data. data.
It uses various tools to process data such as It uses different tools to analyse data such as Rapid Miner,
4.
Tableau, Python, Excel, etc. Open Refine, Node XL, KNIME, etc.
5|Page
7.Types of data analytics:
i. Descriptive analytics:
Descriptive analytics looks at data and analyse past event for insight as to how to approach
future events.
It looks at past performance and understands the performance by mining historical data to
understand the cause of success or failure in the past.
Almost all management reporting such as sales, marketing, operations, and finance uses this
type of analysis.
The descriptive model quantifies relationships in data in a way that is often used to classify
customers or prospects into groups.
Unlike a predictive model that focuses on predicting the behaviour of a single
customer, Descriptive analytics identifies many different relationships between customer and
product.
Common examples of Descriptive analytics are company reports that provide historic reviews
like:
Data Queries
Reports
Descriptive Statistics
Data dashboard
6|Page
iii. Predictive analytics:
Predictive analytics turn the data into valuable, actionable information. predictive analytics
uses data to determine the probable outcome of an event or a likelihood of a situation
occurring. Predictive analytics holds a variety of statistical techniques from
modelling, machine learning , data mining , and game theory that analyse current and
historical facts to make predictions about a future event. Techniques that are used for
predictive analytics are:
Linear Regression
Time Series Analysis and Forecasting
Data Mining
Basic Cornerstones of Predictive Analytics
Predictive modelling
Decision Analysis and optimization
Transaction profiling
7|Page
8. Mechanistic data analytics:
Mechanistic data analysis is a scientific method used to identify the causal relationships
between two variables. This approach focuses on studying how changes in one variable affect
another variable in a deterministic way.
Mechanistic data analysis is widely used in engineering studies. Engineers use this method to
analyse how changes in a system’s components affect its overall performance and then develop
mathematical models based on their observations to predict the system’s performance under
different conditions. For example, engineers might measure how changes in engine design
parameters such as piston size, fuel injection rate, exhaust pressure, or number of cylinders
affect engine power output or fuel efficiency to help optimize engine designs for improved
performance and efficiency in vehicles.
9. Mathematical model:
A mathematical model in data analytics is a representation of data relationships using
mathematical concepts and equations. It helps analysts understand and predict data, and
make decisions about future events.
i. Occam’s razor:
What is Occam’s razor?
Occam’s razor is a law of parsimony popularly stated as (in William’s words) “Plurality must
never be posited without necessity.” Alternatively, as a heuristic, it can be viewed as, when
there are multiple hypotheses to solve a problem, the simpler one is to be preferred. It is not
clear as to whom this principle can be conclusively attributed to, but William of Occam’s (c.
1287 – 1347) preference for simplicity is well documented. Hence this principle goes by the
name, “Occam’s razor.” This often means cutting off or shaving away other possibilities or
explanations, thus “razor” appended to the name of the principle. It should be noted that these
explanations or hypotheses should lead to the same result.
Relevance of Occam’s razor.
There are many events that favour a simpler approach either as an inductive bias or a
constraint to begin with. Some of them are:
Studies like this, where the results have suggested that preschoolers are sensitive to
simpler explanations during their initial years of learning and development.
Preference for a simpler approach and explanations to achieve the same goal is seen in
various facets of sciences; for instance, the parsimony principle applied to
the understanding of evolution.
In theology, ontology, epistemology, etc this view of parsimony is used to derive
various conclusions.
Variants of Occam’s razor are used in knowledge Discovery.
8|Page
ii. Bias-variance trade-offs:
What is Bias?
The bias is known as the difference between the prediction of
the values by the Machine Learning model and the correct
value. Being high in biasing gives a large error in training as
well as testing data. It recommended that an algorithm should
always be low-biased to avoid the problem of underfitting. By
high bias, the data predicted is in a straight-line format, thus
not fitting accurately in the data in the data set. Such fitting is
known as the Underfitting of Data. This happens when
the hypothesis is too simple or linear in nature. Refer to the
graph given below for an example of such a situation.
What is Variance?
9|Page
10.Taxonomy model
Data taxonomy is a way of organizing and classifying data to create a structured hierarchy. It
helps businesses categorize their data to access and use it easily.
Information is grouped according to its characteristics, attributes, and relationships and placed
into categories and subcategories.
There are typically multiple levels or layers in a data taxonomy, each level representing a
specific category or class. Top-level categories are broader, while lower levels are more
granular. Organizations can custom-build their taxonomy structure based on their needs and
the nature of the data.
You can apply taxonomy to various data types, including structured data (databases and
spreadsheets) and unstructured data (documents and multimedia files).
To help understand how data taxonomy works, let’s consider an example for an
ecommerce company.
Top-level categories may contain:
Products
Customers
Orders
Marketing
Inventory
11.Baseline model:
A baseline model in data analytics is a simple model that serves as a reference point for
more complex models. It's used to evaluate the performance of advanced models and to
determine if they're better.
Why use a baseline model?
Baseline models help data scientists understand how complex models will perform.
They can identify issues with data quality, algorithms, features, or hyperparameters.
They can help determine if a more complex model is necessary.
Importance of Baseline Models
Overfitting
They serve as a basis for further model creation
Simplify the model development process
Identify data quality issues
A benchmark for model efficiency
10 | P a g e
Here are some common approaches to creating a baseline model for classification tasks:
1. Uniform or random selection amongst labels
2. The most common label appearing in training data
3. the most accurate signal feature model
4.Majority Class Classifier
11 | P a g e
12.Model evaluation
12 | P a g e
classification model (or "classifier") on a set of test data for which the true value is
known.
The confusion matrix itself is relatively simple to understand, but the The relative
terminology can be confusing.
A confusion matrix also known as an error matrix.
A confusion matrix is a technique for summarizing the performance of a classification
algorithm.
A confusion matrix is nothing but a table with two dimensions viz. "Actual"
"Predicted" and furthermore, both the dimensions have "True Positives (TP)" and
Negatives (TN)", "False Positives (FP)", "False Negatives (FN)".
True Positive:
Interpretation: You predicted positive and it is true.
You predicted that a woman is pregnant and she actually is.
True Negative:
Interpretation: You predicted negative and it’s true.
You predicted that a man is not pregnant and he actually
is not.
False Positive: (Type 1 Error)
Interpretation: You predicted positive and it’s false.
You predicted that a man is pregnant but he actually is
not.
False Negative: (Type 2 Error)
Interpretation: You predicted negative and it’s false.
You predicted that a woman is not pregnant but she actually is.
Just Remember, we describe predicted values as Positive and Negative and actual values as
True and False.
13 | P a g e
13.Class imbalance
Class imbalance (CI) in classification problems arises when the number of observations
belonging to one class is lower than the other. Ensemble learning combines multiple models
to obtain a robust model and has been prominently used with data augmentation methods to
address class imbalance problems.
Problem with Handling Imbalanced Data for Classification
Algorithms may get biased towards the majority class and thus tend to predict output
as the majority class.
Minority class observations look like noise to the model and are ignored by the model.
Imbalanced dataset gives misleading accuracy score.
Let us first understand the meaning of the two terms ROC and AUC.
ROC: Receiver Operating Characteristics
AUC: Area Under Curve
14 | P a g e
It represents the probability with which our model can distinguish between the two classes
present in our target.
15 | P a g e
4. Specificity
Specificity measures the proportion of actual negative
instances that are correctly identified by the model as negative.
It represents the ability of the model to correctly identify
negative instances
Specificity= TN
TN+FP
=1−FPR
And as said earlier ROC is nothing but the plot between TPR
and FPR across all possible thresholds and AUC is the entire
area beneath this ROC curve.
-------------------********************----------------------------------****************----------------
16 | P a g e