Data Analytivs-Unit-2
Data Analytivs-Unit-2
Table of Content
1) Data Analysis: Regression modeling
2) Multivariate analysis
3) Bayesian modeling, inference and Bayesian networks
4) Support vector and kernel methods
5) Analysis of time series: linear systems analysis & nonlinear dynamics
6) Neural networks: learning and generalization
7) Fuzzy logic: extracting fuzzy models from data, fuzzy decision trees, stochastic
search methods.
3. Polynomial Regression:
- Allows for non-linear relationships by including polynomial terms.
- Example: Predicting sales based on advertising spending with a quadratic term.
6. Logistic Regression:
- Used for binary classification problems.
- Outputs probabilities between 0 and 1 using the logistic function.
- Example: Predicting whether a customer will buy a product (yes/no).
1. Data Collection:
- Gather relevant data for the dependent and independent variables.
2. Data Cleaning:
- Handle missing values, outliers, and any data inconsistencies.
4. Feature Selection:
- Identify and select relevant independent variables.
5. Train-Test Split:
- Divide the dataset into training and testing sets to evaluate model performance.
6. Model Building:
- Choose the appropriate regression model based on the problem at hand.
8. Model Evaluation:
- Assess the model's performance on the testing set using metrics like Mean Squared
Error (MSE), R-squared, or accuracy for classification problems.
9. Model Interpretation:
- Understand the impact of each independent variable on the dependent variable.
10. Prediction:
- Use the trained model to make predictions on new, unseen data.
1. Linearity:
- The relationship between variables should be linear.
2. Independence:
- Residuals (the differences between predicted and actual values) should be
independent.
3. Homoscedasticity:
- Residuals should have constant variance across all levels of the independent variable.
4. Normality of Residuals:
- Residuals should be approximately normally distributed.
There are several variations of regression analysis, including linear, multiple linear, and
nonlinear. The most common models are simple linear and multiple linear .
Simple linear regression is a model that assesses the relationship between a dependent
variable and an independent variable. The mathematical representation of simple linear
regression is:
Y = a + bX + \epsilon
Where:
• Y: Dependent variable
• X: Independent (explanatory) variable
• a: Intercept
• b: Slope
• \epsilon: Residual (error)
Multiple linear regression analysis is essentially similar to the simple linear model, with
the exception that multiple independent variables are used in the model. The mathematical
representation of multiple linear regression is:
Where:
• Y: Dependent variable
• X_1,X_2,X_3: Independent (explanatory) variables
• a: Intercept
• b, c, d: Slopes
• \epsilon: Residual (error)
1. Programming Languages:
- Python (with libraries like NumPy, pandas, scikit-learn, statsmodels)
-R
2. Visualization Tools:
- Matplotlib, Seaborn for data visualization.
3. Model Evaluation:
- Scikit-learn provides functions for training, testing, and evaluating regression models.
4. Advanced Techniques:
- Cross-validation, regularization, and feature engineering for improved model
performance.
Multivariate Analysis
Multivariate analysis is a statistical method that involves the analysis of multiple variables
simultaneously. It is used to uncover patterns, relationships, and dependencies between
variables, and to identify the underlying factors that influence them.
- It is used when the researcher wants to understand how multiple variables interact with
each other.
Purpose:
- Make predictions about the value of one variable based on the values of others.
Examples:
Another example is in healthcare. Doctors can use multivariate analysis to identify risk
factors for certain diseases, and develop personalized treatment plans for their patients.
- Equation: Y=β0+β1X1+β2X2+…+βnXn+ϵ
- Identifies the principal components (linear combinations of variables) that explain the
maximum variance.
4. Cluster Analysis:
- Identifies linear combinations of variables that have maximum correlation with each
other.
6. Discriminant Analysis:
- Determines whether there are any statistically significant differences between group
means.
1. Data Preparation:
2. Variable Selection:
3. Assumptions Check:
- Ensure that the assumptions of the chosen multivariate technique are met.
5. Validation:
1. Multicollinearity:
- High correlation between independent variables can affect the reliability of results.
2. Interpretability:
3. Sample Size:
- Multivariate analyses may require larger sample sizes compared to univariate analyses.
4. Assumption Violations:
Multivariate analysis is a powerful tool for extracting meaningful insights from complex
datasets. It helps researchers uncover relationships, patterns, and dependencies that may
not be apparent in univariate or bivariate analyses. Proper understanding of the chosen
technique, careful consideration of assumptions, and thoughtful interpretation of results
are essential for a successful multivariate analysis.
Bayesian modeling is a statistical method that involves updating prior beliefs about a
parameter or hypothesis based on new data. It is based on Bayes' theorem, which states
that the probability of a hypothesis given the data is proportional to the product of the
prior probability of the hypothesis and the likelihood of the data given the hypothesis.
Inference, on the other hand, is the process of drawing conclusions from data. Bayesian
inference involves calculating the posterior probability distribution of a parameter or
hypothesis given the data. This distribution takes into account both the prior belief and the
likelihood of the data.
Bayesian networks are graphical models that represent the probabilistic relationships
between variables. They are used to model complex systems and make predictions based
on uncertain data. Bayesian networks consist of nodes representing variables and edges
representing probabilistic dependencies between them.
Certainly! Below are detailed notes on Bayesian modeling, inference, and Bayesian
networks in the context of data analysis.
- It is based on Bayes' theorem, which updates our beliefs (posterior) based on prior
knowledge and new evidence.
- Prior Distribution:
- Likelihood Function:
- Posterior Distribution:
- Combines the prior and likelihood to provide updated beliefs after observing the
data.
- Assess the model's goodness-of-fit by comparing simulated data from the posterior
distribution to the observed data.
- Model Comparison:
- Bayes factors and Deviance Information Criterion (DIC) are used for comparing
models.
- Parameter Estimation:
- Hypothesis Testing:
- Decision Analysis:
Bayesian Networks:
Bayesian networks (or belief networks) model probabilistic relationships among a set
of variables using a directed acyclic graph (DAG).
- Parameters:
- Each node is associated with a conditional probability table (CPT) specifying the
probability distribution given the parent nodes.
- Variable Elimination:
- Medical Diagnosis:
- Bayesian networks are used for diagnosing diseases based on symptoms and test
results.
- Risk Assessment:
Examples:
A classic example of Bayesian modeling is the Monty Hall problem, where a contestant is
asked to choose one of three doors behind which there may be a prize. After the
contestant chooses a door, the host opens one of the other two doors, revealing that there
is no prize behind it. The contestant is then given the option to switch their choice to the
remaining door. Bayesian modeling can be used to calculate the probability of winning if
the contestant switches their choice.
Bayesian modeling and Bayesian networks are powerful tools in data analysis, providing
a principled way to incorporate prior knowledge, make inferences, and model complex
dependencies among variables. These techniques find applications in various fields,
including healthcare, finance, and artificial intelligence. Continuous advancements in
methodologies and computational tools continue to enhance the effectiveness and
applicability of Bayesian approaches in the realm of data analysis.
Time series analysis is a statistical technique used to analyze and interpret data over time.
It involves studying patterns and trends in data collected at regular intervals, such as
hourly, daily, weekly, monthly, or yearly. The purpose of time series analysis is to
identify underlying patterns and relationships in the data and to use this information to
make predictions about future events.
Time series analysis is an important tool for data analytics because it allows us to uncover
patterns and trends in data that might not be apparent from a simple visual inspection. By
analyzing data over time, we can identify seasonal patterns, trends, and cycles that can
help us make predictions about future events.
- Time series analysis involves studying the patterns and trends within a dataset where
each data point is associated with a specific time.
Linear systems analysis is a mathematical approach used to study the behavior of linear
systems. A linear system is one in which the output is proportional to the input. Linear
systems analysis involves analyzing the response of a system to various inputs and
determining the transfer function that relates the input to the output
Linear systems analysis is a mathematical technique that allows us to study the behavior
of linear systems. Linear systems are commonly used in engineering and physics, where
they are used to model a wide range of physical phenomena. By analyzing the response of
a linear system to various inputs, we can determine the transfer function that relates the
input to the output.
- Assumes that the current value of a variable is a linear combination of its past values.
- Assumes that the current value is a linear combination of past white noise or error
terms.
Stationarity:
- A time series is stationary if its statistical properties, such as mean and variance, remain
constant over time.
- ACF measures the correlation between a time series and its lagged values.
- PACF measures the correlation between a time series and its lagged values after
removing the effect of the intermediate lag values.
- Use model evaluation metrics such as AIC (Akaike Information Criterion) and BIC
(Bayesian Information Criterion) for model selection.
Nonlinear dynamics is a branch of mathematics that deals with the behavior of nonlinear
systems. A nonlinear system is one in which the output is not proportional to the input.
Nonlinear dynamics involves analyzing the behavior of systems that exhibit complex and
unpredictable behavior, such as chaos and bifurcations.
Nonlinear dynamics is a more advanced technique that allows us to study the behavior of
nonlinear systems. Nonlinear systems are often more complex and unpredictable than
linear systems, and they can exhibit behaviors such as chaos and bifurcations. Nonlinear
dynamics allows us to study these complex behaviors and to make predictions about how
a system will behave under different conditions.
Nonlinear Dynamics:
- Nonlinear dynamics studies systems where the relationship between variables is not
linear.
Chaos Theory:
- Chaos theory deals with complex systems that appear to be random but are governed by
underlying deterministic laws.
- The butterfly effect: Small changes in initial conditions can lead to vastly different
outcomes.
- Represent the system as a set of equations defining its states and the relationship
between them.
- Attractors are states towards which a dynamic system evolves over time.
Lyapunov Exponents:
Recurrence Plots:
Neural networks are a type of artificial intelligence (AI) that are modeled after the
structure and function of the human brain. They consist of interconnected nodes, called
neurons, that process and transmit information. Neural networks can be trained to
recognize patterns and make predictions based on input data.
Neural networks learn by adjusting the weights and biases of the connections between
neurons. This process is called backpropagation, and it involves comparing the output of
the network to the desired output and adjusting the weights and biases accordingly. Over
time, the network becomes better at recognizing patterns and making accurate predictions.
Generalisation is the ability of a neural network to apply what it has learned to new,
unseen data. This is important because it allows the network to make accurate predictions
on data that it has not been specifically trained on. Generalisation can be improved by
using techniques such as regularization, which helps prevent overfitting, and cross-
validation, which tests the network's performance on different subsets of the data.
- Supervised Learning:
- Unsupervised Learning:
- Reinforcement Learning:
- Learning through trial and error based on feedback from the environment.
- Input Layer:
- Hidden Layers:
- Output Layer:
- Activation Functions:
- Loss Function:
- Backpropagation:
- Iterative optimization process that adjusts weights based on the gradient of the loss
function.
- Optimizers:
- Algorithms (e.g., SGD, Adam) that guide the update of weights during training.
- Overfitting:
- Occurs when a model learns training data too well, including noise.
- Solutions include regularization techniques (e.g., dropout) and using more data.
- Underfitting:
- Cross-Validation:
- Dividing the dataset into multiple subsets for training and testing to assess
generalization.
- Data Augmentation:
Hyperparameter Tuning:
- Adjusting parameters not learned during training (e.g., learning rate, batch size) to
optimize model performance.
Transfer Learning:
- Explainability:
- Adversarial Attacks:
- Continual Learning:
Fuzzy logic is a mathematical approach that deals with uncertainty and imprecision. It is a
type of logic that allows for partial truths rather than just true or false values. Fuzzy logic
is used to model complex systems with incomplete or uncertain data by assigning degrees
of membership to different values.
1.1 Introduction:
Fuzzy models are mathematical representations that capture uncertainty and imprecision
in data. Extracting fuzzy models from data involves the identification and modeling of
fuzzy relationships.
Clean and preprocess the data to handle missing values and outliers. Fuzzy models are
sensitive to noise, so data quality is crucial.
1.2.2 Fuzzification:
Convert crisp input data into fuzzy sets. This step involves defining membership functions
that describe the degree to which an element belongs to a fuzzy set.
In this step, fuzzy rules are derived from the fuzzy sets. This can be done using methods
like clustering, rule induction, or expert knowledge.
1.2.4 Inference:
Apply the fuzzy rules to make predictions or decisions. This involves combining fuzzy
rules to obtain a fuzzy output.
1.2.5 Defuzzification:
Convert fuzzy output into a crisp value. This step is necessary when the final decision or
prediction needs to be in a non-fuzzy form.
ARCHITECTURE
pg. 20 Faculty : Shanu Gupta(CSE Department)
VISION INSTITUTE OF TECHNOLOGY, Subject: INTRODUCTION TO DATA
ANALYTICS AND VISUALIZATION
ALIGARH
Its Architecture contains four parts :
• RULE BASE: It contains the set of rules and the IF-THEN conditions provided by
the experts to govern the decision-making system, on the basis of linguistic
information. Recent developments in fuzzy theory offer several effective methods
for the design and tuning of fuzzy controllers. Most of these developments reduce
the number of fuzzy rules.
• FUZZIFICATION: It is used to convert inputs i.e. crisp numbers into fuzzy sets.
Crisp inputs are basically the exact inputs measured by sensors and passed into the
control system for processing, such as temperature, pressure, rpm’s, etc.
• INFERENCE ENGINE: It determines the matching degree of the current fuzzy
input with respect to each rule and decides which rules are to be fired according to
the input field. Next, the fired rules are combined to form the control actions.
• DEFUZZIFICATION: It is used to convert the fuzzy sets obtained by the inference
engine into a crisp value. There are several defuzzification methods available and
the best-suited one is used with a specific expert system to reduce the error.
2.1 Overview:
Fuzzy decision trees extend traditional decision trees by incorporating fuzzy logic. Instead
of crisp decisions at each node, fuzzy decision trees allow for uncertainty and gradual
transitions between classes.
Define fuzzy criteria for splitting nodes. Membership functions play a key role in
determining the fuzzy partitions.
Evaluate the impurity or entropy of fuzzy nodes. The goal is to find the best fuzzy
partition that minimizes uncertainty.
Convert the fuzzy decision tree into a set of fuzzy rules. Each path from the root to a leaf
node represents a fuzzy rule.
2.2.4 Interpretability:
3.1 Introduction:
Stochastic search methods are optimization techniques that use randomness to explore the
solution space efficiently. They can be applied to tune fuzzy models.
Apply stochastic search to refine fuzzy rules. This is particularly useful in cases where
rule generation methods may produce suboptimal rules.
Include a robust evaluation framework to assess the performance of the fuzzy model after
stochastic search optimization.
• This system can work with any type of inputs whether it is imprecise, distorted or
noisy input information.
• The construction of Fuzzy Logic Systems is easy and understandable.
• Fuzzy logic comes with mathematical concepts of set theory and the reasoning of
that is quite simple.
• It provides a very efficient solution to complex problems in all fields of life as it
resembles human reasoning and decision-making.
• The algorithms can be described with little data, so little memory is required.
• Many researchers proposed different ways to solve a given problem through fuzzy
logic which leads to ambiguity. There is no systematic approach to solve a given
problem through fuzzy logic.
• Proof of its characteristics is difficult or impossible in most cases because every
time we do not get a mathematical description of our approach.
• As fuzzy logic works on precise as well as imprecise data so most of the time
accuracy is compromised.
Application
• It is used in the aerospace field for altitude control of spacecraft and satellites.
• It has been used in the automotive system for speed control, traffic control.
• It is used for decision-making support systems and personal evaluation in the
large company business.
• It has application in the chemical industry for controlling the pH, drying, chemical
distillation process.
• Fuzzy logic is used in Natural language processing and various intensive
applications in Artificial Intelligence.
• Fuzzy logic is extensively used in modern control systems such as expert systems.
• Fuzzy Logic is used with Neural Networks as it mimics how a person would make
decisions, only much faster. It is done by Aggregation of data and changing it into
more meaningful data by forming partial truths as Fuzzy sets.
Fuzzy logic, when applied to data modeling and decision-making, offers a powerful tool
for handling uncertainty. Extracting fuzzy models from data, building fuzzy decision
trees, and employing stochastic search methods contribute to the robustness and
effectiveness of fuzzy systems in various applications.