BAI 3303 Notes

BAI 3303 NOTES
Chapter 10: Logistic Regression

Logistic Regression is an Extension of Linear Regression to situations where the outcome variable is
categorical. It is widely used, especially in cases where a structured model is needed for explanation
(profiling) or prediction.
P.S.Even small datasets can be used for building logistic regression,classiﬁers and once the model is estim
ated, it is computationally fast and cheap to classify even large samples of new records.
- Logistic Regression works for binary classification it focuses on scenarios where the outcome variable Y
takes on two values, typically Y = 0 or Y = 1, and aims to estimate probabilities P(Y=1) or P(Y=0).
P.S. Standard linear regression models are not suitable for predicting probabilities, which are bounded
between 0 and 1.
- The Logit, which is a function used to relate predictor variables to the outcome. It is a transformation of
the response variable Y in logistic regression, and the Logistic Response Function utilizes a logistic
response function to ensure that the predicted probabilities lie between 0 and 1.
- The Odds: Defined as the probability of an event occurring divided by the probability of it not occurring.
In logistic regression, odds are used to relate predictors to the probability of the event.
- The logit can be converted back to a probability, which in turn can be used to make classifications.
- Classification Cutoff: Classifications are made by establishing a cutoff level for the estimated
probabilities. Popular choices include 0.5, but additional considerations such as maximizing accuracy,
sensitivity, or minimizing misclassification costs can also be considered.
P.S. the logit is a linear function of predictors x1, x2, ... and Takes values from -infinity to +infinity and by
reviewing the relationship between logit, odds, and probability.
--- Model Evaluation:
Performance of the model is typically evaluated using confusion matrices, misclassification rates, and lift
(gains) charts.
Lift Chart: is a representation of the confusion matrix. The goal of evaluating a model using the confusion
matrix is to maximize the values of TP and TN and minimize the values of FP and FN.
- Multicollinearity: Similar to linear regression, multicollinearity can be a problem in logistic regression.

Methods to address this include removing redundant predictors and variable selection techniques.
P.S. Multicollinearity is a statistical concept where several independent variables in a model are
correlated.
- Variable Selection: To avoid overfitting, variable selection techniques are employed to choose the most
relevant predictors.
P.S. P-values for Predictors: P-values are used to test the null hypothesis that a coefficient is equal to zero,
aiding in determining the significance of predictors.
- Multi-Class Classification: Logistic regression can be extended to handle more than two classes using
techniques such as creating dummy variables for each class or utilizing multinomial logistic regression
algorithms.
Chapter 11: Neural Networks
- Definition: Neural networks, a flexible data-driven method, can handle classification,
prediction, and feature extraction, forming the basis for deep learning, a powerful technique in
artificial intelligence applications like image and voice recognition.
P.S. Deep learning is the subset of machine learning methods based on artificial neural networks
with representation learning. The adjective "deep" refers to the use of multiple layers in the
network. Methods used can be either supervised, semi-supervised or unsupervised.
- Neural networks excel in capturing complex relationships between predictors and target
attributes, which is challenging for other models.
- Convolutional neural networks (CNNs) are prominent in image recognition tasks. They
aggregate pixel values using convolution operations, extracting features at various hierarchical
levels.
- Relation to Linear Regression:

NN represent a more flexible approach compared to linear regression. Linear and logistic
regression can be viewed as simple neural networks without hidden layers with one output
node, operating in a feedforward manner.
- NN Architecture:
Neural networks consist of layers: input, hidden, and output. Layers are interconnected, with
nodes in each layer receiving inputs from the previous layer and passing outputs to the next
layer.
- A feedforward network has a one-way flow without cycles.
- Node Functionality:
Nodes in hidden layers compute outputs using a weighted sum of inputs, adjusted by bias
values.
- Various activation functions can map inputs to outputs, such as the logistic/sigmoidal function
or rectified linear unit (ReLU).
P.S. The rectified linear unit (ReLU) or rectifier activation function introduces the property of
nonlinearity to a deep learning model and solves the vanishing gradients issue.
- Output Layer and Classification:
The output layer produces predictions, which are mapped to classifications. For binary
classification, a threshold value is applied to determine class membership. In multi-class
problems, the output node with the highest value represents the predicted class.
- NN Preprocessing Steps:
Neural networks perform best when predictors are on a consistent scale, typically [0, 1] or [-1,
1]. Scaling attributes is necessary before training to ensure uniformity and optimal performance.
P.S. If a and b are unknown, we can estimate them from the minimal and maximal
values of X in the data. Even if new data exceed this range by a small amount
(for example, scaled values are slightly larger than 1), this will not affect the results much.
For categorical variables, If equidistant categories, map to equidistant interval points in 0-1
range. Otherwise, create dummy variables.
Then the training and learning process is through learning the complex relationships iteratively
from data. Training involves adjusting weights based on prediction errors using techniques like
backpropagation.
Training the Model:

---Find weights that yield best predictions, this process is repeated for all records. At each
record compare prediction to actual and the difference is the error for the output node.
In neural networks, the estimation processuses the errors iteratively to update the estimated pa
rameters (bias and weights).
In particular, the error for the output node is distributed across all the hidden
nodes that led to it so that each node is assigned “responsibility” for part of the
error. Each of these node-speciﬁc errors is then used for updating the weights and bias values.
---Back Propagation (“back-prop”)
How to Reach Common Criteria to Stop the Updating

●When weights change very little from one iteration to the next
●When the misclassification rate reaches a required threshold
●When a limit on runs is reached

Avoiding Overfitting
With sufficient iterations, neural nets can easily overfit the data. To avoid overfitting:
● Track error in validation data or via cross-validation
● Limit iterations
● Limit complexity of network

Advantages and Disadvantages:
- Neural networks offer high predictive performance and can capture intricate relationships in
data.
However, they are considered "black box" models with limited interpretability and require
substantial computational resources.
Chapter 16: Clustering

-Clustering forms groups (clusters) of similar records. It is used for segmenting markets into
groups of similar customers.
Other Applications
- Periodic table of the elements
- Classification of species
- Grouping securities in portfolios
Extension to More Than 2 Dimensions

- Clustering with multiple dimensions requires formal algorithms.
- Two main algorithms: hierarchical and non-hierarchical.
Measuring Distance
- Between records: Euclidean Distance is popular.
- For categorical data: Similarity metrics like Matching coefficient or Jaccard's coefficient.
P.S. The Jaccard index or Jaccard similarity coefficient, defined as the size of the intersection
divided by the size of the union of two label sets, is used to compare set of predicted labels for a
sample to the corresponding set of labels in y true.
Normalization - Raw distance measures are influenced by scale, so we normalize (standardize)
the data to prevent domination by features with large scales. Or rescale to 0/1 range, Subtract
the minimum, Divide by range.
Euclidean distance, while widely used, has several important features to consider:
1. Scale Dependence: Changing the units of one attribute can significantly influence results.
Normalizing the data is a common solution to mitigate this issue. However, if certain
measurements are more important, unequal weighting should be considered.
2. Ignorance of Relationship: Euclidean distance disregards the relationship between
measurements. If measurements are strongly correlated, other distances like statistical distance
may be more appropriate.
3. Sensitivity to Outliers: Euclidean distance is sensitive to outliers. If the data contains outliers
and their removal is not feasible, more robust distances like Manhattan distance are preferred.
Other popular distance metrics include.
- Correlation-based similarity: Measures similarity rather than dissimilarity. Can be converted to
distance measures.
- Statistical distance (Mahala Nobis distance): Considers the correlation between

measurements.
- Manhattan distance: Considers absolute differences instead of squared differences.
- Maximum coordinate distance: Looks at the measurement where records deviate most.
For categorical data.
- Matching coefficient: Measures similarity based on shared attributes.

- Jaccard's coefficient: Considers the ratio of shared attributes to total attributes.
For mixed data
- Gower's similarity measure: Weighted average of distances computed for each attribute,
scaled to a [0, 1] range.
When measuring distance between clusters:

- Minimum distance: Closest pair of records between two clusters.
- Maximum distance: Farthest pair of records between two clusters.
- Average distance: Average of all possible distances between records in one cluster and records
in the other.
- Centroid distance: Distance between the centroids of two clusters, where a centroid
represents the average measurements across all records in that cluster.
Hierarchical Methods
- Agglomerative Methods: Start with n clusters, merge closest clusters until one remains.
- Divisive Methods: Start with one cluster, divide into smaller clusters.
Hierarchical Clustering
- Dendrogram illustrates the process.
- Steps: Start with n clusters, merge closest, repeat until one cluster remains.
Non-hierarchical Clustering: K-Means

- Steps: Choose k clusters, start with random partition, iteratively move records to closest
cluster centroid.
Validating Clusters
- Interpretation: Obtain meaningful and useful clusters.
- Caveats: Watch for random chance, different methods produce different results.
- Desirable Cluster Features: Stability and separation.
---------------Summary-----------------------------------------------------------------------------------------------
- Cluster analysis is exploratory; useful when it produces meaningful clusters.
- Hierarchical clustering provides visual representation of different levels.

- Non-hierarchical clustering is computationally cheaper and more stable.
- Be wary of chance results; data may not have definitive "real" clusters.
Chapter 15: Association Rule

This chapter delves into unsupervised learning methods, specifically association rules (also
known as "affinity analysis" and "market basket analysis") and collaborative filtering. Both are
widely used in marketing for cross-selling products based on associations with items a consumer
is considering.
P.S. Cross-selling is a sales technique involving the selling of an additional product or service to
an existing customer.
- Definition: Study of relationships between items in transaction-type databases, commonly
used in retail and other fields.
- Market Basket Analysis: Originated from the study of customer transaction databases to
determine item dependencies. Used for decision-making in store layouts, item placement,
cross-selling, promotions, etc.
P.S. Rules are computed probabilistically and are often displayed in online recommendation
systems.
- Generating Candidate Rules:
- Premise and Conclusion: Association rules are represented in "if-then" format, with the
premise and conclusion being sets of items.
- Candidates: All possible rules between items are examined to select those likely to indicate
true dependence.
- Support: The support of a rule measures the degree to which the data validate its validity and
is determined based on the number of transactions including both premise and conclusion
itemset.
- Frequent Itemsets:
- Definition: Itemsets with support exceeding a selected minimum support threshold.
- Computation: Typically involves considering only combinations that occur with higher
frequency in the database to reduce computational time.
- Apriori Algorithm:
Apriori algorithm is a classic method for generating frequent itemsets, crucial in association rule
mining. The key Idea is that it begins by generating frequent itemsets with one item (one-
itemsets) and then recursively generates frequent itemsets with increasing numbers of items
until all sizes are covered.
- Generating One-Itemsets: Easily achieved by counting the occurrences of each item in the
database, with the transaction counts representing the supports for the one-itemsets.
- Generating K-Itemsets: Utilizes the frequent (k − 1)-itemsets generated in the previous step.
Each step involves a single pass through the database, making the algorithm efficient even for
databases with many unique items.
FP-Growth Algorithm:
An alternative to the Apriori algorithm, FP-Growth, proposed by Han et al. (2000), offers
improved computational efficiency. Methodology : Constructs an FP-tree structure to efficiently
represent transactions in a transaction set, followed by two steps: ordering items by support
and building the FP-tree.
- FP-Tree Construction: Begins with a null node and adds branches corresponding to items in
each transaction. Common items in transactions share overlapping paths, creating compact tree
structures.
- Advantages over Apriori : Requires only two passes through the transaction database, offers
better memory usage, and is computationally more efficient.
Selecting Strong Rules:

- Measuring Rule Strength: Strong rules indicate a significant dependence between premise and
conclusion itemsets, measured using support, confidence, lift, and conviction.
- Support and Confidence: Support measures the co-occurrence of premise and conclusion
itemsets, while confidence compares this co-occurrence with the occurrence of the premise
itemsets.
- Lift: Compares confidence with a benchmark value assuming independence, with a lift greater
than 1 indicating a useful rule.
- Conviction: Measures the degree of implication or directionality of a rule, considering the

likelihood of premise without conclusion under the assumption that the rule is correct.
Process of Rule Selection:
- Frequent Itemset Selection: The first stage involves finding all frequent itemsets with a
requisite support, typically specified by the user.
- Association Rules Generation: From the frequent itemsets, association rules meeting a
confidence requirement are generated. This involves filtering out rare item combinations and
selecting rules with high confidence.
DONUTS (40/90 units, 10 dh commission)
HOUSING FOOD: (on them or campus or DT)/(SAO)
TALKS AND SUCCESSFUL PEOPLE
MISSION TO MAKE STUDENTS TO GIVE THEM A CLEAR VISION OF WHAT THEY WANT TO DO THE IN
FUTURE
BINATNA
INSTAGRAM POLE
BOOKING THE VENUES
DEV AND COM (podcast)
Schools (dean)
Work on marketing (akhobi)
Planning is a must.
Instagram:
Username:
Password:
Ask her about logo.

BAI 3303 Notes

Uploaded by

Copyright:

Available Formats

BAI 3303 Notes

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

BAI 3303 Notes

Uploaded by

Copyright:

Available Formats

BAI 3303 NOTES

Chapter 10: Logistic Regression

--- Model Evaluation:

- Multicollinearity: Similar to linear regression, multicollinearity can be a problem in logistic regression.

- Relation to Linear Regression:

Training the Model:

---Back Propagation (“back-prop”)

How to Reach Common Criteria to Stop the Updating

●When a limit on runs is reached

● Limit complexity of network

Chapter 16: Clustering

Extension to More Than 2 Dimensions

- Between records: Euclidean Distance is popular.

- Statistical distance (Mahala Nobis distance): Considers the correlation between

- Manhattan distance: Considers absolute differences instead of squared differences.

For categorical data.

- Matching coefficient: Measures similarity based on shared attributes.

For mixed data

When measuring distance between clusters:

- Maximum distance: Farthest pair of records between two clusters.

Non-hierarchical Clustering: K-Means

- Hierarchical clustering provides visual representation of different levels.

Chapter 15: Association Rule

Selecting Strong Rules:

- Conviction: Measures the degree of implication or directionality of a rule, considering the

You might also like