BAI 3303 Notes
BAI 3303 Notes
BAI 3303 Notes
- Logistic Regression works for binary classification it focuses on scenarios where the outcome variable Y
takes on two values, typically Y = 0 or Y = 1, and aims to estimate probabilities P(Y=1) or P(Y=0).
P.S. Standard linear regression models are not suitable for predicting probabilities, which are bounded
between 0 and 1.
- The Logit, which is a function used to relate predictor variables to the outcome. It is a transformation of
the response variable Y in logistic regression, and the Logistic Response Function utilizes a logistic
response function to ensure that the predicted probabilities lie between 0 and 1.
- The Odds: Defined as the probability of an event occurring divided by the probability of it not occurring.
In logistic regression, odds are used to relate predictors to the probability of the event.
- The logit can be converted back to a probability, which in turn can be used to make classifications.
- Classification Cutoff: Classifications are made by establishing a cutoff level for the estimated
probabilities. Popular choices include 0.5, but additional considerations such as maximizing accuracy,
sensitivity, or minimizing misclassification costs can also be considered.
P.S. the logit is a linear function of predictors x1, x2, ... and Takes values from -infinity to +infinity and by
reviewing the relationship between logit, odds, and probability.
Performance of the model is typically evaluated using confusion matrices, misclassification rates, and lift
(gains) charts.
Lift Chart: is a representation of the confusion matrix. The goal of evaluating a model using the confusion
matrix is to maximize the values of TP and TN and minimize the values of FP and FN.
P.S. Multicollinearity is a statistical concept where several independent variables in a model are
correlated.
- Variable Selection: To avoid overfitting, variable selection techniques are employed to choose the most
relevant predictors.
P.S. P-values for Predictors: P-values are used to test the null hypothesis that a coefficient is equal to zero,
aiding in determining the significance of predictors.
- Multi-Class Classification: Logistic regression can be extended to handle more than two classes using
techniques such as creating dummy variables for each class or utilizing multinomial logistic regression
algorithms.
Chapter 11: Neural Networks
- Definition: Neural networks, a flexible data-driven method, can handle classification,
prediction, and feature extraction, forming the basis for deep learning, a powerful technique in
artificial intelligence applications like image and voice recognition.
P.S. Deep learning is the subset of machine learning methods based on artificial neural networks
with representation learning. The adjective "deep" refers to the use of multiple layers in the
network. Methods used can be either supervised, semi-supervised or unsupervised.
- Neural networks excel in capturing complex relationships between predictors and target
attributes, which is challenging for other models.
- Convolutional neural networks (CNNs) are prominent in image recognition tasks. They
aggregate pixel values using convolution operations, extracting features at various hierarchical
levels.
- NN Architecture:
Neural networks consist of layers: input, hidden, and output. Layers are interconnected, with
nodes in each layer receiving inputs from the previous layer and passing outputs to the next
layer.
- A feedforward network has a one-way flow without cycles.
- Node Functionality:
Nodes in hidden layers compute outputs using a weighted sum of inputs, adjusted by bias
values.
- Various activation functions can map inputs to outputs, such as the logistic/sigmoidal function
or rectified linear unit (ReLU).
P.S. The rectified linear unit (ReLU) or rectifier activation function introduces the property of
nonlinearity to a deep learning model and solves the vanishing gradients issue.
- Output Layer and Classification:
The output layer produces predictions, which are mapped to classifications. For binary
classification, a threshold value is applied to determine class membership. In multi-class
problems, the output node with the highest value represents the predicted class.
- NN Preprocessing Steps:
Neural networks perform best when predictors are on a consistent scale, typically [0, 1] or [-1,
1]. Scaling attributes is necessary before training to ensure uniformity and optimal performance.
P.S. If a and b are unknown, we can estimate them from the minimal and maximal
values of X in the data. Even if new data exceed this range by a small amount
(for example, scaled values are slightly larger than 1), this will not affect the results much.
For categorical variables, If equidistant categories, map to equidistant interval points in 0-1
range. Otherwise, create dummy variables.
Then the training and learning process is through learning the complex relationships iteratively
from data. Training involves adjusting weights based on prediction errors using techniques like
backpropagation.
However, they are considered "black box" models with limited interpretability and require
substantial computational resources.
- For categorical data: Similarity metrics like Matching coefficient or Jaccard's coefficient.
P.S. The Jaccard index or Jaccard similarity coefficient, defined as the size of the intersection
divided by the size of the union of two label sets, is used to compare set of predicted labels for a
sample to the corresponding set of labels in y true.
Normalization - Raw distance measures are influenced by scale, so we normalize (standardize)
the data to prevent domination by features with large scales. Or rescale to 0/1 range, Subtract
the minimum, Divide by range.
Euclidean distance, while widely used, has several important features to consider:
1. Scale Dependence: Changing the units of one attribute can significantly influence results.
Normalizing the data is a common solution to mitigate this issue. However, if certain
measurements are more important, unequal weighting should be considered.
2. Ignorance of Relationship: Euclidean distance disregards the relationship between
measurements. If measurements are strongly correlated, other distances like statistical distance
may be more appropriate.
3. Sensitivity to Outliers: Euclidean distance is sensitive to outliers. If the data contains outliers
and their removal is not feasible, more robust distances like Manhattan distance are preferred.
Other popular distance metrics include.
- Correlation-based similarity: Measures similarity rather than dissimilarity. Can be converted to
distance measures.
- Maximum coordinate distance: Looks at the measurement where records deviate most.
- Gower's similarity measure: Weighted average of distances computed for each attribute,
scaled to a [0, 1] range.
- Average distance: Average of all possible distances between records in one cluster and records
in the other.
- Centroid distance: Distance between the centroids of two clusters, where a centroid
represents the average measurements across all records in that cluster.
Hierarchical Methods
- Agglomerative Methods: Start with n clusters, merge closest clusters until one remains.
- Divisive Methods: Start with one cluster, divide into smaller clusters.
Hierarchical Clustering
- Dendrogram illustrates the process.
- Steps: Start with n clusters, merge closest, repeat until one cluster remains.
- Caveats: Watch for random chance, different methods produce different results.
- Desirable Cluster Features: Stability and separation.
---------------Summary-----------------------------------------------------------------------------------------------
- Cluster analysis is exploratory; useful when it produces meaningful clusters.
- Candidates: All possible rules between items are examined to select those likely to indicate
true dependence.
- Support: The support of a rule measures the degree to which the data validate its validity and
is determined based on the number of transactions including both premise and conclusion
itemset.
- Frequent Itemsets:
- Definition: Itemsets with support exceeding a selected minimum support threshold.
- Computation: Typically involves considering only combinations that occur with higher
frequency in the database to reduce computational time.
- Apriori Algorithm:
Apriori algorithm is a classic method for generating frequent itemsets, crucial in association rule
mining. The key Idea is that it begins by generating frequent itemsets with one item (one-
itemsets) and then recursively generates frequent itemsets with increasing numbers of items
until all sizes are covered.
- Generating One-Itemsets: Easily achieved by counting the occurrences of each item in the
database, with the transaction counts representing the supports for the one-itemsets.
- Generating K-Itemsets: Utilizes the frequent (k − 1)-itemsets generated in the previous step.
Each step involves a single pass through the database, making the algorithm efficient even for
databases with many unique items.
FP-Growth Algorithm:
An alternative to the Apriori algorithm, FP-Growth, proposed by Han et al. (2000), offers
improved computational efficiency. Methodology : Constructs an FP-tree structure to efficiently
represent transactions in a transaction set, followed by two steps: ordering items by support
and building the FP-tree.
- FP-Tree Construction: Begins with a null node and adds branches corresponding to items in
each transaction. Common items in transactions share overlapping paths, creating compact tree
structures.
- Advantages over Apriori : Requires only two passes through the transaction database, offers
better memory usage, and is computationally more efficient.
MISSION TO MAKE STUDENTS TO GIVE THEM A CLEAR VISION OF WHAT THEY WANT TO DO THE IN
FUTURE
BINATNA
INSTAGRAM POLE
BOOKING THE VENUES
DEV AND COM (podcast)
Schools (dean)
Work on marketing (akhobi)
Planning is a must.
Instagram:
Username:
Password:
Ask her about logo.