R PROGRAMMING FOR
DATA SCIENCE
MRS NANCY A
ASSISTANT PROFESSOR
DEPARTMENT OF DATA
SCIENCE
MODELING
METHODS
Mapping Problems to Machine Learning Models
• Problem Modeling in R refers to the process of using
statistical and machine learning techniques to
identify patterns, make predictions, or gain insights
from data by building mathematical models using the
R programming language.
Types: Classification, Regression, Clustering
• Approach:
⚬ Supervised (labeled data)
⚬ Unsupervised (unlabeled data)
KEY POINTS & KEY
CHARACTERISTICS
• Modeling includes regression, classification, clustering, and more.
• R provides built-in functions and packages (like lm(), glm(), caret, randomForest, nnet) to build and
validate models.
• It supports both supervised and unsupervised learning techniques.
• Models are trained on historical data and evaluated for accuracy and reliability
Statistical Foundation
• Strong base in statistical computing
• Supports linear, nonlinear, and probabilistic models
Wide Range of Models
• Regression (linear, logistic)
• Classification (Naïve Bayes, Decision Trees)
• Clustering (K-means, Hierarchical)
KEY POINTS & KEY
CHARACTERISTICS
Extensive Package Ecosystem
• Libraries like caret, e1071, randomForest, nnet, mlr, and
tidymodels
Ease of Visualization
• Built-in plotting functions (plot(), ggplot2) to analyze models
graphically
Customizability & Flexibility
• Allows tuning of hyperparameters and evaluation methods
• Script-based, so highly adaptable for custom workflows
Evaluation & Validation Tools
• Cross-validation, confusion matrices, ROC/AUC, residual plots
Open-source & Community-driven
• Free to use with a rich online community and resources
MAPPING PROBLEMS
TO MACHINE LEARNING
MODELS
• Problem Types: Classification, Regression, Clustering
Approach:
• Supervised (labeled data)
• Unsupervised (unlabeled data)
DEFINITION
Evaluating clustering models refers to the process of
measuring how well the clustering algorithm has
EVALUATING
grouped the data into meaningful clusters, despite the
absence of labeled outputs.
Key Points CLUSTERING
• Since clustering is unsupervised, evaluation focuses
on:
MODELS
• Intra-cluster similarity (points in the same cluster
should be similar)
• Inter-cluster dissimilarity (clusters should be well
separated)
• Helps in choosing the optimal number of clusters
(e.g., using the Elbow method)
• Uses internal, external, or relative validation metrics
Model validation is the process of assessing how well a
trained machine learning model performs on unseen data. It
VALIDATING
ensures that the model generalizes well and is not overfitting
or underfitting the training data
MODELS Key Points to Include:
• Ensures model reliability and robustness
• Helps detect overfitting or underfitting
• Involves testing the model on data not used during training
Common Validation Techniques:
• Train/Test Split
• K-Fold Cross-Validation
• Leave-One-Out Cross-Validation (LOOCV)
• Stratified Sampling (for imbalanced datasets)
CLUSTER
ANALYSIS
DEFINITION
Cluster Analysis is an unsupervised machine learning
technique used to group a set of data points into clusters,
such that data points within the same cluster are more
similar to each other than to those in other clusters
Key Features:
• No predefined labels or categories
• Groups are formed based on similarity or distance measures (e.g., Euclidean
distance)
• Often used for data exploration, pattern discovery, and market segmentation
TYPES OF CLUSTER
Partitioning Methods
ANALYSIS
• Example: K-Means, K-Medoids
• Divides data into non-overlapping subsets (clusters)
• Suitable for large datasets
• Requires predefined number of clusters (K)
Hierarchical Clustering
• Types:
⚬ Agglomerative (bottom-up)
⚬ Divisive (top-down)
• No need to specify number of clusters
• Visualized using a dendrogram
Density-Based Methods
• Example: DBSCAN
• Forms clusters based on areas of high density
• Can detect arbitrary shapes and noise (outliers)
• Doesn’t require specifying number of clusters
TYPES OF CLUSTER
ANALYSIS
Model-Based Clustering
• Example: Gaussian Mixture Models (GMM)
• Assumes data is generated from a mixture of probability
distributions
• Probabilistic and flexible
Grid-Based Clustering
• Divides the data space into a finite number of cells
• Clusters are formed from dense grid cells
• Example: STING (Statistical Information Grid)
K-MEANS
ALGORITHMS
• K-Means is a popular unsupervised clustering algorithm that
partitions data into K distinct, non-overlapping clusters based on
similarity.
• It minimizes the within-cluster variance (inertia).
Initialize Assign Update Repeat
Assign each data Recalculate Repeat steps 2–
Randomly choose
point to the centroids as 3 until
K centroids
nearest centroid the mean of centroids no
(cluster centers)
(based on points in each longer change
distance) (convergence)
cluster
NAÏVE BAYES
CLASSIFIER
Naïve Bayes is a supervised classification algorithm based on Bayes’
Theorem, assuming that all features are independent of each other
(hence “naïve”)
Bayes’ Theorem Machine Learning Independence
Assumption
Naïve Bayes assumes that all
Bayes’ Theorem is a Forms the basis for the
mathematical formula used to features are conditionally
Naïve Bayes Classifier
determine the probability of a Applied in spam independent given the class
hypothesis based on prior detection, medical label.
knowledge and new evidence diagnosis, and risk
modeling
LINEAR
REGRESSION
Linear Regression is a supervised learning algorithm used to model the
relationship between a dependent variable (target) and one or more
independent variables (features) by fitting a straight line.
UNSUPERVISED
METHODS
Unsupervised learning methods are machine learning techniques that work
with unlabeled data to identify patterns, structures, or groupings without
predefined outcomes or targets
Key Characteristics:
• No labeled output (unlike supervised learning)
Learns intrinsic structure of data
• Used for exploration, pattern detection, and dimensionality reduction
Common Unsupervised Methods:
• Clustering
Groups data into clusters based on similarity
• Algorithms:
• K-Means
Hierarchical Clustering
• DBSCAN
THANK YOU