0% found this document useful (0 votes)
8 views15 pages

Unit 2 R Programming

The document outlines R programming for data science, focusing on modeling methods such as classification, regression, and clustering. It highlights the use of R's built-in functions and packages for building and validating models, as well as the importance of evaluating and validating clustering models. Additionally, it discusses various clustering techniques, including K-Means and Naïve Bayes classification, emphasizing the flexibility and community-driven nature of R.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views15 pages

Unit 2 R Programming

The document outlines R programming for data science, focusing on modeling methods such as classification, regression, and clustering. It highlights the use of R's built-in functions and packages for building and validating models, as well as the importance of evaluating and validating clustering models. Additionally, it discusses various clustering techniques, including K-Means and Naïve Bayes classification, emphasizing the flexibility and community-driven nature of R.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 15

R PROGRAMMING FOR

DATA SCIENCE

MRS NANCY A
ASSISTANT PROFESSOR
DEPARTMENT OF DATA
SCIENCE
MODELING
METHODS
Mapping Problems to Machine Learning Models

• Problem Modeling in R refers to the process of using


statistical and machine learning techniques to
identify patterns, make predictions, or gain insights
from data by building mathematical models using the
R programming language.

Types: Classification, Regression, Clustering

• Approach:
⚬ Supervised (labeled data)
⚬ Unsupervised (unlabeled data)
KEY POINTS & KEY
CHARACTERISTICS
• Modeling includes regression, classification, clustering, and more.
• R provides built-in functions and packages (like lm(), glm(), caret, randomForest, nnet) to build and
validate models.
• It supports both supervised and unsupervised learning techniques.
• Models are trained on historical data and evaluated for accuracy and reliability

Statistical Foundation

• Strong base in statistical computing


• Supports linear, nonlinear, and probabilistic models

Wide Range of Models

• Regression (linear, logistic)


• Classification (Naïve Bayes, Decision Trees)
• Clustering (K-means, Hierarchical)
KEY POINTS & KEY
CHARACTERISTICS
Extensive Package Ecosystem
• Libraries like caret, e1071, randomForest, nnet, mlr, and
tidymodels

Ease of Visualization
• Built-in plotting functions (plot(), ggplot2) to analyze models
graphically

Customizability & Flexibility


• Allows tuning of hyperparameters and evaluation methods
• Script-based, so highly adaptable for custom workflows
Evaluation & Validation Tools
• Cross-validation, confusion matrices, ROC/AUC, residual plots
Open-source & Community-driven
• Free to use with a rich online community and resources
MAPPING PROBLEMS
TO MACHINE LEARNING
MODELS
• Problem Types: Classification, Regression, Clustering

Approach:

• Supervised (labeled data)


• Unsupervised (unlabeled data)
DEFINITION
Evaluating clustering models refers to the process of
measuring how well the clustering algorithm has

EVALUATING
grouped the data into meaningful clusters, despite the
absence of labeled outputs.

Key Points CLUSTERING


• Since clustering is unsupervised, evaluation focuses
on:
MODELS
• Intra-cluster similarity (points in the same cluster
should be similar)
• Inter-cluster dissimilarity (clusters should be well
separated)
• Helps in choosing the optimal number of clusters
(e.g., using the Elbow method)
• Uses internal, external, or relative validation metrics
Model validation is the process of assessing how well a
trained machine learning model performs on unseen data. It

VALIDATING
ensures that the model generalizes well and is not overfitting
or underfitting the training data

MODELS Key Points to Include:

• Ensures model reliability and robustness


• Helps detect overfitting or underfitting
• Involves testing the model on data not used during training

Common Validation Techniques:

• Train/Test Split
• K-Fold Cross-Validation
• Leave-One-Out Cross-Validation (LOOCV)
• Stratified Sampling (for imbalanced datasets)
CLUSTER
ANALYSIS
DEFINITION
Cluster Analysis is an unsupervised machine learning
technique used to group a set of data points into clusters,
such that data points within the same cluster are more
similar to each other than to those in other clusters

Key Features:

• No predefined labels or categories


• Groups are formed based on similarity or distance measures (e.g., Euclidean
distance)
• Often used for data exploration, pattern discovery, and market segmentation
TYPES OF CLUSTER
Partitioning Methods

ANALYSIS
• Example: K-Means, K-Medoids
• Divides data into non-overlapping subsets (clusters)
• Suitable for large datasets
• Requires predefined number of clusters (K)

Hierarchical Clustering
• Types:
⚬ Agglomerative (bottom-up)
⚬ Divisive (top-down)
• No need to specify number of clusters
• Visualized using a dendrogram

Density-Based Methods
• Example: DBSCAN
• Forms clusters based on areas of high density
• Can detect arbitrary shapes and noise (outliers)
• Doesn’t require specifying number of clusters
TYPES OF CLUSTER
ANALYSIS
Model-Based Clustering
• Example: Gaussian Mixture Models (GMM)
• Assumes data is generated from a mixture of probability
distributions
• Probabilistic and flexible

Grid-Based Clustering
• Divides the data space into a finite number of cells
• Clusters are formed from dense grid cells
• Example: STING (Statistical Information Grid)
K-MEANS
ALGORITHMS
• K-Means is a popular unsupervised clustering algorithm that
partitions data into K distinct, non-overlapping clusters based on
similarity.
• It minimizes the within-cluster variance (inertia).

Initialize Assign Update Repeat


Assign each data Recalculate Repeat steps 2–
Randomly choose
point to the centroids as 3 until
K centroids
nearest centroid the mean of centroids no
(cluster centers)
(based on points in each longer change
distance) (convergence)
cluster
NAÏVE BAYES
CLASSIFIER
Naïve Bayes is a supervised classification algorithm based on Bayes’
Theorem, assuming that all features are independent of each other
(hence “naïve”)

Bayes’ Theorem Machine Learning Independence


Assumption
Naïve Bayes assumes that all
Bayes’ Theorem is a Forms the basis for the
mathematical formula used to features are conditionally
Naïve Bayes Classifier
determine the probability of a Applied in spam independent given the class
hypothesis based on prior detection, medical label.
knowledge and new evidence diagnosis, and risk
modeling
LINEAR
REGRESSION
Linear Regression is a supervised learning algorithm used to model the
relationship between a dependent variable (target) and one or more
independent variables (features) by fitting a straight line.
UNSUPERVISED
METHODS
Unsupervised learning methods are machine learning techniques that work
with unlabeled data to identify patterns, structures, or groupings without
predefined outcomes or targets
Key Characteristics:
• No labeled output (unlike supervised learning)
Learns intrinsic structure of data
• Used for exploration, pattern detection, and dimensionality reduction
Common Unsupervised Methods:
• Clustering
Groups data into clusters based on similarity
• Algorithms:
• K-Means
Hierarchical Clustering
• DBSCAN
THANK YOU

You might also like