0% found this document useful (0 votes)

8 views15 pages

Unit 2 R Programming

The document outlines R programming for data science, focusing on modeling methods such as classification, regression, and clustering. It highlights the use of R's built-in functions and packages for building and validating models, as well as the importance of evaluating and validating clustering models. Additionally, it discusses various clustering techniques, including K-Means and Naïve Bayes classification, emphasizing the flexibility and community-driven nature of R.

Uploaded by

MS. S.KARUNALAKSHMI MATHEMATICS

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views15 pages

Unit 2 R Programming

Uploaded by

MS. S.KARUNALAKSHMI MATHEMATICS

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 15

R PROGRAMMING FOR

DATA SCIENCE

MRS NANCY A
ASSISTANT PROFESSOR
DEPARTMENT OF DATA
SCIENCE
MODELING
METHODS
Mapping Problems to Machine Learning Models

• Problem Modeling in R refers to the process of using

statistical and machine learning techniques to
identify patterns, make predictions, or gain insights
from data by building mathematical models using the
R programming language.

Types: Classification, Regression, Clustering

• Approach:
⚬ Supervised (labeled data)
⚬ Unsupervised (unlabeled data)
KEY POINTS & KEY
CHARACTERISTICS
• Modeling includes regression, classification, clustering, and more.
• R provides built-in functions and packages (like lm(), glm(), caret, randomForest, nnet) to build and
validate models.
• It supports both supervised and unsupervised learning techniques.
• Models are trained on historical data and evaluated for accuracy and reliability

Statistical Foundation

• Strong base in statistical computing

• Supports linear, nonlinear, and probabilistic models

Wide Range of Models

• Regression (linear, logistic)

• Classification (Naïve Bayes, Decision Trees)
• Clustering (K-means, Hierarchical)
KEY POINTS & KEY
CHARACTERISTICS
Extensive Package Ecosystem
• Libraries like caret, e1071, randomForest, nnet, mlr, and
tidymodels

Ease of Visualization
• Built-in plotting functions (plot(), ggplot2) to analyze models
graphically

Customizability & Flexibility

• Allows tuning of hyperparameters and evaluation methods
• Script-based, so highly adaptable for custom workflows
Evaluation & Validation Tools
• Cross-validation, confusion matrices, ROC/AUC, residual plots
Open-source & Community-driven
• Free to use with a rich online community and resources
MAPPING PROBLEMS
TO MACHINE LEARNING
MODELS
• Problem Types: Classification, Regression, Clustering

Approach:

• Supervised (labeled data)

• Unsupervised (unlabeled data)
DEFINITION
Evaluating clustering models refers to the process of
measuring how well the clustering algorithm has

EVALUATING
grouped the data into meaningful clusters, despite the
absence of labeled outputs.

Key Points CLUSTERING

• Since clustering is unsupervised, evaluation focuses
on:
MODELS
• Intra-cluster similarity (points in the same cluster
should be similar)
• Inter-cluster dissimilarity (clusters should be well
separated)
• Helps in choosing the optimal number of clusters
(e.g., using the Elbow method)
• Uses internal, external, or relative validation metrics
Model validation is the process of assessing how well a
trained machine learning model performs on unseen data. It

VALIDATING
ensures that the model generalizes well and is not overfitting
or underfitting the training data

MODELS Key Points to Include:

• Ensures model reliability and robustness

• Helps detect overfitting or underfitting
• Involves testing the model on data not used during training

Common Validation Techniques:

• Train/Test Split
• K-Fold Cross-Validation
• Leave-One-Out Cross-Validation (LOOCV)
• Stratified Sampling (for imbalanced datasets)
CLUSTER
ANALYSIS
DEFINITION
Cluster Analysis is an unsupervised machine learning
technique used to group a set of data points into clusters,
such that data points within the same cluster are more
similar to each other than to those in other clusters

Key Features:

• No predefined labels or categories

• Groups are formed based on similarity or distance measures (e.g., Euclidean
distance)
• Often used for data exploration, pattern discovery, and market segmentation
TYPES OF CLUSTER
Partitioning Methods

ANALYSIS
• Example: K-Means, K-Medoids
• Divides data into non-overlapping subsets (clusters)
• Suitable for large datasets
• Requires predefined number of clusters (K)

Hierarchical Clustering
• Types:
⚬ Agglomerative (bottom-up)
⚬ Divisive (top-down)
• No need to specify number of clusters
• Visualized using a dendrogram

Density-Based Methods
• Example: DBSCAN
• Forms clusters based on areas of high density
• Can detect arbitrary shapes and noise (outliers)
• Doesn’t require specifying number of clusters
TYPES OF CLUSTER
ANALYSIS
Model-Based Clustering
• Example: Gaussian Mixture Models (GMM)
• Assumes data is generated from a mixture of probability
distributions
• Probabilistic and flexible

Grid-Based Clustering
• Divides the data space into a finite number of cells
• Clusters are formed from dense grid cells
• Example: STING (Statistical Information Grid)
K-MEANS
ALGORITHMS
• K-Means is a popular unsupervised clustering algorithm that
partitions data into K distinct, non-overlapping clusters based on
similarity.
• It minimizes the within-cluster variance (inertia).

Initialize Assign Update Repeat

Assign each data Recalculate Repeat steps 2–
Randomly choose
point to the centroids as 3 until
K centroids
nearest centroid the mean of centroids no
(cluster centers)
(based on points in each longer change
distance) (convergence)
cluster
NAÏVE BAYES
CLASSIFIER
Naïve Bayes is a supervised classification algorithm based on Bayes’
Theorem, assuming that all features are independent of each other
(hence “naïve”)

Bayes’ Theorem Machine Learning Independence

Assumption
Naïve Bayes assumes that all
Bayes’ Theorem is a Forms the basis for the
mathematical formula used to features are conditionally
Naïve Bayes Classifier
determine the probability of a Applied in spam independent given the class
hypothesis based on prior detection, medical label.
knowledge and new evidence diagnosis, and risk
modeling
LINEAR
REGRESSION
Linear Regression is a supervised learning algorithm used to model the
relationship between a dependent variable (target) and one or more
independent variables (features) by fitting a straight line.
UNSUPERVISED
METHODS
Unsupervised learning methods are machine learning techniques that work
with unlabeled data to identify patterns, structures, or groupings without
predefined outcomes or targets
Key Characteristics:
• No labeled output (unlike supervised learning)
Learns intrinsic structure of data
• Used for exploration, pattern detection, and dimensionality reduction
Common Unsupervised Methods:
• Clustering
Groups data into clusters based on similarity
• Algorithms:
• K-Means
Hierarchical Clustering
• DBSCAN
THANK YOU

Data Warehouse and Data Mining MCQ Questions: Name: Shivani Dattatraya Chatte Roll No: 08
No ratings yet
Data Warehouse and Data Mining MCQ Questions: Name: Shivani Dattatraya Chatte Roll No: 08
46 pages
M Learning
No ratings yet
M Learning
11 pages
Introduction To Basics of Machine Learning Algorithms: Pankaj Oli
100% (1)
Introduction To Basics of Machine Learning Algorithms: Pankaj Oli
13 pages
Outline: Three Basic Algorithms
No ratings yet
Outline: Three Basic Algorithms
34 pages
ML Unit4
No ratings yet
ML Unit4
19 pages
Week 9. Unsupervised Learning
No ratings yet
Week 9. Unsupervised Learning
32 pages
Machine Learning File
No ratings yet
Machine Learning File
7 pages
Overview Basics
No ratings yet
Overview Basics
16 pages
04-FSSR DS610 2024 2025T1 Kmeans
No ratings yet
04-FSSR DS610 2024 2025T1 Kmeans
57 pages
Lecture Unsupervised (17!04!2024)
No ratings yet
Lecture Unsupervised (17!04!2024)
61 pages
Unsupervised Learning: K-Means Clustering
No ratings yet
Unsupervised Learning: K-Means Clustering
23 pages
ML - Machine Learning PDF
No ratings yet
ML - Machine Learning PDF
13 pages
Lect 10 - Unsupervised Learning
No ratings yet
Lect 10 - Unsupervised Learning
50 pages
(KtabPDF Com) xrwA7TEBGp
No ratings yet
(KtabPDF Com) xrwA7TEBGp
32 pages
Chapter 3 p4
No ratings yet
Chapter 3 p4
18 pages
SML Hand Note Bau by DT
No ratings yet
SML Hand Note Bau by DT
1 page
DSML-ML09. Unsupervised Learning
No ratings yet
DSML-ML09. Unsupervised Learning
69 pages
R20 Machine Learning Unit 4
No ratings yet
R20 Machine Learning Unit 4
49 pages
Unit 4-Unsupervised Learning-K Means and Hierarchical Clustering
No ratings yet
Unit 4-Unsupervised Learning-K Means and Hierarchical Clustering
48 pages
Decision Trees. These Models Use Observations About Certain
No ratings yet
Decision Trees. These Models Use Observations About Certain
6 pages
Unit 4 Introduction To Algorithm
No ratings yet
Unit 4 Introduction To Algorithm
10 pages
BUSINESS ANALYTICS Assignment
No ratings yet
BUSINESS ANALYTICS Assignment
14 pages
Algorithms 1
No ratings yet
Algorithms 1
23 pages
ML Notes 1
No ratings yet
ML Notes 1
3 pages
NLP Chapter 2
No ratings yet
NLP Chapter 2
79 pages
ML Clustering
No ratings yet
ML Clustering
33 pages
ML Mod 4 Part 1
No ratings yet
ML Mod 4 Part 1
99 pages
Chapter 8
No ratings yet
Chapter 8
15 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
23 pages
Unit 5
No ratings yet
Unit 5
8 pages
Assignment 2
No ratings yet
Assignment 2
8 pages
Fiches Machine Learning
No ratings yet
Fiches Machine Learning
21 pages
Evolutional Study On KNN and K-Means Algorithms (SP)
No ratings yet
Evolutional Study On KNN and K-Means Algorithms (SP)
9 pages
Machine Learning Algorithms
No ratings yet
Machine Learning Algorithms
5 pages
Machine Learning Clustering AlgorithmsI
No ratings yet
Machine Learning Clustering AlgorithmsI
129 pages
Unit 4
No ratings yet
Unit 4
53 pages
ML Lecture06 Unsupervised Learning
No ratings yet
ML Lecture06 Unsupervised Learning
87 pages
Unit 6
No ratings yet
Unit 6
22 pages
DWBI4
No ratings yet
DWBI4
10 pages
Machine Learning & Data Mining: Understanding
No ratings yet
Machine Learning & Data Mining: Understanding
7 pages
AIML
No ratings yet
AIML
30 pages
Machine Learning
100% (1)
Machine Learning
21 pages
Artificial Intelligence Lec 5
No ratings yet
Artificial Intelligence Lec 5
20 pages
ML Unit-Iii
No ratings yet
ML Unit-Iii
18 pages
Probability and Statistics Mansoura Day4
No ratings yet
Probability and Statistics Mansoura Day4
23 pages
Clustering in R
No ratings yet
Clustering in R
12 pages
Machine Learning4
No ratings yet
Machine Learning4
39 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
6 pages
Unit 2
No ratings yet
Unit 2
57 pages
Business Analytics MGN801-CA2 KAJAL (11917586) Section - Q1959
No ratings yet
Business Analytics MGN801-CA2 KAJAL (11917586) Section - Q1959
14 pages
Unit IV
No ratings yet
Unit IV
96 pages
Week 10
No ratings yet
Week 10
50 pages
Unit 5
No ratings yet
Unit 5
5 pages
DATA 2024 - Dist
No ratings yet
DATA 2024 - Dist
97 pages
Week 9
No ratings yet
Week 9
66 pages
Week 4 - Lecture Slides - K-Means, Mixture Models, & EM
No ratings yet
Week 4 - Lecture Slides - K-Means, Mixture Models, & EM
65 pages
ML Unit 4 V1
No ratings yet
ML Unit 4 V1
30 pages
10.lab Activity
No ratings yet
10.lab Activity
11 pages
Improving Material Flow and Production Layout Using Value Stream Mapping A Case Study in A Manufacturing Company
No ratings yet
Improving Material Flow and Production Layout Using Value Stream Mapping A Case Study in A Manufacturing Company
68 pages
Portfolio Optimization Using Machine Learning Techniques
No ratings yet
Portfolio Optimization Using Machine Learning Techniques
7 pages
Ryan Thesis
No ratings yet
Ryan Thesis
78 pages
2021 ITS665 - ISP565 - GROUP PROJECT-revMac21
No ratings yet
2021 ITS665 - ISP565 - GROUP PROJECT-revMac21
6 pages
Unit IV 3
No ratings yet
Unit IV 3
59 pages
Knowledge Discovery in Databases: An Overview: William J. Frawley, Gregory Piatetsky-Shapiro, and Christopher J. Matheus
No ratings yet
Knowledge Discovery in Databases: An Overview: William J. Frawley, Gregory Piatetsky-Shapiro, and Christopher J. Matheus
14 pages
Divisive Hierarchical Clustering
No ratings yet
Divisive Hierarchical Clustering
11 pages
Konstantinova 2021
No ratings yet
Konstantinova 2021
19 pages
Unified Method For Fully Automated Modal Identification and Tracking With
No ratings yet
Unified Method For Fully Automated Modal Identification and Tracking With
18 pages
Servicescape Interior Design and Consumers' Personality Impressions
No ratings yet
Servicescape Interior Design and Consumers' Personality Impressions
10 pages
DS Question in Mechanical Industry
No ratings yet
DS Question in Mechanical Industry
22 pages
r16 Syllabus Cse Jntuh
No ratings yet
r16 Syllabus Cse Jntuh
58 pages
Ekefre Non Confidential
No ratings yet
Ekefre Non Confidential
59 pages
ML Question Bank CA-II
No ratings yet
ML Question Bank CA-II
10 pages
DSAI2201 Introduction To Data Science and AI: Winter 2022
No ratings yet
DSAI2201 Introduction To Data Science and AI: Winter 2022
30 pages
Practical File of AI and ML
No ratings yet
Practical File of AI and ML
26 pages
Brochure Machine Learning
No ratings yet
Brochure Machine Learning
19 pages
2017 Summer Model Answer Paper
No ratings yet
2017 Summer Model Answer Paper
29 pages
05-Adaptive Resonance Theory
No ratings yet
05-Adaptive Resonance Theory
0 pages
Forest Inventory Design Principles - Challenges and Solutions
No ratings yet
Forest Inventory Design Principles - Challenges and Solutions
6 pages
The WEKA Data Mining Software An Update
No ratings yet
The WEKA Data Mining Software An Update
10 pages
Business Report DSBA Data Mining Project - Part 2 Segmentation Using K-Means Clustering
No ratings yet
Business Report DSBA Data Mining Project - Part 2 Segmentation Using K-Means Clustering
28 pages
EE8012 - Soft Computing
No ratings yet
EE8012 - Soft Computing
340 pages
BscTy Comp Sci Syllabus 2021-22
No ratings yet
BscTy Comp Sci Syllabus 2021-22
15 pages
Daily Dose of Data Science - Archive
No ratings yet
Daily Dose of Data Science - Archive
580 pages
Mca Syllabus
No ratings yet
Mca Syllabus
44 pages
Dsbda Case Study
No ratings yet
Dsbda Case Study
14 pages
Partition
No ratings yet
Partition
52 pages
Kanner 2017
No ratings yet
Kanner 2017
10 pages

Unit 2 R Programming

Uploaded by

Unit 2 R Programming

Uploaded by

R PROGRAMMING FOR

• Problem Modeling in R refers to the process of using

Types: Classification, Regression, Clustering

• Strong base in statistical computing

Wide Range of Models

• Regression (linear, logistic)

Customizability & Flexibility

• Supervised (labeled data)

Key Points CLUSTERING

MODELS Key Points to Include:

• Ensures model reliability and robustness

Common Validation Techniques:

• No predefined labels or categories

Initialize Assign Update Repeat

Bayes’ Theorem Machine Learning Independence

You might also like