Welcome to Scribd!

0% found this document useful (0 votes)

1 views

CST322_Module2_Extra

Uploaded by

chinnuedwina

Copyright:

Available Formats

Download as PPTX, PDF, TXT or read online from Scribd

CST322_Module2_Extra

Uploaded by

chinnuedwina

0% found this document useful (0 votes)

1 views32 pages

Copyright

Available Formats

PPTX, PDF, TXT or read online from Scribd

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Download as PPTX, PDF, TXT or read online from Scribd

Download as pptx, pdf, or txt

0% found this document useful (0 votes)

1 views32 pages

CST322_Module2_Extra

Uploaded by

chinnuedwina

Copyright:

Available Formats

Download as PPTX, PDF, TXT or read online from Scribd

Download as pptx, pdf, or txt

Jump to Page

You are on page 1of 32

Search inside document

CST322 MODULE 2

Part 2

1
Data Pre-processing – Data Cleaning
• Real-world data tend to be incomplete, noisy, and inconsistent.
• Data cleaning (or data cleansing) routines attempt to fill the following.
• Missing Values
• Smooth out noise

1. Missing Values
• Many tuples have no recorded value for several attributes.
• Following methods are used to tackle the missing value
• A) Ignore the tuple
2
B) Fill in the missing value manually
1. Use a global constant to fill in the missing value
2. Use a measure of central tendency for the attribute (e.g.: Mean or median
) to fill in the missing value
C) Use the attribute mean or median for all the samples belonging to the same
class as the given tuple

D) Use the most probable value to fill in the missing value

2. Noisy Data
• Noise is a random error or variance in a measured variable
• Following methods are used to tackle noisy data
❖ Binning

❖ Regression

❖ Outlier analysis
3
• Noisy Data - Binning
• Binning methods smooth a sorted data value by consulting its “neighbourhood,” that is, the values
around it.

• The sorted values are distributed into a number of “buckets,” or bins

• Smoothing by bin medians can be employed, in which each bin value is replaced by the bin median

• Smoothing by bin boundaries, the minimum and maximum values in a given bin are identified as the
bin boundaries

• In the example, the data for price are first sorted and then partitioned into equal-frequency bins of size
3

• Sorted data for price (in dollars):

• 4, 8, 15, 21, 21, 24, 25, 28, 34

4
5
• Noisy Data – Regression
• Data smoothing can also be done by regression, a technique that conforms data
values to a function
• Linear regression involves finding the “best” line to fit two attributes (or
variables) so that one attribute can be used to predict the other.
• Multiple linear regression is an extension of linear regression, where more than
two attributes are involved and the data are fit to a multidimensional surface.

• Noisy Data – Outlier Analysis

• Outliers may be detected by clustering, for example, where similar values are
organized into groups, or “clusters”
• Outliers may be detected as values that fall outside of the cluster sets

6
•

7
Data Reduction – Dimensionality Reduction
• Data reduction techniques can be applied to obtain a reduced representation of the
data set that is much smaller in volume, yet closely maintains the integrity of the
original data
• Working with the reduced data set should be more efficient yet produce the same
(or almost the same) analytical results

• Data reduction strategies include

1. Dimensionality reduction
2. Numerosity reduction
3. Data compression

8
Dimensionality reduction

• It is the process of reducing the number of random variables or attributes under

consideration.
1. Wavelet Transforms

2. Principal Components Analysis

3. Attribute Subset Selection

9
Wavelet Transforms

10
11
12
13
14
Principal Components Analysis

15
•

16
Attribute Subset Selection

17
•

18
19
20
Sampling

21
22
23
Data Transformation

24
25
Data Transformation by Normalization

26
Min-Max Normalization

27
28
Z-Score Normalization

29
30
Decimal Scaling Normalization

31
32

Data Reduction Techniques
Document41 pages
Data Reduction Techniques
Prashant Sahu
No ratings yet
Data Preprocessing Part 3
Document31 pages
Data Preprocessing Part 3
new acc jeet
No ratings yet
L2 A Short Preproc
Document42 pages
L2 A Short Preproc
Shame Bope
No ratings yet
Cap6 - Data Reduction
Document27 pages
Cap6 - Data Reduction
priyanshidubey2008
No ratings yet
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
Document55 pages
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
Chanda Test
No ratings yet
Knowledge Discovery and Data Mining
Document55 pages
Knowledge Discovery and Data Mining
Rupesh V
No ratings yet
3 Data Preprocessing
Document25 pages
3 Data Preprocessing
Nilakhya Chawrok
No ratings yet
Preprocessing
Document52 pages
Preprocessing
tuanbinkk
No ratings yet
Preprocessing
Document62 pages
Preprocessing
poi.tamrakar
No ratings yet
3-Data Preprocessing
Document32 pages
3-Data Preprocessing
divyansh.roorkee
No ratings yet
Spatial and Temporal Data Mining
Document52 pages
Spatial and Temporal Data Mining
amanpcte07
No ratings yet
D06B-Data Preprocessing 2
Document50 pages
D06B-Data Preprocessing 2
Abdul Barir Hakim
No ratings yet
Syllabus: Data Warehousing and Data Mining
Document18 pages
Syllabus: Data Warehousing and Data Mining
It's Me
No ratings yet
Data Preprocessing
Document77 pages
Data Preprocessing
20bme094
No ratings yet
Data Preprocessing
Document33 pages
Data Preprocessing
Bhavani Viswa
No ratings yet
3-Data Pre-Processing
Document18 pages
3-Data Pre-Processing
BindiyaAbhilash
No ratings yet
Slide 2 - Data Preprocessing
Document39 pages
Slide 2 - Data Preprocessing
Lôny Nêz
100% (1)
Data Mining and Business Intelligence
Document52 pages
Data Mining and Business Intelligence
marouli90
No ratings yet
CIS664-Knowledge Discovery and Data Mining
Document52 pages
CIS664-Knowledge Discovery and Data Mining
Akbar Kushanoor
No ratings yet
Week 4 - 5 - Data Preprocessing
Document67 pages
Week 4 - 5 - Data Preprocessing
Hussain ASL
No ratings yet
data_mining_unit_3[1]
Document64 pages
data_mining_unit_3[1]
sengargungun858
No ratings yet
Notes - Unit01 - Data Science and Big Data Analytics
Document7 pages
Notes - Unit01 - Data Science and Big Data Analytics
Atharva Gokhare
No ratings yet
CH1-data Preprocessing
Document49 pages
CH1-data Preprocessing
selsabilrouahi
No ratings yet
16 dm2 Dimred 2022 23
Document49 pages
16 dm2 Dimred 2022 23
nimra
No ratings yet
WINSEM2023-24 - BECE352E - ETH - VL2023240504409 - 2024-02-03 - Reference-Material-I 2
Document16 pages
WINSEM2023-24 - BECE352E - ETH - VL2023240504409 - 2024-02-03 - Reference-Material-I 2
Aditya Bonnerjee 21BEC0384
No ratings yet
COS10022 - Lecture 03 - Data Preparation PDF
Document61 pages
COS10022 - Lecture 03 - Data Preparation PDF
Papersdock Taha
No ratings yet
M-Unit-2 R16
Document21 pages
M-Unit-2 R16
JAGADISH M
No ratings yet
Data Mining and Predictive Modelling: Lecture 4: Data Pre-Processing
Document19 pages
Data Mining and Predictive Modelling: Lecture 4: Data Pre-Processing
Mohit
No ratings yet
Week 2
Document96 pages
Week 2
veceki2439
No ratings yet
Unit-2 Lecture Notes
Document33 pages
Unit-2 Lecture Notes
Sravani Gunnu
No ratings yet
JAVA Advanced 3
Document19 pages
JAVA Advanced 3
Lucky Mahanto
No ratings yet
Lec2 - Data Preprocessing
Document30 pages
Lec2 - Data Preprocessing
Awais Imdad
No ratings yet
UNIT-2
Document34 pages
UNIT-2
tinaktm2004
No ratings yet
UNIT-2
Document37 pages
UNIT-2
tinaktm2004
No ratings yet
4 - Finding and Fixing Data Quality Issues
Document48 pages
4 - Finding and Fixing Data Quality Issues
mkz01041
No ratings yet
Lec06 7 Feature Engineering 08112022 100115am
Document44 pages
Lec06 7 Feature Engineering 08112022 100115am
sumera sajid
No ratings yet
Chapter 2 3 Data Mining
Document4 pages
Chapter 2 3 Data Mining
bharathimanian
No ratings yet
ML RUSA Module 5 Dim Red
Document85 pages
ML RUSA Module 5 Dim Red
mohamed2003imran
No ratings yet
Data Reduction
Document22 pages
Data Reduction
Adil Bin Khalid
No ratings yet
Major Issues in Data Mining
Document5 pages
Major Issues in Data Mining
Gaurav Jaiswal
No ratings yet
CH - 4
Document71 pages
CH - 4
PIYUSH MANGILAL SONI
No ratings yet
UNIT-2 PREPROCESSING
Document18 pages
UNIT-2 PREPROCESSING
P.Padmini Rani
No ratings yet
1.variable Reduction 2.principal Component Analysis: Topic UNIT-4
Document19 pages
1.variable Reduction 2.principal Component Analysis: Topic UNIT-4
subithaperiyasamy
No ratings yet
Data Preparation For ML in Practice v213
Document78 pages
Data Preparation For ML in Practice v213
076bch026.priya
No ratings yet
III Unit Mtech 2023
Document121 pages
III Unit Mtech 2023
Maryam Fatima
No ratings yet
Data Mining and Warehousing (203105431) : Prof. Dheeraj Kumar Singh, Assistant Professor
Document71 pages
Data Mining and Warehousing (203105431) : Prof. Dheeraj Kumar Singh, Assistant Professor
Harsha Gangwani
No ratings yet
Data preprocessing (1)
Document77 pages
Data preprocessing (1)
Kranium A
No ratings yet
Dimensonality Reduction
Document25 pages
Dimensonality Reduction
jugal.chhatriwala.spam
No ratings yet
Module 3
Document41 pages
Module 3
neha1831sewani
No ratings yet
Chapter 3 - Data Pre-Processing Notes
Document8 pages
Chapter 3 - Data Pre-Processing Notes
towsif.imran.dhk
No ratings yet
Unit 3 Dw&DM Notes Mr. Rohit Pratap Singh
Document22 pages
Unit 3 Dw&DM Notes Mr. Rohit Pratap Singh
anuragsiddharth04
No ratings yet
data preprocessing
Document21 pages
data preprocessing
Vishnu Rajeev
No ratings yet
Data Mining: Concepts and Techniques
Document50 pages
Data Mining: Concepts and Techniques
sunnynnus
No ratings yet
Lecture 7 - Data Cleaning
Document36 pages
Lecture 7 - Data Cleaning
raoseshu
No ratings yet
Research Citation Notes
Document35 pages
Research Citation Notes
Web Best Wabii
No ratings yet
r20 DWDM Unit 2 PART 2
Document15 pages
r20 DWDM Unit 2 PART 2
Chandhu Chodisetty
No ratings yet
Lect 2
Document54 pages
Lect 2
Rozanne de Zoysa
No ratings yet
Data Scaling and Normalization
From Everand
Data Scaling and Normalization
Chuck Sherman
No ratings yet
Técnicas Estadísticas para la Ciencia de Datos a través de R. Aprendizaje Supervisado: Análisis Discriminante, Árboles de Decisión, Redes Neuronales y Modelos Lineales Generalizados
From Everand
Técnicas Estadísticas para la Ciencia de Datos a través de R. Aprendizaje Supervisado: Análisis Discriminante, Árboles de Decisión, Redes Neuronales y Modelos Lineales Generalizados
César Pérez López
No ratings yet
Python Machine Learning for Beginners: Unsupervised Learning, Clustering, and Dimensionality Reduction. Part 1
From Everand
Python Machine Learning for Beginners: Unsupervised Learning, Clustering, and Dimensionality Reduction. Part 1
Tom Lesley
No ratings yet
CST322_Module4_Part3_Hadoop
Document45 pages
CST322_Module4_Part3_Hadoop
chinnuedwina
No ratings yet
Apriori_eg_pblm
Document2 pages
Apriori_eg_pblm
chinnuedwina
No ratings yet
waste controller using smart electronic bin report
Document42 pages
waste controller using smart electronic bin report
chinnuedwina
No ratings yet
IDEA PITCHING
Document14 pages
IDEA PITCHING
chinnuedwina
No ratings yet
EST102 Programming in C - Experiments and Co's
Document2 pages
EST102 Programming in C - Experiments and Co's
chinnuedwina
No ratings yet
Inst Format 8086
Document14 pages
Inst Format 8086
chinnuedwina
No ratings yet
Lab 7: Repetition Structure 1-While Loop A) Counter Controlled Example 1
Document11 pages
Lab 7: Repetition Structure 1-While Loop A) Counter Controlled Example 1
Faris Ahmad
No ratings yet
Breguet S Formulas
Document5 pages
Breguet S Formulas
joereis
No ratings yet
UNIT 23 - Phrasal Verbs
Document1 page
UNIT 23 - Phrasal Verbs
Nerea
No ratings yet
(Carrier Sense Multiple Access / Collision Detection) : Csma/Cd
Document10 pages
(Carrier Sense Multiple Access / Collision Detection) : Csma/Cd
Krishna Bhikadiya
No ratings yet
Basic Structural Analysis 3rd Edition C S Reddy download pdf
Document51 pages
Basic Structural Analysis 3rd Edition C S Reddy download pdf
balevadabeer
100% (4)
Brochure QP 705
Document2 pages
Brochure QP 705
Antonio Vailati
100% (1)
1 D Kinematics
Document27 pages
1 D Kinematics
John Aldemar
No ratings yet
BC-20S&30S - Liquid System - V1.0 - EN
Document16 pages
BC-20S&30S - Liquid System - V1.0 - EN
Edgar Mendoza García
No ratings yet
Special Requirement in Traffic Signal: 57.3.1 Change Interval
Document6 pages
Special Requirement in Traffic Signal: 57.3.1 Change Interval
asxl113
No ratings yet
BITS Pilani Presentation
Document26 pages
BITS Pilani Presentation
Seshu Bollineni
No ratings yet
Data Based Question: Eaten by Herbivores
Document3 pages
Data Based Question: Eaten by Herbivores
Sulochana Devi Palanisamy
No ratings yet
Thermo-Mechanical Coupled Simulation of Hot Stamping Components For Process Design
Document5 pages
Thermo-Mechanical Coupled Simulation of Hot Stamping Components For Process Design
Chiheb Ba
No ratings yet
A35 Answers
Document15 pages
A35 Answers
Sean
No ratings yet
BCA 3rd Sem Syllebus
Document17 pages
BCA 3rd Sem Syllebus
zxg4u
50% (2)
Fast, Rail-to-Rail, Low Power, 2.5 V To 5.5 V, Single-Supply TTL/CMOS Comparator
Document12 pages
Fast, Rail-to-Rail, Low Power, 2.5 V To 5.5 V, Single-Supply TTL/CMOS Comparator
Suhas Shirol
No ratings yet
Comandos Usados en ML
Document17 pages
Comandos Usados en ML
John Danny
No ratings yet
JNTUA-B Tech-2018-3 1-Sup-R15-CIV-15A01503 Geotechnical Engineering PDF
Document2 pages
JNTUA-B Tech-2018-3 1-Sup-R15-CIV-15A01503 Geotechnical Engineering PDF
Lavanya
No ratings yet
Parenting Styles
Document14 pages
Parenting Styles
Bruce Wayne
No ratings yet
Transaction MR22-Cost Component Split in A Single Cost Component.
Document4 pages
Transaction MR22-Cost Component Split in A Single Cost Component.
vyigit
No ratings yet
Aaiihy
Document9 pages
Aaiihy
rain rainy
No ratings yet
Powerdns
Document177 pages
Powerdns
alifcom
No ratings yet
Cabin Slim Training
Document19 pages
Cabin Slim Training
Pedro Fox
No ratings yet
Elasticity: L L, L L
Document8 pages
Elasticity: L L, L L
aass
No ratings yet
A1f17 PDF
Document4 pages
A1f17 PDF
Anonymous RhTpwVDJMO
No ratings yet
Mr. M. Raghavendran M.SC (N) Ms. S. Andal M.SC (N) Ms. Gayathri Sahu Mr. Gajju Verma
Document2 pages
Mr. M. Raghavendran M.SC (N) Ms. S. Andal M.SC (N) Ms. Gayathri Sahu Mr. Gajju Verma
AS Katoch
No ratings yet
Synopsis Voice Operated Lift-1
Document5 pages
Synopsis Voice Operated Lift-1
Harsh Moury
No ratings yet
The 4 Stroke Medium Speed Trunk Engine Piston
Document23 pages
The 4 Stroke Medium Speed Trunk Engine Piston
Bharatiyulam
No ratings yet
Lydia Notes
Document1 page
Lydia Notes
Армине Мкртчян
No ratings yet
FYP Proposal Presentation Final
Document14 pages
FYP Proposal Presentation Final
Jawad Qamar
No ratings yet
WTL-Manual-22-23 1
Document92 pages
WTL-Manual-22-23 1
maitreyeejoshi07
No ratings yet