Data Preprocessing Exercise
Data Preprocessing Exercise
Student ID : 20200130005
Class : Master Computer Science Nusa Putra University
Courses: Advanced Database
Lecture : Dr. DINI OKTARINA DWI HANDAYANI, S.T, M.Sc.
2. Explain major tasks in data preprocessing. Describe how to handle dirty data and give
example for each task.
a. Data Cleaning
This task is usually basic and important to control and identity of data. Fill in missing
values, smooth noisy data, identify or remove outliers, and resolve inconsistencies
Example :
I want to remove every datasheet with incomplete status, and fill up more datasheet
with several algorithm like mean
b. Data Integration
Used to combine multiple datasheet, not only from one source but it can use from
several source, cubes and files
Example :
I want make decision tree from rainy season in West Java and I have data sources from
each region : sukabumi.csv , bandung.csv and garut.csv
Every data contain several attribute, missing data and etc; with data integration I can
combine every single attribute from each data source to make my data more accurate .
c. Data Reduction
This task we used to reduce Dimensionality, Numerosity and compress data
Example :
After we combine multiple datasheet, we have many attributes and several data we can
use to process , not only every single attribute we need, but also data we don’t need
By this task we eliminate the data.
Example :
Re generating the data structure after eliminate data from previous task
3. What is Correlation Analysis? Describe your understating and give a real world example.
Correlation Analysis is way to measure wether exist or not exist relation from 2 variables.
We have 4 measurement In Measurement scale of statistic : Interval, Ratio, Ordinal and
Nominal.
There is 3 correlation measure to count in every scale
Pearson Correlation used to measure Interval and Ratio
Rank Spearman / Kendal’s tau used to measure Ordinal
Contingency Coefficient used to measure Nominal
Example :
In the Sukabumi Hospital we have 10 patience about weight and blood pressure , we will
analyze if there is or not relation between that. Here is the data looks like :
a
a=5 % , =0.025
2
db=( n−2 )=( 10−2 )=8
4. Suppose two stocks A and B have the following values in one week: (7, 5), (3, 8), (5, 10),
(14, 11), (6, 14). If the stocks are affected by the same industry trends, will their prices rise
or fall together?
Because the data shown scale, so we use Pearson measurement check if there is relation
between stock A and Stock B or not.
r
The result is : xy = 0.1 < it is indicated that Stocks A & B had no relation
Next we’re looking for Coeffecient (r) , N, T Statistic, DF, and P Value
By using Excel we now have this result
With a = 0,025 we can now take the summary that there is no relation between Stocks A and
Stocks B !
It’s mean the prices will not rise together whenever stock affected by trends