0% found this document useful (0 votes)
79 views

Data Preprocessing Exercise

1. The document contains details about a student named Opie Sopyan studying Master of Computer Science at Nusa Putra University. 2. It discusses data preprocessing tasks like data cleaning, integration, reduction and transformation. Examples are given for handling missing/dirty data and combining data sources. 3. Correlation analysis is explained as a way to measure the relationship between two variables using Pearson, Spearman or contingency coefficients depending on the variable type. An example analyzes the relationship between patient weight and blood pressure using hospital data and Pearson correlation.

Uploaded by

Opie Sopyan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
79 views

Data Preprocessing Exercise

1. The document contains details about a student named Opie Sopyan studying Master of Computer Science at Nusa Putra University. 2. It discusses data preprocessing tasks like data cleaning, integration, reduction and transformation. Examples are given for handling missing/dirty data and combining data sources. 3. Correlation analysis is explained as a way to measure the relationship between two variables using Pearson, Spearman or contingency coefficients depending on the variable type. An example analyzes the relationship between patient weight and blood pressure using hospital data and Pearson correlation.

Uploaded by

Opie Sopyan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 3

Name : Opie Sopyan

Student ID : 20200130005
Class : Master Computer Science Nusa Putra University
Courses: Advanced Database
Lecture : Dr. DINI OKTARINA DWI HANDAYANI, S.T, M.Sc.

1. Describe your understanding about preprocessing.


Data Preprocessing is way to process data before use the data itself. Preprocessing often
used to control ( short, ignore , or delete some ) raw data to make it in good status to be
process in next level data processing ( i.e. Data mining, Data Analyst, etc ).

2. Explain major tasks in data preprocessing. Describe how to handle dirty data and give
example for each task.
a. Data Cleaning
This task is usually basic and important to control and identity of data. Fill in missing
values, smooth noisy data, identify or remove outliers, and resolve inconsistencies

Example :
I want to remove every datasheet with incomplete status, and fill up more datasheet
with several algorithm like mean

b. Data Integration
Used to combine multiple datasheet, not only from one source but it can use from
several source, cubes and files

Example :
I want make decision tree from rainy season in West Java and I have data sources from
each region : sukabumi.csv , bandung.csv and garut.csv
Every data contain several attribute, missing data and etc; with data integration I can
combine every single attribute from each data source to make my data more accurate .

c. Data Reduction
This task we used to reduce Dimensionality, Numerosity and compress data

Example :
After we combine multiple datasheet, we have many attributes and several data we can
use to process , not only every single attribute we need, but also data we don’t need
By this task we eliminate the data.

d. Data transformation and data discretization


In this task the data will proceed to be data that can be understand with machine its
called data Normalization and Hierarchy Generation.

Example :
Re generating the data structure after eliminate data from previous task
3. What is Correlation Analysis? Describe your understating and give a real world example.
Correlation Analysis is way to measure wether exist or not exist relation from 2 variables.
We have 4 measurement In Measurement scale of statistic : Interval, Ratio, Ordinal and
Nominal.
There is 3 correlation measure to count in every scale
 Pearson Correlation used to measure Interval and Ratio
 Rank Spearman / Kendal’s tau used to measure Ordinal
 Contingency Coefficient used to measure Nominal
Example :

In the Sukabumi Hospital we have 10 patience about weight and blood pressure , we will
analyze if there is or not relation between that. Here is the data looks like :

Name Weight Blood Pressure


Imas 45 130
Duloh 44 110
Amir 42 100
Umtiti 57 155
Guardiola 55 130
Acuy 50 130
Alex 45 120
Siska 58 172
Nemandja 61 180
Ratu 60 160

Because the data we have is using scale so we use Pearson measurement.


After we process the data we now have measurement of closeness :
rxy = 0.91 < it is indicated that weight and blood pressure had relation

Now we are trying hypothesis test . first we are formulating hypothesis


 H0 describe there is no relation between weight and blood pressure
 H1 describe there is relation between weight and blood pressure

Next, we determine level of significance and t table

a
a=5 % , =0.025
2
db=( n−2 )=( 10−2 )=8

Next we determine to value


We have :
t0 = 6.21

Next we are formulating testing criteria


t = 6.21 > t
0 0.025;8 =2.308 ( refuse H0 )
In the end, we can make a conclusion that Weight and Blood Pressure Have a relation

4. Suppose two stocks A and B have the following values in one week: (7, 5), (3, 8), (5, 10),
(14, 11), (6, 14). If the stocks are affected by the same industry trends, will their prices rise
or fall together?

By above data we have a table like this :

Day Stocks A Stocks B


1 7 5
2 3 8
3 5 10
4 14 11
5 6 14

Because the data shown scale, so we use Pearson measurement check if there is relation
between stock A and Stock B or not.

r
The result is : xy = 0.1 < it is indicated that Stocks A & B had no relation

Next we’re looking for Coeffecient (r) , N, T Statistic, DF, and P Value
By using Excel we now have this result

Coeffecient (r) 0,195557


N 5
T statistic 0,345384
DF 3
P value 0,752605

With a = 0,025 we can now take the summary that there is no relation between Stocks A and
Stocks B !

It’s mean the prices will not rise together whenever stock affected by trends

You might also like