4.1 - Data Preprocessing
4.1 - Data Preprocessing
4.1 - Data Preprocessing
Feature Selection
• This achieves the reduction of
data by removing irrelevant or
redundant features.
• This aims to find a minimum set
of attributes.
Types of Data Preprocessing – Data Reduction
Instance Selection
• This looks at choosing a subset
of the total available data to
achieve the original purpose of
data mining.
• It works in a similar manner to
statistical sampling methods.
Types of Data Preprocessing – Data Reduction
Discretization
• This transformed quantitative
(numerical) data into qualitative
(nominal) data.
• An association between each
interval with a numerical discrete
value is then established.
Types of Data Preprocessing
• To summarize how these data preprocessing tools work:
• Data Cleaning – How do I clean up the data?
• Data Integration – How do I incorporate and adjust data?
• Data Transformation – How do I provide accurate data?
• Data Normalization – How do I unify and scale data?
• Data Reduction – How do I select the best features of my data?
Steps in Data Preprocessing
• These are the general steps to consider when doing data preprocessing:
• Assess your Data Quality
• Clean your Data
• Transform your Data
• Reduce your Data
• Further Process your Data
Steps in Data Preprocessing
Assess your Data Quality
• Start by looking at your data to get an idea of its overall quality.
• This is where you look at your data collection results and determine
what issues your data may have.
• Once you have identified issues, you then need to determine which
data preprocessing techniques to use.
Steps in Data Preprocessing
Assess your Data Quality
• These are common issues you might need to look at in your data:
• Mismatched Data Types
• Mixed Data Values
• Outliers
• Missing Data
Steps in Data Preprocessing
Clean your Data
• Generally, you always want to clean your data as your first
preprocessing method.
• This is because it removes useless, unrelated, corrupted, or incorrect
data which can interfere with other steps.
• This can be done manually by deleting files or automated with code or
tools.
Steps in Data Preprocessing
Transform your Data
• This is where your data is transformed into a format suitable for
your data analysis tools.
• How you transform your data will depend on what tool you are using
and what analysis you will perform.
• This involves steps such as normalization to further enhance the data.
Steps in Data Preprocessing
Reduce your Data
• You will then want to reduce the size of your overall dataset as
needed to make analysis easier.
• This may not be needed for small datasets but becomes important for
larger datasets.
• This ensures that your data analysis process will not be slow or
impossible.
Steps in Data Preprocessing
Further Process your Data
• You will need to determine if your current data preprocessing
steps are sufficient.
• This is typically done after data analysis to check if the data
preprocessing enhanced the results.
• You can add or remove preprocessing methods if you find that they are
not effective for your dataset.
References
• Data Preprocessing in Data Mining – GeeksforGeeks
• Data Preprocessing in Data Mining.pdf (dstu.dp.ua)
• What Is Data Preprocessing & What Are The Steps Involved? (monke
ylearn.com)
• Data Preprocessing: Definition, Key Steps and Concepts (techtarget.co
m)
• A survey on data preprocessing for data stream mining: Current status
and future directions (ugr.es)