Assignment 02
Assignment 02
Assignment 02
Weekly Assignment 2
S/15/809
1. Explain what are following terms and give some examples for each case:
a. Structured Data
Structured data is data whose elements are addressable for effective analysis. It has
been organized into a formatted repository that is typically a database. It concerns all
data which can be stored in database SQL in a table with rows and columns. They have
relational keys and can easily be mapped into pre-designed fields. Today, those data are
most processed in the development and simplest way to manage
information. Example: Relational data.
b. Semi-Structured Data
Semi-structured data is information that does not reside in a relational database but
that have some organizational properties that make it easier to analyze. With some
process, you can store them in the relation database (it could be very hard for some
kind of semi-structured data), but Semi-structured exist to ease space. Example: XML
data.
c. Unstructured Data
Unstructured data is a data which is not organized in a predefined manner or does not
have a predefined data model, thus it is not a good fit for a mainstream relational
database. So for Unstructured data, there are alternative platforms for storing and
managing, it is increasingly prevalent in IT systems and is used by organizations in a
variety of business intelligence and analytics applications. Example: Word, PDF, Text,
Media logs.
4. Explain following data preprocessing steps. Explain all the steps mentioned here.
a. Data Cleaning
Remove noise and correct inconsistencies in data.
b. Data Integration
Data with different representations are put together and conflicts within the data are
resolved.
c. Data Transformation
Data is normalized and generalized. Normalization is a process that ensures that no data
is redundant, it is all stored in a single place, and all the dependencies are logical.
d. Data Reduction
When the volume of data is huge, databases can become slower, costly to access, and
challenging to properly store. Data reduction step aims to present a reduced
representation of the data in a data warehouse.
5. Clearly discuss all the available methods for dealing with missing data. Give suitable examples
to explain the concepts. Also discuss their advantages and disadvantages.
Ignore the tuple: This is usually done when the class label is missing (assuming the
mining task involves classification). This method is not very effective, unless the
tuple contains several attributes with missing values. It is especially poor when
the percentage of missing values per attribute varies considerably.
Fill in the missing value manually: In general, this approach is time-consuming and may
not be feasible given a large data set with many missing values.
Use a global constant to fill in the missing value: Replace all missing attribute values by
the same constant, such as a label like “Unknown”.
7. Explain why the and when the transformations are needed for raw data.
In data transformation, the data are transformed or consolidated into forms appropriate for
mining. Data transformation can involve the following:
Smoothing
Aggregation
Generalization
Normalization
Attribute construction
8. Using suitable examples clearly explain how the following normalizations are worked out.
a. Decimal Scaling
It normalizes by moving the decimal point of values of the data.
To normalize the data by this technique, we divide each value of the data by the
maximum absolute value of data.
The data value, 𝑣(𝑖) of data is normalized to 𝑣 ՜ (𝑖) by using the formula below
𝑣(𝑖)
𝑣 ՜ (𝑖) =
10𝑘
Where k is the smallest integer such that max (|𝑣՜(𝑖)|) <1.
b. Min-Max Normalization
In this technique of data normalization, linear transformation is performed on
the original data.
Minimum and maximum value from data is fetched and each value is replaced
according to the following formula.
(𝑣(𝑖)−min(𝑣(𝑖)))
𝑣 ՜ (𝑖) =
(max(𝑣(𝑖))−min(𝑣(𝑖)))
11. Clearly explain how the following methods are used for outlier detection
a. Histogram
b. Scatter Plot
If there is a regression line on a scatter plot, you can identify outliers. An outlier for a
scatter plot is the point or points that are farthest from the regression line. There is at
least one outlier on a scatter plot in most cases, and there is usually only one outlier.
Note that outliers for a scatter plot are very different from outliers for a boxplot.
c. Box Plot
Box plot diagram also termed as Whisker’s plot is a graphical method typically depicted
by quartiles and inter quartiles that helps in defining the upper limit and lower limit
beyond which any data lying will be considered as outliers. The very purpose of this
diagram is to identify outliers and discard it from the data series before making any
further observation so that the conclusion made from the study gives more accurate
results not influenced by any extremes or abnormal values.
d. Tabulated Summary
12. Explain why the Discretization is important in Data Mining. Also describe the difference
between unsupervised discretization and Supervised Discretization methods.
Some of the learning algorithms are able to process symbolic or categorical data only.
But, a very large proportion of real data sets includes continuous variables (interval and
ratio level).
A solution is to partition numeric variables into a number of sub-ranges and treat each
sub-range as a category.
Discretization methods can be categorized into; Unsupervised discretization and
supervised discretization.
1. Binning Method:
This method works on sorted data in order to smooth it. The whole data is divided into
segments of equal size and then various methods are performed to complete the task.
Each segmented is handled separately. One can replace all data in a segment by its
mean or boundary values can be used to complete the task.
2. Regression:
Here data can be made smooth by fitting it to a regression function. The regression used
may be linear (having one independent variable) or multiple (having multiple
independent variables).
3. Clustering:
This approach groups the similar data in a cluster. The outliers may be undetected or it
will fall outside the clusters.
17. Discuss issues encounter in Data Integration the possible solutions for them.
There are no of issues to consider during data integration
1. Schema Integration:
Integrate metadata from different sources.
The real world entities from multiple source be matched referred to as the entity
identification problem.
2. Redundancy:
An attribute may be redundant if it can be derived or obtaining from another attribute
or set of attribute.
Inconsistencies in attribute can also cause redundancies in the resulting data set.
Some redundancies can be detected by correlation analysis.
3. Detection and resolution of data value conflicts:
This is the third important issues in data integration.
Attribute values from another different sources may differ for the same real world
entity.
An attribute in one system may be recorded at a lower level abstraction then the
“same” attribute in another.
18. Explain what is data reduction and strategies available for data reduction.
A database or date warehouse may store terabytes of data. So it may take very long to
perform data analysis and mining on such huge amounts of data.
Data reduction techniques can be applied to obtain a reduced representation of the
data set that is much smaller in volume but still contain critical information.
Data reduction strategies:
1. Data Cube Aggregation
Aggregation operations are applied to the data in the construction of a data cube.
2. Dimensionality Reduction
In dimensionality reduction redundant attributes are detected and removed which reduce
the data set size.
3. Data Compression
Encoding mechanisms are used to reduce the data set size.
4. Numerosity Reduction
In numerosity reduction where the data are replaced or estimated by alternative.
5. Discretization and concept hierarchy generation
Where raw data values for attributes are replaced by ranges or higher conceptual levels.
20. (Hands-on-Analysis) Use the churn data set for the following exercises. Also use R Statistical
Software for the analysis. You may use R-Notebook to prepare pdf document containing
answers for this question.
a. Briefly explain what is Churn prediction.
Churn rate, when applied to a customer base, refers to the proportion of
contractual customers or subscribers who leave a supplier during a given time
period.
Churn is directly related to the profitability of a company. The more some can
learn about customer behaviors, the more profit can be gained. This also helps
identifying and improving areas or fields where customer service is lacking.