Assignment 02

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

ST 402 - Statistical Data Mining

Weekly Assignment 2
S/15/809

1. Explain what are following terms and give some examples for each case:
a. Structured Data
Structured data is data whose elements are addressable for effective analysis. It has
been organized into a formatted repository that is typically a database. It concerns all
data which can be stored in database SQL in a table with rows and columns. They have
relational keys and can easily be mapped into pre-designed fields. Today, those data are
most processed in the development and simplest way to manage
information. Example: Relational data.

b. Semi-Structured Data
Semi-structured data is information that does not reside in a relational database but
that have some organizational properties that make it easier to analyze. With some
process, you can store them in the relation database (it could be very hard for some
kind of semi-structured data), but Semi-structured exist to ease space. Example: XML
data.

c. Unstructured Data
Unstructured data is a data which is not organized in a predefined manner or does not
have a predefined data model, thus it is not a good fit for a mainstream relational
database. So for Unstructured data, there are alternative platforms for storing and
managing, it is increasingly prevalent in IT systems and is used by organizations in a
variety of business intelligence and analytics applications. Example: Word, PDF, Text,
Media logs.

2. List Main Data Quality Indicators and briefly explain them.


 The data should be accurate.
The analyst has to check that the name is spelled correctly, the code is
in a given range, the value is complete, etc.

 The data should be stored according to data type.


The analyst must ensure that the numeric value is not presented in
character form that integers are not in the form of real numbers, etc.

 The data should have integrity.


 The data should be consistent
 The data set should be complete.
3. Explain, why preprocessing is considered as very important activity in Data Mining Process
model.
Data preprocessing is a data mining technique which is used to transform the raw data in a
useful and efficient format.
Real-world data is often incomplete, inconsistent, lacking in certain behaviors or trends, and is
likely to contain many errors. Therefore we have to preprocess the data before used.

4. Explain following data preprocessing steps. Explain all the steps mentioned here.
a. Data Cleaning
Remove noise and correct inconsistencies in data.

b. Data Integration
Data with different representations are put together and conflicts within the data are
resolved.

c. Data Transformation
Data is normalized and generalized. Normalization is a process that ensures that no data
is redundant, it is all stored in a single place, and all the dependencies are logical.

d. Data Reduction
When the volume of data is huge, databases can become slower, costly to access, and
challenging to properly store. Data reduction step aims to present a reduced
representation of the data in a data warehouse.

5. Clearly discuss all the available methods for dealing with missing data. Give suitable examples
to explain the concepts. Also discuss their advantages and disadvantages.

 Ignore the tuple: This is usually done when the class label is missing (assuming the
mining task involves classification). This method is not very effective, unless the
tuple contains several attributes with missing values. It is especially poor when
the percentage of missing values per attribute varies considerably.

 Fill in the missing value manually: In general, this approach is time-consuming and may
not be feasible given a large data set with many missing values.

 Use a global constant to fill in the missing value: Replace all missing attribute values by
the same constant, such as a label like “Unknown”.

 Use the attribute mean to fill in the missing value


 Use the attribute mean for all samples belonging to the same class as the given tuple
 Use the most probable value to fill in the missing value
6. Clearly explain methods available for identifying misclassifications by giving suitable
examples.
 Classification variables on categorical variables are validated to make sure that they are;
all valid and consistent.
 Frequency distribution tables are used for cross checking validity of categorical data.

7. Explain why the and when the transformations are needed for raw data.
In data transformation, the data are transformed or consolidated into forms appropriate for
mining. Data transformation can involve the following:
 Smoothing
 Aggregation
 Generalization
 Normalization
 Attribute construction

8. Using suitable examples clearly explain how the following normalizations are worked out.
a. Decimal Scaling
 It normalizes by moving the decimal point of values of the data.
 To normalize the data by this technique, we divide each value of the data by the
maximum absolute value of data.
 The data value, 𝑣(𝑖) of data is normalized to 𝑣 ՜ (𝑖) by using the formula below
𝑣(𝑖)
𝑣 ՜ (𝑖) =
10𝑘
Where k is the smallest integer such that max (|𝑣՜(𝑖)|) <1.

b. Min-Max Normalization
 In this technique of data normalization, linear transformation is performed on
the original data.
 Minimum and maximum value from data is fetched and each value is replaced
according to the following formula.
(𝑣(𝑖)−min(𝑣(𝑖)))
 𝑣 ՜ (𝑖) =
(max(𝑣(𝑖))−min(𝑣(𝑖)))

c. Standard Deviation Normalization


In this technique, values are normalized based on mean and standard deviation of the
data. The formula used is:
(𝑣(𝑖) − 𝑚𝑒𝑎𝑛(𝑣))
𝑣 ՜ (𝑖) =
𝜎𝑣

9. Briefly discuss the difference between Normalization and Data smoothing.


10. Clearly explain data smoothing operation.
 Data smoothing uses an algorithm to remove noise from a data set, allowing important patterns
to stand out.
 It can be used to predict trends, such as those found in securities prices.
 Different data smoothing models include the random method, random walk, and the moving
average.
 While data smoothing can help predict certain trends, it may lead to certain data points being
ignored.

11. Clearly explain how the following methods are used for outlier detection
a. Histogram
b. Scatter Plot
If there is a regression line on a scatter plot, you can identify outliers. An outlier for a
scatter plot is the point or points that are farthest from the regression line. There is at
least one outlier on a scatter plot in most cases, and there is usually only one outlier.
Note that outliers for a scatter plot are very different from outliers for a boxplot.
c. Box Plot
Box plot diagram also termed as Whisker’s plot is a graphical method typically depicted
by quartiles and inter quartiles that helps in defining the upper limit and lower limit
beyond which any data lying will be considered as outliers. The very purpose of this
diagram is to identify outliers and discard it from the data series before making any
further observation so that the conclusion made from the study gives more accurate
results not influenced by any extremes or abnormal values.

d. Tabulated Summary

12. Explain why the Discretization is important in Data Mining. Also describe the difference
between unsupervised discretization and Supervised Discretization methods.
 Some of the learning algorithms are able to process symbolic or categorical data only.
 But, a very large proportion of real data sets includes continuous variables (interval and
ratio level).
 A solution is to partition numeric variables into a number of sub-ranges and treat each
sub-range as a category.
 Discretization methods can be categorized into; Unsupervised discretization and
supervised discretization.

Unsupervised Discretization Supervised Discretization


Partition a variable using only information Supervised techniques normally attempts to
about the distribution of values of that maximize some measure of the relationship
variable. between the partitioned variable and the
classification variable.
Considerably faster. More accurate classification.
13. Clearly explain following Unsupervised Discretization methods.
a. Equi-Interval Width Binning Method
 It divides the range into N intervals of equal size: uniform grid
 If A and B are the lowest and highest values of the attribute, the width of
intervals will be: W = (B-A)/N.
 The most straightforward
 Shortcoming: outliers may dominate presentation
 Skewed data is not handled well.
b. Equi-Frequency Width Binning Method
 It divides the range into N intervals, each containing approximately same
number of samples (quantile-based approach)
 Good data scaling
 Managing categorical attributes can be tricky.
c. Clustering
 Partition data set into clusters, and one can store cluster representation only
 Can be very effective if data is clustered but not if data is “smeared”
 Can have hierarchical clustering and be stored in multidimensional index tree
structures
d. Concept Hierarchy
Reduce the data by collecting and replacing low level concepts (such as numeric values
for the attribute age) by higher level concepts (such as young, middle-aged, or senior).
e. Intuitive Partitioning

14. Clearly explain following Supervised Discretization methods.


a. Entropy based (or Information Gain) discretization
 This method use class information present in data.
 The entropy (or information content) is calculated on the basis of the class label.
 Entropy based binning is one such method.
 It find the best split so that the majority of elements in a bin corresponds to
having same class label.
 It is characterized by finding the split with the maximal information gain.

b. Chi-Square Analysis based discretization


15. Clearly explain methods available for handling noisy data in a dataset.
Noisy data is a meaningless data that can’t be interpreted by machines. It can be generated due
to faulty data collection, data entry errors etc. It can be handled in following ways:

1. Binning Method:
This method works on sorted data in order to smooth it. The whole data is divided into
segments of equal size and then various methods are performed to complete the task.
Each segmented is handled separately. One can replace all data in a segment by its
mean or boundary values can be used to complete the task.
2. Regression:
Here data can be made smooth by fitting it to a regression function. The regression used
may be linear (having one independent variable) or multiple (having multiple
independent variables).
3. Clustering:
This approach groups the similar data in a cluster. The outliers may be undetected or it
will fall outside the clusters.

16. Explain what is Data Integration?


 Data Integration is a data preprocessing technique that involves combining data from
multiple heterogeneous data sources into a coherent data store and provide a unified
view of the data. These sources may include multiple data cubes, databases or flat files.
 The data integration approach are formally defined as triple <G, S, M> where,
o G stand for the global schema,
o S stand for heterogeneous source of schema,
o M stand for mapping between the queries of source and global schema.

17. Discuss issues encounter in Data Integration the possible solutions for them.
There are no of issues to consider during data integration
1. Schema Integration:
 Integrate metadata from different sources.
 The real world entities from multiple source be matched referred to as the entity
identification problem.
2. Redundancy:
 An attribute may be redundant if it can be derived or obtaining from another attribute
or set of attribute.
 Inconsistencies in attribute can also cause redundancies in the resulting data set.
 Some redundancies can be detected by correlation analysis.
3. Detection and resolution of data value conflicts:
 This is the third important issues in data integration.
 Attribute values from another different sources may differ for the same real world
entity.
 An attribute in one system may be recorded at a lower level abstraction then the
“same” attribute in another.
18. Explain what is data reduction and strategies available for data reduction.
 A database or date warehouse may store terabytes of data. So it may take very long to
perform data analysis and mining on such huge amounts of data.
 Data reduction techniques can be applied to obtain a reduced representation of the
data set that is much smaller in volume but still contain critical information.
Data reduction strategies:
1. Data Cube Aggregation
Aggregation operations are applied to the data in the construction of a data cube.
2. Dimensionality Reduction
In dimensionality reduction redundant attributes are detected and removed which reduce
the data set size.
3. Data Compression
Encoding mechanisms are used to reduce the data set size.
4. Numerosity Reduction
In numerosity reduction where the data are replaced or estimated by alternative.
5. Discretization and concept hierarchy generation
Where raw data values for attributes are replaced by ranges or higher conceptual levels.

19. Discuss in detail the following data reduction strategies:


a. Data Cube aggregation
This technique is used to aggregate data in a simpler form. For example, imagine that
information you gathered for your analysis for the years 2012 to 2014, that data
includes the revenue of your company every three months. They involve you in the
annual sales, rather than the quarterly average, so we can summarize the data in such a
way that the resulting data summarizes the total sales per year instead of per quarter. It
summarizes the data.

b. Attribute subset selection


c. Dimensionality reduction methods (Wavelet Transform, Principle Component
Analysis)
Whenever we come across any data which is weakly important, then we use the
attribute required for our analysis. It reduces data size as it eliminates outdated or
redundant features.

Step-wise Forward Selection:


The selection begins with an empty set of attributes later on we decide best of the
original attributes on the set based on their relevance to other attributes. We know it as
a p-value in statistics.

Step-wise Backward Selection:


This selection starts with a set of complete attributes in the original data and at each
point, it eliminates the worst remaining attribute in the set.

Combination of forwarding and Backward Selection:


It allows us to remove the worst and select best attributes, saving time and making the
process faster.
d. Numerosity reduction methods (Regression and Log Linear Models, Histograms,
Clustering, Sampling)

20. (Hands-on-Analysis) Use the churn data set for the following exercises. Also use R Statistical
Software for the analysis. You may use R-Notebook to prepare pdf document containing
answers for this question.
a. Briefly explain what is Churn prediction.
 Churn rate, when applied to a customer base, refers to the proportion of
contractual customers or subscribers who leave a supplier during a given time
period.
 Churn is directly related to the profitability of a company. The more some can
learn about customer behaviors, the more profit can be gained. This also helps
identifying and improving areas or fields where customer service is lacking.

You might also like