Running Head:: Data Mining 1

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 7

Running Head: DATA MINING 1

Data Mining

Student’s Name:

Institutional Affiliation:
DATA MINING 2

Part 1

1. What is K-means from a basic standpoint?

This is a clustering technique that encompasses vector quantization. Its main aim is to

partition and make observations into k-clusters.

2. What are the various types of clusters, and why is the distinction important?

There are different clusters, including partitioning, hierarchical, Density-Based

Clustering, and distribution Model-Based Clustering.

Partitioning Clustering

The technique usually divides a set of data into a set number of groups. The method is

also referred to as the centroid-based technique (Tan et al., 2016). In this method, the cluster

centroid is formed, ensuring the distance of data points in that particular cluster is minimum,

especially when calculated with other centroids.

Hierarchical Clustering

The technique divides data set into numerous clusters in which the user fails to specify

cluster numbers to be generated before training the model (Tan et al., 2016). The method is also

referred to as a connectivity-based technique.

Density-Based Clustering
DATA MINING 3

This is the most commonly used clustering technique and is mainly formed by

segregating different density regions.

Distribution Model-Based Clustering

It encompasses the identification of probability of all data points in a cluster from similar

distribution results in this technique's formation.

3. What are the strengths and weaknesses of K-means?

Strengths of k-means

First, the method is relatively simple to execute. Besides, it scales to vast data sets. The

technique also guarantees convergence and easily adapts to new examples. Lastly, it generalizes

to clusters of different sizes and shapes, for instance, elliptical clusters.

Weaknesses

One of the weaknesses of K-means is being dependent on initial values. Consequently,

the technique chooses manually and utilizes the "Loss vs. Clusters" plot to determine the optimal

(k) (Tan et al., 2016). Moreover, clustering data do vary in size and density. Lastly, the technique

has troubles, especially in clustering data where clusters are of different sizes and densities.

4. What is a cluster evaluation?

The method encompasses sharing of both mutual problems solving and successes across a

cluster of projects.
DATA MINING 4

Select at least two types of cluster evaluation and discuss the concepts of each method.

Fuzzy Clustering

Typically, fuzzy Clustering is a clustering evaluation technique in which every data point

belongs to more than a cluster (Tan et al., 2016). The clustering evaluation technique entails

assigning the data points to the sets so that items in a similar set are alike in every way possible.

However, items that do belong to different clusters are not identical.

Constraint-based (Supervised Clustering)

It belongs to semi-supervised learning algorithms and encompasses cannot-link

constraints, must-link constraints, or the two with a data clustering algorithm. Consequently, the

two define a correlation between two data instances (Tan et al., 2016). For instance, a must-link

constraint is utilized in the specification of two cases. In contrast, a cannot-link constraint is used

to specify two cases that should not be associated with a similar cluster.

Part 2

1. What is the definition of data mining that the author mentions? How is this

different from our current understanding of data mining?

Križanić (2020) has provided a tentative definition of data mining in the case study.

According to the author, data mining involves integrating various efficient methods for analyzing

a large and complex collection of data (Križanić, 2020). Thus, data mining also involves

extracting useful and unexpected data patterns. 


DATA MINING 5

Consequently, the current understanding of data mining involves the extraction of usable

data from a complex and larger set or collection of raw data. In other words, data mining

involves the analysis of trends and data patterns in the large collection by integrating software

tools. Currently, data mining is highly applicable in data warehousing, data collection, and

computer processing. More importantly, the current understanding of data mining involves the

techniques used in data extraction in various ways such as spam Email filtering, fraud detection,

credit risk management, and database marketing. 

2. What is the premise of the use case and findings?

Krizanic (2020) eludes that data mining is highly applicable in the education setting for

higher education institutions in Croatia. Therefore, educational data mining has been integrated

with big data to demonstrate how students’ actions and behavior in e-courses (Križanić, 2020).

The premise of the use case is that educational data mining can be justified through the use of

decision tree technique and cluster analysis as data mining approaches. The case also used event

logs downloaded from an e-learning environment for analyzing student behavior (Križanić,

2020). Thus, data mining was used to analyze student’s achievement via midterm exams based

on their behavior in e-course, thereby justifying the findings that students performed better in

mid-term exams after accessing learning materials for the lectures. 

3. What type of tools are used in the use case's data mining aspect, and how are they

used?

Data mining uses many tools such as Teradata, python, SPSS, SAS, Oracle data mining

and many more. These tools are based on data mining techniques such as classification analysis,

association rule learning and anomaly. More importantly, they are based on clustering analysis,

decision tree, and regression analysis.


DATA MINING 6

The case study by Krizanic (2020) mainly uses cluster analysis and decision trees in the

data mining aspect. Notably, cluster analysis was executed by organizing pattern collections into

a group based on students' similarity of behavior in using course materials. In addition to that,

the decision tree was the critical technique of interest in the generation of a representation of

resolution-making that enabled defining different classes of objects for the sole purpose of

deeper evaluation of how students learned. The cluster analysis tool is mainly used in the

identification of similar patterns of behavior. Decision trees are easy to comprehend and are well

adapted to classifying issues. Consequently, they suffer from data sensitivity employed in their

construction and are deemed a less natural regression model. The key benefit associated with

decision trees is that there are many efficient algorithms, which makes it easy to find

approximate optimal tree architectures.

4. Were the tools used appropriately for the use case? Why or why not?

In my perspective, I believe the tools used in the case were appropriate. This is because

the cluster analysis tool played a critical part by organizing different collections of patterns into

distinct groups based on the similarity of the student's behavior. On the other hand, the decision

tree helped generate a representation of resolution-making, which enabled definitions of classes

of objects for deeper analysis.


DATA MINING 7

References

Križanić, S. (2020). Educational data mining using cluster analysis and decision tree technique:

A case study. International Journal of Engineering Business Management, 12,

1847979020908675.

Tan, P. N., Steinbach, M., & Kumar, V. (2016). Introduction to data mining. Pearson Education

India.

You might also like