Wekappt

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 58

S.no.

Topics for practice

1. Study and explore WEKA environment.

2. Create .arff file using WEKA.

3. Demonstration of pre-processing of .arff file.

4. Demonstrateperforming association rule mining on data sets.

5. Demonstrate performing classification on data sets.

6. Demonstrate performing clustering on data sets

7. Demonstrate performing Regression on data sets.

8. Demonstration of association rule mining.


9. Perform classification using Bayesianclassification algorithm.

10. Perform the cluster analysis by k-means method.


Topic 1 : Study and explore WEKA environment
WEKA - an open source software provides tools for data preprocessing, implementation of several Machine
Learning algorithms, and visualization tools so that you can develop machine learning techniques and apply
them to real-world data mining problems. What WEKA offers is summarized in the following diagram −

If you observe the beginning of the flow of the image, you will understand that there are many stages in dealing
with Big Data to make it suitable for machine learning −
First, you will start with the raw data collected from the field. This data may contain several null values and
irrelevant fields. You use the data preprocessing tools provided in WEKA to cleanse the data.
Then, you would save the preprocessed data in your local storage for applying ML algorithms.
Next, depending on the kind of ML model that you are trying to develop you would select one of the options
such as Classify, Cluster, or Associate. The Attributes Selection allows the automatic selection of features
to create a reduced dataset.
If you observe the beginning of the flow of the image, you will understand that there are many stages in dealing
with Big Data to make it suitable for machine learning −
First, you will start with the raw data collected from the field. This data may contain several null values and
irrelevant fields. You use the data preprocessing tools provided in WEKA to cleanse the data.
Then, you would save the preprocessed data in your local storage for applying ML algorithms.
Next, depending on the kind of ML model that you are trying to develop you would select one of the options
such as Classify, Cluster, or Associate. The Attributes Selection allows the automatic selection of features
to create a reduced dataset.
Note that under each category, WEKA provides the implementation of several algorithms. You would select an
algorithm of your choice, set the desired parameters and run it on the dataset.
Then, WEKA would give you the statistical output of the model processing. It provides you a visualization tool
to inspect the data.
The various models can be applied on the same dataset. You can then compare the outputs of different models
and select the best that meets your purpose.
Thus, the use of WEKA results in a quicker development of machine learning models on the whole.
This is WEKA GUI Chooser
The GUI Chooser application allows you to run five different types of applications as listed here −

 Explorer
 Experimenter
 KnowledgeFlow
 Workbench
 Simple CLI

When you click on the Explorer button in the Applications selector, it opens the following screen −


On the top, you will see several tabs as listed here −

 Preprocess
 Classify
 Cluster
 Associate
 Select Attributes
 Visualize
Under these tabs, there are several pre-implemented machine learning algorithms. Let us look into each of
them in detail now.

Preprocess Tab
Initially as you open the explorer, only the Preprocess tab is enabled. The first step in machine learning is to
preprocess the data. Thus, in the Preprocess option, you will select the data file, process it and make it fit for
applying the various machine learning algorithms.
Classify Tab
The Classify tab provides you several machine learning algorithms for the classification of your data. To list a
few, you may apply algorithms such as Linear Regression, Logistic Regression, Support Vector Machines,
Decision Trees, RandomTree, RandomForest, NaiveBayes, and so on. The list is very exhaustive and provides
both supervised and unsupervised machine learning algorithms.

Cluster Tab
Under the Cluster tab, there are several clustering algorithms provided - such as SimpleKMeans,
FilteredClusterer, HierarchicalClusterer, and so on.

Associate Tab
Under the Associate tab, you would find Apriori, FilteredAssociator and FPGrowth.

Select Attributes Tab


Select Attributes allows you feature selections based on several algorithms such as ClassifierSubsetEval,
PrinicipalComponents, etc.

Visualize Tab
Lastly, the Visualize option allows you to visualize your processed data for analysis.
As you noticed, WEKA provides several ready-to-use algorithms for testing and building your machine learning
applications. To use WEKA effectively, you must have a sound knowledge of these algorithms, how they work,
which one to choose under what circumstances, what to look for in their processed output, and so on. In short,
you must have a solid foundation in machine learning to use WEKA effectively in building your apps.
In the upcoming chapters, you will study each tab in the explorer in depth.

Topic 2 : Create .arff files using Weka


WEKA supports a large number of file formats for the data. Here is the complete list −

 arff
 arff.gz
 bsi
 csv
 dat
 data
 json
 json.gz
 libsvm
 m
 names
 xrff
 xrff.gz
The types of files that it supports are listed in the drop-down list box at the bottom of the screen. This is shown
in the screenshot given below.

As you would notice it supports several formats including CSV and JSON. The default file type is Arff.

Arff Format
An Arff file contains two sections - header and data.

 The header describes the attribute types.


 The data section contains a comma separated list of data.
As an example for Arff format, the Weather data file loaded from the WEKA sample databases is shown below

From the screenshot, you can infer the following points −

 The @relation tag defines the name of the database.


 The @attribute tag defines the attributes.
 The @data tag starts the list of data rows each containing the comma separated fields.
 The attributes can take nominal values as in the case of outlook shown here −
@attribute outlook (sunny, overcast, rainy)
 The attributes can take real values as in this case −
@attribute temperature real
 You can also set a Target or a Class variable called play as shown here −
@attribute play (yes, no)
 The Target assumes two nominal values yes or no.

Other Formats
The Explorer can load the data in any of the earlier mentioned formats. As arff is the preferred format in WEKA,
you may load the data from any format and save it to arff format for later use. After preprocessing the data, just
save it to arff format for further analysis.
Now that you have learned how to load data into WEKA, in the next chapter, you will learn how to preprocess
the data.
Topic 3 : Demonstration of Preprocessing the .arff file

Step1: Loading the data. We can load the dataset into weka by clicking on open button in preprocessing
interface and selecting the appropriate file.

Step2: Once the data is loaded, weka will recognize the attributes and during the scan of the data weka
will compute some basic strategies on each attribute. The left panel in the above figure shows the list of
recognized attributes while the top panel indicates the names of the base relation or table and the
current working relation (which are same initially).

Step3:Clicking on an attribute in the left panel will show the basic statistics on the attributes for the
categorical attributes the frequency of each attribute value is shown, while for continuous attributes we
can obtain min, max, mean, standard deviation and deviation etc.

Step4:The visualization in the right button panel in the form of cross-tabulation across two attributes.
Note:we can select another attribute using the dropdown list.

Step5:Selecting or filtering attributes Removing an attribute-When we need to remove an attribute,we


can do this by using the attribute filters in weka.In the filter model panel,click on choose button,This will
show a popup window with a list of available filters. Scroll down the list and select the
“weka.filters.unsupervised.attribute.remove” filters.

Step 6:a)Next click the textbox immediately to the right of the choose button.In the resulting dialog box
enter the index of the attribute to be filtered out.

b)Make sure that invert selection option is set to false.The click OK now in the filter box.you will see
“Remove-R-7”.

c)Click the apply button to apply filter to this data.This will remove the attribute and create new
working relation. d)Save the new working relation as an arff file by clicking save button on the
top(button)panel.(student.arff) Discretization

1)Sometimes association rule mining can only be performed on categorical data.This requires
performing discretization on numeric or continuous attributes.In the following example let us discretize
age attribute.

ÆLet us divide the values of age attribute into three bins(intervals). ÆFirst load the dataset into
weka(student.arff)

ÆSelect the age attribute.

ÆActivate filter-dialog box and select “WEKA.filters.unsupervised.attribute.discretize”from the list.

ÆTo change the defaults for the filters,click on the box immediately to the right of the choose button.
ÆWe enter the index for the attribute to be discretized.In this case the attribute is age.So we must enter
‘1’ corresponding to the age attribute.

ÆEnter ‘3’ as the number of bins.Leave the remaining field values as they are.

ÆClick OK button.

ÆClick apply in the filter panel.This will result in a new working relation with the selected attribute
partition into 3 bins.

ÆSave the new working relation in a file called student-data-discretized.arff

Dataset student .arff

@relation student

@attribute age {40}

@attribute income {low, medium, high}

@attribute student {yes, no}

@attribute credit-rating {fair, excellent}

@attribute buyspc {yes, no}

@data

<30, high, no, fair, no

<30, high , no, excellent, no

30-40, high, no, fair, yes

> 40 , medium , no ,fair , yes

> 40 , low, yes, fair, yes

>40, low, yes, excellent, no

30-40, low, yes, excellent, yes

40, medium, yes, fair, yes

40, medium, no, excellent, no

%
Topic 4 : Demonstrate performing associative
rule mining in data sets.

Association Rule Mining is a method for identifying frequent patterns, correlations,


associations, or causal structures in data sets found in numerous databases such as relational
databases, transactional databases, and other types of data repositories.

Since most machine learning algorithms work with numerical datasets, they are mathematical in
nature. But, Association Rule Mining is appropriate for non-numeric, categorical data and
requires a little more than simple counting.

Given a set of transactions, the goal of association rule mining is to find the rules that allow us
to predict the occurrence of a specific item based on the occurrences of the other items in the
transaction.

An association rule consists of two parts:


 an antecedent (if) and
 a consequent (then)

An antecedent is something found in data, and a consequent is something located in


conjunction with the antecedent.

For a quick understanding, consider the following association rule:

“If a customer buys bread, he’s 70% likely of buying milk.”

Bread is the antecedent in the given association rule, and milk is the consequent.

Usage of Association Rule Mining

The usage of Association Rule Mining is illustrated below.

Image Source

 Association Rule Mining: Basic Definitions


 Association Rule Mining: Rule Evaluation Metrics

1) Association Rule Mining: Basic Definitions

Before defining the rules of Association Rule Mining, let us first have a look at the basic
definitions.

 Support Count(σ):  It accounts for the frequency of occurrence of an itemset.

Here σ({Milk, Bread, Diaper})=2 

 Frequent Itemset: It represents an itemset whose support is greater than or equal to the
minimum threshold.

 Association Rule: It represents an implication expression of the form X -> Y. Here X


and Y represent any 2 itemsets.
Example: {Milk, Diaper}->{Beer} 

2) Association Rule Mining: Rule Evaluation Metrics

The rule evaluation metrics used in Association Rule Mining are as follows:

 Support(s): It is the number of transactions that include items from the {X} and {Y} parts
of the rule as a percentage of total transactions. It can be represented in the form of a
percentage of all transactions that shows how frequently a group of items occurs
together.

 Support = σ(X+Y) ÷ total: It is a fraction of transactions that include both X and Y. 

 Confidence(c): This ratio represents the total number of transactions of all of the items
in {A} and {B} to the number of transactions of the items in {A}.

 Conf(X=>Y) = Supp(X∪Y) ÷ Supp(X): It counts the number of times each item in Y


appears in transactions that also include items in X.

 Lift(l): The lift of the rule X=>Y is the confidence of the rule divided by the expected
confidence. here, it is assumed that the itemsets X and Y are independent of one
another. The expected confidence is calculated by dividing the confidence by the
frequency of {Y}.

 Lift(X=>Y) = Conf(X=>Y) ÷ Supp(Y): Lift values near 1 indicate that X and Y almost


always appear together as expected. Lift values greater than 1 indicate that they appear
together more than expected, and lift values less than 1 indicate that they appear less
than expected. Greater lift values indicate a more powerful association.

Applications of Association Rule Mining

Some of the applications of Association Rule Mining are as follows:

 Market-Based Analysis
 Medical Diagnosis
 Census Data

1) Market-Basket Analysis

In most supermarkets, data is collected using barcode scanners. This database is called the
“market basket” database. It contains a large number of past transaction records. Every record
contains the name of all the items each customer purchases in one transaction. From this data,
the stores come to know the inclination and choices of items of the customers. And according to
this information, they decide the store layout and optimize the cataloging of different items.

A single record contains a list of all the items purchased by a customer in a single transaction.
Knowing which groups are inclined toward which set of items allows these stores to adjust the
store layout and catalog to place them optimally next to one another.
2) Medical Diagnosis

Association rules in medical diagnosis can help physicians diagnose and treat patients.
Diagnosis is a difficult process with many potential errors that can lead to unreliable results. You
can use relational association rule mining to determine the likelihood of illness based on various
factors and symptoms. This application can be further expanded using some learning
techniques on the basis of symptoms and their relationships in accordance with diseases.

3) Census Data

The concept of Association Rule Mining is also used in dealing with the massive amount of
census data. If properly aligned, this information can be used in planning efficient public
services and businesses. 

Providing a high-quality ETL solution can be a difficult task if you have a large volume of
data. Hevo’s automated, No-code platform empowers you with everything you need to have for
a smooth data replication experience.

Check out what makes Hevo amazing:

 Fully Managed: Hevo requires no management and maintenance as it is a fully


automated platform.
 Data Transformation: Hevo provides a simple interface to perfect, modify, and enrich
the data you want to transfer.
 Faster Insight Generation: Hevo offers near real-time data replication so you have
access to real-time insight generation and faster decision making. 
 Schema Management: Hevo can automatically detect the schema of the incoming data
and map it to the destination schema.
 Scalable Infrastructure: Hevo has in-built integrations for 100+ sources (with 40+ free
sources) that can help you scale your data infrastructure as required.
 Live Support: Hevo team is available round the clock to extend exceptional support to
its customers through chat, email, and support calls.

SIGN UP HERE FOR A 14-DAY FREE TRIAL!

Algorithms of Association Rule Mining

Some of the algorithms which can be used to generate association rules are as follows:

 Apriori Algorithm
 Eclat Algorithm
 FP-Growth Algorithm

1) Apriori Algorithm

It delivers by characteristic the foremost frequent individual things within the information and
increasing them to larger and bigger item sets as long as those item sets seem ofttimes enough
within the information.
The common itemsets ensured by apriori also are accustomed make sure association rules that
highlight trends within the information. It counts the support of item sets employing a breadth-
first search strategy and a candidate generation perform that takes advantage of the downward
closure property of support.

2) Eclat Algorithm

Eclat denotes equivalence class transformation. The set intersection was supported by its
depth-first search formula. It’s applicable for each successive and parallel execution with spot-
magnifying properties. This can be the associate formula for frequent pattern mining supported
by the item set lattice’s depth-first search cross.

 It is a DFS cross of the prefix tree rather than a lattice.


 For stopping, the branch and a specific technique are used.

3) FP-growth Algorithm

This algorithm is also called a recurring pattern. The FP growth formula is used for locating
frequent item sets terribly dealings data but not for candidate generation.

This was primarily designed to compress the database that provides frequent sets and then
divides the compressed data into conditional database sets.

This conditional database is associated with a frequent set. Each database then undergoes the
process of data mining.

The data source is compressed using the FP-tree data structure.

This algorithm operates in two stages. These are as follows:

 FP-tree construction
 Extract frequently used item sets
Topic 5 : Demonstrate performing
classifications on datasets
The 5 algorithms that we will review are:

1. Logistic Regression
2. Naive Bayes
3. Decision Tree
4. k-Nearest Neighbors
5. Support Vector Machines

These are 5 algorithms that you can try on your classification problem as a starting point.

A standard machine learning classification problem will be used to demonstrate each algorithm. Specifically, the

Ionosphere binary classification problem. This is a good dataset to demonstrate classification algorithms because the

input variables are numeric and all have the same scale the problem only has two classes to discriminate.

Each instance describes the properties of radar returns from the atmosphere and the task is to predict whether or not

there is structure in the ionosphere or not. There are 34 numerical input variables of generally the same scale. You

can learn more about this dataset on the UCI Machine Learning Repository. Top results are in the order of 98%

accuracy.

Start the Weka Explorer:

1. Open the Weka GUI Chooser.


2. Click the “Explorer” button to open the Weka Explorer.
3. Load the Ionosphere dataset from the data/ionosphere.arff file.
4. Click “Classify” to open the Classify tab.

Logistic Regression
Logistic regression is a binary classification algorithm.

It assumes the input variables are numeric and have a Gaussian (bell curve) distribution. This last point does not

have to be true, as logistic regression can still achieve good results if your data is not Gaussian. In the case of the

Ionosphere dataset, some input attributes have a Gaussian-like distribution, but many do not.

The algorithm learns a coefficient for each input value, which are linearly combined into a regression function and

transformed using a logistic (s-shaped) function. Logistic regression is a fast and simple technique, but can be very

effective on some problems.


The logistic regression only supports binary classification problems, although the Weka implementation has been

adapted to support multi-class classification problems.

Choose the logistic regression algorithm:

1. Click the “Choose” button and select “Logistic” under the “functions” group.
2. Click on the name of the algorithm to review the algorithm configuration.

Weka Configuration for the Logistic Regression Algorithm

The algorithm can run for a fixed number of iterations (maxIts), but by default will run until it is estimated that the

algorithm has converged.

The implementation uses a ridge estimator which is a type of regularization. This method seeks to simplify the model

during training by minimizing the coefficients learned by the model. The ridge parameter defines how much pressure

to put on the algorithm to reduce the size of the coefficients. Setting this to 0 will turn off this regularization.

1. Click “OK” to close the algorithm configuration.


2. Click the “Start” button to run the algorithm on the Ionosphere dataset.
You can see that with the default configuration that logistic regression achieves an accuracy of 88%.

Weka Classification Results for the Logistic Regression Algorithm

Naive Bayes
Naive Bayes is a classification algorithm. Traditionally it assumes that the input values are nominal, although it

numerical inputs are supported by assuming a distribution.

Naive Bayes uses a simple implementation of Bayes Theorem (hence naive) where the prior probability for each

class is calculated from the training data and assumed to be independent of each other (technically called

conditionally independent).

This is an unrealistic assumption because we expect the variables to interact and be dependent, although this

assumption makes the probabilities fast and easy to calculate. Even under this unrealistic assumption, Naive Bayes

has been shown to be a very effective classification algorithm.


Naive Bayes calculates the posterior probability for each class and makes a prediction for the class with the highest

probability. As such, it supports both binary classification and multi-class classification problems.

Choose the Naive Bayes algorithm:

1. Click the “Choose” button and select “NaiveBayes” under the “bayes” group.
2. Click on the name of the algorithm to review the algorithm configuration.

Weka Configuration for the Naive Bayes Algorithm

By default a Gaussian distribution is assumed for each numerical attributes.

You can change the algorithm to use a kernel estimator with the useKernelEstimator argument that may better match

the actual distribution of the attributes in your dataset. Alternately, you can automatically convert numerical attributes

to nominal attributes with the useSupervisedDiscretization parameter.

1. Click “OK” to close the algorithm configuration.


2. Click the “Start” button to run the algorithm on the Ionosphere dataset.

You can see that with the default configuration that Naive Bayes achieves an accuracy of 82%.
Weka Classification Results for the Naive Bayes Algorithm

There are a number of other flavors of naive bayes algorithms that you could work with.

Decision Tree
Decision trees can support classification and regression problems.

Decision trees are more recently referred to as Classification And Regression Trees (CART). They work by creating a

tree to evaluate an instance of data, start at the root of the tree and moving town to the leaves (roots) until a

prediction can be made. The process of creating a decision tree works by greedily selecting the best split point in

order to make predictions and repeating the process until the tree is a fixed depth.

After the tree is constructed, it is pruned in order to improve the model’s ability to generalize to new data.

Choose the decision tree algorithm:


1. Click the “Choose” button and select “REPTree” under the “trees” group.
2. Click on the name of the algorithm to review the algorithm configuration.

Weka Configuration for the Decision Tree Algorithm

The depth of the tree is defined automatically, but a depth can be specified in the maxDepth attribute.

You can also choose to turn of pruning by setting the noPruning parameter to True, although this may result in worse

performance.
The minNum parameter defines the minimum number of instances supported by the tree in a leaf node when

constructing the tree from the training data.

1. Click “OK” to close the algorithm configuration.


2. Click the “Start” button to run the algorithm on the Ionosphere dataset.

You can see that with the default configuration that the decision tree algorithm achieves an accuracy of 89%.

Weka Classification Results for the Decision Tree Algorithm

Another more advanced decision tree algorithm that you can use is the C4.5 algorithm, called J48 in Weka.
You can review a visualization of a decision tree prepared on the entire training data set by right clicking on the

“Result list” and clicking “Visualize Tree”.

Weka Visualization of a Decision Tree

k-Nearest Neighbors
The k-nearest neighbors algorithm supports both classification and regression. It is also called kNN for short.

It works by storing the entire training dataset and querying it to locate the k most similar training patterns when

making a prediction. As such, there is no model other than the raw training dataset and the only computation

performed is the querying of the training dataset when a prediction is requested.

It is a simple algorithm, but one that does not assume very much about the problem other than that the distance

between data instances is meaningful in making predictions. As such, it often achieves very good performance.

When making predictions on classification problems, KNN will take the mode (most common class) of the k most

similar instances in the training dataset.

Choose the k-Nearest Neighbors algorithm:


1. Click the “Choose” button and select “IBk” under the “lazy” group.
2. Click on the name of the algorithm to review the algorithm configuration.

Weka Configuration for the k-Nearest Neighbors Algorithm

The size of the neighborhood is controlled by the k parameter.

For example, if k is set to 1, then predictions are made using the single most similar training instance to a given new

pattern for which a prediction is requested. Common values for k are 3, 7, 11 and 21, larger for larger dataset sizes.

Weka can automatically discover a good value for k using cross validation inside the algorithm by setting the

crossValidate parameter to True.

Another important parameter is the distance measure used. This is configured in the

nearestNeighbourSearchAlgorithm which controls the way in which the training data is stored and searched.
The default is a LinearNNSearch. Clicking the name of this search algorithm will provide another configuration

window where you can choose a distanceFunction parameter. By default, Euclidean distance is used to calculate the

distance between instances, which is good for numerical data with the same scale. Manhattan distance is good to

use if your attributes differ in measures or type.

Weka Configuration for the Search Algorithm in the k-Nearest Neighbors Algorithm

It is a good idea to try a suite of different k values and distance measures on your problem and see what works best.

1. Click “OK” to close the algorithm configuration.


2. Click the “Start” button to run the algorithm on the Ionosphere dataset.

You can see that with the default configuration that the kNN algorithm achieves an accuracy of 86%.
Weka Classification Results for k-Nearest Neighbors

Support Vector Machines


Support Vector Machines were developed for binary classification problems, although extensions to the technique

have been made to support multi-class classification and regression problems. The algorithm is often referred to as

SVM for short.

SVM was developed for numerical input variables, although will automatically convert nominal values to numerical

values. Input data is also normalized before being used.

SVM work by finding a line that best separates the data into the two groups. This is done using an optimization

process that only considers those data instances in the training dataset that are closest to the line that best separates

the classes. The instances are called support vectors, hence the name of the technique.

In almost all problems of interest, a line cannot be drawn to neatly separate the classes, therefore a margin is added

around the line to relax the constraint, allowing some instances to be misclassified but allowing a better result overall.
Finally, few datasets can be separated with just a straight line. Sometimes a line with curves or even polygonal

regions need to be marked out. This is achieved with SVM by projecting the data into a higher dimensional space in

order to draw the lines and make predictions. Different kernels can be used to control the projection and the amount

of flexibility in separating the classes.

Choose the SVM algorithm:

1. Click the “Choose” button and select “SMO” under the “function” group.
2. Click on the name of the algorithm to review the algorithm configuration.

SMO refers to the specific efficient optimization algorithm used inside the SVM implementation, which stands for

Sequential Minimal Optimization.


Weka Configuration for the Support Vector Machines Algorithm

The C parameter, called the complexity parameter in Weka controls how flexible the process for drawing the line to

separate the classes can be. A value of 0 allows no violations of the margin, whereas the default is 1.
A key parameter in SVM is the type of Kernel to use. The simplest kernel is a Linear kernel that separates data with a

straight line or hyperplane. The default in Weka is a Polynomial Kernel that will separate the classes using a curved

or wiggly line, the higher the polynomial, the more wiggly (the exponent value).

A popular and powerful kernel is the RBF Kernel or Radial Basis Function Kernel that is capable of learning closed

polygons and complex shapes to separate the classes.

It is a good idea to try a suite of different kernels and C (complexity) values on your problem and see what works

best.

1. Click “OK” to close the algorithm configuration.


2. Click the “Start” button to run the algorithm on the Ionosphere dataset.

You can see that with the default configuration that the SVM algorithm achieves an accuracy of 88%.

Weka Classification Results for the Support Vector Machine Algorithm


Topic 6 : Demonstrate performing clustering
on data sets.
We are going to use k means. Broadly means clustering splits n observations into k clusters in
which each observation belongs to speaking, k th ecluster with the nearest mean. The mean is
used as a prototype of the cluster.

ClusteringOpenWeka Explorer environment and load the training file using the mode . Try first
with weather.arff Preprocess . Get to the Cluster mode (by clicking on the Cluster tab) and
select a clustering algorithm, for example SimpleKMeans . Then click on Start and you get the
clustering result in the output window. The actual clustering for this algorithm is shown as one
ins tance for each cluster representing the cluster centroid .

2. EvaluationThe way Weka evaluates the clusterings depends on the cluster mode you select.
Four different cluster modes are available (as buttons in the Cluster mode panel): a. b. c. Use
training set (def ault). After generating the clustering Weka classifies the training instances into
clusters according to the cluster representation and computes the percentage of instances
falling in each cluster. For example, the above clustering produced by kmeans shows cluster
1. 64 % ( 9 instances) in cluster 0 and In Supplied test set or Percentage split 36 % ( 5
instances) in Weka can evaluate clusterings on separate test data if the cluster representation is
probabilistic (e.g. for EM).

Classes to clusters evaluation. and generates th In this mode Weka first ignores the class
attribute e clustering. Then during the test phase it assigns classes to the clusters, based on the
majority value of the class attribute within each cluster. Then it computes the classification error,
based on this assignment and also shows the corresponding confus ion matrix.Visualize the
cluster assignments. panel and select points Visualize cluster assignmentsclick on the clusterer
in the To do this, right . Plot Class against Cluster. All the data Result list will lie on top of each
other, so increase the Jitter slide bar to about half way to add random noise to each point. This
allows us to see more clearly where the bulk of the datapoints lies. In this scatter plot each row
represents a class and each column a cluster. You could save the clustering results by clicking
Save button on the Visualization panel. The results are saved in a .arff file. You could use Weka
to open it and view the results.

We are familiar with the spam dataset by now. To load the spam dataset (available from Lab 4)
select the preprocess tab and then select Open file ....

2. Go to the Cluster tab: Select the SimpleKMeans clusterer, bring up its options window and
set numClusters to 2.

3. In the Cluster mode panel, select Classes to clusters evaluation and hit Start. This option
evaluates clusters with respect to a class. More specifically, in the mode Classes to clusters
evaluation Weka first ignores the class attribute and generates the clustering. Then during the
test phase it assigns classes to the clusters, based on the majority value of the class attribute
within each cluster. Then it computes the classification error, based on this assignment and also
shows the corresponding confusion matrix.

4. Ideally, we would hope to see all instances from a single class assigned to a single cluster,
and no instances from different classes assigned to the same cluster.

5. Look at the Classes to Clusters confusion matrix. Clearly, we don't have a perfect
correspondence between classes and clusters. a. How successful has the clustering been in
this regard? b. Looking at each class individually, can you spot the particular class that is well
identified by the clustering? Classes that are poorly identified? c. Which classes are mostly
confused with each other? d. Compare with the performance of the supervised classifiers we
used during the last lab session. Visualize the cluster assignments. To do this, right-click on the
clusterer in the Result list panel and select Visualize cluster assignments. Plot Class against
Cluster. All the data points will lie on top of each other, so increase the Jitter slide bar to about
half way to add random noise to each point. This allows us to see more clearly where the bulk of
the datapoints lies. In this scatter plot each row represents a class and each column a cluster.

6. Can you draw any conclusion based on this visualization HierarchicalClusterer implements
agglomerative (bottom-up) generation of hierarchical clusters. Several different link types, which
are ways of measuring the distance between clusters, are available as options. 2. Since the
Hierarchical Clustering algorithm builds a tree for the whole dataset, let’s practice this algorithm
on a smaller dataset due to the memory space limitation. Open glass.arff dataset used in the
previous lab, first normalize all the numeric values in the dataset into [0,1]. Then chose
HierarchicalCluster cluster. Since this dataset has 6 classes, we set numClusters as 6. To save
time, we set printNewick as False. Choose link_type=COMPLETE and run the algorithm. 3.
Since the performance of HierarchicalClustering is not good, we could run k-means algorithm on
the same dataset and compare their performance. Do not forget to set the number of clusters to
6. 4. Save the clustering results as glass_kmeans_result.arff file, ie. right-click on the k-means
entry, choose Visualize cluster assignment, then save with the name above. Reopen this
dataset with Weka. Since clustering results are saved as the last column, they are considered
as class labels for the dataset. The original class labels are considered as a feature of the
dataset. If you trust the clustering you might want to use this newly created dataset for
supervised learning.
Topic 7 : Demonstrate performing Regression
on Dataset
Regression is a supervised learning technique which helps in finding the correlation between variables and
enables us to predict the continuous output variable based on the one or more predictor variables. It is mainly
used for prediction, forecasting, time series modeling, and determining the causal-effect relationship
between variables.

In Regression, we plot a graph between the variables which best fits the given datapoints, using this plot, the
machine learning model can make predictions about the data. In simple words, "Regression shows a line or
curve that passes through all the datapoints on target-predictor graph in such a way that the vertical
distance between the datapoints and the regression line is minimum." The distance between datapoints
and line tells whether a model has captured a strong relationship or not.

Some examples of regression can be as:

o Prediction of rain using temperature and other factors


o Determining Market trends
o Prediction of road accidents due to rash driving.

Types of Regression

There are various types of regressions which are used in data science and machine learning. Each type has its
own importance on different scenarios, but at the core, all the regression methods analyze the effect of the
independent variable on dependent variables. Here we are discussing some important types of regression
which are given below:

o Linear Regression
o Logistic Regression
o Polynomial Regression
o Support Vector Regression
o Decision Tree Regression
o Random Forest Regression
o Ridge Regression
o Lasso Regression

Linear Regression:
o Linear regression is a statistical regression method which is used for predictive analysis.
o It is one of the very simple and easy algorithms which works on regression and shows the relationship
between the continuous variables.
o It is used for solving the regression problem in machine learning.
o Linear regression shows the linear relationship between the independent variable (X-axis) and the
dependent variable (Y-axis), hence called linear regression.
o If there is only one input variable (x), then such linear regression is called simple linear regression.
And if there is more than one input variable, then such linear regression is called multiple linear
regression.
o The relationship between variables in the linear regression model can be explained using the below
image. Here we are predicting the salary of an employee on the basis of the year of experience.

o Below is the mathematical equation for Linear regression:

1. Y= aX+b  

Here, Y = dependent variables (target variables),


X= Independent variables (predictor variables),
a and b are the linear coefficients

Some popular applications of linear regression are:

o Analyzing trends and sales estimates


o Salary forecasting
o Real estate prediction
o Arriving at ETAs in traffic.

Logistic Regression:
o Logistic regression is another supervised learning algorithm which is used to solve the classification
problems. In classification problems, we have dependent variables in a binary or discrete format such
as 0 or 1.
o Logistic regression algorithm works with the categorical variable such as 0 or 1, Yes or No, True or
False, Spam or not spam, etc.
o It is a predictive analysis algorithm which works on the concept of probability.
o Logistic regression is a type of regression, but it is different from the linear regression algorithm in the
term how they are used.
o Logistic regression uses sigmoid function or logistic function which is a complex cost function. This
sigmoid function is used to model the data in logistic regression. The function can be represented as:

o f(x)= Output between the 0 and 1 value.


o x= input to the function
o e= base of natural logarithm.

When we provide the input values (data) to the function, it gives the S-curve as follows:
o It uses the concept of threshold levels, values above the threshold level are rounded up to 1, and
values below the threshold level are rounded up to 0.

There are three types of logistic regression:

o Binary(0/1, pass/fail)
o Multi(cats, dogs, lions)
o Ordinal(low, medium, high)

Polynomial Regression:
o Polynomial Regression is a type of regression which models the non-linear dataset using a linear
model.
o It is similar to multiple linear regression, but it fits a non-linear curve between the value of x and
corresponding conditional values of y.
o Suppose there is a dataset which consists of datapoints which are present in a non-linear fashion, so
for such case, linear regression will not best fit to those datapoints. To cover such datapoints, we need
Polynomial regression.
o In Polynomial regression, the original features are transformed into polynomial features of
given degree and then modeled using a linear model. Which means the datapoints are best fitted
using a polynomial line.

o The equation for polynomial regression also derived from linear regression equation that means Linear
regression equation Y= b0+ b1x, is transformed into Polynomial regression equation Y= b 0+b1x+ b2x2+
b3x3+.....+ bnxn.
o Here Y is the predicted/target output, b0, b1,... bn are the regression coefficients. x is
our independent/input variable.
o The model is still linear as the coefficients are still linear with quadratic

Support Vector Regression:

Support Vector Machine is a supervised learning algorithm which can be used for regression as well as
classification problems. So if we use it for regression problems, then it is termed as Support Vector Regression.

Support Vector Regression is a regression algorithm which works for continuous variables. Below are some
keywords which are used in Support Vector Regression:

o Kernel: It is a function used to map a lower-dimensional data into higher dimensional data.
o Hyperplane: In general SVM, it is a separation line between two classes, but in SVR, it is a line which
helps to predict the continuous variables and cover most of the datapoints.
o Boundary line: Boundary lines are the two lines apart from hyperplane, which creates a margin for
datapoints.
o Support vectors: Support vectors are the datapoints which are nearest to the hyperplane and
opposite class.

In SVR, we always try to determine a hyperplane with a maximum margin, so that maximum number of
datapoints are covered in that margin. The main goal of SVR is to consider the maximum datapoints within
the boundary lines and the hyperplane (best-fit line) must contain a maximum number of datapoints .
Consider the below image:

Here, the blue line is called hyperplane, and the other two lines are known as boundary lines.
Topic 8 : Demonstration of association rule
mining.
Aim: This experiment illustrates some of the basic elements of
asscociation rule mining using WEKA. The sample dataset used for
this example is contactlenses.arff

Step1: Open the data file in Weka Explorer. It is presumed that the
required data fields have been discretized. In this example it is age
attribute.

Step2: Clicking on the associate tab will bring up the interface for
association rule algorithm. Step3: We will use apriori algorithm. This is
the default algorithm.
Step4: Inorder to change the parameters for the run (example support,
confidence etc) we click on the text box immediately to the right of
the choose button.
Dataset contactlenses.arff

The following screenshot shows the association rules that were generated when
apriori algorithm is applied on the given dataset.
Topic. 9 : Perform classification using
Bayesianclassification algorithm.
Baye's Theorem
Bayes' Theorem is named after Thomas Bayes. There are two types of
probabilities −

 Posterior Probability [P(H/X)]


 Prior Probability [P(H)]
where X is data tuple and H is some hypothesis.
According to Bayes' Theorem,
P(H/X)= P(X/H)P(H) / P(X)

Bayesian Belief Network


Bayesian Belief Networks specify joint conditional probability distributions.
They are also known as Belief Networks, Bayesian Networks, or Probabilistic
Networks.
 A Belief Network allows class conditional independencies to be defined
between subsets of variables.
 It provides a graphical model of causal relationship on which learning
can be performed.
 We can use a trained Bayesian Network for classification.
There are two components that define a Bayesian Belief Network −

 Directed acyclic graph


 A set of conditional probability tables

Directed Acyclic Graph


 Each node in a directed acyclic graph represents a random variable.
 These variable may be discrete or continuous valued.
 These variables may correspond to the actual attribute given in the
data.
Directed Acyclic Graph Representation
The following diagram shows a directed acyclic graph for six Boolean
variables.

Conditional Probability Table


The conditional probability table for the values of the variable LungCancer
(LC) showing each possible combination of the values of its parent nodes,
FamilyHistory (FH), and Smoker (S) is as follows −

Demonstration of classification rule process on dataset employee.arff using naïve bayes


algorithm
Aim: This experiment illustrates the use of naïve bayes classifier in weka. The sample
data set used in this experiment is “employee”data available at arff format. This
document assumes that appropriate data pre processing has been performed.

Steps involved in this experiment:

1. We begin the experiment by loading the data (employee.arff) into weka.

Step2: next we select the “classify” tab and click “choose” button to select the “id3”classifier.

Step3: now we specify the various parameters. These can be specified by clicking in
the text box to the right of the chose button. In this example, we accept the default
values his default version does perform some pruning but does not perform error
pruning.

Step4: under the “text “options in the main panel. We select the 10-fold cross
validation as our evaluation approach. Since we don’t have separate evaluation data
set, this is necessary to get a reasonable idea of accuracy of generated model.

Step-5: we now click”start”to generate the model .the ASCII version of the tree as
well as evaluation statistic will appear in the right panel when the model construction
is complete.

Step-6: note that the classification accuracy of model is about 69%.this indicates that
we may find more work. (Either in preprocessing or in selecting current parameters for
the classification)

Step-7: now weka also lets us a view a graphical version of the classification tree. This
can be done by right clicking the last result set and selecting “visualize tree” from the
pop-up menu.

Step-8: we will use our model to classify the new instances.

Step-9: In the main panel under “text “options click the “supplied test set” radio button
and then click the “set” button. This will show pop-up window which will allow you
to open the file containing test instances.
Topic. 10: Perform the cluster analysis by k-means
method.

K-Means Clustering-
 

 K-Means clustering is an unsupervised iterative clustering technique.


 It partitions the given data set into k predefined distinct clusters.
 A cluster is defined as a collection of data points exhibiting certain similarities.
 

it partitions the data set such that-

 Each data point belongs to a cluster with the nearest mean.


 Data points belonging to one cluster have high degree of similarity.
 Data points belonging to different clusters have high degree of dissimilarity.
 

K-Means Clustering Algorithm-


 
K-Means Clustering Algorithm involves the following steps-

Step-01:
 

 Choose the number of clusters K.


 

Step-02:
 Randomly select any K data points as cluster centers.

 Select cluster centers in such a way that they are as farther as possible from each other.
 

Step-03:
 

 Calculate the distance between each data point and each cluster center.
 The distance may be calculated either by using given distance function or by using euclidean
distance formula.
 

Step-04:
 

 Assign each data point to some cluster.


 A data point is assigned to that cluster whose center is nearest to that data point.
 

Step-05:
 

 Re-compute the center of newly formed clusters.


 The center of a cluster is computed by taking mean of all the data points contained in that
cluster.
 

Step-06:
 

Keep repeating the procedure from Step-03 to Step-05 until any of the following stopping criteria
is met-

 Center of newly formed clusters do not change


 Data points remain present in the same cluster
 Maximum number of iterations are reached
Advantages-
 

K-Means Clustering Algorithm offers the following advantages-

Point-01:
 

It is relatively efficient with time complexity O(nkt) where-

 n = number of instances
 k = number of clusters
 t = number of iterations

Disadvantages-
 
K-Means Clustering Algorithm has the following disadvantages-
 It requires to specify the number of clusters (k) in advance.
 It can not handle noisy data and outliers.
 It is not suitable to identify clusters with non-convex shapes.
 

PRACTICE PROBLEMS BASED ON K-MEANS CLUSTERING ALGORITHM-


 

Problem-01:
 
Cluster the following eight points (with (x, y) representing locations) into three clusters:
A1(2, 10), A2(2, 5), A3(8, 4), A4(5, 8), A5(7, 5), A6(6, 4), A7(1, 2), A8(4, 9)
 
Initial cluster centers are: A1(2, 10), A4(5, 8) and A7(1, 2).
The distance function between two points a = (x1, y1) and b = (x2, y2) is defined as-
Ρ(a, b) = |x2 – x1| + |y2 – y1|
 
Use K-Means Algorithm to find the three cluster centers after the second iteration.
 

Solution-
 
We follow the above discussed K-Means Clustering Algorithm-
 
Iteration-01:
 
 We calculate the distance of each point from each of the center of the three clusters.
 The distance is calculated by using the given distance function.
 
The following illustration shows the calculation of distance between point A1(2, 10) and each of
the center of the three clusters-
 
Calculating Distance Between A1(2, 10) and C1(2, 10)-
 
Ρ(A1, C1)
= |x2 – x1| + |y2 – y1|
= |2 – 2| + |10 – 10|
=0
 
Calculating Distance Between A1(2, 10) and C2(5, 8)-
 
Ρ(A1, C2)
= |x2 – x1| + |y2 – y1|
= |5 – 2| + |8 – 10|
=3+2
=5
 
Calculating Distance Between A1(2, 10) and C3(1, 2)-
 
Ρ(A1, C3)
= |x2 – x1| + |y2 – y1|
= |1 – 2| + |2 – 10|
=1+8
=9
 
In the similar manner, we calculate the distance of other points from each of the center of the
three clusters.
 
Next,
 We draw a table showing all the results.
 Using the table, we decide which point belongs to which cluster.
 The given point belongs to that cluster whose center is nearest to it.
 

Distance from
Distance from center Distance from center Point belongs to
Given Points center (2, 10) of
(5, 8) of Cluster-02 (1, 2) of Cluster-03 Cluster
Cluster-01

A1(2, 10) 0 5 9

A2(2, 5) 5 6 4

A3(8, 4) 12 7 9

A4(5, 8) 5 0 10

A5(7, 5) 10 5 9
A6(6, 4) 10 5 7

A7(1, 2) 9 10 0

A8(4, 9) 3 2 10

From here, New clusters are-

Cluster-01:
 

First cluster contains points-

 A1(2, 10)
 

Cluster-02:
 

Second cluster contains points-

 A3(8, 4)
 A4(5, 8)
 A5(7, 5)
 A6(6, 4)
 A8(4, 9)
 

Cluster-03:
 

Third cluster contains points-

 A2(2, 5)
 A7(1, 2)
 

Now,
 We re-compute the new cluster clusters.
 The new cluster center is computed by taking mean of all the points contained in that cluster.
 

For Cluster-01:
 

 We have only one point A1(2, 10) in Cluster-01.


 So, cluster center remains the same.
 

For Cluster-02:
 

Center of Cluster-02

= ((8 + 5 + 7 + 6 + 4)/5, (4 + 8 + 5 + 4 + 9)/5)

= (6, 6)

For Cluster-03:
 

Center of Cluster-03

= ((2 + 1)/2, (5 + 2)/2)

= (1.5, 3.5)

This is completion of Iteration-01.

Iteration-02:
 

 We calculate the distance of each point from each of the center of the three clusters.
 The distance is calculated by using the given distance function.
 

The following illustration shows the calculation of distance between point A1(2, 10) and each of
the center of the three clusters-

Calculating Distance Between A1(2, 10) and C1(2, 10)-


 
Ρ(A1, C1)

= |x2 – x1| + |y2 – y1|

= |2 – 2| + |10 – 10|

=0

Calculating Distance Between A1(2, 10) and C2(6, 6)-


 

Ρ(A1, C2)

= |x2 – x1| + |y2 – y1|

= |6 – 2| + |6 – 10|

=4+4

=8

Calculating Distance Between A1(2, 10) and C3(1.5, 3.5)-


 

Ρ(A1, C3)

= |x2 – x1| + |y2 – y1|

= |1.5 – 2| + |3.5 – 10|

= 0.5 + 6.5

=7

In the similar manner, we calculate the distance of other points from each of the center of the
three clusters.

Next,

 We draw a table showing all the results.


 Using the table, we decide which point belongs to which cluster.
 The given point belongs to that cluster whose center is nearest to it.
 
Distance from Distance from Distance from center
Point belongs to
Given Points center (2, 10) of center (6, 6) of (1.5, 3.5) of Cluster-
Cluster
Cluster-01 Cluster-02 03

A1(2, 10) 0 8 7 C1

A2(2, 5) 5 5 2 C3

A3(8, 4) 12 4 7 C2

A4(5, 8) 5 3 8 C2

A5(7, 5) 10 2 7 C2

A6(6, 4) 10 2 5 C2

A7(1, 2) 9 9 2 C3

A8(4, 9) 3 5 8 C1

From here, New clusters are-

Cluster-01:
 

First cluster contains points-

 A1(2, 10)
 A8(4, 9)
 

Cluster-02:
 
Second cluster contains points-

 A3(8, 4)
 A4(5, 8)
 A5(7, 5)
 A6(6, 4)
 

Cluster-03:
 

Third cluster contains points-

 A2(2, 5)
 A7(1, 2)
 

Now,

 We re-compute the new cluster clusters.


 The new cluster center is computed by taking mean of all the points contained in that cluster.
 

For Cluster-01:
 

Center of Cluster-01

= ((2 + 4)/2, (10 + 9)/2)

= (3, 9.5)

For Cluster-02:
 

Center of Cluster-02

= ((8 + 5 + 7 + 6)/4, (4 + 8 + 5 + 4)/4)

= (6.5, 5.25)

For Cluster-03:
 

Center of Cluster-03

= ((2 + 1)/2, (5 + 2)/2)


= (1.5, 3.5)

This is completion of Iteration-02.

After second iteration, the center of the three clusters are-

 C1(3, 9.5)
 C2(6.5, 5.25)
 C3(1.5, 3.5)
 

Problem-02:
 

Use K-Means Algorithm to create two clusters-

Solution-
 

We follow the above discussed K-Means Clustering Algorithm.

Assume A(2, 2) and C(1, 1) are centers of the two clusters.

Iteration-01:
 
 We calculate the distance of each point from each of the center of the two clusters.
 The distance is calculated by using the euclidean distance formula.
 

The following illustration shows the calculation of distance between point A(2, 2) and each of the
center of the two clusters-

Calculating Distance Between A(2, 2) and C1(2, 2)-


 

Ρ(A, C1)

= sqrt [ (x2 – x1)2 + (y2 – y1)2 ]

= sqrt [ (2 – 2)2 + (2 – 2)2 ]

= sqrt [ 0 + 0 ]

=0

Calculating Distance Between A(2, 2) and C2(1, 1)-


 

Ρ(A, C2)

= sqrt [ (x2 – x1)2 + (y2 – y1)2 ]

= sqrt [ (1 – 2)2 + (1 – 2)2 ]

= sqrt [ 1 + 1 ]

= sqrt [ 2 ]

= 1.41

In the similar manner, we calculate the distance of other points from each of the center of the
two clusters.

Next,

 We draw a table showing all the results.


 Using the table, we decide which point belongs to which cluster.
 The given point belongs to that cluster whose center is nearest to it.
 
Distance from center (2, Distance from center (1, Point belongs to
Given Points
2) of Cluster-01 1) of Cluster-02 Cluster

A(2, 2) 0 1.41 C1

B(3, 2) 1 2.24 C1

C(1, 1) 1.41 0 C2

D(3, 1) 1.41 2 C1

E(1.5, 0.5) 1.58 0.71 C2

From here, New clusters are-

Cluster-01:
 

First cluster contains points-

 A(2, 2)
 B(3, 2)
 E(1.5, 0.5)
 D(3, 1)
 

Cluster-02:
 

Second cluster contains points-

 C(1, 1)
 E(1.5, 0.5)
 

Now,

 We re-compute the new cluster clusters.


 The new cluster center is computed by taking mean of all the points contained in that cluster.
 

For Cluster-01:
 

Center of Cluster-01

= ((2 + 3 + 3)/3, (2 + 2 + 1)/3)

= (2.67, 1.67)

For Cluster-02:
 

Center of Cluster-02

= ((1 + 1.5)/2, (1 + 0.5)/2)

= (1.25, 0.75)

You might also like