Wekappt
Wekappt
Wekappt
If you observe the beginning of the flow of the image, you will understand that there are many stages in dealing
with Big Data to make it suitable for machine learning −
First, you will start with the raw data collected from the field. This data may contain several null values and
irrelevant fields. You use the data preprocessing tools provided in WEKA to cleanse the data.
Then, you would save the preprocessed data in your local storage for applying ML algorithms.
Next, depending on the kind of ML model that you are trying to develop you would select one of the options
such as Classify, Cluster, or Associate. The Attributes Selection allows the automatic selection of features
to create a reduced dataset.
If you observe the beginning of the flow of the image, you will understand that there are many stages in dealing
with Big Data to make it suitable for machine learning −
First, you will start with the raw data collected from the field. This data may contain several null values and
irrelevant fields. You use the data preprocessing tools provided in WEKA to cleanse the data.
Then, you would save the preprocessed data in your local storage for applying ML algorithms.
Next, depending on the kind of ML model that you are trying to develop you would select one of the options
such as Classify, Cluster, or Associate. The Attributes Selection allows the automatic selection of features
to create a reduced dataset.
Note that under each category, WEKA provides the implementation of several algorithms. You would select an
algorithm of your choice, set the desired parameters and run it on the dataset.
Then, WEKA would give you the statistical output of the model processing. It provides you a visualization tool
to inspect the data.
The various models can be applied on the same dataset. You can then compare the outputs of different models
and select the best that meets your purpose.
Thus, the use of WEKA results in a quicker development of machine learning models on the whole.
This is WEKA GUI Chooser
The GUI Chooser application allows you to run five different types of applications as listed here −
Explorer
Experimenter
KnowledgeFlow
Workbench
Simple CLI
Preprocess
Classify
Cluster
Associate
Select Attributes
Visualize
Under these tabs, there are several pre-implemented machine learning algorithms. Let us look into each of
them in detail now.
Preprocess Tab
Initially as you open the explorer, only the Preprocess tab is enabled. The first step in machine learning is to
preprocess the data. Thus, in the Preprocess option, you will select the data file, process it and make it fit for
applying the various machine learning algorithms.
Classify Tab
The Classify tab provides you several machine learning algorithms for the classification of your data. To list a
few, you may apply algorithms such as Linear Regression, Logistic Regression, Support Vector Machines,
Decision Trees, RandomTree, RandomForest, NaiveBayes, and so on. The list is very exhaustive and provides
both supervised and unsupervised machine learning algorithms.
Cluster Tab
Under the Cluster tab, there are several clustering algorithms provided - such as SimpleKMeans,
FilteredClusterer, HierarchicalClusterer, and so on.
Associate Tab
Under the Associate tab, you would find Apriori, FilteredAssociator and FPGrowth.
Visualize Tab
Lastly, the Visualize option allows you to visualize your processed data for analysis.
As you noticed, WEKA provides several ready-to-use algorithms for testing and building your machine learning
applications. To use WEKA effectively, you must have a sound knowledge of these algorithms, how they work,
which one to choose under what circumstances, what to look for in their processed output, and so on. In short,
you must have a solid foundation in machine learning to use WEKA effectively in building your apps.
In the upcoming chapters, you will study each tab in the explorer in depth.
arff
arff.gz
bsi
csv
dat
data
json
json.gz
libsvm
m
names
xrff
xrff.gz
The types of files that it supports are listed in the drop-down list box at the bottom of the screen. This is shown
in the screenshot given below.
As you would notice it supports several formats including CSV and JSON. The default file type is Arff.
Arff Format
An Arff file contains two sections - header and data.
Other Formats
The Explorer can load the data in any of the earlier mentioned formats. As arff is the preferred format in WEKA,
you may load the data from any format and save it to arff format for later use. After preprocessing the data, just
save it to arff format for further analysis.
Now that you have learned how to load data into WEKA, in the next chapter, you will learn how to preprocess
the data.
Topic 3 : Demonstration of Preprocessing the .arff file
Step1: Loading the data. We can load the dataset into weka by clicking on open button in preprocessing
interface and selecting the appropriate file.
Step2: Once the data is loaded, weka will recognize the attributes and during the scan of the data weka
will compute some basic strategies on each attribute. The left panel in the above figure shows the list of
recognized attributes while the top panel indicates the names of the base relation or table and the
current working relation (which are same initially).
Step3:Clicking on an attribute in the left panel will show the basic statistics on the attributes for the
categorical attributes the frequency of each attribute value is shown, while for continuous attributes we
can obtain min, max, mean, standard deviation and deviation etc.
Step4:The visualization in the right button panel in the form of cross-tabulation across two attributes.
Note:we can select another attribute using the dropdown list.
Step 6:a)Next click the textbox immediately to the right of the choose button.In the resulting dialog box
enter the index of the attribute to be filtered out.
b)Make sure that invert selection option is set to false.The click OK now in the filter box.you will see
“Remove-R-7”.
c)Click the apply button to apply filter to this data.This will remove the attribute and create new
working relation. d)Save the new working relation as an arff file by clicking save button on the
top(button)panel.(student.arff) Discretization
1)Sometimes association rule mining can only be performed on categorical data.This requires
performing discretization on numeric or continuous attributes.In the following example let us discretize
age attribute.
ÆLet us divide the values of age attribute into three bins(intervals). ÆFirst load the dataset into
weka(student.arff)
ÆTo change the defaults for the filters,click on the box immediately to the right of the choose button.
ÆWe enter the index for the attribute to be discretized.In this case the attribute is age.So we must enter
‘1’ corresponding to the age attribute.
ÆEnter ‘3’ as the number of bins.Leave the remaining field values as they are.
ÆClick OK button.
ÆClick apply in the filter panel.This will result in a new working relation with the selected attribute
partition into 3 bins.
@relation student
@data
%
Topic 4 : Demonstrate performing associative
rule mining in data sets.
Since most machine learning algorithms work with numerical datasets, they are mathematical in
nature. But, Association Rule Mining is appropriate for non-numeric, categorical data and
requires a little more than simple counting.
Given a set of transactions, the goal of association rule mining is to find the rules that allow us
to predict the occurrence of a specific item based on the occurrences of the other items in the
transaction.
Bread is the antecedent in the given association rule, and milk is the consequent.
Image Source
Before defining the rules of Association Rule Mining, let us first have a look at the basic
definitions.
Frequent Itemset: It represents an itemset whose support is greater than or equal to the
minimum threshold.
The rule evaluation metrics used in Association Rule Mining are as follows:
Support(s): It is the number of transactions that include items from the {X} and {Y} parts
of the rule as a percentage of total transactions. It can be represented in the form of a
percentage of all transactions that shows how frequently a group of items occurs
together.
Support = σ(X+Y) ÷ total: It is a fraction of transactions that include both X and Y.
Confidence(c): This ratio represents the total number of transactions of all of the items
in {A} and {B} to the number of transactions of the items in {A}.
Lift(l): The lift of the rule X=>Y is the confidence of the rule divided by the expected
confidence. here, it is assumed that the itemsets X and Y are independent of one
another. The expected confidence is calculated by dividing the confidence by the
frequency of {Y}.
Market-Based Analysis
Medical Diagnosis
Census Data
1) Market-Basket Analysis
In most supermarkets, data is collected using barcode scanners. This database is called the
“market basket” database. It contains a large number of past transaction records. Every record
contains the name of all the items each customer purchases in one transaction. From this data,
the stores come to know the inclination and choices of items of the customers. And according to
this information, they decide the store layout and optimize the cataloging of different items.
A single record contains a list of all the items purchased by a customer in a single transaction.
Knowing which groups are inclined toward which set of items allows these stores to adjust the
store layout and catalog to place them optimally next to one another.
2) Medical Diagnosis
Association rules in medical diagnosis can help physicians diagnose and treat patients.
Diagnosis is a difficult process with many potential errors that can lead to unreliable results. You
can use relational association rule mining to determine the likelihood of illness based on various
factors and symptoms. This application can be further expanded using some learning
techniques on the basis of symptoms and their relationships in accordance with diseases.
3) Census Data
The concept of Association Rule Mining is also used in dealing with the massive amount of
census data. If properly aligned, this information can be used in planning efficient public
services and businesses.
Providing a high-quality ETL solution can be a difficult task if you have a large volume of
data. Hevo’s automated, No-code platform empowers you with everything you need to have for
a smooth data replication experience.
Some of the algorithms which can be used to generate association rules are as follows:
Apriori Algorithm
Eclat Algorithm
FP-Growth Algorithm
1) Apriori Algorithm
It delivers by characteristic the foremost frequent individual things within the information and
increasing them to larger and bigger item sets as long as those item sets seem ofttimes enough
within the information.
The common itemsets ensured by apriori also are accustomed make sure association rules that
highlight trends within the information. It counts the support of item sets employing a breadth-
first search strategy and a candidate generation perform that takes advantage of the downward
closure property of support.
2) Eclat Algorithm
Eclat denotes equivalence class transformation. The set intersection was supported by its
depth-first search formula. It’s applicable for each successive and parallel execution with spot-
magnifying properties. This can be the associate formula for frequent pattern mining supported
by the item set lattice’s depth-first search cross.
3) FP-growth Algorithm
This algorithm is also called a recurring pattern. The FP growth formula is used for locating
frequent item sets terribly dealings data but not for candidate generation.
This was primarily designed to compress the database that provides frequent sets and then
divides the compressed data into conditional database sets.
This conditional database is associated with a frequent set. Each database then undergoes the
process of data mining.
FP-tree construction
Extract frequently used item sets
Topic 5 : Demonstrate performing
classifications on datasets
The 5 algorithms that we will review are:
1. Logistic Regression
2. Naive Bayes
3. Decision Tree
4. k-Nearest Neighbors
5. Support Vector Machines
These are 5 algorithms that you can try on your classification problem as a starting point.
A standard machine learning classification problem will be used to demonstrate each algorithm. Specifically, the
Ionosphere binary classification problem. This is a good dataset to demonstrate classification algorithms because the
input variables are numeric and all have the same scale the problem only has two classes to discriminate.
Each instance describes the properties of radar returns from the atmosphere and the task is to predict whether or not
there is structure in the ionosphere or not. There are 34 numerical input variables of generally the same scale. You
can learn more about this dataset on the UCI Machine Learning Repository. Top results are in the order of 98%
accuracy.
Logistic Regression
Logistic regression is a binary classification algorithm.
It assumes the input variables are numeric and have a Gaussian (bell curve) distribution. This last point does not
have to be true, as logistic regression can still achieve good results if your data is not Gaussian. In the case of the
Ionosphere dataset, some input attributes have a Gaussian-like distribution, but many do not.
The algorithm learns a coefficient for each input value, which are linearly combined into a regression function and
transformed using a logistic (s-shaped) function. Logistic regression is a fast and simple technique, but can be very
1. Click the “Choose” button and select “Logistic” under the “functions” group.
2. Click on the name of the algorithm to review the algorithm configuration.
The algorithm can run for a fixed number of iterations (maxIts), but by default will run until it is estimated that the
The implementation uses a ridge estimator which is a type of regularization. This method seeks to simplify the model
during training by minimizing the coefficients learned by the model. The ridge parameter defines how much pressure
to put on the algorithm to reduce the size of the coefficients. Setting this to 0 will turn off this regularization.
Naive Bayes
Naive Bayes is a classification algorithm. Traditionally it assumes that the input values are nominal, although it
Naive Bayes uses a simple implementation of Bayes Theorem (hence naive) where the prior probability for each
class is calculated from the training data and assumed to be independent of each other (technically called
conditionally independent).
This is an unrealistic assumption because we expect the variables to interact and be dependent, although this
assumption makes the probabilities fast and easy to calculate. Even under this unrealistic assumption, Naive Bayes
probability. As such, it supports both binary classification and multi-class classification problems.
1. Click the “Choose” button and select “NaiveBayes” under the “bayes” group.
2. Click on the name of the algorithm to review the algorithm configuration.
You can change the algorithm to use a kernel estimator with the useKernelEstimator argument that may better match
the actual distribution of the attributes in your dataset. Alternately, you can automatically convert numerical attributes
You can see that with the default configuration that Naive Bayes achieves an accuracy of 82%.
Weka Classification Results for the Naive Bayes Algorithm
There are a number of other flavors of naive bayes algorithms that you could work with.
Decision Tree
Decision trees can support classification and regression problems.
Decision trees are more recently referred to as Classification And Regression Trees (CART). They work by creating a
tree to evaluate an instance of data, start at the root of the tree and moving town to the leaves (roots) until a
prediction can be made. The process of creating a decision tree works by greedily selecting the best split point in
order to make predictions and repeating the process until the tree is a fixed depth.
After the tree is constructed, it is pruned in order to improve the model’s ability to generalize to new data.
The depth of the tree is defined automatically, but a depth can be specified in the maxDepth attribute.
You can also choose to turn of pruning by setting the noPruning parameter to True, although this may result in worse
performance.
The minNum parameter defines the minimum number of instances supported by the tree in a leaf node when
You can see that with the default configuration that the decision tree algorithm achieves an accuracy of 89%.
Another more advanced decision tree algorithm that you can use is the C4.5 algorithm, called J48 in Weka.
You can review a visualization of a decision tree prepared on the entire training data set by right clicking on the
k-Nearest Neighbors
The k-nearest neighbors algorithm supports both classification and regression. It is also called kNN for short.
It works by storing the entire training dataset and querying it to locate the k most similar training patterns when
making a prediction. As such, there is no model other than the raw training dataset and the only computation
It is a simple algorithm, but one that does not assume very much about the problem other than that the distance
between data instances is meaningful in making predictions. As such, it often achieves very good performance.
When making predictions on classification problems, KNN will take the mode (most common class) of the k most
For example, if k is set to 1, then predictions are made using the single most similar training instance to a given new
pattern for which a prediction is requested. Common values for k are 3, 7, 11 and 21, larger for larger dataset sizes.
Weka can automatically discover a good value for k using cross validation inside the algorithm by setting the
Another important parameter is the distance measure used. This is configured in the
nearestNeighbourSearchAlgorithm which controls the way in which the training data is stored and searched.
The default is a LinearNNSearch. Clicking the name of this search algorithm will provide another configuration
window where you can choose a distanceFunction parameter. By default, Euclidean distance is used to calculate the
distance between instances, which is good for numerical data with the same scale. Manhattan distance is good to
Weka Configuration for the Search Algorithm in the k-Nearest Neighbors Algorithm
It is a good idea to try a suite of different k values and distance measures on your problem and see what works best.
You can see that with the default configuration that the kNN algorithm achieves an accuracy of 86%.
Weka Classification Results for k-Nearest Neighbors
have been made to support multi-class classification and regression problems. The algorithm is often referred to as
SVM was developed for numerical input variables, although will automatically convert nominal values to numerical
SVM work by finding a line that best separates the data into the two groups. This is done using an optimization
process that only considers those data instances in the training dataset that are closest to the line that best separates
the classes. The instances are called support vectors, hence the name of the technique.
In almost all problems of interest, a line cannot be drawn to neatly separate the classes, therefore a margin is added
around the line to relax the constraint, allowing some instances to be misclassified but allowing a better result overall.
Finally, few datasets can be separated with just a straight line. Sometimes a line with curves or even polygonal
regions need to be marked out. This is achieved with SVM by projecting the data into a higher dimensional space in
order to draw the lines and make predictions. Different kernels can be used to control the projection and the amount
1. Click the “Choose” button and select “SMO” under the “function” group.
2. Click on the name of the algorithm to review the algorithm configuration.
SMO refers to the specific efficient optimization algorithm used inside the SVM implementation, which stands for
The C parameter, called the complexity parameter in Weka controls how flexible the process for drawing the line to
separate the classes can be. A value of 0 allows no violations of the margin, whereas the default is 1.
A key parameter in SVM is the type of Kernel to use. The simplest kernel is a Linear kernel that separates data with a
straight line or hyperplane. The default in Weka is a Polynomial Kernel that will separate the classes using a curved
or wiggly line, the higher the polynomial, the more wiggly (the exponent value).
A popular and powerful kernel is the RBF Kernel or Radial Basis Function Kernel that is capable of learning closed
It is a good idea to try a suite of different kernels and C (complexity) values on your problem and see what works
best.
You can see that with the default configuration that the SVM algorithm achieves an accuracy of 88%.
ClusteringOpenWeka Explorer environment and load the training file using the mode . Try first
with weather.arff Preprocess . Get to the Cluster mode (by clicking on the Cluster tab) and
select a clustering algorithm, for example SimpleKMeans . Then click on Start and you get the
clustering result in the output window. The actual clustering for this algorithm is shown as one
ins tance for each cluster representing the cluster centroid .
2. EvaluationThe way Weka evaluates the clusterings depends on the cluster mode you select.
Four different cluster modes are available (as buttons in the Cluster mode panel): a. b. c. Use
training set (def ault). After generating the clustering Weka classifies the training instances into
clusters according to the cluster representation and computes the percentage of instances
falling in each cluster. For example, the above clustering produced by kmeans shows cluster
1. 64 % ( 9 instances) in cluster 0 and In Supplied test set or Percentage split 36 % ( 5
instances) in Weka can evaluate clusterings on separate test data if the cluster representation is
probabilistic (e.g. for EM).
Classes to clusters evaluation. and generates th In this mode Weka first ignores the class
attribute e clustering. Then during the test phase it assigns classes to the clusters, based on the
majority value of the class attribute within each cluster. Then it computes the classification error,
based on this assignment and also shows the corresponding confus ion matrix.Visualize the
cluster assignments. panel and select points Visualize cluster assignmentsclick on the clusterer
in the To do this, right . Plot Class against Cluster. All the data Result list will lie on top of each
other, so increase the Jitter slide bar to about half way to add random noise to each point. This
allows us to see more clearly where the bulk of the datapoints lies. In this scatter plot each row
represents a class and each column a cluster. You could save the clustering results by clicking
Save button on the Visualization panel. The results are saved in a .arff file. You could use Weka
to open it and view the results.
We are familiar with the spam dataset by now. To load the spam dataset (available from Lab 4)
select the preprocess tab and then select Open file ....
2. Go to the Cluster tab: Select the SimpleKMeans clusterer, bring up its options window and
set numClusters to 2.
3. In the Cluster mode panel, select Classes to clusters evaluation and hit Start. This option
evaluates clusters with respect to a class. More specifically, in the mode Classes to clusters
evaluation Weka first ignores the class attribute and generates the clustering. Then during the
test phase it assigns classes to the clusters, based on the majority value of the class attribute
within each cluster. Then it computes the classification error, based on this assignment and also
shows the corresponding confusion matrix.
4. Ideally, we would hope to see all instances from a single class assigned to a single cluster,
and no instances from different classes assigned to the same cluster.
5. Look at the Classes to Clusters confusion matrix. Clearly, we don't have a perfect
correspondence between classes and clusters. a. How successful has the clustering been in
this regard? b. Looking at each class individually, can you spot the particular class that is well
identified by the clustering? Classes that are poorly identified? c. Which classes are mostly
confused with each other? d. Compare with the performance of the supervised classifiers we
used during the last lab session. Visualize the cluster assignments. To do this, right-click on the
clusterer in the Result list panel and select Visualize cluster assignments. Plot Class against
Cluster. All the data points will lie on top of each other, so increase the Jitter slide bar to about
half way to add random noise to each point. This allows us to see more clearly where the bulk of
the datapoints lies. In this scatter plot each row represents a class and each column a cluster.
6. Can you draw any conclusion based on this visualization HierarchicalClusterer implements
agglomerative (bottom-up) generation of hierarchical clusters. Several different link types, which
are ways of measuring the distance between clusters, are available as options. 2. Since the
Hierarchical Clustering algorithm builds a tree for the whole dataset, let’s practice this algorithm
on a smaller dataset due to the memory space limitation. Open glass.arff dataset used in the
previous lab, first normalize all the numeric values in the dataset into [0,1]. Then chose
HierarchicalCluster cluster. Since this dataset has 6 classes, we set numClusters as 6. To save
time, we set printNewick as False. Choose link_type=COMPLETE and run the algorithm. 3.
Since the performance of HierarchicalClustering is not good, we could run k-means algorithm on
the same dataset and compare their performance. Do not forget to set the number of clusters to
6. 4. Save the clustering results as glass_kmeans_result.arff file, ie. right-click on the k-means
entry, choose Visualize cluster assignment, then save with the name above. Reopen this
dataset with Weka. Since clustering results are saved as the last column, they are considered
as class labels for the dataset. The original class labels are considered as a feature of the
dataset. If you trust the clustering you might want to use this newly created dataset for
supervised learning.
Topic 7 : Demonstrate performing Regression
on Dataset
Regression is a supervised learning technique which helps in finding the correlation between variables and
enables us to predict the continuous output variable based on the one or more predictor variables. It is mainly
used for prediction, forecasting, time series modeling, and determining the causal-effect relationship
between variables.
In Regression, we plot a graph between the variables which best fits the given datapoints, using this plot, the
machine learning model can make predictions about the data. In simple words, "Regression shows a line or
curve that passes through all the datapoints on target-predictor graph in such a way that the vertical
distance between the datapoints and the regression line is minimum." The distance between datapoints
and line tells whether a model has captured a strong relationship or not.
Types of Regression
There are various types of regressions which are used in data science and machine learning. Each type has its
own importance on different scenarios, but at the core, all the regression methods analyze the effect of the
independent variable on dependent variables. Here we are discussing some important types of regression
which are given below:
o Linear Regression
o Logistic Regression
o Polynomial Regression
o Support Vector Regression
o Decision Tree Regression
o Random Forest Regression
o Ridge Regression
o Lasso Regression
Linear Regression:
o Linear regression is a statistical regression method which is used for predictive analysis.
o It is one of the very simple and easy algorithms which works on regression and shows the relationship
between the continuous variables.
o It is used for solving the regression problem in machine learning.
o Linear regression shows the linear relationship between the independent variable (X-axis) and the
dependent variable (Y-axis), hence called linear regression.
o If there is only one input variable (x), then such linear regression is called simple linear regression.
And if there is more than one input variable, then such linear regression is called multiple linear
regression.
o The relationship between variables in the linear regression model can be explained using the below
image. Here we are predicting the salary of an employee on the basis of the year of experience.
1. Y= aX+b
Logistic Regression:
o Logistic regression is another supervised learning algorithm which is used to solve the classification
problems. In classification problems, we have dependent variables in a binary or discrete format such
as 0 or 1.
o Logistic regression algorithm works with the categorical variable such as 0 or 1, Yes or No, True or
False, Spam or not spam, etc.
o It is a predictive analysis algorithm which works on the concept of probability.
o Logistic regression is a type of regression, but it is different from the linear regression algorithm in the
term how they are used.
o Logistic regression uses sigmoid function or logistic function which is a complex cost function. This
sigmoid function is used to model the data in logistic regression. The function can be represented as:
When we provide the input values (data) to the function, it gives the S-curve as follows:
o It uses the concept of threshold levels, values above the threshold level are rounded up to 1, and
values below the threshold level are rounded up to 0.
o Binary(0/1, pass/fail)
o Multi(cats, dogs, lions)
o Ordinal(low, medium, high)
Polynomial Regression:
o Polynomial Regression is a type of regression which models the non-linear dataset using a linear
model.
o It is similar to multiple linear regression, but it fits a non-linear curve between the value of x and
corresponding conditional values of y.
o Suppose there is a dataset which consists of datapoints which are present in a non-linear fashion, so
for such case, linear regression will not best fit to those datapoints. To cover such datapoints, we need
Polynomial regression.
o In Polynomial regression, the original features are transformed into polynomial features of
given degree and then modeled using a linear model. Which means the datapoints are best fitted
using a polynomial line.
o The equation for polynomial regression also derived from linear regression equation that means Linear
regression equation Y= b0+ b1x, is transformed into Polynomial regression equation Y= b 0+b1x+ b2x2+
b3x3+.....+ bnxn.
o Here Y is the predicted/target output, b0, b1,... bn are the regression coefficients. x is
our independent/input variable.
o The model is still linear as the coefficients are still linear with quadratic
Support Vector Machine is a supervised learning algorithm which can be used for regression as well as
classification problems. So if we use it for regression problems, then it is termed as Support Vector Regression.
Support Vector Regression is a regression algorithm which works for continuous variables. Below are some
keywords which are used in Support Vector Regression:
o Kernel: It is a function used to map a lower-dimensional data into higher dimensional data.
o Hyperplane: In general SVM, it is a separation line between two classes, but in SVR, it is a line which
helps to predict the continuous variables and cover most of the datapoints.
o Boundary line: Boundary lines are the two lines apart from hyperplane, which creates a margin for
datapoints.
o Support vectors: Support vectors are the datapoints which are nearest to the hyperplane and
opposite class.
In SVR, we always try to determine a hyperplane with a maximum margin, so that maximum number of
datapoints are covered in that margin. The main goal of SVR is to consider the maximum datapoints within
the boundary lines and the hyperplane (best-fit line) must contain a maximum number of datapoints .
Consider the below image:
Here, the blue line is called hyperplane, and the other two lines are known as boundary lines.
Topic 8 : Demonstration of association rule
mining.
Aim: This experiment illustrates some of the basic elements of
asscociation rule mining using WEKA. The sample dataset used for
this example is contactlenses.arff
Step1: Open the data file in Weka Explorer. It is presumed that the
required data fields have been discretized. In this example it is age
attribute.
Step2: Clicking on the associate tab will bring up the interface for
association rule algorithm. Step3: We will use apriori algorithm. This is
the default algorithm.
Step4: Inorder to change the parameters for the run (example support,
confidence etc) we click on the text box immediately to the right of
the choose button.
Dataset contactlenses.arff
The following screenshot shows the association rules that were generated when
apriori algorithm is applied on the given dataset.
Topic. 9 : Perform classification using
Bayesianclassification algorithm.
Baye's Theorem
Bayes' Theorem is named after Thomas Bayes. There are two types of
probabilities −
Step2: next we select the “classify” tab and click “choose” button to select the “id3”classifier.
Step3: now we specify the various parameters. These can be specified by clicking in
the text box to the right of the chose button. In this example, we accept the default
values his default version does perform some pruning but does not perform error
pruning.
Step4: under the “text “options in the main panel. We select the 10-fold cross
validation as our evaluation approach. Since we don’t have separate evaluation data
set, this is necessary to get a reasonable idea of accuracy of generated model.
Step-5: we now click”start”to generate the model .the ASCII version of the tree as
well as evaluation statistic will appear in the right panel when the model construction
is complete.
Step-6: note that the classification accuracy of model is about 69%.this indicates that
we may find more work. (Either in preprocessing or in selecting current parameters for
the classification)
Step-7: now weka also lets us a view a graphical version of the classification tree. This
can be done by right clicking the last result set and selecting “visualize tree” from the
pop-up menu.
Step-9: In the main panel under “text “options click the “supplied test set” radio button
and then click the “set” button. This will show pop-up window which will allow you
to open the file containing test instances.
Topic. 10: Perform the cluster analysis by k-means
method.
K-Means Clustering-
Step-01:
Step-02:
Randomly select any K data points as cluster centers.
Select cluster centers in such a way that they are as farther as possible from each other.
Step-03:
Calculate the distance between each data point and each cluster center.
The distance may be calculated either by using given distance function or by using euclidean
distance formula.
Step-04:
Step-05:
Step-06:
Keep repeating the procedure from Step-03 to Step-05 until any of the following stopping criteria
is met-
Point-01:
n = number of instances
k = number of clusters
t = number of iterations
Disadvantages-
K-Means Clustering Algorithm has the following disadvantages-
It requires to specify the number of clusters (k) in advance.
It can not handle noisy data and outliers.
It is not suitable to identify clusters with non-convex shapes.
Problem-01:
Cluster the following eight points (with (x, y) representing locations) into three clusters:
A1(2, 10), A2(2, 5), A3(8, 4), A4(5, 8), A5(7, 5), A6(6, 4), A7(1, 2), A8(4, 9)
Initial cluster centers are: A1(2, 10), A4(5, 8) and A7(1, 2).
The distance function between two points a = (x1, y1) and b = (x2, y2) is defined as-
Ρ(a, b) = |x2 – x1| + |y2 – y1|
Use K-Means Algorithm to find the three cluster centers after the second iteration.
Solution-
We follow the above discussed K-Means Clustering Algorithm-
Iteration-01:
We calculate the distance of each point from each of the center of the three clusters.
The distance is calculated by using the given distance function.
The following illustration shows the calculation of distance between point A1(2, 10) and each of
the center of the three clusters-
Calculating Distance Between A1(2, 10) and C1(2, 10)-
Ρ(A1, C1)
= |x2 – x1| + |y2 – y1|
= |2 – 2| + |10 – 10|
=0
Calculating Distance Between A1(2, 10) and C2(5, 8)-
Ρ(A1, C2)
= |x2 – x1| + |y2 – y1|
= |5 – 2| + |8 – 10|
=3+2
=5
Calculating Distance Between A1(2, 10) and C3(1, 2)-
Ρ(A1, C3)
= |x2 – x1| + |y2 – y1|
= |1 – 2| + |2 – 10|
=1+8
=9
In the similar manner, we calculate the distance of other points from each of the center of the
three clusters.
Next,
We draw a table showing all the results.
Using the table, we decide which point belongs to which cluster.
The given point belongs to that cluster whose center is nearest to it.
Distance from
Distance from center Distance from center Point belongs to
Given Points center (2, 10) of
(5, 8) of Cluster-02 (1, 2) of Cluster-03 Cluster
Cluster-01
A1(2, 10) 0 5 9
A2(2, 5) 5 6 4
A3(8, 4) 12 7 9
A4(5, 8) 5 0 10
A5(7, 5) 10 5 9
A6(6, 4) 10 5 7
A7(1, 2) 9 10 0
A8(4, 9) 3 2 10
Cluster-01:
A1(2, 10)
Cluster-02:
A3(8, 4)
A4(5, 8)
A5(7, 5)
A6(6, 4)
A8(4, 9)
Cluster-03:
A2(2, 5)
A7(1, 2)
Now,
We re-compute the new cluster clusters.
The new cluster center is computed by taking mean of all the points contained in that cluster.
For Cluster-01:
For Cluster-02:
Center of Cluster-02
= (6, 6)
For Cluster-03:
Center of Cluster-03
= (1.5, 3.5)
Iteration-02:
We calculate the distance of each point from each of the center of the three clusters.
The distance is calculated by using the given distance function.
The following illustration shows the calculation of distance between point A1(2, 10) and each of
the center of the three clusters-
= |2 – 2| + |10 – 10|
=0
Ρ(A1, C2)
= |6 – 2| + |6 – 10|
=4+4
=8
Ρ(A1, C3)
= 0.5 + 6.5
=7
In the similar manner, we calculate the distance of other points from each of the center of the
three clusters.
Next,
A1(2, 10) 0 8 7 C1
A2(2, 5) 5 5 2 C3
A3(8, 4) 12 4 7 C2
A4(5, 8) 5 3 8 C2
A5(7, 5) 10 2 7 C2
A6(6, 4) 10 2 5 C2
A7(1, 2) 9 9 2 C3
A8(4, 9) 3 5 8 C1
Cluster-01:
A1(2, 10)
A8(4, 9)
Cluster-02:
Second cluster contains points-
A3(8, 4)
A4(5, 8)
A5(7, 5)
A6(6, 4)
Cluster-03:
A2(2, 5)
A7(1, 2)
Now,
For Cluster-01:
Center of Cluster-01
= (3, 9.5)
For Cluster-02:
Center of Cluster-02
= (6.5, 5.25)
For Cluster-03:
Center of Cluster-03
C1(3, 9.5)
C2(6.5, 5.25)
C3(1.5, 3.5)
Problem-02:
Solution-
Iteration-01:
We calculate the distance of each point from each of the center of the two clusters.
The distance is calculated by using the euclidean distance formula.
The following illustration shows the calculation of distance between point A(2, 2) and each of the
center of the two clusters-
Ρ(A, C1)
= sqrt [ 0 + 0 ]
=0
Ρ(A, C2)
= sqrt [ 1 + 1 ]
= sqrt [ 2 ]
= 1.41
In the similar manner, we calculate the distance of other points from each of the center of the
two clusters.
Next,
A(2, 2) 0 1.41 C1
B(3, 2) 1 2.24 C1
C(1, 1) 1.41 0 C2
D(3, 1) 1.41 2 C1
Cluster-01:
A(2, 2)
B(3, 2)
E(1.5, 0.5)
D(3, 1)
Cluster-02:
C(1, 1)
E(1.5, 0.5)
Now,
For Cluster-01:
Center of Cluster-01
= (2.67, 1.67)
For Cluster-02:
Center of Cluster-02
= (1.25, 0.75)