2021-22 DM Lab Manual
2021-22 DM Lab Manual
2021-22 DM Lab Manual
College of Engineering
Opp Gujarat University, Navrangpura, Ahmedabad - 380015
LAB MANUAL
Branch: Computer Engineering
Faculty Details:
1) Prof. (Dr.) V. B. Vaghela
2) Prof. H. A. Joshiyara
Data Mining (3160714)
Term: 2020-21
SIGN OF FACULTY
Enroll. No.:
Class: 6th CE
Pract. CO No. RB1 RB2 RB3 RB4 Total Date Faculty Sign
No.
1 CO3
2 CO1
3 CO2
4 CO2
5 CO2
6 CO5
7 CO5
8 CO2
9 CO4
10 CO1
1. RATIONALE
To teach the basic principles, concepts and applications of data warehousing and data
mining in Business Intelligence.
To introduce the task of data mining as an important phase of knowledge recovery process
To familiarize Conceptual, Logical, and Physical design of Data Warehouses OLAP
applications and OLAP deployment
To characterize the kinds of patterns that can be discovered by association rule mining,
classification and clustering.
To develop skill in selecting the appropriate data mining algorithm & tools for solving
practical problems.
Master data mining techniques in various applications like social, scientific and
environmental context
2. COMPETENCY
The primary purpose of data mining in business intelligence is to find correlations or patterns
among dozens of fields in large databases. The course exposes students to topics involving
planning, designing, building, populating, and maintaining a successful data warehouse and
implementing various data mining techniques in business applications.
3. COURSE OUTCOMES
Practical – 1
AIM: Introduction various Data Mining Tools -WEKA, DTREG, DB Miner. Compare in terms of their
special features, functionality and limitations.
Objectives:
o Study various data mining tools
o Learn data mining features
o Learn data mining functionality
Theory:
Weka (Waikato Environment for Knowledge Analysis) is a popular suite of machine
learning software written in Java, developed at the University of Waikato, New Zealand.
Weka is free software available under the GNU General Public License. The Weka
workbench contains a collection of visualization tools and algorithms for data analysis and
predictive modeling, together with graphical user interfaces for easy access to this functionality.
Weka supports several standard data mining tasks, more specifically, data preprocessing,
clustering, classification, regression, visualization, and feature selection. Weka provides access to
SQL databases using Java Database Connectivity and can process the result returned by a database
query. It is not capable of multi- relational data mining, but there is separate software for
converting a collection of linked database tables into a single table that is suitable for processing
using Weka.
Download and Installation Procedure:
Step 1: download from link: http://sourceforge.net/projects/weka/postdownload?source=dlp
Step 2: Installation step
Follow the simple steps to install the WEKA as mentioned below:
DTREG is a robust application that is installed easily on any Windows system. DTREG
reads Comma Separated Value (CSV) data files that are easily created from almost any data
source. Once you create your data file, just feed it into DTREG, and let DTREG do all of the
work of creating a decision tree, Support Vector Machine, K-Means clustering, Linear
Discriminant Function, Linear Regression or Logistic Regression model. Even complex analyses
can be set up in minutes.
DTREG accepts a dataset containing of number of rows with a column for each
variable. One of the variables is the ―target variable‖ whose value is to be
modeled and predicted as a function of the ―predictor variables‖. DTREG
analyzes the data and generates a model showing how best to predict the values
of the target variable based on values of the predictor variables.
DTREG can create classical, single-tree models and also TreeBoost and
Decision Tree Forest models consisting of ensembles of many trees. DTREG
also can generate Neural Networks, Support Vector Machine (SVM), Gene
Expression Programming/Symbolic Regression, K-Means clustering, GMDH
polynomial networks, Discriminate Analysis, Linear Regression, and Logistic
Regression models.
Installing DTREG
Signature of Faculty:
Practical – 2
AIM: Demonstration of preprocessing on dataset like student.arff and labor.arff using WEKA.
Theory:
o Data preprocessing is a data mining technique which is used to transform the raw data in
a useful and efficient format. To ensure high-quality data, it's crucial to preprocess it. To
make the process easier, data preprocessing is divided into four stages: data cleaning,
data integration, data reduction, and data transformation.
1. Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data cleaning is
done. It involves handling of missing data, noisy data etc.
2. Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable for mining
process. This involves following ways:
3. Data Reduction:
Since data mining is a technique that is used to handle huge amount of data. While working
with huge volume of data, analysis became harder in such cases. In order to get rid of this,
we uses data reduction technique. It aims to increase the storage efficiency and reduce data
storage and analysis costs.
Procedure / Steps:
The sample dataset used for this example is the student data available in arff format.
Step1:
Load the data. We can load the dataset into weka by clicking on open button in
preprocessing interface and selecting the appropriate file.
Step2:
Once the data is loaded, weka will recognize the attributes and during the scan of the
data weka will compute some basic strategies on each attribute. The left panel in the
above figure shows the list of recognized attributes while the top panel indicates the
names of the base relation or table and the current working relation (which are same
initially).
Step3:
Clicking on an attribute in the left panel will show the basic statistics on the attributes
for the categorical attributes the frequency of each attribute value is shown, while for
continuous attributes we can obtain min, max, mean, standard deviation and deviation
etc.,
Step4:
The visualization in the right button panel in the form of cross-tabulation across two
attributes.
c) Click the apply button to apply filter to this data.This will remove the attribute and
create new working relation.
d) Save the new working relation as an arff file by clicking save button on the
top(button)panel.(student.arff)
Signature of Faculty:
Practical – 3
AIM: Demonstration of Association rule process on any data set using Apriori algorithm in WEKA.
Theory:
o Association rule mining finds interesting associations and relationships among large sets
of data items. This rule shows how frequently an itemset occurs in a transaction. A
typical example is Market Based Analysis.
o Market Based Analysis is one of the key techniques used by large relations to show
associations between items. It allows retailers to identify relationships between the items
that people buy together frequently.
Background / Preparation:
o Given a set of transactions, we can find rules that will predict the occurrence of an item
based on the occurrences of other items in the transaction.
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Before we start defining the rule, let us first see the basic definitions.
Frequent Itemset – An itemset whose support is greater than or equal to minsup threshold.
Association Rule– An implication expression of the form X -> Y, where X and Y are any 2
itemsets.
Support(s) –
The number of transactions that include items in the {X} and {Y} parts of the rule as a
percentage of the total number of transaction. It is a measure of how frequently the
collection of items occur together as a percentage of all transactions.
Support
Confidence(c)–
It is the ratio of the no of transactions that includes all items in {B} as well as the no of
transactions that includes all items in {A} to the no of transactions that includes all items
in {A}.
Conf(X=>Y)=Supp(XUY)/Supp(X)–
It measures how often each item in Y appears in transactions that contains items in X
also.
Lift(l)
The lift of the rule X=>Y is the confidence of the rule divided by the expected
confidence, assuming that the itemsets X and Y are independent of each other. The
expected confidence is the confidence divided by the frequency of {Y}.
Procedure / Steps:
o Step1: Open the data file in Weka Explorer. It is presumed that the required data fields
have been discretized. In this example it is age attribute.
o Step2: Clicking on the associate tab will bring up the interface for association rule
algorithm. Step3: We will use apriori algorithm. This is the default algorithm.
o Step4: Inorder to change the parameters for the run (example support, confidence etc) we
click on the text box immediately to the right of the choose button.
Dataset contactlenses.arff
The following screenshot shows the association rules that were generated when apriori
algorithm is applied on the given dataset.
Signature of Faculty:
Practical – 4
AIM: Demonstration of classification rule process on dataset using j48 algorithm using WEKA.
Theory:
Classification:
It is a data analysis task, i.e. the process of finding a model that describes and
distinguishes data classes and concepts. Classification is the problem of identifying to
which of a set of categories (subpopulations), a new observation belongs to, on the basis
of a training set of data containing observations and whose categories membership is
known.
Before we discuss the various classification algorithms in data mining, let‘s first look
at the type of classification techniques available. Primarily, we can divide the
classification algorithms into two categories:
1. Generative
2. Discriminative
Generative
Discriminative
It‘s a rudimentary classification algorithm that determines a class for a row of data. It
models by using the observed data and depends on the data quality instead of its
distributions.
Background / Preparation:
This experiment illustrates the use of j-48 classifier in weka. The sample data set used in
this experiment is ―student‖ data available at arff format. This document assumes that
appropriate data pre processing has been performed.
Procedure / Steps:
o Steps involved in this experiment:
o Step-1:
We begin the experiment by loading the data (student.arff)into weka.
o Step2:
Next we select the ―classify‖ tab and click ―choose‖ button t o select the
―j48‖classifier.
o Step3:
Now we specify the various parameters. These can be specified by clicking in the
text box to the right of the chose button. In this example, we accept the default
values. The default version does perform some pruning but does not perform error
pruning.
o Step4:
Under the ―text‖ options in the main panel. We select the 10-fold cross validation
as our evaluation approach. Since we don‘t have separate evaluation data set, this
is necessary to get a reasonable idea of accuracy of generated model.
o Step-5:
We now click ‖start‖ to generate the model .the Ascii version of the tree as well
as evaluation statistic will appear in the right panel when the model construction
is complete.
o Step-6:
Note that the classification accuracy of model is about 69%.this indicates that we
may find more work. (Either in preprocessing or in selecting current parameters
for the classification)
o Step-7:
Now weka also lets us a view a graphical version of the classification tree. This
can be done by right clicking the last result set and selecting ―visualize tree‖ from
the pop-up menu.
o Step-8:
In the main panel under ―text‖ options click the ―supplied test set‖ radio button
and then click the ―set‖ button. This wills pop-up a window which will allow you
to open the file containing test instances.
The following screenshot shows the classification rules that were generated when j48
algorithm is applied on the given dataset.
Signature of Faculty:
Practical – 5
AIM: Demonstration of classification rule process on dataset using id3 algorithm using WEKA.
Theory:
o The ID3 algorithm begins with the original set as the root node. On each iteration of the
algorithm, it iterates through every unused attribute of the set and calculates the entropy
or the information gain of that attribute. It then selects the attribute which has the smallest
entropy (or largest information gain) value. The set is then split or partitioned by the
selected attribute to produce subsets of the data. (For example, a node can be split into
child nodes based upon the subsets of the population whose ages are less than 50,
between 50 and 100, and greater than 100.) The algorithm continues to recurse on each
subset, considering only attributes never selected before.
o Recursion on a subset may stop in one of these cases:
o Every element in the subset belongs to the same class; in which case the node is turned
into a leaf node and labelled with the class of the examples.
o There are no more attributes to be selected, but the examples still do not belong to the
same class. In this case, the node is made a leaf node and labelled with the most common
class of the examples in the subset.
o There are no examples in the subset, which happens when no example in the parent set
was found to match a specific value of the selected attribute. An example could be the
absence of a person among the population with age over 100 years. Then a leaf node is
created and labelled with the most common class of the examples in the parent node's set.
o Throughout the algorithm, the decision tree is constructed with each non-terminal node
(internal node) representing the selected attribute on which the data was split, and
terminal nodes (leaf nodes) representing the class label of the final subset of this branch.
Background / Preparation:
The sample data set used in this experiment is ―employee‖data available at arff format.
This document assumes that appropriate data pre processing has been performed.
Procedure / Steps:
o Steps involved in this experiment:
o Step1:
We begin the experiment by loading the data (employee.arff) into weka.
o Step 2:
Next we select the ―classify‖ tab and click ―choose‖ button to select the
―id3‖classifier.
o Step 3:
Now we specify the various parameters. These can be specified by clicking in the
text box to the right of the chose button. In this example, we accept the default
values his default version does perform some pruning but does not perform error
pruning.
o Step 4:
Under the ―text ―options in the main panel. We select the 10-fold cross validation
as our evaluation approach. Since we don‘t have separate evaluation data set, this
is necessary to get a reasonable idea of accuracy of generated model.
o Step 5:
We now click‖start‖to generate the model .the ASCII version of the tree as well
as evaluation statistic will appear in the right panel when the model construction
is complete.
o Step 6:
Note that the classification accuracy of model is about 69%.this indicates that we
may find more work. (Either in preprocessing or in selecting current parameters
for the classification)
o Step 7:
Now weka also lets us a view a graphical version of the classification tree. This
can be done by right clicking the last result set and selecting ―visualize tree‖ from
the pop-up menu.
o Step 8:
We will use our model to classify the new instances.
o Step 9:
In the main panel under ―text ―options click the ―supplied test set‖ radio button
and then click the ―set‖ button. This will show pop-up window which will allow
you to open the file containing test instances.
The following screenshot shows the classification rules that were generated when id3 algorithm
is applied on the given dataset.
Signature of Faculty:
Practical – 6
AIM: Implementation of Bayesian classifier using JAVA and verify result with WEKA.
Theory:
Naive Bayes classifiers are a collection of classification algorithms based on
Bayes’ Theorem. It is not a single algorithm but a family of algorithms where all of them
share a common principle, i.e. every pair of features being classified is independent of
each other.
Background / Preparation:
To start with, let us consider a dataset.
Consider a fictional dataset that describes the weather conditions for playing a
game of golf. Given the weather conditions, each tuple classifies the conditions as
fit(―Yes‖) or unfit(―No‖) for playing golf.
Here is a tabular representation of our dataset.
The dataset is divided into two parts, namely, feature matrix and the response vector.
Feature matrix contains all the vectors(rows) of dataset in which each vector
consists of the value of dependent features. In above dataset, features are
‗Outlook‘, ‗Temperature‘, ‗Humidity‘ and ‗Windy‘.
Response vector contains the value of class variable(prediction or output) for
each row of feature matrix. In above dataset, the class variable name is ‗Play
golf‘.
Assumption:
The fundamental Naive Bayes assumption is that each feature makes an:
independent
equal
We assume that no pair of features are dependent. For example, the temperature
being ‗Hot‘ has nothing to do with the humidity or the outlook being ‗Rainy‘ has
no effect on the winds. Hence, the features are assumed to be independent.
Secondly, each feature is given the same weight(or importance). For example,
knowing only temperature and humidity alone can‘t predict the outcome
accuratey. None of the attributes is irrelevant and assumed to be contributing
equally to the outcome.
Note: The assumptions made by Naive Bayes are not generally correct in real-world
situations. In-fact, the independence assumption is never correct but often works well in
practice.
Bayes‘ Theorem provides a way that we can calculate the probability of a piece of data
belonging to a given class, given our prior knowledge. Bayes‘ Theorem is stated as:
Questions:
1. What is a Naïve Bayes Classifier?
2. What is Statistical Significance?
3. Why Naive Bayes is called Naive?
4. How would you use Naive Bayes classifier for categorical features? What if some
features are numerical?
Signature of Faculty:
Practical – 7
AIM: Implementation of K-mean clustering algorithm using JAVA and verify result with WEKA.
Theory:
K-Means Clustering is an unsupervised learning algorithm that is used to solve the
clustering problems in machine learning or data science. In this topic, we will learn what
is K-means clustering algorithm, how the algorithm works, along with the Python
implementation of k-means clustering.
It allows us to cluster the data into different groups and a convenient way to discover the
categories of groups in the unlabeled dataset on its own without the need for any training.
Background / Preparation:
The algorithm takes the unlabeled dataset as input, divides the dataset into k-number of
clusters, and repeats the process until it does not find the best clusters. The value of k
should be predetermined in this algorithm.
Determines the best value for K center points or centroids by an iterative process.
Assigns each data point to its closest k-center. Those data points which are near to
the particular k-center, create a cluster.
Hence each cluster has data points with some commonalities, and it is away from other
clusters.
The below diagram explains the working of the K-means Clustering Algorithm:
Signature of Faculty:
Practical – 8
AIM: Demonstration of Decision Tree classification using WEKA.
Theory:
Each node in the tree represents a question derived from the features present in your
dataset. Your dataset is split based on these questions until the maximum depth of the tree
is reached. The last node does not ask a question but represents which class the value
belongs to.
The topmost node in the Decision tree is called the Root node
The bottom-most node is called the Leaf node
A node divided into sub-nodes is called a Parent node. The sub-nodes are
called Child nodes
Background / Preparation:
o ” Weka is a free open-source software with a range of built-in machine learning
algorithms that you can access through a graphical user interface! ―
o WEKA stands for Waikato Environment for Knowledge Analysis and was developed
at the University of Waikato, New Zealand.
o Weka has multiple built-in functions for implementing a wide range of machine learning
algorithms from linear regression to neural network. This allows you to deploy the most
complex of algorithms on your dataset at just a click of a button! Not only this, Weka
gives support for accessing some of the most common machine learning library
algorithms of Python and R!
o With Weka you can preprocess the data, classify the data, cluster the data and even
visualize the data! This you can do on different formats of data files like ARFF, CSV,
C4.5, and JSON. Weka even allows you to add filters to your dataset through which you
can normalize your data, standardize it, and interchange features between nominal and
numeric values, and what not!
Procedure / Steps:
o I will take the Breast Cancer dataset from the UCI Machine Learning Repository. I
recommend you read about the problem before moving forward.
Let us first load the dataset in Weka. To do that, follow the below steps:
You can view all the features in your dataset on the left-hand side. Weka automatically
creates plots for your features which you will notice as you navigate through your features.
You can even view all the plots together if you click on the “Visualize All” button.
” Reduced Error Pruning Tree (RepTree) is a fast decision tree learner that builds
a decision/regression tree using information gain as the splitting criterion, and
prunes it using reduced error pruning algorithm.‖
“Decision tree splits the nodes on all available variables and then selects the split which
results in the most homogeneous sub-nodes.”
You can select your target feature from the drop-down just above the “Start” button. If you
don‘t do that, WEKA automatically selects the last feature as the target for you.
The “Percentage split” specifies how much of your data you want to keep for training the
classifier. The rest of the data is used during the testing phase to calculate the accuracy of the
model.
With “Cross-validation Fold” you can create multiple samples (or folds) from the training
dataset. If you decide to create N folds, then the model is iteratively run N times. And each
time one of the folds is held back for validation while the remaining N-1 folds are used for
training the model. The result of all the folds is averaged to give the result of cross-
validation.
The greater the number of cross-validation folds you use, the better your model will become.
This makes the model train on randomly selected data which makes it more robust.
Finally, press the “Start” button for the classifier to do its magic!
Our classifier has got an accuracy of 92.4%. Weka even prints the Confusion matrix for
you which gives different metrics.
Decision trees have a lot of parameters. We can tune these to improve our model‘s overall
performance. This is where a working knowledge of decision trees really plays a crucial role.
You can access these parameters by clicking on your decision tree algorithm on top:
You can always experiment with different values for these parameters to get the best
accuracy on your dataset.
Weka even allows you to easily visualize the decision tree built on your dataset:
Interpreting these values can be a bit intimidating but it‘s actually pretty easy once you get the
hang of it.
The values on the lines joining nodes represent the splitting criteria based on the values in
the parent node feature
In the leaf node:
o The value before the parenthesis denotes the classification value
o The first value in the first parenthesis is the total number of instances from the
training set in that leaf. The second value is the number of instances incorrectly
classified in that leaf
o The first value in the second parenthesis is the total number of instances from the
pruning set in that leaf. The second value is the number of instances incorrectly
classified in that leaf
Questions:
1. What is the Decision Tree Algorithm?
2. List down the attribute selection measures used by the ID3 algorithm to construct a
Decision Tree.
3. List down the different types of nodes in Decision Trees.
Signature of Faculty:
Practical – 9
AIM: Case Study: Do literature review of research paper on Data Mining and Web Mining which must
include any of the special techniques as per your syllabus and advanced topics/improved
technique.
Theory:
o "Research paper." What image comes into mind as you hear those words: working with
stacks of articles and books, hunting the "treasure" of others' thoughts? Whatever image
you create, it's a sure bet that you're envisioning sources of information--articles, books,
people, artworks. Yet a research paper is more than the sum of your sources, more than a
collection of different pieces of information about a topic, and more than a review of the
literature in a field. A research paper analyzes a perspective or argues a point. Regardless
of the type of research paper you are writing, your finished research paper should present
your own thinking backed up by others' ideas and information.
and resolve your doubts. If you can't clarify what exactly you require for your work, then
ask your supervisor to help you with an alternative. He or she might also provide you
with a list of essential readings.
o Use of computer is recommended: As you are doing research in the field of computer
science then this point is quite obvious. Use right software: Always use good quality
software packages. If you are not capable of judging good software, then you can lose the
quality of your paper unknowingly. There are various programs available to help you
which you can get through the internet.
o Use the internet for help: An excellent start for your paper is using Google. It is a
wondrous search engine, where you can have your doubts resolved. You may also read
some answers for the frequent question of how to write your research paper or find a
model research paper. You can download books from the internet. If you have all the
required books, place importance on reading, selecting, and analyzing the specified
information. Then sketch out your research paper. Use big pictures: You may use
encyclopedias like Wikipedia to get pictures with the best resolution. At Global Journals,
you should strictly follow here
o Bookmarks are useful: When you read any book or magazine, you generally use
bookmarks, right? It is a good habit which helps to not lose your continuity. You should
always use bookmarks while searching on the internet also, which will make your search
easier.
o Revise what you wrote: When you write anything, always read it, summarize it, and
then finalize it.
o Make every effort: Make every effort to mention what you are going to write in your
paper. That means always have a good start. Try to mention everything in the
introduction—what is the need for a particular research paper. Polish your work with
good writing skills and always give an evaluator what he wants. Make backups: When
you are going to do any important thing like making a research paper, you should always
have backup copies of it either on your computer or on paper. This protects you from
losing any portion of your important data.
o Produce good diagrams of your own: Always try to include good charts or diagrams in
your paper to improve quality. Using several unnecessary diagrams will degrade the
quality of your paper by creating a hodgepodge. So always try to include diagrams which
were made by you to improve the readability of your paper. Use of direct quotes: When
you do research relevant to literature, history, or current affairs, then use of quotes
becomes essential, but if the study is relevant to science, use of quotes is not preferable.
o Use proper verb tense: Use proper verb tenses in your paper. Use past tense to present
those events that have happened. Use present tense to indicate events that are going on.
Use future tense to indicate events that will happen in the future. Use of wrong tenses
will confuse the evaluator. Avoid sentences that are incomplete.
o Pick a good study spot: Always try to pick a spot for your research which is quiet. Not
every spot is good for studying.
o Know what you know: Always try to know what you know by making objectives,
otherwise you will be confused and unable to achieve your target.
o Use good grammar: Always use good grammar and words that will have a positive
impact on the evaluator; use of good vocabulary does not mean using tough words which
the evaluator has to find in a dictionary. Do not fragment sentences. Eliminate one-word
sentences. Do not ever use a big word when a smaller one would suffice.
o Verbs have to be in agreement with their subjects. In a research paper, do not start
sentences with conjunctions or finish them with prepositions. When writing formally, it is
advisable to never split an infinitive because someone will (wrongly) complain. Avoid
clichés like a disease. Always shun irritating alliteration. Use language which is simple
and straightforward. Put together a neat summary.
o Arrangement of information: Each section of the main body should start with an
opening sentence, and there should be a changeover at the end of the section. Give only
valid and powerful arguments for your topic. You may also maintain your arguments
with records.
o Never start at the last minute: Always allow enough time for research work. Leaving
everything to the last minute will degrade your paper and spoil your work.
o Multitasking in research is not good: Doing several things at the same time is a bad
habit in the case of research activity. Research is an area where everything has a
particular time slot. Divide your research work into parts, and do a particular part in a
particular time slot.
o Never copy others' work: Never copy others' work and give it your name because if the
evaluator has seen it anywhere, you will be in trouble. Take proper rest and food: No
matter how many hours you spend on your research activity, if you are not taking care of
your health, then all your efforts will have been in vain. For quality research, take proper
rest and food.
o Go to seminars: Attend seminars if the topic is relevant to your research area. Utilize all
your resources.
o Refresh your mind after intervals: Try to give your mind a rest by listening to soft
music or sleeping in intervals. This will also improve your memory. Acquire colleagues:
Always try to acquire colleagues. No matter how sharp you are, if you acquire
colleagues, they can give you ideas which will be helpful to your research.
o Think technically: Always think technically. If anything happens, search for its reasons,
benefits, and demerits. Think and then print: When you go to print your paper, check that
tables are not split, headings are not detached from their descriptions, and page sequence
is maintained.
o Adding unnecessary information: Do not add unnecessary information like "I have
used MS Excel to draw graphs." Irrelevant and inappropriate material is superfluous.
Foreign terminology and phrases are not apropos. One should never take a broad view.
Analogy is like feathers on a snake. Use words properly, regardless of how others use
them. Remove quotations. Puns are for kids, not grunt readers. Never oversimplify: When
adding material to your research paper, never go for oversimplification; this will
definitely irritate the evaluator. Be specific. Never use rhythmic redundancies.
Contractions shouldn't be used in a research paper. Comparisons are as terrible as clichés.
Give up ampersands, abbreviations, and so on. Remove commas that are not necessary.
Parenthetical words should be between brackets or commas. Understatement is always
the best way to put forward earth-shaking thoughts. Give a detailed literary review.
o Report concluded results: Use concluded results. From raw data, filter the results, and
then conclude your studies based on measurements and observations taken. An
appropriate number of decimal places should be used. Parenthetical remarks are
prohibited here. Proofread carefully at the final stage. At the end, give an outline to your
arguments. Spot perspectives of further study of the subject. Justify your conclusion at
the bottom sufficiently, which will probably include examples.
o Upon conclusion: Once you have concluded your research, the next most important step
is to present your findings. Presentation is extremely important as it is the definite
medium though which your research is going to be in print for the rest of the crowd. Care
should be taken to categorize your thoughts well and present them in a logical and neat
manner. A good quality research paper format is essential because it serves to highlight
your research paper and bring to light all necessary aspects of your research.
Signature of Faculty:
Practical – 10
AIM: To perform hands on experiments of data preprocessing with sample data sets on Rapid Miner.
As we will see in the following, processes can be produced from a large number
of almost randomly nestable operators and finally be represented by a so-called
process graph (flow design). The process structure is described internally by
XML and developed by means of a graphical user interface. In the background,
RapidMiner Studio constantly checks the process currently being developed for
syntax conformity and automatically makes suggestions in case of problems. This
is made possible by the so-called meta data transformation, which transforms the
underlying meta data at the design stage in such a way that the form of the re-
sult can already be foreseen and solutions can be identified in case of unsuitable
operator combinations (quick fixes). In addition, RapidMiner Studio offers the
possibility of defining breakpoints and of therefore inspecting virtually every in-
termediate result. Successful operator combinations can be pooled into building
blocks and are therefore available again in later processes.
RapidMiner Studio contains more than 1500 operations altogether for all tasks
of professional data analysis, from data partitioning, to market-based analysis,
to attribute generation, it includes all the tools you need to make your data work
for you. But also methods of text mining, web mining, the automatic sentiment
analysis from Internet discussion forums (sentiment analysis, opinion mining)
as well as the time series analysis and -prediction are available. RapidMiner Studio
enables us to use strong visualisations like 3-D graphs, scatter matrices
and self-organizing maps. It allows you to turn your data into fully customizable,
exportable charts with support for zooming, panning, and rescaling for maximum
visual impact.
Background / Preparation:
Before we can work with RapidMiner Studio, you of course need to download and
install the software first. You will find it in the download area of the RapidMiner
website:
http://www.rapidminer.com
Download the appropriate installation package for your operating system and
install RapidMiner Studio according to the instructions on the website. All usual
Windows versions are supported as well as Macintosh, Linux or Unix systems.
Please note that an up-to-date Java Runtime (at least version 7) is needed for
the latter.
If you are starting RapidMiner Studio for the first time, you will be asked to
create a new repository (Fig. 10.1). We will limit ourselves to a local repository
on your computer first of all - later on you can then define repositories in the
network, which you can also share with others:
Figure 10.1: Create a local repository on your computer to begin with the first use
of RapidMiner Studio
For a local repository you just need to specify a name (alias) and define any
directory on your hard drive (Fig. 10.2). You can select the directory directly by
clicking on the folder icon on the right. It is advisable to create a new directory
in a convenient place within the file dialog that then appears and then use this
new directory as a basis for your local repository. This repository serves as a
central storage location for your data and analysis processes and will accompany
you in the near future.
Figure 10.2: Definition of a new local repository for storing your data and
analysis processes. It is advisable to create a new directory as a basis.
Perspectives and Views
After choosing the repository you will be welcomed into the Home Perspective
(Fig. 10.3). The right section shows current news about RapidMiner, if you are
connected to the Internet. The list in the centre shows the typical actions, which
you will perform frequently after starting RapidMiner Studio. Here are the details
of those:
1. New Process: Opens the design perspective and creates a new analysis process.
2. Open: Opens a repository browser, if you click on the button. You can choose and open
an existing process in the design perspective. If you click on the arrow button on the right
side, a list of recently opened processes appears. You can select one and it will be opened
in the design perspective. Either way, RapidMiner Studio will then automatically switch
to the Design Perspective.
3. Application Wizard: You can use the Application Wizard to solve typical data mining
problems with your data in three steps. The Direct Marketing Wizard allows you to _nd
marketing actions with the highest conversion rates. The Predictive Maintenance Wizard
predicts necessary maintenance activities. The Churn Analysis Wizard allows you to
identify which customers are most likely to churn and why. The Sentiment Analysis
Wizard analyses a social media stream and gives you an insight into customers' thinking.
4. Tutorials: Starts a tutorial window which shows several available tutorials from creating
the first analysis process to data transformation. Each tutorial can be used directly within
RapidMiner Studio and gives an introduction to some data mining concepts using a
selection of analysis processes.
Procedure / Steps:
o When you have completed the tutorials, you can use RapidMiner Studio's built-in
samples repository, with explanatory help text, for more practice exercises. The sample
data and processes are located in the Repository panel:
The data folder contains a dozen different data sets, which are used by the sample
exercises. They contain a variety of different data types.
The processes folder contains over 130 sample processes, organized by function, that
demonstrate preprocessing, visualization, clustering, and many other topics.
double-click to display the individual operators with help text. This method is best
for learning.
drag-and-drop to have the process immediately available for running.
Signature of Faculty: