Data Mining Lab Questions
Data Mining Lab Questions
Data Mining Lab Questions
Set-1
1)Open Eclipse
2)File->create new Java Project and Name it
3)In that Project file ,create a new folder and name its as lib
4)Copy weka.jar file and paste in the lib folder
5)Refresh eclipse by pressing F5 and click on lib folder->right click on weka.jar-
>build path->Add to Build path
6)Create a new class in the Project and write the respective code
7)run it
(PS:Give the File paths Correctly)
ArffToCsv.java
import weka.core.Instances;
import weka.core.converters.ArffLoader;
import weka.core.converters.CSVSaver;
import java.io.File;
public class ArffToCsv
{
public static void main(String[] args)throws Exception
{
ArffLoader Loader = new ArffLoader();
Loader.setSource(new File("C:\\Users\\vinay\\eclipse-
workspace\\CsvToArff\\Desktop\\Test_arff.arff"));
Instances data=Loader.getDataSet();
CSVSaver saver= new CSVSaver();
saver.setInstances(data);
saver.setFile(new File("airline.csv"));
saver.setDestination(new File("Desktop/airline.csv"));
saver.writeBatch();
System.out.println("Success\n");
}
}
output:
Success
x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)
y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)
relation<-lm(y~x)
print(relation)
print(summary(relation))
a<-data.frame(x=170)
result<-predict(relation,a)
print(result)
png(file = "linearregression.png")
#Plot the chart.
plot(y,x,col = "blue",main = "Regression",abline(lm(x~y)),cex = 1.3,pch = 16,xlab =
"Weight in Kg",ylab = "Height in cm")
dev.off()
output:
Set-2
1. Convert .csv file into .arff file using Java-WEKA API
Procedure:
1)Open Eclipse
2)File->create new Java Project and Name it
3)In that Project file ,create a new folder and name its as lib
4)Copy weka.jar file and paste in the lib folder
5)Refresh eclipse by pressing F5 and click on lib folder->right click on weka.jar-
>build path->Add to Build path
6)Create a new class in the Project and write the respective code
7)run it(PS:Give the File paths Correctly)
CsvToArff.java
import weka.core.Instances;
import weka.core.converters.CSVLoader;
import weka.core.converters.ArffSaver;
import java.io.File;
public class CsvToArff
{
public static void main(String[] args)throws Exception
{
CSVLoader Loader = new CSVLoader();
Loader.setSource(new File("C:\\Users\\vinay\\OneDrive\\Documents\\R\\win-
library\\3.5\\Hmisc\\tests\\csv\\TEST.csv")); //Path of the file
Instances data=Loader.getDataSet();
ArffSaver saver= new ArffSaver();
saver.setInstances(data);
saver.setFile(new File("Test_arff.arff"));
saver.setDestination(new File("Desktop/Test_arff.arff"));
saver.writeBatch();
System.out.println("Success\n");
}
}
output:
Success
2. K-NN in WEKA
Procedure:
1)Open Weka and open any categorical arff file
2) Go to classifiers tab
3)On top left corner click choose
4)Click on lazy and select IBK
5)start
NAME
weka.classifiers.lazy.IBk
Set-3
NAME
weka.classifiers.trees.J48
SYNOPSIS
Class for generating a pruned or unpruned C4.5 decision tree.
Set-4
1. Perform K-NN classification of Time series data using R-Tool
c. Discretisation
NAME
weka.filters.unsupervised.attribute.Discretize
SYNOPSIS
An instance filter that discretizes a range of numeric attributes in the
dataset into nominal attributes. Discretization is by simple binning.
Skips the class attribute if set.
d. Converting nominal attributes to binary attributes
NAME
weka.filters.unsupervised.attribute.NominalToBinar
SYNOPSIS
Converts all nominal attributes into binary numeric attributes. An
attribute with k values is transformed into k binary attributes if the
class is nominal (using the one-attribute-per-value approach). Binary
attributes are left binary if option '-A' is not given. If the class is
numeric, you might want to use the supervised version of this filter.
e. Standardisation
NAME
weka.filters.unsupervised.attribute.Standardize
SYNOPSIS
Standardizes all numeric attributes in the given dataset to have zero
mean and unit variance (apart from the class attribute, if set).
Set-5
1. Random Forest in R
2. Perform the following preprocessing operations in WEKA
a. Attribute selection
NAME
weka.filters.supervised.attribute.AttributeSelection
SYNOPSIS
A supervised attribute filter that can be used to select attributes. It is
very flexible and allows various search and evaluation methods to be
combined.
b. Normalisation
NAME
weka.filters.unsupervised.attribute.Normalize
SYNOPSIS
Normalizes all numeric values in the given dataset (apart from the class
attribute, if set). By default, the resulting values are in [0,1] for the data
used to compute the normalization intervals. But with the scale and
translation parameters one can change that, e.g., with scale = 2.0 and
translation = -1.0 you get values in the range [-1,+1].
c. Outlier detection
d.. Discretisation
NAME
weka.filters.unsupervised.attribute.Discretize
SYNOPSIS
An instance filter that discretizes a range of numeric attributes in the
dataset into nominal attributes. Discretization is by simple binning.
Skips the class attribute if set.
e. Handle missing values
NAME
weka.filters.unsupervised.attribute.ReplaceMissingValues
SYNOPSIS
Replaces all missing values for nominal and numeric attributes in a
dataset with the modes and means from the training data. The class
attribute is skipped by default.
Set-6
1. Analyse time series data using Dynamic Time Warping using R-Tool
2. Naive Bayes in WEKA
NAME
weka.classifiers.bayes.NaiveBayes
SYNOPSIS
Class for a Naive Bayes classifier using estimator classes. Numeric
estimator precision values are chosen based on analysis of the training
data. For this reason, the classifier is not an UpdateableClassifier (which
in typical usage are initialized with zero training instances) -- if you need
the UpdateableClassifier functionality, use the NaiveBayesUpdateable
classifier. The NaiveBayesUpdateable classifier will use a default
precision of 0.1 for numeric attributes when buildClassifier is called
with zero training instances.
OPTIONS
useKernelEstimator -- Use a kernel estimator for numeric attributes
rather than a normal distribution.
numDecimalPlaces -- The number of decimal places to be used for the
output of numbers in the model.
batchSize -- The preferred number of instances to process if batch
prediction is being performed. More or fewer instances may be
provided, but this gives implementations a chance to specify a
preferred batch size.
debug -- If set to true, classifier may output additional info to the
console.
displayModelInOldFormat -- Use old format for model output. The old
format is better when there are many class values. The new format is
better when there are fewer classes and many attributes.
doNotCheckCapabilities -- If set, classifier capabilities are not checked
before classifier is built (Use with caution to reduce runtime).
useSupervisedDiscretization -- Use supervised discretization to convert
numeric attributes to nominal ones
Set-7
1. Perform time series decomposition and forecasting in R
2. Adaboost in WEKA
NAME
weka.classifiers.meta.AdaBoostM1
SYNOPSIS
Class for boosting a nominal class classifier using the Adaboost M1
method. Only nominal class problems can be tackled. Often
dramatically improves performance, but sometimes overfits
Set-8
1. Classify Time series data using R-tool
2. Bagging in WEKA
weka.classifiers.meta.Bagging
SYNOPSIS
Class for bagging a classifier to reduce variance. Can do classification
and regression depending on the base learner.
Set-9
1. Perform hierarchical clustering on time series data in R
# Authonitical keys
consumer_key <- 'ABCDEFGHI1234567890'
consumer_secret <- 'ABCDEFGHI1234567890'
access_token <- 'ABCDEFGHI1234567890'
access_secret <- 'ABCDEFGHI1234567890'
setup_twitter_oauth(consumer_key, consumer_secret, access_token,
access_secret)
tweets <- userTimeline("realDonaldTrump", n=200)
n.tweet <- length(tweets)
tweets.df <- twListToDF(tweets)
head(tweets.df)
#Load the file from hard disk
tweets.df=read.csv("D:\\two\\Sentiment.csv")
head(tweets.df$text)
dim(tweets.df)
head(tweets.df$text)
tweets.df2 <- gsub("http.*","",tweets.df$text)
tweets.df2 <- gsub("https.*","",tweets.df2)
tweets.df2 <- gsub("#.*","",tweets.df2)
#tweets.df2 <- gsub("#.*","",tweets.df$text)
tweets.df2 <- gsub("@.*","",tweets.df2)
#To match occurrence of any word preceeded by @symbol
#gsub("@[A-Za-z]+","","hai @hello cbit")
#to match occurrence of a single back slash in source
tweets.df2 <- gsub("////","",tweets.df2)
head(tweets.df2)
#Getting sentiment score for each tweet
word.df <- as.vector(tweets.df2)
emotion.df <- get_nrc_sentiment(word.df)
#Cbind in R appends or combines vector, matrix or data frame by
columns.
emotion.df2 <- cbind(tweets.df2, emotion.df)
head(emotion.df2,100)
#get_sentiment function to extract sentiment score for each of the
tweets
sent.value <- get_sentiment(word.df)
most.positive <- word.df[sent.value == max(sent.value)]
most.positive
most.negative <- word.df[sent.value <= min(sent.value)]
most.negative
sent.value
#segregate positive and negative tweets based on the score assigned to
each of the tweets.
positive.tweets <- word.df[sent.value > 0]
head(positive.tweets)
#Negative Tweets
negative.tweets <- word.df[sent.value < 0]
head(negative.tweets)
#Neutral tweets
neutral.tweets <- word.df[sent.value == 0]
head(neutral.tweets)
# Alternate way to classify as Positive, Negative or Neutral tweets
category_senti <- ifelse(sent.value < 0, "Negative", ifelse(sent.value > 0,
"Positive", "Neutral"))
head(category_senti)
table(category_senti)
2. K-Means in WEKA
NAME
weka.clusterers.SimpleKMeans
SYNOPSIS
Cluster data using the k means algorithm. Can use either the Euclidean
distance (default) or the Manhattan distance. If the Manhattan distance
is used, then centroids are computed as the component-wise median
rather than mean.
Set-12
1. Write a program in R-tool for displaying word cloud
install.packages('Scale', dependencies=TRUE,
repos='http://cran.rstudio.com/')
install.packages('tm', dependencies=TRUE,
repos='http://cran.rstudio.com/')
options(repos='http://cran.rstudio.com/')
install.packages("SnowballC")
#Functionality to create pretty word clouds, visualize differences and
similarity between documents,
#and avoid over-plotting in scatter plots with text.
install.packages("wordcloud")
#Provides color schemes for maps (and other graphics) designed by
Cynthia Brewer
#http://www.sthda.com/english/wiki/text-mining-and-word-cloud-
fundamentals-in-r-5-simple-steps-you-should-know
install.packages("RColorBrewer")
#The stringr package provide a cohesive set of functions
#designed to make working with strings as easy as possible
install.packages("stringr")
#Provides an interface to the Twitter web API.
install.packages("twitteR")
library(Scale)
library(tm)
library(SnowballC)
library(wordcloud)
library(RColorBrewer)
library(stringr)
library(twitteR)
load("rdmTweets-201306.RData")
#Data frame
data <- data.frame(word = names(v),freq=v)
head(data, 10)
#generate the wordcloud
#words : the words to be plotted
#freq : their frequencies
#min.freq : words with frequency below min.freq will not be plotted
#max.words : maximum number of words to be plotted
#random.order : plot words in random order. If false, they will be
plotted in decreasing frequency
#rot.per : proportion words with 90 degree rotation (vertical text)
#colors : color words from least to most frequent. Use, for example,
colors ="black" for single color.
set.seed(1056)
wordcloud(words = data$word, freq = data$freq, min.freq = 1,
max.words=200, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"))
2. Generate association rules using Apriori and FP-Growth in WEKA
APRIORI
NAME
weka.associations.Apriori
SYNOPSIS
Class implementing an Apriori-type algorithm. Iteratively reduces the
NAME
weka.associations.FPGrowth
FP-GROWTH
SYNOPSIS
Class implementing the FP-growth algorithm for finding large item sets
without candidate generation. Iteratively reduces the minimum support
until it finds the required number of rules with the given minimum
metric. minimum support until it finds the required number of rules
with the given minimum confidence.
The algorithm has an option to mine class association rules. It is
adapted as explained in the second reference.
Set-13
1. Generate Association rules using Apriori in R-Tool
ibrary(arules)
library(arulesViz)
library(datasets)
data("Groceries")
itemFrequencyPlot(Groceries,topN=20,type="absolute")
rules<-apriori(Groceries,parameter=list(supp=0.001,conf=0.6))
options(digits=2)
inspect(rules[1:5])
plot(rules,method="graph",interactive=TRUE,shading="confidenc
e")
#sorting rules
rules<-sort(rules,by="confidence",decreasing=TRUE)
#using appearance
rules<-apriori(Groceries,parameter=list(supp=0.001,conf=0.6),
appearance=list(default="lhs",rhs="whole milk"),
control=list(verbose=F))
rules<-sort(rules,by="confidence",decreasing=TRUE)
options(digits=2)
inspect(rules[1:5])
#plotting
plot(rules,method="graph",interactive=TRUE,shading=NA)
2. Adaboost in R
Set-14
1. Generate Association rules using FP-Growth in R-Tool
library(rCBA)
data("iris")
classifier<-
rCBA::buildFPGrowth(iris[sample(nrow(iris),20),],"Species",paralle
l=FALSE)
model<-classifier$model
predictions<-rCBA::classification(iris,model)
table(predictions)
sum(as.character(iris$Species)==as.character(predictions),na.rm
=TRUE)/length(predictions)
2. K-NN in R
data <-read.csv("Wholesale customers data.csv",header=T)
summary(data)
top.n.custs <- function (data,cols,n=5) { #Requires some data frame and
the top N to remove
idx.to.remove <-integer(0) #Initialize a vector to hold customers being
removed
for (c in cols){ # For every column in the data we passed to this
function
col.order <-order(data[,c],decreasing=T) #Sort column "c" in
descending order (bigger on top)
#Order returns the sorted index (e.g. row 15, 3, 7, 1, ...) rather than
the actual values sorted.
idx <-head(col.order, n) #Take the first n of the sorted column C to
idx.to.remove <-union(idx.to.remove,idx) #Combine and de-duplicate
the row ids that need to be removed
}
return(idx.to.remove) #Return the indexes of customers to be
removed
idx.to.remove
}
top.custs <-top.n.custs(data,cols=3:8,n=5)
idx.to.remove
length(top.custs) #How Many Customers to be Removed?
data[top.custs,] #Examine the customers
data.rm.top <-data[-c(top.custs),] #Remove the Customers
set.seed(76964057) #Set the seed for reproducibility
k <-kmeans(data.rm.top[,-c(1,2)], centers=5) #Create 5 clusters,
Remove columns 1 and 2
k$centers #Display cluster centers
table(k$cluster) #Give a count of data points in each cluster
rng<-2:20 #K from 2 to 20
tries<-100 #Run the K Means algorithm 100 times
1wavg.totw.ss<-integer(length(rng)) #Set up an empty vector to hold all
of points
for(v in rng){ # For each value of the range variable
v.totw.ss<-integer(tries) #Set up an empty vector to hold the 100 tries
for(i in 1:tries){
k.temp<-kmeans(data.rm.top,centers=v) #Run kmeans
v.totw.ss[i]<-k.temp$tot.withinss#Store the total withinss
}
avg.totw.ss[v-1]<-mean(v.totw.ss) #Average the 100 total withinss
}
plot(rng,avg.totw.ss,type="b", main="Total Within SS by Various K",
ylab="Average Total Within Sum of Squares",
xlab="Value of K"
Set-15
1. Build a Decision Tree Classifier in R-Tool using the packages Party.
library(readr)
library(dplyr)
library(party)
library(rpart)
library(rpart.plot)
library(ROCR)
library(magrittr)
titanic3<-"https://goo.gl/At238b"%>%
read.csv%>%
select(survived,embarked,sex,sibsp,parch,fare)%>%
mutate(embarked=factor(embarked),sex=factor(sex))
#splitting data
.data<-c("training","test")%>%
sample(nrow(titanic3),replace=TRUE)%>%
split(titanic3,.)
# Dissimilarity matrix
d <- dist(df, method = "euclidean")
Diana.R
library(tidyverse) # data manipulation
library(cluster) # clustering algorithms
library(factoextra) # clustering visualization
library(dendextend) # for comparing two dendrograms
df <- USArrests
df
df <- na.omit(df)
df <- scale(df)
head(df)
# Dissimilarity matrix
d <- dist(df, method = "euclidean")
# compute divisive hierarchical clustering
hc4 <- diana(df)
# plot dendrogram
pltree(hc4, cex = 0.6, hang = -1, main = "Dendrogram of diana")
Set-16
1. Build a Decision Tree Classifier in R-Tool using the packages caret.
infogain.R
library(caret)
library(rpart.plot)
data_url <- c("https://archive.ics.uci.edu/ml/machine-learning-
databases/car/car.data")
download.file(url = data_url, destfile = "car.data")
car_df <- read.csv("car.data", sep = ',', header = FALSE)
str(car_df)
head(car_df)
set.seed(3033)
intrain <- createDataPartition(y = car_df$V7, p= 0.7, list = FALSE)
training <- car_df[intrain,]
testing <- car_df[-intrain,]
#check dimensions of train & test set
dim(training); dim(testing);
anyNA(car_df)
summary(car_df)
trctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 3)
set.seed(3333)
dtree_fit <- train(V7 ~., data = training, method = "rpart",
parms = list(split = "information"),
trControl=trctrl,
tuneLength = 10)
dtree_fit
gini.R
library(caret)
library(rpart.plot)
data_url <- c("https://archive.ics.uci.edu/ml/machine-learning-
databases/car/car.data")
download.file(url = data_url, destfile = "car.data")
car_df <- read.csv("car.data", sep = ',', header = FALSE)
str(car_df)
head(car_df)
set.seed(3033)
intrain <- createDataPartition(y = car_df$V7, p= 0.7, list = FALSE)
training <- car_df[intrain,]
testing <- car_df[-intrain,]
#check dimensions of train & test set
dim(training); dim(testing);
anyNA(car_df)
summary(car_df)
trctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 3)
set.seed(3333)
dtree_fit_gini <- train(V7 ~., data = training, method = "rpart",
parms = list(split = "gini"),
trControl=trctrl,
tuneLength = 10)
dtree_fit_gini
prp(dtree_fit_gini$finalModel, box.palette = "Reds", tweak = 1.2)
test_pred_gini <- predict(dtree_fit_gini, newdata = testing)
confusionMatrix(test_pred_gini, testing$V7 ) #check accuracy
2. Naive Bayes in R
#install.packages("caret")
library(caTools)
library(caret)
library(e1071)
## Load vcd package
library(vcd)
df<-data.frame(Arthritis)
nrow(Arthritis)
ind<-sample(nrow(Arthritis),floor(nrow(Arthritis)*0.3))
ind
train<-df[ind,]
train
train$Improved
test<-df[-ind,]
x_train<-train[]
x_train
y_train<-train$Improved
y_train
x_test<-test[]
x_test
y_test<-test$Improved
y_test
classifier=naiveBayes(x_train,y_train,laplace=0)
classifier
predictions<-predict(classifier,x_test)
conf<-confusionMatrix(predictions,y_test)
print(conf)
Set-17
1. Build a Decision Tree Classifier in R-Tool using the packages rpart.
library(readr)
library(dplyr)
library(party)
library(rpart)
library(rpart.plot)
library(ROCR)
library(magrittr)
titanic3<-"https://goo.gl/At238b"%>%
read.csv%>%
select(survived,embarked,sex,sibsp,parch,fare)%>%
mutate(embarked=factor(embarked),sex=factor(sex))
#splitting data
.data<-c("training","test")%>%
sample(nrow(titanic3),replace=TRUE)%>%
split(titanic3,.)
2. K-Means in R
## ----setup, include = FALSE----------------------------------------------
knitr::opts_chunk$set(message = FALSE, warning = FALSE)
library(dplyr)
library(ggplot2)
library(purrr)
library(tibble)
library(tidyr)
set.seed(27)
data("Groceries")
itemFrequencyPlot(Groceries,topN=20,type="absolute")
itemsets<-eclat(Groceries,parameter=list(supp=0.001,maxlen=3))
rules<-ruleInduction(itemsets,Groceries,confidence=0.6)
inspect(rules[1:5])
plot(rules,method="graph",interactive=TRUE,shading=NA
2. Bagging in R
Set-19
1. Hierarchical Clustering in WEKA
2. Data visualisation and exploration in R
a. Read the dataset into R-Dataframe
b. Get the first 5 rows
c. Correlation of two dimensions
d. Histogram of an attribute
e. Cleveland Dot Charts
f. Bar Charts
g. Pie chart
h. Line charts for both numeric and categorical dimensions
(Note your observations, Comment on the data distribution, try
plotting commands for
different kinds of dimensions, try different plotting function options:
symbols, size of
plotting symbol, legends, x,y-axis labels, titles of graphs, etc)
Set-20
1. DBSCAN in WEKA
Set-21
1. Convert .csv file into .arff file using WEKA
Procedure:
1) Open Weka GUI Chooser(Weka)
2) Click on Tools
3)click on Arff viewer
4) Click on file and open a .csv file by changing the type of the file
5) click on file and save it as .arff file
Timeseries.R
a <- ts(1:20, frequency = 12, start = c(2011, 3))
print(a)
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 2011 1 2 3 4 5 6 7 8 9 10
## 2012 11 12 13 14 15 16 17 18 19 20
str(a)
## Time-Series [1:20] from 2011 to 2013: 1 2 3 4 5 6 7 8 9 10...
attributes(a)
## $tsp
## [1] 2011 2013 12
##
## $class
## [1] "ts"
apts <- ts(AirPassengers, frequency = 12)
f <- decompose(apts)
plot(f$figure, type = "b")
plot(f)
# build an ARIMA model
fit <- arima(AirPassengers, order = c(1, 0, 0), list(order = c(2,
1, 0), period = 12))
fore <- predict(fit, n.ahead = 24)
# error bounds at 95% confidence level
U <- fore$pred + 2 * fore$se
L <- fore$pred - 2 * fore$se
ts.plot(AirPassengers, fore$pred, U, L,
col = c(1, 2, 4, 4), lty = c(1, 1, 2, 2))
legend("topleft", col = c(1, 2, 4), lty = c(1, 1, 2),
c("Actual", "Forecast", "Error Bounds (95% Confidence)"))
library(dtw)
idx <- seq(0, 2 * pi, len = 100)
a <- sin(idx) + runif(100)/10
b <- cos(idx)
align <- dtw(a, b, step = asymmetricP1, keep = T)
dtwPlotTwoWay(align)
# read data into R sep='': the separator is white space, i.e., one
# or more spaces, tabs, newlines or carriage returns
sc <- read.table("synthetic_control.txt", header = F)
sc
# show one sample from each class
idx <- c(1, 101, 201, 301, 401, 501)
sample1 <- t(sc[idx, ])
plot.ts(sample1, main = "")
# sample n cases from every class
n <- 10
s <- sample(1:100, n)
idx <- c(s, 100 + s, 200 + s, 300 + s, 400 + s, 500 + s)
sample2 <- sc[idx, ]
observedLabels <- rep(1:6, each = n)
# hierarchical clustering with Euclidean distance
hc <- hclust(dist(sample2), method = "ave")
plot(hc, labels = observedLabels, main = "")
# cut tree to get 8 clusters
memb <- cutree(hc, k = 8)
table(observedLabels, memb)
## memb
## observedLabels 1 2 3 4 5 6 7 8
## 1 10 0 0 0 0 0 0 0
## 2 0 3 1 1 3 2 0 0
## 3 0 0 0 0 0 0 10 0
## 4 0 0 0 0 0 0 0 10
## 5 0 0 0 0 0 0 10 0
## 6 0 0 0 0 0 0 0 10
myDist <- dist(sample2, method = "DTW")
hc <- hclust(myDist, method = "average")
plot(hc, labels = observedLabels, main = "")
# cut tree to get 8 clusters
memb <- cutree(hc, k = 8)
table(observedLabels, memb)