8 - Cia 3 Key
8 - Cia 3 Key
8 - Cia 3 Key
Part-A
1.Can relational algebra operations performed using MapReduce. Justify the above statement.
Selection, projection, union, intersection, difference and natural join.
2.List three methods used to integrate R with Mapreduce.
RHIPE,Rhadoop,Hadoop streaming.
3. Suppose we have x targets and y darts. The probability that none of the ‘y’ darts hit a given
target is given by_[(x-1)/x]^y.
4. Compute the second-order moment for the stream: a, b, c, b, d, a, c, d, a, b, d, c, a, a, b.
5^2 + 4^2 + 3^2 + 3^2 = 59.
5. Two conditions are met: 1. The graph is strongly connected; that is, it is possible to get from
any node to any other node. 2. There are no dead ends: nodes that have no arcs out.
6.RMSE-error metric
7. Infer the long tail phenomenon in recommender system with an example.
The vertical axis represents popularity (the number of times an item is chosen). The items are
ordered on the horizontal axis according to their popularity. Physical institutions provide only
the most popular items to the left of the vertical line, while the corresponding on-line institutions
provide the entire range of items: the tail as well as the popular items.
8. Let Cut(S, T ) be the number of edges that connect a node in S to a node in T . Then the
normalized cut value for S and T is Cut(S, T ) Vol(S) + Cut(S, T ) Vol(T )
9.Give the R command for Correlation coefficient for a dataset.
Cor(dataframe$attribute1, dataframe$attribute2).
10.employee<-
Data.frame(Name=c(“A”,”B”,”C”,”D”,”E”),employyeid=c(1,2,3,4,5),age=c(22,21,22,21,22),gen
der=c(“m”,”m”,”m”,”m”,”m”),designation=c(“xre”,”sde”,”ca”,”sme”,”pl”))
Part-B
Count the number of ‘1’ in the bitstream 100101011001011 10101010101011 1010101010111
1010101 110101 01011 0010, what happens if the next bit of 1,1,0 and 1 arrives. Use DGIM
algorithm for above.
1 B C 1 D 1 E 1
1 1 2 1 1
F 2 G 1 H 2
1 ½ ½ 1
I 3 J 3
½ ½
K 6
13.
Part-C
14.a. Design a data frame “ROSTER” with columns “student_name”,” Math”,” Science” and”
English”. Enter 4 observations for the following attributes. (1)
Answer:
student_name<-c("John davis","Angela","David jones")
maths<-c(78,89,90)
science<-c(95,99,80)
English<-c(89,90,67)
df<-data.frame(student_name,maths,science,English)
i)Take the columns from 2 to 4. Calculate the mean of each row or observation. Make it as a
column with name “score”. Bind the newly formed column “score” to the data frame “roster”.
df$score <- rowMeans(df[,-1]) (1)
ii) Give the score column to quantile function with (.8,.6,.4,.2) and assign grades accordingly with
>.8, assign ‘A’ grade, between .8 to .6 assign between .6 to .4 assign C, between .4 to .2 assign D
and less than .2 assign ‘F’. Add the new column “Grade” to the data frame “roster”.
y<-quantile(df$score,c(.8,.6,.4,.2)) (1)
df$grade[df$score<y[4]]<-"F"
df$grade[df1$score<=y[3]&&df$score>=y[4]]<-"D"
df$grade[df$score<=y[2]&&df$score>=y[3]]<-"C"
df$grade[df$score<=y[1]&&df1$score>=y[2]]<-"B"
df$grade[df$score>=y[1]]<-"A"
iii) Split the first column “name” by first_name and last_name. Sort by last and first names. (1)
name<-strsplit((df$student_name)," ")
lastname<-sapply(name,"[",2)
firstname<-sapply(name,"[",1)
df<-cbind(firstname,lastname,df[,-1])
14.b. i) The dataset “Iris” has the attributes, Sepal.Length,Sepal.Width,Petal.Length,Petal.Width
and Species. Give the structure and summary of the dataset.
(1)
Data(iris), str(iris), summary(iris)
ii) Find correlation between “Sepal.Length” and “Sepal.Width” by applying Least Squares Linear
Regression Model. (1)
Y<-iris[,”Sepal.Width”]
X<-iris[,”Sepal.Length”]
XYcorr<-cor(Y,X,method=”pearson”)
iii)plot the scatter plot between the above two attributes and add regression line to the scatter plot
to find the relationship between the above two attributes. (1)
Plot(Y~`X)
Model<-lm(Y~X)
iv)Draw a ggplot with Petal.Width and Petal.Length with geom_point attribute. (1)
Library(ggplot2)
ggplot(iris,aes(x=Sepal.Length,y=Sepal.Width))+geom_point()
v)Build the box plot with Sepal.Length as y-axis and Species as x-axis. (1)
Boxplot(Sepal.Length~Species,data=iris)
vi) Sepal width has small variation across species. We want to know if the mean sepal width is
the same across 3 species. This is done through Analysis of Variance (ANOVA). Test the
difference with ANOVA. (1)
Summary(aov(Sepal.Width~Species)