Big Data Analytics Lab File
Big Data Analytics Lab File
Big Data Analytics Lab File
CS 1730 | [0 0 2 1]
Submitted to Manipal
University, Jaipur
Towards the partial fulfilment for the Award of the Degree of
BACHELORS OF
TECHNOLOGY
In Computers Science
and Engineering
2021-2022
By
Akshay Jain
189301060
S. No Topic
01 Introduction to R
03 K Means Clustering
04 Association Rules
05 Linear Regression
06 Logistic Regression
08 Decision Tree
11 In-database Analytics
12 Assignment 01
13 Assignment 02
Lab – 01
Introduction to R
Lab – 02
Basic Statistics, Visualization and Hypothesis Testing
Lab – 03
K means Clustering
Lab – 04
Association Rules
Lab – 05
Linear Regression
Lab – 06
Logistic Regression
Lab – 07
Naïve Bayes
Lab – 08
Decision Tree
Lab – 09
Time Series Analysis Using ARIMA
Lab – 10 & Lab – 11
3. Configurations
a. Edit 5 files
b. Configure Hadoop
4. Testing
a. Testing if installation was successful
CS1730 Big Data Analytics Lab
Assignment 01
The dataset has 1000 observations and one variable, which is gestation days. The only variable is a
numeric-continuous type.
pregnancy.df is a data frame. A data frame is a list of variables of the same number of rows with
unique row names, given class “data.frame”. If no variables are included, the row names determine
the number of rows. Our data frame is a one-dimensional labelled data with only one column and
multiple rows.
You clearly see that we have one column and 1000 rows, which makes it a one-dimensional array.
You can go further by checking the dimension of your data frame.
DESCRIPTIVE STATISTICS
From the summary statistics, you can calculate the median, first quartile (25th percentile), third
quartile (75th quartile), upper and lower whiskers also known as inner and outer tukey fences. You
can get all these information from summarizing the data.
From the output, we see the minimum gestation days is 240 days or 8 months, while the maximum
gestation days is 349 days or about 11 months. Therefore, the range of days between gestation is
109 days or about 3 months. This may suggest lots of variability in gestation days. Plus, the 25th
percentile of data is 260 days or about 8,66 months and the 75th percentile is 272 or about 9,06
months. Finally, the median gestation days is 267 and the “typical” average gestation days is about
267 days or exactly 9 months. Let’s check the histogram of gestation days.
The output has a peaked bar with gestation days at about 272 days. Most of the values are bunched
up in the left side of the histogram, with few values spread along the right tail. This tells us that the
data points are not normally distributed. In addition, the plot reveals the presence of an outlier in
the lower right bound of the plot. This outlier is completely removed from other observations. Let’s
further explore this data set by displaying line counts.
This output corroborates with what we previously said about the distribution of the histogram. We
can say the gestation days are positively skewed, with observations stretching and spreading along
the tail. Similarly to previous observations, data points are clustered in the left side, proving the
asymmetric distribution of the values of the variable. Finally, the mean and the median occur at
different points, though by a little difference, thus renforcing a asymmetric distribution. Using the
histogram is a good way to assess the shape and spread of the data and to identify any potential
outliers. Similarly, the boxplot is also vital to evaluate the spread of the data. Let’s go back to the
summary and get a good understanding of skewness.
This output corroborates with what we previously said about the distribution of the histogram. We
can say the gestation days are positively skewed, with observations stretching and spreading along
the tail. Similarly to previous observations, data points are clustered in the left side, proving the
asymmetric distribution of the values of the variable. Finally, the mean and the median occur at
different points, though by a little difference, thus renforcing a asymmetric distribution. Using the
histogram is a good way to assess the shape and spread of the data and to identify any potential
outliers. Similarly, the boxplot is also vital to evaluate the spread of the data. Let’s go back to the
summary and get a good understanding of skewness.
INTERQUARTILE RANGE
In the summary results, the interquartile range is equal to 272 minus 260, IQR = maximum -
minimum. Alternatively, you can call the built-in IQR() function on the GestationDays column to
calculate the IQR.
You can then compute the lower and upper tukey fences, thanks to John Wilder Tukey. John Wilder
Tukey was an American mathematician best known for the development of the FFT algorithm and
box plot. The Tukey range test, the Tukey lambda distribution, the Tukey test of additivity, and the
Teichmuller-Tukey lemma all bear his name.
Using R’s which() function, it is easy to determine the points and their indices that violate the fences.
The power of which() function is that it gives you all the data points that are out of bounds on either
side of the Tukey fences.
In the results, these points are the ones with values higher than the upper tukey fence of 290. You
have 8 data points that are outliers.
In the results, these points are the ones with values higher than the upper tukey fence of 290. You
have 8 data points that are outliers.
In the results, these points are the ones with values higher than the upper tukey fence of 290. You
have 8 data points that are outliers.
In the results, the output displays 2 values that are below the lower tukey fence. These are values
below 242 days of gestation.
BOXPLOT
Since you have all the needed information, you will have to plot a boxplot using the R’s boxplot()
function. The boxplot() function will graph the median, first quartile and third quartile, tukey fences,
and any outliers. To do this drawing, you pass the GestationDays column to the boxplot() function.
Since you have all the needed information, you will have to plot a boxplot using the R’s boxplot()
function. The boxplot() function will graph the median, first quartile and third quartile, tukey fences,
and any outliers. To do this drawing, you pass the GestationDays column to the boxplot() function.
If you set the range = 0 the upper tukey will expand to the last data point, which is our greatest
outlier point. The plot below shows a boxplot with range 0.
You can improve the boxplot by using ggplot2. You can draw a boxplot and show the colored
outliers. In the output result, the tiny red cirles are outliers. You have 8 outliers above the box and 2
outliers below the box
CS1730 Big Data Analytics Lab
Assignment 02
Problem Statement
About Company
Dream Housing Finance company deals in all home loans. They have presence across all urban,
semi urban and rural areas. Customer first apply for home loan after that company validates the
customer eligibility for loan. Company wants to automate the loan eligibility process (real time)
based on customer detail provided while filling online application form. These details are Gender,
Marital Status, Education, Number of Dependents, Income, Loan Amount, Credit History and
others. To automate this process, they have given a problem to identify the customers segments,
those are eligible for loan amount so that they can specifically target these customers. I have
already shared the dataset.
Problem
1. Perform the EDA operations.
2. Build the Model using the logistic regression.
3. Build the decision tree model.
Loan status is the target variable.
Predictors:
1. Gender factor with two levels. Has na’s in both train and test sets.
2. Married - factor with two levels. Has na’s only in test set.
I’ll worry about the three na’s later.
6. Applicant Income
7. CoApplicant Income. Both numeric variables. No na’s.
There are many outliers. The distributions are right-asymetric.
10. Credit History - integer. This should actually be a factor variable. Both sets have na’s.
11. property Area - factor with three levels. No missing values.
Loan_Status by other variables
Tidying the data - filling in missing values