Big Data Analytics Lab File

Download as pdf or txt
Download as pdf or txt
You are on page 1of 61

BIG DATA ANALYTICS LAB

CS 1730 | [0 0 2 1]
Submitted to Manipal
University, Jaipur
Towards the partial fulfilment for the Award of the Degree of

BACHELORS OF
TECHNOLOGY
In Computers Science
and Engineering
2021-2022
By

Akshay Jain

189301060

Under the guidance of

Dr. Jayesh Gangrade


Department of Computer Science and
Engineering School of Computing and
Information Technology Manipal
University Jaipur
Jaipur, Rajasthan
Index

S. No Topic
01 Introduction to R

Basic Statistics, Visualization and Hypothesis Testing


02 a) Basic Statistics and Vizualization using R
b) Graphics package plots and Hypothesis testing

03 K Means Clustering

04 Association Rules

05 Linear Regression

06 Logistic Regression

Naïve Bayes Classifier


07 a) Building Naïve Bayes Classifier
b) Naïve Bayes Classifier – Census Data

08 Decision Tree

09 Time Series Analysis Using ARIMA

10 Hadoop, HDFS, Map Reduce and Pig

11 In-database Analytics

12 Assignment 01

13 Assignment 02
Lab – 01
Introduction to R
Lab – 02
Basic Statistics, Visualization and Hypothesis Testing
Lab – 03
K means Clustering
Lab – 04
Association Rules
Lab – 05
Linear Regression
Lab – 06
Logistic Regression
Lab – 07
Naïve Bayes
Lab – 08
Decision Tree
Lab – 09
Time Series Analysis Using ARIMA
Lab – 10 & Lab – 11

Hadoop, HDFS, Map Reduce and Pig


and
In-database analytics

Hadoop 3.3.1 Installation Steps:

1. Download and Installation


a. Java
b. Hadoop

2. Environment Variables and Path


a. Java Home
b. Hadoop Home

3. Configurations
a. Edit 5 files
b. Configure Hadoop

4. Testing
a. Testing if installation was successful
CS1730 Big Data Analytics Lab
Assignment 01

Lab Instructor: Dr. Jayesh Gangrade


Submitted by: Akshay Jain (189301060)

The dataset has 1000 observations and one variable, which is gestation days. The only variable is a
numeric-continuous type.

pregnancy.df is a data frame. A data frame is a list of variables of the same number of rows with
unique row names, given class “data.frame”. If no variables are included, the row names determine
the number of rows. Our data frame is a one-dimensional labelled data with only one column and
multiple rows.

You clearly see that we have one column and 1000 rows, which makes it a one-dimensional array.
You can go further by checking the dimension of your data frame.

DESCRIPTIVE STATISTICS
From the summary statistics, you can calculate the median, first quartile (25th percentile), third
quartile (75th quartile), upper and lower whiskers also known as inner and outer tukey fences. You
can get all these information from summarizing the data.
From the output, we see the minimum gestation days is 240 days or 8 months, while the maximum
gestation days is 349 days or about 11 months. Therefore, the range of days between gestation is
109 days or about 3 months. This may suggest lots of variability in gestation days. Plus, the 25th
percentile of data is 260 days or about 8,66 months and the 75th percentile is 272 or about 9,06
months. Finally, the median gestation days is 267 and the “typical” average gestation days is about
267 days or exactly 9 months. Let’s check the histogram of gestation days.
The output has a peaked bar with gestation days at about 272 days. Most of the values are bunched
up in the left side of the histogram, with few values spread along the right tail. This tells us that the
data points are not normally distributed. In addition, the plot reveals the presence of an outlier in
the lower right bound of the plot. This outlier is completely removed from other observations. Let’s
further explore this data set by displaying line counts.
This output corroborates with what we previously said about the distribution of the histogram. We
can say the gestation days are positively skewed, with observations stretching and spreading along
the tail. Similarly to previous observations, data points are clustered in the left side, proving the
asymmetric distribution of the values of the variable. Finally, the mean and the median occur at
different points, though by a little difference, thus renforcing a asymmetric distribution. Using the
histogram is a good way to assess the shape and spread of the data and to identify any potential
outliers. Similarly, the boxplot is also vital to evaluate the spread of the data. Let’s go back to the
summary and get a good understanding of skewness.

This output corroborates with what we previously said about the distribution of the histogram. We
can say the gestation days are positively skewed, with observations stretching and spreading along
the tail. Similarly to previous observations, data points are clustered in the left side, proving the
asymmetric distribution of the values of the variable. Finally, the mean and the median occur at
different points, though by a little difference, thus renforcing a asymmetric distribution. Using the
histogram is a good way to assess the shape and spread of the data and to identify any potential
outliers. Similarly, the boxplot is also vital to evaluate the spread of the data. Let’s go back to the
summary and get a good understanding of skewness.

INTERQUARTILE RANGE
In the summary results, the interquartile range is equal to 272 minus 260, IQR = maximum -
minimum. Alternatively, you can call the built-in IQR() function on the GestationDays column to
calculate the IQR.

You can then compute the lower and upper tukey fences, thanks to John Wilder Tukey. John Wilder
Tukey was an American mathematician best known for the development of the FFT algorithm and
box plot. The Tukey range test, the Tukey lambda distribution, the Tukey test of additivity, and the
Teichmuller-Tukey lemma all bear his name.

Using R’s which() function, it is easy to determine the points and their indices that violate the fences.
The power of which() function is that it gives you all the data points that are out of bounds on either
side of the Tukey fences.

In the results, these points are the ones with values higher than the upper tukey fence of 290. You
have 8 data points that are outliers.
In the results, these points are the ones with values higher than the upper tukey fence of 290. You
have 8 data points that are outliers.
In the results, these points are the ones with values higher than the upper tukey fence of 290. You
have 8 data points that are outliers.
In the results, the output displays 2 values that are below the lower tukey fence. These are values
below 242 days of gestation.
BOXPLOT
Since you have all the needed information, you will have to plot a boxplot using the R’s boxplot()
function. The boxplot() function will graph the median, first quartile and third quartile, tukey fences,
and any outliers. To do this drawing, you pass the GestationDays column to the boxplot() function.
Since you have all the needed information, you will have to plot a boxplot using the R’s boxplot()
function. The boxplot() function will graph the median, first quartile and third quartile, tukey fences,
and any outliers. To do this drawing, you pass the GestationDays column to the boxplot() function.

If you set the range = 0 the upper tukey will expand to the last data point, which is our greatest
outlier point. The plot below shows a boxplot with range 0.
You can improve the boxplot by using ggplot2. You can draw a boxplot and show the colored
outliers. In the output result, the tiny red cirles are outliers. You have 8 outliers above the box and 2
outliers below the box
CS1730 Big Data Analytics Lab
Assignment 02

Lab Instructor: Dr. Jayesh Gangrade


Submitted by: Akshay Jain (189301060)

Problem Statement

About Company
Dream Housing Finance company deals in all home loans. They have presence across all urban,
semi urban and rural areas. Customer first apply for home loan after that company validates the
customer eligibility for loan. Company wants to automate the loan eligibility process (real time)
based on customer detail provided while filling online application form. These details are Gender,
Marital Status, Education, Number of Dependents, Income, Loan Amount, Credit History and
others. To automate this process, they have given a problem to identify the customers segments,
those are eligible for loan amount so that they can specifically target these customers. I have
already shared the dataset.

Problem
1. Perform the EDA operations.
2. Build the Model using the logistic regression.
3. Build the decision tree model.
Loan status is the target variable.

Predictors:
1. Gender factor with two levels. Has na’s in both train and test sets.

Should be able to impute these using rpart prediction. (Later.)

2. Married - factor with two levels. Has na’s only in test set.
I’ll worry about the three na’s later.

3. Dependents - factor with 4 levels. Has na’s in both sets.


4. Education is a factor with 2 levels. There are no missing values.
5. Self-Employed -factor with two levels. Has missing values in both sets.

6. Applicant Income
7. CoApplicant Income. Both numeric variables. No na’s.
There are many outliers. The distributions are right-asymetric.

8. Loan Amount - numeric. Has na’s in both sets.

distributions are right-asymetric.


9. Loan amount term - numeric. Both sets have na’s.

10. Credit History - integer. This should actually be a factor variable. Both sets have na’s.
11. property Area - factor with three levels. No missing values.
Loan_Status by other variables
Tidying the data - filling in missing values

You might also like