0% found this document useful (0 votes)
6 views13 pages

DV Unit 2 Update

Unit 2 covers data manipulation and visualization using R, emphasizing the importance of organizing data for better interpretation. It introduces various packages like dplyr, ggplot2, and data.table for efficient data handling and visualization techniques such as scatter plots and bar charts. Additionally, it discusses Watson Studio's features for collaborative data analysis and machine learning model development.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views13 pages

DV Unit 2 Update

Unit 2 covers data manipulation and visualization using R, emphasizing the importance of organizing data for better interpretation. It introduces various packages like dplyr, ggplot2, and data.table for efficient data handling and visualization techniques such as scatter plots and bar charts. Additionally, it discusses Watson Studio's features for collaborative data analysis and machine learning model development.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Unit 2

Data Manipulation with R

• Data manipulation is process of organizing or arranging the data


in order to make it easier to interpret.
• Also called as ‘Data Exploration’.
• It involves ‘manipulation’ data using available set of variables.
This is done to enhance accuracy and precision associated with
data.
• The data collection process can have many loopholes .These are
various uncontrollable factors which leads to inaccuracy in data
such as mental situation of respondents , personal biases,
difference /error in reading of machines etc. To is done to
increase the possible accuracy in data.

Different Ways to manipulate / Treat Data


1. Manipulating data using inbuilt base R functions
2. Use of packages for data manipulation
3. Use of ML(machine learning) algorithms for data manipulation.

You can install a packages using


Install.packages(‘package name’)
List of Packages
1. Dplyr
2. Data.tables
3. Ggplot2
4. Reshape2
5. Reader r
6. Read r
7. Tidy r
8. Lubridate

1.dplyr Package: This package is created and maintained by


Hadley Wickham.
• This package has everything to accelerate data manipulation
efforts.
• It includes 5 major data manipulation commands:
1. Filters: Filters the data based on a condition.
2. Select: select column of interest from data set.
3. Arrange: Arrange data set values on ASCENDING or
DESCENDING order.
4. Mutate: Used to create new variable from existing
variable.
5. Summarise: Used to perform analysis by commonly used
operation such as min ,max, mean ,etc

2.data.table Package:
• This package allows you to perform faster manipulation in data
set
• Data.table helps in reducing computing times as compared to
data.frames.
• A data table has 3 part namely DT[I ,j,by]
• Subset the rows using ‘i’ , to calculate ‘j’ which is grouped by
‘by’.

3.ggplot2: Ggplot offers a set of colors and


patterns.
• It provides the function to plot the graph like scatter plot , bar
plot , histogram etc.
4.reshape 2 Package:

• This package useful in reshaping data.


• It has two function namely melt and cast.
1. Melt: this function converts data from wide format to long
format.
2. Cast: This function converts data from long format to wide
format.
5.read r Package:

• ‘Read r ‘ helps in reading various forms of data into R.


• This package can replaced the traditional read.csv() and
read.table() base R function.
1. Web log files with read_logic()
2. Fixed width files with read_fwf(), and read _ table()
3. Read_csv() , read_csv2()
6.Hidyr Package:

• This package can make your data look tidy.


• It has four functions
1. Gather() : It ‘gather’ multiple colummns. Then it converts
them into key: value form wide from of data to long form.
2. Spread() : It is the reverse of gather .It takes key:value pair
and convert it into separate columns.
3. Separate(): It splits a column into multiple columns.
4. Unite: It does reverse of separate . It unites multiple
columns into single column.

7.Lubridate Package: It used to work with data


time variable in R.
• The inbuilt function of this package easy way to parse the date
& time.

Data Visualization with R


 Data visualization is way to explore, analyze and present data is
visual format.
 R provides wide range of libraries and packages specifically
designed for creating various types of data visualization.
 Some commonly used packages for data visualization in R
include ggplot1, plotly, lattice, and base graphics.
1) Scatter plot:
 A scatter plot is set of dotted points representing
individual data pieces on the horizontal and vertical axis.
 In a graph in which the values of two variables are
plotted along x-axis and y-axis the pattern of resulting
points reveals a correction between them.
R-Scatter Plot
 In R scatter plot is created using plot() function.
Syntax: plot(x, y, main, xlab, ylab, xlim, ylim, axes)

Parameters
1. X :- This parameter sets the horizontal coordinates.
2. Y :- This parameter sets the vertical coordinate.
3. Xlab:- This parameter is the label for horizontal axis.
4. Ylab:- This parameter is the label for vertical axis.
5. Main:- This parameter main is the title of the chart.
6. Xlim:- This parameter used to plotting values of x.
7. Ylim:- This parameter is used to plotting values of y.
8. Axes:- This parameter indicates whether both axes
should be drawn on the plot. Ex:- x—c(1,2,3,4,5)
y—c(2,4,8,10)
plot(x, y, main=”scatter plot”, xlab=”x-axis”,
ylab=”y-axis”)
scatter plot diagram

R-Bar chart
 A bar chart is a pictorial representation of data that
presents categorical data with rectangular bars with
height or lengths proportional to the values that they
represent.
 R uses the function barplot () to create bar charts.
Syntax:
Barplot (h, xlab, ylab, main, names.arg, col)

Parameters:
H:- this parameter is a vector or matrix containing numeric values
which is used in bar chart.
Xlab
Ylab
Main
Names.arg:- this parameter is a vector of names appering under each
bar in bar chart.
Col:- This parameter is used to give colors to the bars in the graph.
Ex: a=c(17,32,8,53,1)
Barplot(A, xlab=”x-axis”, ylab=”y-axis”, main=”Bar-char”)
Bar chart diagram

Horizontal Bar chart


Horiz=TRUE
Ex: A—c(17,32,8,53,1)
Barplot (A, horiz=TRUE, xlab=”x-axis”,ylab=”y-axis”,
main=”barchart”)
Bar chart diagram

Ex: A—c (17,2,8,13,1,22)


B—c (“jan”, ”feb”, ”mar”, “apr”, “may”, “jun”)
Barplot(A, names.arg=B, xlab=”month”, ylab=”Articles”,
col=”green”, main=”Article chart”)

R-Histograms
 In R to plot histogram the hist() function in used
Syntax: hist(v, main, xlab, xlim, ylim, breaks, col, border)

Parameter:
V: - This parameter contain numerical values used in histogram.
Main:-
Col: - This parameter is used to set color of the bars.
Xlab:-
Border: - This parameter is used to set border color of each bar
Xlim:-
Ylim:-
Break:- This parameter is used as width of each bar.
Ex: v—c(19,23,11,6,16,21,32,14,19,27,39)
His(v, xlab=”no. of Articles”, ylab=”frequency” col=”green”,
border=”black”)

Diagram
Histogram is used to show the frequency distribution of data.
Watson Studio
 It is an integrated development environment (IDR) provided by
IBM that allows data scientists and developers to collaboratively
work on data analysis and machine learning projects.
 It is a part of the IBM Watson suite of AI and machine learning
tools.

Key features of Watson studio 1) Data preparation-


 Watson studio allows users to access and explore data
from various sources, including databases, cloud storage
and streaming data. It provide tools for data cleansing
,transformation and enrichment.
2) Notebook- Data scientist can create jupyter notebooks within
Watson studio to perform data analysis, data visualization and to
build and train machine learning models.
3) Model development- Watson studio provides tools and libraries
for building models and training machine learning models. User
can use different framework like Tensorflow, pyTorch,
Scikitlearn etc.
4) Model deployment- after creating and training a machine
learning model, Watson studio allow users to deploy the models
as web services or API`s making them accessible for real-time
predictions and integration with other applications.
5) Auto AI- Watson studio offers Auto AI, on automated learning
capability that helps users quickly build and deploy machine
learning models without the need for extensive manual
configuration.
6) Collaboration- Teams can collaborate on project within Watson
studio, share code, notebooks and data and manage access
controls to ensure secure collaboration.
7) Model Monitoring- Watson studio provides tools to monitor the
performance of deployed machine learning models and track
their usage over time.
Applications of Watson studio-
1) Data Analysis and Exploration.
2) Predictive Analysis
3) Natural language processing(NLP):- Watson studio can be used
for sentiment analysis, language translation, chatbot
development.
4) Image and video analysis
5) Social media analysis
6) Agriculture and environment monitoring
7) Education
8) Healthcare and life sciences
9) Customer segmentation and marketing.

Data Refinary-
 Data refinery is a process which collect data(from
disparate sources) enriches data (blending different data
sets) and create an integrated refine data repository which
can be used for analysis to take actions.
CollectEnrichRefined data repository Analyze 
Act

Steps to visualize data in Watson studio


1)
On the assets tab of your project, click data asset in the list of
assest type and select a data assest.
2)
Click the visualization tab.
3)
Select the chart type from the options that are listed and input
your preferences in the graphical options pane.
Available chart types are ordered from most relevant to least
relevant , based on the selected columns.
If there are no columns in data set with a data type that is
supported for a chart type, that chart will not be available. If a
columns data type is not supported for a chart, that column is
not available for selection for that charts. Dots next to the
charts names suggest the best charts for your data.
As you are building the chart, the canvas displays a preview of
the chart. The preview uses the actual variables labels and
measurement levels that are representative of your actual data.

4)
To save your visualization, select actions > save visualization to
project.
Your saved visualization is listed as a visualization asset in your
project.
You can view or edit the visualization by clicking the name of
the visualization assets of your project.

You might also like