0% found this document useful (0 votes)

7 views

Programming in R. Ex5-wrangling

This document outlines a data wrangling exercise using R and the tidyverse package, focusing on manipulating a fabricated dataset of fruit bags. It includes steps for reading data, merging datasets, filtering, aggregating, and summarizing data, as well as using functions like left_join, mutate, and summarise. The exercise emphasizes practical coding skills for data analysis and encourages the use of AI tools like ChatGPT for assistance.

Uploaded by

soloviovalada

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views

Programming in R. Ex5-wrangling

Uploaded by

soloviovalada

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Wrangling Exercises

In this exercise you will do some data wrangling on a fabricated dataset. Use the .press-
slides to reuse some of the code.
In this exercise you will need tidyverse, since that is our primary wrangling package.
Load the tidyverse package:
install.packages("stringr")
library(tidyverse)
library(stringr)
library(dplyr)

You will do a lot of the same operations as in the lecture - but with another dataset. Start
reading in your 3 data-set.
library(readxl)

# Option to end the same place

file_path <- paste0(dirname(rstudioapi::getSourceEditorContext()
$path), "/")
# read in your data: Bags of Apples (note: csv-file!)
BOA <- read.csv(paste0(file_path,"BagsOfApples.csv"), sep=";") #this
one should work anyhow
BOA <- read.csv("BagsOfApples.csv", sep=";") # this one also works if
the work directory is correct
# Also read in the Bag of oranges
BOO <- read_excel(paste0(file_path, "BagsOfOrangesNA.xlsx"))
# And read in Geo_dim
Geo <- read_excel(paste0(file_path, "Geo_dim.xlsx"))

You will need to bind the two data set.

# Add a new column to each data frame called fruits with "Apples" and
"Oranges" respectively.
BOA$fruits <- "Apples"
BOO$fruits <- "Oranges"

# Bind the two data frames using rbind() into a new dataset called
BagOfFruits or BOF
BOF <- rbind(BOO, BOA)
Merging data (left_join()) ~ VLOOKUP in excel
• Prepare data (rename California to prepare for merge). Use str_replace_all() or
gsub()
• Merge
• Rename the Prize column to price
if (!requireNamespace("stringr", quietly = TRUE)) {
install.packages("stringr")
}
library(stringr)
library(dplyr)

##
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':

##
## filter, lag

## The following objects are masked from 'package:base':

##
## intersect, setdiff, setequal, union

# Rename content
# (Remember to replace California with United States (check spelling
in Geo_dim))
BOF$origin <- str_replace_all(BOF$origin, "California", "United
States")

# Here you merge BOF with Geo_dim.left_join()

names(BOF)

## [1] "bagNo" "weight" "prize" "origin" "foodLabel"

"fruits"

names(Geo)

## [1] "Country" "Region"

ncol(BOF)

## [1] 6
BOF <- left_join(BOF, Geo, by = c("origin" = "Country"))
ncol(BOF) # check - one column more?

## [1] 7

# Renaming column prize to price

BOF <- rename(BOF, price = prize)

Filtering and subse ng data with filter() and select()

Remove unnecessary columns
library(dplyr)

# Here you remove the first column, bagNo

BOF <- select(BOF, -bagNo)

Check for NA’s

• Check for NAs
• Remove lines with NA in.
# Here you check for NAs
anyNA(BOF)

## [1] TRUE

# Remove all rows with NA (if any)

BOF <- na.omit(BOF)

#BOFna <- BOF[!is.na(BOF$foodLabel), ] # for a single column

# df[grepl("\n", df$colonne1),]

Make a new dataset BOF_Europe (or Asia, or Africa or North America). This should only
consist of the region you deside.
# new data with only one Region

BOF_europe <- subset(BOF, Region=="Europe")

BOF_europe <- filter(BOF, Region=="Europe")

Adding new column, pricePerKilo using mutate() (and base-R)

# Add the code here to create new priceperkilo-column, ppk
tti
BOF <- mutate(BOF, ppk = price/weight)
#BOF$ppk <- BOF$price/BOF$weight

Sort data using arrange()

Order data such that the highest price per kilo is in the top
# top ten most expensive bags of fruits
arrange(BOF,desc(ppk))

## # A tibble: 61 × 7
## weight price origin foodLabel fruits Region
ppk
## <dbl> <dbl> <chr> <chr> <chr> <chr>
<dbl>
## 1 2.03 4.91 United States Organic Oranges North America
2.42
## 2 1.59 3.67 Netherlands Organic Apples Europe
2.31
## 3 1.74 3.07 Spain Organic Apples Europe
1.77
## 4 1.31 2.22 Germany Conventional Apples Europe
1.70
## 5 1.57 2.39 Denmark Organic Apples Europe
1.52
## 6 1.59 2.25 Germany Organic Apples Europe
1.42
## 7 1.30 1.76 Denmark Conventional Apples Europe
1.35
## 8 2.11 2.66 New Zealand Conventional Apples Asia & Pacific
1.26
## 9 2.25 2.7 China Conventional Apples Asia & Pacific
1.20
## 10 1.70 2.04 Spain Organic Oranges Europe
1.20
## # ℹ 51 more rows

arrange(BOF,ppk)

## # A tibble: 61 × 7
## weight price origin foodLabel fruits Region
ppk
## <dbl> <dbl> <chr> <chr> <chr> <chr>
<dbl>
## 1 2.70 1.3 India Conventional Apples Asia & Pacific
0.481
## 2 2.56 1.31 Poland Conventional Apples Europe
0.512
## 3 3.67 1.93 India Conventional Apples Asia & Pacific
0.526
## 4 2.78 1.59 China Conventional Apples Asia & Pacific
0.572
## 5 3.07 1.99 China Conventional Apples Asia & Pacific
0.648
## 6 2.66 2.09 South Africa Conventional Apples Africa
0.787
## 7 1.99 1.59 United States Conventional Apples North America
0.799
## 8 1.93 1.57 South Africa Organic Apples Africa
0.813
## 9 2.59 2.2 China Conventional Apples Asia & Pacific
0.849
## 10 1.99 1.74 South Africa Conventional Apples Africa
0.875
## # ℹ 51 more rows

Print your BOF le

We need it next time! Hint: Look in the ile, CodeStructure.Rpres from lecture 4
# Put your write file-code here

write.table(BOF, file="bagsoffruits_price.txt",sep = "\t", row.names =

FALSE)

Aggrega ng your data in di erent ways

Count your data using count()
Count how many bags there are per region - and per foodLabel
# put your counting code in here
count(BOF, foodLabel)
ti
fi
ff
f
## # A tibble: 2 × 2
## foodLabel n
## <chr> <int>
## 1 Conventional 42
## 2 Organic 19

# also combine count and arrange to have the highest count in the
start.

arrange(count(BOF, foodLabel), desc(n))

## # A tibble: 2 × 2
## foodLabel n
## <chr> <int>
## 1 Conventional 42
## 2 Organic 19

count(BOF, foodLabel) |>

arrange(desc(n))

## # A tibble: 2 × 2
## foodLabel n
## <chr> <int>
## 1 Conventional 42
## 2 Organic 19

BOF |>
count(foodLabel) |>
arrange(desc(n))

## # A tibble: 2 × 2
## foodLabel n
## <chr> <int>
## 1 Conventional 42
## 2 Organic 19

Group by
Group by origin and foodLabel Use summarise() to create a mean-value, a standard-
deviation-value and a count-value (use the functions, mean(), sd() and n())
# Insert script here
# Simple version - group by one column, making one value (e.g. mean)
BOFg <- group_by(BOF, foodLabel)
BOFgn <- summarise(BOFg, meanppk= mean(ppk))
BOFgn

## # A tibble: 2 × 2
## foodLabel meanppk
## <chr> <dbl>
## 1 Conventional 0.963
## 2 Organic 1.27

# Full version
BOFg <- group_by(BOF, origin, foodLabel)
BOFgn <- summarise(BOFg, meanppk= mean(ppk), sdppk = sd(ppk), number =
n())

## `summarise()` has grouped output by 'origin'. You can override

using the
## `.groups` argument.

BOFgn

## # A tibble: 23 × 5
## # Groups: origin [13]
## origin foodLabel meanppk sdppk number
## <chr> <chr> <dbl> <dbl> <int>
## 1 Brazil Conventional 0.899 0.0219 4
## 2 Brazil Organic 0.938 NA 1
## 3 Chile Conventional 0.968 0.0601 3
## 4 China Conventional 0.904 0.171 11
## 5 China Organic 1.05 0.0207 2
## 6 Denmark Conventional 1.35 NA 1
## 7 Denmark Organic 1.52 NA 1
## 8 Germany Conventional 1.70 NA 1
## 9 Germany Organic 1.42 NA 1
## 10 India Conventional 0.664 0.278 3
## # ℹ 13 more rows

BOFg <- group_by(BOF, fruits, foodLabel)

BOFgn <- summarise(BOFg, meanppk= mean(ppk), sdppk = sd(ppk), number =
n())
## `summarise()` has grouped output by 'fruits'. You can override
using the
## `.groups` argument.

BOFgn

## # A tibble: 4 × 5
## # Groups: fruits [2]
## fruits foodLabel meanppk sdppk number
## <chr> <chr> <dbl> <dbl> <int>
## 1 Apples Conventional 0.948 0.317 20
## 2 Apples Organic 1.31 0.458 10
## 3 Oranges Conventional 0.976 0.0648 22
## 4 Oranges Organic 1.23 0.452 9

BOFgn <- BOF |>

group_by(fruits, foodLabel) |>
summarise(meanppk= mean(ppk), sdppk = sd(ppk), number = n())

## `summarise()` has grouped output by 'fruits'. You can override

using the
## `.groups` argument.

BOFgn

BOFgn <- BOF %>%

group_by(fruits, foodLabel) %>%
summarise(meanppk= mean(ppk), sdppk = sd(ppk), number = n())

## `summarise()` has grouped output by 'fruits'. You can override

using the
## `.groups` argument.

BOFgn
## # A tibble: 4 × 5
## # Groups: fruits [2]
## fruits foodLabel meanppk sdppk number
## <chr> <chr> <dbl> <dbl> <int>
## 1 Apples Conventional 0.948 0.317 20
## 2 Apples Organic 1.31 0.458 10
## 3 Oranges Conventional 0.976 0.0648 22
## 4 Oranges Organic 1.23 0.452 9

Using Copilot/ChatGPT
Create an account (if you don’t have it already. By default you have Copilot(Microsoft) as a
CBS-student but you can also use e.g. ChatGPT(OpenAI) and Gemini(Google) - or
skolegpt.dk (where you dont need an account. It does speak english if you talkt to it in
english)
Try to solve the group by-exercise from before
Did you have had any problems during the class? Try to ask ChatGPT for help
Are there any concepts you are having problems with? Try to get help from ChatGPT

Verzani Answers
100% (8)
Verzani Answers
94 pages
R For Health Data Science
No ratings yet
R For Health Data Science
365 pages
Modern Statistics With R
100% (2)
Modern Statistics With R
580 pages
Pandas
No ratings yet
Pandas
43 pages
Book - Roger D Peng-Exploratory Data Analysis With R-Leanpub (2015) PDF
0% (1)
Book - Roger D Peng-Exploratory Data Analysis With R-Leanpub (2015) PDF
125 pages
Detailed explanations for Programming in R. Wrangling data
No ratings yet
Detailed explanations for Programming in R. Wrangling data
6 pages
Finalproj Aml
No ratings yet
Finalproj Aml
69 pages
Cleaning Data in R
No ratings yet
Cleaning Data in R
9 pages
Book - Roger D Peng-Exploratory Data Analysis With R-Leanpub (2015) PDF
No ratings yet
Book - Roger D Peng-Exploratory Data Analysis With R-Leanpub (2015) PDF
125 pages
Exploratory Data Analysis With R PDF
No ratings yet
Exploratory Data Analysis With R PDF
125 pages
Exploratory Data Analysis With R-Leanpub PDF
No ratings yet
Exploratory Data Analysis With R-Leanpub PDF
125 pages
Exdata
No ratings yet
Exdata
184 pages
Data Manipulation With Dplyr
100% (1)
Data Manipulation With Dplyr
39 pages
Siti Noor Hazirah Sta715 Cdcs702 Cdcs
No ratings yet
Siti Noor Hazirah Sta715 Cdcs702 Cdcs
25 pages
R Introduction
No ratings yet
R Introduction
94 pages
Important R Codes and Notes
No ratings yet
Important R Codes and Notes
13 pages
R Course Own English HS
No ratings yet
R Course Own English HS
70 pages
HW 4
No ratings yet
HW 4
12 pages
07 Scatterplot Barplot Piechart
100% (1)
07 Scatterplot Barplot Piechart
15 pages
3 Loading and Saving Data in R Data
No ratings yet
3 Loading and Saving Data in R Data
11 pages
R Workshop Material 18-19, Oct-2023
No ratings yet
R Workshop Material 18-19, Oct-2023
67 pages
20mia1032 Lab 5
No ratings yet
20mia1032 Lab 5
7 pages
DR - Pierpaolo-Delser - Introduction R
No ratings yet
DR - Pierpaolo-Delser - Introduction R
83 pages
R Programming Language Notes
No ratings yet
R Programming Language Notes
8 pages
Practical Assignment-10 Mini Project Nutrition Calculator - Calculate Nutrition For Recipes
No ratings yet
Practical Assignment-10 Mini Project Nutrition Calculator - Calculate Nutrition For Recipes
16 pages
R for Health Data Science 1st Edition Ewen Harrison 2024 Scribd Download
100% (3)
R for Health Data Science 1st Edition Ewen Harrison 2024 Scribd Download
65 pages
Assignment Food and Nutrition
No ratings yet
Assignment Food and Nutrition
3 pages
ME204 - ? Lab 01 - Recap of Base R and Tidyverse Fundamentals
No ratings yet
ME204 - ? Lab 01 - Recap of Base R and Tidyverse Fundamentals
3 pages
R4_0121212
No ratings yet
R4_0121212
7 pages
Data Manipulation R
No ratings yet
Data Manipulation R
13 pages
Experiment No. 9
No ratings yet
Experiment No. 9
9 pages
Tutorials
No ratings yet
Tutorials
10 pages
Basic R Dplyr Session 4 Demonstration
No ratings yet
Basic R Dplyr Session 4 Demonstration
18 pages
Module 4-1
No ratings yet
Module 4-1
84 pages
R Programming
No ratings yet
R Programming
9 pages
Basic R Programming
No ratings yet
Basic R Programming
37 pages
P6ADBMS
No ratings yet
P6ADBMS
34 pages
Cluster R
No ratings yet
Cluster R
1 page
Immediate download R cookbook 1st ed Edition Teetor ebooks 2025
No ratings yet
Immediate download R cookbook 1st ed Edition Teetor ebooks 2025
87 pages
R for Health Data Science 1st Edition Ewen Harrison - The full ebook with all chapters is available for download
100% (4)
R for Health Data Science 1st Edition Ewen Harrison - The full ebook with all chapters is available for download
74 pages
R-plots_HOW TO
No ratings yet
R-plots_HOW TO
4 pages
Data Manipulation Workshop Handout
No ratings yet
Data Manipulation Workshop Handout
46 pages
Hierarchical Clustering and Experiment With Cutting The Dendrogram
No ratings yet
Hierarchical Clustering and Experiment With Cutting The Dendrogram
5 pages
STATA - Subject Table of Contents
No ratings yet
STATA - Subject Table of Contents
15 pages
Plyr Package in R Programming
No ratings yet
Plyr Package in R Programming
9 pages
All Codes
No ratings yet
All Codes
10 pages
Matrix, Dataframes, List
No ratings yet
Matrix, Dataframes, List
8 pages
Intermediate R
No ratings yet
Intermediate R
5 pages
DataCamp Week 5
No ratings yet
DataCamp Week 5
7 pages
Reshape2 - R - Flexibly Reshape Data - A Reboot of The Reshape Package
No ratings yet
Reshape2 - R - Flexibly Reshape Data - A Reboot of The Reshape Package
14 pages
vertopal.com_R_practical
No ratings yet
vertopal.com_R_practical
9 pages
R Code
No ratings yet
R Code
5 pages
MIT 302 - Statistical Computing II - Tutorial 02
No ratings yet
MIT 302 - Statistical Computing II - Tutorial 02
5 pages
Apache Cassandra Developer Associate - Exam Practice Tests
From Everand
Apache Cassandra Developer Associate - Exam Practice Tests
Cristian Scutaru
No ratings yet
Simplifying Data Science With Python
From Everand
Simplifying Data Science With Python
Billy David millican
No ratings yet
Programming Concepts in Java
From Everand
Programming Concepts in Java
Robert Burns
No ratings yet
50 Python Concepts Every Developer Should Know
From Everand
50 Python Concepts Every Developer Should Know
Hernando Abella
No ratings yet
Python For Beginners
From Everand
Python For Beginners
Célio Azevedo
No ratings yet
Easy Programming for Everyone
From Everand
Easy Programming for Everyone
Umar Asghar
No ratings yet
Practice Tests for CASAS Math GOAL 2 Level D, Forms 927M and 928M
From Everand
Practice Tests for CASAS Math GOAL 2 Level D, Forms 927M and 928M
Coaching For Better Learning
No ratings yet
Unidad 1pr Ed
No ratings yet
Unidad 1pr Ed
1 page
Creating A Boiler Plant Dashboard in Power BI Involves Several Steps
No ratings yet
Creating A Boiler Plant Dashboard in Power BI Involves Several Steps
2 pages
AWS Service Catalog
0% (1)
AWS Service Catalog
31 pages
PTC Customer Service Guide: Parametric Technology Corporation
No ratings yet
PTC Customer Service Guide: Parametric Technology Corporation
114 pages
StringManipulation
No ratings yet
StringManipulation
3 pages
Information Lifecycle Management For Business Data
No ratings yet
Information Lifecycle Management For Business Data
36 pages
EinScan HX Brochure
No ratings yet
EinScan HX Brochure
8 pages
How To Create Windows 10 Bootable USB Flash Drive From ISO File Using Rufus
No ratings yet
How To Create Windows 10 Bootable USB Flash Drive From ISO File Using Rufus
6 pages
Creation of DF
No ratings yet
Creation of DF
16 pages
Seminar Final Report
No ratings yet
Seminar Final Report
26 pages
GS. System Requirements
No ratings yet
GS. System Requirements
2 pages
IT Onboarding Checklist
No ratings yet
IT Onboarding Checklist
3 pages
Rdpwrap Ini
No ratings yet
Rdpwrap Ini
208 pages
Application Layer
No ratings yet
Application Layer
57 pages
8086 Microprocessor
No ratings yet
8086 Microprocessor
25 pages
Edt 103 Questions and Answers Compiled by Omoh
No ratings yet
Edt 103 Questions and Answers Compiled by Omoh
6 pages
Gunjan Pandey
No ratings yet
Gunjan Pandey
10 pages
Steps To Create An Apple Developer Account
No ratings yet
Steps To Create An Apple Developer Account
2 pages
e07INST3
No ratings yet
e07INST3
244 pages
SOE 201-Programming Practices
No ratings yet
SOE 201-Programming Practices
18 pages
Task Analysis, Storyboarding, Use Cases: J. Dheeba/SCOPE
No ratings yet
Task Analysis, Storyboarding, Use Cases: J. Dheeba/SCOPE
37 pages
AUTOSAR SWS CANInterface
No ratings yet
AUTOSAR SWS CANInterface
212 pages
UG-III-COREL-DRAW
No ratings yet
UG-III-COREL-DRAW
33 pages
Circular - ICT Acceptable Use Agreement: Acknowledge The Agreement Attached Herewith (Page No.4)
No ratings yet
Circular - ICT Acceptable Use Agreement: Acknowledge The Agreement Attached Herewith (Page No.4)
1 page
Master Thesis Presentation Slides
100% (2)
Master Thesis Presentation Slides
6 pages
Karthik Dundigalla D
No ratings yet
Karthik Dundigalla D
4 pages
Terms of Reference For Consultancy To Develop An E-Learning (Content Creation) Course For Sub-National Administration Councilors
No ratings yet
Terms of Reference For Consultancy To Develop An E-Learning (Content Creation) Course For Sub-National Administration Councilors
14 pages
AI NOTES UNIT-IV
No ratings yet
AI NOTES UNIT-IV
34 pages
Lab 9
No ratings yet
Lab 9
9 pages
Requirements: Tools & Techniques
No ratings yet
Requirements: Tools & Techniques
31 pages