Programming in R. Ex5-wrangling
Programming in R. Ex5-wrangling
In this exercise you will do some data wrangling on a fabricated dataset. Use the .press-
slides to reuse some of the code.
In this exercise you will need tidyverse, since that is our primary wrangling package.
Load the tidyverse package:
install.packages("stringr")
library(tidyverse)
library(stringr)
library(dplyr)
You will do a lot of the same operations as in the lecture - but with another dataset. Start
reading in your 3 data-set.
library(readxl)
# Bind the two data frames using rbind() into a new dataset called
BagOfFruits or BOF
BOF <- rbind(BOO, BOA)
Merging data (left_join()) ~ VLOOKUP in excel
• Prepare data (rename California to prepare for merge). Use str_replace_all() or
gsub()
• Merge
• Rename the Prize column to price
if (!requireNamespace("stringr", quietly = TRUE)) {
install.packages("stringr")
}
library(stringr)
library(dplyr)
##
## Attaching package: 'dplyr'
# Rename content
# (Remember to replace California with United States (check spelling
in Geo_dim))
BOF$origin <- str_replace_all(BOF$origin, "California", "United
States")
names(Geo)
ncol(BOF)
## [1] 6
BOF <- left_join(BOF, Geo, by = c("origin" = "Country"))
ncol(BOF) # check - one column more?
## [1] 7
## [1] TRUE
Make a new dataset BOF_Europe (or Asia, or Africa or North America). This should only
consist of the region you deside.
# new data with only one Region
## # A tibble: 61 × 7
## weight price origin foodLabel fruits Region
ppk
## <dbl> <dbl> <chr> <chr> <chr> <chr>
<dbl>
## 1 2.03 4.91 United States Organic Oranges North America
2.42
## 2 1.59 3.67 Netherlands Organic Apples Europe
2.31
## 3 1.74 3.07 Spain Organic Apples Europe
1.77
## 4 1.31 2.22 Germany Conventional Apples Europe
1.70
## 5 1.57 2.39 Denmark Organic Apples Europe
1.52
## 6 1.59 2.25 Germany Organic Apples Europe
1.42
## 7 1.30 1.76 Denmark Conventional Apples Europe
1.35
## 8 2.11 2.66 New Zealand Conventional Apples Asia & Pacific
1.26
## 9 2.25 2.7 China Conventional Apples Asia & Pacific
1.20
## 10 1.70 2.04 Spain Organic Oranges Europe
1.20
## # ℹ 51 more rows
arrange(BOF,ppk)
## # A tibble: 61 × 7
## weight price origin foodLabel fruits Region
ppk
## <dbl> <dbl> <chr> <chr> <chr> <chr>
<dbl>
## 1 2.70 1.3 India Conventional Apples Asia & Pacific
0.481
## 2 2.56 1.31 Poland Conventional Apples Europe
0.512
## 3 3.67 1.93 India Conventional Apples Asia & Pacific
0.526
## 4 2.78 1.59 China Conventional Apples Asia & Pacific
0.572
## 5 3.07 1.99 China Conventional Apples Asia & Pacific
0.648
## 6 2.66 2.09 South Africa Conventional Apples Africa
0.787
## 7 1.99 1.59 United States Conventional Apples North America
0.799
## 8 1.93 1.57 South Africa Organic Apples Africa
0.813
## 9 2.59 2.2 China Conventional Apples Asia & Pacific
0.849
## 10 1.99 1.74 South Africa Conventional Apples Africa
0.875
## # ℹ 51 more rows
# also combine count and arrange to have the highest count in the
start.
## # A tibble: 2 × 2
## foodLabel n
## <chr> <int>
## 1 Conventional 42
## 2 Organic 19
## # A tibble: 2 × 2
## foodLabel n
## <chr> <int>
## 1 Conventional 42
## 2 Organic 19
BOF |>
count(foodLabel) |>
arrange(desc(n))
## # A tibble: 2 × 2
## foodLabel n
## <chr> <int>
## 1 Conventional 42
## 2 Organic 19
Group by
Group by origin and foodLabel Use summarise() to create a mean-value, a standard-
deviation-value and a count-value (use the functions, mean(), sd() and n())
# Insert script here
# Simple version - group by one column, making one value (e.g. mean)
BOFg <- group_by(BOF, foodLabel)
BOFgn <- summarise(BOFg, meanppk= mean(ppk))
BOFgn
## # A tibble: 2 × 2
## foodLabel meanppk
## <chr> <dbl>
## 1 Conventional 0.963
## 2 Organic 1.27
# Full version
BOFg <- group_by(BOF, origin, foodLabel)
BOFgn <- summarise(BOFg, meanppk= mean(ppk), sdppk = sd(ppk), number =
n())
BOFgn
## # A tibble: 23 × 5
## # Groups: origin [13]
## origin foodLabel meanppk sdppk number
## <chr> <chr> <dbl> <dbl> <int>
## 1 Brazil Conventional 0.899 0.0219 4
## 2 Brazil Organic 0.938 NA 1
## 3 Chile Conventional 0.968 0.0601 3
## 4 China Conventional 0.904 0.171 11
## 5 China Organic 1.05 0.0207 2
## 6 Denmark Conventional 1.35 NA 1
## 7 Denmark Organic 1.52 NA 1
## 8 Germany Conventional 1.70 NA 1
## 9 Germany Organic 1.42 NA 1
## 10 India Conventional 0.664 0.278 3
## # ℹ 13 more rows
BOFgn
## # A tibble: 4 × 5
## # Groups: fruits [2]
## fruits foodLabel meanppk sdppk number
## <chr> <chr> <dbl> <dbl> <int>
## 1 Apples Conventional 0.948 0.317 20
## 2 Apples Organic 1.31 0.458 10
## 3 Oranges Conventional 0.976 0.0648 22
## 4 Oranges Organic 1.23 0.452 9
BOFgn
## # A tibble: 4 × 5
## # Groups: fruits [2]
## fruits foodLabel meanppk sdppk number
## <chr> <chr> <dbl> <dbl> <int>
## 1 Apples Conventional 0.948 0.317 20
## 2 Apples Organic 1.31 0.458 10
## 3 Oranges Conventional 0.976 0.0648 22
## 4 Oranges Organic 1.23 0.452 9
BOFgn
## # A tibble: 4 × 5
## # Groups: fruits [2]
## fruits foodLabel meanppk sdppk number
## <chr> <chr> <dbl> <dbl> <int>
## 1 Apples Conventional 0.948 0.317 20
## 2 Apples Organic 1.31 0.458 10
## 3 Oranges Conventional 0.976 0.0648 22
## 4 Oranges Organic 1.23 0.452 9
Using Copilot/ChatGPT
Create an account (if you don’t have it already. By default you have Copilot(Microsoft) as a
CBS-student but you can also use e.g. ChatGPT(OpenAI) and Gemini(Google) - or
skolegpt.dk (where you dont need an account. It does speak english if you talkt to it in
english)
Try to solve the group by-exercise from before
Did you have had any problems during the class? Try to ask ChatGPT for help
Are there any concepts you are having problems with? Try to get help from ChatGPT