0% found this document useful (0 votes)
7 views

Programming in R. Ex5-wrangling

This document outlines a data wrangling exercise using R and the tidyverse package, focusing on manipulating a fabricated dataset of fruit bags. It includes steps for reading data, merging datasets, filtering, aggregating, and summarizing data, as well as using functions like left_join, mutate, and summarise. The exercise emphasizes practical coding skills for data analysis and encourages the use of AI tools like ChatGPT for assistance.

Uploaded by

soloviovalada
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Programming in R. Ex5-wrangling

This document outlines a data wrangling exercise using R and the tidyverse package, focusing on manipulating a fabricated dataset of fruit bags. It includes steps for reading data, merging datasets, filtering, aggregating, and summarizing data, as well as using functions like left_join, mutate, and summarise. The exercise emphasizes practical coding skills for data analysis and encourages the use of AI tools like ChatGPT for assistance.

Uploaded by

soloviovalada
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Wrangling Exercises

In this exercise you will do some data wrangling on a fabricated dataset. Use the .press-
slides to reuse some of the code.
In this exercise you will need tidyverse, since that is our primary wrangling package.
Load the tidyverse package:
install.packages("stringr")
library(tidyverse)
library(stringr)
library(dplyr)

You will do a lot of the same operations as in the lecture - but with another dataset. Start
reading in your 3 data-set.
library(readxl)

# Option to end the same place


file_path <- paste0(dirname(rstudioapi::getSourceEditorContext()
$path), "/")
# read in your data: Bags of Apples (note: csv-file!)
BOA <- read.csv(paste0(file_path,"BagsOfApples.csv"), sep=";") #this
one should work anyhow
BOA <- read.csv("BagsOfApples.csv", sep=";") # this one also works if
the work directory is correct
# Also read in the Bag of oranges
BOO <- read_excel(paste0(file_path, "BagsOfOrangesNA.xlsx"))
# And read in Geo_dim
Geo <- read_excel(paste0(file_path, "Geo_dim.xlsx"))

You will need to bind the two data set.


# Add a new column to each data frame called fruits with "Apples" and
"Oranges" respectively.
BOA$fruits <- "Apples"
BOO$fruits <- "Oranges"

# Bind the two data frames using rbind() into a new dataset called
BagOfFruits or BOF
BOF <- rbind(BOO, BOA)
Merging data (left_join()) ~ VLOOKUP in excel
• Prepare data (rename California to prepare for merge). Use str_replace_all() or
gsub()
• Merge
• Rename the Prize column to price
if (!requireNamespace("stringr", quietly = TRUE)) {
install.packages("stringr")
}
library(stringr)
library(dplyr)

##
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':


##
## filter, lag

## The following objects are masked from 'package:base':


##
## intersect, setdiff, setequal, union

# Rename content
# (Remember to replace California with United States (check spelling
in Geo_dim))
BOF$origin <- str_replace_all(BOF$origin, "California", "United
States")

# Here you merge BOF with Geo_dim.left_join()


names(BOF)

## [1] "bagNo" "weight" "prize" "origin" "foodLabel"


"fruits"

names(Geo)

## [1] "Country" "Region"

ncol(BOF)

## [1] 6
BOF <- left_join(BOF, Geo, by = c("origin" = "Country"))
ncol(BOF) # check - one column more?

## [1] 7

# Renaming column prize to price


BOF <- rename(BOF, price = prize)

Filtering and subse ng data with filter() and select()


Remove unnecessary columns
library(dplyr)

# Here you remove the first column, bagNo


BOF <- select(BOF, -bagNo)

Check for NA’s


• Check for NAs
• Remove lines with NA in.
# Here you check for NAs
anyNA(BOF)

## [1] TRUE

# Remove all rows with NA (if any)


BOF <- na.omit(BOF)

#BOFna <- BOF[!is.na(BOF$foodLabel), ] # for a single column


# df[grepl("\n", df$colonne1),]

Make a new dataset BOF_Europe (or Asia, or Africa or North America). This should only
consist of the region you deside.
# new data with only one Region

BOF_europe <- subset(BOF, Region=="Europe")


BOF_europe <- filter(BOF, Region=="Europe")

Adding new column, pricePerKilo using mutate() (and base-R)


# Add the code here to create new priceperkilo-column, ppk
tti
BOF <- mutate(BOF, ppk = price/weight)
#BOF$ppk <- BOF$price/BOF$weight

Sort data using arrange()


Order data such that the highest price per kilo is in the top
# top ten most expensive bags of fruits
arrange(BOF,desc(ppk))

## # A tibble: 61 × 7
## weight price origin foodLabel fruits Region
ppk
## <dbl> <dbl> <chr> <chr> <chr> <chr>
<dbl>
## 1 2.03 4.91 United States Organic Oranges North America
2.42
## 2 1.59 3.67 Netherlands Organic Apples Europe
2.31
## 3 1.74 3.07 Spain Organic Apples Europe
1.77
## 4 1.31 2.22 Germany Conventional Apples Europe
1.70
## 5 1.57 2.39 Denmark Organic Apples Europe
1.52
## 6 1.59 2.25 Germany Organic Apples Europe
1.42
## 7 1.30 1.76 Denmark Conventional Apples Europe
1.35
## 8 2.11 2.66 New Zealand Conventional Apples Asia & Pacific
1.26
## 9 2.25 2.7 China Conventional Apples Asia & Pacific
1.20
## 10 1.70 2.04 Spain Organic Oranges Europe
1.20
## # ℹ 51 more rows

arrange(BOF,ppk)

## # A tibble: 61 × 7
## weight price origin foodLabel fruits Region
ppk
## <dbl> <dbl> <chr> <chr> <chr> <chr>
<dbl>
## 1 2.70 1.3 India Conventional Apples Asia & Pacific
0.481
## 2 2.56 1.31 Poland Conventional Apples Europe
0.512
## 3 3.67 1.93 India Conventional Apples Asia & Pacific
0.526
## 4 2.78 1.59 China Conventional Apples Asia & Pacific
0.572
## 5 3.07 1.99 China Conventional Apples Asia & Pacific
0.648
## 6 2.66 2.09 South Africa Conventional Apples Africa
0.787
## 7 1.99 1.59 United States Conventional Apples North America
0.799
## 8 1.93 1.57 South Africa Organic Apples Africa
0.813
## 9 2.59 2.2 China Conventional Apples Asia & Pacific
0.849
## 10 1.99 1.74 South Africa Conventional Apples Africa
0.875
## # ℹ 51 more rows

Print your BOF le


We need it next time! Hint: Look in the ile, CodeStructure.Rpres from lecture 4
# Put your write file-code here

write.table(BOF, file="bagsoffruits_price.txt",sep = "\t", row.names =


FALSE)

Aggrega ng your data in di erent ways


Count your data using count()
Count how many bags there are per region - and per foodLabel
# put your counting code in here
count(BOF, foodLabel)
ti
fi
ff
f
## # A tibble: 2 × 2
## foodLabel n
## <chr> <int>
## 1 Conventional 42
## 2 Organic 19

# also combine count and arrange to have the highest count in the
start.

arrange(count(BOF, foodLabel), desc(n))

## # A tibble: 2 × 2
## foodLabel n
## <chr> <int>
## 1 Conventional 42
## 2 Organic 19

count(BOF, foodLabel) |>


arrange(desc(n))

## # A tibble: 2 × 2
## foodLabel n
## <chr> <int>
## 1 Conventional 42
## 2 Organic 19

BOF |>
count(foodLabel) |>
arrange(desc(n))

## # A tibble: 2 × 2
## foodLabel n
## <chr> <int>
## 1 Conventional 42
## 2 Organic 19

Group by
Group by origin and foodLabel Use summarise() to create a mean-value, a standard-
deviation-value and a count-value (use the functions, mean(), sd() and n())
# Insert script here
# Simple version - group by one column, making one value (e.g. mean)
BOFg <- group_by(BOF, foodLabel)
BOFgn <- summarise(BOFg, meanppk= mean(ppk))
BOFgn

## # A tibble: 2 × 2
## foodLabel meanppk
## <chr> <dbl>
## 1 Conventional 0.963
## 2 Organic 1.27

# Full version
BOFg <- group_by(BOF, origin, foodLabel)
BOFgn <- summarise(BOFg, meanppk= mean(ppk), sdppk = sd(ppk), number =
n())

## `summarise()` has grouped output by 'origin'. You can override


using the
## `.groups` argument.

BOFgn

## # A tibble: 23 × 5
## # Groups: origin [13]
## origin foodLabel meanppk sdppk number
## <chr> <chr> <dbl> <dbl> <int>
## 1 Brazil Conventional 0.899 0.0219 4
## 2 Brazil Organic 0.938 NA 1
## 3 Chile Conventional 0.968 0.0601 3
## 4 China Conventional 0.904 0.171 11
## 5 China Organic 1.05 0.0207 2
## 6 Denmark Conventional 1.35 NA 1
## 7 Denmark Organic 1.52 NA 1
## 8 Germany Conventional 1.70 NA 1
## 9 Germany Organic 1.42 NA 1
## 10 India Conventional 0.664 0.278 3
## # ℹ 13 more rows

BOFg <- group_by(BOF, fruits, foodLabel)


BOFgn <- summarise(BOFg, meanppk= mean(ppk), sdppk = sd(ppk), number =
n())
## `summarise()` has grouped output by 'fruits'. You can override
using the
## `.groups` argument.

BOFgn

## # A tibble: 4 × 5
## # Groups: fruits [2]
## fruits foodLabel meanppk sdppk number
## <chr> <chr> <dbl> <dbl> <int>
## 1 Apples Conventional 0.948 0.317 20
## 2 Apples Organic 1.31 0.458 10
## 3 Oranges Conventional 0.976 0.0648 22
## 4 Oranges Organic 1.23 0.452 9

BOFgn <- BOF |>


group_by(fruits, foodLabel) |>
summarise(meanppk= mean(ppk), sdppk = sd(ppk), number = n())

## `summarise()` has grouped output by 'fruits'. You can override


using the
## `.groups` argument.

BOFgn

## # A tibble: 4 × 5
## # Groups: fruits [2]
## fruits foodLabel meanppk sdppk number
## <chr> <chr> <dbl> <dbl> <int>
## 1 Apples Conventional 0.948 0.317 20
## 2 Apples Organic 1.31 0.458 10
## 3 Oranges Conventional 0.976 0.0648 22
## 4 Oranges Organic 1.23 0.452 9

BOFgn <- BOF %>%


group_by(fruits, foodLabel) %>%
summarise(meanppk= mean(ppk), sdppk = sd(ppk), number = n())

## `summarise()` has grouped output by 'fruits'. You can override


using the
## `.groups` argument.

BOFgn
## # A tibble: 4 × 5
## # Groups: fruits [2]
## fruits foodLabel meanppk sdppk number
## <chr> <chr> <dbl> <dbl> <int>
## 1 Apples Conventional 0.948 0.317 20
## 2 Apples Organic 1.31 0.458 10
## 3 Oranges Conventional 0.976 0.0648 22
## 4 Oranges Organic 1.23 0.452 9

Using Copilot/ChatGPT
Create an account (if you don’t have it already. By default you have Copilot(Microsoft) as a
CBS-student but you can also use e.g. ChatGPT(OpenAI) and Gemini(Google) - or
skolegpt.dk (where you dont need an account. It does speak english if you talkt to it in
english)
Try to solve the group by-exercise from before
Did you have had any problems during the class? Try to ask ChatGPT for help
Are there any concepts you are having problems with? Try to get help from ChatGPT

You might also like