This repository contains the data and code for the paper "Developing a standardized categorization system for energy efficiency measures (1836-RP)" submitted to Science and Technology for the Built Environment. This study was conducted as part of ASHRAE Research Project 1836, Developing a standardized categorization system for energy efficiency measures.
The repository is divided into three directories:
/data/
: List(s) of EEMs to be categorized (or re-categorized) and related data/analysis/
: R script for EEM categorization/results/
: Output produced by R script
Energy Efficiency Measures (EEMs) play a central role in building energy modeling, energy auditing, and energy data collection and exchange. Despite their importance, there is currently no standardized way to describe or categorize EEMs in industry. This lack of standardization severely limits the ability to communicate the intent of an EEM clearly and consistently, and to perform “apples-to-apples” comparisons of measure savings and cost effectiveness. The goal of this study was to develop and test a standardized system for categorizing EEMs.
The standardized categorization system developed in 1836-RP consists of a three-level hierarchy of building elements based on UNIFORMAT, a standard classification system for building elements and related sitework. An individual EEM is categorized on the hierarchy using an element “tag” (i.e., keyword) present in the EEM name that links the EEM with a single UNFORMAT category. The EEM name may also contain one or more descriptor tags that are not used for categorization, but may provide useful additional information for sorting and analyzing a group of EEMs.
There are five datasets associated with this project: one set of categorization tags, two example lists of EEMs to be re-categorized, and two corresponding ground truth files that contain manually assigned re-categorizations for each EEM.
To categorize the list of EEMs, the R script uses a seed list of categorization tags that was developed during 1836-RP. The file categorization-tags.csv contains the element and descriptor type tags, along with their associated UNIFORMAT categories. The CSV file containing the categorization tags consists of six columns. The column keyword
contains the tags to be used for categorization. The column type
provides information regarding whether the tag is an element tag or a descriptor tag. The remaining columns uni_code
, uni_level_1
, uni_level_2
, and uni_level_3
contain the UNIFORMAT category associated with that tag.
Some descriptor tags like "cool roof" will always belong to a single UNIFORMAT category and are therefore assigned to that category (in this case B3010 Roof Coverings). Other descriptor tags like "insulation" could belong to multiple categories depending upon the building element affected and are therefore left unassigned and given the uni_code X0000. When possible, tags have been assigned to UNIFORMAT Level 3, however, some high-level element tags (e.g., building envelope) can only be assigned to UNIFORMAT Level 1 or 2, and lower levels are left unassigned. The first five tags in the dataset:
keyword | type | uni_code | uni_level_1 | uni_level_2 | uni_level_3 |
---|---|---|---|---|---|
Foundation wall | Element | A1010 | SUBSTRUCTURE | Foundations | Standard Foundations |
Slab | Element | A1030 | SUBSTRUCTURE | Foundations | Slab on Grade |
Basement wall | Element | A2020 | SUBSTRUCTURE | Basement Construction | Basement Walls |
Building envelope | Element | B | SHELL | Unassigned | Unassigned |
envelope | Element | B | SHELL | Unassigned | Unassigned |
Two example EEM lists are provided as Comma Separated Values (CSV) files along with the R script. The first example list is a random sample of 5% of the EEMs from the 1836-RP main list of EEMs from the repo ashrae-1836-rp-text-mining and is named sample-eems.csv.
The dataset contains five columns. The EEM IDs from the main list of 1836-RP EEMs are shown in the eem_id
column, the source document is shown in the document
column, the original categorization in the source document is shown in the cat_lev1
column, the sub-categorization in the original source document (if present) is shown in the cat_lev2
column, and the EEM name is given in the eem_name
column. The first five EEMs from the 5% random sample:
eem_id | document | cat_lev1 | cat_lev2 | eem_name |
---|---|---|---|---|
10 | 1651RP | Daylighting | Passive | High ceilings |
24 | 1651RP | Daylighting | Passive | Use of interzone luminous ceilings |
35 | 1651RP | Envelope | Fenestration | Heat absorbing blinds |
36 | 1651RP | Envelope | Fenestration | Manual Internal Window shades |
60 | 1651RP | Envelope | Infiltration | High Performance Air Barrier to Reduce Infiltration |
The second list contains all of the EEMs in BuildingSync and is named building-sync.csv. It follows the same five column format as the 5% random sample. The first five EEMs from the BuildingSync list:
eem_id | document | cat_lev1 | cat_lev2 | eem_name |
---|---|---|---|---|
1237 | BSYNC | Advanced Metering Systems | 0 | Install advanced metering systems |
1238 | BSYNC | Advanced Metering Systems | 0 | Clean and/or repair |
1239 | BSYNC | Advanced Metering Systems | 0 | Implement training and/or documentation |
1240 | BSYNC | Advanced Metering Systems | 0 | Upgrade operating protocols, calibration, and/or sequencing |
1241 | BSYNC | Advanced Metering Systems | 0 | Other |
Two ground truth files contain manually assigned re-categorizations for each EEM in the example EEM lists. These files are not required for EEM categorization (or re-categorization), and are only used to compute some of the performance metrics. The first ground truth file corresponds to the 5% sample of EEMs and is named sample-eems-ground-truth.csv. The file contains the same five headers as the list of EEMs, plus a sixth uni_code_manual
column that lists the manual UNIFORMAT assignment for each EEM. EEMs that were unable to be assigned manually are listed as "X0000". The first five EEMs from the 5% random sample ground truth:
eem_id | document | cat_lev1 | cat_lev2 | eem_name | uni_code_manual |
---|---|---|---|---|---|
10 | 1651RP | Daylighting | Passive | High ceilings | C3030 |
24 | 1651RP | Daylighting | Passive | Use of interzone luminous ceilings | C3030 |
35 | 1651RP | Envelope | Fenestration | Heat absorbing blinds | E2010 |
36 | 1651RP | Envelope | Fenestration | Manual Internal Window shades | E2010 |
60 | 1651RP | Envelope | Infiltration | High Performance Air Barrier to Reduce Infiltration | B |
The second ground truth file corresponds to the list of BuildingSync EEMs and is named building-sync-ground-truth.csv. It follows the same six column format as the 5% random sample. The first five EEMs from the BuildingSync ground truth:
eem_id | document | cat_lev1 | cat_lev2 | eem_name | uni_code_manual |
---|---|---|---|---|---|
1237 | BSYNC | Advanced Metering Systems | 0 | Install advanced metering systems | D5010 |
1238 | BSYNC | Advanced Metering Systems | 0 | Clean and/or repair | D5010 |
1239 | BSYNC | Advanced Metering Systems | 0 | Implement training and/or documentation | D5010 |
1240 | BSYNC | Advanced Metering Systems | 0 | Upgrade operating protocols, calibration, and/or sequencing | D5010 |
1241 | BSYNC | Advanced Metering Systems | 0 | Other | D5010 |
In order to use a different list of categorization tags, EEMs, or ground truth, add the new list to the /data/
folder and update the file name where necessary in in the R script. User-provided input data should follow the same header and column format shown above. For user-defined EEM lists, note that the input data needs to follow the required header structure regardless of whether the list of EEMs has a prior categorization system associated with it or not. The column eem_id
should contain a unique ID for each EEM. If your EEM list does not have any original categorization system associated with it, keep the dummy columns cat_lev1
and cat_lev2
empty.
The R script eem-recategorization.R categorizes an existing list of EEMs according to the standardized categorization system developed in 1836-RP.
First, load (or install if you do not already have them installed) the packages required for data handling and tokenization.
# Load required packages
library(tidyverse)
library(tidytext)
library(tokenizers) # for 1-n gram tokenization
Import the list of categorization tags, the list of EEMs to be categorized, and the corresponding ground truth (if available). The relative filepaths in this script follow the same directory structure as this Github repository, and it is recommended that you use this same structure. You might have to use setwd()
to set the working directory to the location of the R script.
# Load categorization tags
tag_list <- read_csv("../data/categorization-tags.csv")
# Load one of the two samples provided with the script or a custom EEM list:
sample_eems <- read_csv("../data/sample-eems.csv") # 5% random sample of EEMs from the main list
# Load the manually assigned categories for the EEM samples (For calculating metrics 3 and 4)
ground_truth <- read_csv("../data/sample-eems-ground-truth.csv") # Ground truth for the 5% random sample
Since the tags consist of n-grams where n>1 (e.g., words, bigrams, trigrams), the EEM names need to be tokenized as sequences of n-grams. To determine the number of n in n-grams to be used to tokenize the EEM names, we first tokenize the tags and count the range of number of tokens in the categorization tag list.
tokenized_tag_list <- tag_list %>%
select(keyword) %>%
unnest_tokens(word, keyword, token = "words", drop = FALSE)
tag_token_counts <- tokenized_tag_list %>%
count(keyword)
This produces a token count for each tag. For the first five tags:
keyword n
<chr> <int>
1 Absorption chiller 2
2 Advanced power strip 3
3 AHU 1
4 Air barrier 2
5 air cooled 2
For the ASHRAE 1836-RP categorization tags, the tags range from 1 to 5 words in length. In order to look for these tags within the EEM names, the EEM names need to be tokenized as n-grams where n ranges from 1 to 5.
## Tokenize EEM names
# create a list of tokens for each EEM name
token_1_5grams <- tokenize_ngrams(sample_eems$eem_name, lowercase = TRUE, n = 5, n_min = 1)
# Map the list as a dataframe
eem_tokens <- map_df(token_1_5grams, ~as.data.frame(.x), .id="id")
eem_tokens$id <- as.numeric(eem_tokens$id)
eem_tokens <- dplyr::rename(eem_tokens, tokens = .x)
This produces a dataframe with the tokens from each EEM as separate rows. For example, the EEM "High ceilings" is tokenized into three possible tokens: "high", "high ceilings", and "ceilings". For the first five tokens:
id tokens
1 1 high
2 1 high ceilings
3 1 ceilings
4 2 use
5 2 use of
The automatic tagger and categorizer code searches for the categorization tags within the tokenized EEM names. Every time it finds a match, it tags that EEM with all the information associated with that particular tag. This includes the tag, the tag type and the UNIFORMAT category associated with the tag. Tokens that do not have a matching tagged are left untagged with an .
# Tagging all tokens with relevant categorization tags
# and using them to re-categorize the EEMs according to UNIFORMAT
for (i in 1:nrow(tag_list)) {
for (j in 1:nrow(eem_tokens)) {
# Look for keywords (tag_list, column 1) in EEM name tokens (eem_tokens, column 2)
# Add corresponding tags and associated UNIFORMAT category to the EEM name tokens
if(grepl(paste0("^", tag_list[i,1]), eem_tokens[j,2], ignore.case=TRUE) == 1){
eem_tokens[j,3] <- tag_list[i,1] # keyword
eem_tokens[j,4] <- tag_list[i,2] # tag type
eem_tokens[j,5] <- tag_list[i,3] # UNI code
eem_tokens[j,6] <- tag_list[i,4] # UNI level 1 category
eem_tokens[j,7] <- tag_list[i,5] # UNI level 2 category
eem_tokens[j,8] <- tag_list[i,6] # UNI level 3 category
}
}
}
For the first five tokens, the dataframe is now:
id tokens keyword type uni_code uni_level_1 uni_level_2 uni_level_3
1 1 high <NA> <NA> <NA> <NA> <NA> <NA>
2 1 high ceilings <NA> <NA> <NA> <NA> <NA> <NA>
3 1 ceilings ceiling Element C3030 INTERIORS Interior Finishes Ceiling Finishes
4 2 use <NA> <NA> <NA> <NA> <NA> <NA>
5 2 use of <NA> <NA> <NA> <NA> <NA> <NA>
The script then exports six CSV files:
sheet-1-tagged-eems
: Tagged EEMs from the sample along with their original and new categorizationssheet-2-untagged-eems
: Untagged EEMs from the sample along with their original categorizationssheet-3-top-tags
: Most frequently occurring tagged terms from the sample along with their countssheet-4-untagged-words
: Most frequently occurring un-tagged terms from the sample along with their countssheet-5-untagged-bigrams
: Most frequently occurring un-tagged bigrams from the sample along with their countssheet-6-ground-truth-matching
: Comparison of automatically and manually assigned categorizations for each EEM in the sample
To illustrate the results, the first five EEMs from sheet-1-tagged-eems
are shown below. If multiple tags are present in the EEM, each tag gets its own row in the re-categorized list.
eem_id | document | cat_lev1 | cat_lev2 | eem_name | tags | type | uni_code | uni_level_1 | uni_level_2 | uni_level_3 |
---|---|---|---|---|---|---|---|---|---|---|
10 | 1651RP | Daylighting | Passive | High ceilings | ceiling | Element | C3030 | INTERIORS | Interior Finishes | Ceiling Finishes |
24 | 1651RP | Daylighting | Passive | Use of interzone luminous ceilings | ceiling | Element | C3030 | INTERIORS | Interior Finishes | Ceiling Finishes |
35 | 1651RP | Envelope | Fenestration | Heat absorbing blinds | blind | Element | E2010 | EQUIPMENT & FURNISHINGS | Furnishings | Fixed Furnishings |
36 | 1651RP | Envelope | Fenestration | Manual Internal Window shades | manual | Descriptor | X0000 | Unassigned | Unassigned | Unassigned |
36 | 1651RP | Envelope | Fenestration | Manual Internal Window shades | Wind | Descriptor | D3010 | SERVICES | HVAC | Energy Supply |
The script then computes four summary metrics to quantify the performance of the categorization system and R script:
- Metric 1: Percentage of EEMs that got tagged automatically by the script
- Metric 2: Percentage of EEMs that got categorized automatically by the script
- Metric 3: Percentage of EEMs that got categorized manually in the ground truth
- Metric 4: Percentage of EEMs that got categorized correctly
If the script produces an error, check for the following issues:
- Are you using a CSV file? The script will not work with Excel sheets.
- Do the column names in the file match those specified above?
- Are you following the GitHub repo folder structure?
- Are you using RStudio to open and run the files?
- Do you have the most recent version of R and RStudio installed?
- Are the libraries in R up to date?