Developing a standardized categorization system for energy efficiency measures (1836-RP)

This repository contains the data and code for the paper "Developing a standardized categorization system for energy efficiency measures (1836-RP)" submitted to Science and Technology for the Built Environment. This study was conducted as part of ASHRAE Research Project 1836, Developing a standardized categorization system for energy efficiency measures.

Repository Structure

The repository is divided into three directories:

/data/: List(s) of EEMs to be categorized (or re-categorized) and related data
/analysis/: R script for EEM categorization
/results/: Output produced by R script

Objective

Energy Efficiency Measures (EEMs) play a central role in building energy modeling, energy auditing, and energy data collection and exchange. Despite their importance, there is currently no standardized way to describe or categorize EEMs in industry. This lack of standardization severely limits the ability to communicate the intent of an EEM clearly and consistently, and to perform “apples-to-apples” comparisons of measure savings and cost effectiveness. The goal of this study was to develop and test a standardized system for categorizing EEMs.

The standardized categorization system developed in 1836-RP consists of a three-level hierarchy of building elements based on UNIFORMAT, a standard classification system for building elements and related sitework. An individual EEM is categorized on the hierarchy using an element “tag” (i.e., keyword) present in the EEM name that links the EEM with a single UNFORMAT category. The EEM name may also contain one or more descriptor tags that are not used for categorization, but may provide useful additional information for sorting and analyzing a group of EEMs.

Data

There are five datasets associated with this project: one set of categorization tags, two example lists of EEMs to be re-categorized, and two corresponding ground truth files that contain manually assigned re-categorizations for each EEM.

ASHRAE 1836-RP categorization tags

To categorize the list of EEMs, the R script uses a seed list of categorization tags that was developed during 1836-RP. The file categorization-tags.csv contains the element and descriptor type tags, along with their associated UNIFORMAT categories. The CSV file containing the categorization tags consists of six columns. The column keyword contains the tags to be used for categorization. The column type provides information regarding whether the tag is an element tag or a descriptor tag. The remaining columns uni_code, uni_level_1, uni_level_2, and uni_level_3 contain the UNIFORMAT category associated with that tag.

Some descriptor tags like "cool roof" will always belong to a single UNIFORMAT category and are therefore assigned to that category (in this case B3010 Roof Coverings). Other descriptor tags like "insulation" could belong to multiple categories depending upon the building element affected and are therefore left unassigned and given the uni_code X0000. When possible, tags have been assigned to UNIFORMAT Level 3, however, some high-level element tags (e.g., building envelope) can only be assigned to UNIFORMAT Level 1 or 2, and lower levels are left unassigned. The first five tags in the dataset:

keyword	type	uni_code	uni_level_1	uni_level_2	uni_level_3
Foundation wall	Element	A1010	SUBSTRUCTURE	Foundations	Standard Foundations
Slab	Element	A1030	SUBSTRUCTURE	Foundations	Slab on Grade
Basement wall	Element	A2020	SUBSTRUCTURE	Basement Construction	Basement Walls
Building envelope	Element	B	SHELL	Unassigned	Unassigned
envelope	Element	B	SHELL	Unassigned	Unassigned

ASHRAE 1836-RP 5% sample of EEMs

Two example EEM lists are provided as Comma Separated Values (CSV) files along with the R script. The first example list is a random sample of 5% of the EEMs from the 1836-RP main list of EEMs from the repo ashrae-1836-rp-text-mining and is named sample-eems.csv.

The dataset contains five columns. The EEM IDs from the main list of 1836-RP EEMs are shown in the eem_id column, the source document is shown in the document column, the original categorization in the source document is shown in the cat_lev1 column, the sub-categorization in the original source document (if present) is shown in the cat_lev2 column, and the EEM name is given in the eem_name column. The first five EEMs from the 5% random sample:

eem_id	document	cat_lev1	cat_lev2	eem_name
10	1651RP	Daylighting	Passive	High ceilings
24	1651RP	Daylighting	Passive	Use of interzone luminous ceilings
35	1651RP	Envelope	Fenestration	Heat absorbing blinds
36	1651RP	Envelope	Fenestration	Manual Internal Window shades
60	1651RP	Envelope	Infiltration	High Performance Air Barrier to Reduce Infiltration

BuildingSync list of EEMs

The second list contains all of the EEMs in BuildingSync and is named building-sync.csv. It follows the same five column format as the 5% random sample. The first five EEMs from the BuildingSync list:

eem_id	document	cat_lev1	eem_name
1237	BSYNC	Advanced Metering Systems	Install advanced metering systems
1238	BSYNC	Advanced Metering Systems	Clean and/or repair
1239	BSYNC	Advanced Metering Systems	Implement training and/or documentation
1240	BSYNC	Advanced Metering Systems	Upgrade operating protocols, calibration, and/or sequencing
1241	BSYNC	Advanced Metering Systems	Other

ASHRAE 1836-RP 5% sample ground truth

Two ground truth files contain manually assigned re-categorizations for each EEM in the example EEM lists. These files are not required for EEM categorization (or re-categorization), and are only used to compute some of the performance metrics. The first ground truth file corresponds to the 5% sample of EEMs and is named sample-eems-ground-truth.csv. The file contains the same five headers as the list of EEMs, plus a sixth uni_code_manual column that lists the manual UNIFORMAT assignment for each EEM. EEMs that were unable to be assigned manually are listed as "X0000". The first five EEMs from the 5% random sample ground truth:

eem_id	document	cat_lev1	cat_lev2	eem_name	uni_code_manual
10	1651RP	Daylighting	Passive	High ceilings	C3030
24	1651RP	Daylighting	Passive	Use of interzone luminous ceilings	C3030
35	1651RP	Envelope	Fenestration	Heat absorbing blinds	E2010
36	1651RP	Envelope	Fenestration	Manual Internal Window shades	E2010
60	1651RP	Envelope	Infiltration	High Performance Air Barrier to Reduce Infiltration	B

BuildingSync ground truth

The second ground truth file corresponds to the list of BuildingSync EEMs and is named building-sync-ground-truth.csv. It follows the same six column format as the 5% random sample. The first five EEMs from the BuildingSync ground truth:

eem_id	document	cat_lev1	eem_name	uni_code_manual
1237	BSYNC	Advanced Metering Systems	Install advanced metering systems	D5010
1238	BSYNC	Advanced Metering Systems	Clean and/or repair	D5010
1239	BSYNC	Advanced Metering Systems	Implement training and/or documentation	D5010
1240	BSYNC	Advanced Metering Systems	Upgrade operating protocols, calibration, and/or sequencing	D5010
1241	BSYNC	Advanced Metering Systems	Other	D5010

User-defined data

In order to use a different list of categorization tags, EEMs, or ground truth, add the new list to the /data/ folder and update the file name where necessary in in the R script. User-provided input data should follow the same header and column format shown above. For user-defined EEM lists, note that the input data needs to follow the required header structure regardless of whether the list of EEMs has a prior categorization system associated with it or not. The column eem_id should contain a unique ID for each EEM. If your EEM list does not have any original categorization system associated with it, keep the dummy columns cat_lev1 and cat_lev2 empty.

Analysis

The R script eem-recategorization.R categorizes an existing list of EEMs according to the standardized categorization system developed in 1836-RP.

Setup

Load packages

First, load (or install if you do not already have them installed) the packages required for data handling and tokenization.

# Load required packages
library(tidyverse)
library(tidytext)
library(tokenizers) # for 1-n gram tokenization

Import data

Import the list of categorization tags, the list of EEMs to be categorized, and the corresponding ground truth (if available). The relative filepaths in this script follow the same directory structure as this Github repository, and it is recommended that you use this same structure. You might have to use setwd() to set the working directory to the location of the R script.

# Load categorization tags
tag_list <- read_csv("../data/categorization-tags.csv")

# Load one of the two samples provided with the script or a custom EEM list:
sample_eems <- read_csv("../data/sample-eems.csv") # 5% random sample of EEMs from the main list

# Load the manually assigned categories for the EEM samples (For calculating metrics 3 and 4)
ground_truth <- read_csv("../data/sample-eems-ground-truth.csv") # Ground truth for the 5% random sample

Tokenize tags and EEMs

Since the tags consist of n-grams where n>1 (e.g., words, bigrams, trigrams), the EEM names need to be tokenized as sequences of n-grams. To determine the number of n in n-grams to be used to tokenize the EEM names, we first tokenize the tags and count the range of number of tokens in the categorization tag list.

tokenized_tag_list <- tag_list %>% 
  select(keyword) %>% 
  unnest_tokens(word, keyword, token = "words", drop = FALSE)

tag_token_counts <- tokenized_tag_list %>% 
  count(keyword)

This produces a token count for each tag. For the first five tags:

   keyword                     n
   <chr>                   <int>
 1 Absorption chiller          2
 2 Advanced power strip        3
 3 AHU                         1
 4 Air barrier                 2
 5 air cooled                  2

For the ASHRAE 1836-RP categorization tags, the tags range from 1 to 5 words in length. In order to look for these tags within the EEM names, the EEM names need to be tokenized as n-grams where n ranges from 1 to 5.

## Tokenize EEM names
# create a list of tokens for each EEM name
token_1_5grams <- tokenize_ngrams(sample_eems$eem_name, lowercase = TRUE, n = 5, n_min = 1)

# Map the list as a dataframe
eem_tokens <- map_df(token_1_5grams, ~as.data.frame(.x), .id="id")
eem_tokens$id <- as.numeric(eem_tokens$id)
eem_tokens <- dplyr::rename(eem_tokens, tokens = .x)

This produces a dataframe with the tokens from each EEM as separate rows. For example, the EEM "High ceilings" is tokenized into three possible tokens: "high", "high ceilings", and "ceilings". For the first five tokens:

  id        tokens
1  1          high
2  1 high ceilings
3  1      ceilings
4  2           use
5  2        use of

Search, tag, and categorize

The automatic tagger and categorizer code searches for the categorization tags within the tokenized EEM names. Every time it finds a match, it tags that EEM with all the information associated with that particular tag. This includes the tag, the tag type and the UNIFORMAT category associated with the tag. Tokens that do not have a matching tagged are left untagged with an .

# Tagging all tokens with relevant categorization tags
# and using them to re-categorize the EEMs according to UNIFORMAT

for (i in 1:nrow(tag_list)) {
  for (j in 1:nrow(eem_tokens)) {
    # Look for keywords (tag_list, column 1) in EEM name tokens (eem_tokens, column 2)
    # Add corresponding tags and associated UNIFORMAT category to the EEM name tokens
    if(grepl(paste0("^", tag_list[i,1]), eem_tokens[j,2], ignore.case=TRUE) == 1){
      eem_tokens[j,3] <- tag_list[i,1] # keyword
      eem_tokens[j,4] <- tag_list[i,2] # tag type
      eem_tokens[j,5] <- tag_list[i,3] # UNI code
      eem_tokens[j,6] <- tag_list[i,4] # UNI level 1 category
      eem_tokens[j,7] <- tag_list[i,5] # UNI level 2 category
      eem_tokens[j,8] <- tag_list[i,6] # UNI level 3 category
    }
  }
}

For the first five tokens, the dataframe is now:

  id        tokens keyword    type uni_code uni_level_1       uni_level_2      uni_level_3
1  1          high    <NA>    <NA>     <NA>        <NA>              <NA>             <NA>
2  1 high ceilings    <NA>    <NA>     <NA>        <NA>              <NA>             <NA>
3  1      ceilings ceiling Element    C3030   INTERIORS Interior Finishes Ceiling Finishes
4  2           use    <NA>    <NA>     <NA>        <NA>              <NA>             <NA>
5  2        use of    <NA>    <NA>     <NA>        <NA>              <NA>             <NA>

Results

List of tagged and untagged EEMs

The script then exports six CSV files:

sheet-1-tagged-eems: Tagged EEMs from the sample along with their original and new categorizations
sheet-2-untagged-eems: Untagged EEMs from the sample along with their original categorizations
sheet-3-top-tags: Most frequently occurring tagged terms from the sample along with their counts
sheet-4-untagged-words: Most frequently occurring un-tagged terms from the sample along with their counts
sheet-5-untagged-bigrams: Most frequently occurring un-tagged bigrams from the sample along with their counts
sheet-6-ground-truth-matching: Comparison of automatically and manually assigned categorizations for each EEM in the sample

To illustrate the results, the first five EEMs from sheet-1-tagged-eems are shown below. If multiple tags are present in the EEM, each tag gets its own row in the re-categorized list.

eem_id	document	cat_lev1	cat_lev2	eem_name	tags	type	uni_code	uni_level_1	uni_level_2	uni_level_3
10	1651RP	Daylighting	Passive	High ceilings	ceiling	Element	C3030	INTERIORS	Interior Finishes	Ceiling Finishes
24	1651RP	Daylighting	Passive	Use of interzone luminous ceilings	ceiling	Element	C3030	INTERIORS	Interior Finishes	Ceiling Finishes
35	1651RP	Envelope	Fenestration	Heat absorbing blinds	blind	Element	E2010	EQUIPMENT & FURNISHINGS	Furnishings	Fixed Furnishings
36	1651RP	Envelope	Fenestration	Manual Internal Window shades	manual	Descriptor	X0000	Unassigned	Unassigned	Unassigned
36	1651RP	Envelope	Fenestration	Manual Internal Window shades	Wind	Descriptor	D3010	SERVICES	HVAC	Energy Supply

Performance metrics

The script then computes four summary metrics to quantify the performance of the categorization system and R script:

Metric 1: Percentage of EEMs that got tagged automatically by the script
Metric 2: Percentage of EEMs that got categorized automatically by the script
Metric 3: Percentage of EEMs that got categorized manually in the ground truth
Metric 4: Percentage of EEMs that got categorized correctly

Troubleshooting

If the script produces an error, check for the following issues:

Are you using a CSV file? The script will not work with Excel sheets.
Do the column names in the file match those specified above?
Are you following the GitHub repo folder structure?
Are you using RStudio to open and run the files?
Do you have the most recent version of R and RStudio installed?
Are the libraries in R up to date?

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
analysis		analysis
data		data
results		results
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Developing a standardized categorization system for energy efficiency measures (1836-RP)

Contents

Repository Structure

Objective

Data

ASHRAE 1836-RP categorization tags

ASHRAE 1836-RP 5% sample of EEMs

BuildingSync list of EEMs

ASHRAE 1836-RP 5% sample ground truth

BuildingSync ground truth

User-defined data

Analysis

Setup

Load packages

Import data

Tokenize tags and EEMs

Search, tag, and categorize

Results

List of tagged and untagged EEMs

Performance metrics

Troubleshooting

About

Releases

Packages

Contributors 2

Languages

License

retrofit-lab/ashrae-1836-rp-categorization

Folders and files

Latest commit

History

Repository files navigation

Developing a standardized categorization system for energy efficiency measures (1836-RP)

Contents

Repository Structure

Objective

Data

ASHRAE 1836-RP categorization tags

ASHRAE 1836-RP 5% sample of EEMs

BuildingSync list of EEMs

ASHRAE 1836-RP 5% sample ground truth

BuildingSync ground truth

User-defined data

Analysis

Setup

Load packages

Import data

Tokenize tags and EEMs

Search, tag, and categorize

Results

List of tagged and untagged EEMs

Performance metrics

Troubleshooting

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages