Open Refine

Download as pdf or txt
Download as pdf or txt
You are on page 1of 41

Introduction to OpenRefine

2018 Bibliometrics and Research Assessment Symposium


Candace Norton, MLS
© Library Carpentry: "OpenRefine Lessons for Librarians."
Housekeeping

• This session will be recorded and distributed


after editing and captioning.
• Slides will be available following the Symposium.
• Questions are welcome throughout the
presentation; please use the microphone.
• Due to the set up of the breakout room, this is a
lecture style course and will not be hands on.
Course Objectives

• Explain what OpenRefine does and why to use


it.
• Recognize when using OpenRefine is
appropriate.
• Understand how to import, edit, transform, and
export data using OpenRefine.
Introduction to OpenRefine

• OpenRefine is a powerful tool for working with


messy data
• OpenRefine can clean data, transform data from
one format to another, and can extend data via
web services and external data
• OpenRefine is available for download online at
http://openrefine.org/download.html
Why use OpenRefine?

• Get an overview of a data set


• Resolve inconsistencies in a data set
– standardizing date formatting
• Helps split data up into more granular parts
– splitting up cells with multiple authors into separate
cells
• Match local data up to other data sets
– matching local subjects against the Library of
Congress Subject Headings
• Enhance a data set with data from other sources
Common Scenarios: Dates

Data as entered Desired data


1st January 2014 2014-01-01
01/01/2014 2014-01-01
Jan 1 2014 2014-01-01
2014-01-01 2014-01-01
Common Scenarios: Names

Data as entered Desired data


London London
London] London
London,] London
london London
Common Scenarios: Combined Data

Address in single Library


field Institution name Address 1 Town/City Country Postcode
University of Wales, University of Llyfrgell Llanbadarn Aberystwyth United SY23 3AS
Llyfrgell Thomas Parry Wales Thomas Fawr Kingdom
Library, Llanbadarn Parry
Fawr, ABERYSTWYTH, Library
Ceredigion, SY23 3AS,
United Kingdom

University of Aberdeen, University of Queen Meston Walk Aberdeen United AB24 3UE
Queen Mother Library, Abderdeen Mother Kingdom
Meston Walk, Library
ABERDEEN, AB24 3UE,
United Kingdom
Importing Data into OpenRefine

• Create a project by importing data:

• TSV • Excel (.xls and .xlsx)


• CSV • XML
• *SV • RDF as XML
• JSON • Google Data documents

• Support for other formats can be added with


OpenRefine extensions
• Typically a publication dataset from a
bibliographic database in CSV format
Create a Project

• Click “Create Project” in the left menu bar


• Select “Get data from this computer”
• Browse for the file
• Click “Next >>” to select import parameters
• Preview the import
• Name the project using the text field in the upper
right corner
• Click “Create Project” button
Layout of OpenRefine

• Displays data in a tabular format


• Each row will usually represent a ‘record’ in the
data
• Each column represents a type of information or
variable
• Only displays a limited number of rows of data at
one time
• Most options to work with data are accessed
from drop down menus at the top of the data
columns
Rows and Records

• OpenRefine has two modes of viewing data:


– Rows
– Records
• Default view is Rows mode
– Each row represents a single record in the data set
• Records mode
– OpenRefine can link together multiple rows as
belonging to the same Record
Working with Columns

• Re-order columns by clicking the drop-down


menu at the top of the first column labelled ‘All’
• Choose Edit Columns->Re-order/remove
columns
• Drag and drop column names to re-order the
columns, or remove columns completely if they
are not needed
Sorting Data

• Sort data by clicking on the drop-down menu for


the relevant column, and click ‘Sort’
• Once the data is sorted, a new ‘Sort’ drop-down
menu will display
• Sorts performed in OpenRefine are temporary
• Sort on multiple columns at the same time by
adding another sorted column
Splitting Cells
To split the multi-valued cells (like author names or
addresses) into their own cells, use a Split multi-
valued cells function:
•Click the dropdown menu at the top of the Author
column
•Choose Edit cells->Split multi-valued cells
•In the prompt type the ( | ) symbol and click OK
•Note that the rows are still numbered sequentially
•Click the Records option to change to Records
mode
•Note how the numbering has changed - indicating that
several rows are related to the same record
Joining Cells

•Click the dropdown menu at the top of the Author


column
•Choose Edit cells->Join multi-valued cells
•In the prompt type the ( | ) symbol
•Here we are specifying the delimiter character
for OpenRefine to use to join the values
together.
•Click OK to join the Authors cells back together
Joining Cells, continued

• A common workflow with multi-valued cells is


– split multi-valued cells into individual cells
– modify/refine/clean individual cells
– join multi-valued cells back together

• After joining cells together, Rows and Records


values will now be the same since there are no
more split columns
• Click both the Rows and Records options and
observe how the numbers are equal
Clustering

• The Cluster function groups together similar but


inconsistent values in a given column and
permits merging these inconsistent values into a
single chosen value
• This is very effective where there is data with
minor variations in data values, e.g. names of
people, organizations, places, classification
terms
Clustering, continued

• ‘Clusters’ are created automatically according to


an algorithm
– Link for more information on Clustering Algorithms
• For each cluster, there is the option of ‘merging’
the values together
• To use the ‘Cluster’ function, click on the Edit
Cells menu option of the relevant column and
choose Cluster and Edit
Clustering to clean author data

• Split out the author names into individual cells


using Edit cells -> Split multi-valued cells,
using the pipe ( | ) character as the separator
• Choose Edit cells -> Cluster and edit from the
‘author’ column
• Using the key collision method with the
fingerprint Keying Function, work through the
clusters of values, merging them to a single
value where appropriate
• Try changing the clustering method being used
Transformations

• Transformations are ways of manipulating data


in columns when basic sorting and faceting are
not enough
• Allows users to programmatically edit data
• Normally written in a special language called
GREL (General Refine Expression Language)
– Similar to Excel functions but focused on text
manipulation rather than numeric functions
– Full documentation for GREL is available online
Transformation Examples

• Splitting data that is in a single column into


multiple columns
– Splitting an address into multiple parts
• Standardizing the format of data in a column
without changing the values
– Removing punctuation or standardizing a date format
• Extracting a particular type of data from a longer
text string
– Finding ISBNs in a bibliographic citation
Writing Transformations

• Select the column to transform and choose ‘Edit


cells->Transform’ from the drop down menu
• A new screen will display with a place to write a
transformation (the ‘Expression’ box) along with
a preview window to see the effect of the
transformation on 10 rows of data
• The transformation typed into the ‘Expression’
box has to be a valid GREL expression
– The word ‘value’ by itself is the simplest expression
– It simply means display the value that is currently in
the column and make no change.
GREL Functions

• GREL functions are written by giving a value of


some kind (a text string, a date, a number, etc.)
to a GREL function
• Some GREL functions take additional
parameters or options which control how the
function works
• GREL supports two types of syntax:
– value.function(options)
– function(value, options)
Common Transformations

Common
Transformation Action GREL expression
To Uppercase Converts the current value to value.toUppercase()
uppercase
To Lowercase Converts the current value to value.toLowercase()
lowercase
To Titlecase Converts the current value to value.toTitlecase()
titlecase (i.e. each word starts with
an uppercase character and all other
characters are converted to
lowercase)
Trim leading and Removes any ‘whitespace’ value.trim()
trailing characters (e.g. spaces, tabs) from
whitespace the start or end of the current value
Sample Workflow: Author Affiliations

• Split multi-valued cells


• Remove author names
– value.replace(/\[\D+\]/,"")
– expression: \[\D+\]
• Remove additional affiliation information
– value.replace(/\,.+/,"")
– expression: \,.+
• Remove leading and trailing whitespace
• Create facet and continue as normal
Faceting

• A ‘Facet’ groups all the values that appear in a


column, allows for filtering the data by these
values, and then editing values across many
records at the same time
• OpenRefine limits the number of values allowed
in a single facet to ensure the software does not
perform slowly or run out of memory
• Why use faceting?
– Can help get an overview of the data in a project
– Can help bring more consistency to the data
Using ‘Text Facet’

• To create a text facet for a column, click on the


drop down menu at the top of the column, and
choose Facet -> Text Facet
– The facet will appear in the left side panel
– Will consist of a list of values used in that column of
data
– Can filter the data by clicking on one of the headings
– Can Include multiple values at one time or Invert the
filter to show all values that do not match the selected
values
– Can make minor edits to all selected values with Edit
More on Facets

As well as ‘Text facets’ OpenRefine also supports


a range of other types of facet. These include:
• Numeric facets
• Timeline facets (for dates)
• Scatterplot facets
• Custom facets
– Word facet
– Duplicates facet
– Text length facet
– Facet by blank
Filters

• As with Facets, Filter options appear in the left


hand panel in OpenRefine
• Any operations carried out while data is filtered
will only apply to the filtered data currently on
display
• Faceting and filtering look very similar:
– Faceting gives an overview description of all of the
data that is currently selected
– Filtering allows the selection of a subset of data for
analysis
Using ‘Text Filter’

• ‘Text Filter’ looks for a particular piece of text


appearing in a column
– Click the drop down menu at the top of the column to
filter then choose ‘text filter’
– Enter the text to filter by, and the ‘text filter’ will only
display rows that contain the specified text
– Regular expressions can also be used to filter data
Working with Arrays

• An ‘Array’ is a list of values, represented in Refine


by the use of square brackets containing a list of
values surrounded by quotation marks and
separated by commas.
– [“Monday”,”Tuesday”,”Wednesday”,”Thursday”,”Friday”,
”Saturday”,”Sunday”]
• Arrays can be sorted, de-duplicated, and
manipulated in other ways in GREL expressions,
but cannot appear directly in an OpenRefine cell.
• Arrays in OpenRefine are usually the result of a
transformation.
Arrays Example

• A cell has the value:


“Monday,Tuesday,Wednesday,Thursday,Friday,Sa
turday,Sunday”
• Transform into an array using the split function
– value.split(“,”)
• Creates an array:
[“Monday”,”Tuesday”,”Wednesday”,”Thursday”,”Fr
iday”,”Saturday”,”Sunday”]
Undo and Redo

• The Undo/Redo panel lists all the steps taken so


far
• To undo steps, click on the last step to preserve in
the list; this will automatically undo all the changes
made since that step
• The remaining steps will continue to show in the
list but greyed out, and can be reapplied by
clicking on the last step to apply
• If ‘undo’ is used on a set of steps and then new
transformations are performed, the greyed out
steps will disappear and the option to ‘redo’ these
steps is no longer available
Saving Steps

• To save a set of steps to be re-applied later or to


use in a different project, click the ‘Extract’
button
• ‘Extract’ gives the option to select the steps to
save, and copy the transformations included in
those steps in JSON format
• ‘Apply’ will apply the saved or copied steps
Exporting data from OpenRefine

• Export options are available through the Export


button at the top right of the OpenRefine
interface
• Export formats include HTML, Excel, CSV, TSV
• Custom export options are available, including
exporting specific fields, adding a header or
footer, and specifying the exact format
Demo OpenRefine
Course Development Resources

• Library Carpentry: "OpenRefine Lessons for


Librarians.“ June 2016, http://data-
lessons.github.io/library-openrefine/.
• Verborgh, R. Using OpenRefine. Packt
Publishing; 2013.
Additional Resources

• Openrefine.org
• Openrefine.github.com
• librarycarpentry.github.io/lc-open-refine/
• librarycarpentry.github.io/lc-open-refine/13-
looking-up-data/index.html
• github.com/OpenRefine/OpenRefine/wiki/Gener
al-Refine-Expression-Language
• github.com/OpenRefine/OpenRefine/wiki/Cluster
ing-In-Depth
Questions?

Email: Candace.Norton@nih.gov
Visit: www.nihlibrary.nih.gov/bibliometrics

You might also like