Open Refine Tutorial

What is Open Refine?
Open Refine is a free and open source tool that cleans, transforms, and reconciles data. In this digital age, more data is available online for collaboration and analysis. Unfortunately, it is often not in a form that easily facilitates computation. Rather than cleaning data manually in Excel or Open Office, Open Refine allows users to perform operations across any size data set with a few commands.  It can perform the simple deletion of replications to the more complicated action of reconciling data with online databases.

How does it work?
Open Refine is downloaded from the internet but works as a desktop application. It provides a browser interface in which to work but does not require data to be uploaded to a server. Once installed, the user can import local data sets in various formats, online data sets, or data sets provided by Google. Open Refine records every edit and transformation made to each cell. The user can export the final product in various formats, as well as download the revision history.

Data 
This tutorial will use a messy data set on eighteenth-century midwifery at George Washington’s Mount Vernon. It is not complete, but the data will demonstrate the importance of clean data in building one’s own data sets and in cleaning data sets made by others.

Installation and Data Importation
Download Open Refine. After installing, the tool will open in a browser. You can create, open, or import a project. Using the tool for the first time, you will create a new project. Click Create Project, then select the mid_practice.csv file from your local computer and import it.

Importing data into Open Refine
Importing data into Open Refine

The data should appear in the data preview mode. Below the data, you can select data viewing and format options. For this tutorial, every option should be correct. It is a data set with commas separating the values. You can name the project whatever you like then click Create Project.

Data Preview
Data Preview

Data Cleaning
First, you should know that Open Refine uses Google Refine Expression Language, or GREL. It offers a command line interface in addition to simple clicks of the mouse for data cleaning. This tutorial will examine the basics of both methods. It is helpful to explore the data and examine its variables: values, rows, cells, and records. After exploring the data, you can begin to clean it.

Extra Spaces- Open Refine allows you to easily find and clean cell values with extra spaces. You simply click the column name, then Edit cells, Transform cells, and Common Transforms. You can choose from a variety of commands, but for this example we will trim leading and tailing whitespaces for each column.

Editing one cell at a time- You can change text within one cell. For example, if you want to change some of the cells in the “Payment” column, simply click Edit in the individual cell and type in the new value. You can also apply this change to all matching cells. What happens if you have a column with a lot of different cell values and variations?

Faceting- By faceting, you can filter and count the number of differing cells. This function is similar to the filter function in R. These facets appear on the left side of the data with the count of each facet. You can change this value and the count will change appropriately. For example, if we apply a text facet to the “Midwife” column, Open Refine will put every value variation into its own group and provide the count of each. Click on the column name, then Facet, then Text Facet. If we know that “Mrs. D Parker” is the same as “Darcus Parker”, then we can easily standardize this value. Faceting also works with numerical values.

Faceting in Open Refine
Faceting in Open Refine

Clustering- The facet feature also provides clusters of the different facets. If two or more values are grouped together, you can merge them with the new value in one click. Click Cluster in the Facet box and a new window will appear. You can choose which method to cluster by, but the default is key collision and fingerprinting. Ngram also works well. You choose which groups to merge, define the new cell value, and click Merge Selected and Recluster.

Clustering
Clustering

Data Transformation
Another way to clean the data is through transformations. This key feature in Open Refine uses the command line interface. You can click in any column on the Edit cell button then Transform cell to open the command window.

Value.replace- This function transforms any cell value into another value. For example, some name variations did not appear as clusters when faceted. If we want to replace “Mrs. Brasenton” with the standardized “Mrs. Brasington” in the Midwife column, we can use:
value.replace(“Mrs. Brasington”, “Mrs. Brasenton”) 
The value you want to replace comes first, followed by the string you want to replace it with.

Replacing values using GREL
Replacing values using GREL

Value.toDate- Let’s say we want to reformat the “Date paid” column from “mm/dd/yyyy” to “yyyy-mm-dd”. The command would look like this:
value.toDate(“mm/dd/yyyy”).toString(“yyyy-mm-dd”) 
This function takes the date format entered and transforms it into the specified date string.

Date Function
Date Function

Value.split(” “).reverse().join(“, “)- This function incorporates three commands: split, reverse, and join. It splits strings, then puts them back together in the reverse. If we want to put the Midwife names in as “Last Name, First Name”, then we would use this function.
value.split(” “) Splits the string by its separator
value.reverse() Puts the last name first
join(“, “) Joins the reversed strings back into one column with a comma and a space

Split, Reverse, and Join Cells
Split, Reverse, and Join Cells

Split multi-valued cells- For each record, you can split column cells with multiple values without creating new columns. Open Refine adds rows underneath the original record. You can define what to split by, such as “|” or “,”. If we want to split Betty’s twins in 1778 into two rows, then we can choose Edit Cell then Split multi-valued cells for that column. The program prompts us to define what operator separates the values, “,” in this case. Open Refine then adds the different value as a new row in the Betty record number. Switching from Row view to Record view at the top of the data set shows the multiple rows for that one entry.

Splitting a column cell value into multiple rows.
Splitting a column cell value into multiple rows.
The new values appear as rows under the one record.
The new values appear as rows under the one record.

Data Reconciliation
This feature allows you to match text to database keys and add new columns, thereby adding to your data. OpenRefine is currently trying to develop a reconciliation service for Historical Newspapers. Data Reconciliation not only enhances your own data, but promotes digital collaboration and sharing of data across scholars and institutions.
Geocoding- Open Refine can geocode street addresses in order to produce maps from data. You can Add Column by Fetching URLs, which allows you to extract out the latitude and longitude coordinates. Open Refine uses the Google Geocoding API. This is also similar to the geocode function in R.
Application to History
OpenRefine is important to historians both using and building datasets on the web, because it easily explores, cleans, transforms, and reconciles data on any scale. Its design is conducive to collaboration on large datasets, just like DAT. Open Refine also speaks to those historians who want to make their research reproducible. The tool has an undo/redo section in which the user can track every change he or she has made to the data. You can then extract the dataset out at each step. The original data is not altered, and you can always go back to it. This feature makes your data cleaning reproducible on many levels. If you have a similar data set and need to perform the same commands to it as a previous set, then you simply apply those changes to the new data set. Open Refine also lets users export an entire project as a .tar.gz file. This means that another scholar can then import it into their own local Open Refine. The exported file contains both the cleaned data and the revision history. The person receiving the project can easily see the previous changes, undo them and make their own, and extract out a whole new data set.

Resources
Open Refine Site
Open Refine Github Documentation
Tutorial on The Programming Historian

Final Project So Far

For the final project, I have decided to explore hysteria in males. I recently found a lot of articles in nineteenth-century medical journals discussing the diagnosis of hysteria in male patients. I thought it would be fun to research this topic with these primary sources and compare the symptoms and cures to hysteria in females during the same time period. This is the home page for the website so far:

Image

I chose a background image of hypnotizing swirls from Subtle Patterns to highlight the mental illness. I faded the gray color of the circles in Photoshop and saved the image for the web as a jpeg. The red menu mimics the one in my image assignment to provide some cohesion to the entire digital portfolio. I like the main title, but I am not satisfied with the sub-title. The images are from an actual 1886 medical journal article discussing hysteria in males. I think these illustrations are wonderful and really look good on the web! I can’t believe I stumbled upon them. Do you like that they continue off the wrapper? It happened by accident in Dreamweaver, but I like the effect.

For the content, I plan to topic model the articles on men and women, comparing the symptoms, causes, and cures of hysteria. It could be fun and interactive to then present a couple of case studies and poll the user on whether the patients are male or female. I want to add an interactive element besides hyperlinks.

I would appreciate any constructive comments on my design and content presentation!

Comment on Anne and Amanda’s blog

Improving

This week I worked on improving my Type Assignment. This was the old version:Image

To improve, I moved the image to the body content and resized it. I played up the colors in the image more with the background and horizontal menu bar. I also embedded new fonts from Google. I wanted to add more images, but I cannot embed images of women suffering from the “baby blues” from the Vogue or Redbook magazines, because they do not exist in the public domain. So, I perused Google for images of Andrea Yates, but the ones I found did not contribute to the layout. (I’m not sure they’re in the public domain either)

Here is the updated version:

 

Image

Does anyone have any further suggestions to improve this page, or creative image ideas?

I’m not sure if my domain is updated, so I put this screenshot up. Here is the link, anyway: www.saracollini.org/typepro.html

This week I commented on Kasey and Paul’s blog