SlideShare a Scribd company logo
Advances in Data Science
Fall 2016
TUTORIAL
INTRODUCTION
FEATURES
INSTALLATION
DEMO
COMPARISON
WHAT IS …
??
• Formerly known as Google Refine
OpenRefine is a power tool for working with messy
data, primarily for
• detecting and fixing inconsistencies
• transforming data from one structure or format to
another
• extending it with web services and external data
• connecting names within your data to name
registries (databases)
Use OpenRefine when you need something ...
• more powerful than a spreadsheet
• more interactive and visual than scripting
• more provisional / exploratory / experimental /
. playful than a database
• Import data in various formats (Ex: TSV, CSV,Excel (.xls, xlsx),XML,RDF as XML,JSON)
• Explore datasets in a matter of seconds
• Apply basic and advanced cell transformations
• Deal with cells that contain multiple values
• Create instantaneous links between datasets
• Filter and partition your data easily with regular expressions
• Use named-entity extraction on full-text fields to automatically identify topics
• Perform advanced data operations with the General Refine Expression Language
IMPORTANT FEATURES:
The LendingClub data contains complete loan
data for all loans issued through the time
period stated, including the current loan
status (Current, Late, Fully Paid, etc.) and
latest payment information
LENDING CLUB
LOAN STATS DATA
Our aim is to perform exploratory analysis on given financial data
• Getting the data
• Looking at the data
• Cleansing
• Transforming
• Creating visualizations
STEPS
1 – Getting started with OpenRefine
2 – Analyzing and Fixing Data
3 – Advanced Data Operations
4 – Linking Datasets
5 – Regular Expressions and GREL
TUTORIAL
• Requirements
• Java JRE installed
• Download
• OpenRefine is a desktop application. Here’s the link: Google OpenRefine
• Unlike most other desktop applications, it runs as a small web server on
your own computer
• You point your web browser at that web server in order to use Refine. So,
think of Refine as a personal and private web application
HOW TO INSTALL
• Install:
• Once you have downloaded the .zip file, uncompress it into a folder wherever you want
(such as in C:Google-Refine).
• Run:
• Run the .exe file in that folder. You should see the Command window in which OpenRefine
runs. By default, the Command window has a black background and text in monospace font
in it.
• Shut down:
• When you need to shut down OpenRefine, switch to that Command window, and press Ctrl-
C. Wait until there's a message that says the shutdown is complete. That window might
close automatically, or you can close it yourself. If you get asked, "Terminate all batch
processes? Y/N", just press Y.
INSTALLATION: WINDOWS
• Install:
• Once you have downloaded the .dmg file, open it, and drag the OpenRefine icon into
the Applications folder icon (just like you would normally install Mac applications).
• Run:
• To launch OpenRefine, go to the Applications folder and double click the OpenRefine
app. You'll see the OpenRefine app appear in your dock.
• Shut down:
• You can switch to the OpenRefine app (clicking on its icon in the dock) and invoke its
Quit command.
• If you use Yosemite you will need to install Java for OS X 2014-001 first.
INSTALLATION: MAC
• Install / Run: Once you have downloaded the tar.gz file, open a shell and
type
• tar xzf google-refine.tar.gz
• cd google-refine
• ./refine
• This will start OpenRefine and open your browser to its starting page.
• Shut down: Press Ctrl-C in the shell.
INSTALLATION: LINUX
RUN OPENREFINE
• To increase memory: refine.bat /m 4096m
IMPORT DATA
EXPLORING DATA
MANIPULATING COLUMNS
USING THE PROJECT HISTORY
EXPORTING A PROJECT
ANALYZING AND FIXING DATA
WORKING ON THE DATA
• sorting data
• faceting data
• detecting duplicates
• applying a text filter
• using simple cell transformations
• removing matching rows
• splitting data across columns
• adding derived columns
SPECIAL FEATURE
• Regular Expressions and GREL
• Can use Python, Clojure
ADDING A RECONCILIATION SERVICE AND
RECONCILING WITH LINKED DATA
ADVANCED DATA OPERATIONS
• handling multi-valued cells
• alternating between rows and records mode
• clustering similar cells
• transforming cell values
• adding derived columns
• transposing rows and columns
• installing extensions
• Documentation:
• https://github.com/OpenRefine/OpenRefine/wiki
• Youtube Tutorial:
• https://www.youtube.com/playlist?list=PL737054C67FCC0741
REFERENCES:

More Related Content

What's hot (20)

Open refine reconciliation service api (dc python 2013_03_05)
Open refine reconciliation service api (dc python 2013_03_05)Open refine reconciliation service api (dc python 2013_03_05)
Open refine reconciliation service api (dc python 2013_03_05)
Alison Rowland
 
Introduction to ETL
Introduction to ETLIntroduction to ETL
Introduction to ETL
Maira Bay de Souza
 
Semantic Pipes and Semantic Mashups
Semantic Pipes and Semantic MashupsSemantic Pipes and Semantic Mashups
Semantic Pipes and Semantic Mashups
giurca
 
The Digital Cavemen of Linked Lascaux
The Digital Cavemen of Linked LascauxThe Digital Cavemen of Linked Lascaux
The Digital Cavemen of Linked Lascaux
Ruben Verborgh
 
ELUNA2013:Providing Voyager catalog data in a custom, open source web applica...
ELUNA2013:Providing Voyager catalog data in a custom, open source web applica...ELUNA2013:Providing Voyager catalog data in a custom, open source web applica...
ELUNA2013:Providing Voyager catalog data in a custom, open source web applica...
Michael Cummings
 
Apache Stanbol 
and the Web of Data - ApacheCon 2011
Apache Stanbol 
and the Web of Data - ApacheCon 2011Apache Stanbol 
and the Web of Data - ApacheCon 2011
Apache Stanbol 
and the Web of Data - ApacheCon 2011
Nuxeo
 
Doctrine Data migrations | May 2017
Doctrine Data migrations | May 2017Doctrine Data migrations | May 2017
Doctrine Data migrations | May 2017
Petr Bechyně
 
Initial Usage Analysis of DBpedia's Triple Pattern Fragments
Initial Usage Analysis of DBpedia's Triple Pattern FragmentsInitial Usage Analysis of DBpedia's Triple Pattern Fragments
Initial Usage Analysis of DBpedia's Triple Pattern Fragments
Ruben Verborgh
 
Reproducible research
Reproducible researchReproducible research
Reproducible research
C. Tobin Magle
 
Web Scraping Basics
Web Scraping BasicsWeb Scraping Basics
Web Scraping Basics
Kyle Banerjee
 
Linked data-tooling-xml
Linked data-tooling-xmlLinked data-tooling-xml
Linked data-tooling-xml
Felix Sasaki
 
Achieving time effective federated information from scalable rdf data using s...
Achieving time effective federated information from scalable rdf data using s...Achieving time effective federated information from scalable rdf data using s...
Achieving time effective federated information from scalable rdf data using s...
తేజ దండిభట్ల
 
Emerging technologies in academic libraries
Emerging technologies in academic librariesEmerging technologies in academic libraries
Emerging technologies in academic libraries
Michael Cummings
 
Scalable Data Analysis in R -- Lee Edlefsen
Scalable Data Analysis in R -- Lee EdlefsenScalable Data Analysis in R -- Lee Edlefsen
Scalable Data Analysis in R -- Lee Edlefsen
Revolution Analytics
 
Linked data tooling XML
Linked data tooling XMLLinked data tooling XML
Linked data tooling XML
FREMEProjectH2020
 
Session 03 acquiring data
Session 03 acquiring dataSession 03 acquiring data
Session 03 acquiring data
bodaceacat
 
Introduction to Elastic with a hint of Symfony and Docker
Introduction to Elastic with a hint of Symfony and DockerIntroduction to Elastic with a hint of Symfony and Docker
Introduction to Elastic with a hint of Symfony and Docker
Daniel Platt
 
Evolutionary & Swarm Computing for the Semantic Web
Evolutionary & Swarm Computing for the Semantic WebEvolutionary & Swarm Computing for the Semantic Web
Evolutionary & Swarm Computing for the Semantic Web
Ankit Solanki
 
Reinhard LAWDI Presentation
Reinhard LAWDI PresentationReinhard LAWDI Presentation
Reinhard LAWDI Presentation
charinos
 
Stanbol
StanbolStanbol
Stanbol
STIinnsbruck
 
Open refine reconciliation service api (dc python 2013_03_05)
Open refine reconciliation service api (dc python 2013_03_05)Open refine reconciliation service api (dc python 2013_03_05)
Open refine reconciliation service api (dc python 2013_03_05)
Alison Rowland
 
Semantic Pipes and Semantic Mashups
Semantic Pipes and Semantic MashupsSemantic Pipes and Semantic Mashups
Semantic Pipes and Semantic Mashups
giurca
 
The Digital Cavemen of Linked Lascaux
The Digital Cavemen of Linked LascauxThe Digital Cavemen of Linked Lascaux
The Digital Cavemen of Linked Lascaux
Ruben Verborgh
 
ELUNA2013:Providing Voyager catalog data in a custom, open source web applica...
ELUNA2013:Providing Voyager catalog data in a custom, open source web applica...ELUNA2013:Providing Voyager catalog data in a custom, open source web applica...
ELUNA2013:Providing Voyager catalog data in a custom, open source web applica...
Michael Cummings
 
Apache Stanbol 
and the Web of Data - ApacheCon 2011
Apache Stanbol 
and the Web of Data - ApacheCon 2011Apache Stanbol 
and the Web of Data - ApacheCon 2011
Apache Stanbol 
and the Web of Data - ApacheCon 2011
Nuxeo
 
Doctrine Data migrations | May 2017
Doctrine Data migrations | May 2017Doctrine Data migrations | May 2017
Doctrine Data migrations | May 2017
Petr Bechyně
 
Initial Usage Analysis of DBpedia's Triple Pattern Fragments
Initial Usage Analysis of DBpedia's Triple Pattern FragmentsInitial Usage Analysis of DBpedia's Triple Pattern Fragments
Initial Usage Analysis of DBpedia's Triple Pattern Fragments
Ruben Verborgh
 
Linked data-tooling-xml
Linked data-tooling-xmlLinked data-tooling-xml
Linked data-tooling-xml
Felix Sasaki
 
Achieving time effective federated information from scalable rdf data using s...
Achieving time effective federated information from scalable rdf data using s...Achieving time effective federated information from scalable rdf data using s...
Achieving time effective federated information from scalable rdf data using s...
తేజ దండిభట్ల
 
Emerging technologies in academic libraries
Emerging technologies in academic librariesEmerging technologies in academic libraries
Emerging technologies in academic libraries
Michael Cummings
 
Scalable Data Analysis in R -- Lee Edlefsen
Scalable Data Analysis in R -- Lee EdlefsenScalable Data Analysis in R -- Lee Edlefsen
Scalable Data Analysis in R -- Lee Edlefsen
Revolution Analytics
 
Session 03 acquiring data
Session 03 acquiring dataSession 03 acquiring data
Session 03 acquiring data
bodaceacat
 
Introduction to Elastic with a hint of Symfony and Docker
Introduction to Elastic with a hint of Symfony and DockerIntroduction to Elastic with a hint of Symfony and Docker
Introduction to Elastic with a hint of Symfony and Docker
Daniel Platt
 
Evolutionary & Swarm Computing for the Semantic Web
Evolutionary & Swarm Computing for the Semantic WebEvolutionary & Swarm Computing for the Semantic Web
Evolutionary & Swarm Computing for the Semantic Web
Ankit Solanki
 
Reinhard LAWDI Presentation
Reinhard LAWDI PresentationReinhard LAWDI Presentation
Reinhard LAWDI Presentation
charinos
 

Viewers also liked (20)

A Quick Tour of OpenRefine
A Quick Tour of OpenRefineA Quick Tour of OpenRefine
A Quick Tour of OpenRefine
Tony Hirst
 
Google refine tutotial
Google refine tutotialGoogle refine tutotial
Google refine tutotial
Vijaya Prabhu
 
Tomasz Kopacz: Architektura i service fabric - jak budować aplikacje w paas v2
Tomasz Kopacz: Architektura i service fabric - jak budować aplikacje w paas v2Tomasz Kopacz: Architektura i service fabric - jak budować aplikacje w paas v2
Tomasz Kopacz: Architektura i service fabric - jak budować aplikacje w paas v2
AnalyticsConf
 
Dawid Gonzo Kałędowski: R jako osobisty GPS
Dawid Gonzo Kałędowski: R jako osobisty GPSDawid Gonzo Kałędowski: R jako osobisty GPS
Dawid Gonzo Kałędowski: R jako osobisty GPS
AnalyticsConf
 
Final_Project
Final_ProjectFinal_Project
Final_Project
Ashwin Dinoriya
 
Final presentation
Final presentationFinal presentation
Final presentation
Ashwin Dinoriya
 
DDL,DML,SQL Functions and Joins
DDL,DML,SQL Functions and JoinsDDL,DML,SQL Functions and Joins
DDL,DML,SQL Functions and Joins
Ashwin Dinoriya
 
Tor Hovland: Taking a swim in the big data lake
Tor Hovland: Taking a swim in the big data lakeTor Hovland: Taking a swim in the big data lake
Tor Hovland: Taking a swim in the big data lake
AnalyticsConf
 
What Is Reporting Services?
 What Is Reporting Services?  What Is Reporting Services?
What Is Reporting Services?
LearnItFirst.com
 
PowerBI - Porto.Data - 20150219
PowerBI - Porto.Data - 20150219PowerBI - Porto.Data - 20150219
PowerBI - Porto.Data - 20150219
Rui Romano
 
Denny Lee\'s Data Camp v1.0 talk on SSRS Best Practices for IT
Denny Lee\'s Data Camp v1.0 talk on SSRS Best Practices for ITDenny Lee\'s Data Camp v1.0 talk on SSRS Best Practices for IT
Denny Lee\'s Data Camp v1.0 talk on SSRS Best Practices for IT
Bala Subra
 
Banking Database
Banking DatabaseBanking Database
Banking Database
Ashwin Dinoriya
 
Employing Google Refine to publish Linked Data
Employing Google Refine to publish Linked DataEmploying Google Refine to publish Linked Data
Employing Google Refine to publish Linked Data
Fadi Maali
 
Subqueries, Backups, Users and Privileges
Subqueries, Backups, Users and PrivilegesSubqueries, Backups, Users and Privileges
Subqueries, Backups, Users and Privileges
Ashwin Dinoriya
 
Sql Server 2012 Reporting-Services is Now a SharePoint Service Application
Sql Server 2012   Reporting-Services is Now a SharePoint Service ApplicationSql Server 2012   Reporting-Services is Now a SharePoint Service Application
Sql Server 2012 Reporting-Services is Now a SharePoint Service Application
InnoTech
 
Data Visualization-Ashwin
Data Visualization-AshwinData Visualization-Ashwin
Data Visualization-Ashwin
Ashwin Dinoriya
 
Data and Donuts: Data cleaning with OpenRefine
Data and Donuts: Data cleaning with OpenRefineData and Donuts: Data cleaning with OpenRefine
Data and Donuts: Data cleaning with OpenRefine
C. Tobin Magle
 
Welcome to PowerBI and Tableau
Welcome to PowerBI and TableauWelcome to PowerBI and Tableau
Welcome to PowerBI and Tableau
Ashwin Dinoriya
 
Rafał Korszuń: Security in Design of Cloud Applications
Rafał Korszuń: Security in Design of Cloud ApplicationsRafał Korszuń: Security in Design of Cloud Applications
Rafał Korszuń: Security in Design of Cloud Applications
AnalyticsConf
 
Paweł Ciepły: PowerBI part1
Paweł Ciepły: PowerBI part1Paweł Ciepły: PowerBI part1
Paweł Ciepły: PowerBI part1
AnalyticsConf
 
A Quick Tour of OpenRefine
A Quick Tour of OpenRefineA Quick Tour of OpenRefine
A Quick Tour of OpenRefine
Tony Hirst
 
Google refine tutotial
Google refine tutotialGoogle refine tutotial
Google refine tutotial
Vijaya Prabhu
 
Tomasz Kopacz: Architektura i service fabric - jak budować aplikacje w paas v2
Tomasz Kopacz: Architektura i service fabric - jak budować aplikacje w paas v2Tomasz Kopacz: Architektura i service fabric - jak budować aplikacje w paas v2
Tomasz Kopacz: Architektura i service fabric - jak budować aplikacje w paas v2
AnalyticsConf
 
Dawid Gonzo Kałędowski: R jako osobisty GPS
Dawid Gonzo Kałędowski: R jako osobisty GPSDawid Gonzo Kałędowski: R jako osobisty GPS
Dawid Gonzo Kałędowski: R jako osobisty GPS
AnalyticsConf
 
DDL,DML,SQL Functions and Joins
DDL,DML,SQL Functions and JoinsDDL,DML,SQL Functions and Joins
DDL,DML,SQL Functions and Joins
Ashwin Dinoriya
 
Tor Hovland: Taking a swim in the big data lake
Tor Hovland: Taking a swim in the big data lakeTor Hovland: Taking a swim in the big data lake
Tor Hovland: Taking a swim in the big data lake
AnalyticsConf
 
What Is Reporting Services?
 What Is Reporting Services?  What Is Reporting Services?
What Is Reporting Services?
LearnItFirst.com
 
PowerBI - Porto.Data - 20150219
PowerBI - Porto.Data - 20150219PowerBI - Porto.Data - 20150219
PowerBI - Porto.Data - 20150219
Rui Romano
 
Denny Lee\'s Data Camp v1.0 talk on SSRS Best Practices for IT
Denny Lee\'s Data Camp v1.0 talk on SSRS Best Practices for ITDenny Lee\'s Data Camp v1.0 talk on SSRS Best Practices for IT
Denny Lee\'s Data Camp v1.0 talk on SSRS Best Practices for IT
Bala Subra
 
Employing Google Refine to publish Linked Data
Employing Google Refine to publish Linked DataEmploying Google Refine to publish Linked Data
Employing Google Refine to publish Linked Data
Fadi Maali
 
Subqueries, Backups, Users and Privileges
Subqueries, Backups, Users and PrivilegesSubqueries, Backups, Users and Privileges
Subqueries, Backups, Users and Privileges
Ashwin Dinoriya
 
Sql Server 2012 Reporting-Services is Now a SharePoint Service Application
Sql Server 2012   Reporting-Services is Now a SharePoint Service ApplicationSql Server 2012   Reporting-Services is Now a SharePoint Service Application
Sql Server 2012 Reporting-Services is Now a SharePoint Service Application
InnoTech
 
Data Visualization-Ashwin
Data Visualization-AshwinData Visualization-Ashwin
Data Visualization-Ashwin
Ashwin Dinoriya
 
Data and Donuts: Data cleaning with OpenRefine
Data and Donuts: Data cleaning with OpenRefineData and Donuts: Data cleaning with OpenRefine
Data and Donuts: Data cleaning with OpenRefine
C. Tobin Magle
 
Welcome to PowerBI and Tableau
Welcome to PowerBI and TableauWelcome to PowerBI and Tableau
Welcome to PowerBI and Tableau
Ashwin Dinoriya
 
Rafał Korszuń: Security in Design of Cloud Applications
Rafał Korszuń: Security in Design of Cloud ApplicationsRafał Korszuń: Security in Design of Cloud Applications
Rafał Korszuń: Security in Design of Cloud Applications
AnalyticsConf
 
Paweł Ciepły: PowerBI part1
Paweł Ciepły: PowerBI part1Paweł Ciepły: PowerBI part1
Paweł Ciepły: PowerBI part1
AnalyticsConf
 

Similar to OpenRefine Class Tutorial (16)

GOKb and Refine (Kuali Days 2013)
GOKb and Refine (Kuali Days 2013)GOKb and Refine (Kuali Days 2013)
GOKb and Refine (Kuali Days 2013)
GOKb Project
 
Iug2015 watkins
Iug2015 watkinsIug2015 watkins
Iug2015 watkins
Whitni Watkins
 
Dressen-RSA-2019-preconference-data-workshop-copy.pptx
Dressen-RSA-2019-preconference-data-workshop-copy.pptxDressen-RSA-2019-preconference-data-workshop-copy.pptx
Dressen-RSA-2019-preconference-data-workshop-copy.pptx
AvneeshKumar164042
 
Course 6 (part 2) data visualisation by toon vanagt
Course 6 (part 2)   data visualisation by toon vanagtCourse 6 (part 2)   data visualisation by toon vanagt
Course 6 (part 2) data visualisation by toon vanagt
Betacowork
 
Open refine to update and clean up your messy data
Open refine to update and clean up your messy dataOpen refine to update and clean up your messy data
Open refine to update and clean up your messy data
University of Connecticut Libraries
 
Theory & Practice of Data Cleaning: Introduction to OpenRefine
Theory & Practice of Data Cleaning: Introduction to OpenRefineTheory & Practice of Data Cleaning: Introduction to OpenRefine
Theory & Practice of Data Cleaning: Introduction to OpenRefine
Bertram Ludäscher
 
Data analytics martinmagdinier-go open data 2015
Data analytics martinmagdinier-go open data 2015Data analytics martinmagdinier-go open data 2015
Data analytics martinmagdinier-go open data 2015
GO Open Data (GOOD)
 
OpenRefine
OpenRefineOpenRefine
OpenRefine
Georgia Libraries Conference (formerly Ga COMO).
 
Williams Open Refine for Librarians
Williams Open Refine for LibrariansWilliams Open Refine for Librarians
Williams Open Refine for Librarians
National Information Standards Organization (NISO)
 
School of Data - mapping company networks
School of Data - mapping company networksSchool of Data - mapping company networks
School of Data - mapping company networks
Tony Hirst
 
Beautiful Research Data (Structured Data and Open Refine)
Beautiful Research Data (Structured Data and Open Refine)Beautiful Research Data (Structured Data and Open Refine)
Beautiful Research Data (Structured Data and Open Refine)
Digital Scholarship Unit at the UTSC Library
 
Intro to open refine
Intro to open refineIntro to open refine
Intro to open refine
School of Data
 
Toronto OpenRefine MeetUp Nov 2015
Toronto OpenRefine MeetUp Nov 2015Toronto OpenRefine MeetUp Nov 2015
Toronto OpenRefine MeetUp Nov 2015
Martin Magdinier
 
Building the Global Open Knowledgebase (ER&L 2013)
Building the Global Open Knowledgebase (ER&L 2013)Building the Global Open Knowledgebase (ER&L 2013)
Building the Global Open Knowledgebase (ER&L 2013)
GOKb Project
 
Exploratory Data Analysis
Exploratory Data AnalysisExploratory Data Analysis
Exploratory Data Analysis
thinrhino
 
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Cloudera, Inc.
 
GOKb and Refine (Kuali Days 2013)
GOKb and Refine (Kuali Days 2013)GOKb and Refine (Kuali Days 2013)
GOKb and Refine (Kuali Days 2013)
GOKb Project
 
Dressen-RSA-2019-preconference-data-workshop-copy.pptx
Dressen-RSA-2019-preconference-data-workshop-copy.pptxDressen-RSA-2019-preconference-data-workshop-copy.pptx
Dressen-RSA-2019-preconference-data-workshop-copy.pptx
AvneeshKumar164042
 
Course 6 (part 2) data visualisation by toon vanagt
Course 6 (part 2)   data visualisation by toon vanagtCourse 6 (part 2)   data visualisation by toon vanagt
Course 6 (part 2) data visualisation by toon vanagt
Betacowork
 
Theory & Practice of Data Cleaning: Introduction to OpenRefine
Theory & Practice of Data Cleaning: Introduction to OpenRefineTheory & Practice of Data Cleaning: Introduction to OpenRefine
Theory & Practice of Data Cleaning: Introduction to OpenRefine
Bertram Ludäscher
 
Data analytics martinmagdinier-go open data 2015
Data analytics martinmagdinier-go open data 2015Data analytics martinmagdinier-go open data 2015
Data analytics martinmagdinier-go open data 2015
GO Open Data (GOOD)
 
School of Data - mapping company networks
School of Data - mapping company networksSchool of Data - mapping company networks
School of Data - mapping company networks
Tony Hirst
 
Toronto OpenRefine MeetUp Nov 2015
Toronto OpenRefine MeetUp Nov 2015Toronto OpenRefine MeetUp Nov 2015
Toronto OpenRefine MeetUp Nov 2015
Martin Magdinier
 
Building the Global Open Knowledgebase (ER&L 2013)
Building the Global Open Knowledgebase (ER&L 2013)Building the Global Open Knowledgebase (ER&L 2013)
Building the Global Open Knowledgebase (ER&L 2013)
GOKb Project
 
Exploratory Data Analysis
Exploratory Data AnalysisExploratory Data Analysis
Exploratory Data Analysis
thinrhino
 
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Cloudera, Inc.
 

OpenRefine Class Tutorial

  • 1. Advances in Data Science Fall 2016 TUTORIAL
  • 3. WHAT IS … ?? • Formerly known as Google Refine OpenRefine is a power tool for working with messy data, primarily for • detecting and fixing inconsistencies • transforming data from one structure or format to another • extending it with web services and external data • connecting names within your data to name registries (databases) Use OpenRefine when you need something ... • more powerful than a spreadsheet • more interactive and visual than scripting • more provisional / exploratory / experimental / . playful than a database
  • 4. • Import data in various formats (Ex: TSV, CSV,Excel (.xls, xlsx),XML,RDF as XML,JSON) • Explore datasets in a matter of seconds • Apply basic and advanced cell transformations • Deal with cells that contain multiple values • Create instantaneous links between datasets • Filter and partition your data easily with regular expressions • Use named-entity extraction on full-text fields to automatically identify topics • Perform advanced data operations with the General Refine Expression Language IMPORTANT FEATURES:
  • 5. The LendingClub data contains complete loan data for all loans issued through the time period stated, including the current loan status (Current, Late, Fully Paid, etc.) and latest payment information LENDING CLUB LOAN STATS DATA Our aim is to perform exploratory analysis on given financial data
  • 6. • Getting the data • Looking at the data • Cleansing • Transforming • Creating visualizations STEPS
  • 7. 1 – Getting started with OpenRefine 2 – Analyzing and Fixing Data 3 – Advanced Data Operations 4 – Linking Datasets 5 – Regular Expressions and GREL TUTORIAL
  • 8. • Requirements • Java JRE installed • Download • OpenRefine is a desktop application. Here’s the link: Google OpenRefine • Unlike most other desktop applications, it runs as a small web server on your own computer • You point your web browser at that web server in order to use Refine. So, think of Refine as a personal and private web application HOW TO INSTALL
  • 9. • Install: • Once you have downloaded the .zip file, uncompress it into a folder wherever you want (such as in C:Google-Refine). • Run: • Run the .exe file in that folder. You should see the Command window in which OpenRefine runs. By default, the Command window has a black background and text in monospace font in it. • Shut down: • When you need to shut down OpenRefine, switch to that Command window, and press Ctrl- C. Wait until there's a message that says the shutdown is complete. That window might close automatically, or you can close it yourself. If you get asked, "Terminate all batch processes? Y/N", just press Y. INSTALLATION: WINDOWS
  • 10. • Install: • Once you have downloaded the .dmg file, open it, and drag the OpenRefine icon into the Applications folder icon (just like you would normally install Mac applications). • Run: • To launch OpenRefine, go to the Applications folder and double click the OpenRefine app. You'll see the OpenRefine app appear in your dock. • Shut down: • You can switch to the OpenRefine app (clicking on its icon in the dock) and invoke its Quit command. • If you use Yosemite you will need to install Java for OS X 2014-001 first. INSTALLATION: MAC
  • 11. • Install / Run: Once you have downloaded the tar.gz file, open a shell and type • tar xzf google-refine.tar.gz • cd google-refine • ./refine • This will start OpenRefine and open your browser to its starting page. • Shut down: Press Ctrl-C in the shell. INSTALLATION: LINUX
  • 12. RUN OPENREFINE • To increase memory: refine.bat /m 4096m
  • 16. USING THE PROJECT HISTORY
  • 19. WORKING ON THE DATA • sorting data • faceting data • detecting duplicates • applying a text filter • using simple cell transformations • removing matching rows • splitting data across columns • adding derived columns
  • 20. SPECIAL FEATURE • Regular Expressions and GREL • Can use Python, Clojure
  • 21. ADDING A RECONCILIATION SERVICE AND RECONCILING WITH LINKED DATA
  • 22. ADVANCED DATA OPERATIONS • handling multi-valued cells • alternating between rows and records mode • clustering similar cells • transforming cell values • adding derived columns • transposing rows and columns • installing extensions
  • 23. • Documentation: • https://github.com/OpenRefine/OpenRefine/wiki • Youtube Tutorial: • https://www.youtube.com/playlist?list=PL737054C67FCC0741 REFERENCES: