SlideShare a Scribd company logo
Almost Scraping: Web Scraping  for Non-Programmers Michelle Minkoff, PBSNews.org Matt Wynn, Omaha World-Herald
What is Web scraping? The *all-knowing* Wikipedia says: “ Web scraping  (also called  Web harvesting  or  Web data extraction ) is a computer software technique of extracting information from websites. …Web scraping focuses more on the transformation of unstructured Web content, typically in  HTML  format, into structured data that can be stored and analyzed in a central local database or spreadsheet. Web scraping is also related to Web automation, which simulates human Web browsing using computer software. Uses of Web scraping include online price comparison, weather data monitoring, website change detection, Web research, Web content mashup and Web data integration.”
Why do I want to Web scrape? Journalists like to find stories Editors like stories that are exclusive Downloading a dataset is like going to a press conference, anyone can grab and use it. Web scraping is like an enterprise story, less likely to be picked up by all. Puts more control back into your hands
What kind of data can I get? Laws (Summary of same-sex marriage laws for each state, pdfs) Photos (pictures associated with all players on a team you’re highlighting, all mayoral candidates) Recipe ingredients (NYT story about peanut butter) Health care (see ProPublica’s Dollars for Docs project) Links, images, dates, names, categories, tags, anything with some sort of repeatable structure
DownThemAll http://www.downthemall.net
Yahoo Pipes http://pipes.yahoo.com/pipes
Yahoo Pipes Access and manipulate RSS feeds, which are often a flurry of information Sort, filter, combine your information Format that info to fit your needs (date formatter)
Yahoo Pipes Pair with Versionista, which can create an RSS feed of changes to a Web site to keep tabs on what’s changing.  This was done to great effect by ProPublica’s team in late 2009, esp. by Scott Klein and then-intern Brian Boyer, now at Chicago Tribune
ScraperWiki http://scraperwiki.com
Needlebase http://needlebase.com
Needlebase For sites that follow a repetitive formula spanning multiple pages, like index pg & detail page, maybe with a search results page in the middle Like a good employee, train it once, and then let it churn.
Needlebase Query, select and filter your data in the Web app, then export in format of your choice. Can check your data and stay up-to-date on your data set Will go more in depth on Needle in Sat.’s hands-on lab at 10 a.m.
InfoExtractor http://www.infoextractor.org
irobotsoft http://irobotsoft.com
Imacros https://addons.mozilla.org/en-US/firefox/addon/imacros-for-firefox/
Imacros Record repetitive tasks that you do every day, and keep them as a data set Think of it like a bookmark, but if you could include logging in, or entering a search term, as part of that bookmark Useful for stats you check every day, scores for your local sports team, stocks if you’re a biz reporter, etc. More complex function allows you to extract multiple data points on a page, like from an HTML table.
OutwitHub http://www.outwit.com/products/hub
OutwitHub Versatile Firefox extension Can use it for certain defaults (links, images)
OutwitHub Dig through the HTML hierarchy tree Structural elements (<h3>) Stylistic elements (<strong>) Download list of attached files or files themselves More options if you buy Pro version Will discuss in-depth and use in hands-on lab on Saturday at 10 am
Python
Wrap-Up Non-programming scrapers can’t do everything, but have the power to get you started.  Some  say “Program or be programmed,” but this is a compromise. Legal permissions still apply, so don’t use scraped info you don’t have the right to. Something to consider.  How does this apply to what you do every day, and how scraping could contribute to your job? “ The businesses that win will be those that understand how to build value from data from wherever it comes. Information isn’t power. The right information is.” – media consultant Neil Perkin wrote in  Marketing Week
Ad

More Related Content

What's hot (19)

Web Scraping Basics
Web Scraping BasicsWeb Scraping Basics
Web Scraping Basics
Kyle Banerjee
 
Getting started with Scrapy in Python
Getting started with Scrapy in PythonGetting started with Scrapy in Python
Getting started with Scrapy in Python
Viren Rajput
 
Intro to web scraping with Python
Intro to web scraping with PythonIntro to web scraping with Python
Intro to web scraping with Python
Maris Lemba
 
Web scraping in python
Web scraping in python Web scraping in python
Web scraping in python
Viren Rajput
 
Web scraping
Web scrapingWeb scraping
Web scraping
Selecto
 
Web Scraping Technologies
Web Scraping TechnologiesWeb Scraping Technologies
Web Scraping Technologies
Krishna Sunuwar
 
Getting started with Web Scraping in Python
Getting started with Web Scraping in PythonGetting started with Web Scraping in Python
Getting started with Web Scraping in Python
Satwik Kansal
 
Web scraping in python
Web scraping in pythonWeb scraping in python
Web scraping in python
Saurav Tomar
 
Scraping with Python for Fun and Profit - PyCon India 2010
Scraping with Python for Fun and Profit - PyCon India 2010Scraping with Python for Fun and Profit - PyCon India 2010
Scraping with Python for Fun and Profit - PyCon India 2010
Abhishek Mishra
 
What is web scraping?
What is web scraping?What is web scraping?
What is web scraping?
Brijesh Prajapati
 
Web scraping
Web scrapingWeb scraping
Web scraping
Ashley Davis
 
Tutorial on Web Scraping in Python
Tutorial on Web Scraping in PythonTutorial on Web Scraping in Python
Tutorial on Web Scraping in Python
Nithish Raghunandanan
 
Introduction to OpenRefine
Introduction to OpenRefineIntroduction to OpenRefine
Introduction to OpenRefine
Heather Myers
 
Null 1
Null 1Null 1
Null 1
MarcosHuenchullanSot
 
Using Web Data for Finance
Using Web Data for FinanceUsing Web Data for Finance
Using Web Data for Finance
Scrapinghub
 
Scraping talk public
Scraping talk publicScraping talk public
Scraping talk public
Nesta
 
Web scraping 101 with goutte
Web scraping 101 with goutteWeb scraping 101 with goutte
Web scraping 101 with goutte
Joshua Copeland
 
Day 4 - Advance Python - Ground Gurus
Day 4 - Advance Python - Ground GurusDay 4 - Advance Python - Ground Gurus
Day 4 - Advance Python - Ground Gurus
Chariza Pladin
 
Google history nd architecture
Google history nd architectureGoogle history nd architecture
Google history nd architecture
Divyangee Jain
 
Getting started with Scrapy in Python
Getting started with Scrapy in PythonGetting started with Scrapy in Python
Getting started with Scrapy in Python
Viren Rajput
 
Intro to web scraping with Python
Intro to web scraping with PythonIntro to web scraping with Python
Intro to web scraping with Python
Maris Lemba
 
Web scraping in python
Web scraping in python Web scraping in python
Web scraping in python
Viren Rajput
 
Web scraping
Web scrapingWeb scraping
Web scraping
Selecto
 
Web Scraping Technologies
Web Scraping TechnologiesWeb Scraping Technologies
Web Scraping Technologies
Krishna Sunuwar
 
Getting started with Web Scraping in Python
Getting started with Web Scraping in PythonGetting started with Web Scraping in Python
Getting started with Web Scraping in Python
Satwik Kansal
 
Web scraping in python
Web scraping in pythonWeb scraping in python
Web scraping in python
Saurav Tomar
 
Scraping with Python for Fun and Profit - PyCon India 2010
Scraping with Python for Fun and Profit - PyCon India 2010Scraping with Python for Fun and Profit - PyCon India 2010
Scraping with Python for Fun and Profit - PyCon India 2010
Abhishek Mishra
 
Introduction to OpenRefine
Introduction to OpenRefineIntroduction to OpenRefine
Introduction to OpenRefine
Heather Myers
 
Using Web Data for Finance
Using Web Data for FinanceUsing Web Data for Finance
Using Web Data for Finance
Scrapinghub
 
Scraping talk public
Scraping talk publicScraping talk public
Scraping talk public
Nesta
 
Web scraping 101 with goutte
Web scraping 101 with goutteWeb scraping 101 with goutte
Web scraping 101 with goutte
Joshua Copeland
 
Day 4 - Advance Python - Ground Gurus
Day 4 - Advance Python - Ground GurusDay 4 - Advance Python - Ground Gurus
Day 4 - Advance Python - Ground Gurus
Chariza Pladin
 
Google history nd architecture
Google history nd architectureGoogle history nd architecture
Google history nd architecture
Divyangee Jain
 

Similar to Almost Scraping: Web Scraping without Programming (20)

What are the different types of web scraping approaches
What are the different types of web scraping approachesWhat are the different types of web scraping approaches
What are the different types of web scraping approaches
Aparna Sharma
 
Pratical Deep Dive into the Semantic Web - #smconnect
Pratical Deep Dive into the Semantic Web - #smconnectPratical Deep Dive into the Semantic Web - #smconnect
Pratical Deep Dive into the Semantic Web - #smconnect
Jan-Willem Bobbink - Freelance SEO Consultant
 
AI-Driven News & Article Data Scraping: A Deep Dive into Content Extraction
AI-Driven News & Article Data Scraping: A Deep Dive into Content ExtractionAI-Driven News & Article Data Scraping: A Deep Dive into Content Extraction
AI-Driven News & Article Data Scraping: A Deep Dive into Content Extraction
Web Screen Scraping
 
Lecture7
Lecture7Lecture7
Lecture7
guest8461ae
 
How To Web - Introduction To Data Mining For Web Applications
How To Web - Introduction To Data Mining For Web ApplicationsHow To Web - Introduction To Data Mining For Web Applications
How To Web - Introduction To Data Mining For Web Applications
Wembrio
 
"Using Web 2.0 as a Weapon Against Corruption"
"Using Web 2.0 as a Weapon Against Corruption" "Using Web 2.0 as a Weapon Against Corruption"
"Using Web 2.0 as a Weapon Against Corruption"
J T "Tom" Johnson
 
Implementation of Web Application for Disease Prediction Using AI
Implementation of Web Application for Disease Prediction Using AIImplementation of Web Application for Disease Prediction Using AI
Implementation of Web Application for Disease Prediction Using AI
BOHR International Journal of Data Mining and Big Data
 
E017413647
E017413647E017413647
E017413647
IOSR Journals
 
Implementation ofWeb Application for Disease Prediction Using AI
Implementation ofWeb Application for Disease Prediction Using AIImplementation ofWeb Application for Disease Prediction Using AI
Implementation ofWeb Application for Disease Prediction Using AI
BOHR International Journal of Computer Science (BIJCS)
 
Making things findable
Making things findableMaking things findable
Making things findable
Peter Mika
 
Web scraper using PHP
Web scraper using PHPWeb scraper using PHP
Web scraper using PHP
Manish Bhattacharya
 
Peter Mika's Presentation at SSSW 2011
Peter Mika's Presentation at SSSW 2011Peter Mika's Presentation at SSSW 2011
Peter Mika's Presentation at SSSW 2011
sssw2011
 
Sharepoint tips and tricks
Sharepoint tips and tricksSharepoint tips and tricks
Sharepoint tips and tricks
Jeff Wisniewski
 
Microformats 101 Workshop
Microformats 101 WorkshopMicroformats 101 Workshop
Microformats 101 Workshop
Kelley Howell
 
IST 561 Spring 2007--Session7, Sources of Information
IST 561 Spring 2007--Session7, Sources of InformationIST 561 Spring 2007--Session7, Sources of Information
IST 561 Spring 2007--Session7, Sources of Information
D.A. Garofalo
 
Enhanced Web Usage Mining Using Fuzzy Clustering and Collaborative Filtering ...
Enhanced Web Usage Mining Using Fuzzy Clustering and Collaborative Filtering ...Enhanced Web Usage Mining Using Fuzzy Clustering and Collaborative Filtering ...
Enhanced Web Usage Mining Using Fuzzy Clustering and Collaborative Filtering ...
inventionjournals
 
Share point metadata
Share point metadataShare point metadata
Share point metadata
Termset Platform
 
Introduction to internet.
Introduction to internet.Introduction to internet.
Introduction to internet.
Anish Thomas
 
Week 2 computers, web and the internet
Week 2 computers, web and the internetWeek 2 computers, web and the internet
Week 2 computers, web and the internet
carolyn oldham
 
search
searchsearch
search
ssuserbad56d
 
What are the different types of web scraping approaches
What are the different types of web scraping approachesWhat are the different types of web scraping approaches
What are the different types of web scraping approaches
Aparna Sharma
 
AI-Driven News & Article Data Scraping: A Deep Dive into Content Extraction
AI-Driven News & Article Data Scraping: A Deep Dive into Content ExtractionAI-Driven News & Article Data Scraping: A Deep Dive into Content Extraction
AI-Driven News & Article Data Scraping: A Deep Dive into Content Extraction
Web Screen Scraping
 
How To Web - Introduction To Data Mining For Web Applications
How To Web - Introduction To Data Mining For Web ApplicationsHow To Web - Introduction To Data Mining For Web Applications
How To Web - Introduction To Data Mining For Web Applications
Wembrio
 
"Using Web 2.0 as a Weapon Against Corruption"
"Using Web 2.0 as a Weapon Against Corruption" "Using Web 2.0 as a Weapon Against Corruption"
"Using Web 2.0 as a Weapon Against Corruption"
J T "Tom" Johnson
 
Making things findable
Making things findableMaking things findable
Making things findable
Peter Mika
 
Peter Mika's Presentation at SSSW 2011
Peter Mika's Presentation at SSSW 2011Peter Mika's Presentation at SSSW 2011
Peter Mika's Presentation at SSSW 2011
sssw2011
 
Sharepoint tips and tricks
Sharepoint tips and tricksSharepoint tips and tricks
Sharepoint tips and tricks
Jeff Wisniewski
 
Microformats 101 Workshop
Microformats 101 WorkshopMicroformats 101 Workshop
Microformats 101 Workshop
Kelley Howell
 
IST 561 Spring 2007--Session7, Sources of Information
IST 561 Spring 2007--Session7, Sources of InformationIST 561 Spring 2007--Session7, Sources of Information
IST 561 Spring 2007--Session7, Sources of Information
D.A. Garofalo
 
Enhanced Web Usage Mining Using Fuzzy Clustering and Collaborative Filtering ...
Enhanced Web Usage Mining Using Fuzzy Clustering and Collaborative Filtering ...Enhanced Web Usage Mining Using Fuzzy Clustering and Collaborative Filtering ...
Enhanced Web Usage Mining Using Fuzzy Clustering and Collaborative Filtering ...
inventionjournals
 
Introduction to internet.
Introduction to internet.Introduction to internet.
Introduction to internet.
Anish Thomas
 
Week 2 computers, web and the internet
Week 2 computers, web and the internetWeek 2 computers, web and the internet
Week 2 computers, web and the internet
carolyn oldham
 
Ad

More from Michelle Minkoff (6)

Elvismargasak
ElvismargasakElvismargasak
Elvismargasak
Michelle Minkoff
 
Charting Crime Categories - Easy(ier) Programming w/Google Chart Tools - ONA ...
Charting Crime Categories - Easy(ier) Programming w/Google Chart Tools - ONA ...Charting Crime Categories - Easy(ier) Programming w/Google Chart Tools - ONA ...
Charting Crime Categories - Easy(ier) Programming w/Google Chart Tools - ONA ...
Michelle Minkoff
 
Web scrapingpanel
Web scrapingpanelWeb scrapingpanel
Web scrapingpanel
Michelle Minkoff
 
Making HTML Tables Interactive
Making HTML Tables InteractiveMaking HTML Tables Interactive
Making HTML Tables Interactive
Michelle Minkoff
 
Discoverable databases: Is your site *really* user-friendly?
Discoverable databases: Is your site *really* user-friendly?Discoverable databases: Is your site *really* user-friendly?
Discoverable databases: Is your site *really* user-friendly?
Michelle Minkoff
 
NICAR 2010: Hidden Power of Javascript
NICAR 2010: Hidden Power of JavascriptNICAR 2010: Hidden Power of Javascript
NICAR 2010: Hidden Power of Javascript
Michelle Minkoff
 
Charting Crime Categories - Easy(ier) Programming w/Google Chart Tools - ONA ...
Charting Crime Categories - Easy(ier) Programming w/Google Chart Tools - ONA ...Charting Crime Categories - Easy(ier) Programming w/Google Chart Tools - ONA ...
Charting Crime Categories - Easy(ier) Programming w/Google Chart Tools - ONA ...
Michelle Minkoff
 
Making HTML Tables Interactive
Making HTML Tables InteractiveMaking HTML Tables Interactive
Making HTML Tables Interactive
Michelle Minkoff
 
Discoverable databases: Is your site *really* user-friendly?
Discoverable databases: Is your site *really* user-friendly?Discoverable databases: Is your site *really* user-friendly?
Discoverable databases: Is your site *really* user-friendly?
Michelle Minkoff
 
NICAR 2010: Hidden Power of Javascript
NICAR 2010: Hidden Power of JavascriptNICAR 2010: Hidden Power of Javascript
NICAR 2010: Hidden Power of Javascript
Michelle Minkoff
 
Ad

Recently uploaded (20)

Quantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur MorganQuantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur Morgan
Arthur Morgan
 
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveDesigning Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
ScyllaDB
 
Are Cloud PBX Providers in India Reliable for Small Businesses (1).pdf
Are Cloud PBX Providers in India Reliable for Small Businesses (1).pdfAre Cloud PBX Providers in India Reliable for Small Businesses (1).pdf
Are Cloud PBX Providers in India Reliable for Small Businesses (1).pdf
Telecoms Supermarket
 
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Aqusag Technologies
 
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-UmgebungenHCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
panagenda
 
How analogue intelligence complements AI
How analogue intelligence complements AIHow analogue intelligence complements AI
How analogue intelligence complements AI
Paul Rowe
 
Role of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered ManufacturingRole of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered Manufacturing
Andrew Leo
 
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfThe Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
Abi john
 
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc
 
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptxIncreasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Anoop Ashok
 
Semantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AISemantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AI
artmondano
 
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxSpecial Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
shyamraj55
 
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
BookNet Canada
 
HCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser EnvironmentsHCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser Environments
panagenda
 
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven InsightsAndrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell
 
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPathCommunity
 
Mastering Advance Window Functions in SQL.pdf
Mastering Advance Window Functions in SQL.pdfMastering Advance Window Functions in SQL.pdf
Mastering Advance Window Functions in SQL.pdf
Spiral Mantra
 
Technology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data AnalyticsTechnology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data Analytics
InData Labs
 
Web and Graphics Designing Training in Rajpura
Web and Graphics Designing Training in RajpuraWeb and Graphics Designing Training in Rajpura
Web and Graphics Designing Training in Rajpura
Erginous Technology
 
Quantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur MorganQuantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur Morgan
Arthur Morgan
 
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveDesigning Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
ScyllaDB
 
Are Cloud PBX Providers in India Reliable for Small Businesses (1).pdf
Are Cloud PBX Providers in India Reliable for Small Businesses (1).pdfAre Cloud PBX Providers in India Reliable for Small Businesses (1).pdf
Are Cloud PBX Providers in India Reliable for Small Businesses (1).pdf
Telecoms Supermarket
 
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Aqusag Technologies
 
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-UmgebungenHCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
panagenda
 
How analogue intelligence complements AI
How analogue intelligence complements AIHow analogue intelligence complements AI
How analogue intelligence complements AI
Paul Rowe
 
Role of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered ManufacturingRole of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered Manufacturing
Andrew Leo
 
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfThe Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
Abi john
 
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc
 
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptxIncreasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Anoop Ashok
 
Semantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AISemantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AI
artmondano
 
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxSpecial Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
shyamraj55
 
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
BookNet Canada
 
HCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser EnvironmentsHCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser Environments
panagenda
 
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven InsightsAndrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell
 
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPathCommunity
 
Mastering Advance Window Functions in SQL.pdf
Mastering Advance Window Functions in SQL.pdfMastering Advance Window Functions in SQL.pdf
Mastering Advance Window Functions in SQL.pdf
Spiral Mantra
 
Technology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data AnalyticsTechnology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data Analytics
InData Labs
 
Web and Graphics Designing Training in Rajpura
Web and Graphics Designing Training in RajpuraWeb and Graphics Designing Training in Rajpura
Web and Graphics Designing Training in Rajpura
Erginous Technology
 

Almost Scraping: Web Scraping without Programming

  • 1. Almost Scraping: Web Scraping for Non-Programmers Michelle Minkoff, PBSNews.org Matt Wynn, Omaha World-Herald
  • 2. What is Web scraping? The *all-knowing* Wikipedia says: “ Web scraping (also called Web harvesting or Web data extraction ) is a computer software technique of extracting information from websites. …Web scraping focuses more on the transformation of unstructured Web content, typically in HTML format, into structured data that can be stored and analyzed in a central local database or spreadsheet. Web scraping is also related to Web automation, which simulates human Web browsing using computer software. Uses of Web scraping include online price comparison, weather data monitoring, website change detection, Web research, Web content mashup and Web data integration.”
  • 3. Why do I want to Web scrape? Journalists like to find stories Editors like stories that are exclusive Downloading a dataset is like going to a press conference, anyone can grab and use it. Web scraping is like an enterprise story, less likely to be picked up by all. Puts more control back into your hands
  • 4. What kind of data can I get? Laws (Summary of same-sex marriage laws for each state, pdfs) Photos (pictures associated with all players on a team you’re highlighting, all mayoral candidates) Recipe ingredients (NYT story about peanut butter) Health care (see ProPublica’s Dollars for Docs project) Links, images, dates, names, categories, tags, anything with some sort of repeatable structure
  • 7. Yahoo Pipes Access and manipulate RSS feeds, which are often a flurry of information Sort, filter, combine your information Format that info to fit your needs (date formatter)
  • 8. Yahoo Pipes Pair with Versionista, which can create an RSS feed of changes to a Web site to keep tabs on what’s changing. This was done to great effect by ProPublica’s team in late 2009, esp. by Scott Klein and then-intern Brian Boyer, now at Chicago Tribune
  • 11. Needlebase For sites that follow a repetitive formula spanning multiple pages, like index pg & detail page, maybe with a search results page in the middle Like a good employee, train it once, and then let it churn.
  • 12. Needlebase Query, select and filter your data in the Web app, then export in format of your choice. Can check your data and stay up-to-date on your data set Will go more in depth on Needle in Sat.’s hands-on lab at 10 a.m.
  • 16. Imacros Record repetitive tasks that you do every day, and keep them as a data set Think of it like a bookmark, but if you could include logging in, or entering a search term, as part of that bookmark Useful for stats you check every day, scores for your local sports team, stocks if you’re a biz reporter, etc. More complex function allows you to extract multiple data points on a page, like from an HTML table.
  • 18. OutwitHub Versatile Firefox extension Can use it for certain defaults (links, images)
  • 19. OutwitHub Dig through the HTML hierarchy tree Structural elements (<h3>) Stylistic elements (<strong>) Download list of attached files or files themselves More options if you buy Pro version Will discuss in-depth and use in hands-on lab on Saturday at 10 am
  • 21. Wrap-Up Non-programming scrapers can’t do everything, but have the power to get you started. Some say “Program or be programmed,” but this is a compromise. Legal permissions still apply, so don’t use scraped info you don’t have the right to. Something to consider. How does this apply to what you do every day, and how scraping could contribute to your job? “ The businesses that win will be those that understand how to build value from data from wherever it comes. Information isn’t power. The right information is.” – media consultant Neil Perkin wrote in Marketing Week