Web Data Extractors
Web Data Extractors
By
Extracting data from the World Wide Web (WWW) has become an important issue in the
last few years as the number of web pages available on the visible Internet has grown to
billions of pages with trillions of pages available from the invisible web. Tools and
protocols to extract all this information have now come in demand as researchers as well
as web browsers and surfers want to discover new knowledge at an ever increasing rate!
As robots (bots) and intelligent agents are at the heart of many extraction tools I decided
to create a compilation of the latest sources and sites that extract information from the
web.
Agenty – Robotic Process Automation (RPA) Software on Cloud for Data Scraping
https://www.agenty.com/
Anthracite
http://freecode.com/projects/anthracite
Beautiful Soup
http://freecode.com/projects/beautifulsoup
Beautiful Soup - HTML/XML Parser for Quick Turnaround Screen Scraping and Web
Data Extraction
http://www.crummy.com/software/BeautifulSoup/
Browse.ai – Easiest Way to Extract and Monitor Data from Any Website
https://www.browse.ai/
Cogitum Co-Citer
http://www.cogitum.com/co-tracker-text/more.shtml
Common Crawl
http://www.commoncrawl.org/
CrawlMonster
http://www.crawlmonster.com/
Crawly
http://crawly.diffbot.com/
Data Miner – Powerful Web Scraping Tool for Professional Data Miners
https://data-miner.io/
DiscoverText - Import, Sort, Distribute and Analyze Electronic Content from eMail,
Document Repositories, and Social Media
http://discovertext.com/
Import.io - Turn the Web Into Data With Extractors, Crawlers and Connectors
https://import.io/
Open Datasets
http://www.DataPortals.org/
https://github.com/caesar0301/awesome-public-datasets
https://www.kaggle.com/datasets
https://www.data.gov/
https://www.quora.com/Where-can-I-find-large-datasets-open-to-the-public
https://aws.amazon.com/public-datasets/
https://data.world/
http://data.worldbank.org/
http://www.OpenDataSets.info/
Outscraper – Solutions for Accessing Public Information from the Internet for Lead
Generation, Marketing, and Data Science
https://Outscraper.com/
OutWit Hub - Harvest the Web With Your Own Web Collection Engine
http://www.outwit.com/
Quick Code
https://quickcode.io/
REBOL Technologies
http://www.rebol.com/
ScrapeForge
http://freecode.com/projects/scrapeforge
10
Scraper
http://freecode.com/projects/scraper
Sensible Code
http://sensiblecode.io/
SPSS Modeler
http://developer.ibm.com/predictiveanalytics
12
TadaWeb - Clone and Amplify Human Intelligence for Web Data Collection and
Analysis
https://www.tadaweb.com/
TextConverter 4
https://www.simx.com/simx/TC-Overview.stp?
TextSniper – Extract Text from Images and Other Digital Documents in Seconds
https://textsniper.app/
Vaazo – Web Bot That Can Scrape Data and Automate Tasks and More
https://vaazo.com/
VietSpider
http://binhgiang.sourceforge.net/
Web Scraper
http://www.webscraper.io/
14
Website Downloader
https://websitedownloader.io/
15
Accessibility Resources
http://www.AccessibilityResources.info/
Agriculture Resources
http://www.AgricultureResources.info/
AnswerSpot
http://www.AnswerSpot.co/
Astronomy Resources
http://www.AstronomyResources.info/
Auction Resources
http://www.AuctionResources.info/
Biological Informatics
http://www.BiologicalInformatics.info/
Biotechnology Resources
http://www.BiotechnologyResources.info/
Bot Research
http://www.BotResearch.info/
16
Directory Resources
http://www.DirectoryResources.info/
eCommerce Resources
http://eCommerceResources.info/
Elder Resources
http://www.ElderResources.info/
Employment Resources
http://www.EmploymentResources.info/
Entrepreneurial Resources
http://www.EntrepreneurialResources.info/
Financial Sources
http://www.FinancialSources.info/
Finding People
http://www.FindingPeople.info/
Games Resources
http://www.GamesResources.info/
Genealogy Resources
http://www.GenealogyResources.info/
17
Green Files
http://www.GreenFiles.info/
Healthcare Resources
http://www.HealthcareResources.info/
Internet Alerts
http://www.InternetAlerts.info/
Internet Demographics
http://www.InternetDemographics.info/
Internet Experts
http://www.InternetExperts.info/
Internet Hoaxes
http://www.InternetHoaxes.info/
Intrapreneurial Resources
http://www.IntrapreneurialResources.info/
Journalism Resources
http://www.JournalismResources.info/
Knowledge Discovery
http://www.KnowledgeDiscovery.info/
18
Privacy Resources
http://www.PrivacyResources.info/
Reference Resources
http://www.ReferenceResources.info/
Research Resources
http://www.ResearchResources.info/
RestStress™
http://www.RestStress.com/
Script Resources
http://www.ScriptResources.info/
ShoppingBots
http://www.ShoppingBots.info/
Social Informatics
http://www.SocialInformatics.info/
Student Research
http://www.StudentResearch.info/
Theology Resources
http://www.TheologyResources.info/
Tutorial Resources
http://www.TutorialResources.info/
19
20
LinkSeries Publications
http://www.LinkSeries.com/
Links By Marcus™
http://www.LinksByMarcus.com/
Workshops By Marcus™
http://www.WorkshopsByMarcus.com/
Deep Web Research and Discovery Resources 2024 Online White Paper
http://DeepWeb.us/
23
Using the Internet As a Dynamic Resource Tool for Knowledge Discovery 2024
http://www.zillman.us/white-papers/using-the-internet-as-a-dynamic-resource-tool-for-
knowledge-discovery/
24
25
26