SlideShare a Scribd company logo
Java Web Scraping Sumant Kumar Raja
Agenda What is Web Scraping!!!! Stages in web - scraping Useful API for web scraping Limitations using above APIs Defining I18n and L10n I18n and L10n checkpoints Before we end And finally {}
What is web scraping? Web scraping  describes any of various means to extract content from a website over HTTP for the purpose of transforming that content into another format suitable for use in another context. Also called  harvesting
Stages in web scraping Connect : connect with the remote site over HTTP or FTP Extract and Process: Extract information from the website and Process data into useful data format Save : save data in desired format The process stage consists of Filter : filter useful data from source Format : format data to a format required by user.
Useful APIs for web scraping commons-httpclient-3.1.jar HTTP javacsv.jar csv Save pmd-4.2.3.jar Code quality jxl.jar poi-3.1-FINAL-20080629.jar Excel javacsv.jar CSV jericho-html-2.6.jar HTML FontBox-0.1.0.jar PDFBox-0.7.3.jar PDF Extract and process commons-net-1.4.1.jar FTP Connection slf4j-api-1.5.2.jar slf4j-log4j12-1.5.2.jar NA Logging API Type Process
Limitations using above APIs Apache POI does not support extraction of older version of excel. Use JExcel in place of POI. PDF box and Font box failed to process the pdf certain encodings.
Defining I18n and L10n I18n stands for I nternationalizatio n The process of converting locale dependent data into locale independent data Example: The date string 12-Mar-2008 can be saved as date object. Date object is locale independent. L10n stands for L ocalizatio n The process of converting data from one locale to another locale or from locale independent format to  locale dependent format. Example: The currency $1000 in US locale to equivalent pounds in UK locale.
I18n and L10n checkpoints Take care of following points while scrapping data from various locales Convert the number format. Example: the number is Dutch locale 1000.00,95 is 100000.95. Convert the date format. The date should be converted from one format to another. Ie, date in French locale should be converted into date object Convert Unit of measure (length, area, weight, etc.)
Before we end Economize the internet roundtrip Fetching data from HTTP/FTP is costly. Make minimum number of round trip to get data from internet. Write all data at same time as writing into disk is costly.
And finally {} Some of the ready to use web scrapping software  http://en.wikipedia.org/wiki/Web-scraping_software_comparison
Thank You
Ad

More Related Content

Viewers also liked (20)

Almost Scraping: Web Scraping without Programming
Almost Scraping: Web Scraping without ProgrammingAlmost Scraping: Web Scraping without Programming
Almost Scraping: Web Scraping without Programming
Michelle Minkoff
 
Scraping data from the web and documents
Scraping data from the web and documentsScraping data from the web and documents
Scraping data from the web and documents
Tommy Tavenner
 
Web Scraping
Web ScrapingWeb Scraping
Web Scraping
primeteacher32
 
Scrapinghub PyCon Philippines 2015
Scrapinghub PyCon Philippines 2015Scrapinghub PyCon Philippines 2015
Scrapinghub PyCon Philippines 2015
Richard Dowinton
 
LOCKSS Como funciona 2007
LOCKSS Como funciona 2007LOCKSS Como funciona 2007
LOCKSS Como funciona 2007
Miguel Angel Mardero Arellano
 
Estudo de caso do "O Curioso" (Rio on Rails)
Estudo de caso do "O Curioso" (Rio on Rails)Estudo de caso do "O Curioso" (Rio on Rails)
Estudo de caso do "O Curioso" (Rio on Rails)
guestf4f70f
 
Shut up and give me the data
Shut up and give me the dataShut up and give me the data
Shut up and give me the data
Ana Paula Gomes
 
Web - Crawlers
Web - CrawlersWeb - Crawlers
Web - Crawlers
Nobre Pedro
 
Curso YaCy Mecanismo de Busca de Código Aberto
Curso YaCy Mecanismo de Busca de Código AbertoCurso YaCy Mecanismo de Busca de Código Aberto
Curso YaCy Mecanismo de Busca de Código Aberto
Julio Della Flora
 
Capturando dados com Python - UAI Python
Capturando dados com Python - UAI PythonCapturando dados com Python - UAI Python
Capturando dados com Python - UAI Python
Álvaro Justen
 
Scraping by examples
Scraping by examplesScraping by examples
Scraping by examples
Alexandre Gomes
 
Scraping for fun and glory
Scraping for fun and gloryScraping for fun and glory
Scraping for fun and glory
italomaia
 
Marina Grigorian - Portfolio
Marina Grigorian - PortfolioMarina Grigorian - Portfolio
Marina Grigorian - Portfolio
Marina Grigorian
 
Web crawler
Web crawlerWeb crawler
Web crawler
Daniel Mantovani
 
Desbravando o mundo dos webcrawlers
Desbravando o mundo dos webcrawlersDesbravando o mundo dos webcrawlers
Desbravando o mundo dos webcrawlers
João Gabriel Lima
 
Web Crawlers
Web CrawlersWeb Crawlers
Web Crawlers
Alberto Trindade
 
Web::Scraper
Web::ScraperWeb::Scraper
Web::Scraper
Tatsuhiko Miyagawa
 
Scraping
ScrapingScraping
Scraping
Vítor Baptista
 
Capturando a web com Scrapy
Capturando a web com ScrapyCapturando a web com Scrapy
Capturando a web com Scrapy
Gabriel Freitas
 
Raspador: Biblioteca em Python para extração de dados em texto semi-estruturado
Raspador: Biblioteca em Python para extração de dados em texto semi-estruturadoRaspador: Biblioteca em Python para extração de dados em texto semi-estruturado
Raspador: Biblioteca em Python para extração de dados em texto semi-estruturado
Fernando Macedo
 
Almost Scraping: Web Scraping without Programming
Almost Scraping: Web Scraping without ProgrammingAlmost Scraping: Web Scraping without Programming
Almost Scraping: Web Scraping without Programming
Michelle Minkoff
 
Scraping data from the web and documents
Scraping data from the web and documentsScraping data from the web and documents
Scraping data from the web and documents
Tommy Tavenner
 
Scrapinghub PyCon Philippines 2015
Scrapinghub PyCon Philippines 2015Scrapinghub PyCon Philippines 2015
Scrapinghub PyCon Philippines 2015
Richard Dowinton
 
Estudo de caso do "O Curioso" (Rio on Rails)
Estudo de caso do "O Curioso" (Rio on Rails)Estudo de caso do "O Curioso" (Rio on Rails)
Estudo de caso do "O Curioso" (Rio on Rails)
guestf4f70f
 
Shut up and give me the data
Shut up and give me the dataShut up and give me the data
Shut up and give me the data
Ana Paula Gomes
 
Curso YaCy Mecanismo de Busca de Código Aberto
Curso YaCy Mecanismo de Busca de Código AbertoCurso YaCy Mecanismo de Busca de Código Aberto
Curso YaCy Mecanismo de Busca de Código Aberto
Julio Della Flora
 
Capturando dados com Python - UAI Python
Capturando dados com Python - UAI PythonCapturando dados com Python - UAI Python
Capturando dados com Python - UAI Python
Álvaro Justen
 
Scraping for fun and glory
Scraping for fun and gloryScraping for fun and glory
Scraping for fun and glory
italomaia
 
Marina Grigorian - Portfolio
Marina Grigorian - PortfolioMarina Grigorian - Portfolio
Marina Grigorian - Portfolio
Marina Grigorian
 
Desbravando o mundo dos webcrawlers
Desbravando o mundo dos webcrawlersDesbravando o mundo dos webcrawlers
Desbravando o mundo dos webcrawlers
João Gabriel Lima
 
Capturando a web com Scrapy
Capturando a web com ScrapyCapturando a web com Scrapy
Capturando a web com Scrapy
Gabriel Freitas
 
Raspador: Biblioteca em Python para extração de dados em texto semi-estruturado
Raspador: Biblioteca em Python para extração de dados em texto semi-estruturadoRaspador: Biblioteca em Python para extração de dados em texto semi-estruturado
Raspador: Biblioteca em Python para extração de dados em texto semi-estruturado
Fernando Macedo
 

Similar to Java Web Scraping (20)

Tunnelpoint Pitch
Tunnelpoint PitchTunnelpoint Pitch
Tunnelpoint Pitch
tunnelpoint
 
Internet And How It Works
Internet And How It WorksInternet And How It Works
Internet And How It Works
ftz 420
 
Monkey Server
Monkey ServerMonkey Server
Monkey Server
Eduardo Silva Pereira
 
Lecture 2
Lecture 2Lecture 2
Lecture 2
uccwebcourses
 
How bol.com makes sense of its logs, using the Elastic technology stack.
How bol.com makes sense of its logs, using the Elastic technology stack.How bol.com makes sense of its logs, using the Elastic technology stack.
How bol.com makes sense of its logs, using the Elastic technology stack.
Renzo Tomà
 
MS Day EPITA 2010: Visual Studio 2010 et Framework .NET 4.0
MS Day EPITA 2010: Visual Studio 2010 et Framework .NET 4.0MS Day EPITA 2010: Visual Studio 2010 et Framework .NET 4.0
MS Day EPITA 2010: Visual Studio 2010 et Framework .NET 4.0
Thomas Conté
 
OSMC 2018 | Stream connector: Easily sending events and/or metrics from the C...
OSMC 2018 | Stream connector: Easily sending events and/or metrics from the C...OSMC 2018 | Stream connector: Easily sending events and/or metrics from the C...
OSMC 2018 | Stream connector: Easily sending events and/or metrics from the C...
NETWAYS
 
Server Side Programming
Server Side ProgrammingServer Side Programming
Server Side Programming
Milan Thapa
 
Mcneill 01
Mcneill 01Mcneill 01
Mcneill 01
narsareddy1234
 
CGI by rj
CGI by rjCGI by rj
CGI by rj
Shree M.L.Kakadiya MCA mahila college, Amreli
 
Lesson 02 - Introduction to Web Development.docx
Lesson 02 - Introduction to Web Development.docxLesson 02 - Introduction to Web Development.docx
Lesson 02 - Introduction to Web Development.docx
DhoyNavarro
 
l9functionality of requiremenet document.ppt
l9functionality of requiremenet document.pptl9functionality of requiremenet document.ppt
l9functionality of requiremenet document.ppt
hailish4421ict
 
.Net Framework Architecture: All You Need to Know: All You Need to Know
.Net Framework Architecture: All You Need to Know: All You Need to Know.Net Framework Architecture: All You Need to Know: All You Need to Know
.Net Framework Architecture: All You Need to Know: All You Need to Know
dotnetindiaexperts
 
Service Oriented Architecture
Service Oriented ArchitectureService Oriented Architecture
Service Oriented Architecture
Luqman Shareef
 
Software Development Trends 2010-2011
Software Development Trends 2010-2011Software Development Trends 2010-2011
Software Development Trends 2010-2011
Charalampos Arapidis
 
Using Node-RED for building IoT workflows
Using Node-RED for building IoT workflowsUsing Node-RED for building IoT workflows
Using Node-RED for building IoT workflows
Aniruddha Chakrabarti
 
CCNA RS_NB - Chapter 4
CCNA RS_NB - Chapter 4CCNA RS_NB - Chapter 4
CCNA RS_NB - Chapter 4
Irsandi Hasan
 
Basic's of internet
Basic's of internet Basic's of internet
Basic's of internet
Poovarasan Shanmugasundaram
 
Migrate from Lotus Notes to SharePoint 2013 or SharePoint Online - Tips, Tric...
Migrate from Lotus Notes to SharePoint 2013 or SharePoint Online - Tips, Tric...Migrate from Lotus Notes to SharePoint 2013 or SharePoint Online - Tips, Tric...
Migrate from Lotus Notes to SharePoint 2013 or SharePoint Online - Tips, Tric...
Knut Relbe-Moe [MVP, MCT]
 
Web Services 2009
Web Services 2009Web Services 2009
Web Services 2009
Cathie101
 
Tunnelpoint Pitch
Tunnelpoint PitchTunnelpoint Pitch
Tunnelpoint Pitch
tunnelpoint
 
Internet And How It Works
Internet And How It WorksInternet And How It Works
Internet And How It Works
ftz 420
 
How bol.com makes sense of its logs, using the Elastic technology stack.
How bol.com makes sense of its logs, using the Elastic technology stack.How bol.com makes sense of its logs, using the Elastic technology stack.
How bol.com makes sense of its logs, using the Elastic technology stack.
Renzo Tomà
 
MS Day EPITA 2010: Visual Studio 2010 et Framework .NET 4.0
MS Day EPITA 2010: Visual Studio 2010 et Framework .NET 4.0MS Day EPITA 2010: Visual Studio 2010 et Framework .NET 4.0
MS Day EPITA 2010: Visual Studio 2010 et Framework .NET 4.0
Thomas Conté
 
OSMC 2018 | Stream connector: Easily sending events and/or metrics from the C...
OSMC 2018 | Stream connector: Easily sending events and/or metrics from the C...OSMC 2018 | Stream connector: Easily sending events and/or metrics from the C...
OSMC 2018 | Stream connector: Easily sending events and/or metrics from the C...
NETWAYS
 
Server Side Programming
Server Side ProgrammingServer Side Programming
Server Side Programming
Milan Thapa
 
Lesson 02 - Introduction to Web Development.docx
Lesson 02 - Introduction to Web Development.docxLesson 02 - Introduction to Web Development.docx
Lesson 02 - Introduction to Web Development.docx
DhoyNavarro
 
l9functionality of requiremenet document.ppt
l9functionality of requiremenet document.pptl9functionality of requiremenet document.ppt
l9functionality of requiremenet document.ppt
hailish4421ict
 
.Net Framework Architecture: All You Need to Know: All You Need to Know
.Net Framework Architecture: All You Need to Know: All You Need to Know.Net Framework Architecture: All You Need to Know: All You Need to Know
.Net Framework Architecture: All You Need to Know: All You Need to Know
dotnetindiaexperts
 
Service Oriented Architecture
Service Oriented ArchitectureService Oriented Architecture
Service Oriented Architecture
Luqman Shareef
 
Software Development Trends 2010-2011
Software Development Trends 2010-2011Software Development Trends 2010-2011
Software Development Trends 2010-2011
Charalampos Arapidis
 
Using Node-RED for building IoT workflows
Using Node-RED for building IoT workflowsUsing Node-RED for building IoT workflows
Using Node-RED for building IoT workflows
Aniruddha Chakrabarti
 
CCNA RS_NB - Chapter 4
CCNA RS_NB - Chapter 4CCNA RS_NB - Chapter 4
CCNA RS_NB - Chapter 4
Irsandi Hasan
 
Migrate from Lotus Notes to SharePoint 2013 or SharePoint Online - Tips, Tric...
Migrate from Lotus Notes to SharePoint 2013 or SharePoint Online - Tips, Tric...Migrate from Lotus Notes to SharePoint 2013 or SharePoint Online - Tips, Tric...
Migrate from Lotus Notes to SharePoint 2013 or SharePoint Online - Tips, Tric...
Knut Relbe-Moe [MVP, MCT]
 
Web Services 2009
Web Services 2009Web Services 2009
Web Services 2009
Cathie101
 
Ad

Recently uploaded (20)

UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPathCommunity
 
TrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business ConsultingTrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business Consulting
Trs Labs
 
How analogue intelligence complements AI
How analogue intelligence complements AIHow analogue intelligence complements AI
How analogue intelligence complements AI
Paul Rowe
 
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul
 
Linux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdfLinux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdf
RHCSA Guru
 
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
Drupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy ConsumptionDrupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy Consumption
Exove
 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.
hpbmnnxrvb
 
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdfComplete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Software Company
 
Rusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond SparkRusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond Spark
carlyakerly1
 
Role of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered ManufacturingRole of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered Manufacturing
Andrew Leo
 
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc
 
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
BookNet Canada
 
Electronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploitElectronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploit
niftliyevhuseyn
 
Cybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure ADCybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure AD
VICTOR MAESTRE RAMIREZ
 
Procurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptxProcurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptx
Jon Hansen
 
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath MaestroDev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
UiPathCommunity
 
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfThe Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
Abi john
 
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxSpecial Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
shyamraj55
 
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPathCommunity
 
TrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business ConsultingTrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business Consulting
Trs Labs
 
How analogue intelligence complements AI
How analogue intelligence complements AIHow analogue intelligence complements AI
How analogue intelligence complements AI
Paul Rowe
 
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul
 
Linux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdfLinux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdf
RHCSA Guru
 
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
Drupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy ConsumptionDrupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy Consumption
Exove
 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.
hpbmnnxrvb
 
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdfComplete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Software Company
 
Rusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond SparkRusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond Spark
carlyakerly1
 
Role of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered ManufacturingRole of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered Manufacturing
Andrew Leo
 
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc
 
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
BookNet Canada
 
Electronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploitElectronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploit
niftliyevhuseyn
 
Cybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure ADCybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure AD
VICTOR MAESTRE RAMIREZ
 
Procurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptxProcurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptx
Jon Hansen
 
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath MaestroDev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
UiPathCommunity
 
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfThe Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
Abi john
 
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxSpecial Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
shyamraj55
 
Ad

Java Web Scraping

  • 1. Java Web Scraping Sumant Kumar Raja
  • 2. Agenda What is Web Scraping!!!! Stages in web - scraping Useful API for web scraping Limitations using above APIs Defining I18n and L10n I18n and L10n checkpoints Before we end And finally {}
  • 3. What is web scraping? Web scraping describes any of various means to extract content from a website over HTTP for the purpose of transforming that content into another format suitable for use in another context. Also called harvesting
  • 4. Stages in web scraping Connect : connect with the remote site over HTTP or FTP Extract and Process: Extract information from the website and Process data into useful data format Save : save data in desired format The process stage consists of Filter : filter useful data from source Format : format data to a format required by user.
  • 5. Useful APIs for web scraping commons-httpclient-3.1.jar HTTP javacsv.jar csv Save pmd-4.2.3.jar Code quality jxl.jar poi-3.1-FINAL-20080629.jar Excel javacsv.jar CSV jericho-html-2.6.jar HTML FontBox-0.1.0.jar PDFBox-0.7.3.jar PDF Extract and process commons-net-1.4.1.jar FTP Connection slf4j-api-1.5.2.jar slf4j-log4j12-1.5.2.jar NA Logging API Type Process
  • 6. Limitations using above APIs Apache POI does not support extraction of older version of excel. Use JExcel in place of POI. PDF box and Font box failed to process the pdf certain encodings.
  • 7. Defining I18n and L10n I18n stands for I nternationalizatio n The process of converting locale dependent data into locale independent data Example: The date string 12-Mar-2008 can be saved as date object. Date object is locale independent. L10n stands for L ocalizatio n The process of converting data from one locale to another locale or from locale independent format to locale dependent format. Example: The currency $1000 in US locale to equivalent pounds in UK locale.
  • 8. I18n and L10n checkpoints Take care of following points while scrapping data from various locales Convert the number format. Example: the number is Dutch locale 1000.00,95 is 100000.95. Convert the date format. The date should be converted from one format to another. Ie, date in French locale should be converted into date object Convert Unit of measure (length, area, weight, etc.)
  • 9. Before we end Economize the internet roundtrip Fetching data from HTTP/FTP is costly. Make minimum number of round trip to get data from internet. Write all data at same time as writing into disk is costly.
  • 10. And finally {} Some of the ready to use web scrapping software http://en.wikipedia.org/wiki/Web-scraping_software_comparison