SlideShare a Scribd company logo
JSOUP
Overview
What is Jsoup
Parsing with Url
Parsing with File
Modify Data
Prevent cross site scripting
JSOUP
jsoup is a Java library for working with real-world HTML. It provides a very convenient API for
extracting and manipulating data,
● scrape and parse HTML from a URL, file, or string
● find and extract data, using DOM traversal or CSS selectors
● manipulate the HTML elements, attributes, and text
● clean user-submitted content against a safe white-list, to prevent XSS attacks
● output tidy HTML
Parse a document from a url
The connect(String url) method creates a new Connection, and get()fetches and parses a HTML file. If
an error occurs whilst fetching the URL, it will throw an IOException, which you should handle
appropriately.
Document document = Jsoup.connect("https://grails.org/").get()
String title = document.title()
.
Continue..
The Connection interface is designed for method chaining to build specific requests:
Document doc = Jsoup.connect("http://example.com")
.userAgent("Mozilla")
.cookie("auth", "token")
.timeout(3000)
.post();
Parse a document from a string
You have HTML in a Java String, and you want to parse that HTML to get at its contents, or to make
sure it's well formed, or to modify it. The String may have come from user input, a file, or from the
web.
String html = "<html><head><title>First parse</title></head>"
+ "<body><p>Parsed HTML into a doc.</p></body></html>";
Document doc = Jsoup.parse(html);
Load a document from a file
File file = new File("/home/shipra/Downloads/Jsoup.html")
Document document = Jsoup.parse(file, "UTF-8")
String content = document.getElementById(“content”)
String tag = document.getElementByTag(“p”)
String class = document.getElementByClass(“green”)
Use DOM methods to navigate a document
You have a HTML document that you want to extract data from.
File file = new File("/home/shipra/Downloads/Jsoup.html")
Document document = Jsoup.parse(file, "UTF-8")
Elements elements = document.select(".nav-sections li")
elements.each { element ->
String text = element.select("a").text()
String attr = element.select("a").attr("href")
}
Modify Data
Use the attribute setter methods Element.attr(String key, String value), and Elements.attr(String key,
String value).
If you need to modify the class attribute of an element, use the Element.addClass(String className)
and Element.removeClass(String className) methods.
The Elements collection has bulk attribue and class methods. For example, to add a rel="nofollow"
attribute to every a element inside a div:
doc.select("div.comments a").attr("rel", "nofollow");
doc.select("div.masthead").attr("title", "jsoup").addClass("round-box");
Setting the text content of an element
Element div = document.select("div").first();
div.html("<p>paragraph</p>");
div.prepend("<p>First</p>");
div.append("<p>Last</p>");
Sanitize untrusted HTML (to prevent XSS)
Whitelist allows what are the features that are passed to cleaning and others are discarded.
String unsafe ="<p><a href='http://example.com/' onclick='stealCookies()'>Link</a></p>"
String safe = Jsoup.clean(unsafe, Whitelist.basic());
Tidy HTML
The parser will make every attempt to create a clean parse from the HTML you provide, regardless of
whether the HTML is well-formed or not. It handles:
● unclosed tags (e.g. <p>Lorem <p>Ipsum parses to <p>Lorem</p> <p>Ipsum</p>)
● implicit tags (e.g. a naked <td>Table data</td> is wrapped into a <table><tr><td>...)
● reliably creating the document structure (html containing a head and body, and only
appropriate elements within the head)
Demo Reference
https://github.com/NexThoughts/JSOUP.git
Jsoup

More Related Content

What's hot (20)

PPTX
Session 20 - Collections - Maps
PawanMM
 
PDF
09.Local Database Files and Storage on WP
Nguyen Tuan
 
PPTX
Xml processors
Saurav Mawandia
 
PPTX
Data handling in python
deepalishinkar1
 
PPTX
Introductionto xslt
Kumar
 
PPTX
Mongo db nosql (1)
Bhavesh Sarvaiya
 
PPTX
XML - SAX
SaraswathiRamalingam
 
PDF
Elasticsearch
Pratyush Majumdar
 
PPSX
Elasticsearch - basics and beyond
Ernesto Reig
 
PPT
Session 5
Lại Đức Chung
 
PDF
Users as Data
pdingles
 
ODP
Xml processing in scala
Knoldus Inc.
 
PPTX
XSL - XML STYLE SHEET
SaraswathiRamalingam
 
PPTX
OData and SharePoint
Sanjay Patel
 
PPTX
Chapter iii(working with data)
Chhom Karath
 
PPT
Json – java script object notation
Pankaj Srivastava
 
PDF
Wanna search? Piece of cake!
Alex Kursov
 
Session 20 - Collections - Maps
PawanMM
 
09.Local Database Files and Storage on WP
Nguyen Tuan
 
Xml processors
Saurav Mawandia
 
Data handling in python
deepalishinkar1
 
Introductionto xslt
Kumar
 
Mongo db nosql (1)
Bhavesh Sarvaiya
 
Elasticsearch
Pratyush Majumdar
 
Elasticsearch - basics and beyond
Ernesto Reig
 
Users as Data
pdingles
 
Xml processing in scala
Knoldus Inc.
 
XSL - XML STYLE SHEET
SaraswathiRamalingam
 
OData and SharePoint
Sanjay Patel
 
Chapter iii(working with data)
Chhom Karath
 
Json – java script object notation
Pankaj Srivastava
 
Wanna search? Piece of cake!
Alex Kursov
 

Viewers also liked (17)

PPTX
JFree chart
NexThoughts Technologies
 
PDF
Spring Web Flow
NexThoughts Technologies
 
PPTX
Introduction to es6
NexThoughts Technologies
 
PDF
Introduction to gradle
NexThoughts Technologies
 
PPTX
Grails with swagger
NexThoughts Technologies
 
PPTX
Actors model in gpars
NexThoughts Technologies
 
PDF
Unit test-using-spock in Grails
NexThoughts Technologies
 
PDF
Reactive java - Reactive Programming + RxJava
NexThoughts Technologies
 
PDF
Cosmos DB Service
NexThoughts Technologies
 
PPTX
Apache tika
NexThoughts Technologies
 
PPTX
Progressive Web-App (PWA)
NexThoughts Technologies
 
PDF
Java 8 features
NexThoughts Technologies
 
PDF
Introduction to thymeleaf
NexThoughts Technologies
 
Spring Web Flow
NexThoughts Technologies
 
Introduction to es6
NexThoughts Technologies
 
Introduction to gradle
NexThoughts Technologies
 
Grails with swagger
NexThoughts Technologies
 
Actors model in gpars
NexThoughts Technologies
 
Unit test-using-spock in Grails
NexThoughts Technologies
 
Reactive java - Reactive Programming + RxJava
NexThoughts Technologies
 
Cosmos DB Service
NexThoughts Technologies
 
Progressive Web-App (PWA)
NexThoughts Technologies
 
Java 8 features
NexThoughts Technologies
 
Introduction to thymeleaf
NexThoughts Technologies
 
Ad

Similar to Jsoup (20)

PPTX
Jsoup tutorial
Ramakrishna kapa
 
PPTX
Internet and Web Technology (CLASS-7) [XML and AJAX] | NIC/NIELIT Web Technology
Ayes Chinmay
 
PDF
HTML Foundations, pt 2
Shawn Calvert
 
PPTX
Chapter 18
application developer
 
PPTX
Xsd restrictions, xsl elements, dhtml
AMIT VIRAMGAMI
 
PPTX
Selenium-Locators
Mithilesh Singh
 
PPTX
Xml 2
pavishkumarsingh
 
DOCX
DOM(Document Object Model) in javascript
Rashmi Mishra
 
PPT
2310 b 12
Krazy Koder
 
PPT
Understanding XML DOM
Om Vikram Thapa
 
PPT
Apache Utilities At Work V5
Tom Marrs
 
PPTX
Document Object Model (DOM)
GOPAL BASAK
 
PPT
Ajax workshop
WBUTTUTORIALS
 
PPTX
Get docs from sp doc library
Sudip Sengupta
 
PDF
jQuery Rescue Adventure
Allegient
 
PPTX
Ajax
Yoga Raja
 
PPTX
Dom
Surinder Kaur
 
PDF
Xml & Java
Slim Ouertani
 
PPTX
PostgreSQL's Secret NoSQL Superpowers
Amanda Gilmore
 
PDF
The Django Web Application Framework
Simon Willison
 
Jsoup tutorial
Ramakrishna kapa
 
Internet and Web Technology (CLASS-7) [XML and AJAX] | NIC/NIELIT Web Technology
Ayes Chinmay
 
HTML Foundations, pt 2
Shawn Calvert
 
Xsd restrictions, xsl elements, dhtml
AMIT VIRAMGAMI
 
Selenium-Locators
Mithilesh Singh
 
DOM(Document Object Model) in javascript
Rashmi Mishra
 
2310 b 12
Krazy Koder
 
Understanding XML DOM
Om Vikram Thapa
 
Apache Utilities At Work V5
Tom Marrs
 
Document Object Model (DOM)
GOPAL BASAK
 
Ajax workshop
WBUTTUTORIALS
 
Get docs from sp doc library
Sudip Sengupta
 
jQuery Rescue Adventure
Allegient
 
Ajax
Yoga Raja
 
Xml & Java
Slim Ouertani
 
PostgreSQL's Secret NoSQL Superpowers
Amanda Gilmore
 
The Django Web Application Framework
Simon Willison
 
Ad

More from NexThoughts Technologies (20)

PDF
Alexa skill
NexThoughts Technologies
 
PDF
Docker & kubernetes
NexThoughts Technologies
 
PDF
Apache commons
NexThoughts Technologies
 
PDF
Microservice Architecture using Spring Boot with React & Redux
NexThoughts Technologies
 
PDF
Solid Principles
NexThoughts Technologies
 
PDF
Introduction to TypeScript
NexThoughts Technologies
 
PDF
Smart Contract samples
NexThoughts Technologies
 
PDF
My Doc of geth
NexThoughts Technologies
 
PDF
Geth important commands
NexThoughts Technologies
 
PDF
Ethereum genesis
NexThoughts Technologies
 
PPTX
Springboot Microservices
NexThoughts Technologies
 
PDF
An Introduction to Redux
NexThoughts Technologies
 
PPTX
Google authentication
NexThoughts Technologies
 
Docker & kubernetes
NexThoughts Technologies
 
Apache commons
NexThoughts Technologies
 
Microservice Architecture using Spring Boot with React & Redux
NexThoughts Technologies
 
Solid Principles
NexThoughts Technologies
 
Introduction to TypeScript
NexThoughts Technologies
 
Smart Contract samples
NexThoughts Technologies
 
My Doc of geth
NexThoughts Technologies
 
Geth important commands
NexThoughts Technologies
 
Ethereum genesis
NexThoughts Technologies
 
Springboot Microservices
NexThoughts Technologies
 
An Introduction to Redux
NexThoughts Technologies
 
Google authentication
NexThoughts Technologies
 

Recently uploaded (20)

PPTX
Agile Chennai 18-19 July 2025 | Workshop - Enhancing Agile Collaboration with...
AgileNetwork
 
PPTX
Agile Chennai 18-19 July 2025 Ideathon | AI Powered Microfinance Literacy Gui...
AgileNetwork
 
PPTX
Farrell_Programming Logic and Design slides_10e_ch02_PowerPoint.pptx
bashnahara11
 
PDF
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
PDF
OpenInfra ID 2025 - Are Containers Dying? Rethinking Isolation with MicroVMs.pdf
Muhammad Yuga Nugraha
 
PDF
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
PDF
The Future of Artificial Intelligence (AI)
Mukul
 
PPTX
Using Google Data Studio (Looker Studio) to Create Effective and Easy Data Re...
Orage Technologies
 
PDF
Build with AI and GDG Cloud Bydgoszcz- ADK .pdf
jaroslawgajewski1
 
PDF
introduction to computer hardware and sofeware
chauhanshraddha2007
 
PDF
Researching The Best Chat SDK Providers in 2025
Ray Fields
 
PPTX
Simple and concise overview about Quantum computing..pptx
mughal641
 
PDF
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
PDF
The Past, Present & Future of Kenya's Digital Transformation
Moses Kemibaro
 
PDF
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
PPTX
The Future of AI & Machine Learning.pptx
pritsen4700
 
PPTX
Agile Chennai 18-19 July 2025 | Emerging patterns in Agentic AI by Bharani Su...
AgileNetwork
 
PDF
Generative AI vs Predictive AI-The Ultimate Comparison Guide
Lily Clark
 
PPTX
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
PDF
State-Dependent Conformal Perception Bounds for Neuro-Symbolic Verification
Ivan Ruchkin
 
Agile Chennai 18-19 July 2025 | Workshop - Enhancing Agile Collaboration with...
AgileNetwork
 
Agile Chennai 18-19 July 2025 Ideathon | AI Powered Microfinance Literacy Gui...
AgileNetwork
 
Farrell_Programming Logic and Design slides_10e_ch02_PowerPoint.pptx
bashnahara11
 
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
OpenInfra ID 2025 - Are Containers Dying? Rethinking Isolation with MicroVMs.pdf
Muhammad Yuga Nugraha
 
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
The Future of Artificial Intelligence (AI)
Mukul
 
Using Google Data Studio (Looker Studio) to Create Effective and Easy Data Re...
Orage Technologies
 
Build with AI and GDG Cloud Bydgoszcz- ADK .pdf
jaroslawgajewski1
 
introduction to computer hardware and sofeware
chauhanshraddha2007
 
Researching The Best Chat SDK Providers in 2025
Ray Fields
 
Simple and concise overview about Quantum computing..pptx
mughal641
 
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
The Past, Present & Future of Kenya's Digital Transformation
Moses Kemibaro
 
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
The Future of AI & Machine Learning.pptx
pritsen4700
 
Agile Chennai 18-19 July 2025 | Emerging patterns in Agentic AI by Bharani Su...
AgileNetwork
 
Generative AI vs Predictive AI-The Ultimate Comparison Guide
Lily Clark
 
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
State-Dependent Conformal Perception Bounds for Neuro-Symbolic Verification
Ivan Ruchkin
 

Jsoup

  • 2. Overview What is Jsoup Parsing with Url Parsing with File Modify Data Prevent cross site scripting
  • 3. JSOUP jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, ● scrape and parse HTML from a URL, file, or string ● find and extract data, using DOM traversal or CSS selectors ● manipulate the HTML elements, attributes, and text ● clean user-submitted content against a safe white-list, to prevent XSS attacks ● output tidy HTML
  • 4. Parse a document from a url The connect(String url) method creates a new Connection, and get()fetches and parses a HTML file. If an error occurs whilst fetching the URL, it will throw an IOException, which you should handle appropriately. Document document = Jsoup.connect("https://grails.org/").get() String title = document.title() .
  • 5. Continue.. The Connection interface is designed for method chaining to build specific requests: Document doc = Jsoup.connect("http://example.com") .userAgent("Mozilla") .cookie("auth", "token") .timeout(3000) .post();
  • 6. Parse a document from a string You have HTML in a Java String, and you want to parse that HTML to get at its contents, or to make sure it's well formed, or to modify it. The String may have come from user input, a file, or from the web. String html = "<html><head><title>First parse</title></head>" + "<body><p>Parsed HTML into a doc.</p></body></html>"; Document doc = Jsoup.parse(html);
  • 7. Load a document from a file File file = new File("/home/shipra/Downloads/Jsoup.html") Document document = Jsoup.parse(file, "UTF-8") String content = document.getElementById(“content”) String tag = document.getElementByTag(“p”) String class = document.getElementByClass(“green”)
  • 8. Use DOM methods to navigate a document You have a HTML document that you want to extract data from. File file = new File("/home/shipra/Downloads/Jsoup.html") Document document = Jsoup.parse(file, "UTF-8") Elements elements = document.select(".nav-sections li") elements.each { element -> String text = element.select("a").text() String attr = element.select("a").attr("href") }
  • 9. Modify Data Use the attribute setter methods Element.attr(String key, String value), and Elements.attr(String key, String value). If you need to modify the class attribute of an element, use the Element.addClass(String className) and Element.removeClass(String className) methods. The Elements collection has bulk attribue and class methods. For example, to add a rel="nofollow" attribute to every a element inside a div: doc.select("div.comments a").attr("rel", "nofollow"); doc.select("div.masthead").attr("title", "jsoup").addClass("round-box");
  • 10. Setting the text content of an element Element div = document.select("div").first(); div.html("<p>paragraph</p>"); div.prepend("<p>First</p>"); div.append("<p>Last</p>");
  • 11. Sanitize untrusted HTML (to prevent XSS) Whitelist allows what are the features that are passed to cleaning and others are discarded. String unsafe ="<p><a href='http://example.com/' onclick='stealCookies()'>Link</a></p>" String safe = Jsoup.clean(unsafe, Whitelist.basic());
  • 12. Tidy HTML The parser will make every attempt to create a clean parse from the HTML you provide, regardless of whether the HTML is well-formed or not. It handles: ● unclosed tags (e.g. <p>Lorem <p>Ipsum parses to <p>Lorem</p> <p>Ipsum</p>) ● implicit tags (e.g. a naked <td>Table data</td> is wrapped into a <table><tr><td>...) ● reliably creating the document structure (html containing a head and body, and only appropriate elements within the head)