Jsoup is a HTML parsing and data extraction library for Java, focused on flexibility and ease of use. It can be used to extract sepecific data from HTML pages, which is commonly known as "web scraping", as well as modify the content of HTML pages, and "clean" untrusted HTML with a whitelist of allowed tags and attributes.
Jsoup does not support JavaScript, and, because of this, any dynamically generated content or content which is added to the page after page load cannot be extracted from the page. If you need to extract content which is added to the page with JavaScript, there are a few alternative options:
Use a library which does support JavaScript, such as Selenium, which uses an an actual web browser to load pages, or HtmlUnit.
Reverse engineer how the page loads it's data. Typically, web pages which load data dynamically do so via AJAX, and thus, you can look at the network tab of your browser's developer tools to see where the data is being loaded from, and then use those URLs in your own code. See how to scrape AJAX pages for more details.
You can find various Jsoup related resources at jsoup.org, including the Javadoc, usage examples in the Jsoup cookbook and JAR downloads. See the GitHub repository for the source code, issues, and pull requests.
Jsoup is available on Maven as org.jsoup.jsoup:jsoup
, If you're using Gradle (eg. with Android Studio), you can add it to your project by adding the following to your build.gradle
dependencies section:
compile 'org.jsoup:jsoup:1.8.3'
If you're using Ant (Eclipse), add the following to your POMs dependencies section:
<dependency>
<!-- jsoup HTML parser library @ http://jsoup.org/ -->
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.8.3</version>
</dependency>
Jsoup is also available as downloadable JAR for other environments.
Version | Release Date |
---|---|
1.9.2 | 2016-05-17 |
1.8.3 | 2015-08-02 |
Selecting only the attribute value of a link:href will return the relative URL.
String bodyFragment =
"<div><a href=\"/documentation\">Stack Overflow Documentation</a></div>";
Document doc = Jsoup.parseBodyFragment(bodyFragment);
String link = doc
.select("div > a")
.first()
.attr("href");
System.out.println(link);
Output
/documentation
By passing the base URI into the parse
method and using the absUrl
method instead of attr
, we can extract the full URL.
Document doc = Jsoup.parseBodyFragment(bodyFragment, "http://stackoverflow.com");
String link = doc
.select("div > a")
.first()
.absUrl("href");
System.out.println(link);
Output
http://stackoverflow.com/documentation
Jsoup can be used to manipulate or extract data from a file on local that contains HTML. filePath
is path of a file on disk. ENCODING
is desired Charset Name e.g. "Windows-31J". It is optional.
// load file
File inputFile = new File(filePath);
// parse file as HTML document
Document doc = Jsoup.parse(filePath, ENCODING);
// select element by <a>
Elements elements = doc.select("a");
Jsoup can be be used to easily extract all links from a webpage. In this case, we can use Jsoup to extract only specific links we want, here, ones in a h3
header on a page. We can also get the text of the links.
Document doc = Jsoup.connect("http://stackoverflow.com").userAgent("Mozilla").get();
for (Element e: doc.select("a.question-hyperlink")) {
System.out.println(e.attr("abs:href"));
System.out.println(e.text());
System.out.println();
}
This gives the following output:
http://stackoverflow.com/questions/12920296/past-5-week-calculation-in-webi-bo-4-0
Past 5 week calculation in WEBI (BO 4.0)?
http://stackoverflow.com/questions/36303701/how-to-get-information-about-the-visualized-elements-in-listview
How to get information about the visualized elements in listview?
[...]
What's happening here:
First, we get the HTML document from the specified URL. This code also sets the User Agent header of the request to "Mozilla", so that the website serves the page it would usually serve to browsers.
Then, use select(...)
and a for loop to get all the links to Stack Overflow questions, in this case links which have the class question-hyperlink
.
Print out the text of each link with .text()
and the href of the link with attr("abs:href")
. In this case, we use abs:
to get the absolute URL, ie. with the domain and protocol included.