Jsoup Extract the URLs and titles of links


Example

Jsoup can be be used to easily extract all links from a webpage. In this case, we can use Jsoup to extract only specific links we want, here, ones in a h3 header on a page. We can also get the text of the links.

Document doc = Jsoup.connect("http://stackoverflow.com").userAgent("Mozilla").get();
for (Element e: doc.select("a.question-hyperlink")) {
    System.out.println(e.attr("abs:href"));
    System.out.println(e.text());
    System.out.println();
}

This gives the following output:

http://stackoverflow.com/questions/12920296/past-5-week-calculation-in-webi-bo-4-0
Past 5 week calculation in WEBI (BO 4.0)?

http://stackoverflow.com/questions/36303701/how-to-get-information-about-the-visualized-elements-in-listview
How to get information about the visualized elements in listview?

[...]

What's happening here:

  • First, we get the HTML document from the specified URL. This code also sets the User Agent header of the request to "Mozilla", so that the website serves the page it would usually serve to browsers.

  • Then, use select(...) and a for loop to get all the links to Stack Overflow questions, in this case links which have the class question-hyperlink.

  • Print out the text of each link with .text() and the href of the link with attr("abs:href"). In this case, we use abs: to get the absolute URL, ie. with the domain and protocol included.