Jsoup Web crawling with Jsoup Extracting all the URLs from a website using JSoup (recursion)


Example

In this example we will extract all the web links from a website. I am using http://stackoverflow.com/ for illustration. Here recursion is used, where each obtained link's page is parsed for presence of an anchor tag and that link is again submitted to the same function.

The condition if(add && this_url.contains(my_site)) will limit results to your domain only.

    import java.io.IOException;
    import java.util.HashSet;
    import java.util.Set;
    import org.jsoup.Jsoup;
    import org.jsoup.nodes.Document;
    import org.jsoup.select.Elements;
    
    public class readAllLinks {
    
        public static Set<String> uniqueURL = new HashSet<String>();
        public static String my_site;
    
        public static void main(String[] args) {
    
            readAllLinks obj = new readAllLinks();
            my_site = "stackoverflow.com";
            obj.get_links("http://stackoverflow.com/");
        }
    
        private void get_links(String url) {
            try {
                Document doc = Jsoup.connect(url).userAgent("Mozilla").get();
                Elements links = doc.select("a");

                if (links.isEmpty()) {
                   return;
                }

                links.stream().map((link) -> link.attr("abs:href")).forEachOrdered((this_url) -> {
                    boolean add = uniqueURL.add(this_url);
                    if (add && this_url.contains(my_site)) {
                        System.out.println(this_url);
                        get_links(this_url);
                    }
                });
    
            } catch (IOException ex) {
    
            }
    
        }
    }

The program will take much time to execute depending on your website. The above code can be extended to extract data (like titles of pages or text or images) from particular website. I would recommend you to go through company's terms of use before scarping it's website.

The example uses JSoup library to get the links, you can also get the links using your_url/sitemap.xml.