Jsoup can be used to extract links and email address from a webpage, thus "Web email address collector bot" First, this code uses a Regular expression to extract the email addresses, and then uses methods provided by Jsoup to extract the URLs of links on the page.
public class JSoupTest {
public static void main(String[] args) throws IOException {
Document doc = Jsoup.connect("http://stackoverflow.com/questions/15893655/").userAgent("Mozilla").get();
Pattern p = Pattern.compile("[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\\.[a-zA-Z0-9-.]+");
Matcher matcher = p.matcher(doc.text());
Set<String> emails = new HashSet<String>();
while (matcher.find()) {
emails.add(matcher.group());
}
Set<String> links = new HashSet<String>();
Elements elements = doc.select("a[href]");
for (Element e : elements) {
links.add(e.attr("href"));
}
System.out.println(emails);
System.out.println(links);
}
}
This code could also be easily extended to also recursively visit those URLs and extract data from linked pages. It could also easily be used with a different regex to extract other data.
(Please don't become a spammer!)