How to build Web Crawler using Java

web scraper java, jsoup web crawler
Here's a basic Java code for a web crawler:

web crawler java

import java.io.IOException;
import java.util.HashSet;
import java.util.LinkedList;
import java.util.Queue;
import java.util.Set;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class WebCrawler {
    private Set<String> visitedUrls = new HashSet<>();
    private Queue<String> urlsToVisit = new LinkedList<>();
    
    private int maxPages = 100;
    
    public void crawl(String startUrl) {
        urlsToVisit.add(startUrl);
        while (!urlsToVisit.isEmpty() && visitedUrls.size() < maxPages) {
            String currentUrl = urlsToVisit.poll();
            visitedUrls.add(currentUrl);
            try {
                Document doc = Jsoup.connect(currentUrl).get();
                Elements links = doc.select("a[href]");
                for (Element link : links) {
                    String linkUrl = link.attr("abs:href");
                    if (!visitedUrls.contains(linkUrl)) {
                        urlsToVisit.add(linkUrl);
                    }
                }
                // process the document content here
                String pageTitle = doc.title();
                String pageContent = doc.text();
                System.out.println("Title: " + pageTitle);
                System.out.println("Content: " + pageContent);
            } catch (IOException e) {
                System.out.println("Error crawling URL: " + currentUrl);
            }
        }
        System.out.println("Crawling finished, visited " + visitedUrls.size() + " pages.");
    }
    
    public static void main(String[] args) {
        WebCrawler crawler = new WebCrawler();
        crawler.crawl("https://www.example.com");
    }
}
This web crawler uses the Jsoup library to parse HTML documents and extract links. The 'crawl' method takes a starting URL and starts a breadth-first search through the pages of the website, up to a maximum number of pages. For each page, the crawler extracts the links to other pages and adds them to the queue of pages to visit, and processes the content of the current page, which can be customized depending on the specific use case. In this example, the title and text content of the page are printed to the console. The 'visitedUrls' set is used to keep track of the pages that have already been visited, and the 'urlsToVisit' queue is used to store the pages that still need to be visited.
I trust this helps you! if you have any query you can ask me.

Post a Comment