BFS, DFS, PageRank, AKA — How to run embeddings only on important parts of a website | Some Thoughts on AI, LLMs and Tech

BFS, DFS, PageRank, AKA — How to run embeddings only on important parts of a website

Depth-First Search (DFS) and Breadth-First Search (BFS) are two common algorithms used in web scraping to traverse websites. Usually, you would want to use BFS; and this in order [no pun intended] to get to important pages first.

In short, DFS will explore as far as possible along each branch before backtracking, while BFS will explore neighbor nodes at the present depth before moving to nodes at the next depth level.

**BFS ** First we visit Homepage, then Products, Blog and About pages, and then children of each page, etc..

DFS ** First we visit Homepage, then Products, then children of Products, then Blog, **then we’re “stuck” in all blogposts [children of Blog page]. Instead of visiting About page, after the Blog page.

Creating this pre-coding-LLMs era, would take lots more time!

Using PageRank algorithm, we see that we visit: Homepage, Products, and then Category 1 and Category 2 pages.

This is of course just for illustration — if the internal link structure “implies us” [read: “implies Google”] that these pages are more important, we would like to visit them first.

Let’s use Claude/ChatGPT/Cursor; let’s use PageRank

Using Cursor, it took me about 10 minutes to add PageRank class. Created a class which: - Accepts a page and links - Adds to internal graph database [could use networkx, but for this case used a simple dict] - Calculate PageRank - Suggests next page to visit

class WebsitePagerank {
    constructor(initialUrl) {
        // Note - for numerical and other reasons, the invariant that total PageRank of all pages is 1, does not hold here

        this.initialUrl = this.normalizeURL(initialUrl);
        this.pages = new Map();
        this.links = new Map();
        this.crawledPages = new Set();
        this.pages.set(this.initialUrl, 10);
    }

    addPage(url) {
        url = this.normalizeURL(url);
        if (!this.pages.has(url)) {
            this.pages.set(url, url === this.initialUrl ? 10 : 1);
        }
    }

    addLink(sourceUrl, destUrl) {
        sourceUrl = this.normalizeURL(sourceUrl);
        destUrl = this.normalizeURL(destUrl);
        this.addPage(sourceUrl);
        this.addPage(destUrl);

        if (!this.links.has(sourceUrl)) {
            this.links.set(sourceUrl, new Set());
        }
        this.links.get(sourceUrl).add(destUrl);
    }

    markAsCrawled(url) {
        url = this.normalizeURL(url);
        console.log('markAsCrawled', url);
        this.crawledPages.add(url);
    }

    calculatePageRank(iterations = 10, dampingFactor = 0.85) {
        const numPages = this.pages.size;

        for (let i = 0; i < iterations; i++) {
            const newRanks = new Map();

            for (const [page] of this.pages) {
                let sum = 0;
                for (const [sourcePage, destPages] of this.links) {
                    if (destPages.has(page)) {
                        sum += this.pages.get(sourcePage) / destPages.size;
                    }
                }
                newRanks.set(page, (1 - dampingFactor) / numPages + dampingFactor * sum);
            }

            this.pages = newRanks;
        }
    }

    getNextUrl() {
        return Array.from(this.pages.entries())
        .filter(([url]) => !this.crawledPages.has(url))
        .sort((a, b) => b[1] - a[1])[0]?.[0];
    }

    normalizeURL(url) {
        url = url.split('#')[0];
        if (url.endsWith('/')) {
            url = url.slice(0, -1);
        }
        return url;
    }

    dumpPageRankInfo(filePath) {
        const data = Array.from(this.pages.entries()).map(([url, rank]) => ({
            url,
            rank,
            outgoingLinks: Array.from(this.links.get(url) || [])
        }));

        fs.writeFileSync(filePath, JSON.stringify(data, null, 2));
    }
}

Conclusion We saw toy examples for BFS, DFS, and PageRank algorithms. When website has 100+ pages, this becomes a true pain.

Q : If we need to browse the pages to build the PageRank, we don’t save anything, right? A : Correct, if we look about bandwidth or crawled pages; However, if we would like to embed the pages afterward, we surely want to embed only the most important pages.