As I was coding the web crawler for my search engine project, I encountered an unexpected issue. The way I was sorting URLs that needed to be crawled was resulting in my script only crawling the oldest pages on many websites.
the problem
The problem arose with sites with URL patterns like this:
https://example.com/docs/v1.0.0
https://example.com/docs/v1.0.1
https://example.com/docs/v1.0.2
https://example.com/docs/v1.1.0
https://example.com/docs/v2.0.0
Or this:
https://example.blog/posts/2004/10/15/changed-my-blog-platform-to-wordpress
https://example.blog/posts/2006/07/24/changed-my-blog-platform-to-hugo
https://example.blog/posts/2008/01/01/sorry-i-should-post-more
https://example.blog/posts/2021/07/13/changed-my-blog-platform-to-hand-coded-html
https://example.blog/posts/2023/02/18/this-time-i-really-mean-it-im-gonna-post-more
When sorted normally, my crawler would start with the oldest pages and crawl toward the newest. That’s perfectly fine—except that by default I limit each domain to having 100 crawled subpages. I would only crawl the oldest 100 blog posts on a site, or only documentation for outdated versions of a piece of software.
the solution
Pretty simple. Instead of piping the URLs to sort -, I piped to sort -rn -, sorting by numerical ascending order (making 10 sort as more than 2, even though the first character is lower) then reversing it.
Now the problem is solved and if I hit my page count limit for a site, I receive the freshest articles and documentation instead of the oldest.
Sometimes problems are simple.