I wrote in a recent article about how something so simple as reversing the sort order of a list of URLs could drastically improve my search index’s quality, due to my practice of limiting the crawler to 100 pages per domain.
Today I decided to prune those starting URLs further, with the help of grep.
the problem
The biggest thing I saw that was impacting my search index negatively was the presence of tag listings, pages that list all the articles that match a certain tag.
Due to my sort order solution, these pages (starting with t) were getting crawled before most blog posts and articles (starting with b or a, usually), which is non-ideal. I don’t have any desire to serve these tag listings as search results, and they were keeping me from crawling more articles, which I do want, so something had to be done.
the solution
I have a function in my crawler that’s regularly run to remove duplicates in the queue of pages to be crawled. This felt like the ideal place to prune unwanted URLs.
So I passed the result of it to this grep command:
grep -vP 'http(s)?://.*/(tag(s)?|categor(ies|y))/*'
Looks a little complex, let’s break it down.
- The
-vflag reverses the results; things that match my string won’t be in the result, rather than my result being only URLs that match the string. - The
-Pflag tells it to interpret my string as a Perl-compatible regular expression, which gets me a couple features I needed. https(s)?://.*will match any URL starting with “https://” or “http://”—which is every URL in my index. This isn’t where the magic happens.- It’s
(tag(s)?|categor(ies|y))where the magic happens; this’ll match any paths that have a directory namedtag,tags,categories, orcategory. There’s a slight possibility this’ll have some false positives, but it increases my index quality so much that I don’t mind.
This problem was a bit more complex, but still pretty simple in concept.