using grep to increase index quality

I wrote in a recent article about how something so simple as reversing the sort order of a list of URLs could drastically improve my search index’s quality, due to my practice of limiting the crawler to 100 pages per domain.

Today I decided to prune those starting URLs further, with the help of grep.

the problem

The biggest thing I saw that was impacting my search index negatively was the presence of tag listings, pages that list all the articles that match a certain tag.

Due to my sort order solution, these pages (starting with t) were getting crawled before most blog posts and articles (starting with b or a, usually), which is non-ideal. I don’t have any desire to serve these tag listings as search results, and they were keeping me from crawling more articles, which I do want, so something had to be done.

the solution

I have a function in my crawler that’s regularly run to remove duplicates in the queue of pages to be crawled. This felt like the ideal place to prune unwanted URLs.

So I passed the result of it to this grep command:

grep -vP 'http(s)?://.*/(tag(s)?|categor(ies|y))/*'

Looks a little complex, let’s break it down.

The -v flag reverses the results; things that match my string won’t be in the result, rather than my result being only URLs that match the string.
The -P flag tells it to interpret my string as a Perl-compatible regular expression, which gets me a couple features I needed.
https(s)?://.* will match any URL starting with “https://” or “http://”—which is every URL in my index. This isn’t where the magic happens.
It’s (tag(s)?|categor(ies|y)) where the magic happens; this’ll match any paths that have a directory named tag, tags, categories, or category. There’s a slight possibility this’ll have some false positives, but it increases my index quality so much that I don’t mind.

This problem was a bit more complex, but still pretty simple in concept.