this should've been a database

I’ve mostly passed the web crawler stage in my search engine project’s prototyping, so my next step is to figure out how to index the vast amount of information I’ve scraped from random places on the internet.

The obvious answer is to use a database.

I… uh… am not using a database.

my solution

In my working directory, I’ve created a subdirectory titled _index. Inside it are subdirectories for each language (since I wanna separate keywords by language). When I run my ./index script, it goes through every page I’ve scraped in the specified language and creates a line something like this:

0.970873786407767	https://alpha.polymaths.social/

The first column is the weight, or the relative frequency of the keyword within a page. It’s roughly equal to percentage of the page’s text, but with text in headings counting as extra. The second column is the link to the page in question.

Then this line gets put in a file named after the keyword. Rinse and repeat for every page and keyword in my corpus.

Takes a good amount of time and computing power, but once it’s done it means I can just read the files for the keywords people search for and get a shortlist of URLs worth further indexing.

so… why not a database?

Because I’m silly.

No, seriously.

For some reason, I really, really don’t like using databases. So I’m seeing how far I can get before I give in and just set up postgresql like a smart code monkey. I’ve been told numerous times I should be using a database, and I agree.

conclusion

Not entirely sure where I’m going with this. I know I’ll end up rewriting this all to be in a database, assuming I get far enough into this project and don’t drop it in a few days.

But hey, I’m having fun for now, and that’s what counts, right? As Eric, from Polymaths.social, replied to me on the subject:

It’s more of a choice between wanting to build one versus wanting to use an established one. It’s much like the choice of building a search engine versus wanting to use an established one.

Indeed, this whole search engine project is about wanting to see if I can build things myself. It makes sense, then, that I’d rather see how far I can get with my own indexing mechanism, however inefficient, rather than simply use a database.