clew reaches alpha

Has it really been almost two months since I announced the name of my search engine project, Clew?

So much has happened. Let me walk you through it.

the great refactor

I started this year with a series as I wrote a crawler for this project in pure bash scripts. Well, I’ve changed course: I re-made the whole thing in Python. The quality and robustness of the crawler is much higher, so it’s a good change to have made. I’m not sorry I started with shell scripts, though, I learned so much about scripting through that experiment, and that knowledge is sure to help me in other projects.

The second big change is that I’ve finally gotten over my phobia of databases and am indexing all the data in PostgreSQL. In fact, relearning SQL has been a far more enjoyable experience than I remembered; this may be partly due to my finally learning database normalization and some key features like JOIN.

There’s only perhaps ten lines of code from the original codebase I blogged about that are still in place. And almost all of that new code was written in the last week. Which… brings me to the problem I’ll be discussing in a minute. But let’s stay with the good news for a little longer.

α

My original approach for the project was to complete the crawler before starting work on the actual code to search the index. At some point during the refactor, though, I realized that I was already storing enough information to throw together a proof of concept.

The biggest benefit of actually using a database is that all the code I needed to do the filtering and ranking of results was possible to work out in pure SQL (I’m using the Okapi BM25 model primarily, specifically the BM25F variation with some weights of my own to match the project’s needs).

So, with a working proof of concept, I published it to my server. I’m not sharing the link publicly (to avoid overloading the server, since I haven’t done much performance optimization), but I’ve been posting numerous updates about it and have shared the link privately with anyone who asked for it, so if you’re interested in giving it a spin, my email’s below this article, shoot me a message and I’ll give you the link.

The project has meant so much more with an actual tangible result to be able to try and show off, so I’m glad I took this route.

the “bad” news

The problem with all this enormous progress is that I’ve been using this project as a means of procrastinating from my other goals and responsibilities. I need to take a break from working on Clew to be able to get my life back in balance.

At this point, I don’t think I’ll be on hiatus from the project for more than a week, but hey, sometimes life gets busy. We’ll see how it goes.

I’ve uploaded all of my work so far to the alpha site, so if you have access it’ll be there for you to play with while you wait for me to get back to work.

conclusion

Before I close out this article, I want to give a big thanks to the two anonymous donors who’ve contributed financially to this project so far, it’s been overwhelming; I’m currently receiving $6.26 per week from y’all, which is a huge deal for a project like this when I have effectively no income. I plan to use donations to Clew first to cover all server and other tech costs associated with the project, then to use any leftover to help support myself as I start to dip my toe into the world of being an independent creator of stories, projects, and dreams.

If you’re reading this and would like to contribute as well, you can do so at Liberapay. Thanks for even considering it; there’s no pressure to donate if you don’t feel like it or can’t afford it. Or hey, maybe you’d rather wait until I’ve got a complete first version, that makes sense too.

If you have thoughts about my search engine project (feature requests, concerns, links to blogs [including your own!] that you’d like to be sure are included, cookie recipes, and so on), I’d love to hear from you! My email address is below this article, I’m already looking forward to your message, obstinately challenging the arrow of time.