|
Powered by
|
|
Section: All | News & Politics | Geek Stuff | Devel | Non-existent Life | Random | Food! | Life |
Wed, January 26, 2005
BlogMatcher back up
After several months of outage, BlogMatcher is back up. Apart from some minor internal cleanups, and UI tweaks, everything's more or less the way it was before. The biggest difference is that the indexer runs once an hour now, but only fetches the last 5 minutes' worth of data from Weblogs.com's shortChanges list.
In case anyone's interested, here's a rundown of what happened. Show Rest of Post
Sometime last fall, BlogMatcher's search engine started to consume massive resources on start up, which meant the server was coming down to a crawl 4 times a day. The search engine has to be reloaded periodically because it basically updates by re-reading data and re-building the index, which alleviates the need to handle realtime updates and the accompanying concurrency issues. So, basically, how it works is, the indexer fetches new data, processes the data, then stores it in neat little (or not so little) cache files. The search engine then only has to read in cached data, and create the mappings in memory. But with 152,000+ knowns links and 30,000+ blogs (and 1.47 million link-blog mappings), this last step was consuming almost 100% of the server's resources for around 10 minutes at a go.
So, sometime in August, I finally got fed up, and tweaked the cron jobs to only reboot the search engine once a day. The problem was, somehow this (or something else) simply broke the search engine code, so although it was running, no useful results were being returned (really, a bug that should've existed in the btree code before, magically decided to switch itself on).
Fortunately, the combination of a new server and some optimizations resulted in startup times well below a minute (probably around 30 seconds). Okay, so at this point, you're probably wondering, "how'd it go from 10 minutes to 30 seconds?" Frankly, I am too. I think it's a combination of the following:
- Faster processor (1.3GHz AMD vs 3GHz P4)
- Faster/more memory (256MB/??? vs 1GB/DDR)
- Faster drive (7200rpm IDE vs 10k rpm SATA)
- GCC optimizations (no optimization vs -O3)
- Algorithm changes: The blog-link adjacency list stores blog IDs and link IDs (which are assigned by BlogMatcher's search engine). Link ID lookups are done by searching for a given link in a btree, which currently has about 60k nodes. Since link ID lookups had to be performed for every link in every blog, this added up (1.47 million lookups). Since most links show up in multiple blogs, sometimes in thousands of blogs, the obvious thing to do was to cache link IDs. So in the current version, the link ID lookup function caches results in a hash table, which eliminats the need for about 500k btree lookups.
Anyway, that's the long boring story behind BlogMatcher's outage.
| |
Posted Wed, February 16, 2005 14:47 by Poker@68.22.118.212
From: http://www.nutzu.com/poker.html
[moderate]