|
Powered by
|
|
Section: All | News & Politics | Geek Stuff | Devel | Non-existent Life | Random | Food! | Life |
Archives: 2003 > 04 Mon, April 21, 2003
BlogMatcher optimization
BlogMatcher: I'm re-working the match-finding portion of the system right now. The PHP-based script worked remarkably well, but with 2600 blogs and counting, it was starting to show its limits. The database approach failed miserably (I'll spend more time on that later), so the new plan is to convert chunks of my PHP code to C/C++. I just finished the C version, which does the searches in roughly a third of the time (i.e. 1 second vs 3 seconds), but I realized that one of the slowest processes could be further optimized if I use C++/STL. While taking a dump, I also realized that maybe I could run multiple processes to do the search since it's fairly disk intensive and I'm sure the programs waste quite a few CPU cycles waiting for disk I/O.
Ultimately, my goal is to meet Google standards (show results in less than 1 second). On a semi-related note, I still want to work for Google.
I'll post more in a couple of hours...
Holy Sh*t!
Holy mother of llamas... the core search now takes well under a second (0.24 seconds for my blog)! I almost crapped in my pants. Well, I didn't, but my hands are shaking.
Here's how it works. Basically, the indexer generates a list of links for each blog and saves them into a file. What the search program has to do is go through each of the links files and see how many of the links in there match the links for the reference site (also stored in a file).
Here's what the PHP scripts did: The PHP search feature read the reference site's links into an array, then went through each of the indexed links files and ran an array_intersect().
Here's what the rev 1 C program did: I read in the reference site's links into an array, and went through each of the link files. When looking up each of the links, it did a linear search.
Here's what my rev 2 C program does: I read in the reference site's links into a B-Tree, and does everything else like before, except instead of an exhaustive linear search, it just has to search the B-Tree.
Does it get any better? Well, possibly... B-trees aren't the best if the data isn't inserted in random order (i.e. if it's already sorted, you get lop-sided trees). If I did an AVL tree it might be a little bit more efficient. Having said that, the tree will only contain at most 200 nodes, usually somewhere in the range of 30-60 so I doubt there's much room left for optimization there. At this point, I think the limiting factor is disk I/O. The only way to get around that would be to create a daemon that has all the links stored in RAM, and have a bunch of readily available (i.e. pre-spawned) processes ready. I'm not sure if I'm that desperate...
More goodness...
Okay, I think I've got it firgured out now. I added a counter (using hash tables, yay!) so now I can vastly improve the scoring algorithm. How it works is it actually counts how many times each URL shows up in your results. So, for example, if it finds a bunch of other sites that link to Google, it automatically gives it a lower score, whereas an uncommon link will receive a much higher score. The really amazing thing is, all this seems to have only added 0.03 seconds or so to the search time.
All done!
BlogMatcher: Okay, it's all done now. It searches in under 1 second (unless the server's under heavy load), and it's got pagination and everything!
http://blog.iloha.net/lab
I also wrote an FAQ.
drats
It's a little past 7am. I have class at 9am, which means I have to get up at around 8:20. Well, so much for that whole sleep thing. I hate it when this happens because I get confused about when to eat breakfast. I mean, it's been like 9 hours since my last meal, so I feel like I deserve another meal. But then, it's 7am for crying out loud. If I eat now, I'll be starving around breakfast time.
Am I making any sense? No? Oh... ok.
"Music To Your Ears"
Will Apple accnounce their music distribution software on the 28th? I guess we'll just have to wait and see...
Theorem
I was just sitting in the toilet reading about heaps (as in the data structure) and I came up with a theorem (I'm sure someone's thought of the same thing and named it centuries ago... let me know if you know what it is):
Any highly complex system can be broken down to optimized, specialized, and simplified elements that interact with each other.
Examples: The human brain consists of simple neurons. The web consists of simple textual data that link to each other. The universe is made up of simple particles governed by a few simple (albeit not fully understood) laws.
|
|
Ryo Chijiiwa
I'm a biologically Japanese, culturally American, Germany-raised, socially liberal, politically independent, gun-totin', code writin' dude. My life is currently sponsored by Google.
|