I've been working on a "BlogMatcher" script for the last day or so (started it last night). The script looks for blogs that have the same links (or words) as your blog. The basic idea is that a blog that links to the same things as your blog has some topical commonality and may be of interest to you.
http://blog.iloha.net/lab/j.php
It's really not very complicated, until you start thinking about performance. Right now, I have a script that fetches a list of recently updated blogs from weblogs.com, then downloads and indexes each of the blogs. Last I checked (it's indexing again right now) it had over 1000 blogs indexed, and commonality searches took around 10 seconds. I could probably get better performance if I rewrote chunks of it in C/C++, but I'm still debating whether databases would speed it up.
Another problem is that a whole bunch of sites link to sites like google or blogger, and as it is, that pops up as a valid commonality match. I need to figure out a way to efficiently ignore some of the most common sites without sacrificing valid results. Kinda makes me appreciate google more.
Un a semi-related note, I really want to work for google...
Posted Sun, April 20, 2003 15:25 by Michael Fagan
Nice job. When I get around to updating http://www.faganfinder.com/misc/site.shtml , I will definitely add this in.
The research prototype search engine Yuntis does similarity work also. Check out http://yuntis-usb.ecsl.cs.sunysb.edu/help/queries/#SimilarLists to start with.
>Un a semi-related note, I really want to work for google...
I'm sure quite a few people do. They're hiring now, but you probably knew that.
[moderate]