ryochiji's blog
Brought to you fresh from the depths of Ryo Chijiiwa


 
Powered by
IlohaBlog

Section: All | News & Politics | Geek Stuff | Devel | Non-existent Life | Random | Food! | Life |

Archives: 2003 > 04

Mon, April 28, 2003

More BlogMatcher...
I've got a prototype of the BMD (BlogMatcher Daemon) up and running, and I've also integreated it with a PHP front end. The BMD basically has all the link <-> blog relationships stored in RAM in a 350 million node graph, allowing it to do searches in well under 0.1 second. Right now, it seems like the slowest process now is the PHP front end that still does most of the scoring calculations.

One of the benefits of having a graph in memory is that I can do all kinds of things. For example, it's just as easy to search for all blogs that have a particular link, and in theory, it's also possible to do all kinds of "shortest path" types of calculations (i.e. shortest path from one blog to another).

Having said that, I want to work on the link scoring algorithm now. With blogs that have a lot of links to various sites, the results are next to useless. I need to somehow figure out a way to determine which links are significant and which links aren't...

I just thought of something. Maybe one way to do this would be to track "link shares" over time. The basic idea is that a certain percentage of links will always be to certain sites, like google or slashdot, and that percentage probably will remain fairly constant. On the other hand, a hot new site or an interesting article is more likely to pop up suddently and receive a lot of attention, then fade away again. Maybe I need to start logging that information...



Knock on wood everybody (yes, you too)

I just sent my resume to Google... I sent it to their Great People address instead of the one the individual job listings show. I figured that if they think I'm a great person, it might be worth it... otherwise I might as well stay in school for a few years and try again.

I wonder if I'm supposed to post stuff like this in a blog... I mean, someone from Google might look at it and go "Ah hah!"...

Well, I don't know what they'd say after "Ah hah!" so I guess I'll just risk it.



"Downloads done right"
Apple unveiled its music store today. Summary:
  • Integrated with iTunes
  • 99cents per song
  • AAC format
  • Unrestricted CD burning
I actually might go buy a few songs, just to show support for the general idea. Since you can burn CDs then rip the CD, I don't think they're really trying to prevent piracy. I think the point is that they're:
  1. giving people (well, Mac users) the option of legally downloading music
  2. making sure the RIAA gets some money
At the end of the day, the RIAA doesn't give a hoot about piracy.. they just want more money. Give 'em more money and they'll shut up.



Links...

Michael Fagan asks: "Maybe you should exclude the consistently popular links"

I thought this was worth talking about... Before I begin, let me present to you Exhibit A. The file containins a list with numbers of occurrences on the left side, link URLs on the right side, and is sorted by frequency. To a certain degree, ignoring the most popular links sounds like it might work. The top few links go to sites like Blogger/Blogspot and MovableType, which certainly hold little actual meaning. You go down the list, and you see that the same applies for links like google.com or the W3C validator.

But then, at position number 13, like a truck in thick fog, a link to The Onion appears without warning. What should we do? Can we disregard that? Personally, I don't think so. Even though a link to blogdex (a few places further down) doesn't hold much significance, I think a link to The Onion does say a thing or two about the blog (and blog author(s)).

Anyway, that's the main reason why I'm not so sure weeding out the top N popular links would work. That doesn't mean I have a better solution either... Perhaps what I need is some sort of feedback loop, so people can democratically determine which links are "significant" and which ones aren't. If I could add in a feedback mechanism and feed it to an AI algorithm, that might work... Hmm.

Time to hit Crtich for his AI book.



Cracked!
Okay, I think I just cracked the link scoring algorithm:

S(l) = D(l) * log(M) / log(N(l))

Where:
S(l) = score of link
D(l) = depth of link
M = number of occurences of most popular link
N(l) = number of occurrences of link n

So, if there's a link that you share with just a few other blogs, that link could get a score of perhaps 14, while the more popular links may only get a score of 2 or 3. I haven't implemented the actual scoring algorithm, but I did calculate log(n)s of a bunch of links and it looks very promising.

For anyone who cares, the solution was bloody obvious. I graphed the occurrence rate of all the links, and the graph looked a lot like stuff I saw a couple of years ago in calculus. So I tried to remember all the different things we did to those poor numbers (some of which I dare not mention in public) and came up with log().

Maybe I'll do my master's thesis on this. What? I have to get my B.S. first? Drats... I hate it when that happens. I guess they call it "BS" for a reason.



BlogMatcher Link Scores

I've implemented the new link scoring algorithm and is currently running the BlogMatcher system.

Does it work? I'm not sure. The results generated are more or less similar, but some results now show up much higher than they did before (and some lower). The new algorithm seems to be effective in "curving" the scores appropriately, and "rare" links are being scored much higher than they would've been in the old system.

The problem now (still) is that some popular links that also have significance are being scored lower than they should be. Also, I'm not entirely convinced that some other blog that happens to have the same "rare" link should score _that_ high. Logically, the combination of having links to Slashdot, Wired, and The Onion seems to be more indicative of common interest than one common link, as "rare" as that link may be.

My conclusion, for now, is that the problem hasn't been cracked yet after all. I'm still going to have to figure out a way to weigh links based not just on popularity/rarity, but also on contextual/topical significance.



Ryo Chijiiwa

I'm a biologically Japanese, culturally American, Germany-raised, socially liberal, politically independent, gun-totin', code writin' dude. My life is currently sponsored by Google.
www.flickr.com
This is a Flickr badge showing public photos and videos from ryochiji. Make your own badge here.