ryochiji's blog
Brought to you fresh from the depths of Ryo Chijiiwa


 
Powered by
IlohaBlog

Section: All | News & Politics | Geek Stuff | Devel | Non-existent Life | Random | Food! | Life |

Archives: 2003 > 04

Thu, April 24, 2003

BlogMatcher... Stuck!
Okay, I'm stuck. I've kinda been working on the word indexer, but I've kind of hit a wall. You see, right now I have a program that can extract all words from an HTML file, and I was hoping to write a program that extracts just the significant words. My initial plan was to count the occurences of each word to weed out insignificant words. But after writing the code and looking at the results, I realized that that was a poor approach. There were words that only appeared once or twice that were common (insignificant) words, and there were words that occurred frequently which were signficant words.

So I came up with Plan B. Plan B is to gather occurence rate information for all words in all blogs, then compare the occrence rate of each word within a blog to the statistical average. So, for example, let's say there's a word that, on average (across all blogs), appears once every 1000 words. Then, let's say that in one particular blog, the word appears once every 200 words. The chances are, that word has at least some significance, and that approach also allows me to come up with a number indicating the word's significance.

What's the problem? The problem is this: there's a lot of data. Right now, I have over 8000 blogs, or nearly 200MBs worth, indexed, and I'm not sure how I could count occurrence rates for each individual word. I mean, I know how to do it... I'm just not sure how to do it without either sacrificing accuracy or server resources. The thing is, the occurence rate for the all data has to be calculated at once, instead of incrementally. If I stat one chunk today, then a different chunk tomorrow, it's possible (likely) that the stuff I stat'd today would have changed. So, assuming that I have to do it all at once, the problem then becomes, how to go through millions of words without killing the system.

Maybe I'm just being paranoid... Maybe the server will be able to handle a few million words without any problems. Perhaps I should just give it a shot and go "oops" if something blows up. Hm... we do have that dual 1.25GHz G4 server at work. I wish I had some money so I could buy myself a couple of decent machines to do this kind of thing... I wish, I wish.



Awww...

I like cute and fuzzy things. I also like rubber duckies. With that in mind, take a look at these.



BlogMatcher discussions
I did a google search for blogmatcher and got a few hits. Considering how there are a couple of fairly big bloggers talking about it, it just might be that I'm on to something. Apparently the famous gnome-girl took it for a spin as well, according to comments posted on Tim Swanson's blog.



Fascinating...

I just ran a prototype of the word indexer (called wordex!) and got some pretty impressive results. It took at least a few minutes to go through all 8000+ blogs, and at some point, I started getting worried so I started up top to monitor its resource usage. The memory usage went up about a MB every few seconds, topping off just shy of 30MB. If you want to see the raw data, feel free to download here.

Some numbers:

  • 8376 blogs
  • 9,651,095 words
  • 424,725 distinct words
  • Single-occurence words: 267674
  • Top 10 Words are:
    1. the (526158)
    2. and (257368)
    3. that (128991)
    4. for (98236)
    5. this (72287)
    6. you (72279)
    7. with (65559)
    8. was (65303)
    9. have(58183)
    10. but (53217)

Some interesting words:
Current events:
  • war: 14132
  • iraq: 10296
  • oil: 2251
  • sars: 1802
People:
  • bush: 5828
  • saddam: 3891
  • jesus: 1740
  • aragorn: 60 (gandalf: 21, frodo: 22, legolas: 16, gimli: 5)
Browser wars:
  1. Mozilla 284
  2. Safari 227
  3. Explorer 177
  4. Camino 94
Operating Systems:
  1. windows: 1419
  2. linux: 559
  3. mac: 820 (macos: 45, macosx: 15)
  4. openbsd: 44
  5. freebsd: 30
  6. bsd: 14
  7. solaris: 51
Companies:
  • google: 1864
  • microsoft: 941
  • apple: 924
  • yahoo: 770
  • aol: 292
  • sony: 284
  • msn: 212
  • wal-mart: 169 (walmart: 71)
  • riaa: 118
  • redhat: 38
  • mpaa: 31
Bad words:
  • shit: 1792
  • ass: 1487
  • fuck: 1281
  • bitch: 636
  • asshole: 186
  • distinct tokens that contain "fuck": 271



BlogMatcher on MetaFilter!

BlogMatcher is on MetaFilter! Hairyeyeball calls it an "awesome new social-networking gewgaw".



Ryo Chijiiwa

I'm a biologically Japanese, culturally American, Germany-raised, socially liberal, politically independent, gun-totin', code writin' dude. My life is currently sponsored by Google.
www.flickr.com
This is a Flickr badge showing public photos and videos from ryochiji. Make your own badge here.