|
Powered by
|
|
Section: All | News & Politics | Geek Stuff | Devel | Non-existent Life | Random | Food! | Life |
Thu, April 24, 2003
BlogMatcher... Stuck!
Okay, I'm stuck. I've kinda been working on the word indexer, but I've kind of hit a wall. You see, right now I have a program that can extract all words from an HTML file, and I was hoping to write a program that extracts just the significant words. My initial plan was to count the occurences of each word to weed out insignificant words. But after writing the code and looking at the results, I realized that that was a poor approach. There were words that only appeared once or twice that were common (insignificant) words, and there were words that occurred frequently which were signficant words.
So I came up with Plan B. Plan B is to gather occurence rate information for all words in all blogs, then compare the occrence rate of each word within a blog to the statistical average. So, for example, let's say there's a word that, on average (across all blogs), appears once every 1000 words. Then, let's say that in one particular blog, the word appears once every 200 words. The chances are, that word has at least some significance, and that approach also allows me to come up with a number indicating the word's significance.
What's the problem? The problem is this: there's a lot of data. Right now, I have over 8000 blogs, or nearly 200MBs worth, indexed, and I'm not sure how I could count occurrence rates for each individual word. I mean, I know how to do it... I'm just not sure how to do it without either sacrificing accuracy or server resources. The thing is, the occurence rate for the all data has to be calculated at once, instead of incrementally. If I stat one chunk today, then a different chunk tomorrow, it's possible (likely) that the stuff I stat'd today would have changed. So, assuming that I have to do it all at once, the problem then becomes, how to go through millions of words without killing the system.
Maybe I'm just being paranoid... Maybe the server will be able to handle a few million words without any problems. Perhaps I should just give it a shot and go "oops" if something blows up. Hm... we do have that dual 1.25GHz G4 server at work. I wish I had some money so I could buy myself a couple of decent machines to do this kind of thing... I wish, I wish.
|
|
Ryo Chijiiwa
I'm a biologically Japanese, culturally American, Germany-raised, socially liberal, politically independent, gun-totin', code writin' dude. My life is currently sponsored by Google.
|
Posted Thu, April 24, 2003 19:17 by Michael Fagan http://www.faganfinder.com/
Why not just list the stop words manually? That's the way most people do it. Not perfect of course, but possibly not any worse than by algorithm and easier to do. I even went to the trouble of finding this list for you. Of course, there are a lot of non-English stop words too.