I just ran a prototype of the word indexer (called wordex!) and got some pretty impressive results. It took at least a few minutes to go through all 8000+ blogs, and at some point, I started getting worried so I started up top to monitor its resource usage. The memory usage went up about a MB every few seconds, topping off just shy of 30MB. If you want to see the raw data, feel free to download here.
Some numbers:
- 8376 blogs
- 9,651,095 words
- 424,725 distinct words
- Single-occurence words: 267674
- Top 10 Words are:
- the (526158)
- and (257368)
- that (128991)
- for (98236)
- this (72287)
- you (72279)
- with (65559)
- was (65303)
- have(58183)
- but (53217)
Some interesting words:
Current events:
|
People:
|
Browser wars:
|
Operating Systems:
|
Companies:
|
Bad words:
|
Posted Thu, April 24, 2003 11:37 by
Can it tell us how many degrees (links) of seperation there are between a blog and a mention of Kevin Bacon?
[moderate]