ryochiji's blog
Brought to you fresh from the depths of Ryo Chijiiwa


 
Powered by
IlohaBlog

Section: All | News & Politics | Geek Stuff | Devel | Non-existent Life | Random | Food! | Life |

Archives: 2003 > 04

Fri, April 25, 2003

Gigantic Graphs

I've been thinking... BlogMatcher has nearly 10,000 blogs indexed, and the search is taking longer (still well under 1second in most cases). The problem is, the current search algorithm is O(n), which means search times grow in direct proportion to the number of blogs indexed. That's not good.

So, I need to figure out an algorithm that's O(log n). The hard part isn't the algorithm, people have already figured that one out. Basically, I just need to create a gigantic graph that contains all the data. The problem is, the graph is going to be gigantic, and it's going to take time to read in all the data from disk and put together the graph.

One idea is to create a server daemon that performs searches. The problem is, the daemon will need to be a multiprocess program, which means it needs to fork() for every new session, and with MBs worth of data in memory, it's going to take time to fork(). To get around this, one idea is to fork a bunch of extra processes on startup, so that there's little-or-no overhead for every new session (that's what Apache does). But then, the problem is, I don't know how to do that. Forking after accept() is easier because the parent doesn't have to do anything. But I don't know how to make the parent toss off a connection to an existing child.

Hm... But that's the only solution I can think of. How does Google do it?

UPDATE: On second thought, maybe it doesn't take _that_ long to fork a process with MBs worth of data. After all, I'm not Google, and I' not making Apache. I'm not aiming to do searches in <0.1 seconds, just <0.5.



Grrr...

I've implemented the core of what's to be the next search program, but it's turning out to be more difficult than I expected. In fact, I even had to write pseudo code before I got it right, and some of you know how often that happens (i.e, never).

As if that wasn't bad enough, I ran it and... what do you mean did it work? of course it did... its memory usage went peaked at 46MB. Here are the global variables I'm using:

vector <string> 		blog_vector;
map<string, int, ltstr> 	blog_map;
vector<string>			link_vector;
map<string, int, ltstr> 	link_map;

map<string, vector<int>, ltstr>  link_blog;
map<string, vector<int>, ltstr>  blog_link;

I don't have the exact numbers right now, but there are nearly 10,000 blogs, and roughly 1 million links. So, yeah, maybe it's not all that surprising.

What's Plan B, you ask? I modified wordex to extract all links that appear in at least two blogs, so I'm going go use those links and ignore the rest. That should cut down the number of links to <500k.



Ryo Chijiiwa

I'm a biologically Japanese, culturally American, Germany-raised, socially liberal, politically independent, gun-totin', code writin' dude. My life is currently sponsored by Google.
www.flickr.com
This is a Flickr badge showing public photos and videos from ryochiji. Make your own badge here.