ryochiji's blog
Brought to you fresh from the depths of Ryo Chijiiwa


 
Powered by
IlohaBlog

Section: All | News & Politics | Geek Stuff | Devel | Non-existent Life | Random | Food! | Life |

Archives: 2004 > 06

Sat, June 19, 2004

Cool Summer Project #1

This is a little warmup project, so it's not one of my "cool summer projects". But it's still pretty cool.
Anyway, I realized that a lot of spam is sent from addresses that are obviously randomly generated. Obvious, that is, to the human eye. So, I figured I'd apply something I learned in Information Theory, and see if I could find a way to determine whether or not an email address was randomly generated.

First, I extracted email addresses from several thousand non-spam (ham) messages in my account. Then, I extracted the user names (i.e. portion before the @) and calculated second order probabilities, that is, the odds of all two-character sequences occurring. For example, the probability of the sequence 'RY' is 0.069, which is the product of the probability of an 'R' occurring and the probability that the 'R' would be followed by a 'Y'.

Then, to calculate the randomness of a word, I summed up the second order probabilities for each letter in the word, and took the mean. The randomness index is actually the average probability that each of the letters were NOT randomly generated, so a low score means it is MORE likely to be random...

Some examples:
ryo0.046809373533349
james0.036362542427477
alii0.028019466941281
jesus0.024827154390081
max0.023151473175686
ryochiji0.023043750343737
gwen0.021219886375799
jonathan0.020438100318045
palevsky0.017368422240704
chijiiwa0.016507969162423
matthew0.011459321235289
mheaozvmffc0.0054842668481834
nbavqwjjgqzna0.0022731872639314
qnaibpzojwtpo0.0018295307791367

I included multiple instances of the same email address in the original corpus because, essentially, that's how the "learning" happens. For most people, the odds of a 'y' following a 'r' are low, but since I send myself a lot of email, the odds improve, meaning my name looks less random. Conversely, since this account doesn't get very many email to/from addresses containing "chijiiwa" or even "ryochiji", both of them scored significantly lower.

Conclusion
This very simplistic approach in analyzing email addresses and comparing to a known corpus seems to work fairly well. In the example above, all non-random user names scored 0.011 or higher, while the 3 random strings (taken from real spam sender addresses) all scored below 0.006. This seems to indicate that this methodology effectively distinguishes random vs non-random email user names. Additionally, since randomness can be calculated very rapidly once second order probabilities are calculated and stored, this approach may be useful in spam filters tailored specifically to individual users.



Ryo Chijiiwa

I'm a biologically Japanese, culturally American, Germany-raised, socially liberal, politically independent, gun-totin', code writin' dude. My life is currently sponsored by Google.
www.flickr.com
This is a Flickr badge showing public photos and videos from ryochiji. Make your own badge here.