This is a little warmup project, so it's not one of my "cool summer projects". But it's still pretty cool.
Anyway, I realized that a lot of spam is sent from addresses that are obviously randomly generated. Obvious, that is, to the human eye. So, I figured I'd apply something I learned in Information Theory, and see if I could find a way to determine whether or not an email address was randomly generated.
First, I extracted email addresses from several thousand non-spam (ham) messages in my account. Then, I extracted the user names (i.e. portion before the @) and calculated second order probabilities, that is, the odds of all two-character sequences occurring. For example, the probability of the sequence 'RY' is 0.069, which is the product of the probability of an 'R' occurring and the probability that the 'R' would be followed by a 'Y'.
Then, to calculate the randomness of a word, I summed up the second order probabilities for each letter in the word, and took the mean. The randomness index is actually the average probability that each of the letters were NOT randomly generated, so a low score means it is MORE likely to be random...
Some examples:
| ryo | 0.046809373533349 |
| james | 0.036362542427477 |
| alii | 0.028019466941281 |
| jesus | 0.024827154390081 |
| max | 0.023151473175686 |
| ryochiji | 0.023043750343737 |
| gwen | 0.021219886375799 |
| jonathan | 0.020438100318045 |
| palevsky | 0.017368422240704 |
| chijiiwa | 0.016507969162423 |
| matthew | 0.011459321235289 |
| mheaozvmffc | 0.0054842668481834 |
| nbavqwjjgqzna | 0.0022731872639314 |
| qnaibpzojwtpo | 0.0018295307791367 |
I included multiple instances of the same email address in the original corpus because, essentially, that's how the "learning" happens. For most people, the odds of a 'y' following a 'r' are low, but since I send myself a lot of email, the odds improve, meaning my name looks less random. Conversely, since this account doesn't get very many email to/from addresses containing "chijiiwa" or even "ryochiji", both of them scored significantly lower.
Conclusion
This very simplistic approach in analyzing email addresses and comparing to a known corpus seems to work fairly well. In the example above, all non-random user names scored 0.011 or higher, while the 3 random strings (taken from real spam sender addresses) all scored below 0.006. This seems to indicate that this methodology effectively distinguishes random vs non-random email user names. Additionally, since randomness can be calculated very rapidly once second order probabilities are calculated and stored, this approach may be useful in spam filters tailored specifically to individual users.