Just for the fun of it, I applied the purely statistical (well, quasi-statistical) string analysis to mail subjects, to see if it could create some level of ham-ness vs spam-ness index. I rewrote the pre-calculation code so that it spits out a gigantic array to an include file, that contains the second order probabilities. I then ran this against a fairly large (1000+) corpus of spam and ham subject headers. Once second order probabilities for spam and ham subject lines had been compiled, I then wrote code to go through an entire folder and calculate the spam-ness vs ham-ness of all the subject lines. Depending on the ratio between spamness and hamness (and threshold setup on both ends), each of the messages were categorized as "spam", "ham", or "unknown" (e.g. spam if spamness/hamness > 1.5, ham if <0.66, unknown if somewhere in between).
The results weren't too conclusive, as may have been expected. With a threshold of 1.5-0.67, it decided that one of my spam folders contained 14 spam, 0 ham, and 158 indeterminates. My INBOX at the same threshold, it thinks it contains 14 ham, 6 spam and 355 unknowns. With a threshold of 1.25-0.8, my INBOX as 122 ham, 16 spam, and 237 indeterminates, while my spambox contains 5 ham, 53 spam, and 114 indeterminates. So, it basically doesn't really work
Then, I implemented a Bayesian approach (a la Paul Graham's idea) but that appears to have failed miserably. I think my next step is to do Bayesians with actual words as tokens, not two-character sequences.
Posted Tue, February 1, 2005 08:57 by texas holdem
You may find it interesting to check out some helpful info about online poker texas holdem online poke...
[moderate]