Tuesday, June 3, 2008

Curiosity is a funny thing

For various reasons, many to do with my instinctive distrust of university IT departments (what can I say, I've been around academia for too long), I find myself responsible for our departments web-server.
The web-server hosts various things, such as the course pages, a wiki for the South African OR society and a few other things.

Unsurprisingly, pursuing the logs reveals a great deal of refer spam, and, unsurprisingly, I have a set rewrite rules to deny access to obviously bad refers, and various automated steps in place to keep the blacklist updated, and some manual processes, since not everything can be automated. This does have the side effect that I have a automatically updated record of referrer spam in our logs. Which is where curiosity kicks in and asks "I wonder if there are any interesting stats to be gather from this?"

Perhaps fortunately, the data isn't well organised for analysis - I can't readily extract temporal information or frequencies of repeated patterns without correlating stuff against the logs, which is more effort than it's worth, but some simple keyword grepping is enough t at least roughly break things down into categories, which is already interesting.

So, based pretty much on the junk seem so far this year, dip's stats are:

Leading the way, at a little under 35%, are gambling related terms, with poker the big winner, followed by various spellings of blackjack.

Insurance surprisingly comes next - accounting for just over 20%. Medical and health insurance are the major keywords, with the vehicle insurance terms running a fairly distant third.

Next, at just under 10%, are various loan and debt management terms. Bills and debt are the most popular keywords here, with loans close behind.

Viagra and such account for around 7.5%

Porn does surprisingly badly, only accounting for around 5% of the cases - with a huge variety of terms used.

The rest is a mish-mash of music sites, link sites and various other junk., with nothing really worth singling out into a single category, although 1% is taken up by cigar and cigarette related links, which I find somewhat bizarre.

Well, at least I'm no longer that curious - in due course, no doubt, the bug will bite again, and I'll do the correlation against the logs, in which case, dear innocent reader, you shall be confronted by graphs (and even possibly pie-charts).

No comments: