Saturday, June 21, 2008

Python Sprint Day Take 2

In terms of overall productivity, this was less successful than the first. Partly the break for the talks ate up a lot of time, so getting momentum was a lot harder, and also, with the recent beta release, the number of clear issues was much reduced. The checkin that broke floatobject for some of us didn't help either, as people got side tracked into getting working python 2.6 builds.

Still, we prodded a couple of issues from the last bug day, and a couple of new bugs were filed about odd bits in the code (mainly by Simon), so there was some positive progress.

The CTPUG meeting part of the day was a definite success. The localisation talk took some time to get going, and it would have been nice if Dwayne had been able to stay longer so that we could have explored some of the discussion avenues a bit further. Simon's PyObject talk was interesting, especially touching on some of the differences between py3k and the python 2.X series.

Tuesday, June 3, 2008

Curiosity is a funny thing

For various reasons, many to do with my instinctive distrust of university IT departments (what can I say, I've been around academia for too long), I find myself responsible for our departments web-server.
The web-server hosts various things, such as the course pages, a wiki for the South African OR society and a few other things.

Unsurprisingly, pursuing the logs reveals a great deal of refer spam, and, unsurprisingly, I have a set rewrite rules to deny access to obviously bad refers, and various automated steps in place to keep the blacklist updated, and some manual processes, since not everything can be automated. This does have the side effect that I have a automatically updated record of referrer spam in our logs. Which is where curiosity kicks in and asks "I wonder if there are any interesting stats to be gather from this?"

Perhaps fortunately, the data isn't well organised for analysis - I can't readily extract temporal information or frequencies of repeated patterns without correlating stuff against the logs, which is more effort than it's worth, but some simple keyword grepping is enough t at least roughly break things down into categories, which is already interesting.

So, based pretty much on the junk seem so far this year, dip's stats are:

Leading the way, at a little under 35%, are gambling related terms, with poker the big winner, followed by various spellings of blackjack.

Insurance surprisingly comes next - accounting for just over 20%. Medical and health insurance are the major keywords, with the vehicle insurance terms running a fairly distant third.

Next, at just under 10%, are various loan and debt management terms. Bills and debt are the most popular keywords here, with loans close behind.

Viagra and such account for around 7.5%

Porn does surprisingly badly, only accounting for around 5% of the cases - with a huge variety of terms used.

The rest is a mish-mash of music sites, link sites and various other junk., with nothing really worth singling out into a single category, although 1% is taken up by cigar and cigarette related links, which I find somewhat bizarre.

Well, at least I'm no longer that curious - in due course, no doubt, the bug will bite again, and I'll do the correlation against the logs, in which case, dear innocent reader, you shall be confronted by graphs (and even possibly pie-charts).