Archive for April, 2006

Beating Traffic

Monday, April 24th, 2006

From Brandon Hansen’s blog about how he analyzed personal commute times in order to best maximize his time on the road.

The idea being to minimize time in the car without changing too much his standard work schedule. It is clear that Brandon put a substantial amount of effort into this analysis, and his results are presented in a straightforward manner. What I also found to be interesting, was the U.S. Census report about commute times, and the other interesting reference materials Brandon had uncovered during his initial data gathering process.

Keep Things Exceptional

Thursday, April 20th, 2006

A recent story about a driver circumventing the traffic lights on his way to work reminds us that security does not work without monitoring. Exceptional events should be rare, which means it should be reasonable to keep track of when they occur.

In this instance the problem was solved when people noted that the same car seemed to be around when the traffic lights were behaving abnormally. There is no reason that this could have been noticed sooner by an automated feedback system. With a proper automated system, it should be simple to note the rise in occurrence of these exception events (he was going to work every day!) and notify a person who can decide if it is meaningful/worth further investigation. Even if this person were to decide it is not, the system could notify them again later when it becomes apparent that the events are occurring at regular intervals. And even if that is dismissed, then the logging of events at least allows a person somewhere to go for investigation should they note the abnormal behavior independently of the system, as was the case in this situation.

Hopefully such feedback is part of the upgrade they mentioned will be implemented. More security does not really seem necessary, just better feedback.

An Engaging Presentation Style

Friday, April 14th, 2006

How do you give a 15-minute presentation on a technical subject, and keep the audience engaged and interested? I came across this presentation by Dick Hardt, the CEO of “Sxip, a software security company headquartered in Vancouver. Sxip stands for “Simple, eXtensible Identity Protocol”, and is pronounced “Skip” in case you were wondering.

What is really interesting to see is that Dick uses hundreds of slides in a 15 minute presentation, leaving each slide on the screen for no more than a couple of seconds. The slides don’t contain flashy diagrams or reams of 10pt bulleted lists; rather, with a refined simplicity, only contain a few words or a simple picture. Investigating further, I learned that this presentation style originated first with Stanford law professor, Lawrence Lessig, and is known fondly as the “Lessig Method”.

While this unique approach may not be appropriate for all situations, it certainly gives us a sense of how PowerPoint can be used to effectively complement a talk, rather than replacing the talk with words that are read off the screen.

Why not give it a try, even if just for a part of your presentation next time? See if you can grab the audience the way Dick was able to!

Sun’s Open Sourced Modeling Tools

Thursday, April 13th, 2006

A good bit of news out of Silicon Valley today, Sun is releasing open source UML software in a bid to compete with the IBM Eclipse development project. Having two large tech companies making significant open source contributions is an important step towards expanding the reach of analytics software.

Don’t Summarize Away Everything

Tuesday, April 11th, 2006

When presenting results of analysis, it is very important to make sure statistics are presented along with their constraints. Leaving details out may make for an easier read, but it could very well leave the reader misinformed. Such questionable presentation of statistical information might lead a critical reader might become prejudiced against the writer. An article about legislation to close a road in Golden Gate Park a second day each week provides an example:

The academy sees 10 percent fewer visits on Sundays than it does on Saturdays, the closed roads making the difference, Kilduff said.

Now it could be that the difference in attendance at the museum is indeed made up by the road closure. Complex problems like museum attendance usually have more than one variable, making it hard to believe, for those with some scientific background, that the number presented is accurate. Those readers unable to pick up on the simplification of a complex problem might now be under the impression that attendance at the museum on Sunday will go up at least 10% if the road is opened back up on that day.

If we assume that the statement is correct and a result of an unbiased study, only poorly stated, all that need be done to clarify it is a slight rewording:

The academy has attributed a 10 percent drop in attendance on Sundays in comparison to Saturdays to the road closures alone.

The Horrors of Poor Visualization

Thursday, April 6th, 2006

Data visualization expert Howard A. Spielman wrote a recent article in BI Review Magazine that accurately describes one of the biggest problems that arises from feature-rich “business intelligence” tools that are incredibly weak at helping users communicate: poor graph design can confuse and distort your message to the point of mis-information. Simple is always better, don’t let the tool get in the way of your message! If you take a step back and decide what you are trying to say before you begin to create your chart or graph, I promise you will have better results.

Quick ETL

Wednesday, April 5th, 2006

If you have a large amount of data logs that you need to process, you might consider writing a parsing script in Ruby.

A lot of server logs come in CSV files that are gzipped. Rather than going through a process involving unzipping and parsing files one at a time or configuring an ETL tool to do it for you, it may be easier and faster to just write a script that does everything you need all at once. Consider the following Ruby snippet:

Zlib::GzipReader.open('data.csv.gz') do |gz|
CSV::Reader.parse(gz.readline) do |row|
# do something interesting!
end
end

Two lines of code and it is already time to add your business logic!

Given that most ETL tools are complicated and/or expensive, writing a simple script often seems the path of least resistance. Especially when you just need to load the data to do some analysis and not set up an ongoing processing system.

But where are they coming from?

Monday, April 3rd, 2006

When analyzing web data, it can be useful to map IP addresses back to a geographic location. Good for those of us that can’t afford to buy commercial databases with this information, there are public projects like Host IP picking up the slack.

One thing that makes Host IP particularly interesting is that you can download the whole IP database for your own use.

Certainly accuracy is a problem in a data set like this. But, so long as you account for that, you can still get a general idea of geographic distribution of requests. Perhaps the most useful information you can determine with certainty is which country traffic is coming from, which can help you with planning for site development.

Play Ball

Sunday, April 2nd, 2006

One of the fun things about baseball is how prevalent analysis is in the fan community. Analysis isn’t really possible without data, so it is great know there are publicly available sources of player performance data.

That fans of baseball enjoy pouring over the numbers is no secret. So it is disappointing that the MLB itself does not make more data readily available to its fans. They are lucky there are independent sites out there like the Baseball Archive filling that gap for them.

Opening day is Monday, play ball!

Census Time

Sunday, April 2nd, 2006

Say what you will about the government, but when they put their minds to it they can produce some very useful data. The folks at the Census Bureau in particular seem to be good at sharing their data in a way that is useful to people outside their organization. There is a possibility that that comes from their having been in this business for awhile…

The meta-data they provide along with the data is the key. This gives you insight into the process that produced the data, which is always important information to have when analyzing any data set.