Sun’s Open Sourced Modeling Tools

Thursday, April 13th, 2006 | Justin Bugajski | No Comments

A good bit of news out of Silicon Valley today, Sun is releasing open source UML software in a bid to compete with the IBM Eclipse development project. Having two large tech companies making significant open source contributions is an important step towards expanding the reach of analytics software.

Share this with:
  • Twitter
  • Facebook
  • LinkedIn
  • Posterous
  • Tumblr
  • email

Don’t Summarize Away Everything

Tuesday, April 11th, 2006 | Nick Bugajski | No Comments

When presenting results of analysis, it is very important to make sure statistics are presented along with their constraints. Leaving details out may make for an easier read, but it could very well leave the reader misinformed. Such questionable presentation of statistical information might lead a critical reader might become prejudiced against the writer. An article about legislation to close a road in Golden Gate Park a second day each week provides an example:

The academy sees 10 percent fewer visits on Sundays than it does on Saturdays, the closed roads making the difference, Kilduff said.

Now it could be that the difference in attendance at the museum is indeed made up by the road closure. Complex problems like museum attendance usually have more than one variable, making it hard to believe, for those with some scientific background, that the number presented is accurate. Those readers unable to pick up on the simplification of a complex problem might now be under the impression that attendance at the museum on Sunday will go up at least 10% if the road is opened back up on that day.

If we assume that the statement is correct and a result of an unbiased study, only poorly stated, all that need be done to clarify it is a slight rewording:

The academy has attributed a 10 percent drop in attendance on Sundays in comparison to Saturdays to the road closures alone.

Share this with:
  • Twitter
  • Facebook
  • LinkedIn
  • Posterous
  • Tumblr
  • email

The Horrors of Poor Visualization

Thursday, April 6th, 2006 | Justin Bugajski | No Comments

Data visualization expert Howard A. Spielman wrote a recent article in BI Review Magazine that accurately describes one of the biggest problems that arises from feature-rich “business intelligence” tools that are incredibly weak at helping users communicate: poor graph design can confuse and distort your message to the point of mis-information. Simple is always better, don’t let the tool get in the way of your message! If you take a step back and decide what you are trying to say before you begin to create your chart or graph, I promise you will have better results.

Share this with:
  • Twitter
  • Facebook
  • LinkedIn
  • Posterous
  • Tumblr
  • email

Quick ETL

Wednesday, April 5th, 2006 | Nick Bugajski | No Comments

If you have a large amount of data logs that you need to process, you might consider writing a parsing script in Ruby.

A lot of server logs come in CSV files that are gzipped. Rather than going through a process involving unzipping and parsing files one at a time or configuring an ETL tool to do it for you, it may be easier and faster to just write a script that does everything you need all at once. Consider the following Ruby snippet:

Zlib::GzipReader.open('data.csv.gz') do |gz|
CSV::Reader.parse(gz.readline) do |row|
# do something interesting!
end
end

Two lines of code and it is already time to add your business logic!

Given that most ETL tools are complicated and/or expensive, writing a simple script often seems the path of least resistance. Especially when you just need to load the data to do some analysis and not set up an ongoing processing system.

Share this with:
  • Twitter
  • Facebook
  • LinkedIn
  • Posterous
  • Tumblr
  • email

But where are they coming from?

Monday, April 3rd, 2006 | Nick Bugajski | No Comments

When analyzing web data, it can be useful to map IP addresses back to a geographic location. Good for those of us that can’t afford to buy commercial databases with this information, there are public projects like Host IP picking up the slack.

One thing that makes Host IP particularly interesting is that you can download the whole IP database for your own use.

Certainly accuracy is a problem in a data set like this. But, so long as you account for that, you can still get a general idea of geographic distribution of requests. Perhaps the most useful information you can determine with certainty is which country traffic is coming from, which can help you with planning for site development.

Share this with:
  • Twitter
  • Facebook
  • LinkedIn
  • Posterous
  • Tumblr
  • email

Play Ball

Sunday, April 2nd, 2006 | Nick Bugajski | No Comments

One of the fun things about baseball is how prevalent analysis is in the fan community. Analysis isn’t really possible without data, so it is great know there are publicly available sources of player performance data.

That fans of baseball enjoy pouring over the numbers is no secret. So it is disappointing that the MLB itself does not make more data readily available to its fans. They are lucky there are independent sites out there like the Baseball Archive filling that gap for them.

Opening day is Monday, play ball!

Share this with:
  • Twitter
  • Facebook
  • LinkedIn
  • Posterous
  • Tumblr
  • email

Census Time

Sunday, April 2nd, 2006 | Nick Bugajski | No Comments

Say what you will about the government, but when they put their minds to it they can produce some very useful data. The folks at the Census Bureau in particular seem to be good at sharing their data in a way that is useful to people outside their organization. There is a possibility that that comes from their having been in this business for awhile…

The meta-data they provide along with the data is the key. This gives you insight into the process that produced the data, which is always important information to have when analyzing any data set.

Share this with:
  • Twitter
  • Facebook
  • LinkedIn
  • Posterous
  • Tumblr
  • email

Random Data

Friday, March 31st, 2006 | Nick Bugajski | No Comments

New sources of quasi-random data are interesting. But if one must go to such great lengths to gather that data, its value in a one time encryption process has to be questioned. This is especially true given the fragility of the data source.

Share this with:
  • Twitter
  • Facebook
  • LinkedIn
  • Posterous
  • Tumblr
  • email

Too Much Information

Friday, March 31st, 2006 | Nick Bugajski | No Comments

When dealing with sensitive information it can’t be stressed how important it is to clean up after you are done with the information. Stories about machines unintentionally keeping too much data serve to remind us of how important it can be to be vigilant about cleaning up and keeping only what you absolutely need.

Share this with:
  • Twitter
  • Facebook
  • LinkedIn
  • Posterous
  • Tumblr
  • email

Meta

Search