Sun’s Open Sourced Modeling Tools
Thursday, April 13th, 2006 | Justin Bugajski | No Comments
A good bit of news out of Silicon Valley today, Sun is releasing open source UML software in a bid to compete with the IBM Eclipse development project. Having two large tech companies making significant open source contributions is an important step towards expanding the reach of analytics software.
Don’t Summarize Away Everything
Tuesday, April 11th, 2006 | Nick Bugajski | No Comments
When presenting results of analysis, it is very important to make sure statistics are presented along with their constraints. Leaving details out may make for an easier read, but it could very well leave the reader misinformed. Such questionable presentation of statistical information might lead a critical reader might become prejudiced against the writer. An article about legislation to close a road in Golden Gate Park a second day each week provides an example:
The academy sees 10 percent fewer visits on Sundays than it does on Saturdays, the closed roads making the difference, Kilduff said.
Now it could be that the difference in attendance at the museum is indeed made up by the road closure. Complex problems like museum attendance usually have more than one variable, making it hard to believe, for those with some scientific background, that the number presented is accurate. Those readers unable to pick up on the simplification of a complex problem might now be under the impression that attendance at the museum on Sunday will go up at least 10% if the road is opened back up on that day.
If we assume that the statement is correct and a result of an unbiased study, only poorly stated, all that need be done to clarify it is a slight rewording:
The academy has attributed a 10 percent drop in attendance on Sundays in comparison to Saturdays to the road closures alone.
The Horrors of Poor Visualization
Thursday, April 6th, 2006 | Justin Bugajski | No Comments
Data visualization expert Howard A. Spielman wrote a recent article in BI Review Magazine that accurately describes one of the biggest problems that arises from feature-rich “business intelligence” tools that are incredibly weak at helping users communicate: poor graph design can confuse and distort your message to the point of mis-information. Simple is always better, don’t let the tool get in the way of your message! If you take a step back and decide what you are trying to say before you begin to create your chart or graph, I promise you will have better results.
Quick ETL
Wednesday, April 5th, 2006 | Nick Bugajski | No Comments
If you have a large amount of data logs that you need to process, you might consider writing a parsing script in Ruby.
A lot of server logs come in CSV files that are gzipped. Rather than going through a process involving unzipping and parsing files one at a time or configuring an ETL tool to do it for you, it may be easier and faster to just write a script that does everything you need all at once. Consider the following Ruby snippet:
Zlib::GzipReader.open('data.csv.gz') do |gz|
CSV::Reader.parse(gz.readline) do |row|
# do something interesting!
end
end
Two lines of code and it is already time to add your business logic!
Given that most ETL tools are complicated and/or expensive, writing a simple script often seems the path of least resistance. Especially when you just need to load the data to do some analysis and not set up an ongoing processing system.
But where are they coming from?
Monday, April 3rd, 2006 | Nick Bugajski | No Comments
When analyzing web data, it can be useful to map IP addresses back to a geographic location. Good for those of us that can’t afford to buy commercial databases with this information, there are public projects like Host IP picking up the slack.
One thing that makes Host IP particularly interesting is that you can download the whole IP database for your own use.
Certainly accuracy is a problem in a data set like this. But, so long as you account for that, you can still get a general idea of geographic distribution of requests. Perhaps the most useful information you can determine with certainty is which country traffic is coming from, which can help you with planning for site development.
Play Ball
Sunday, April 2nd, 2006 | Nick Bugajski | No Comments
One of the fun things about baseball is how prevalent analysis is in the fan community. Analysis isn’t really possible without data, so it is great know there are publicly available sources of player performance data.
That fans of baseball enjoy pouring over the numbers is no secret. So it is disappointing that the MLB itself does not make more data readily available to its fans. They are lucky there are independent sites out there like the Baseball Archive filling that gap for them.
Opening day is Monday, play ball!
Census Time
Sunday, April 2nd, 2006 | Nick Bugajski | No Comments
Say what you will about the government, but when they put their minds to it they can produce some very useful data. The folks at the Census Bureau in particular seem to be good at sharing their data in a way that is useful to people outside their organization. There is a possibility that that comes from their having been in this business for awhile…
The meta-data they provide along with the data is the key. This gives you insight into the process that produced the data, which is always important information to have when analyzing any data set.
Random Data
Friday, March 31st, 2006 | Nick Bugajski | No Comments
New sources of quasi-random data are interesting. But if one must go to such great lengths to gather that data, its value in a one time encryption process has to be questioned. This is especially true given the fragility of the data source.
Too Much Information
Friday, March 31st, 2006 | Nick Bugajski | No Comments
When dealing with sensitive information it can’t be stressed how important it is to clean up after you are done with the information. Stories about machines unintentionally keeping too much data serve to remind us of how important it can be to be vigilant about cleaning up and keeping only what you absolutely need.





