Author Archive

The State of Things

Sunday, October 5th, 2008

I know I’ve made a mistake in the design of a REST style application if it can’t handle the same user logging on from two places at the same time. It is easy to assume that there is no good reason for a user to do such a thing and therefore dismiss the hard work of fixing any related problems. But that problem is just a good way to test how your application behaves over an unreliable network (heretofore called “The Web”). Just as easy as ignoring that mean user with his two simultaneous sessions, is ignoring the fact that The Web is unreliable. Any good web application design should account for that. The problems associated with this problem mostly have to do with separation of application state from resource state.

As an example consider a blog where users must be logged in to comment. In this system there is a create comment page. This page adds the comment, when submitted, to the article that the user was viewing before he went to the create comment page.

If application state was managed on the server, there might be a variable stored in the user account that identifies the post that that user last viewed. This variable would be referenced when the user submits a comment, with the comment being added to whichever article that variable contains.

Now, how can that go wrong? Say a user, let’s call him Steve, logs in to the blog. Steve reads an article that he would like to comment on, so Steve goes to the create comment page. While composing his comment, Steve remembers another post that was related that he would like to quote. So Steve opens a new tab and navigates to that post. Steve then goes back to the tab with the comment form, finishes it and clicks the submit button.

In this scenario, the comment, since it relied on the server side variable to determine which article the comment was for, would end up on the second article, the one Steve was viewing in the second tab, because that was the last viewed article, not the one he was actually trying to comment on.

In a better design, the article that a comment was meant for would be manged on the client side, it being application state, possibly as an item in the comment form.

Obviously this is a simple example, but generally the ways in which this problem arise can be reduced to such a simple scenario. They generally involve many more steps and much more complicated interactions, but they all involve requiring the server to know the current state of the client.

See also: State in Web application design

The Hypertext Constraint

Friday, September 5th, 2008

I have spent quite a bit of time recently designing a system in the REST style. Two main observations from that process:

  1. Web search is still pretty terrible once you need something more than a company’s homepage or a wikipedia entry. There is a vast wealth of information dispensed by experts everyday in venues that don’t attract large numbers of links. Finding that information is way more painful that it needs to be.
  2. 99% of people writing about the REST style seem to have totally missed the most important constraint: Hypermedia as the engine of application state (HATEOAS). If that concept was better understood by more people we might finally make some progress on things other than how to make pretty URLs. It took a significant amount of time to discover just how important that constraint is for the REST style (very, by the way, like, it’s everything, no HATEOAS, no REST, not even close), and that really slowed down the design process for me.

Here is some HATEOAS link love so as to do my part to improve the state of knowledge on REST in those link counting search engines

Don’t Forget About The Bots!

Tuesday, May 15th, 2007

It is occasionally necessary for us to take down one of our customer web sites in order to perform maintenance tasks. Most of the time this doesn’t last more than a few minutes, but if things go wrong, it could take much more than that. These days we use Capistrano for deployment, which has built in functionality to help easily disable a web site (disable_web). That provides reasonable feedback to users coming to the site, so they know what is happening. What it doesn’t cover are the machines accessing the site. Chances are those programs can’t tell that the site is down from the maintenance page.

Turns out that this problem was easy to fix, all that was needed was to get the web server to return the correct HTTP status code. As it was, it was returning code 200: “OK”, when the more appropriate code would be a 503: “Service Temporarily Unavailable.”

All our customer sites run with an Apache server sitting in front of the Rails sites. This allowed us to make the maintenance page a script rather than just a static HTML page.

  • To start, we implemented the custom maintenance pages described by Mike Clark.
  • Next we replaced the HTML version of the maintenance page with a PHP page.
  • And last was to add in a couple of custom headers into the PHP page so that it returned the correct information.

<?php
header("HTTP/1.1 503 Service Temporarily Unavailable");
header("Retry-After: 300");
?>

We included a Retry-After header of 5 minutes, assuming that the site will probably be back up soon.

And that’s it, a couple of changes and now our sites speak proper HTTP even when disabled, great!

Swivel

Wednesday, December 6th, 2006

Swivel, a new data analysis website, has launched today. The founders like to refer to the site as YouTube for Data. The aim of the site is to get people to upload and analyze data on the site and then share and distribute the results of that analysis in the form of linked graphs, kind of like the embedded video players of the aforementioned YouTube.

I worry that this site is trying to hard to be everything to everyone. Covering all types of data and trying to do something intelligent with them seems to me to make it unlikely that the site will be an authority on any data set. Also of concern is the highly likely possibility of users producing all kinds of bad graphs that mean nothing, but are treated as evidence of something or other. Those that enjoy the kind of nonsense graphics that populate the likes of Time magazine, will probably love this site for that very reason.

What is exciting to me, is the possibility of this site becoming a central repository of quality metadata. The kinds of datasets that are most useful as additions to other datasets. Things like lists of holidays or stock market closing days. If these types of datasets find a home on Swivel, perhaps they can get the ongoing updates, corrections and verifications that would make them very useful to the community.

Convert NetFlix Prize Data to CSV

Tuesday, October 3rd, 2006

Here is a simple Ruby script to convert the NetFlix Prize training data files into a single denormalized CSV file.

require "CSV"
# make a movie lookup table
movies = Array.new
f = File.open('movie_titles.txt', 'r')
f.each_line do |line|
row = line.chomp.split(',', 3)
movies[ row.shift.to_i ] = row
end
f.close
# read all the ratings file and denormalize into one csv file
out = CSV::Writer.create(File.open(’ratings.txt’, ‘w’))
in_files = Dir[ "training_set/mv_*.txt" ]
in_files.each do |file|
f = File.open(file, ‘r’)
# first line is the movie id
movie_id = f.gets.to_i
rating = [ "", "", "", movie_id, movies[ movie_id ] ].flatten
printf “%5d - %s\n”, rating[ 3 ], rating[ 5 ]
f.each_line do |line|
rating[0..2] = line.chomp.split(’,')
out << rating
end
f.close
end

Keep Things Exceptional

Thursday, April 20th, 2006

A recent story about a driver circumventing the traffic lights on his way to work reminds us that security does not work without monitoring. Exceptional events should be rare, which means it should be reasonable to keep track of when they occur.

In this instance the problem was solved when people noted that the same car seemed to be around when the traffic lights were behaving abnormally. There is no reason that this could have been noticed sooner by an automated feedback system. With a proper automated system, it should be simple to note the rise in occurrence of these exception events (he was going to work every day!) and notify a person who can decide if it is meaningful/worth further investigation. Even if this person were to decide it is not, the system could notify them again later when it becomes apparent that the events are occurring at regular intervals. And even if that is dismissed, then the logging of events at least allows a person somewhere to go for investigation should they note the abnormal behavior independently of the system, as was the case in this situation.

Hopefully such feedback is part of the upgrade they mentioned will be implemented. More security does not really seem necessary, just better feedback.

Don’t Summarize Away Everything

Tuesday, April 11th, 2006

When presenting results of analysis, it is very important to make sure statistics are presented along with their constraints. Leaving details out may make for an easier read, but it could very well leave the reader misinformed. Such questionable presentation of statistical information might lead a critical reader might become prejudiced against the writer. An article about legislation to close a road in Golden Gate Park a second day each week provides an example:

The academy sees 10 percent fewer visits on Sundays than it does on Saturdays, the closed roads making the difference, Kilduff said.

Now it could be that the difference in attendance at the museum is indeed made up by the road closure. Complex problems like museum attendance usually have more than one variable, making it hard to believe, for those with some scientific background, that the number presented is accurate. Those readers unable to pick up on the simplification of a complex problem might now be under the impression that attendance at the museum on Sunday will go up at least 10% if the road is opened back up on that day.

If we assume that the statement is correct and a result of an unbiased study, only poorly stated, all that need be done to clarify it is a slight rewording:

The academy has attributed a 10 percent drop in attendance on Sundays in comparison to Saturdays to the road closures alone.

Quick ETL

Wednesday, April 5th, 2006

If you have a large amount of data logs that you need to process, you might consider writing a parsing script in Ruby.

A lot of server logs come in CSV files that are gzipped. Rather than going through a process involving unzipping and parsing files one at a time or configuring an ETL tool to do it for you, it may be easier and faster to just write a script that does everything you need all at once. Consider the following Ruby snippet:

Zlib::GzipReader.open('data.csv.gz') do |gz|
CSV::Reader.parse(gz.readline) do |row|
# do something interesting!
end
end

Two lines of code and it is already time to add your business logic!

Given that most ETL tools are complicated and/or expensive, writing a simple script often seems the path of least resistance. Especially when you just need to load the data to do some analysis and not set up an ongoing processing system.

But where are they coming from?

Monday, April 3rd, 2006

When analyzing web data, it can be useful to map IP addresses back to a geographic location. Good for those of us that can’t afford to buy commercial databases with this information, there are public projects like Host IP picking up the slack.

One thing that makes Host IP particularly interesting is that you can download the whole IP database for your own use.

Certainly accuracy is a problem in a data set like this. But, so long as you account for that, you can still get a general idea of geographic distribution of requests. Perhaps the most useful information you can determine with certainty is which country traffic is coming from, which can help you with planning for site development.

Play Ball

Sunday, April 2nd, 2006

One of the fun things about baseball is how prevalent analysis is in the fan community. Analysis isn’t really possible without data, so it is great know there are publicly available sources of player performance data.

That fans of baseball enjoy pouring over the numbers is no secret. So it is disappointing that the MLB itself does not make more data readily available to its fans. They are lucky there are independent sites out there like the Baseball Archive filling that gap for them.

Opening day is Monday, play ball!