20
Jun

Minister of Information

New York Magazine has a good profile of Dr. Edward Tufte. If you are not familiar with his work, you should be. Dr. Tufte is an expert and pioneer in the field of visual communication of information and this is a nice introduction to read before you buy your copies of his fantastic books.

He keeps going on the road, selling steadily, a few gigs a month, year after year. That may be why there are 1.4 million copies of his titles in print—a staggering figure for self-publishing. (The top seller, The Visual Display of Quantitative Information, has been a reliable mover since 1983.) And at these six-and-a-half-hour presentations, the audience starts cheering when he hits the floor, clamors for their books to be signed, buys posters at the table out front. As soon as the applause stops, Tufte bolts backstage, enthusiastically draining a Corona.

15
May

Don’t Forget About The Bots!

It is occasionally necessary for us to take down one of our customer web sites in order to perform maintenance tasks. Most of the time this doesn’t last more than a few minutes, but if things go wrong, it could take much more than that. These days we use Capistrano for deployment, which has built in functionality to help easily disable a web site (disable_web). That provides reasonable feedback to users coming to the site, so they know what is happening. What it doesn’t cover are the machines accessing the site. Chances are those programs can’t tell that the site is down from the maintenance page.

Turns out that this problem was easy to fix, all that was needed was to get the web server to return the correct HTTP status code. As it was, it was returning code 200: “OK”, when the more appropriate code would be a 503: “Service Temporarily Unavailable.”

All our customer sites run with an Apache server sitting in front of the Rails sites. This allowed us to make the maintenance page a script rather than just a static HTML page.

  • To start, we implemented the custom maintenance pages described by Mike Clark.
  • Next we replaced the HTML version of the maintenance page with a PHP page.
  • And last was to add in a couple of custom headers into the PHP page so that it returned the correct information.

<?php
header("HTTP/1.1 503 Service Temporarily Unavailable");
header("Retry-After: 300");
?>

We included a Retry-After header of 5 minutes, assuming that the site will probably be back up soon.

And that’s it, a couple of changes and now our sites speak proper HTTP even when disabled, great!

20
Apr

Agriculture Department Exposes SSNs

Came across an article in the New York Times describing the latest occurrence in the growing trend of private consumer information being inadvertently or purposely exposed on the internet. Now, due to obvious concerns about identity theft, millions of government dollars will have to be spent to monitor all these folks’ credit reports. Even worse than that though, is how many places this database has been copied which are completely outside of the agency’s control.

The Agriculture Department said that its review of the database shows that between 100,000 and 150,000 people could be at risk.

Privacy advocates say the actions by the agencies may not be enough. The database is more than two decades old, and is used by many federal and state agencies, by researchers, by journalists and by other private citizens to track government spending. Thousands of copies of the database exist.

14
Apr

Information Rich Web Design

Dr. Tufte has posted on his blog a letter he wrote to the Executive Editor of the Washington Post, following their site’s recent redesign. In short, he delivers the Editor the following excellent instructions to be handed off to their web designer:

Make our webpage straightforward, and if possible elegant–and, no matter what, increase the amount of news available within the immediate eyespan of the viewer on the homepage. We want more of what we do well immediately visible. People come to our website for the news, not for the interface.

Edward Tufte
March 29 2007

Sage advice any site designer should heed. Click over to Dr. Tufte’s site to join in the discussion about the Post’s redesign.

11
Apr

Web Analytic Solution Comparison

Manoj Jasra posted a very useful web analytic solution comparison on his blog recently. If you are using, or are considering using, any kind of web analytic package on your site, his collection of links is definitely worth browsing through.

10
Apr

It’s Official: PowerPoint Bad for Brains

The Register UK reports on new research coming out of Australia which recommends doing away with PowerPoint presentations as a means to communicate information.

Anyone who’s been a victim of “death by PowerPoint” - that glazed and distant feeling that overwhelms you when some sales droid starts their presentation - will be reassured by Aussie researchers who’ve discovered biological reasons for the feeling.

Humans just don’t like absorbing information verbally and visually at the same time - one or the other is fine but not both simultaneously.

Researchers at the University of New South Wales in Australia found the brain is limited in the amount of information it can absorb - and presenting the same information in visual and verbal form - like reading from a typical PowerPoint slide - overloads this part of memory and makes absorbing information more difficult.

Professor Sweller said: “The use of the PowerPoint presentation has been a disaster. It should be ditched.

“It is effective to speak to a diagram, because it presents information in a different form. But it is not effective to speak the same words that are written, because it is putting too much load on the mind and decreases your ability to understand what is being presented.”

The theory of “cognitive load theory” suggest the memory can deal with two or three tasks for a period of a few seconds - any more than that and information starts to get lost.

Read the abstract of Professor Sweller’s work.

06
Dec

Swivel

Swivel, a new data analysis website, has launched today. The founders like to refer to the site as YouTube for Data. The aim of the site is to get people to upload and analyze data on the site and then share and distribute the results of that analysis in the form of linked graphs, kind of like the embedded video players of the aforementioned YouTube.

I worry that this site is trying to hard to be everything to everyone. Covering all types of data and trying to do something intelligent with them seems to me to make it unlikely that the site will be an authority on any data set. Also of concern is the highly likely possibility of users producing all kinds of bad graphs that mean nothing, but are treated as evidence of something or other. Those that enjoy the kind of nonsense graphics that populate the likes of Time magazine, will probably love this site for that very reason.

What is exciting to me, is the possibility of this site becoming a central repository of quality metadata. The kinds of datasets that are most useful as additions to other datasets. Things like lists of holidays or stock market closing days. If these types of datasets find a home on Swivel, perhaps they can get the ongoing updates, corrections and verifications that would make them very useful to the community.

03
Oct

Convert NetFlix Prize Data to CSV

Here is a simple Ruby script to convert the NetFlix Prize training data files into a single denormalized CSV file.

require "CSV"
# make a movie lookup table
movies = Array.new
f = File.open('movie_titles.txt', 'r')
f.each_line do |line|
row = line.chomp.split(',', 3)
movies[ row.shift.to_i ] = row
end
f.close
# read all the ratings file and denormalize into one csv file
out = CSV::Writer.create(File.open(’ratings.txt’, ‘w’))
in_files = Dir[ "training_set/mv_*.txt" ]
in_files.each do |file|
f = File.open(file, ‘r’)
# first line is the movie id
movie_id = f.gets.to_i
rating = [ "", "", "", movie_id, movies[ movie_id ] ].flatten
printf “%5d - %s\n”, rating[ 3 ], rating[ 5 ]
f.each_line do |line|
rating[0..2] = line.chomp.split(’,')
out << rating
end
f.close
end

31
Aug

AOL Search Data Reveals a Great Deal

As I’m sure you’ve already heard, there was a little mistake made by a research team over at AOL when they decided to release a 3 month sample of their search log data to the academic community. Of course the dataset was retracted from their servers within a matter of days, but by that point there were mirrors of the data everywhere and it was too late.

During the week of August 6, some people in AOL’s research division decided to release to the public a little database they had. It contained a list of about 658,000 users and the Web searches each made from March to May. If you were one of those lucky, randomly selected souls, every search term you entered was opened to the world.

AOL didn’t tell its users it could do this, nor that it was going to, and it didn’t offer anyone the opportunity to opt out. It did take a small step back from the abyss by substituting a number for the users’ screen names.

“So what?” you might say. “As long as no one knows it was me searching for “dwarf prostitutes in south dakota” what difference does it make?”

The problem is that searches aren’t anonymous, even if the screen names were withheld to protect the innocent. The New York Times proved this when it tracked down user 4417749, one Thelma Arnold of Lilburn, Ga., from her searches.

And you don’t need the resources of the Times. Even a part-time technology columnist of average intelligence can glean plenty from the database.

Feel free to check out a few of the websites that have been built around this data set in the past few weeks:

31
Aug

Data Mining Used to Find New Materials

An interesting combination of data mining and quantum mechanics at MIT seems to have created a new approach for predicting crystalline structures. They use the same data mining techniques that are employed in consumer applications like e-commerce shopping recommendation engines and market basket analysis.

The MIT team preloaded the entire body of historical knowledge of crystal structures into a computer algorithm, or program, which they had designed to make correlations among the data based on the underlying rules of physics.

Harnessing this knowledge, the program then delivers a list of possible crystal structures for any mixture of elements whose structure is unknown. The team can then run that list of possibilities through a second algorithm that uses quantum mechanics to calculate precisely which structure is the most stable energetically - a standard technique in the computer modeling of materials.

The latest research work has been published by Nature Materials under the title “Predicting crystal structure by merging data mining with quantum mechanics” (Volume 5, Number 8, Pages 641-646, August 2006). ABSTRACT | FULL TEXT