Web Analytic Solution Comparison

Wednesday, April 11th, 2007 | Justin Bugajski | No Comments

Manoj Jasra posted a very useful web analytic solution comparison on his blog recently. If you are using, or are considering using, any kind of web analytic package on your site, his collection of links is definitely worth browsing through.

It’s Official: PowerPoint Bad for Brains

Tuesday, April 10th, 2007 | Justin Bugajski | 1 Comment

The Register UK reports on new research coming out of Australia which recommends doing away with PowerPoint presentations as a means to communicate information.

Anyone who’s been a victim of “death by PowerPoint” – that glazed and distant feeling that overwhelms you when some sales droid starts their presentation – will be reassured by Aussie researchers who’ve discovered biological reasons for the feeling.

Humans just don’t like absorbing information verbally and visually at the same time – one or the other is fine but not both simultaneously.

Researchers at the University of New South Wales in Australia found the brain is limited in the amount of information it can absorb – and presenting the same information in visual and verbal form – like reading from a typical PowerPoint slide – overloads this part of memory and makes absorbing information more difficult.

Professor Sweller said: “The use of the PowerPoint presentation has been a disaster. It should be ditched.

“It is effective to speak to a diagram, because it presents information in a different form. But it is not effective to speak the same words that are written, because it is putting too much load on the mind and decreases your ability to understand what is being presented.”

The theory of “cognitive load theory” suggest the memory can deal with two or three tasks for a period of a few seconds – any more than that and information starts to get lost.

Read the abstract of Professor Sweller’s work.

Swivel

Wednesday, December 6th, 2006 | Nick Bugajski | No Comments

Swivel, a new data analysis website, has launched today. The founders like to refer to the site as YouTube for Data. The aim of the site is to get people to upload and analyze data on the site and then share and distribute the results of that analysis in the form of linked graphs, kind of like the embedded video players of the aforementioned YouTube.

I worry that this site is trying to hard to be everything to everyone. Covering all types of data and trying to do something intelligent with them seems to me to make it unlikely that the site will be an authority on any data set. Also of concern is the highly likely possibility of users producing all kinds of bad graphs that mean nothing, but are treated as evidence of something or other. Those that enjoy the kind of nonsense graphics that populate the likes of Time magazine, will probably love this site for that very reason.

What is exciting to me, is the possibility of this site becoming a central repository of quality metadata. The kinds of datasets that are most useful as additions to other datasets. Things like lists of holidays or stock market closing days. If these types of datasets find a home on Swivel, perhaps they can get the ongoing updates, corrections and verifications that would make them very useful to the community.

Convert NetFlix Prize Data to CSV

Tuesday, October 3rd, 2006 | Nick Bugajski | No Comments

Here is a simple Ruby script to convert the NetFlix Prize training data files into a single denormalized CSV file.

require "CSV"
# make a movie lookup table
movies = Array.new
f = File.open('movie_titles.txt', 'r')
f.each_line do |line|
row = line.chomp.split(',', 3)
movies[ row.shift.to_i ] = row
end
f.close
# read all the ratings file and denormalize into one csv file
out = CSV::Writer.create(File.open('ratings.txt', 'w'))
in_files = Dir[ "training_set/mv_*.txt" ]
in_files.each do |file|
f = File.open(file, 'r')
# first line is the movie id
movie_id = f.gets.to_i
rating = [ "", "", "", movie_id, movies[ movie_id ] ].flatten
printf "%5d - %s\n", rating[ 3 ], rating[ 5 ]
f.each_line do |line|
rating[0..2] = line.chomp.split(',')
out << rating
end
f.close
end

AOL Search Data Reveals a Great Deal

Thursday, August 31st, 2006 | Justin Bugajski | No Comments

As I’m sure you’ve already heard, there was a little mistake made by a research team over at AOL when they decided to release a 3 month sample of their search log data to the academic community. Of course the dataset was retracted from their servers within a matter of days, but by that point there were mirrors of the data everywhere and it was too late.

During the week of August 6, some people in AOL’s research division decided to release to the public a little database they had. It contained a list of about 658,000 users and the Web searches each made from March to May. If you were one of those lucky, randomly selected souls, every search term you entered was opened to the world.

AOL didn’t tell its users it could do this, nor that it was going to, and it didn’t offer anyone the opportunity to opt out. It did take a small step back from the abyss by substituting a number for the users’ screen names.

“So what?” you might say. “As long as no one knows it was me searching for “dwarf prostitutes in south dakota” what difference does it make?”

The problem is that searches aren’t anonymous, even if the screen names were withheld to protect the innocent. The New York Times proved this when it tracked down user 4417749, one Thelma Arnold of Lilburn, Ga., from her searches.

And you don’t need the resources of the Times. Even a part-time technology columnist of average intelligence can glean plenty from the database.

Feel free to check out a few of the websites that have been built around this data set in the past few weeks:

Data Mining Used to Find New Materials

Thursday, August 31st, 2006 | Justin Bugajski | No Comments

An interesting combination of data mining and quantum mechanics at MIT seems to have created a new approach for predicting crystalline structures. They use the same data mining techniques that are employed in consumer applications like e-commerce shopping recommendation engines and market basket analysis.

The MIT team preloaded the entire body of historical knowledge of crystal structures into a computer algorithm, or program, which they had designed to make correlations among the data based on the underlying rules of physics.

Harnessing this knowledge, the program then delivers a list of possible crystal structures for any mixture of elements whose structure is unknown. The team can then run that list of possibilities through a second algorithm that uses quantum mechanics to calculate precisely which structure is the most stable energetically – a standard technique in the computer modeling of materials.

The latest research work has been published by Nature Materials under the title “Predicting crystal structure by merging data mining with quantum mechanics” (Volume 5, Number 8, Pages 641-646, August 2006). ABSTRACT | FULL TEXT

Feds Sharpen Secret Tools for Data Mining

Wednesday, July 26th, 2006 | Justin Bugajski | No Comments

Big brother may be trying to watch you, but it’s unclear how skilled he is at dealing with the petabytes of information being collected.

Data-mining systems used by intelligence agencies include:

• Hardware and software from NCR subsidiary Teradata that is capable of storing and searching databases as large as 4 million gigabytes, or twice as much information as is held in all research libraries in the USA. Teradata executive Bill Cooper won’t say what’s in the Teradata systems that intelligence agencies use, but he says their applications include searching financial transactions for signs of money laundering.

• A program designed to identify members of terrorist networks and determine the most important members of those networks. Cogito Inc., of Draper, Utah, sold the program to the National Security Agency and other intelligence agencies, company executive William Donahoo says.

• Software from Verity Inc. used by the Defense Intelligence Agency and the Department of Homeland Security. A 2004 congressional report says DIA’s Verity system includes personally identifiable information about Americans from other agencies and commercial sources.

The five data-mining programs developed under Total Information Awareness are among at least eight TIA projects that have continued since Congress killed TIA in 2003. They include four efforts to create software that searches through mountains of data for evidence of terrorists and three projects that allow intelligence analysts from many different agencies to collaborate on computer networks. A contract to pull all of the new software together into a working system also remained active until at least last year, government records show.

AI Set to Exceed Human Brain Power

Tuesday, July 25th, 2006 | Justin Bugajski | No Comments

While the pace of advancement in machine intelligence has been slower than most have hoped, progress is being made. New approaches are needed in order to assimilate and understand the petabytes of information being generated by our 21 century society. Existing methods of computing and analysis need to evolve significantly in order to keep up with the rising data tide, else it will be all we can do just to process and store all the information being created let alone gleen useful knowledge from it.

Nick Bostrom, Director of the Future of Humanity Institute at the UK’s Oxford University, said that AI-inspired systems were already integral to many everyday technologies such as internet search engines, bank software for processing transactions and in medical diagnosis. “A lot of cutting edge AI has filtered into general applications, often without being called AI because once something becomes useful enough and common enough it’s not labelled AI anymore.”

But Bostrom said that traditional “top-down” approaches to AI, in which programmers coded machined to cope with specific situations, were being supplemented by “bottom-up” systems inspired by enhanced understanding of the neural networks of the brain, leading to more subtle forms of AI.

“The more we discover how the human brain achieves intelligence the more we’ll be able to use the same computational architecture and logarithms in computers,” said Bostrom.

Analysis is Not Evil

Wednesday, June 28th, 2006 | Justin Bugajski | No Comments

An important point was brought up in this article in regards to the negative connotation the term “data mining” often has for people. This stems from users’ prior history with data mining tools that were ineffectual, difficult to use, and provided results that were more abstract than actionable.

Linda Koontz, information management issues director at the Government Accountability Office, said some agencies she interviewed about programs that mine data refuse to identity their programs as such.

“Different people sometimes mean different things by the term data mining,” she said. “There isn’t one definition that everyone agrees with. A lot of people feel aversion to using the word ‘data mining’ because they think that casts a negative pall over what they are doing.”

GAO defines data mining as the application of database technology and techniques to uncover hidden patterns and subtle relationships in data and infer rules that allow for the prediction of future results. Koontz said she doesn’t understand why data mining has a negative connotation. “Analysis is not evil,” she said.

Read more about the CDC’s BioSense initiative

Data Mining Helps Uncover Fraud in Disaster Relief

Wednesday, June 28th, 2006 | Justin Bugajski | No Comments

This recent article about the GAO’s investigation into fraudulent use of government assistance following Hurricane’s Katrina and Rita illustrates why it is becoming increasingly important to develop data analysis techniques that are proactive in real-time instead of reactive in the wrong time. What could have been done differently in this situation to help the government avoid paying out on bogus claims? While certainly there is a balance to be maintained between the need to distribute aid quickly and the need to have thorough checks into a person’s identity, surely there must be a better way to manage the situation than what was done by the government after these two disasters occurred.

A government watchdog relied on data mining to uncover an estimated $1 billion of improper or fraudulent payments for assistance in the aftermath of hurricanes Katrina and Rita last year.

The Government Accountability Office reported its findings on the fraud to the House Homeland Security Investigations Subcommittee on Wednesday. GAO found that the lack of “upfront controls” and inadequate data checks at the Federal Emergency Management Agency led to the improper disbursement of anywhere from $600 million to $1.4 billion to alleged hurricane victims who registered for federal assistance.

In one case, an individual using 13 different Social Security numbers, one of which belonged to the person, received 26 payments totaling $139,000. By searching public records, GAO found that of the 13 addresses that person claimed as damaged property, eight were bogus addresses or were publicly owned.

Matching information from FEMA registrations to a database of federal and state prison inmates, furthermore, GAO found more than 1,000 registrants used names and Social Security numbers belonging to prisoners who were not displaced by the storms. In one case, a Louisiana inmate received more than $20,000 for registering a post-office box as damaged property.

Meta

Search