AOL Search Data Reveals a Great Deal

August 31st, 2006

As I’m sure you’ve already heard, there was a little mistake made by a research team over at AOL when they decided to release a 3 month sample of their search log data to the academic community. Of course the dataset was retracted from their servers within a matter of days, but by that point there were mirrors of the data everywhere and it was too late.

During the week of August 6, some people in AOL’s research division decided to release to the public a little database they had. It contained a list of about 658,000 users and the Web searches each made from March to May. If you were one of those lucky, randomly selected souls, every search term you entered was opened to the world.

AOL didn’t tell its users it could do this, nor that it was going to, and it didn’t offer anyone the opportunity to opt out. It did take a small step back from the abyss by substituting a number for the users’ screen names.

“So what?” you might say. “As long as no one knows it was me searching for “dwarf prostitutes in south dakota” what difference does it make?”

The problem is that searches aren’t anonymous, even if the screen names were withheld to protect the innocent. The New York Times proved this when it tracked down user 4417749, one Thelma Arnold of Lilburn, Ga., from her searches.

And you don’t need the resources of the Times. Even a part-time technology columnist of average intelligence can glean plenty from the database.

Feel free to check out a few of the websites that have been built around this data set in the past few weeks:

Data Mining Used to Find New Materials

August 31st, 2006

An interesting combination of data mining and quantum mechanics at MIT seems to have created a new approach for predicting crystalline structures. They use the same data mining techniques that are employed in consumer applications like e-commerce shopping recommendation engines and market basket analysis.

The MIT team preloaded the entire body of historical knowledge of crystal structures into a computer algorithm, or program, which they had designed to make correlations among the data based on the underlying rules of physics.

Harnessing this knowledge, the program then delivers a list of possible crystal structures for any mixture of elements whose structure is unknown. The team can then run that list of possibilities through a second algorithm that uses quantum mechanics to calculate precisely which structure is the most stable energetically - a standard technique in the computer modeling of materials.

The latest research work has been published by Nature Materials under the title “Predicting crystal structure by merging data mining with quantum mechanics” (Volume 5, Number 8, Pages 641-646, August 2006). ABSTRACT | FULL TEXT

Feds Sharpen Secret Tools for Data Mining

July 26th, 2006

Big brother may be trying to watch you, but it’s unclear how skilled he is at dealing with the petabytes of information being collected.

Data-mining systems used by intelligence agencies include:

• Hardware and software from NCR subsidiary Teradata that is capable of storing and searching databases as large as 4 million gigabytes, or twice as much information as is held in all research libraries in the USA. Teradata executive Bill Cooper won’t say what’s in the Teradata systems that intelligence agencies use, but he says their applications include searching financial transactions for signs of money laundering.

• A program designed to identify members of terrorist networks and determine the most important members of those networks. Cogito Inc., of Draper, Utah, sold the program to the National Security Agency and other intelligence agencies, company executive William Donahoo says.

• Software from Verity Inc. used by the Defense Intelligence Agency and the Department of Homeland Security. A 2004 congressional report says DIA’s Verity system includes personally identifiable information about Americans from other agencies and commercial sources.

The five data-mining programs developed under Total Information Awareness are among at least eight TIA projects that have continued since Congress killed TIA in 2003. They include four efforts to create software that searches through mountains of data for evidence of terrorists and three projects that allow intelligence analysts from many different agencies to collaborate on computer networks. A contract to pull all of the new software together into a working system also remained active until at least last year, government records show.

AI Set to Exceed Human Brain Power

July 25th, 2006

While the pace of advancement in machine intelligence has been slower than most have hoped, progress is being made. New approaches are needed in order to assimilate and understand the petabytes of information being generated by our 21 century society. Existing methods of computing and analysis need to evolve significantly in order to keep up with the rising data tide, else it will be all we can do just to process and store all the information being created let alone gleen useful knowledge from it.

Nick Bostrom, Director of the Future of Humanity Institute at the UK’s Oxford University, said that AI-inspired systems were already integral to many everyday technologies such as internet search engines, bank software for processing transactions and in medical diagnosis. “A lot of cutting edge AI has filtered into general applications, often without being called AI because once something becomes useful enough and common enough it’s not labelled AI anymore.”

But Bostrom said that traditional “top-down” approaches to AI, in which programmers coded machined to cope with specific situations, were being supplemented by “bottom-up” systems inspired by enhanced understanding of the neural networks of the brain, leading to more subtle forms of AI.

“The more we discover how the human brain achieves intelligence the more we’ll be able to use the same computational architecture and logarithms in computers,” said Bostrom.

Analysis is Not Evil

June 28th, 2006

An important point was brought up in this article in regards to the negative connotation the term “data mining” often has for people. This stems from users’ prior history with data mining tools that were ineffectual, difficult to use, and provided results that were more abstract than actionable.

Linda Koontz, information management issues director at the Government Accountability Office, said some agencies she interviewed about programs that mine data refuse to identity their programs as such.

“Different people sometimes mean different things by the term data mining,” she said. “There isn’t one definition that everyone agrees with. A lot of people feel aversion to using the word ‘data mining’ because they think that casts a negative pall over what they are doing.”

GAO defines data mining as the application of database technology and techniques to uncover hidden patterns and subtle relationships in data and infer rules that allow for the prediction of future results. Koontz said she doesn’t understand why data mining has a negative connotation. “Analysis is not evil,” she said.

Read more about the CDC’s BioSense initiative

Data Mining Helps Uncover Fraud in Disaster Relief

June 28th, 2006

This recent article about the GAO’s investigation into fraudulent use of government assistance following Hurricane’s Katrina and Rita illustrates why it is becoming increasingly important to develop data analysis techniques that are proactive in real-time instead of reactive in the wrong time. What could have been done differently in this situation to help the government avoid paying out on bogus claims? While certainly there is a balance to be maintained between the need to distribute aid quickly and the need to have thorough checks into a person’s identity, surely there must be a better way to manage the situation than what was done by the government after these two disasters occurred.

A government watchdog relied on data mining to uncover an estimated $1 billion of improper or fraudulent payments for assistance in the aftermath of hurricanes Katrina and Rita last year.

The Government Accountability Office reported its findings on the fraud to the House Homeland Security Investigations Subcommittee on Wednesday. GAO found that the lack of “upfront controls” and inadequate data checks at the Federal Emergency Management Agency led to the improper disbursement of anywhere from $600 million to $1.4 billion to alleged hurricane victims who registered for federal assistance.

In one case, an individual using 13 different Social Security numbers, one of which belonged to the person, received 26 payments totaling $139,000. By searching public records, GAO found that of the 13 addresses that person claimed as damaged property, eight were bogus addresses or were publicly owned.

Matching information from FEMA registrations to a database of federal and state prison inmates, furthermore, GAO found more than 1,000 registrants used names and Social Security numbers belonging to prisoners who were not displaced by the storms. In one case, a Louisiana inmate received more than $20,000 for registering a post-office box as damaged property.

Government Increasingly Turns to Data Mining

June 27th, 2006

An interesting article in the Washington Post today.

Industry executives, analysts and watchdog groups say the federal government has significantly increased what it spends to buy personal data from the private sector, along with the software to make sense of it, since the Sept. 11, 2001, attacks. They expect the sums to keep rising far into the future.

The hope is that the technology can help to discern and thwart threats just as businesses have used it for years to predict consumer behavior on buying cosmetics or repaying mortgages, for example.

Companies keep an increasing amount of data about everyone — tracking their buying, travel, bank transactions and bill-paying habits. Data mining uses mathematical formulas to look for patterns in those behaviors. The results could enable the grocery store to send out targeted coupons, or, in theory, help the government decide how likely it may be that someone is linked to terrorist groups.

NSA Looking at Social-Networking Spaces

June 27th, 2006

Another interesting article coming off the UPI wire regarding the NSA going after MySpace and Facebook type sites in order to discover new patterns of behavior that may point to illicit activities when combined with other data sources.

This information, if collected and filtered correctly, can be combined with other harvested data to reveal information as to banking, retail and property records and eventually help fill in the picture of a potential terror suspect’s activities. Such an aid may prove extremely helpful to the intelligence community in its hunt for both terror suspects and criminals.

Are they barking up the wrong tree here? Or is it simply a matter of having access to as many data sources as possible, leaving options open for different analytical paths to follow?

Suffering From Presentation Fatigue?

June 21st, 2006

Business Objects has released the results of survey that suggests that most business executives suffer from presentation fatigue, a direct result of the vast amounts of time they spend preparing for and giving presentations each month.

Most business executives suffer from “presentation fatigue,” a direct result of the vast amounts of time they spend preparing for and giving presentations each month, according to a survey from Business Objects. The survey indicated that while executives have improved access to data, they need solutions that can help them quickly and easily present that information in a visually appealing format.

A recent online poll of 382 executives found that over half of the respondents give one or more presentations per month. Of those respondents, 36 percent found presenting data to the board or senior management “tedious,” with a further 24 percent stating they “dread it each time it comes around.”

The poll results reveal that preparing for presentations is time-consuming and stressful for most executives. Survey participants admitted (45 percent) that it often takes them one or more hours to convert data in Excel spreadsheets into a format for a meeting or presentation. An additional 34 percent of the executives surveyed have pulled an all-nighter to prepare for a big presentation.

As more and more is demanded of executives at board meetings, customer events, and financial reviews, they need to find ways to make the data presentation process simpler and more compelling.

“As we all know, presentations can either lull an audience to sleep, or they can energize them to act,” said Donald MacCormick, vice president of product marketing for Business Objects.

Information Theory Used to Understand Whale Song

May 24th, 2006

A good article was recently published in the Journal of the Acoustical Society of America that describes using Information Theory to understand the signals that whales use to communicate with each other.

The computer analysis and the human observers all found that whale songs are not only hierarchical, they convey around one bit of information per second. By comparison, humans generate 10 bits of information, or variance, for every word that is spoken.

This application is an intriguing mix of biology with mathematics sure to be used in more applications as scientists seek to better understand the complex patterns that shape the world around us.