My new colleague Steven Forth, who is CTO of eMonitor (the content technology arm of Monitor Group) referred me last night to Many Eyes (http://services.alphaworks.ibm.com/manyeyes/home), which is a social data visualization and interpretation service developed by the Collaborative User Experience (CUE) Research Group at IBM's Watson Research Center. As the intersection of social software and content analysis is currently a high-priority professional interest, I decided to try it out.
Among other visualization approaches to structured data sets, Many Eyes generates tag clouds from free text files. Steven noted that in particular, the two-word view seems like a very powerful 80-20 cut at inferring predominant meaning in a body of text.
I experimented by exporting the contents of this blog as a text file, progressively scrubbing useless Typepad artifact words and html tags that appear frequently (like "title", "breaks", "comments", and my name) out of the source file -- to do this I simply ran "edit/replace/'word', ''" in Windows Notepad -- and then publishing the file on Many Eyes. Here's the result (click on the image to manipulate the cloud on Many Eyes):
The two-word view does a pretty decent job of communicating the themes I write about, I think. Unintended side benefit: highlights recurring cliches and verbal tics I need to purge from my writing, like "drive higher" (argh).
This whole effort took about 30 minutes, from registration to pasting the syndication html into this post. Two-thirds of that time was spent scrubbing the data iteratively. This could have gone faster in one of two ways. First, Many Eyes could provide a custom scrubbing interface where I could register multiple words to be eliminated or replaced from a text file. Second, and better, they could allow users to share not only comments, but scrubbing filters that would be applicable to data sets coming from common sources with common problems, such as Typepad exports, or government information.
Beyond this, I can imagine a thematic matching capability -- "based on two-word 'keyphrase' frequencies, this data set seems to have lots in common with these other ones..." Such a capability could be further enhanced by ex-post user rating, so people could confirm whether, for any given algorithmically-suggested match, the result was actually good, a la "was this useful to you?" This, like the "Graphic Friendships" idea I wrote about a while back, could help to make the web browsing experience more productive.