| Muck and Mystery Loitering With Intent |
blog - at - crumbtrail.org |
Any collection of information is useful only to the extent that you can access the information readily. Though hardly the first to grapple with the truth American librarian Melvil Dewey was inspired to create the Dewey Decimal System of Classification to bring some order to vast collections of books. But there is a fundamental defect to the idea of ordering information, indexing it by some type of abstraction; it conceals more than it reveals. Anything less than complete information loses information.
The problem isn't that information needs order, it is that it needs search tools that can provide views of the whole collection of data filtered by selected criteria. But there's a problem here too in that a full search can take a long time and find so many matches that there is still too much for a human to digest. Finding something takes skill and talent even with good tools. Scholars and researchers require full information and are willing to develop the required skills. Ordinary consumers are less interested in complete and accurate results than immediate satisfaction. Scholars search, consumers sort.
Many of the consumer oriented products for the web - the most vast collection of information - seek to make that collection useful to amateurs and consumers who haven't the time, interest or skill to do searches. Early web directories such as Yahoo, and later DMOZ, attempted to order and classify pages to allow more direct access. The problem is that once more than a trivial amount of information is classified the directory itself becomes unwieldy, compounding the degradation of the loss of accuracy and completeness.
An old joke among software developers is that software users always want a DWIM module - Do What I Mean - to relieve them of the need to use the software effectively. Web searchers want a FWIW module - Find What I Want, for what it's worth. Static approaches to the problem, from Dewey Decimal to Metalanguages, have failed to satisfy. A current fad is "tagging", a variant of the DMOZ idea of user directory creation that has even less hierarchical guidance and even more participants. Wade Roush takes a look.
It's called tagging, and it's going on at a handful of free websites--Delicious, Flickr, Furl, and Rojo, among others--where members are voluntarily classifying and categorizing thousands of pieces of content each day. The phenomenon is growing fast: Delicious alone had 90,000 total users by April 2005, up from 30,000 the previous December. . .As noted above, "tagging" is the tradition in the search industry rather than the new approach. It is searching that is new since it only became possible with the advent of cheap and vast digital storage, and speedy processors driven by clever search algorithms. The oldest forms of computerized data management systems used "keys", aka tags, to make it faster to find information and limit browsing to only the likely matches. Many still think of data collections as filing cabinets full of folders. These "views" of the data collections might have embedded keys, keys that actually exist in the information, or external keys that describe the information even if the words don't occur in the text. The views can either be static or dynamic. A static view is a "cached" query of the data base and may include the result as well as the question to avoid having to repeat the scan of the data base. A dynamic view is constructed for immediate use. The rub is that a poorly constructed query may not yield the required data or may yield too much to be useful. It takes skill and talent to do well.Tagging is already attracting the attention of the traditional search industry. In March, Yahoo acquired Flickr, a photo-sharing site where users can tag their own pictures and others'. Technorati, a site that tracks the most-discussed subjects on Web logs (or "blogs"), makes extensive use of tags from Delicious, Flickr, and Furl. Last September, Looksmart purchased Furl, and it is now adding the ability to create instant Furl bookmarks to its family of specialized search sites. "It makes a lot of sense to have a capability that allows for sharing of essential Web pages," says Debby Richman, Looksmart's senior vice president of consumer product development.
There's no uniformity to the way people tag Web pages, so the same tag might wind up being applied to very different kinds of content. But to most developers, that's actually a strength of the technology, not a weakness. "The information you get [through tags] is always going to be somewhat imperfect and fuzzy," says Joshua Schachter, the creator of Delicious. "But a bunch of people doing ‘okay' tagging may actually have a higher net value than an authoritative organization telling you how information should be organized."
The part that fascinates Roush is that users in effect sort themselves as well as the data by choosing to participate in tagging communities and rely on their products. Increasing the number of unconstrained taggers may yield superior results to those of groups of trained experts. This can be seen as an application of the ideas noted previously about the benefits of diversity for group problem solving, closer to James Surowiecki's ideas than those of Scott Page and Lu Hong. But there's a problem in that there is no right answer to discover. The technique is doomed to a meandering, unhalting drift over time and cross culture. In the extreme limit case all pages will be indexed by all tags and each tag will return the whole data base. The more users there are and the older the tag the closer the limit is approached.
A more useful understanding of tags and tagging systems may be fashion than substance. It is in effect another type of reputation system that harks back to the citation indexes of the academic community. And coming full circle this is the basis of Google's ranking algorithm for full searches. All the data is returned, but it is presented in order of significance as determined by reputation. In other words, there isn't any real difference between such searches and tagging schemes, except when the group of taggers is small and homogeneous. And so they will proliferate. The nice thing about metalanguages is that there are so many of them. Standards are like that.
However, non-text information is still unwieldy without tags. We haven't yet developed good image and audio search systems. Robots still don't see very well. They have no trouble detecting images, they just don't know what they mean. That is changing. There have been some recent developments in CAD software to allow designers to find previous 2D and 3D drawings by specifying an example drawing. The matches are made on content not description, just like a patient and knowledgeable human would do. It is reasonable to expect that image and audio searches won't be too far behind. I can imagine using my search engine to find a tune by humming a few bars, or sketching a crude image to search for a photo. Of course, the better my humming and sketching the better the results will be. We still need DWIM and FWIW modules. We used to call them secretaries, then research librarians and in future they may still be called that but they will be software agents that get to know us well enough to turn our nebulous yearnings into well formed queries. These personal assistants may consult one another, have their own social groups in effect, but humans will be best served by their software agents that have gotten to know them, in a sense they will be digitized and augmented versions of their own identities, cyber clones.
Update
See Tags & Folksonomies - What are they, and why should you care? for more about tags and links to even more. The earlier post All The Way took issue with an extreme view of the significance of metalanguages that saw the basis of a "global mind".
If there is a metalanguage there will be metalanguages. Anything that happens once will happen repeatedly. Anything that happens repeatedly will happen variably. The model we should be using is the ecosphere not a single human mind. The fundamental rule of ecology is diversity - if there is more than one way to make a living there will be many ways livings are made. Anything that works will change. Nothing is constant, the field and the players are in flux, there is no equilibrium, no stasis, no durability. There is resilience.There is also deceit. Anything that can be built can be hacked. Anything that has value will be counterfeited. Every trust will be broken, every identity will be spoofed. Every entity will have predators. Everything will get its fair share of abuse.