Taxonomies…thesauri…classification systems…synonym rings. We’ve heard all of these terms in the context of the Web. As Web sites expand, the task of organizing them has become increasingly problematic and complex.
All of the terms mentioned above are controlled vocabularies. That means that they are organized lists of words and phrases, or notation systems, that are used to initially tag content, and then to find it through navigation or search.
Unfortunately, a great deal of disagreement exists as to the individual definitions of each of the terms I mentioned; we spend too much of our valuable time misunderstanding each other. The terminology is in flux; hopefully, at some point, specific definitions will be standardized.
Until then, we need to focus on the possibilities of various types and levels of controlled vocabulary. Frankly, the specific terminology isn’t nearly as important as knowing what each example can accomplish, and be used most effectively.
I have three purposes in this introduction to the world of controlled vocabularies:
Where Do Controlled Vocabularies Fit Into the Information Architecture of a Web site?
There are two basic places in information architecture where the form of labels and search terms can be controlled and then often organized in some way, usually hierarchically:
There are several options for obtaining labels or terms; which ones you use depend both on what you want to do and what resources you have at your disposal. Basically, your options fall into three categories — buy/borrow, revise, or build from scratch.
There are literally hundreds, if not thousands, of controlled vocabularies floating around, some in electronic format and others still only in print. The vast majority of the vocabularies in existence are probably in-house lists. Unless you a) have some insider knowledge, and b) can convince their owners to share them with you, you probably won’t be able to get your hands on them.
At the end of this column you’ll find a list of useful links, including several to Web sites that collect links to vocabularies that are available in electronic format. While some vocabularies are in the public domain, many are only available as development resources if you either license or buy them.
Although it seems attractive to obtain an existing resource for fee or for free, my experience has been that many of these are more suited to libraries and classical document collections (i.e., books, journal articles, etc.). They are often fairly broad in scope and the terms are therefore not very deep or specific in extremely specialized areas. Of course, there are always exceptions; it pays to look at these to see if there is something that you can use.
In some cases, this type of resource can be better utilized if it is either combined with another, and/or revised. You might take an existing vocabulary and use only the part that interests you. For example, say that you are a PC manufacturer and want to develop a controlled vocabulary for your Web site. You might choose an existing computer science oriented vocabulary, and take only the part that contains the terms relevant to PC engineering. Since these will be more general than you want, you could then add to this base to create a tailored vocabulary for your site. Be sure to consider any copyright issues that might apply to borrowing terms from other sources.
The last option, build your own, is the one I encounter most frequently. Frankly, after you go through all the trouble of extracting just the terms you want from an existing vocabulary and then adding to that structure, it is probably just as easy to start from scratch. In this case, you use all resources at your disposal to create a stock of unstructured terms that you then organize as you wish.
There are lots of sources for word stock, including internal vocabularies and lists of terms, labels and terms from other Web sites related in subject matter to yours, and the previously mentioned electronic controlled vocabularies. The copyright caveat applies here as well.
Sometimes, the only option if you are in a new field and there are no other resources available, is to use your existing content items from which to choose your terms. This is the most labor-intensive option, and in my view, often the last resort.
I Have My Terms, Now What Do I Do?
It depends on your goal. You are either developing a navigation scheme with labels or a search vocabulary, or sometimes both. The key is to decide how much “control” you want, and then to take the steps to achieve that. There are basically three levels of control. From simplest to most complex, they are:
In developing a navigation scheme, the first option, synonym control, means that you will choose the best, most consistently clear and unambiguous labels available for the content to which the user will navigate. There are many guidelines you can use, but basically you should use labels that say exactly what you want in as few words as possible. The labels that you determine will then need to be applied consistently across pages in the site.
If you are developing a set of equivalent terms for searching, you are often creating what is also called synonym rings. This involves going through your word stock and deciding what terms should be considered interchangeable when searching. For example, the terms “zucchini” and “courgettes” can be used interchangeably in a database for searching recipes. If you have the software capability, you can store these two terms as a unit. A search using either term will then retrieve all documents tagged as designated.
Synonym rings are often used to create some degree of control over a set of content items tagged using their “natural language.” By natural language, also called “keyword searching,” we refer to the situation where you use the uncontrolled language of documents for indexing, and then search using your own unrestricted vocabulary. Control is achieved by deciding which terms in the text of the content should be made equivalent and achieving this through the search interface, rather than by tagging content with a search vocabulary.
Synonym control is also the first step in creating a controlled vocabulary for tagging content items. In this case, the entry in the vocabulary would be:
Courgettes Use Zucchini
This is the first step in building a thesaurus, the most complete and complex of the vocabularies discussed here. However, the important point is that you can stop here if your goal is just to achieve some control over synonymous terms. Controlling synonyms can be very effective in minimizing the variants introduced by natural language.
The next level of organization of your terms is to arrange them in some way. Usually, a hierarchy is used. This level of organization is generally what people are referring to when they talk use the term taxonomy.
Arranging labels in a navigation scheme results in a navigation scheme that is hierarchical, although other ways of navigating across hierarchies can and should be introduced.
When you are building a search thesaurus, you need to create a list of search terms that are now arranged in what is called a BT-NT, or “broader term/narrower term” relationship. The advantage here is that when users can see this arrangement, they can select terms that are as specific as they want.
For example, to continue with the example above:
VEGETABLES SQUASHES ZUCCHINI
This path could be followed either as the navigation hierarchy from top to bottom, or stored as a “tree” that users can view and choose from while searching.
Hierarchies are probably the most ubiquitous organization structures on the Web, and users find them very intuitive. They also serve the important retrieval function of showing the relationships among content items according to their specificity or generality.
What you are really doing here is classifying your terms. Of course, this will probably not be as rigorous as a true classification scheme or taxonomy; it doesn’t need to be. But arranging terms according to specificity from top to bottom is quite useful in many contexts
Finally, it is also possible to relate terms across hierarchies. This is called the associative or related term relationship; it represents the most complex level of control. For example, VEGETABLES can be considered the top term in one hierarchy, while another hierarchy in our database of recipes might be MAIN COURSES. In this hierarchy, you might have the following:
MAIN COURSES VEGETARIAN DISHES ZUCCHINI PARMESAN
In a navigation scheme, it is often a good idea to relate these kinds of labels by placing links on the pages to guide the user to the related content.
In a thesaurus, this information is stored as another explicit relationship, the RT, or “related term” relationship. When a searcher inputs a particular term, the display would show all the terms related hierarchically and associatively to that term, as in the following:
Search Term: Zucchini parmesan
The content displayed would be that tagged with the search term. The display would also show the UF, or “used for” term, and the BT (broader term) for the user’s consideration in modifying the search:
ZUCCHINI PARMESAN UF Parmesan of courgettes BT VEGETARIAN DISHES RT ZUCCHINI
Summary & General Advice
One of my main goals in this discussion has been to take the emphasis off the actual terms used and to discuss as clearly as possible the options for various levels of control. Basically, the level of control you choose depends both on what you want to do and what resources you have.
In general, I give the following advice:
Thesaurus Bibliographies, Lists of Online Thesauri, and Directions to Thesaurus Management Software
Brown University. “Thesauri/Glossaries.”
Queensland University of Technology. “Controlled Vocabularies.”
Queensland University of Technology. “Taxonomy, Classification and Metadata Resources.”
University of Massachusetts — Amherst Libraries. “Classification, Indexing, Metadata and Thesauri.”
Willpower. “Software for Building & Editing Thesauri.”
Amy J. Warner, Ph.D.