Table of Contents
Introduction 1
Irrelevant Information and Infoglut 2
Context 2
Ambiguity 3
Browsing vs. Searching 3
Report Scope 3 Do You Know What You Know?
Evolution of Taxonomy Technologies 4
Is “Taxonomy” a Misnomer? 5
Using Categories in the Search Process 5 Introduction
Browsing Process 6
Taxonomy and Search 7 Cataloguing unstructured information is a chronic
Benefits of Taxonomy 7 problem, that if not adequately addressed can be
Taxonomy Software Integration 7 terminal for your organization. Today we have many
Market Survey 9
Survey Results 10 easy-to-use and accessible tools to create and publish
“Infoglut” and Knowledge Worker Efficiency 10 information electronically. Examples are: the
Current Software and Manual Systems Not Adequate 11 ubiquitous suite of Microsoft office products like
Enterprise Organizations Deomonstrate Interest 13 Word and PowerPoint, Adobe Portable Document Files
Definition of Terms 14 (PDF), Web Pages (HTML files), e-mail, news feeds, and
Taxonomy Market Landscape 16 the like.
Stages of Taxonomy 19
Vendor Assessment Reports:
Autonomy 21 Lack of information is no longer the problem—but
Convera 24 lack of time to correlate, categorize, analyze and act on
Entopia 28 the information is a crucial problem. The information
Mohomine 31 is there, hidden in reports and e-mails and published
Quiver 33 on the corporate Web site. We are placed in the
Semio 35
Stratify 37 position of being unable to find applicable and
Textology 40 pertinent information to make timely business
TopicalNet 43 decisions. This comes at a time when the agility to
Verity 45 quickly make fast, informed decisions is increasingly
Wherewithal 48 critical to survival and prosperity. As the volume of
Controversies & Pitfalls 52 opportunities increases, the duration of the time to act
Manual vs. Automatic 52
Maintenance and Dynamic Information 53 on each opportunity decreases. The information-
Librarians 53 based economy is in danger of drowning in a sea of
Directory Building vs. Hierarchical Categories 53 irrelevant, unstructured data.
Granularity of the Taxonomy Structure 53
Users Needs and Personalized Taxonomies 53 A new segment of software has emerged to help with
Speed, Accuracy, Robustness and Scalability 54
Future trends 54
the task of combating “infoglut.” For example, there is
End Note 56 software that enhances the performance of search
Vendor Contact Information 58 engines, text mining, natural language search
Delphi Group Research
Invisible refers to dynamically generated Web like catalog pages, Wall Street Journal Archives.
Invisible Web, Chris Sherman
UC Berkeley
UC Berkeley SIMS How much information.
Browsing is an interactive process. As you To paraphrase a quote, “to search via a computer
navigate a well-designed interface to information, without a taxonomy system is like trying to find
you will automatically be directed to other your way around an unfamiliar country without a
relevant topics. If you search and browse through map.” Taxonomy helps delineate the conceptual
information about categorization software, for relationships that exist within and between
example, you will find reviews, analysis, white various topics contained in the multitude of
papers and commentaries with information unstructured data within various
about other technologies, companies or related enterprise documents.
topics of information that may be worth
investigating. The benefits are:
Browsing is an iterative process. Repeating the • Discovering information you didn’t know
process refines your focus while broadening your you had
knowledge. Accessing relevant information and • Avoiding duplicate efforts within large
interrelated ideas and concepts supports a organizations where independent groups
fundamental change in your activity—from “reinvent the wheel” over and over again
simply searching, to finding and discovering.
• Not repeating the same mistake
Taxonomy and Search • Reports are better prepared if the author
really expects to be read.5
One purpose of taxonomy is to aid in the retrieval
of relevant information. An intrinsic benefit of • Provide overview as well as details about
the hierarchical structure of categorization is that a subject
links and summaries of information are rendered • Demonstrate relationships
in the context of their unique “parent-child”
relationships. Relevant information is more likely • Reduce complexity
to be found when specific content filters are
employed. For example, if we had had a general Taxonomy Software Integration With
category like “computers” in our search for Other Applications
“chips,” we would not have wasted any time with
false returns from the “recipes” category. Taxonomy can impact many aspects of your
organization. As organizations implement
Benefits of Taxonomy various software solutions to manage their
knowledge assets, taxonomy can dramatically
Finding relevant information quicker is the key increase the effectiveness of such solutions. All
benefit, especially when it provides immediate the software features in the world won’t matter if
access to the right information that allows the they don’t facilitate “just-in-time” knowledge
user to take effective actions. Equipping retrieval. Software applications such as portals,
enterprise knowledge workers with the tools to content management systems, knowledge
make faster and better-informed decisions is a management systems, search and retrieval
strategic imperative in today’s economy. Jakob software, personalization software, data
Nielsen, the guru of usability, estimates that poor extraction, and data mining can all benefit from
classification costs a 10,000 user organization taxonomy. Many taxonomy solutions are sold
$10M annually. with Application Programming Interfaces to
integrate into these existing applications.
There is a product marketing manager that places a peach cobbler recipe in his market research reports. He places the
recipe in a relatively arcane section of the report. At the end of the recipe, he offers to whomever finds the recipe a bottle
of Don Perignon champagne. In his career, producing many reports, he never has had to buy the champagne.
Survey Summary
Market Survey:
Enterprises know they have a serious and
Results and Analysis growing problem with unstructured data, and
that the problem is dramatically impacting their
During the first week of February 2002, Delphi ability to make rapid and effective business
conducted an extensive survey of approximately decisions. Current systems are not adequate.
450 end user organizations on the subject of Organizations planning on developing a
categorization and taxonomy management. taxonomy strategy remain unsure about how to
Several dozen questions were asked in regards to do so. This presents a significant opportunity for
the business issues surrounding the evaluation, the technology providers seeking to fill this need.
planning and implementation of taxonomy
technology. Profile of Respondents
The survey of 450-plus respondents represents a
The objectives of the survey include: fair sampling of enterprise organizations, with
over half the organizations having revenues of
• Validate the extent of the unstructured data over $100 million. 73% are located in North
problem faced by knowledge workers in today’s America. The respondents were either executives,
organization IT, or LOB or had project management
responsibilities. Respondents’ role in determining
• Determine the relative importance of the taxonomy software was primarily as a sponsor or
business issues surrounding the retrieval of project lead, or involved in defining need or
information from unstructured data sources specifications.
• Understand the scope of the problem and the Survey Methodology
perceived impediments associated with job Individuals identified by Delphi’s analyst team
performance and unstructured data. were contracted directly and asked to answer a
series of structured survey questions. The survey
• Confirm the characteristics of unstructured format was primarily multiple-choice, with either
information sources in terms of size, volatility, single or multiple answers possible depending on
and language the question. Respondents were also provided the
opportunity to volunteer more detailed and
• Verify if there are classification processes and otherwise restricted answers (i.e,. “fill-in-the-
policies in place today blank”) which are quoted at various points
throughout this report. The results from the
• Ascertain if there are taxonomy software survey are tabulated and graphed in the sections
projects underway or pending, their relative to follow. Listed below are the profiles of the
importance, and the proposed budgets for respondents, the questions, the results and,
implementation and maintenance of taxonomy finally, Delphi’s analysis of this market survey.
• Find out who will be responsible for defining Survey Limitations and Risks
and then maintaining the taxonomy software The survey resulted in over 450 respondents.
While this number is sufficient to develop quan-
• Clarify how the taxonomy software should be titative and qualitative trends, variations may be
configured and deployed found within individual deployments or taxono-
my initiatives. While respondents represents a
• Discover to what extent the market recognizes valid cross-section of enterprise-class organiza-
the leading providers of taxonomy software tions, the population’s preestablished interest in
this or similar technology may distinguish this
group as more knowledgeable and aware than a
similar group of randomly selected enterprise
51-75% 26-50%
Don't Know
Don't Know
No 8%
processing technique based on the way biological There are a number of algorithms that
nervous systems, such as the brain, process technology vendors customize, optimize, combine
information. Composed of a large number of and patent in order to categorize digital
highly interconnected processing elements, a documents. For a variety of reasons each vendor
neural network system uses the human-like has chosen a particular algorithm method or
technique of learning by example to resolve combination of methods. This is a list of each of
problems. The neural network is configured for a the methods that are discussed in more detail in
specific application, such as data classification or the following sections:
pattern recognition, through a learning process
called “training.” • Rules-based
Spiders • Bayesian
Spiders are automated processes used to feed • Linguistic and Semantic
pages to data extraction and parsing engines. It’s
called a spider because it “crawls” over the data. • Support Vector Machine
Another term for these programs is crawler. • Pattern Matching and Other Statistical
included in a particular category. Thus rules are c) Semantic and Linguistic Clustering
a powerful and flexible means for automatically
classifying content based on not just content Semantic analysis depends on a particular
itself but the metadata that describes the language and dialect. Documents are clustered or
content’s business context.7 The down side of grouped depending on meaning of words using
rule-based system is that expensive human thesauri, custom dictionaries (e.g. a dictionary of
domain experts have to write and maintain the abbreviations), parts-of-speech analyzers, rule-
rules. Other examples of rules are source of based and probabilistic grammar, recognition of
document, age, size and document type. idioms, verb chain recognition, and noun phrase
identifiers (e.g. “business unit manager”).
a) Statistical Text Analysis and Clustering Linguistic software also analyzes the structure of
the sentences identifying the subject, verbs and
This technology observes and measures co- objects, like you did when you first studied
occurrences of words. For example, “Java” used in grammar in grade school. Then sentence
connection with Starbucks probably relates to a structure analysis is applied to extract the
document about coffee instead of a programming meaning. Stemming or reducing a word to its root
language. Relative placement of words is also helps linguistic or semantic clustering.
important. Words in the first lines of a document
are likely more important than information d) Support Vector Machine
contained in the copyright section. Statistical
analysis and clustering also look for word Support Vector Machine (SVM) is a refinement of
frequency, placement and grouping, as well as the taxonomy-by-example. These algorithms are
distance between words in a document. Pattern derived from statistical learning theory. SVM’s
analysis improves precision by resolving calculate the maximum “separation,” in multiple
ambiguous or multiple meetings. dimensions of one document from another. Each
document—essentially a collection of words and
b) Bayesian Probability phrases that together have meaning—can be
represented as a vector. The direction of the
The Bayesian approach attempts to learn the vector is determined by the words (dimension) it
probabilities of words for a given category. An spans. The magnitude of the vector is determined
example of Bayesian probability applied is that if by how many times each word occurs in the
a given document contains the words “apples” document (distance traveled in each
and “oranges” it is more than likely this dimension).8 As this iterative method
document is about fruit, which leads to the continuously analyses documents, it separates
assumption that other fruit nouns such as them into either the “relevant” side or the
“grapes” or “tangerines” will occur. “irrelevant” space. By repeating the process it
categorizes those documents that are “relevant”
Applying a Bayesian algorithm sorts documents into like categories, but more importantly learns
by examining the terms, words and phrases how they are different.
contained therein. Bayesian probability uses
statistical models from words in training sets, Combining Methodologies
and uses pattern analysis to assign the
probability of correlation. This is one of the more Of course, no single taxonomy methodology,
common methods applied to building categories algorithm, or technology is superior to another
and taxonomy structures. for every possible application. The trend by more
and more taxonomy software companies is to
combine multiple methods to categorize the
corpus of documents to increase the accuracy
and the relevancy of grouping similar documents.
Mohomine White Papers
API or Standalone Application The companies reviewed here run the gamut
from Mohomine—which markets just a
As with many new technologies, a subset of the development framework for the OEM market—to
vendors have developed standalone applications Quiver, which markets an application suite as a
that come complete with end-user browser-based complete end-to-end solution for the enterprise.
clients and MS Windows-based clients. Users Most companies offer both approaches, and
and/or administrators can point the software at depending on their orientation emphasize either
the body of documents to be classified on hard a strongly developed GUI as an application or a
disks, servers, intranet sites, portals and Web more extensive API for integration.
sites. The taxonomy engines residing on servers
perform the categorization process and usually Company and Product Profiles
populate a database of metadata. The next section details the companies that
participated in this project with Delphi Group.
Taxonomy software can be an enabling The company sections are organized in the
technology. It aids and enhances other following way:
applications that the user interacts with through
a GUI. Most of the vendors support this concept • Introduction – short description of the
and offer an API to integrate into other company and its origins
applications. Since many enterprise
applications are custom-built this is an • Technology approach – a basic description of
important consideration. the underlying technology of the software
A number of vendors view this technology as • Product – a description of the functions and
eventually being an even more fundamental features of the taxonomy product
component of the information infrastructure.
Just as relational databases are a fundamental • Vision or customer study – a statement by the
infrastructure component of applications such as executives of the technology company relative
accounting, CRM, and other enterprise to where they see the market heading or an
applications, taxonomy software will be the example of a taxonomy implementation
infrastructure component that correlate • Assessment – a short evaluation of the
unstructured data. This design philosophy technology and product approach by the
positions taxonomy software as a core module in Delphi Group
a suite of products that work on the unstructured
data within an organization. The companies listed in this section are arranged
alphabetically based on the name of the
Another series of vendors mentioned in this organization. This arrangement does not imply a
report have added and tightly integrated the ranking or endorsement by Delphi Group.
functionality of taxonomy into the search and
retrieve application. Here are various ways we
can group them:
Convera’s categorization solution provides a Convera has recognized the amount of confusion
modular platform for secure organization and in the marketplace that surrounds the topic of
search of rapidly changing information across information organization and has responded by
repositories, across languages and across media/ bringing to market a solution that includes not
file types. only search and categorization engines but also
a categorization personalization feature, domain-
Convera’s modular architectural approach allows specific taxonomies, and related domain-specific
IT department’s efficiency in deploying and semantic networks for most languages. Convera’s
managing a categorization project while their objective is to reduce the customer’s confusion
cross-lingual, concept-based categorization and risk by providing a single stop for both
solution allows knowledge workers to find the information organization and search solutions.
information they need as this information is
dynamically organized by how they conceptualize Convera’s basic technology includes the
the knowledge. following items:
• A security model that supports secure access to are organized by concept. Although a program
those disparate repositories can use surrounding words and linguistic
analysis to guess at the meaning of a term in a
• Multilingual support through a plug-in document or query, it is important to provide a
architecture for categorization across category administrator the choice to select the
languages meaning of a term. RetrievalWare’s PowerSearch
feature allows users to control word expansion of
• A component architecture query terms in the Semantic Network, which
means they choose which, and how many, related
• Software development kits (SDKs) and terms to include in the category rules. Users can
application programming interfaces (APIs) that also choose specific meanings for their query
allow customization and integration of terms. For example, when using the term for
categorization and search banks in a category’s rule set, an administrator
can easily instruct RetrievalWare to use only
Concept-Based Categorization those other meanings associated with financial
institutions and not those dealing with the bank
of a river or a turning aircraft. In addition it is
Concept-based categorization accounts for the possible to select individual terms within a
fact that there may be several different words that meaning to sharpen the rules even more.
can be used to express essentially the same
concept. It is often the case that industry-specific Support of Languages and Subject Domains
jargon is developed to more describe objects,
concepts or processes that might otherwise be
expressed through the use of simple layman’s RetrievalWare includes modular support for
terms. Concept-based categorization enables the more than 25 languages and many domain-
controlled expansion of terms within a category specific semantic networks through a Plug-in
rule set e.g., a search for “international architecture. This feature acts like a multilingual
commerce” will find documents that contain subject expert that not only understands the
terms such as foreign trade, import, export global concepts and terms in a subject area, but also
mercantilism and free trade. By using a concept- knows how they are used across languages.
based rules engine, the administrator can take
advantage of subject matter expertise to support Concept-based categorization that takes
a more accurate rule set that is automatically advantage of cross-lingual search and domain-
updated and improved with subsequent versions specific semantic networks reduces the need for
of the semantic network. In other words, a authors or other people to tag the documents
standard taxonomy can be imported and then because this feature automatically provides these
modified by subject matter experts to match your relationships. For example, if a category rule uses
organization’s needs. the term “central” meaning in or near a middle
position, not only could documents containing
Concept search also does what we naturally do in middle, medial, midway and other relevant
conversations with each other: it clarifies the English terms be included, but also it could
meaning of query words through analysis of include documents in other languages using
surrounding words (e.g., the word tank when terms such as middelpunt (Dutch); milieu
surrounded by words such as military and vehicle (French); Mittelpunkt (German); medio (Spanish)
is more likely to be a fighting vehicle and less and mezzo (Italian). If the medical domain-
likely to be a container for holding fuel). specific term bronchus was used in a category
rule, other English terms such as windpipe,
The core support behind concept-based respiratory tract and bronchial tubes could be
categorization is the standard RetrievalWare automatically be included in the rules, along with
Semantic Network, a collection of approximately terms from other languages such as bovenlip
500,000 English words that expands to over 1.6 (Dutch); arrière-gorge (French); Atemwege
million semantic relationships and idioms that (German); bocado de Adán (Spanish) and bocca
©2002 Delphi Group • Ten Post Office Square, Boston, MA 02109-4603 • v(617) 247-1511 •
Reproduction authorization required.
Vision Assessment
To efficiently provide users with knowledge Convera has long been a pioneer in this market
appropriate for their needs and roles, based on all and their products reflect the upcoming
available information, regardless of location, requirements of the enterprise as the next
language or form, and through a system that generation of search and categorization software
interacts naturally with the user at his, her or its emerges.
convenience, anticipating the user’s needs when
appropriate and desired. The latest release of their product incorporates
many of the future trends Delphi Group sees
• Provision of domain-specific taxonomies and emerging in this market segment such as:
semantic networks
• Provision of domain-specific taxonomies and
• Support for multiple languages in semantic networks
categorization and in search
• Support for multiple languages in categorization
• Robust security provisions that recognize and and in search
respect access restrictions associated with
particular repositories and/or documents • Security provisions that recognize and respect
access restrictions associated with particular
• Support for Multimedia information such as repositories and/or documents.
image, audio and video files
• Support for Multimedia information such as
image, audio and video files
As contributors add information, the Smart The result is a Quantum File, or Q-File, that contains:
Classification feature suggests “best fit” •The content
destinations, based upon the first-stage semantic
analysis. The visualization of these personalized •Enrichments (highlights, annotations) and
taxonomies will be made possible through a comments
Virtual Classification Structure. It allows a single •The semantic profile of the content and its
user to personalize his virtual overlay (based on enrichments
the invisible taxonomy) upon a corpus of
documents which may reside in the different •Contextual metadata
Products This module is a server infrastructure design
Entopia’s Quantum is a software suite designed to that enables users within the same organization
help enterprise organizations improve individual to share information and collaborate with each
productivity, manage information and facilitate the other through the workgroup zone in the visible
transfer of knowledge among individuals and taxonomy. A host of collaboration features are
workgroups. Entopia characterizes its software by offered including voice and text anchored
features that fall into three modules: Collect, comments and threaded discussion, e-mail
Collaborate, and Capitalize. notification, document check-in / check-out, and
HTML publishing. In the process of collaboration
metadata is also added to the enterprise
knowledge base. This collaboration module
encourages the synthesis between concepts and
the relationships between people that uncover
tacit and explicit knowledge within the
The third module is comprised of tools to
capitalize on the accumulated knowledge base
initially created during the Quantum Collection
and Collaboration processes. These tools include
Smart Classification, the Virtual Classification
Collect System (described above), auto-summarization,
The Collect module is the front end of the suite, and a next-generation search engine, the
accessed through a browser-based “thin client” or Knowledge Locator.
Windows desktop client. Entopia Quantum allows
users to gather and save information, from any The Knowledge Locator is a search solution that
digital source, into the visible taxonomy. These allows users to find not only documents but also
documents can be MS Office files, PDF files, Web experts and sources of knowledge within the
pages, e-mails, and many more file types. enterprise. Aware that different goals require
Through collection, metadata is automatically different search tools, the Knowledge Locator
generated for indexing and retrieving utilizes both a familiar full-text keyword or
information and added to the invisible taxonomy. phrase search, and a more advanced “semantic
search.” To track down documents containing
specific terms, only a full-text search is adequate.
For “fuzzier,” more conceptual searches, one
needs semantic search that yield results “about”
or “around” the query. This also reduces the noise
by listing only very relevant documents.
differentiator for Quiver is their modular design, objectives, advise and set an appropriate timeline
which has three components: and goals accordingly. They also assist in
deployment and integration and ensure long-
1. Categorization engine discussed in the term, flexible and reliable customized solutions
previous section that fit your specific business needs. Quiver
2. Directory Management Toolset (DMT™) services range from taxonomy consulting to UI
3. Output and Display Interface customization, custom reporting, training and
systems integration.
From the Directory Management toolset module Vision
via an intuitive single panel:
Quiver’s unique approach augments advanced
• Document locations are chosen, which contain auto-categorization technology with an intuitive
the documents to be categorized, directory and workflow management toolset to
• Training sets developed and run teaching the provide clear visibility into classification decisions
classifier how documents are related. and maximum control of the end-user experience.
• Domain experts can then be assigned to review Quiver sees as their vision to be the provider of
documents based on qualification criteria. information management solutions to maximize
• Administrators can assign rights, permissions enterprise leverage of corporate information
and set limits and hours for classification. assets. Quiver’s view is that technology alone
cannot replace the accuracy of human expertise
This tool is very flexible allowing multiple and contextual analysis. Quiver combines the
individuals to approve the classification of the efficiency of technology with accuracy of human
document with optional administrative control oversight to deliver optimal knowledge solutions
for managing the people and the process. for intuitive and comprehensive access to
Following the MS Outlook GUI metaphor, this enterprise information. Whether Quiver is
presents a hierarchical structure of folders and deployed standalone or as a part of an employee
sub-folders and files is on the left which can be portal, a customer extranet, or a public website,
expanded and contracted with simple mouse an organized view of content is crucial to
clicks. The list of documents for review is listed successful information sharing and productivity
in a window to the right, with metadata improvement.
information managed in the panel directly below
the main window. With administrator rights Assessment
users can assign documents for review by your To make taxonomy useful for your constituents it
domain experts. These are the people in your must apply their view of the world. A highly
organization who understand their area of interactive application like Quiver will assure
expertise – be it budgets, or marketing those who want control in the creation and
information or technical support procedures. The maintenance of a taxonomy system to have it
toolset also allows administrators to filter out easily customized for your users. This approach
documents on the basis of age, size, source and assumes that the accuracy of the relevancy of the
document type. categorization is very important to your
organization. If close is good enough, then
Targeting enterprise applications such as Search, Quiver’s approach may be overkill.
CRM, content management and ERP systems,
Quiver API’s deliver standard XML output to Quiver products have a solid methodology for
portals, intranets, extranets and other enterprise developing a useful taxonomy with as much or as
applications. This output can be either a XML little control as your users and management
output, standalone or integrated directory view wants. If what you are looking for is an
application to develop and maintain your
Drawing on a deep understanding of information taxonomy with a distributed group of domain
classification, ranking and retrieval products, experts within your organization, then Quiver is a
Quiver’s professional services group works choice to consider.
closely with each client to identify project
• SemioSkyline ™ - a viewer designed to add The project started on January 2001and the pilot
browsing and searching capabilities as well as was initiated in March of 2001. This was an
information about the document. This can be integration with Plumtree to provide a portal for
customized to reflect your organization’s look 2500 users distributed across 11 Highways Agency
and feel. sites. Use of Semio indexing and categorization
has stripped away and exposed bad data,
• SemioTagger™ - is the categorization engine highlighting what needs to be done and where.
that supports over 200 file formats including It has uncovered data previously hidden or
Lotus Notes and Documentum and runs on unavailable for analysis. Semio’s software has
Windows 2000 SP1 server with output to either reduced the time and cost associated with
XML, Microsoft SQL or SemioSkyline. This knowledge delivery and has improved
component contains the Knowledge decision support.
Engineering workbench which is a module for
editing the rules that govern the linguistic Assessment
categorization process. Other modules are
crawlers, converters, categorizers, exporters With over 100 implementations Semio
and an administration tool. understands what enterprise customers need and
want to turn information into knowledge. Semio
A key differentiator for Semio is a “Starter File” realizes that extracting knowledge from information
in the KE Workbench module in the SemioTagger overload is a chronic problem that could be
engine. This “Starter File” can contain 2,000 to terminal in today’s knowledge-based economy.
10,000 predefined categories based on your
organization’s requirements. This “Starter File” Semio has a suite of proven products that have
answers the basic question of where and how do been refined by working with enterprise
I start this taxonomy process. customers to resolve real business issues
involving categorizing unstructured information.
Addressing the evolving needs of customers
Semio will be expanding their product’s
capabilities to address significant business
issues such as security and personalization.
Technology Approach
Vendor Assessment Report:
Stratify automatically creates a hierarchy of
Stratify concepts important to a given business. By
applying pattern-matching algorithms to a
Introduction sample set of documents, Stratify aggregates
individual documents into clusters and arranges
Stratify is highly attuned to the context and these clusters into a topic hierarchy. The software
associative meanings of words. The company continuously adjusts the composition of the
recently changed its name from PurpleYogi clusters, identifying outlying documents and
(memorable but not relevant to their business) to duplicates as it proceeds. Stratify can also import
Stratify. They did this to reflect the company’s an existing taxonomy or provide a pre-built
focus on creating order (stratification) out of the taxonomy containing more than 15,000 topics to
chaos of unstructured data. jump-start an implementation.
An aptly named product, the Stratify Discovery Any hierarchy can be enhanced using additional
System helps customers discover knowledge rules and keywords to define topics. Software
hidden within unstructured data scattered across tools allow nontechnical people to edit the
the enterprise. Consisting of HTML pages, taxonomy to meet their specific needs. Users can
reports, proposals and e-mails, these text add documents to training sets and test the
documents are located on file systems, Web results of these changes in real time. The tools
servers, databases and content repositories. help the user diagnose and resolve problems in
Stratify sees its products as helping executives classification. By applying business rules and
make better informed decisions by finding the different filters, the user can specify which
“whys” (supporting e-mails, reports, news items) sources of information are more important to
and correlating them to the “whats” derived from them. Users can add, delete and link concepts as
applications drawing on relational databases. they wish since this information is stored in a
metadata repository.
Stratify believes that a taxonomy of important
business topics should be an integral part of Taxonomies evolve with time as topics and
every major enterprise application. By creating relationships change. Unstructured content has
logical topic hierarchies and accurately implicit meanings that people interpret in
classifying enterprise documents into them, different ways depending on the current context
Stratify provides business critical information to and their individual interests. Stratify allows the
a variety of enterprise applications. Examples of easy manipulation and re-linking of concepts via
these types of applications are: Enterprise Search the metadata contained in the hierarchy, instead
Solutions, Customer Relationship Management, of having to reclassify an entire corpus of
Content and Document Management, Corporate documents. For example, as references to Clinton
Portals, News and Information Aggregation and change from president to ex-president, the same
ultimately Business Intelligence applications. documents can now be classified together and yet
Access to information in an organized and be found under another category as well. This is a
contextually relevant manner will dramatically good balance of hands-on approach and
enhance the value of those applications for automatic functionality.
today’s businesses.
Consistent with Stratify’s belief that different
categorization algorithms and methodologies
have different strengths and limitations, Stratify
uses a variety of classifier engines to analyze and
categorize documents into appropriate topics.
The Stratify product is designed so that multiple
classification methodologies run in parallel. The
classification analyses are compared and chosen
by using the results of the individual processes
1. Textology’s approach improves the granularity Textology was founded as a joint venture between
of taxonomy structures by basing classification Elbit, Ltd and Assa-or Systems. Assa-or has
on identifying key concepts and the context in produced products for the American and Israeli
which they are used. This provides greater defense and intelligence communities. Elbit
precision then subjects based on statistical makes focused investments in the e-business and
aggregations of key words, and allows difficult m-commerce markets with infrastructure
categories like “explanations of market technologies, cellular applications, and value-
conditions” to be accurately handled. added services.
• Precise text summaries with unique versions This market focus allows performance
created for each concept the user optimization for targeted applications.
is interested in
Java-based architecture ensures platform
• Dynamically individualized summaries to independence and embeddable components
reduce reading time for easy integration into OEM products. The
technical requirements are:
• User selectable summary views (list of key
sentences, paragraphs, or simply highlighting Windows NT or Solaris.
sentences within document view)
Typical NT server is a 4 CPU Pentium 4 with
• Granular taxonomy and easy cross- 512 MB RAM.
categorization to other taxonomies
Textology provides connectors to standard text
• Comprehensive meta-tag information and Web document formats. Textology output of
meta-tag data supports standard databases and
• Customizable relevant document headlines applications. In addition to training and support,
or titles Textology provides professional services to
install and setup Textology products, and to
• Detailed data extracts that allow data to integrate them into customer application
populate other applications data requirements. environments.
Vision Assessment
Textology is targeting its solutions to customers How much detail is too much and how much is
where there is a high volume of dynamic, volatile not enough is a sticky problem to solve.
text information and high value associated with Textology’s approach is to provide a very detailed
extracting detailed concepts from this classification and extraction of concepts and
unstructured text. Delivering innovative business ideas contained within large volumes of
solutions for enhancing productivity and unstructured text. Solving the problem of
creating business value in text critical information overload becomes more critical
environments is Textology’s mission. By focusing when the volume and volatility of the
on a select group of customers, Textology can information is very large. Textology’s solution is
leverage their expertise and value to address for those organizations who need a very detailed,
critical business issues. granular taxonomy and the ability to extract
information and summarize it to make critical
Example Application business decisions. Textology solutions requires
serious consideration.
Most financial newswire services use a dedicated
staff to daily review hundreds of Web pages, press
releases, internal documents, and other potential
news sources. For each identified news item they
must read it, classify it for relevancy to one or
more channel topics, construct a short text
summary and/or news headline, create a list of
key words associated with the article, identify a
list of related articles. They then enter this
information along with the URL or location
identity of the source document into a company
specified meta-tag format. Finding related
articles is particularly a time-consuming process
due to the imprecise process of sifting through
the large number of hits returned by entering the
identified key words into an internal Web site
search engine.
Technological Approach
Vendor Assessment Report:
TopicalNet’s technology uses semantic and
TopicalNet syntactic knowledge to classify documents. The
system understands that “biology,” “biological,”
Introduction “bio” and “biologists” are all related. It knows
that “textbooks” and “text books” are lexical
“Quick start” and automatic taxonomy generation variants. It also knows that “Linux operating
are two descriptors that summarize TopicalNet’s system” and “Linux” are semantic variations.
approach. The quick start comes from a pre-built
taxonomy of over 1 million categories that is an Each quarter, TopicalNet fetches over 60 million
integral value of the TopicalNet solution. This
Web pages. Specialized software analyzes these
extensive pre-built taxonomy eliminates the need pages to extract knowledge about the
for developing a set of training documents for the
relationships between words, phrases and
software. For some organizations trying to find classifications. One of the outputs from this
the “right” training documents becomes a analysis is a taxonomy of about 1.5 million
circular problem. You can’t find out what you richly-connected categories. Another output is a
don’t know because you need to use known set of over 1 million semantic relationships
documents for training the software. Simple between individual words. TopicalNet also finds
installation requirements make this solution
and analyzes over 2.3 billion distinct word
typically up and running in one day. phrases in the document set. This automatically
generated base of knowledge allows TopicalNet’s
Quick start is an essential element to TopicalNet’s software to classify new documents “out-of-the-box.”
ROI argument, but so are features to allow the use
of existing taxonomies within an organization.
Through its mapping technology, they are able to Products
leverage the depth and accuracy of their TopicalNet Classifier is the core product. This is
underlying million plus categories while an out-of-the-box solution. A major differentiator
expressing the classifications using the names is that little training for either the users or the
and organization of a company’s own taxonomy. system is required. This means quick deployment
and quick ROI. The system is very scalable from
Automatic categorization is based on TopicalNet’s one extreme of 60 million pages down to the files
extensive pre-built taxonomy. TopicalNet has over on a single PC. Currently, the TopicalNet product
a million categories with an interconnected offers both an API aimed at custom integration
matrix of topic relationships. The software crawls into enterprise applications and a packaged user
through your unstructured data, parses the interface for smaller organizations looking for a
concepts and matches and places them into the quick return on investment.
pre-built categories or the companies’ categories.
The product consists of several modules:
Designed to work with multiple types of
environments that range from a single desktop • Data acquisition to harvest data from many file
machine, to file servers, to an enterprise types and formats such as HTML, Text, MS
information system, TopicalNet is quick to Office products, presentations and PDF files.
deploy and quarterly updates keep it “topical.”
• The Classifier that develops the hierarchical
and interrelated categorizations and scores the
documents relevancy.
• The presentation module that shows a
browsable directory structure.
TopicalNet’s vision is that classification adds a
dimension of clarity, fidelity, and actionability to
the analysis of most types of unstructured
business data. The software has the capability of
defining the thematic attributes to answer the
question, “What is this data about?” The software
facilitates collaboration between employees on
disparate information by organizing the
information into recognizable categories.
TopicalNet’s mission is to develop and market
software applications with the lowest possible
cost and deployment effort to be applied to a
broad range of customer problems associated
with unstructured data.
Getting started quickly and scaling up to your
enterprise's capacity usually are two
diametrically opposed features. TopicalNet has
been able to address both of these issues in one
solution. Using the pre-built taxonomy eliminates
the need for developing training sets and the
quarterly experience with very large corpora
(over 60 million Web pages analyzed) combines
scalability with being current and topical.
TopicalNet software has the ability to tune and/or
train the system to do a better job for specific
clients' content and taxonomy. TopicalNet has
shifted the burden of responsibility for a
successful implementation from the client to
this impressive product that does the bulk of
the hard work.
Technological Approach
Vendor Assessment Report:
Verity breaks content organization down into
Verity four steps:
Technical Approach
Vendor Assessment Report:
Wherewithal has developed a multi-threaded
Wherewithal software engine that holds the structure of the
taxonomy in a proprietary format designed for
fast response times under very heavy load and
Introduction extremely large numbers of categories and items.
There are various approaches to addressing the This is called the Collaborative Taxonomy
issues of constructing and maintaining an Engine™ (CTE) and has proven its capability by
enterprise taxonomy. The continuum ranges being the core technology of the, an
from a fully automatic approach with minimal Internet search site.
human intervention to Wherewithal’s approach A patent pending search algorithm-called the
which is a collaborative effort that makes every Hierarchical Search Algorithm™-uses the
member of your team—or your entire hierarchical structure itself to resolve the search.
company—contributors who leverage their own This algorithm is designed to both create the
unique knowledge to create a taxonomy for your most relevant search results for users, and also to
intranet. Wherewithal uses human judgment to create a system by which many contributors (viz.
create and maintain the taxonomy. This approach “tens of thousands” or more) would be able to
ensures the taxonomy is relevant, personal and collaborate on keywords without the system
up-to-date. getting bogged down with duplicated effort. This
structure is based on the human placement of the
Wherewithal’s approach allows category owners document, page and information pointers using
to “opt in.” What this means is that the domain the Category Owner Toolbox™ that is used to
experts can contribute to the maintenance and create and maintain the taxonomy category.
construction of their category as they have time
and need. This encourages a sense of internal The result of a “search” on this taxonomy
“pride of ownership” and an external community structure gives multiple hits based on the
pressure to maintain the relevancy of the category. keywords placed in the structure by human
A check and balance approach to taxonomy. contributors and the relationship of those
keywords to others used in the taxonomy. To
The advantages are that the maintenance and ensure results are relevant, and to enable the
construction of the taxonomy becomes self- scalable system of collaboration, results are
fulfilling. The most popular categories will get scored using multiple criteria. These criteria include:
the most searches and the most positive or
negative feedback based on how well the • Lexical “closeness” of the terms used by the
searchers believe the category is maintained. The searcher to the keyword found in the category
other advantage is that this approach alleviates item (i.e. word pattern matching-”computers”
the IT bottleneck. The contributors to the vs.”computing,” etc.)
taxonomy structure are the ones that have a
vested interest in the category, not someone in • How many times the keyword is found within
the overburdened IT department. the taxonomy hierarchy-if the result is found
several times within the hierarchy, it is thought
of as more relevant to that search.
• Keywords found at higher levels within the
hierarchy are given more weight in the scoring
process. This means that contributors at
“higher levels” of the taxonomy have greater
control over search results than those at lower
• A scalable system that allows for reuse of other Wherewithal’s perspective on taxonomy can be
people’s work and reduces or eliminates summed up:
duplication of effort
• Knowledge as a “non-fixed taxonomy”
Wherewithal has packaged the software in a suite – Also called a “hierarchical matrix”
of tools and applications for the enterprise – Also called a “hierarchy with infinite
market in a package called Wherewithal Enter- contextual variant branches”
prise Web Directory.
In other words, that things are classified
This suite consists of: according to a certain context, and that things
• The Collaborative Taxonomy Engine should be able to be classified in number of ways
(described above) • Indexing cannot be done “automatically”
• The Category Owner Toolbox for maintaining – It must be done by real people to have
and creating categories within the taxonomy real relevance
• Custom Directory™ - a way to customize the – Given enough people (1% of users in
look and feel of directory content any given base), people will make
crawlers irrelevant
• MultiSearch™ programmable meta-search
engine - a way to integrate existing enterprise Wherewithal’s value proposition is:
search functionality into a single set of search
results • A more complete and relevant corporate
taxonomy means a better-organized and easier
• The Taxonomy Service API, an XML-based to search intranet, leading to faster searching
interface into the CTE and browsing
This product is designed to develop categories to • The collaborative approach allows IT and
make searching on intranets and portals more management to maintain control but saves
efficient. time in meetings deciding on taxonomy and
Intranet structure
Executive Vision
• Business decisions are reflected instantly to the
Wherewithal envisions a world in which the Intranet when users decide on new structure,
classification and organization of items for use and the Intranet changes as fast as the business
by others is an integral part of everybody’s daily does
job. The company’s technology focus is
specifically on the issues of bringing together • Search results are under complete control of
large groups of people to collaborate on a single the business, vs. a random crawler
knowledge base. Wherewithal’s executive team
says, “Leverage the Internet’s power to gather • Programmers can use a single repository for
knowledge from individuals to create it’s index.” structural information versus separate
“Someday every person on earth will be a part databases and static HTML Web pages
time or full time infomediary-a knowledge
worker dedicated to classifying parts of the
internet or a corporate Intranet”
Delphi Group sees Wherewithal’s products as
overcoming the common obstacles of making a
taxonomy timely and relevant to the users by
using the collaborative model for constructing
and maintaining the taxonomy. This community
effort will ensure that the documents and
information that most people care about will be
placed in the categories where most people will
want to find them. An example will be that Bill
Clinton will be moved from the category of
“president” to the category of “former president.”
documents are released, and out-of-date The third interpretation here is that individuals
documents are removed from circulation. or work groups can develop their own relevant
Changing strategies, evolving products, and taxonomies as a subset of large taxonomies.
advancing technologies all drive changes.
Taxonomies may be industry- or even Librarians
department-specific. Information today is by Librarians are, first and foremost, people who
nature dynamic—consequently, categorization help you find information. Corporate librarians
systems must be dynamic as well. are experts on how various categorization
schema are designed. They know how to find
Directory Building vs. Hierarchical information—that “needle in a haystack.” The
Categories idea of an automatic taxonomy system may at
Another argument in this field emphasizes the first seem threatening to them. The reality is that,
fact that directories are about stored things as because of the dynamic nature of information,
opposed to related concepts. Directories are these experts will become more and more
virtual bins and do not necessarily reflect a valuable as the amount of information
hierarchical relationship. Proponents of this view expands exponentially.
would simply ask you to look at the directory
structure on your personal computer. Carrying Users Needs and Personalized Taxonomies
this argument forward supports the idea of The needs of individual users represent another
customizing directories to help reflect your major aspect to examine. Will one comprehensive
unique view of the world. The alternative enterprise taxonomy address everyone’s needs, or
approach is to implement a strict hierarchy with will you need departmental taxonomies as well?
topics and subtopics organized in a strict Or will individual workers require their own
grandparent-parent-child structure. Delphi unique taxonomies? Or will your environment
believes that each approach has advantages and require a blend of all of the above? As you
disadvantages. The choice is one of working style investigate different taxonomy products, be sure
more than substance. to investigate how flexible the products are for
generating multiple taxonomies.
Granularity of the Taxonomy Structure
Although there are two distinct sides to this Speed, Accuracy, Robustness and
debate, your decision will place you somewhere Scalability
along a continuum of alternatives. For the There is no universally accepted standard for
purpose of this discussion, taxonomies can be evaluating the various algorithms or software
arbitrarily divided into three sizes by the number configurations in regard to speed, accuracy, and
of nodes or headings and subheadings: scalability. When your organization is in the final
• Small - 1,000 or less stages of evaluation and has developed its short
• Medium - 1,001 t0 20,000 list of vendors, Delphi Group recommends
testing the different solutions against a
• Large - +20,000 significant portion of your unstructured data,
There are many very large taxonomies. The letting your users verify that the documents are
proponents of large taxonomies say that more is categorized quickly and accurately and on a scale
better. Since the organization of information is that meets your needs.
hierarchical, the users can drill down to as much
detail as they wish. Levels of hierarchy greater
than 10 are not uncommon in implementations
on this scale.
At the other end of spectrum, proponents of
small taxonomies argue that more than five levels
once again confront users with a kind of Infoglut,
receiving too many hits on a search, too much
irrelevant information, etc.
retrieval software. We are already seeing this For instance, you probably do not want all your
trend as companies such as Semio and employees to see the list of personnel files that
Mohomine develop aggressive OEM programs. are assigned to specific categories of diseases. If
you look up oncology you certainly don’t want to
Expect to see taxonomy software increasingly see John Doe’s personnel file linked to that topic.
integrated with applications like these: Even if you don’t have access to John Doe’s
personnel file, the fact of its potential association
• Search & Retrieval with sensitive topics such as an employee’s
• Internet & Intranet medical condition is of obvious concern. A
• Portals number of technology providers (including
Autonomy, Convera, Semio, Stratify and Verity)
• Content & Document Management are expanding their security functionality.
• Supply Chain
• CRM & Business Intelligence Ontologies
An Enterprise Ontology is a collection of terms
Security and definitions relevant to business enterprises.10
Security issues are among the key topics on the An ontology is more than a taxonomy or
minds of information managers. Most taxonomy classification of terms. Although taxonomy
applications follow the standard security model contributes to the semantics of a term in a
for the server they run on within the enterprise vocabulary, ontologies include richer
system. This approach can be summarized by this relationships between terms. It is these rich
simple rule: if you have access to the area where relationships that enable the expression of
the document is stored, you have security domain-specific knowledge, without the need to
clearance to see the document. Delphi Group include domain-specific terms.11
believes this type of security implementation is
the minimum level allowable. Security issues An ontology is more than an agreed-upon
relative to unstructured data in the future will vocabulary, however. The terms in an ontology
involve at least three different aspects: are selected with great care, ensuring that the
most basic (abstract) foundational concepts and
distinctions are defined and specified. The terms
1. Security attributes of each individual document.
When the document is created, various
chosen form a complete set, whose relationship
properties will be assigned as to which one to another is defined using formal techniques.
individuals, which set of clearances, which It is these formally defined relationships that
departments, etc., will be permitted to view or provide the semantic basis for the terminology chosen.
edit the documents. In the context of knowledge sharing, an ontology
2. Security attributes of the individual. This type is a description (like a formal specification of a
of security will be based on clearance levels, program) of the concepts and relationships that
membership in particular departments or can exist for an agent or a community of agents.
assigned role(s) in the organization.
An ontology often takes the form of an extremely
3. Security issues regarding the overall operating large database of words and phrases, their
environment. Examples of these considerations meanings and their conceptual relationships.
will be factors such as what time of day access
is allowed, access from inside or outside the Examples of conceptual relationships are: a
firewall, number of documents accessed, etc. “commissioner” is a member of a “commission”;
“good” is an antonym to “bad”; and “lumber” has
substance, i.e. “wood.”
