Visual Text Analytics Users Guide
Visual Text Analytics Users Guide
SAS® Documentation
June 26, 2018
The correct bibliographic citation for this manual is as follows: SAS Institute Inc. 2017. SAS® Visual Text Analytics 8.2: User’s Guide. Cary, NC:
SAS Institute Inc.
SAS® Visual Text Analytics 8.2: User’s Guide
Copyright © 2017, SAS Institute Inc., Cary, NC, USA
Appendix 2 • Pre-Defined Concept Priorities (for Languages Other Than English) . . . . . . . . 103
Using Priority Values in Predefined Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
Priority Values for Predefined Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
Audience
This book is designed for users of SAS Visual Text Analytics on Viya. It describes the
terminology used in SAS Visual Text Analytics on Viya and provides instructions for
tasks. Where appropriate, it guides users to information about Viya.
vi About this Book
vii
Accessibility
For information about the accessibility of this product, see Model Studio: Accessibility
Features.
viii About this Book
1
Chapter 1
Figure 1.1 provides an overview of the SAS Visual Text Analytics processes.
SAS Visual Text Analytics on Viya combines the visual programming flow of SAS Text
Miner with the rules-based linguistic methods of categorization and extraction in SAS
Contextual Analysis These capabilities, along with document-level scoring for each
component, are combined in a single user interface.
Using SAS Visual Text Analytics on Viya, you can identify key textual data in your
document collections, categorize those data, build concept models, and remove
meaningless textual data.
By default, words that provide little or no informational value (stop words) are excluded
from topic analysis. Examples of these words include the articles a, an, and the and
conjunctions such as and, or, and but. Other terms that are specific to your document
collection but provide little or no value are also identified and excluded.
SAS Visual Text Analytics on Viya uses a graphical user interface that is useful for all
users, regardless of whether they have programming experience.
The Topics analysis node groups similar documents in a collection into related themes,
or topics. The documents in each topic often contain similar subject matter, such as
motorcycle accidents, computer graphics, or weather patterns. Automatic topic
identification enables you to easily categorize each document in your collection.
The Category analysis node labels documents based on their content. You can create
categories using these methods:
• specify category (target) variables in your training documents
• create new categories that correspond to your organization’s interests
• promote discovered topics to categories
Preliminary rules are generated when you promote a topic to a category or when you
specify category variables in your training documents. These rules can be edited and
refined using simple Boolean and proximity operators.
The Sentiment analysis node determines whether documents express positive, neutral, or
negative attitudes. Analysis performed after the Sentiment Analysis node will display a
sentiment indicator for each document.
Finally, each of the analysis nodes (except parsing) provide score code that enables you
to deploy your models. Use deployed models to automate the process of labeling a set of
input documents into their respective concepts, categories, topics, and sentiment.
4 Chapter 1 • Introduction to SAS Visual Text Analytics on Viya
Supported Languages
Table 1.1 shows a full list of project languages that are supported. See your SAS sales
representative for information about licensing additional languages.
Croatian Czech
Danish Dutch
English Farsi
Finnish French
German Greek
Hebrew Hindi
Indonesian Italian
Japanese Korean
Portuguese Russian
Slovak Slovene
Spanish Swedish
Tagalog Thai
Turkish Vietnamese
Introduction
When you run a pipeline, the following analyses are performed in their respective nodes
(if data are present):
• Concepts node — concept extraction
Visual Text Analytics Basics 5
Concepts
A concept is a property such as a book title, last name, city, gender, and so on. Concepts
are useful for analyzing information in context and for extracting useful information.
You can write rules for recognizing concepts that are important to you, thereby creating
custom concepts. For example, you can specify that the concept kitchen is identified
when the terms refrigerator, sink, and countertop are encountered in text.
SAS Visual Text Analytics provides predefined concepts, which are concepts whose
rules are already written. Predefined concepts save time by providing you with
commonly used concepts and their definitions, such as an organization name or a date.
You cannot rename predefined concepts, nor can you view or edit their base definitions.
You can provide additional rules in the Edit to modify or extend their behavior.
For custom concepts, you can prioritize which matches are returned when overlapping
matches occur (for example, a concept node that matches New York and another concept
node that matches New York City). You do this by setting a priority value. When setting
priority values, it is helpful to know the preset values of predefined concepts so that you
can set a custom concept’s priority at a higher value. For more information about setting
priorities, see “Which Rule Type Should I Use?” on page 45.
Table 1.2 on page 5 shows a list of the predefined concepts for English that are
included with SAS Visual Text Analytics, along with their preset priority values. For
predefined concepts and priority values for other languages, see Appendix 2, “Pre-
Defined Concept Priorities (for Languages Other Than English),” on page 103.
Note: Some languages use a subset of the predefined concepts listed here.
A custom concept is a concept whose rules you must write.
For more information about writing concept rules, see “Writing Concept Rules: Basic
LITI Syntax” on page 43. For information about writing category rules, see Writing
Category Rules on page 62.
Topics
Topics are derived from natural groupings of important terms that occur in your
documents. In SAS Visual Text Analytics, topics are automatically generated and
assigned to documents. A single document can contain more than one topic.
The interactive window for the Topics node displays all the topics that SAS Visual Text
Analytics identified. The default name of a topic is the top five terms that appear
frequently in the topic. These terms are sorted in descending order based on their weight.
Sentiment Scoring
Sentiment analysis is the process of identifying the author’s tone or attitude (positive,
negative, or neutral) expressed in a document. SAS Visual Text Analytics uses a set of
proprietary rules that identify and analyze terms, phrases, and character strings that
imply sentiment. A sentiment score is then assigned, based on that analysis. Using these
rules, the software is able to provide repeatable, high quality results.
The assignment of sentiment to a document is based on the attitude that is associated
with the document as a whole. For example, the following document would have a
positive sentiment: Had an awesome time yesterday. Glad I brought my
tent from Store XYZ.
Because documents can be associated with multiple words or terms that imply sentiment,
SAS Visual Text Analytics uses a scoring system to assign a final sentiment score. The
following list provides basic information about how sentiment scoring works. (The
information has been simplified to illustrate key concepts.)
• Each positive term or phrase is worth a single (positive) point.
• Each negative term or phrase is worth a negative point.
8 Chapter 1 • Introduction to SAS Visual Text Analytics on Viya
• If there are more positive terms or phrases than negative, the final sentiment score is
positive.
• If there are more negative terms or phrases, the final sentiment score is negative.
• If there are an equal number of positive and negative terms or phrases, the sentiment
score is neutral.
Categories
A category identifies a group of documents that share a common characteristic.
For example, you could use categories to identify the following:
• areas of complaints for hotel stays
• themes in abstracts of published articles
• recurring problems in a warranty call center
You create categories by promoting a topic to a category, specifying a category variable
while creating a new project, or creating a new category in the Categories node. You can
edit the rules that are automatically generated for category variables and for topics that
are promoted to categories.
Note: The category rules are in the format that SAS Contextual Analysis uses (MCAT),
rather than in LITI format. You can refer to LITI concepts from within categories.
For more information about writing concept rules, see “Writing Concept Rules: Basic
LITI Syntax” on page 43. For information about writing category rules, see Writing
Category Rules on page 62.
Using Taxonomies
In SAS Visual Text Analytics, you can create category and concept rule sets, which are
organized into a taxonomic structure. Each taxonomy consists of tree nodes (not to be
confused with analysis nodes). Each tree node is a container for one or more rules. The
taxonomy is used to organize rules and reflect the overall model design and to make
testing, refinement, and maintenance of rules easier. Rules explicitly may reference other
tree nodes, but there are no implied dependencies within the tree that impact results (like
dependencies of inheritance).
Concept and category taxonomy trees can be organized in any way that is useful for your
objectives. However, using a careful and principled design process is recommended for
larger projects. For example, commonly referenced rules should be placed in a location
where they are easy to find and their shared status is apparent. Naming concept or
category tree nodes should enable easy navigation among nodes. See guidelines for
naming nodes for more information.
Each category node in the tree is a container for a rule. By contrast, under a concept
node, there can exist multiple rules. Figure 1.2 on page 9 demonstrates how category
and concept taxonomies differ.
Visual Text Analytics Basics 9
Chapter 2
Working in Projects
Getting Started . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Preparing the Document Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Creating a Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Assigning Variables in the Data Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Customizing Views in the Data Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Using SAS Sentiment Analysis Models in SAS Visual Text Analytics . . . . . . . . . . . 14
Project Sharing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Getting Started
Creating a Project
To create a project in Model Studio, click New Project in the upper right corner of the
Projects page. A New Project window appears. Within the window, the user can: assign
12 Chapter 2 • Working in Projects
a name to the project; select the type of project they want to create (“Text Analytics” is
the default); choose a data source; and select the language to be used for analyzing the
text in the document collection. For a list of supported languages, see Table 1.1 on page
4. Once all fields are populated, click Save in the lower right corner of the New Project
window.
Select “Assign variable roles” to access the Assign Variable Roles window. To assign a
text variable, select Text variable from the Required Roles list on the left side of the
window. The text variable identifies the text data to be analyzed. Select the variable to be
used from the list provided.
If you are not going to assign a category variable, click OK to submit your changes. To
assign a category variable or variables, select Category variables under Optional Roles
on the left side of the window. Select the variable or variables of choice from the list
Getting Started 13
using the icon; if selecting multiple variables, you can only add one at a time. Once
you are done assigning roles, click OK in the bottom right corner of the window to
submit your changes.
After assigning variable roles, you can select variables to be used as display variables.
Display variables become columns in the Documents tab of all pipeline nodes with the
exception of the Data node and Sentiment node. To change the display status of
variables, click the check box to the left of each variable that you want to modify. Once
variables have been selected, use the drop-down menu in the top right corner of the Data
tab and select Change display status. The display status of the selected variable or
variables changes instantly.
Note: Your Text variable will always have a display status of “Yes”; however, you can
choose whether to display Key and Category variables.
from the Variables table to the View table, click the icon in the top left corner
of the Data Tab, next to the search bar. The View table shows greater detail, and has a
column for each of the variables in the data set.
To customize your view in either the Variables table or the View Table, you can right
click on column headers to sort or freeze a column.
14 Chapter 2 • Working in Projects
Using the icons between the two lists, you can move variables from the Displayed
columns list to the Hidden columns list, and from the Hidden columns list to the
Displayed columns list.
Note: By default, the View table displays all variables as columns and therefore does
not display any variables in the Hidden columns list.
Project Sharing
It is possible to share projects with other users. In Model Studio, check the selection box
in the project that you want to share. Then select the menu and select Share.
Note that in shared Read-only mode, all operations that involve changing of existing
data, such as running a pipeline, adding or editing concepts or categories, and so on, are
disabled. You can still perform actions such as viewing document matches, viewing term
maps, and test rules against text. It’s important to note that when the project is in Read-
only mode, even the project owner cannot make changes to the data.
16 Chapter 2 • Working in Projects
17
Chapter 3
Working In Pipelines
Each of these nodes is designed to solve a specific problem related to text analytics.
These nodes and their associated properties are explained in detail in the following
sections. When a new project is created, a default pipeline associated with the project is
pre-populated. This default pipeline represents a typical workflow of a text analytics
project. It looks like this:
For detailed information about each analytic task performed by the nodes, see “Visual
Text Analytics Basics” on page 4.
Where applicable, the output of a given node is used within (flows into) its successors.
Here are some examples:
• When a Text Parsing node runs, it uses the concepts from all its predecessor nodes
during text parsing and extracts relevant terms
• When a Text Parsing node precedes a Concepts or Categories node, all the kept terms
from the Text Parsing node are included in the concepts and categories interactive
view as textual elements. These textual elements can be used to develop rules for
concept extraction or categorization.
• From the Topics interactive window, you can select one or more topics and promote
them as categories. These categories and the associated category rules are
automatically created when any of the succeeding Category nodes run.
• Within the rules in a Categories interactive window, you can refer to concepts
defined in the immediately preceding Concepts node. For more information about
referring to concepts in categorization rules, see “Introduction to Category Rules” on
page 62.
• Within the interactive views that follow a Sentiment node, the document level
sentiment information is shown alongside the document text.
Note: The Data and Sentiment nodes do not have interactive windows.
saved. This action can also change the node status to “Out of Date” if it was previously
marked as “Completed”.
Concepts
The only option you can specify for the Concepts node is whether or not to include
predefined concepts in your analysis. You can also adjust the minimum number of
documents to view by using the slider. The default, which is set automatically, is 4.
Predefined concepts identify items in context such as a person, name, or an address.
They save time by providing you with commonly used concepts and their definitions.
(Predefined concept availability depends on the project data language.) For more
information about concepts and predefined concepts, see “Concepts” on page 5.
Text Parsing
The options for the Text Parsing node include adjusting the minimum number of
documents that a term must appear in to be kept in the analysis; specifying a custom start
or stop list; and specifying a custom synonym list. If the number of matching documents
for a term is less than the minimum number, the term is dropped when the Text Parsing
node is run.
Start lists and stop lists enable you to control which terms are or are not used,
respectively, in the text parsing and terms analysis. You can use a start list or a stop list,
but not both. A start list is a data set that contains a list of terms to include in the analysis
results. If you use a start list, then only terms that are included in that list appear in the
results. No start list is applied by default. To select a start list, check the check box and
select the table that represents the start list data set. A stop list is a data set that lists
terms to exclude from the analysis results, such as terms that contain little information or
that are outside the realm of your analysis. A stop list is provided and automatically
applied by default for the following languages: Croatian, Czech, Danish, Dutch, English,
Finnish, French, German, Greek, Hebrew, Italian, Norwegian, Polish, Portuguese,
Russian, Slovak, Slovene, Spanish, Swedish, and Turkish. To override the default stop
list with a custom stop list, check the check box in the options pane and select the table
that represents the stop list data set.
A synonym list is a SAS data set that identifies pairs of words that should be treated as a
single term for generating topics and textual elements. The data set can include both a
term and different forms of that term, including misspellings or abbreviations. For
example, you can specify that the words advert and advertising are to be treated as the
term advertisement. For more information, see “Text Parsing—Terms and Synonyms” on
page 6.
Setting Options for the Analysis Nodes 21
Sentiment
You can specify and apply a sentiment model if you want document-level sentiment to
appear in your analysis within the application. (Score code can be generated for feature-
level sentiment.) If you do not specify a sentiment model, a default model is used (not
available for all languages).
Add analysis nodes after the sentiment node in order to see document-level sentiment.
There is no interactive window for the Sentiment analysis node.
22 Chapter 3 • Working In Pipelines
Topics
• You can choose for the software to generate topics, or you can designate a maximum
(or exact) number of topics that you want generated for the analysis. This setting
determines the number of documents that are displayed in the Topics node interactive
window.
• Term density determines the term cutoff value for each topic. Terms that have an
absolute value of weight that is above this value are considered to be included in the
topic. Terms that have values below the cutoff are not included in the topic.
Term density is defined by an integer between 0 and 10 (the default value is 1).
When term density is closer to 0, term topic cutoff will be lower and therefore topics
will be more densely populated by terms. When term density is closer to 10, topics
are less densely populated by terms. Use this setting in conjunction with document
density.
• Document density affects the cutoff for each topic, which in turn determines the
number of documents that belong to a topic. Only documents with a value higher
than the cutoff are assigned to the topic. Use this setting in conjunction with term
density.
Setting Options for the Analysis Nodes 23
You must rerun the Topics analysis node to see the results of your changes to these
settings.
Categories
You can choose to have the application generate category rules and also rules for
category variables. (You specify category variables in the Data tab.) Category rules are
also generated when topics are promoted to categories.
TIP You must run the Category analysis node to see any of the generated categories
or rules.
24 Chapter 3 • Working In Pipelines
When you download score code from a node, the resulting ZIP file contains two entries:
• SAS score code for the node - This code can be used to score an external CAS table
within a SAS Viya environment (for example, in SAS Studio).
• A copy of the model created within the node - The model can be used to score
external SAS data sets within a SAS 9.4 environment. The models created in SAS
Visual Text Analytics 8.2 are compatible with SAS 9.4M5 or higher. The score code
in this case can be obtained from SAS Contextual Analysis (for concepts, categories,
and sentiment) or SAS Text Miner (for Topics).
Scoring an External Data Set 25
26 Chapter 3 • Working In Pipelines
27
Chapter 4
TIP Use the and icons in the Documents tab to switch between Document
View (shows one document at a time) and Tabular View (Shows multiple documents
at once). You can also select the icon to view only the documents that match a
predefined concept or a custom concept.
Expand Predefined Concepts and Custom Concepts to see what is included in your
analysis. To expand the list, click the arrow to the left of Predefined Concepts or
Custom Concepts.
28 Chapter 4 • Using the Interactive Windows for the Nodes
Note: If you choose to exclude predefined concepts during project creation, you cannot
access predefined concepts in the interactive window for the Concepts node.
Here are other important actions that you can execute in the interactive window for the
Concepts node:
• Add a custom concept
Select the icon to add a custom concept for which you create your own rules.
Note: No more than 400 concepts (including child concepts) can be present.
In the Edit a Concept pane, enter the LITI rules for a selected concept. (For more
information about writing LITI rules, see “Writing Concept Rules: Basic LITI Syntax”
on page 43. Validating the rules before running the Concepts node enables you to see
and correct errors more easily.
To validate concept rules, select the icon in the toolbar in the Edit a Concept pane.
Otherwise, a warning message will appear at the bottom of the Edit a Concept pane.
Matches within the documents are highlighted, as shown in the following sample
screen:
You can also test for matches on a single document or string of text. To test a
document, right-click the document in the Documents tab and select Paste to Test
Sample Text. You must then click the icon in the Test Sample Text tab to test the
document against the selected concept. To test a string of text, simply enter the
desired text into the text box in the Test Sample Text tab and click the icon. If a
Matched item, Matched fact, or Overlapping match is discovered, the match is
indicated by certain visual cues. Select the icon to see the legend for each type of
cue.
Note: Sample text can be tested on newly created custom concepts without running
the Concepts node. However, you must run the Concepts node to see updated
document matches in the Documents tab.
Note: The Matched Documents tab and the Test Sample Text tab offer different
scopes for concept rule matching. The Matched Documents pane displays
relevant matches for all concepts in the project being applied, including those
concept types with global impact (for example, REMOVE_ITEM). The Test
Sample Text tab shows matches for rules in the selected concept only, plus the
concepts that are explicitly referenced in rules of the highlighted concept.
Note: When using Test Sample Text feature, global rule types not defined in the
specific concept being tested will not affect results. Global rule types include
NO_BREAK and REMOVE_ITEM.
• Guidelines for Naming Concepts
30 Chapter 4 • Using the Interactive Windows for the Nodes
When you create a custom concept node, follow these naming guidelines:
• Use valid characters – numbers, letters, and underscores (_). (See the Note below
regarding the use of underscores).
• Concept names are case-sensitive.
• Create names that are not regular words; using mixed case is recommended to
help with readability. For example, MyConcept or myConcept are good names.
Do not use names for custom concepts that are also words (for example,
Problem or Mechanics ) that could be matched in your text. Instead, use
names that cannot be interpreted as words, such as MyNewConcept.
If underscores (_) are used in concept names, follow these guidelines to ensure that
your concept rules will work as expected:
• If you use underscores at either end of the concept name, be sure there is a
matched pair at both ends. For example, _Domestic_ is permitted, but
Domestic_ is not permitted.
• Do not include _Q, a character combination reserved by the application,
anywhere in a concept name.
• If a concept name begins with an underscore, the next character must be a letter.
For example, the concept name _25anniv_ is not permitted.
TIP Use mixed case to enhance the readability of concept names. For example,
truckMechanicalIssues is easier to read than
truckmechanicalissues.
Here are other important tasks that you can complete in Terms Management:
• View Terms
In the interactive window for the Text Parsing node, view terms in the following
contexts:
• The Kept Terms pane displays all of the terms in the document collection that
were kept
• The Role column displays the part of speech from which each term is derived
Note: In some languages, the roles displayed might not be the same as the ones
used for rule writing in the Concepts node.
The Interactive Window for the Text Parsing Node 31
• The Documents column displays the number of training documents that contain
the selected term
• The Frequency column displays the number of times that each term is used
• To view the surface forms that were assigned to a term, click the triangle that
appears next to that term
Note: If you chose to exclude predefined concepts in the Concepts node, you can
still see terms with the role nlpNounGroup in the interactive window for the
Text Parsing node.
Note: The above options appear for the Kept Terms pane, and might differ from
those in the Dropped Terms and Documents pane.
• Select and drop terms from one tab to another
By default, the lists of terms are sorted in descending order of the number of
documents in which each term appears. You can select parent terms from the Kept
tab and move them to the Dropped tab by using the icon, and back again using
the icon.
Note: If you make changes to the terms and you want to see the effects of your
changes in the matched documents table, you must click the icon on the
Pipelines tab to rerun the pipeline.
CAUTION:
32 Chapter 4 • Using the Interactive Windows for the Nodes
If concept rules are out-of-date when you rerun any nodes (all out-of-date
nodes or topics only), any changes that you made to terms are overwritten
with the original terms list.
Note: Sentiment values are displayed only if a Sentiment node precedes the Text
Parsing node.
• View a term map
To view a Term Map for a term, select that term in the Kept Terms and click the
icon.
The Interactive Window for the Text Parsing Node 33
The Term Map window displays a term map for the selected term. In the preceding
sample screen, the selected term is flight, and it is represented by the largest circle in
the map. For more information about reading the map, click above the term map.
Note: Term maps for more frequently occurring terms will take longer to produce.
The time it takes to produce a term map can vary dramatically and is dependent
upon these factors: the number of documents in your collection, the number of
terms being searched, and the number of documents that include the center term.
34 Chapter 4 • Using the Interactive Windows for the Nodes
Note: You must run the topics to see the results of your changes.
• Customize your view
Use the icon to select which column types will appear in each pane. In the Topics
pane, there are two options: Topics added as category and Documents. The Terms
pane offers more options, such as Relevancy, Similarity, Role, Documents, and
Frequency. The Documents tab offers two columns, Sentiment and Relevancy. You
can also resize columns by using the splitter bars between panes, and change the sort
order of each column by right-clicking on the column headings to access sorting
options.
The Documents pane offers two tabs, All and Matched. To see which documents
match a particular category, select a category from the Categories pane, and then
select Matched in the Documents tab. The highlighted terms are the terms that
determined the document’s membership in the category.
Note: In the case that emoji characters are present in the data source, they are
rendered as a diamond character with a “?” in it within Model Studio.
Note: The sentiment for each document is displayed only if you preceded the
Categories node with a Sentiment node.
Note: If a rule is edited or if a category is renamed, the Categories node must be
rerun in order to display the document matches. A dotted red line underneath a
term indicates an out-of-date match.
Note: Once a category has been modified, a warning icon appears in the top right
corner of the Documents pane. Upon clicking the icon, you are given the option
to rerun Categories in order to update matches.
To begin editing, select a category. In the Edit Rules pane, the rule for that category
will appear. Use the tree view icon or the rule view icon to switch between
views. The option to rename a category is also available. Select the category to be
renamed and select the icon in the Categories pane. The category name can then
be modified.
To edit a rule, you can use either the rule view or the tree view in the Edit Category
pane.
To edit a rule in the default rule view, select the rule to be edited. The rule will then
appear in the Edit Category pane.
A rule has two main components: arguments and operators. In the rule view,
arguments are displayed in purple text, and the operators are displayed in blue text.
To edit the selected rule, simply click inside the rule and modify the arguments and
the operators as desired.
TIP By default, the typing assistance tool is active unless manually disabled.
When modifying an operator, this feature will provide a list of available operators
as well as an explanation of what each operator does.
TIP Press Shift+F6 to exit from the code editor.
Note: If typing assistance is not desired, open the drop-down menu in the Edit a
Category toolbar and click Typing assistance to turn it off.
The Interactive Window for the Categories Node 39
4. Once an operator is in place, you must create an argument. Select the operator
that you added and select the icon.
5. Select Add argument. A text box appears underneath the operator in the Edit a
Concept pane.
40 Chapter 4 • Using the Interactive Windows for the Nodes
6. Enter the desired argument in the text box and press Enter on the keyboard when
you are finished.
Simply type (or copy and paste) the test text into the Test Category tab for the rule
that you have selected, or select a document from the Documents tab and click the
icon to paste that document into the Test Category tab. Click the to test the
text.
Once the testing is complete, any matched items and overlapping matches are
highlighted.
The Interactive Window for the Categories Node 41
Note: When using Test Sample Text feature, global rule types not defined in the
specific concept being tested will not affect results. Global rule types include
NO_BREAK and REMOVE_ITEM.
Clear the highlighting by clicking the icon, or clear the sample text entirely by
selecting the icon.
• Using Textual Elements
In the Textual Elements pane, the terms that were kept from the Text Parsing node
appear. Use the Textual Elements pane to create a rule for an existing category, or to
create a rule for a new category. To create a rule in the Textual Elements pane, a
category must first be selected. Once a category is selected, select the terms from the
Textual Elements pane that will be used to create the new rule. Select the icon to
view and edit the new rule before it is applied to the category selected.
Note: The new rule created will replace any previous rule associated with the
selected category .
42 Chapter 4 • Using the Interactive Windows for the Nodes
Note: There can be no more than 400 categories (including sub-categories) present.
It might also be useful to know which terms are “similar” to -- that is, likely to
appear in the same context as -- a selected term in your documents. Select a kept
term and click the icon to generate similarity scores. The higher the score, the
more the term is likely to appear in the same context as the selected term. A score of
1.0 is an exact match (in other words, the term itself). To turn off similarity scores,
click the icon on the right side of the pane, and the terms table will return to its
original format.
TIP Use similarity scores to create rules from the most similar terms to capture
more documents relating to the selected term.
• Customize your view
In the Textual Elements and Documents panes, use the icon to select which
columns to display or to hide. In Textual Elements, there are five options:
Documents, Frequency, Similarity, String, and Role. The Documents pane offers
three options: Relevancy, Sentiment, and text.
43
Chapter 5
Writing Rules
For information about editing rules by using the interface and by using properties
settings, see “The Interactive Window for the Concepts Node” on page 27. For a list of
rule types, see “Which Rule Type Should I Use?” on page 45.
The following list provides basic guidelines for using LITI syntax to write concept rules.
The syntax is flexible, and therefore the syntax elements can be combined in numerous
ways.
• A rule consists of a rule type (which is written in uppercase letters), followed by a
colon, then by arguments. For example, in the rule CLASSIFIER:LGA,
CLASSIFIER is the rule type, LGA is the argument, and they are separated by a
colon. Rule modifiers can be used to further refine the set of matches. The rule
syntax varies greatly depending on the rule type; the basic syntax is included in the
description of each rule in Table 5.1 on page 45 and Table 5.2 on page 47. For a
list of rule modifiers, see “Adding Rule Modifiers” on page 48.
• Use descriptive concept rule names that cannot be used as single words (for example,
BASEBALLSCORE). You can also include information about how you will use the
concept in other rules by using a prefix (for example Helper_BaseballScore).
• A single concept rule can reference one or more other concepts nodes. You can also
write rules that recognize key words or elements within a specific context. For
example, you can extract documents that contain the string LGA only if it appears
before the word Airport.
• Use part-of-speech tags in rules to identify linguistic structures. For more
information, see “Using Part-of-Speech and Other Tags” on page 55 .
• Use Boolean and proximity operators to enhance the precision of your rules. For
more information, see “Using Boolean Operators for Extracting Concept Rules and
Facts” on page 50.
• Use morphological expansion operators to return inflected forms of a word.
• Use coreference operators to resolve pronouns. For example, if the pronoun he were
used to refer to Walt Disney, you can write a rule that specifies the canonical
form (full form) and returns it in the concept. For more information, see “Using the
Coreference Modifier” on page 53.
• You can use a sequence rule (SEQUENCE) when the order of the items in the fact is
important. A sequence rule can detect a structure so that each term in the fact
matches in the order that you specify with no intervening items.
CLASSIFIER Identifies single terms or strings that you want matched in context. For
example, in a concept definition, you can create CLASSIFIER rules that
contain specific airport codes. The portions of text that contain the
airport codes are considered matches to the CLASSIFIER rules.
CLASSIFIER:string <, information>
C_CONCEPT Returns matches that occur in the specified context only. For example, to
extract matches that include names of university professors, you could
create a C_CONCEPT rule that identifies matches on a concept
(previously defined) that identifies last names only when the matched
names are preceded by the word Professor.
Note: This rule type requires the _c{} modifier.
C_CONCEPT:<argument> _c{argument}<argument>
where argument can be a concept name, rule modifier, or string.
CONCEPT_RULE Uses Boolean and proximity operators to determine matches. For a list
of operators, see “Boolean and Proximity Operators for Category Rules”
on page 63.
Note: This rule type requires the _c{} modifier. Quotation marks (")
must surround the strings that you want to match. The _c{} can
surround only one argument, which is highlighted when matches are
returned. The other arguments that appear in quotation marks provide
context for the match and must be present for a match to occur.
CONCEPT_RULE:(<Boolean-rule-1>…<Boolean-rule-n>
where Boolean-rule can be nested n times and is written as:
Boolean-operator “_c{argument-1}”,<“argument-2”>…<“argument-
n”>)
NO_BREAK Prevents partial matches by ensuring that a match occurs only if the
entire string is located. For example, suppose you want to capture text
that includes the item National Gallery of Art. You can
create a rule that ensures that the entire string National Gallery
of Art is matched and not Gallery and Art as separate items.
Note: This rule type requires the _c{} modifier.
Note: NO_BREAK applies across the entire taxonomy regardless of
where the rule appears or whether the rule is enabled or disabled.
Note: Do not insert NO_BREAK rules just anywhere. It is helpful to
insert them all in one concept. That is, create a concept that contains
globally implemented rules only (NO_BREAK or REMOVE_ITEM).
Having such rules all in one place aids in troubleshooting the matching
behavior across your taxonomy.
NO_BREAK: _c{argument}
where argument can be a concept name (not recommended) or string.
matches any five digit number to help find ZIP codes in the USA.
REGEX:regular-expression
Writing Concept Rules: Basic LITI Syntax 47
REMOVE_ITEM Ensures that a correct match is made when one word is a unique
identifier for more than one concept. For example, you can write a rule
that distinguishes between the Arizona Cardinals football team and
the St. Louis Cardinals baseball team. The context of each match is
used to eliminate incorrect matches.
Note: This rule type requires the _c{} modifier and the ALIGNED
operator. Quotation marks (") must surround each of the two arguments
of ALIGNED.
Note: The REMOVE_ITEM rule type is a global rule type that can
influence matches outside of the concept node in which it is used.
REMOVE_ITEM:(ALIGNED, “_c{concept name}”, “argument”)
where argument can be a concept name, rule modifier, or string.
Table 5.2 lists the rules used for extracting facts. Included is a brief description of how
each rule type is used, along with basic syntax.
PREDICATE_ Helps you define facts that you want identified in text. For information
RULE about facts, see “Concepts versus Facts” on page 44.
PREDICATE_RULE:(argument-name-1…
<argument-name-n>): (Boolean-rule-1…<Boolean-rule-n>)
where argument-name refers to a name you specify for fact matching,
and where Boolean-rule can be nested n times and is written as:
(Boolean-operator, “_argument-name
{argument}”, … “<_argument-name>{<argument>}”)
The PREDICATE_RULE rule type is more flexible than the
SEQUENCE rule type because it does not always specify order.
SEQUENCE Identifies facts in documents if the facts appear in the order specified
with no intervening elements. For information about facts, see
“Concepts versus Facts” on page 44.
SEQUENCE:(argument-name-1…
<argument-name-n>):_argument-name-1{argument}
<_argument-name-n {argument}>
where argument-name refers to a name you specify for fact matching,
and where argument can be a concept name, rule modifier, or string.
Note: This syntax is written in its simplest form. Additional modifiers
and arguments for concept rule matching can be inserted.
The SEQUENCE rule type requires the number of argument-names
specified must match the number of _argument-names applied.
Using Punctuation
Use punctuation to qualify the matches for all rule types except CLASSIFIER and
CONCEPT.
Colon :
Separates rule types and tags. When to use a colon:
48 Chapter 5 • Writing Rules
Comments X X X X
Word (_w) X X X
Multiple matches X X
symbol (>)
Morphological X X X
expansion
symbols (@,
@A, @N, and
@V)
Writing Concept Rules: Basic LITI Syntax 49
Boolean and X
proximity
operators
Part-of-speech X X X
tags
Export feature X
Coreference X X X
symbols (_ref{},
_P, and _F)
Regular
expressions
(Regex)
Predefined X X X
concepts
Table 5.4 Concept Rule Modifiers and Associated Rule Types, Continued
Comments X X X X
Context (_c{}) X X
(Required) (Required)
Word (_w) X X X X
> symbol
Morphological X X X X
expansion
symbols (@,
@A, @N, and
@V)
Boolean and X
proximity
operators
Part-of-speech X X X X
tags
Export feature
Coreference
symbols (_ref{},
_P, and _F)
50 Chapter 5 • Writing Rules
Regular X
expressions (Required)
(Regex)
Predefined X X X X
concepts
Table 5.5 Boolean Operators for Extracting Concept Rules and Facts
Operator Description
ALIGNED Takes two arguments. Returns a match when both arguments are matched in
the same span of text in a document. Used with the REMOVE_ITEM rule
type only. For example, the following rule specifies that if a match on rules
in the LOC concept node also matches rules in the PERSON concept node,
then the match on LOC should be removed:
REMOVE_ITEM:(ALIGNED, "_c{LOC}", "PERSON")
AND Takes one or more arguments. Matches if all arguments occur in the
document, in any order. For example, the following rule returns a match on
King Louis XIV if it occurs in the document with France:
CONCEPT_RULE:(AND, "_c{King Louis XIV}", "France")
DIST_n (Distance) Takes a value for n and two or more arguments. Matches if all
arguments occur within n (or fewer) tokens of each other, regardless of their
order. For example, the following rule returns a match in the phrase the
picture with the best lighting:
CONCEPT_RULE:(DIST_5, "best", "_c{picture}")
Note: For calculation purposes, the distance between tokens is not inclusive.
For example, the distance between best and show in the phrase best
in show is two tokens. Tokens that include hyphens are counted as one
(for example, merry-go-round is one token).
NOT Takes one argument. Matches if the argument does not occur in the
document. Must be used with the AND operator. For example, the following
rule returns a match if cinema, theater, or theatre occur in the
document, but Broadway does not:
CONCEPT_RULE: (AND, (OR, "_c{cinema}", "_c{theater}",
"_c{theatre}"), (NOT, "Broadway"))
Note: The NOT operator applies across the entire document. All operators
must have their own parentheses around themselves and their associated
arguments.
Writing Concept Rules: Basic LITI Syntax 51
OR Takes one or more arguments. Matches if at least one argument occurs in the
document. For example, the following rule returns a match if one or more of
the items U.S., US, or United States appear in the document:
CONCEPT_RULE:(OR, "_c{U.S.}", "_c{US} ", "_c{United States}")
Note: Rules that are generated by SAS Visual Text Analytics nest the OR
operator within the AND operator. However, the OR operator can stand
alone.
ORD (Order) Takes one or more arguments. Matches if all of the arguments occur
in the order specified in the rule. For example, the following rule returns a
match in the sentence The warranty claim for the washing
machine was denied.:
CONCEPT_RULE:(ORD, “warranty”, “claim”, “denied”)
ORDDIST_n (Order and distance) Takes a value for n and two or more arguments.
Matches if all arguments occur in the same order that is specified in the rule
and if all arguments are within n tokens of each other. For example, the
following rule returns a match in the phrase the teacher
introduced elementary statistics because the arguments
appear in the correct order and within five words of each other:
CONCEPT_RULE:(ORDDIST_5, "elementary", "_c{statistics}")
Note: For calculation purposes, the distance between tokens is not inclusive.
For example, the distance between best and show in the phrase best
in show is two tokens. Tokens that include hyphens are counted as one
(for example, merry-go-round is one token).
PARA (Paragraph) Matches if all the arguments occur in a single paragraph, in any
order. For example, the following rule returns a match if the paragraph
contains the term Manhattan and also includes the token apartment.
(Only Manhattan is highlighted.)
CONCEPT_RULE:(PARA, "_c{Manhattan}", "apartment")
Note: PARA rules work properly only when they are applied to data sets that
contain paragraph delimiters \n\n (newline), \t\t (tab), or <P> (paragraph).
PARA cannot be applied on the Test Sample Text tab. PARA also cannot be
applied to data that is contained in folders.
52 Chapter 5 • Writing Rules
SENT (Sentence) Takes two or more arguments. Matches if all the arguments occur
in the same sentence, in any order. For example, the following rule returns a
match when Amazon and river occur within the same sentence:
CONCEPT_RULE:(SENT, "_c{Amazon}", "river")
Delimiters are used for sentence tokenization, which is a process that breaks
up sentences into words, phrases, symbols, or other meaningful elements
(tokens). Note that a period ( . ) does not necessarily indicate an end of
sentence (for example, Mr. Quackenbush or Boston, Mass. could
occur in the middle of a sentence). Here is a list of sentence delimiters:
\r\n\r\n Two consecutive carriage returns and new lines (for
documents created in Windows)
\r\n \r\n Two consecutive carriage returns and new lines, separated
by a space
.<SPACE> Period (.) followed by an ASCII space
.\n Period (.) followed by a new line
.\r Period (.) followed by a carriage return
! Exclamation point
!\n Exclamation point followed by a new line
!\r Exclamation point followed by a carriage return
? Question mark
?\n Question mark followed by a newline
?\r Question mark followed by a carriage return
.) Period followed by a closing parenthesis
!) Exclamation point followed by a closing parenthesis
?) Question mark followed by a closing parenthesis
.” Period followed by double quotation marks.
SENT_n (Multiple sentences) Takes a value for n and two or more arguments. Returns
matches within n sentences. For example, the following rule returns a match
for the concept node GENDER and the term he within two sentences.
Suppose the GENDER concept node contains the following rule:
CLASSIFIER:male
SENTEND_n (End of sentence) Takes a value for n and one or more arguments. Returns
matches within n tokens of the end of the sentence. For example, suppose the
GENDER concept node contains the following rule:
CLASSIFIER:female
Then the following rule returns a match for the concept node GENDER and
the term she within five tokens from the end of a sentence:
CONCEPT_RULE:(SENTEND_5, "_c{GENDER}", "she")
SENTSTART_ (Start of sentence) Takes a value for n and one or more arguments. Returns
n matches within n tokens of the beginning of the sentence. For example, the
following rule locates matches for the sentence The patient
experienced breathing difficulty. :
CONCEPT_RULE:(SENTSTART_5, "_c{patient}" "breathing", "difficulty")
UNLESS Takes two arguments, the second of which is one of the following operators
(with its arguments): AND, SENT, DIST, ORD, or ORDDIST. Restricts
certain matches by specifying a relationship between two arguments and
allowing a match only if a third argument does not intervene. Used in rule
types PREDICATE_RULE and CONCEPT_RULE only.
For example, the following rule does not include the token river in its
matches; ain addition, the rule returns matches for Mississippi the state
and not Mississippi the river:
CONCEPT_RULE:(UNLESS, "river", (SENT, "_c{Mississippi}", "United States"))
The rule ensures that river does not appear between Mississippi and
United States in the matches.
Note: When you specify a concept governed directly by the UNLESS
operator, specify concepts that contain only CLASSIFIER or REGEX rules.
You can create a concept node THEY_SAID that enables they to reference its canonical
form, Congressional leaders. Both forms are matched in the document.
C_CONCEPT:_c{LEADERS} said _ref{they}
You can use the following symbols with the coreference modifier (_ref{}). Place the
symbol after the _ref{concept} modifier.
54 Chapter 5 • Writing Rules
_F (Forward)
Returns only matches that occur from the coreference rule match onward. Sample
syntax:
C_CONCEPT:_c{PERSON} as _ref{TITLE}_F
_P (Preceding)
Returns only matches that occur up to and including the coreference rule match.
Sample syntax:
C_CONCEPT:_c{MILITARY BRANCH} as _ref{HONOR}_P
The rule first matches the term Sokolov. If that match is found, the rule checks the
documents for any occurrences of the term accounts receivable and assigns any
matches to the concept AR. In the list of matches for ACCOUNT_HOLDER, the term
Sokolov would be highlighted. In the list of matches for AR, the term accounts
receivable would be highlighted. Note that in order for the rule to work, the primary
term (in the example, Sokolov) needs to be present anywhere in the document before
accounts receivable can be returned as a match for concept node AR.
Concepts that you are exporting to (such as AR in the example) must exist in the list of
concepts and can contain additional rules (or be empty).
The following example illustrates how to export two sets of terms to the same concept.
CLASSIFIER: [export=text2]:text1
If text1 and text2 appear in a document, return text1 and text2 as separate
matches for the concept where this line is located. For example, suppose you have
written the following rule:
CLASSIFIER:[export=SAS]:institute
The string SAS institute returns SAS and institute as matches to the concept
where this line is located. The string institute (occurring alone) is a match, but not
SAS occurring alone.
Writing Concept Rules: Basic LITI Syntax 55
The rule assumes that there is a concept SENATE_TITLE that contains words such as
majority leader, senator, and senators, and a concept STATE that includes
names of states. The :Prep tag indicates a preposition (for example, from or of). A
match on the C_CONCEPT rule would occur on the text Senator Phineas
Craymoor from North Carolina took the floor. However, the following
text would not produce a match because the word and is not a preposition: Senators
Phineas Craymoor and Garrett Garcia from North Carolina pushed
the bill through.
Table 5.6 lists the part-of-speech tags in English. For tags in other languages, see
Appendix 1, “Part-of-Speech Tags (for Languages Other Than English),” on page 71.
Note that in some languages, the tags documented in these sections might be different
from the tags displayed in the Role column of the Text Parsing node.
:Vpt Past tense be, do, or have auxiliary was, were, did, have
Past tense verb dashed, factored, went
If you add a plus sign (+) as follows, the rule matches one or more of the characters
specified in any combination, such as rash , cash, ash, and crass (but not
crashpad or crashdummy):
REGEX:[crash]+
• Characters are matched within a string in sequence when represented without square
brackets ([ ] ). For example, the following rule matches only the word any
(anyone or anything would not be matched):
REGEX:any
To match words that contain any, you can modify the rule to use asterisks (*) to
match other character occurrences (or none) surrounding any. For example, the
following rule matches any, anyone, anything, and Many:
REGEX:[A-Za-z]*any[A-Za-z]*
• You can specify a range of characters to be matched. For example, the following rule
matches lowercase characters between a and f, inclusively:
REGEX:[a-f]
• You can specify characters that should not be matched (negated characters) by
inserting a caret (^) before a set of characters. For example, the following rule
matches all characters, numbers, and symbols in text except a, e, i, o, and u:
REGEX:[^aeiou]
58 Chapter 5 • Writing Rules
The special characters used for matching in Regex syntax can be used in combination
and are shown in Table 5.7 on page 58.
% Matches %
? Matches 0 or 1 occurrences
{} Indicates repetition:
{n} matches exactly n {n,m} matches at least n
occurrences occurrences but no more than
m occurrences
{n,} matches at least n
occurrences
\a Alarm (beep)
\n New line
\r Carriage return
\t Tab
Writing Concept Rules: Basic LITI Syntax 59
\f Form feed
\e Escape
If you add a plus sign (+) to match multiple occurrences (or one occurrence) as
follows, the rule matches any combination of the characters that are specified, such
as xzx, yz, and zyzy:
REGEX:[xyz]+
breathes and breathing, use the following syntax for the argument:
“breathe@V”.
Symbol Description
@ Expands the concept rule to match all inflectional forms of the word in the
argument. For example, the argument “wonder@” returns the matches
wonder, wonders, wondered, wondering, and so on.
Note: If you apply @ to a word that SAS Visual Text Analytics does not
recognize, no expansion occurs. Only the exact string specified before the
@ is matched. For example, “grath” would not expand. Only the string
grath would return a match in the rule.
@N Expands the concept rule to match all inflected noun forms of the word in
the argument. For example, the argument “quality@N” returns the
matches quality and qualities.
Note: If you apply @N to a word that is not a noun, no expansion occurs.
@V Expands the concept rule to match all inflected verb forms of the word in
the argument. For example, the argument “transfer@V” returns the
matches transfer , transfers, transferred, and
transferring.
Note: If you apply @V to a word that is not a verb, no expansion occurs.
Adding Comments
You can insert comments into rule definitions that have separate rules appearing on
successive lines, such as CLASSIFIER rules. The comment continues until the end of
the line. Comments are written as
# comment text
Note: The pound character (#) denotes a comment. If you want to match # in a rule
definition, you must use a backward slash (\) as an escape character before the #.
(Example: The expression 99\# attempts to match the string 99#.)
TIP You can comment out a rule by inserting a pound character (#) at the beginning
of a line that contains a rule.
CLASSIFIER
Example: To extract documents that contain US airport codes, you can create a
concept node named US_AIRPORTS that includes these CLASSIFIER rules:
CLASSIFIER:BUF
CLASSIFIER:BUR
CLASSIFIER:BVK
So, documents that include a match on one or more of the airport codes BUF, BUR, or
BVK, return a match for US_AIRPORTS.
CONCEPT
Example: To extract documents that contain flight arrival information, create a
concept node ON_TIME_ARRIVALS. The rule definition for ON_TIME_ARRIVALS
contains the CONCEPT rule type. The CONCEPT rule type can reference the
concept node US_AIRPORTS , which enables airport codes to be detected. The rule
definition for the concept node ON_TIME_ARRIVALS is as follows: CONCEPT:at
US_AIRPORTS on time (where US_AIRPORTS includes CLASSIFIER rules that
identify US airport codes).
C_CONCEPT
Example: To extract documents that include names of university professors, create a
C_CONCEPT rule named PROFESSORS whose definition includes this rule:
C_CONCEPT:Professor _c{FIRSTNAME LASTNAME}. The rule indicates that
matches are returned when FIRSTNAME and LASTNAME (previously defined) are
found, but only when they are preceded by the word Professor. Provide the
context for the match by using the modifier _c and enclosing the argument that you
want to match in the braces ({}).
The rule modifier _c{} indicates that the match occurs within the context of the
specified concept nodes.
NO_BREAK
Example: Suppose you want to extract National Gallery of Art. You defined
a concept node US_ART_GALLERIES that includes the CLASSIFIER rule
National Gallery of Art. There also exists a concept node called
CLASS_TYPES that includes the CLASSIFIER rule Art. You can create the
following rule that prevents a partial match on CLASS_TYPES and ensures that the
entire string National Gallery of Art is matched:
NO_BREAK:_c{US_ART_GALLERIES}
The rule modifier _c indicates that the match occurs within the context of another
concept node.
REMOVE_ITEM
Example: Suppose you want to extract the baseball team St. Louis Cardinals, but
not the football team Arizona Cardinals. You have a concept node named
FOOTBALL that includes the rule CLASSIFIER:Cardinals. You have another
concept node named BASEBALL that includes the rule CLASSIFIER:Cardinals.
The following rule returns matches for the baseball team only:
REMOVE_ITEM(ALIGNED, “_c{FOOTBALL}”, “BASEBALL”)
Note: The REMOVE_ITEM rule type could influence matches outside of the
concept node in which it is used. In this case, the rule could influence matches in
the FOOTBALL rule because the rule specifies that items be removed.
REGEX
Example: To extract whole numbers in text (such as 1, 23, 456, and so on), use the
rule
62 Chapter 5 • Writing Rules
REGEX:[0-9]+
This rule requires that one or more consecutive digits occur and are without
decimals.
Example: To extract a number that uses decimal notation, such as 392.55, 45.25,
and 0,987654321, use the following rule:
REGEX:[0-9]+[,\.][0-9]+
This rule returns a match on one or more digits, a comma, or a period, and then
ending in one or more digits.
For more information about writing Regex rules, see “Using Regular Expressions
(Regex)” on page 57.
CONCEPT_RULE
Example: Suppose you want to extract Amazon the company, not Amazon the river.
You could use this rule, which would return a company name within three words of
company, but not if there were nature-related words in the document.
CONCEPT_RULE:(AND, (DIST_3, "_c{COMPANY}", "company"), (NOT, "NATURE"))
SEQUENCE
Example: Suppose you want to extract first and last names only from a list of first,
middle, and last names. You can use a SEQUENCE rule to define the arguments
first and last. By using these arguments, matches are made on the concept nodes
FIRST_NAME, MIDDLE_NAME, and LAST_NAME, but matches are returned on only
FIRST_NAME and LAST_NAME.
SEQUENCE:(first, last): _first{FIRST_NAME} MIDDLE_NAME _last{LAST_NAME}
PREDICATE_RULE
Example: Suppose you want to match a company to its products. You could use the
following PREDICATE_RULE, which assumes that the concept node COMPANY
includes CLASSIFIER rules that list company names and the concept node
PRODUCTS contains CLASSIFIER rules that list products. Items must appear in the
same sentence.
PREDICATE_RULE:(company, product):(SENT, "_company{COMPANY}",
"produces", "_product{PRODUCTS}")
• Boolean and proximity operators and their arguments are enclosed in parentheses
and separated with commas. The arguments are included in quotation marks (“ ”).
Example: (AND, “my_w holiday”, “_cap”)
• Rules can be nested. Example: (AND, (OR, “courage”, “courageous”), (OR, “brave”,
“bravery”))
• Reference a category from another category by using special syntax called tmac
syntax (_tmac). For more information, see “Using _tmac for Referencing Categories”
on page 69.
• Concept node names can be referenced in category rules. If you reference a concept
node name, the concept matches are used to contribute to the true/false match of the
category rule. Concept node names must be enclosed in braces ( [] ). For example,
to reference the concept node GAME_SHOWS in a category rule, you could write the
rule (OR, “[GAME_SHOWS]”).
Note: Concept nodes that are named in categories might return more matches than
concepts that are run outside of categories. In categories, matches on concepts
are based on an “all matches” method, which returns all matches found in the
text. The best match method detects when text that matches one concept overlaps
text that matches another concept (for example, a concept that matches New
York and another concept that matches New York City). When concept
matches overlap and the best match method is used, only the concept that is
assigned the highest number for the priority is returned (1 is the lowest). When
two or more concepts have the same priority assigned, SAS Visual Text
Analytics selects a match.
• The enabled or disabled status of concepts that are named in categories is ignored
during category matching. As a result, the concepts are processed as if they were all
enabled, regardless of whether they were previously disabled.
• Special symbols can be used to modify the rules to include, wildcards, case
sensitivity, and so on. For a list of symbols, see Table 5.10 on page 67.
Note: XPath expressions are not supported.
Operator Description
AND Takes one or more arguments. Matches if all arguments occur in the
document, in any order. For example, the rule (AND, “King”, “Louis”,
“XIV”) returns a match if King, Louis, and XIV all occur in the
document.
64 Chapter 5 • Writing Rules
DIST_n (Distance) Takes a value for n and two or more arguments. Matches if all
arguments occur within n (or fewer) tokens of each other, regardless of
their order. For example, the rule (DIST_5, “best”, “picture”) returns a
match in the phrase the picture with the best lighting.
Note: For calculation purposes, the distance between tokens is not
inclusive. For example, the distance between the tokens best and show
in the phrase best in show is two tokens. Words that include
hyphens are counted as one token (for example, merry-go-round is
one token).
END_n (From the end of the document) Takes a value for n and one or more
arguments. Matches if the argument occurs within n tokens from the end
of the document. For example, the rule (END_35, “conclusion”) returns a
match if conclusion is found within 35 tokens from the last token in
the document.
Note: Words that include hyphens are counted as one word (for example,
merry-go-round is one word).
MIN_n (Minimum) Takes a value for n and one or more arguments. Matches if the
document contains at least n of the arguments specified (in any order). For
example, the rule (MIN_2, “Hollywood”, “tinseltown”, “movies”) returns
a match if Hollywood and movies occur in the document. However,
there is no match if Hollywood occurs twice and no other arguments
occur.
MINOC_n (Minimum occurrence) Takes a value for n and one or more arguments.
Matches if the document contains at least n occurrences of the arguments
specified (in any order or combination). For example, the rule (MINOC_2,
“Hollywood”, “tinseltown”, “movies”) returns a match if Hollywood
and movies occur in the document. There is also a match if
Hollywood occurs twice and no other arguments occur.
MAXOC_n (Maximum occurrence) Takes a value for n and one or more arguments.
Matches if the document contains n or fewer occurrences of the arguments
(in any order or combination). For example, the rule (MAXOC_8,
“savings”, “offer”, “best”) returns a match if savings occurs in the
document six times. There is also a match if offer occurs in the
document six times and best occurs twice.
MAXPAR_n (Maximum paragraph) Takes a value for n and one or more arguments.
Matches if all arguments occur within the first n (or fewer) paragraphs of
the document, in any order. For example, the rule (MAXPAR_4,
“seasonal", “herbs”, “plants”) returns a match if seasonal occurs in
paragraph 4, herbs occurs in paragraph 2, and plants occurs in
paragraph 2.
Note: MAXPAR rules work properly only when applied to data sets that
contain paragraph delimiters (\n\n). MAXPAR cannot be applied on the
Test Sample Text tab. MAXPAR also cannot be applied in the Categories
node to data that is contained in folders.
MAXSENT_n (Maximum sentence) Takes a value for n and one or more arguments.
Matches if all arguments occur within the first n sentences of the
document, in any order. For example, the rule (MAXSENT_4, “weight
loss”, “plan”) returns a match if weight loss and plan occur in
sentence 3 of the document. For a list of sentence delimiters, see the SENT
operator.
Writing Category Rules 65
NOT Takes one argument. Matches if the argument does not occur in the
document. Must be used with the AND operator. For example, the rule
(AND, (OR, “cinema”, “theater”, “theatre”), (NOT, “Broadway”)) returns
a match if cinema, theater, or theatre occur in the document and
Broadway does not.
Note: The NOT operator applies across the entire document.
NOTIN (Not in) Takes two arguments and matches if the first argument does not
appear within the second argument. For example, the rule (NOTIN,
“butter”, “peanut butter”) identifies butter when it does not appear
within the noun phrase peanut butter. This sentence returns a
match: Early American colonists churned their own
butter.
NOTINDIST_n (Not in distance) Takes a value for n and two arguments. Matches if the
arguments do not occur within n tokens of each other, or if the first
argument listed in the rule occurs in the document and the second
argument does not. For example, the rule (NOTINDIST_3 “orange”,
“green”) returns a match if orange and green do not occur within
three tokens of each other, or if only orange appears in the document.
The following sentence returns a match because the tokens that are
specified in the rule are more than three words apart: How green is
my valley, how orange is the sunset?
Note: For calculation purposes, the distance between tokens is not
inclusive. For example, the distance between the tokens best and show
in the phrase best in show is two tokens. Tokens that include
hyphens are counted as one token (for example, merry-go-round is
one token).
NOTINPAR (Not in paragraph) Takes two or more arguments and matches if all
arguments occur within the document but appear in separate paragraphs.
For example, the rule (NOTINPAR, “China”, “export”) returns a match if
China and export occur in separate paragraphs (without the other
argument present).
Note: NOTINPAR rules work properly only when applied to data sets that
contain paragraph delimiters (\n\n). NOTINPAR cannot be applied on the
Test Sample Text tab. NOTINPAR also cannot be applied in the
Categories node to data that is contained in folders.
NOTINSENT (Not in sentence) Takes two or more arguments and matches when the first
of the two arguments is present and the second of the two arguments does
NOT occur. For example, the rule (NOTINSENT, “trade”, “China”)
indicates that “trade” will match if the word “China” does not occur in the
same sentence. For a list of sentence delimiters, see the SENT operator.
ORD (Order) Takes one or more arguments. Matches if all of the arguments
occur in the order that is specified in the rule. It cannot be used with SENT
(or any other operator that limits the scope of matches). For example, the
rule (ORD, “warranty”, “claim”, “denied”) returns a match in the sentence
The warranty claim for the washing machine was
denied.
ORDDIST_n (Order and distance) Takes a value for n and two or more arguments.
Matches if both arguments occur in the same order that is specified in the
rule and if both arguments are within n tokens of each other. For example,
the rule (ORDDIST_5, “elementary”, “statistics”) returns a match in the
phrase the teacher introduced elementary
statistics.
Note: For calculation purposes, the distance between tokens is not
inclusive. For example, the distance between the tokens best and show
in the phrase best in show is two tokens. Words that include
hyphens are counted as one token (for example, merry-go-round is
one word).
PAR (Paragraph) Takes one or more arguments. Matches if all the arguments
occur in a single paragraph, in any order. For example, the rule (PAR,
“director”, “budget”) returns a match if the paragraph includes both
director and budget.
Note: PAR rules work properly only when applied to data sets that contain
paragraph delimiters (\n\n). PAR cannot be applied on the Test Sample
Text tab. PAR also cannot be applied in the Categories node to data that is
contained in folders.
PARPOS_n (Paragraph position) Takes a value for n and one or more arguments.
Matches if all arguments occur within the nth paragraph, in any order. For
example, the rule (PARAPOS_2, “journalists”, “detained”, “overseas”)
returns a match if journalists, detained, and overseas occur
within paragraph 2 of the document.
Note: PARPOS rules work properly only when applied to data sets that
contain paragraph delimiters (\n\n). PARPOS cannot be applied on the
Test Sample Text tab. PARPOS also cannot be applied in the Categories
node to data that is contained in folders.
Writing Category Rules 67
SENT (Sentence) Takes two or more arguments. Matches if all the arguments
occur in the same sentence, in any order. For example, the rule (SENT,
“growth”, “hormone”) returns a match in the sentence Patients who
take a growth hormone might experience side
effects. Sentence delimiters are as follows:
\r\n\r\n Two consecutive carriage returns and new lines (for
documents created in Windows)
\r\n \r\n Two consecutive carriage returns and new lines,
separated by a space
.<SPACE> Period (.) followed by an ASCII space
.\n Period (.) followed by a new line
.\r Period (.) followed by a carriage return
! Exclamation point
!\n Exclamation point followed by a new line
!\r Exclamation point followed by a carriage return
? Question mark
?\n Question mark followed by a newline
?\r Question mark followed by a carriage return
.) Period followed by a closing parenthesis
!) Exclamation point followed by a closing parenthesis
?) Question mark followed by a closing parenthesis
.” Period followed by double quotation marks
START_n (From the start of the document) Takes a value for n and one or more
arguments. Matches if the argument occurs within n words from the start
of the document. For example, the rule (START_22, “infection”) returns a
match if infection occurs within 22 words of the first word in the
document.
Note: Words that include hyphens are counted as one word (for example,
merry-go-round is one word).
Symbol Description
_L (Literal matching) Matches a literal string. Useful when you want to match
a string that includes symbols. For example, the argument “$USD_L”
returns the match $USD.
Note: Tokens (words, phrases, symbols, or other meaningful elements)
need to be specified by the user to be considered for matching.
To enforce a first name followed by last name (FIRST LAST), you could add this rule in
a category called COMPLETE_NAME::
(ORD,_tmac:"@Top/NAME/FIRST",_tmac:"@Top/NAME/LAST")
Appendix 1
(such as punctuation). All tags are case-sensitive and are preceded by a colon (:) in
concept rules. For more information, including English tags, see “Using Part-of-Speech
and Other Tags” on page 55.
Arabic
Chinese
:C Conjunction 或, 与, 雖然
:E Interjection 咦, 呸, 哦喲
:G Other morpheme 馨, 慚
:H Other prefix 亚, 非
:K Other suffix 们, 者, 們
:P Preposition 依照, 对于
:Q Classifier 个, 斤, 艘, 加侖
:R Pronoun 我, 他們, 这
:U Particle 的, 了, 着
:W Punctuation or symbols !, 。, $, ¥
:Y Interjectional particle 吧, 吗, 麽
Croatian
Czech
:PPOS Preposition v, z
Danish
Dutch
:digit Number 21
English
Farsi
:A Adjective خوشحال,خوشگل
Part-of-Speech Tags for Rule Writing 79
Finnish
French
German
:A Adjective zuverlässig
:digit Number 21
Greek
:digit Number 1, 20
Hebrew
:A Adjective אדיר,יפה
Hindi
:digit Number 0, 3
Indonesian
Italian
:digit Number 21
Japanese
:NN Numeral 千, 零, 6
To use Japanese POS tags in LITI rules, you need to add the Form type after the POS
tags. For the POS tags of nominals, add ‘|ROOT’ after the POS tags. E.g. ‘NC|ROOT’,
‘DN|ROOT’, ‘CN|ROOT’. For the POS tags of predicates, add the conjugation forms
listed in the table below. E.g. ‘AJ|CONJ’, ‘V1|COND’.
Korean
:IJ Interjection 아, 네, 그래
Norwegian
:A Adjective leket
Polish
Portuguese
:digit Number 21
Russian
:INTJ Interjection ах
Slovak
Slovene
:PPOS Preposition v, za
Spanish
:digit Number 21
Swedish
:A Adjective fört
Tagalog
:PTCL Particle ay
:digit Number 1, 20
Thai
Turkish
Vietnamese
Appendix 2
Pre-Defined Concept
Priorities (for Languages
Other Than English)
“Which Rule Type Should I Use?” on page 45. For priority values in English, see
“Concepts” on page 5.
Note: Use the highest priority value per language to ensure that there are no conflicts
with custom concepts during document processing. The highest priority value for
each language is marked in the tables in the following section with a footnote.
Arabic
nlpDate 18
nlpMoney 18
nlpNounGroup 15
nlpOrganization 20
nlpPercent 18
nlpPerson 20
nlpPlace* 25*
nlpTime 18
Chinese
The default value of 10 is used for all of the predefined concepts listed below.
Predefined Concept
nlpDate
nlpMoney
nlpOrganization
nlpPercent
Priority Values for Predefined Concepts 105
nlpPerson
nlpPlace
nlpTime
Croatian
nlpDate 10
nlpMeasure 10
nlpMoney 10
nlpNounGroup 10
nlpOrganization 10
nlpPercent 10
nlpPerson 11
nlpPlace* 12*
nlpTime 10
Czech
nlpDate* 10*
nlpMoney* 10*
nlpNounGroup 9
nlpOrganization* 10*
nlpPercent* 10*
106 Appendix 2 • Pre-Defined Concept Priorities (for Languages Other Than English)
nlpPerson* 10*
nlpPlace* 10*
nlpTime* 10*
Danish
nlpNounGroup 15
nlpOrganization* 20*
nlpPerson* 20*
nlpPlace* 20*
Dutch
nlpDate 18
nlpMoney 18
nlpNounGroup 15
nlpOrganization* 20*
nlpPercent 18
nlpPerson* 20*
nlpPlace* 20*
nlpTime 18
Farsi
For Farsi, there are no specific priority values for predefined concepts. The default value
of 10 is used for all of the predefined concepts listed below.
Predefined Concept
nlpDate
nlpMoney
nlpOrganization
nlpPercent
nlpPerson*
PERSON
ORGANIZATION
Finnish
nlpDate 10
nlpMoney 10
nlpNounGroup 15
nlpOrganization* 25*
nlpPerson 20
nlpPlace* 25*
nlpTime 10
French
nlpDate 18
nlpMoney 18
nlpNounGroup 15
nlpOrganization* 20*
nlpPercent 18
nlpPerson* 20*
nlpPlace* 20*
nlpTime 18
German
nlpDate 18
nlpMoney 25
nlpNounGroup 15
nlpOrganization 25
nlpPercent 18
nlpPerson* 60*
nlpPlace 40
nlpTime 18
Greek
nlpDate 18
nlpMoney 18
nlpNounGroup 15
nlpOrganization 20
nlpPercent 18
nlpPerson* 20
nlpPlace 25*
nlpTime 18
Hebrew
For Hebrew, there are no specific priority values for predefined concepts. The default
value of 10 is used for all of the predefined concepts listed below.
Predefined Concept
nlpDate
nlpMoney
nlpNounGroup
nlpOrganization
nlpPercent
nlpPerson
nlpPlace
nlpTime
110 Appendix 2 • Pre-Defined Concept Priorities (for Languages Other Than English)
Hindi
nlpDate 10
nlpMoney 10
nlpNounGroup 10
nlpOrganization 10
nlpPercent 10
nlpPerson 10
nlpPlace* 40*
nlpTime 10
Indonesian
nlpDate* 20*
nlpMoney* 20*
nlpNounGroup 10
nlpOrganization* 20*
nlpPercent* 20*
nlpPerson* 20*
nlpPlace* 20*
nlpTime* 20*
Italian
For Italian, there are no specific priority values for predefined concepts. The default
value of 10 is used.
Predefined Concept
nlpDate
nlpMoney
nlpNounGroup
nlpOrganization
nlpPercent
nlpPerson*
nlpPlace
nlpTime
Japanese
For Japanese, there are no specific priority values for predefined concepts. The default
value of 50 is used for all of the predefined concepts listed below.
Predefined Concept
nlpDate
nlpMoney
nlpOrganization
nlpPercent
nlpPerson*
nlpPlace
nlpTime
112 Appendix 2 • Pre-Defined Concept Priorities (for Languages Other Than English)
Korean
For Korean, there are no specific priority values for predefined concepts. The default
value of 50 is used.
Predefined Concept
nlpDate
nlpMoney
nlpOrganization
nlpPercent
nlpPerson*
nlpPlace
nlpTime
Norwegian
For Norwegian, there are no specific priority values for predefined concepts. The default
value of 10 is used for all of the predefined concepts listed below.
nlpNounGroup 10
Polish
nlpDate 18
nlpMoney 18
nlpNounGroup 15
Priority Values for Predefined Concepts 113
nlpOrganization* 21*
nlpPercent 18
nlpPerson* 20
nlpPlace 20
nlpTime 18
Portuguese
nlpDate 18
nlpMoney 18
nlpNounGroup 15
nlpOrganization 25*
nlpPercent 18
nlpPerson* 20
nlpPlace 25*
nlpTime 18
Russian
nlpDate* 10*
nlpMoney 9
nlpNounGroup* 10*
114 Appendix 2 • Pre-Defined Concept Priorities (for Languages Other Than English)
nlpOrganization* 10*
nlpPercent* 10*
nlpPerson* 10*
nlpPlace* 10*
nlpTime* 10*
Slovak
nlpDate* 10*
nlpMoney* 10*
nlpNounGroup* 10*
nlpOrganization* 10*
nlpPercent* 10*
nlpPerson* 7
nlpPlace 8
nlpTime* 10*
Slovene
For Slovene, there are no specific priority values for predefined concepts. The default
value of 10 is used.
Predefined Concept
nlpDate
ORGANIZATION
Priority Values for Predefined Concepts 115
nlpMoney
nlpNounGroup
nlpOrganization
nlpPercent
nlpPerson*
VEHICLE
NOUN_GROUP
Spanish
nlpDate 18
nlpMoney 18
nlpNounGroup 15
nlpOrganization 25*
nlpPercent 18
nlpPerson* 20
nlpPlace 25*
nlpTime 18
Swedish
nlpDate 18
nlpMeasure 18
116 Appendix 2 • Pre-Defined Concept Priorities (for Languages Other Than English)
nlpMoney 18
nlpNounGroup 15
nlpOrganization 20*
nlpPercent 18
nlpPerson* 20*
nlpPlace 20*
nlpTime 18
Tagalog
For Tagalog, there are no specific priority values for predefined concepts. The default
value of 10 is used.
nlpDate
nlpMoney
nlpNounGroup
nlpOrganization
nlpPercent
nlpPerson
nlpPlace
nlpTime
Thai
For Thai, there are no specific priority values for predefined concepts. The default value
of 10 is used.
Predefined Concept
nlpDate
Priority Values for Predefined Concepts 117
nlpMoney
nlpOrganization
nlpPercent
nlpPerson
nlpPlace
nlpTime
Turkish
nlpDate 10
nlpMoney 10
nlpNounGroup 10
nlpOrganization 11*
nlpPercent 10
nlpPerson 10
nlpPlace 10
nlpTime 10
Vietnamese
For Vietnamese, there are no specific priority values for predefined concepts. The
default value of 10 is used.
Predefined Concept
nlpDate
nlpMoney
118 Appendix 2 • Pre-Defined Concept Priorities (for Languages Other Than English)
nlpOrganization
nlpPercent
nlpPerson
nlpPlace
nlpTime
119
Recommended Reading
Glossary
category
a classification for documents that is based on a common characteristic. Category
membership is indicated as a binary property. In order to determine when a
document is likely to be a member of a category, one or more Boolean rules
comprising the category text definition must be satisfied.
concept
an abstract class of meanings. In order to determine when a concept is likely to be
referenced in a subset of text, the rules comprising the concept text definition must
be satisfied.
model scoring
the process of applying a model to new data in order to compute outputs.
parse
to analyze text, such as a SAS statement, for the purpose of separating it into its
constituent words, phrases, punctuation marks, values, or other types of information.
The information can then be analyzed according to a definition or set of rules.
relevancy score
a score that indicates how well a document satisfies a rule or model. The best match
has a score of 1 and reflects a perfect (100%) match.
scoring
See model scoring.
sentiment
an attitude that is expressed about an item that is being analyzed, which can be a
segment of text, a grouping of text segments, or a specific subject of interest.
sentiment analysis
the use of natural language processing, computational linguistics, and text analytics
to determine the attitude of a speaker or writer with respect to a topic, document, or
other item of analysis. Sentiment analysis results in a positive, negative, or neutral
score on the target of analysis.
stemming
the process of finding and returning the root form of a word. For example, the root
form of grind, grinds, grinding, and ground is grind.
122 Glossary
stop list
a SAS data set that contains a simple collection of low-information or extraneous
words that you want to remove from text mining analysis.
string
See text string.
subset of text
the matched text for a concept text definition; this consists of one or more strings that
are contained in a document.
surface form
a variant of a term that is contained in a matched subset of text in one or more
documents. These forms include stems, synonyms, misspellings, and alternate ways
of referring to the same entity.
taxonomy
a hierarchical relationship of parent and child category nodes. In a true taxonomy,
whenever a category is detected, it is implied that all parents are also represented.
For example, if something is identified as human, it must also be a primate, mammal,
animal, and so on.
term
a representation of a single concept in one or more textual forms, as defined by rules
or algorithms.
term map
a node-arc graph that centers around an "object of interest," which could be a
category, concept, topic, or term. Corresponding nodes in the graph indicate rules
that are predictive of the object of interest. Better rules are shown as larger nodes.
The arcs represent the addition or exclusion of terms that are used to build up the
rules.
term role
a function that is performed by a term in a particular context. A term can function as
a part of speech, entity type, or other purpose that is user-defined.
term table
a list of every term in a collection of documents including the representative text
form for each term, its role, and all of its surface forms that appear within that
collection.
text string
a subset of text that consists of adjacent characters of any type. Depending on the
specified options, strings can be either case-sensitive or case-insensitive.
token
in the SAS programming language, a collection of characters that communicates a
meaning to SAS and that cannot be divided into smaller functional units. A token
such as a variable name might look like an English word, but can also be a
mathematical operator, or even an individual character such as a semicolon. A token
can contain a maximum of 32,767 characters.
Glossary 123
topic
a machine-generated category, the purpose of which is to indicate what documents
are about. A topic identifies groupings of important terms in a document collection.
A single document can contain one or more topics, or no topics.
weight
a numeric indicator that is assigned to an item and that indicates the relative
importance of the item in a frequency distribution or population.
124 Glossary