1152cs191 Data Visualization Unit IV

School of Computing
Department of Computer Science &

Engineering
1152CS191- Data Visualization

Category : Program Elective
UNIT-IV
Course Handling Faculty :
12/10/2024 1
No
CO
No
CO
CO4
Engineering Knowledge
12/10/2024
Problem Analysis
documents
Design / Development of solutions
Conduct investigations of complex

problems
Modern Tool usage
The Engineer Society
Environment & Sustainability

Course Outcomes
Ethics
Individual & Team Work

Identify the visualization techniques for Text and
Communication
Project Management & Finance
Life Long Learning

Course Outcomes
Mathematical Concepts
K2
taxonomy)
Software Development
revised Bloom’s
Level of learning
domain (Based on
Transferring Skills
Correlation of COs with Student Outcomes ABET
EAC and CAC
CO SO1 SO2 SO3 SO4 SO5 SO6 SO7
CO4 3 2 2 - 2 2 3
CO SO1 SO2 SO3 SO4 SO5 SO6
CO4 3 2 2 - 2 -
Department of Computer Science & Engineering Data

12/10/2024 3
Visualization
Course Content
UNIT IV Text and Document Visualization 9
• Levels of Text Representations

• The Vector Space Model
• Single Document Visualizations
• Document Collection Visualizations
• Extended Text Visualizations
• Designing Effective Visualizations
• Steps in Designing Visualizations
• Problems
• Comparing and Evaluating Visualization Techniques.

12/10/2024 4
Visualization
Text and Documents
• The most obvious tasks in text and documents, are searching for a word,
phrase, or topic.
• For partially structured data, relationships between words, phrases, topics,

or documents are searched.
• For structured text or document collections, the key task is most often
searching for patterns and outliers within the text or documents.
• Collection of documents is corpus (plural corpora). It deals with objects

within corpora.
• The objects can be words, sentences, paragraphs, documents, or even

collection of documents. Images and videos are also considered.

12/10/2024 5
Visualization
Text and Documents
• The objects are considered atomic with respect to the task, analysis and
visualization.
• Text and documents are often minimally structured and may be rich with
attributes and metadata, especially when focused in a specific application
domain.
• For example, documents have a format and often include metadata about the
document (i.e., author, date of creation, date of modification, comments,
size).
• Information retrieval systems are used to query corpora, which requires

computing the relevance of a document with respect to a query. This
requires document preprocessing and interpretation of the semantics of text.
• Statistics about Document computation

12/10/2024 6
Visualization
A Little Experiment

12/10/2024 7
Visualization
A Little Experiment
12/10/2024 Department of Computer Science & Engineering Data Visualization 8

A Brief History

12/10/2024 9
Visualization
A Brief History

12/10/2024 10
Visualization
Text

Text as Visualization

12/10/2024 12
Visualization

12/10/2024 13
Visualization

12/10/2024 14
Visualization
12/10/2024 15
Visualization for “Raw” Text

12/10/2024 16
Visualization

12/10/2024 17
Visualization

12/10/2024 18
Visualization
Visualizing text (features)
Requires a transformation step:

Discretization, Aggregation,Normalization,..

Structured Text Features

12/10/2024 20
Visualization
Typical Steps of Processing to derive Text Features
• Large collections require pre-processing of text to extract information and align text.
Typical steps are:
• Cleaning (regular expressions)
• Sentence splitting
• Change to lower case
• Stopword removal (most frequent words in a language)
• Stemming - demo porter stemmer
• POS tagging (part of speech) - demo
• Noun chunking
• NER (name entity recognition) - demo opencalais
• Deep parsing - try to “understand” text.
12/10/2024 21
Visualization
Text features are complicated
Be aware!! text understanding can be hard:

• Toilet out of order. Please use floor below.
• “One morning I shot an elephant in my pajamas.

How he got in my pajamas, I don't know.”
• Did you ever hear the story about the blind

carpenter who picked up his hammer and saw?
12/10/2024 22
Text features are complicated
Be aware!! text understanding can be hard:

• Toilet out of order. Please use floor below.
• “One morning I shot an elephant in my pajamas.

How he got in my pajamas, I don't know.”
• Did you ever hear the story about the blind

carpenter who picked up his hammer and saw?
12/10/2024 23
Text Units Hierarchy
12/10/2024 24
Levels of Text Representations
To convert the unstructured text to some form of structured data
Lexical Level Syntatic Level Semantic Level
• Transforms a string of • Identifies and tags

characters into a sequence (annotating) each token’s • Extraction of meaning
of atomic entities, called function.
tokens. • Tokens have attributes such and relationships
• Process the sequence of as singular or plural, or between pieces of
characters with a given set their proximity to other knowledge derived
of rules into a new tokens. from the structures
sequence of tokens • Richer tags include date, identified in the
• Tokens can include money, place, person, syntactical level.
characters, character n- organization, and time. •
grams, words, word stems, • The process of extracting • The goal of this level is
lexemes, phrases, or word these annotations is called
n-grams, all with associated named entity recognition to define an analytic
attributes. (NER). interpretation of the full
• Finite state machines • The richness and wide text within a specific
defined by regular variety of language models context, or even
expressions is used to and grammars yield a wide independent of context.
extract tokens variety of approaches.
12/10/2024 25
Vector Space Model
• Computing term vectors is an essential step for many

document and corpus Visualization and analysis techniques.
• In the vector space model , a term vector for an object of

interest (paragraph, document, or document collection) is a
vector in which each dimension represents the weight of a
given word in that document.
• Typically, to clean up noise, stop words (such as “the” or

“a”) are removed (filtering), and words that share a word
stem are aggregated together (stemming)
• https://wordcounter.net/
12/10/2024 26
Vector Space Model
• The pseudocode below counts occurrences of unique tokens,
excluding stop words.
• The input is assumed to be a stream of tokens generated by a
lexical analyzer for a single document.
• The terms variable contains a hashtable that maps unique terms
to their counts in the document.
Count-Terms(tokenStream)
1 terms ← ∅ initialize terms to an empty hashtable.
2 for each token t in tokenStream
3 do if t is not a stop word
4 do increment (or initialize to 1) terms[t]
5 return terms
12/10/2024 27
Statistical Models
• A document is typically represented by a bag of words
(unordered words with frequencies).
• Bag = set that allows multiple occurrences of the same

element.
• User specifies a set of desired terms with optional weights:

• Weighted query terms:
• Q = < database 0.5; text 0.8; information 0.2 >
• Unweighted query terms:
• Q = < database; text; information >No Boolean
conditions specified in the query.
12/10/2024 28
Statistical Retrieval
• Retrieval based on similarity between query and

documents.
• Output documents are ranked according to similarity to
query.
• Similarity based on occurrence frequencies of keywords in
query and document.
• Automatic relevance feedback can be supported:
• Relevant documents “added” to query.
• Irrelevant documents “subtracted” from query.
12/10/2024 29
Issues for Vector Space Model
• How to determine important words in a document?

 Word sense?
 Word n-grams (and phrases, idioms,…) terms
• How to determine the degree of importance of a term
within a document and within the entire collection?
• How to determine the degree of similarity between a
document and the query?
• In the case of the web, what is a collection and what are the
effects of links, formatting information, etc.?
12/10/2024 30
Vector Space Model
• Assume t distinct terms remain after preprocessing; call
them index terms or the vocabulary.
• These “orthogonal” terms form a vector space.
 Dimension = t = |vocabulary|
• Each term, i, in a document or query, j, is given a real-
valued weight, wij.
• Both documents and queries are expressed as t-
dimensional vectors:
 dj = (w1j, w2j, …, wtj)
12/10/2024 31
Graphic Representation
12/10/2024 32
Document Collection
12/10/2024 33
Term Weights: Term Frequency
• More frequent terms in a document are more important, i.e.

more indicative of the topic.
fij = frequency of term i in document j
• May want to normalize term frequency (tf) by dividing by

the frequency of the most common term in the document:
tfij = fij / maxi{fij}
12/10/2024 34
Term Weights: Inverse Document Frequency
12/10/2024 35
TF-IDF Weighting
12/10/2024 36
Computing TF-IDF -- An Example
12/10/2024 37
Similarity Measure
12/10/2024 38
Similarity Measure - Inner Product
12/10/2024 39
Properties of Inner Product
12/10/2024 40
Inner Product - Example
12/10/2024 41
Example – Similarity Measure
12/10/2024 42
12/10/2024 43
12/10/2024 44
Cosine Similarity Measure
12/10/2024 45
Naïve Implementation
12/10/2024 46
Vector Space Model
12/10/2024 47
Vector Space Model - Issues
12/10/2024 48
Vector Space Model - Exercise
12/10/2024 49
Vector Space Model - Exercise
Dr.M.Kavitha Department of Computer Science & Engineering

12/10/2024 50
Data Visualization
Computing TF-IDF(Documents)
Compute-TfIdf(documents)
1 termFrequencies ← ∅ Looks up term count tables for document
names.
2 documentF requencies← ∅ Counts the documents in which a
term occurs.
3 uniqueT erms← ∅ The list of all unique terms.
4 for each document d in documents
5 do docName ← Name(d) Extract the name of the document.
6 tokenStream ← Tokenize(d) Generate document token stream.
7 terms ← Count-Terms(tokenStream) Count the term frequencies.
8 termFrequencies[docName] ← terms Store the term frequencies.
9 for each term t in Keys(terms)
10 do increment (or initialize to 1) documentF requencies[t]
11 uniqueT erms← uniqueT erms ∪ t
12/10/2024 51
Computing TF-IDF(Documents)
13 tf IdfV ectorT able ← ∅ Looks up tf-idf vectors for document

names.
14 n ← Length(documents)
15 for each document name docName in Keys(termFrequencies)
16 do tf IdfV ector ← create zeroed array of length
Length(uniqueT erms)
17 terms ← termFrequencies[docName]
18 for each term t in keys(terms)
19 do tf ← terms[t]
20 df ← documentF requencies[t]
21 tf Idf ← tf ∗ log(n/df)
22 tf IdfV ector[index of t in uniqueT erms]← tf Idf
23 tf IdfV ectorT able[docName] ← tf IdfV ector
24 return tf IdfV ectorT able
12/10/2024 52
Zipf’s Law
 The economist Vilfredo Pareto stated that a company’s revenue is
inversely proportional to its rank—a classic power law, resulting in
the famous 80-20 rule, in which 20% of the population holds 80% of
the wealth.
 Zipf stated the distribution of words in natural language corpora using

a discrete power law distribution called a Zipfian distribution.
 Zipf’s Law states that in a typical natural language document, the

frequency of any word is inversely proportional to its rank in the
frequency table.
 Plotting the Zipf curve on a log-log scale yields a straight line with a
slope of -1
12/10/2024 53
Zipf’s Law
A document view in which named entities are highlighted, color-coded by

entity type.
12/10/2024 54
Zipf’s Law
The distribution of terms in Wikipedia, an example of Zipf’s Law in

action. Term frequency is on the y-axis, and frequency rank is on the x-
axis.
12/10/2024 55
Tasks Using the Vector Space Model
• The vector space model, when accompanied by some distance

metric, allows one to perform many useful tasks.
• tf-idf and the vector space model is used to identify documents of

particular interest.
• For example, the vector space model, with the use of some distance
metric, will allow us to answer questions such as which documents
are similar to a specific one, which documents are relevant to a given
collection of documents, or which documents are most relevant to a
given search query—all by finding the documents whose term
vectors are most similar to the given document, the average vector
over a document collection, or the vector of a search query.
12/10/2024 56
Single Document Visualizations
A tag cloud visualization generated by the free service tagCrowd.com. The font
size and darkness are proportional to the frequency of the word in the document.
12/10/2024 57
Word Clouds
•Word clouds , also known as text clouds or tag clouds, are layouts of raw tokens,
colored and sized by their frequency within a single document.
•Text clouds and their variations, such as a Wordle, are examples of visualizations
that use only term frequency vectors and some layout algorithm to create the
visualization.
A Wordle visualization
generated by the free service
wordle.net. The size of
the text corresponds to the
frequency of the word in the
document.
12/10/2024 58
WordTree
The WordTree visualization is a visual representation of both term frequencies, as
well as their context .
Size is used to represent the term or phrase frequency. The root of the tree is a
user-specified word or phrase of interest, and the branches represent the various
contexts in which the word or phrase is used in the document.
A WordTree
visualization
generated by the free
service ManyEyes .
The branches of the
tree represent the
various contexts
following a root word
or phrase
in the document.
12/10/2024 59
TextArc
TextArc is a visual representation of how terms relate to the lines of text
in which they appear.
Every word of the text is drawn in order around an ellipse as small lines
with a slight offset at its start.
As in a text cloud, more frequently occurring words are drawn larger and
brighter.
Words with higher frequencies are drawn within the ellipse, pulled by its
occurrences on the circle (similar to RadViz).
The user is able to highlight the underlying text with probing and animate
“reading” the text by visualizing the flow of the text through relevant
connected terms.
12/10/2024 60
TextArc
http://textarc
.org/Stills.ht
ml
12/10/2024 61
Arc Diagram
Arc diagrams are a visualization focused on displaying repetition in
text or any sequence.
Repeated subsequences are identified and connected by

semicircular arcs.
The thickness of the arcs represents the length of the subsequence,

and the height of the arcs represents the distance between the
subsequences.
http://mbostock.github.io/protovis/ex/arc.html is the website for this diagram through

Protovis/D3. The input data is in http://mbostock.github.io/protovis/ex/miserables.js
12/10/2024 62
Arc Diagram
Figure displays Bach’s Minuet in G Major, visualizing the classic pattern of a

minuet. It contains two parts, each consisting of a long passage played twice. The
parts are loosely related, as shown by the bundle of thin arcs connecting the two
main parts. The overlap of the two main arcs shows that the end of the first passage
is the same as the beginning of the second.
12/10/2024 63
Arc Diagram
12/10/2024 64
Literature Fingerprinting
 Literature fingerprinting is a method of visualizing features used to
characterize text .
 Instead of calculating just one feature value or vector for the whole
text (this is what is usually done), a sequence of feature values per text
are calculated and presented to the user as a characteristic fingerprint
of the document.
 This allows the user to “look inside” the document and analyze the
development of the values across the text. Moreover, the structural
information of the document is used to visualize the document on
different levels of resolution.
 Literature fingerprinting was applied to an authorship attribution

problem to show the discrimination power of the standard measures
that are assumed to capture the writing style of an author
12/10/2024 65
Literature Fingerprinting
Literature fingerprinting technique. Here, literature fingerprinting is used to analyze
the ability of several text measures to discriminate between authors. Each pixel
represents a text block, and the pixels are grouped into books. Color is mapped to the
feature value, in this case to the average sentence length. If a measure is able to
discriminate between the two authors, the books in the first row (written by London)
are visually set apart from the remaining (Image from [222], c 2007 IEEE.)
12/10/2024 66
Document Collection Visualizations
•Document collection visualizations, the goal is to place similar documents

close to each other and dissimilar ones far apart.
•This is a minimax problem and typically O(n2). We compute the similarity

between all pairs of documents and determine a layout.
•The common approaches are graph spring layouts, multidimensional scaling,

clustering (k-means, hierarchical, EM, support vector), and self organizing
maps.
•Several document collection visualizations are self organizing maps,

clustermaps, and themescapes.
12/10/2024 67
Self-Organizing Maps
•A self-organizing map is an unsupervised learning algorithm using a collection of

typically 2D nodes, where documents will be located.
•Each node has an associated vector of the same dimensionality as the input vectors
(the document vectors) used to train the map.
•We initialize the SOM nodes, typically with random weights.
•We choose a random vector from the input vectors and calculate its distance from
each node.
•We adjust the weights of the closest nodes (within a particular radius), making each
closer to the input vector, with the higher weights corresponding to the closest
selected node.
•As we iterate through the input vectors, the radius gets smaller.
•An example of using SOMs for text data is shown in Figure, which shows a million
documents collected from 83 newsgroups.
12/10/2024 68
Self-Organizing Maps
A self-organizing map (SOM) layout of Finnish

news bulletins. The labels show the topical
areas, and color represents the number of
documents, with light areas containing more
12/10/2024 69
Themescapes
•Themescapes are summaries of corpora using abstract 3D landscapes in
which height and color are used to represent density of similar documents.
•The example shown in Figure from Pacific Northwest National Labs
represents news articles visualized as a themescape.
•The taller mountains represent frequent themes in the document corpus
A themescape from PNNL that uses

height to represent the frequency of
themes in news articles. (Image
reprinted from with permission of
Springer Science and Business
Media.)
12/10/2024 70
Document Cards
Dr.M.Kavitha Department of Computer Science & Engineering

12/10/2024 71
Data Visualization
Document Cards
12/10/2024 72
Document Cards
The Document Card pipeline. Each step is further explained in the sections
indicated by the number in the top right corner of each box.
12/10/2024 73
Extended Text Visualization
Software Visualization
Eick et al. developed a visualization tool called SeeSoft
visualizes statistics for each line of code (i.e., age and num
modifications, programmer, dates).
Dimensions of Software Visualization
 Tasks – why is the visualization needed?
 Audience – who will use the visualization?
 Target – what is the data source to represent?
 Representation – how to represent it?
 Medium – where to represent the visualization
12/10/2024 74
Software Visualization
12/10/2024 75
SeeSoft
12/10/2024 76
SeeSoft
12/10/2024 77
SeeSoft
12/10/2024 78
SeeSoft
12/10/2024 79
SeeSoft - Uses
12/10/2024 80
SeeSoft - Applications
12/10/2024 81
SeeSoft - Applications
12/10/2024 82
SeeSoft
New Application Area
Display of large amount of Texts.

Visualization of Directories and files.
Limitations
Only 50,000 lines of code can be displayed

Difficult to use with the monochrome devices
12/10/2024 83
Search Result Visualization
 Marti Hearst developed a simple query result visualization

foundationally similar to Keim’s pixel displays called TileBars,
which displays a number of term-related statistics, including
frequency and distribution of terms, length of document, term-
based ranking, and strength of ranking.
 Each document of the result set is represented by a rectangle,

where width indicates relative length of the document and
stacked squares correspond to text segments.
 Each row of the stack represents a set of query terms, and the
darkness of the square indicates the frequency of terms among
the corresponding terms.
12/10/2024 84
Search Result Visualization
The TileBars query result

visualization. Each large rectangle
indicates a document, and each
square within the document
represents a text segment. The
darker the tile, the more frequent
the query term set. (Image from [, c
1995 Addison-
Wesley.)
12/10/2024 85
Temporal Document Collection Visualization
ThemeRiver, also called a stream graph, is

a visualization of thematic changes in a
document collection over time .
This visualization assumes that the input
data progresses over time. Themes are
visually represented as colored horizontal
bands whose vertical thickness at a given
horizontal location represents their
frequency at a particular point in time.
A stream graph (ThemeRiver), depicting the election

night speeches of several different candidates for a
Canadian election. (Image from, c 2002 IEEE.)
12/10/2024 86
Temporal Document Collection Visualization
Jigsaw is a tool for visualizing and

exploring text corpora [155].
Jigsaw’s calendar view positions
document objects on a calendar based
on date entities identified within the
text. When the user highlights a
document, the entities that occur
within that document are displayed .
Wanner et al. developed a visual

analytics tool for conducting
semiautomatic sentiment analysis of
large news feeds.
News articles are presented with the Jigsaw
calendar view, based on the extracted date
entities
12/10/2024 87
Control Panel View
12/10/2024 88
Document View
12/10/2024 89
List View
The Jigsaw list view, displaying the connections between people (left), places
(center), and organizations (right)
12/10/2024 90
List View
The Jigsaw list view, displaying the connections between people (left), places
(center), and organizations (right)
12/10/2024 91
Representing Relationships
Jigsaw also includes an entity graph
view , in which the user can navigate
a graph of related entities and
documents.
 In Jigsaw, entities are connected to

the documents in which they appear.
The Jigsaw graph view does not

show the entire document collection,
but it allows the user to incrementally
expand the graph by selecting
documents and entities of interest
A sentiment analysis visualization. News items are plotted along the time axis. Shape and color show to which category
an item belongs, and the vertical position depends on the automatically determined sentiment score of an item. The visual
objects representing news items are painted semi-transparent in order to make overlapping items more easily
distinguishable.
12/10/2024 92
Cluster Graph View
The Jigsaw graph view, representing connections between named entities and
documents.
12/10/2024 93
Cluster GraphView
12/10/2024 94
Dcoument ClusterView
12/10/2024 95
Scatterplot View
12/10/2024 96
Circular GraphView
12/10/2024 97
Representing Relationships
A clustered graph view in Jigsaw that filters for documents having specific entities.
Mousing over an entity identifies data about the document. Colors represent token
values.
12/10/2024 98
References
http://vallandingham.me/textvis-talk/#70
http://jcsites.juniata.edu/faculty/rhodes/ida/textDocViz.html
https://www.analyticsvidhya.com/blog/2020/08/information-retrieval-using-word2ve
c-based-vector-space-model/
https://towardsdatascience.com/a-complete-exploratory-data-analysis-and-visualizati
on-for-text-data-29fb1b96fb6a
http://mlwiki.org/index.php/Vector_Space_Models
12/10/2024 99

1152cs191 Data Visualization Unit IV

Uploaded by

Copyright:

Available Formats

1152cs191 Data Visualization Unit IV

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1152cs191 Data Visualization Unit IV

Uploaded by

Copyright:

Available Formats

School of Computing

Department of Computer Science &

1152CS191- Data Visualization

Course Handling Faculty :

Conduct investigations of complex

Modern Tool usage

The Engineer Society

Environment & Sustainability

Individual & Team Work

Project Management & Finance

Life Long Learning

CO SO1 SO2 SO3 SO4 SO5 SO6 SO7

CO SO1 SO2 SO3 SO4 SO5 SO6

Department of Computer Science & Engineering Data

UNIT IV Text and Document Visualization 9

• Levels of Text Representations

Department of Computer Science & Engineering Data

• For partially structured data, relationships between words, phrases, topics,

• Collection of documents is corpus (plural corpora). It deals with objects

• The objects can be words, sentences, paragraphs, documents, or even

Department of Computer Science & Engineering Data

• Information retrieval systems are used to query corpora, which requires

• Statistics about Document computation

Department of Computer Science & Engineering Data

12/10/2024 Department of Computer Science & Engineering Data Visualization 8

Department of Computer Science & Engineering Data

Department of Computer Science & Engineering Data

12/10/2024 Department of Computer Science & Engineering Data Visualization 11

Department of Computer Science & Engineering Data

Department of Computer Science & Engineering Data

Department of Computer Science & Engineering Data

Department of Computer Science & Engineering Data

Department of Computer Science & Engineering Data

Department of Computer Science & Engineering Data

Requires a transformation step:

12/10/2024 Department of Computer Science & Engineering Data Visualization 19

Department of Computer Science & Engineering Data

Be aware!! text understanding can be hard:

• “One morning I shot an elephant in my pajamas.

• Did you ever hear the story about the blind

Be aware!! text understanding can be hard:

• “One morning I shot an elephant in my pajamas.

• Did you ever hear the story about the blind

Lexical Level Syntatic Level Semantic Level

• Transforms a string of • Identifies and tags

• Computing term vectors is an essential step for many

• In the vector space model , a term vector for an object of

• Typically, to clean up noise, stop words (such as “the” or

• Bag = set that allows multiple occurrences of the same

• User specifies a set of desired terms with optional weights:

• Retrieval based on similarity between query and

• How to determine important words in a document?

• More frequent terms in a document are more important, i.e.

fij = frequency of term i in document j

• May want to normalize term frequency (tf) by dividing by

tfij = fij / maxi{fij}

Dr.M.Kavitha Department of Computer Science & Engineering

13 tf IdfV ectorT able ← ∅ Looks up tf-idf vectors for document

 Zipf stated the distribution of words in natural language corpora using

 Zipf’s Law states that in a typical natural language document, the