1152cs191 Data Visualization Unit IV
1152cs191 Data Visualization Unit IV
1152cs191 Data Visualization Unit IV
12/10/2024 1
No
CO
No
CO
CO4
Engineering Knowledge
12/10/2024
Problem Analysis
documents
Design / Development of solutions
Ethics
Communication
Mathematical Concepts
K2
taxonomy)
Software Development
revised Bloom’s
Level of learning
domain (Based on
Transferring Skills
Correlation of COs with Student Outcomes ABET
EAC and CAC
CO4 3 2 2 - 2 2 3
CO4 3 2 2 - 2 -
• For structured text or document collections, the key task is most often
searching for patterns and outliers within the text or documents.
• Text and documents are often minimally structured and may be rich with
attributes and metadata, especially when focused in a specific application
domain.
• For example, documents have a format and often include metadata about the
document (i.e., author, date of creation, date of modification, comments,
size).
12/10/2024 15
Visualization for “Raw” Text
• Large collections require pre-processing of text to extract information and align text.
Typical steps are:
• Cleaning (regular expressions)
• Sentence splitting
• Change to lower case
• Stopword removal (most frequent words in a language)
• Stemming - demo porter stemmer
• POS tagging (part of speech) - demo
• Noun chunking
• NER (name entity recognition) - demo opencalais
• Deep parsing - try to “understand” text.
Department of Computer Science & Engineering Data
12/10/2024 21
Visualization
Text features are complicated
12/10/2024 24
Levels of Text Representations
To convert the unstructured text to some form of structured data
• https://wordcounter.net/
12/10/2024 26
Vector Space Model
• The pseudocode below counts occurrences of unique tokens,
excluding stop words.
• The input is assumed to be a stream of tokens generated by a
lexical analyzer for a single document.
• The terms variable contains a hashtable that maps unique terms
to their counts in the document.
Count-Terms(tokenStream)
1 terms ← ∅ initialize terms to an empty hashtable.
2 for each token t in tokenStream
3 do if t is not a stop word
4 do increment (or initialize to 1) terms[t]
5 return terms
12/10/2024 27
Statistical Models
• A document is typically represented by a bag of words
(unordered words with frequencies).
12/10/2024 28
Statistical Retrieval
12/10/2024 31
Graphic Representation
12/10/2024 32
Document Collection
12/10/2024 33
Term Weights: Term Frequency
12/10/2024 34
Term Weights: Inverse Document Frequency
12/10/2024 35
TF-IDF Weighting
12/10/2024 36
Computing TF-IDF -- An Example
12/10/2024 37
Similarity Measure
12/10/2024 38
Similarity Measure - Inner Product
12/10/2024 39
Properties of Inner Product
12/10/2024 40
Inner Product - Example
12/10/2024 41
Example – Similarity Measure
12/10/2024 42
Example – Similarity Measure
12/10/2024 43
Example – Similarity Measure
12/10/2024 44
Cosine Similarity Measure
12/10/2024 45
Naïve Implementation
12/10/2024 46
Vector Space Model
12/10/2024 47
Vector Space Model - Issues
12/10/2024 48
Vector Space Model - Exercise
12/10/2024 49
Vector Space Model - Exercise
Compute-TfIdf(documents)
1 termFrequencies ← ∅ Looks up term count tables for document
names.
2 documentF requencies← ∅ Counts the documents in which a
term occurs.
3 uniqueT erms← ∅ The list of all unique terms.
4 for each document d in documents
5 do docName ← Name(d) Extract the name of the document.
6 tokenStream ← Tokenize(d) Generate document token stream.
7 terms ← Count-Terms(tokenStream) Count the term frequencies.
8 termFrequencies[docName] ← terms Store the term frequencies.
9 for each term t in Keys(terms)
10 do increment (or initialize to 1) documentF requencies[t]
11 uniqueT erms← uniqueT erms ∪ t
12/10/2024 51
Computing TF-IDF(Documents)
Plotting the Zipf curve on a log-log scale yields a straight line with a
slope of -1
12/10/2024 53
Zipf’s Law
12/10/2024 54
Zipf’s Law
• For example, the vector space model, with the use of some distance
metric, will allow us to answer questions such as which documents
are similar to a specific one, which documents are relevant to a given
collection of documents, or which documents are most relevant to a
given search query—all by finding the documents whose term
vectors are most similar to the given document, the average vector
over a document collection, or the vector of a search query.
12/10/2024 56
Single Document Visualizations
A tag cloud visualization generated by the free service tagCrowd.com. The font
size and darkness are proportional to the frequency of the word in the document.
12/10/2024 57
Word Clouds
•Word clouds , also known as text clouds or tag clouds, are layouts of raw tokens,
colored and sized by their frequency within a single document.
•Text clouds and their variations, such as a Wordle, are examples of visualizations
that use only term frequency vectors and some layout algorithm to create the
visualization.
A Wordle visualization
generated by the free service
wordle.net. The size of
the text corresponds to the
frequency of the word in the
document.
12/10/2024 58
WordTree
The WordTree visualization is a visual representation of both term frequencies, as
well as their context .
Size is used to represent the term or phrase frequency. The root of the tree is a
user-specified word or phrase of interest, and the branches represent the various
contexts in which the word or phrase is used in the document.
A WordTree
visualization
generated by the free
service ManyEyes .
The branches of the
tree represent the
various contexts
following a root word
or phrase
in the document.
12/10/2024 59
TextArc
TextArc is a visual representation of how terms relate to the lines of text
in which they appear.
Every word of the text is drawn in order around an ellipse as small lines
with a slight offset at its start.
As in a text cloud, more frequently occurring words are drawn larger and
brighter.
Words with higher frequencies are drawn within the ellipse, pulled by its
occurrences on the circle (similar to RadViz).
The user is able to highlight the underlying text with probing and animate
“reading” the text by visualizing the flow of the text through relevant
connected terms.
12/10/2024 60
TextArc
http://textarc
.org/Stills.ht
ml
12/10/2024 61
Arc Diagram
Arc diagrams are a visualization focused on displaying repetition in
text or any sequence.
12/10/2024 62
Arc Diagram
12/10/2024 63
Arc Diagram
12/10/2024 64
Literature Fingerprinting
Literature fingerprinting is a method of visualizing features used to
characterize text .
Instead of calculating just one feature value or vector for the whole
text (this is what is usually done), a sequence of feature values per text
are calculated and presented to the user as a characteristic fingerprint
of the document.
This allows the user to “look inside” the document and analyze the
development of the values across the text. Moreover, the structural
information of the document is used to visualize the document on
different levels of resolution.
12/10/2024 66
Document Collection Visualizations
12/10/2024 67
Self-Organizing Maps
12/10/2024 68
Self-Organizing Maps
12/10/2024 69
Themescapes
•Themescapes are summaries of corpora using abstract 3D landscapes in
which height and color are used to represent density of similar documents.
•The example shown in Figure from Pacific Northwest National Labs
represents news articles visualized as a themescape.
•The taller mountains represent frequent themes in the document corpus
12/10/2024 70
Document Cards
12/10/2024 72
Document Cards
The Document Card pipeline. Each step is further explained in the sections
indicated by the number in the top right corner of each box.
12/10/2024 73
Extended Text Visualization
Software Visualization
Eick et al. developed a visualization tool called SeeSoft
visualizes statistics for each line of code (i.e., age and num
modifications, programmer, dates).
Dimensions of Software Visualization
Tasks – why is the visualization needed?
Audience – who will use the visualization?
Target – what is the data source to represent?
Representation – how to represent it?
Medium – where to represent the visualization
12/10/2024 74
Software Visualization
12/10/2024 75
SeeSoft
12/10/2024 76
SeeSoft
12/10/2024 77
SeeSoft
12/10/2024 78
SeeSoft
12/10/2024 79
SeeSoft - Uses
12/10/2024 80
SeeSoft - Applications
12/10/2024 81
SeeSoft - Applications
12/10/2024 82
SeeSoft
Limitations
12/10/2024 83
Search Result Visualization
12/10/2024 84
Search Result Visualization
12/10/2024 85
Temporal Document Collection Visualization
12/10/2024 86
Temporal Document Collection Visualization
12/10/2024 87
Control Panel View
12/10/2024 88
Document View
12/10/2024 89
List View
The Jigsaw list view, displaying the connections between people (left), places
(center), and organizations (right)
12/10/2024 90
List View
The Jigsaw list view, displaying the connections between people (left), places
(center), and organizations (right)
12/10/2024 91
Representing Relationships
Jigsaw also includes an entity graph
view , in which the user can navigate
a graph of related entities and
documents.
A sentiment analysis visualization. News items are plotted along the time axis. Shape and color show to which category
an item belongs, and the vertical position depends on the automatically determined sentiment score of an item. The visual
objects representing news items are painted semi-transparent in order to make overlapping items more easily
distinguishable.
12/10/2024 92
Cluster Graph View
The Jigsaw graph view, representing connections between named entities and
documents.
12/10/2024 93
Cluster GraphView
12/10/2024 94
Dcoument ClusterView
12/10/2024 95
Scatterplot View
12/10/2024 96
Circular GraphView
12/10/2024 97
Representing Relationships
A clustered graph view in Jigsaw that filters for documents having specific entities.
Mousing over an entity identifies data about the document. Colors represent token
values.
12/10/2024 98
References
http://vallandingham.me/textvis-talk/#70
http://jcsites.juniata.edu/faculty/rhodes/ida/textDocViz.html
https://www.analyticsvidhya.com/blog/2020/08/information-retrieval-using-word2ve
c-based-vector-space-model/
https://towardsdatascience.com/a-complete-exploratory-data-analysis-and-visualizati
on-for-text-data-29fb1b96fb6a
http://mlwiki.org/index.php/Vector_Space_Models
12/10/2024 99