1152cs191 Data Visualization Unit IV

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 99

School of Computing

Department of Computer Science &


Engineering

1152CS191- Data Visualization


Category : Program Elective
UNIT-IV

Course Handling Faculty :

12/10/2024 1
No
CO
No
CO

CO4
Engineering Knowledge

12/10/2024
Problem Analysis

documents
Design / Development of solutions

Conduct investigations of complex


problems

Modern Tool usage

The Engineer Society

Environment & Sustainability


Course Outcomes

Ethics

Individual & Team Work


Identify the visualization techniques for Text and

Communication

Project Management & Finance

Life Long Learning


Course Outcomes

Mathematical Concepts
K2
taxonomy)

Software Development
revised Bloom’s
Level of learning
domain (Based on

Transferring Skills
Correlation of COs with Student Outcomes ABET
EAC and CAC

CO SO1 SO2 SO3 SO4 SO5 SO6 SO7

CO4 3 2 2 - 2 2 3

CO SO1 SO2 SO3 SO4 SO5 SO6

CO4 3 2 2 - 2 -

Department of Computer Science & Engineering Data


12/10/2024 3
Visualization
Course Content

UNIT IV Text and Document Visualization 9

• Levels of Text Representations


• The Vector Space Model
• Single Document Visualizations
• Document Collection Visualizations
• Extended Text Visualizations
• Designing Effective Visualizations
• Steps in Designing Visualizations
• Problems
• Comparing and Evaluating Visualization Techniques.

Department of Computer Science & Engineering Data


12/10/2024 4
Visualization
Text and Documents
• The most obvious tasks in text and documents, are searching for a word,
phrase, or topic.

• For partially structured data, relationships between words, phrases, topics,


or documents are searched.

• For structured text or document collections, the key task is most often
searching for patterns and outliers within the text or documents.

• Collection of documents is corpus (plural corpora). It deals with objects


within corpora.

• The objects can be words, sentences, paragraphs, documents, or even


collection of documents. Images and videos are also considered.

Department of Computer Science & Engineering Data


12/10/2024 5
Visualization
Text and Documents
• The objects are considered atomic with respect to the task, analysis and
visualization.

• Text and documents are often minimally structured and may be rich with
attributes and metadata, especially when focused in a specific application
domain.

• For example, documents have a format and often include metadata about the
document (i.e., author, date of creation, date of modification, comments,
size).

• Information retrieval systems are used to query corpora, which requires


computing the relevance of a document with respect to a query. This
requires document preprocessing and interpretation of the semantics of text.

• Statistics about Document computation


Department of Computer Science & Engineering Data
12/10/2024 6
Visualization
A Little Experiment

Department of Computer Science & Engineering Data


12/10/2024 7
Visualization
A Little Experiment

12/10/2024 Department of Computer Science & Engineering Data Visualization 8


A Brief History

Department of Computer Science & Engineering Data


12/10/2024 9
Visualization
A Brief History

Department of Computer Science & Engineering Data


12/10/2024 10
Visualization
Text

12/10/2024 Department of Computer Science & Engineering Data Visualization 11


Text as Visualization

Department of Computer Science & Engineering Data


12/10/2024 12
Visualization
Text as Visualization

Department of Computer Science & Engineering Data


12/10/2024 13
Visualization
Text as Visualization

Department of Computer Science & Engineering Data


12/10/2024 14
Visualization
Text as Visualization

12/10/2024 15
Visualization for “Raw” Text

Department of Computer Science & Engineering Data


12/10/2024 16
Visualization
Visualization for “Raw” Text

Department of Computer Science & Engineering Data


12/10/2024 17
Visualization
Visualization for “Raw” Text

Department of Computer Science & Engineering Data


12/10/2024 18
Visualization
Visualizing text (features)

Requires a transformation step:


Discretization, Aggregation,Normalization,..

12/10/2024 Department of Computer Science & Engineering Data Visualization 19


Structured Text Features

Department of Computer Science & Engineering Data


12/10/2024 20
Visualization
Typical Steps of Processing to derive Text Features

• Large collections require pre-processing of text to extract information and align text.
Typical steps are:
• Cleaning (regular expressions)
• Sentence splitting
• Change to lower case
• Stopword removal (most frequent words in a language)
• Stemming - demo porter stemmer
• POS tagging (part of speech) - demo
• Noun chunking
• NER (name entity recognition) - demo opencalais
• Deep parsing - try to “understand” text.
Department of Computer Science & Engineering Data
12/10/2024 21
Visualization
Text features are complicated

Be aware!! text understanding can be hard:


• Toilet out of order. Please use floor below.

• “One morning I shot an elephant in my pajamas.


How he got in my pajamas, I don't know.”

• Did you ever hear the story about the blind


carpenter who picked up his hammer and saw?
12/10/2024 22
Text features are complicated

Be aware!! text understanding can be hard:


• Toilet out of order. Please use floor below.

• “One morning I shot an elephant in my pajamas.


How he got in my pajamas, I don't know.”

• Did you ever hear the story about the blind


carpenter who picked up his hammer and saw?
12/10/2024 23
Text Units Hierarchy

12/10/2024 24
Levels of Text Representations
To convert the unstructured text to some form of structured data

Lexical Level Syntatic Level Semantic Level

• Transforms a string of • Identifies and tags


characters into a sequence (annotating) each token’s • Extraction of meaning
of atomic entities, called function.
tokens. • Tokens have attributes such and relationships
• Process the sequence of as singular or plural, or between pieces of
characters with a given set their proximity to other knowledge derived
of rules into a new tokens. from the structures
sequence of tokens • Richer tags include date, identified in the
• Tokens can include money, place, person, syntactical level.
characters, character n- organization, and time. •
grams, words, word stems, • The process of extracting • The goal of this level is
lexemes, phrases, or word these annotations is called
n-grams, all with associated named entity recognition to define an analytic
attributes. (NER). interpretation of the full
• Finite state machines • The richness and wide text within a specific
defined by regular variety of language models context, or even
expressions is used to and grammars yield a wide independent of context.
extract tokens variety of approaches.
12/10/2024 25
Vector Space Model

• Computing term vectors is an essential step for many


document and corpus Visualization and analysis techniques.

• In the vector space model , a term vector for an object of


interest (paragraph, document, or document collection) is a
vector in which each dimension represents the weight of a
given word in that document.

• Typically, to clean up noise, stop words (such as “the” or


“a”) are removed (filtering), and words that share a word
stem are aggregated together (stemming)

• https://wordcounter.net/
12/10/2024 26
Vector Space Model
• The pseudocode below counts occurrences of unique tokens,
excluding stop words.
• The input is assumed to be a stream of tokens generated by a
lexical analyzer for a single document.
• The terms variable contains a hashtable that maps unique terms
to their counts in the document.

Count-Terms(tokenStream)
1 terms ← ∅ initialize terms to an empty hashtable.
2 for each token t in tokenStream
3 do if t is not a stop word
4 do increment (or initialize to 1) terms[t]
5 return terms

12/10/2024 27
Statistical Models
• A document is typically represented by a bag of words
(unordered words with frequencies).

• Bag = set that allows multiple occurrences of the same


element.

• User specifies a set of desired terms with optional weights:


• Weighted query terms:
• Q = < database 0.5; text 0.8; information 0.2 >
• Unweighted query terms:
• Q = < database; text; information >No Boolean
conditions specified in the query.

12/10/2024 28
Statistical Retrieval

• Retrieval based on similarity between query and


documents.
• Output documents are ranked according to similarity to
query.
• Similarity based on occurrence frequencies of keywords in
query and document.
• Automatic relevance feedback can be supported:
• Relevant documents “added” to query.
• Irrelevant documents “subtracted” from query.
12/10/2024 29
Issues for Vector Space Model

• How to determine important words in a document?


 Word sense?
 Word n-grams (and phrases, idioms,…) terms
• How to determine the degree of importance of a term
within a document and within the entire collection?
• How to determine the degree of similarity between a
document and the query?
• In the case of the web, what is a collection and what are the
effects of links, formatting information, etc.?
12/10/2024 30
Vector Space Model
• Assume t distinct terms remain after preprocessing; call
them index terms or the vocabulary.
• These “orthogonal” terms form a vector space.
 Dimension = t = |vocabulary|
• Each term, i, in a document or query, j, is given a real-
valued weight, wij.
• Both documents and queries are expressed as t-
dimensional vectors:
 dj = (w1j, w2j, …, wtj)

12/10/2024 31
Graphic Representation

12/10/2024 32
Document Collection

12/10/2024 33
Term Weights: Term Frequency

• More frequent terms in a document are more important, i.e.


more indicative of the topic.

fij = frequency of term i in document j

• May want to normalize term frequency (tf) by dividing by


the frequency of the most common term in the document:

tfij = fij / maxi{fij}

12/10/2024 34
Term Weights: Inverse Document Frequency

12/10/2024 35
TF-IDF Weighting

12/10/2024 36
Computing TF-IDF -- An Example

12/10/2024 37
Similarity Measure

12/10/2024 38
Similarity Measure - Inner Product

12/10/2024 39
Properties of Inner Product

12/10/2024 40
Inner Product - Example

12/10/2024 41
Example – Similarity Measure

12/10/2024 42
Example – Similarity Measure

12/10/2024 43
Example – Similarity Measure

12/10/2024 44
Cosine Similarity Measure

12/10/2024 45
Naïve Implementation

12/10/2024 46
Vector Space Model

12/10/2024 47
Vector Space Model - Issues

12/10/2024 48
Vector Space Model - Exercise

12/10/2024 49
Vector Space Model - Exercise

Dr.M.Kavitha Department of Computer Science & Engineering


12/10/2024 50
Data Visualization
Computing TF-IDF(Documents)

Compute-TfIdf(documents)
1 termFrequencies ← ∅ Looks up term count tables for document
names.
2 documentF requencies← ∅ Counts the documents in which a
term occurs.
3 uniqueT erms← ∅ The list of all unique terms.
4 for each document d in documents
5 do docName ← Name(d) Extract the name of the document.
6 tokenStream ← Tokenize(d) Generate document token stream.
7 terms ← Count-Terms(tokenStream) Count the term frequencies.
8 termFrequencies[docName] ← terms Store the term frequencies.
9 for each term t in Keys(terms)
10 do increment (or initialize to 1) documentF requencies[t]
11 uniqueT erms← uniqueT erms ∪ t
12/10/2024 51
Computing TF-IDF(Documents)

13 tf IdfV ectorT able ← ∅ Looks up tf-idf vectors for document


names.
14 n ← Length(documents)
15 for each document name docName in Keys(termFrequencies)
16 do tf IdfV ector ← create zeroed array of length
Length(uniqueT erms)
17 terms ← termFrequencies[docName]
18 for each term t in keys(terms)
19 do tf ← terms[t]
20 df ← documentF requencies[t]
21 tf Idf ← tf ∗ log(n/df)
22 tf IdfV ector[index of t in uniqueT erms]← tf Idf
23 tf IdfV ectorT able[docName] ← tf IdfV ector
24 return tf IdfV ectorT able
12/10/2024 52
Zipf’s Law
 The economist Vilfredo Pareto stated that a company’s revenue is
inversely proportional to its rank—a classic power law, resulting in
the famous 80-20 rule, in which 20% of the population holds 80% of
the wealth.

 Zipf stated the distribution of words in natural language corpora using


a discrete power law distribution called a Zipfian distribution.

 Zipf’s Law states that in a typical natural language document, the


frequency of any word is inversely proportional to its rank in the
frequency table.

 Plotting the Zipf curve on a log-log scale yields a straight line with a
slope of -1
12/10/2024 53
Zipf’s Law

A document view in which named entities are highlighted, color-coded by


entity type.

12/10/2024 54
Zipf’s Law

The distribution of terms in Wikipedia, an example of Zipf’s Law in


action. Term frequency is on the y-axis, and frequency rank is on the x-
axis.
12/10/2024 55
Tasks Using the Vector Space Model

• The vector space model, when accompanied by some distance


metric, allows one to perform many useful tasks.

• tf-idf and the vector space model is used to identify documents of


particular interest.

• For example, the vector space model, with the use of some distance
metric, will allow us to answer questions such as which documents
are similar to a specific one, which documents are relevant to a given
collection of documents, or which documents are most relevant to a
given search query—all by finding the documents whose term
vectors are most similar to the given document, the average vector
over a document collection, or the vector of a search query.

12/10/2024 56
Single Document Visualizations

A tag cloud visualization generated by the free service tagCrowd.com. The font
size and darkness are proportional to the frequency of the word in the document.

12/10/2024 57
Word Clouds
•Word clouds , also known as text clouds or tag clouds, are layouts of raw tokens,
colored and sized by their frequency within a single document.

•Text clouds and their variations, such as a Wordle, are examples of visualizations
that use only term frequency vectors and some layout algorithm to create the
visualization.

A Wordle visualization
generated by the free service
wordle.net. The size of
the text corresponds to the
frequency of the word in the
document.

12/10/2024 58
WordTree
The WordTree visualization is a visual representation of both term frequencies, as
well as their context .
Size is used to represent the term or phrase frequency. The root of the tree is a
user-specified word or phrase of interest, and the branches represent the various
contexts in which the word or phrase is used in the document.

A WordTree
visualization
generated by the free
service ManyEyes .
The branches of the
tree represent the
various contexts
following a root word
or phrase
in the document.

12/10/2024 59
TextArc
TextArc is a visual representation of how terms relate to the lines of text
in which they appear.

Every word of the text is drawn in order around an ellipse as small lines
with a slight offset at its start.

As in a text cloud, more frequently occurring words are drawn larger and
brighter.

Words with higher frequencies are drawn within the ellipse, pulled by its
occurrences on the circle (similar to RadViz).

The user is able to highlight the underlying text with probing and animate
“reading” the text by visualizing the flow of the text through relevant
connected terms.
12/10/2024 60
TextArc

http://textarc
.org/Stills.ht
ml

12/10/2024 61
Arc Diagram
Arc diagrams are a visualization focused on displaying repetition in
text or any sequence.

Repeated subsequences are identified and connected by


semicircular arcs.

The thickness of the arcs represents the length of the subsequence,


and the height of the arcs represents the distance between the
subsequences.

http://mbostock.github.io/protovis/ex/arc.html is the website for this diagram through


Protovis/D3. The input data is in http://mbostock.github.io/protovis/ex/miserables.js

12/10/2024 62
Arc Diagram

Figure displays Bach’s Minuet in G Major, visualizing the classic pattern of a


minuet. It contains two parts, each consisting of a long passage played twice. The
parts are loosely related, as shown by the bundle of thin arcs connecting the two
main parts. The overlap of the two main arcs shows that the end of the first passage
is the same as the beginning of the second.

12/10/2024 63
Arc Diagram

12/10/2024 64
Literature Fingerprinting
 Literature fingerprinting is a method of visualizing features used to
characterize text .

 Instead of calculating just one feature value or vector for the whole
text (this is what is usually done), a sequence of feature values per text
are calculated and presented to the user as a characteristic fingerprint
of the document.

 This allows the user to “look inside” the document and analyze the
development of the values across the text. Moreover, the structural
information of the document is used to visualize the document on
different levels of resolution.

 Literature fingerprinting was applied to an authorship attribution


problem to show the discrimination power of the standard measures
that are assumed to capture the writing style of an author
12/10/2024 65
Literature Fingerprinting
Literature fingerprinting technique. Here, literature fingerprinting is used to analyze
the ability of several text measures to discriminate between authors. Each pixel
represents a text block, and the pixels are grouped into books. Color is mapped to the
feature value, in this case to the average sentence length. If a measure is able to
discriminate between the two authors, the books in the first row (written by London)
are visually set apart from the remaining (Image from [222], c 2007 IEEE.)

12/10/2024 66
Document Collection Visualizations

•Document collection visualizations, the goal is to place similar documents


close to each other and dissimilar ones far apart.

•This is a minimax problem and typically O(n2). We compute the similarity


between all pairs of documents and determine a layout.

•The common approaches are graph spring layouts, multidimensional scaling,


clustering (k-means, hierarchical, EM, support vector), and self organizing
maps.

•Several document collection visualizations are self organizing maps,


clustermaps, and themescapes.

12/10/2024 67
Self-Organizing Maps

•A self-organizing map is an unsupervised learning algorithm using a collection of


typically 2D nodes, where documents will be located.
•Each node has an associated vector of the same dimensionality as the input vectors
(the document vectors) used to train the map.
•We initialize the SOM nodes, typically with random weights.
•We choose a random vector from the input vectors and calculate its distance from
each node.
•We adjust the weights of the closest nodes (within a particular radius), making each
closer to the input vector, with the higher weights corresponding to the closest
selected node.
•As we iterate through the input vectors, the radius gets smaller.
•An example of using SOMs for text data is shown in Figure, which shows a million
documents collected from 83 newsgroups.

12/10/2024 68
Self-Organizing Maps

A self-organizing map (SOM) layout of Finnish


news bulletins. The labels show the topical
areas, and color represents the number of
documents, with light areas containing more

12/10/2024 69
Themescapes
•Themescapes are summaries of corpora using abstract 3D landscapes in
which height and color are used to represent density of similar documents.
•The example shown in Figure from Pacific Northwest National Labs
represents news articles visualized as a themescape.
•The taller mountains represent frequent themes in the document corpus

A themescape from PNNL that uses


height to represent the frequency of
themes in news articles. (Image
reprinted from with permission of
Springer Science and Business
Media.)

12/10/2024 70
Document Cards

Dr.M.Kavitha Department of Computer Science & Engineering


12/10/2024 71
Data Visualization
Document Cards

12/10/2024 72
Document Cards

The Document Card pipeline. Each step is further explained in the sections
indicated by the number in the top right corner of each box.

12/10/2024 73
Extended Text Visualization
Software Visualization
Eick et al. developed a visualization tool called SeeSoft
visualizes statistics for each line of code (i.e., age and num
modifications, programmer, dates).
Dimensions of Software Visualization
 Tasks – why is the visualization needed?
 Audience – who will use the visualization?
 Target – what is the data source to represent?
 Representation – how to represent it?
 Medium – where to represent the visualization
12/10/2024 74
Software Visualization

12/10/2024 75
SeeSoft

12/10/2024 76
SeeSoft

12/10/2024 77
SeeSoft

12/10/2024 78
SeeSoft

12/10/2024 79
SeeSoft - Uses

12/10/2024 80
SeeSoft - Applications

12/10/2024 81
SeeSoft - Applications

12/10/2024 82
SeeSoft

New Application Area

Display of large amount of Texts.


Visualization of Directories and files.

Limitations

Only 50,000 lines of code can be displayed


Difficult to use with the monochrome devices

12/10/2024 83
Search Result Visualization

 Marti Hearst developed a simple query result visualization


foundationally similar to Keim’s pixel displays called TileBars,
which displays a number of term-related statistics, including
frequency and distribution of terms, length of document, term-
based ranking, and strength of ranking.

 Each document of the result set is represented by a rectangle,


where width indicates relative length of the document and
stacked squares correspond to text segments.
 Each row of the stack represents a set of query terms, and the
darkness of the square indicates the frequency of terms among
the corresponding terms.

12/10/2024 84
Search Result Visualization

The TileBars query result


visualization. Each large rectangle
indicates a document, and each
square within the document
represents a text segment. The
darker the tile, the more frequent
the query term set. (Image from [, c
1995 Addison-
Wesley.)

12/10/2024 85
Temporal Document Collection Visualization

ThemeRiver, also called a stream graph, is


a visualization of thematic changes in a
document collection over time .
This visualization assumes that the input
data progresses over time. Themes are
visually represented as colored horizontal
bands whose vertical thickness at a given
horizontal location represents their
frequency at a particular point in time.

A stream graph (ThemeRiver), depicting the election


night speeches of several different candidates for a
Canadian election. (Image from, c 2002 IEEE.)

12/10/2024 86
Temporal Document Collection Visualization

Jigsaw is a tool for visualizing and


exploring text corpora [155].
Jigsaw’s calendar view positions
document objects on a calendar based
on date entities identified within the
text. When the user highlights a
document, the entities that occur
within that document are displayed .

Wanner et al. developed a visual


analytics tool for conducting
semiautomatic sentiment analysis of
large news feeds.
News articles are presented with the Jigsaw
calendar view, based on the extracted date
entities

12/10/2024 87
Control Panel View

12/10/2024 88
Document View

12/10/2024 89
List View

The Jigsaw list view, displaying the connections between people (left), places
(center), and organizations (right)
12/10/2024 90
List View

The Jigsaw list view, displaying the connections between people (left), places
(center), and organizations (right)
12/10/2024 91
Representing Relationships
Jigsaw also includes an entity graph
view , in which the user can navigate
a graph of related entities and
documents.

 In Jigsaw, entities are connected to


the documents in which they appear.

The Jigsaw graph view does not


show the entire document collection,
but it allows the user to incrementally
expand the graph by selecting
documents and entities of interest

A sentiment analysis visualization. News items are plotted along the time axis. Shape and color show to which category
an item belongs, and the vertical position depends on the automatically determined sentiment score of an item. The visual
objects representing news items are painted semi-transparent in order to make overlapping items more easily
distinguishable.
12/10/2024 92
Cluster Graph View

The Jigsaw graph view, representing connections between named entities and
documents.

12/10/2024 93
Cluster GraphView

12/10/2024 94
Dcoument ClusterView

12/10/2024 95
Scatterplot View

12/10/2024 96
Circular GraphView

12/10/2024 97
Representing Relationships

A clustered graph view in Jigsaw that filters for documents having specific entities.
Mousing over an entity identifies data about the document. Colors represent token
values.
12/10/2024 98
References

http://vallandingham.me/textvis-talk/#70
http://jcsites.juniata.edu/faculty/rhodes/ida/textDocViz.html
https://www.analyticsvidhya.com/blog/2020/08/information-retrieval-using-word2ve
c-based-vector-space-model/
https://towardsdatascience.com/a-complete-exploratory-data-analysis-and-visualizati
on-for-text-data-29fb1b96fb6a
http://mlwiki.org/index.php/Vector_Space_Models

12/10/2024 99

You might also like