Unit iv

UNIT IV
TOPICS
INTRODUCTION - LEVELS OF
TEXT REPRESENTATIONS
 Computers are brilliant when dealing with numbers. They are faster than
humans in calculations & decoding patterns by many orders of magnitude.
 But what if the data is not numerical? What if it's language? What happens
when the data is in characters, words & sentences? How do we make
computers process our language? How does Alexa, Google Home & many
other smart assistants understand & reply to our speech?
 Natural Language Processing is a sub-field of artificial intelligence that

works on making machines understand & process human language. The
most basic step for the majority of natural language processing (NLP) tasks
is to convert words into numbers for machines to understand & decode
patterns within a language.
 We call this step text representation. This step, though iterative, plays a
significant role in deciding features for your machine learning
model/algorithm.
INTRODUCTION - LEVELS OF
TEXT REPRESENTATIONS
 Text representations can be broadly classified into two sections:
• Discrete text representations

• Distributed/Continuous text representations
DISCRETE TEXT
REPRESENTATION:
 These are representations where words are represented by their
corresponding indexes to their position in a dictionary from a larger corpus
or corpora.
 Famous representations that fall within this category are:
• One-Hot encoding
• Bag-of-words representation (BOW)
• Basic BOW — CountVectorizer
• Advanced BOW — TF-IDF
ONE-HOT ENCODING:
 It is a type of representation that assigns 0 to all elements in a vector

except for one, which has a value of 1. This value represents a category of
an element.
ONE-HOT ENCODING
 For example:
 If i had a sentence, “I love my dog”, each word in the sentence would be

represented as below:
 I → [1 0 0 0], love → [0 1 0 0], my → [0 0 1 0], dog → [0 0 0 1]
 The entire sentence is then represented as:
 sentence = [ [1,0,0,0],[0,1,0,0],[0,0,1,0],[0,0,0,1] ]

ONE-HOT ENCODING
 The intuition behind one-hot encoding is that each bit represents a possible
category & if a particular variable cannot fall into multiple categories, then a single
bit is enough to represent it
 As you may have grasped, the length of an array of word depends on the
vocabulary size. This is not scalable for a very large corpus which could contain up
to 100,000 unique words or even more.

SNIPPET USING PYTHON
ADVANTAGES AND
DISADVANTES OF ONE-HOT
ENCONDING
 Advantages of one-hot encoding:
• Easy to understand & implement

 Disadvantages of one-hot encoding:
• Explosion in feature space if number of categories are very high

• The vector representation of words is orthogonal and cannot determine or
measure relationship between different words
• Cannot measure importance of a word in a sentence but understand mere
presence/absence of a word in a sentence
• High dimensional sparse matrix representation can be memory &
computationally expensive
BAG-OF-WORDS
REPRESENTATION
 Bag-of-words representation as the name suggests intutively, puts words in a “bag”
& computes frequency of occurrence of each word. It does not take into account
the word order or lexical information for text representation
 The intuition behind BOW representation is that document having similar words
are similar irrespective of the word positioning
 Basic BOW — CountVectorizer
 The CountVectorizer computes the frequency of occurrence of a word in a document. It
converts the corpus of multiple sentences (say product reviews) into a matrix of reviews &
words & fills it with frequency of each word in a sentence
BAG-OF-WORDS
REPRESENTATION
BAG-OF-WORDS
REPRESENTATION
BAG-OF-WORDS
REPRESENTATION
 As you see the word “nlp” occurs twice in the sentence & also falls in index 3.
Which we can see as the output of the final print statement
 The “weight” of a word in a sentence is its frequency
 There are various parameters that can be tweaked as part of the CountVectorizer
to get the desired results including text preprocessing parameters like lowercase,
strp_accents, preprocessor

COUNT VECTORIZER.
 Advantage of CountVectorizer:
• CountVectorizer also gives us frequency of words in a text

document/sentence which One-hot encoding fails to provide
• Length of the encoded vector is the length of the dictionary
 Disadvantages of CountVectorizer:
• This method ignores the location information of the word. It is not possible
to grasp the meaning of a word from this representation
• The intuition that high-frequency words are more important or give more
information about the sentence fails when it comes to stop words like “is,
the, an, I” & when the corpus is context-specific. For example, in a corpus
about covid-19, the word coronavirus may not add a lot of value
DISTRIBUTED OR
CONTINUOUS TEXT
REPRESENTATIONS
 Distributed or continuous text representations, also known as word
embeddings, are a popular technique in data science for representing text
data in a numerical form that can be used for machine learning tasks.
 Traditionally, text data has been represented using one-hot encoding,
where each word in a vocabulary is assigned a unique index and
represented as a vector with a 1 in the index corresponding to the word
and 0s elsewhere.
 However, one-hot encoding has several limitations, such as the inability to
capture relationships between words and the high dimensionality of the
resulting representation.
 Word embeddings, on the other hand, are dense, low-dimensional vectors that
represent words in a continuous vector space.
 They are learned by training a neural network on a large corpus of text data, with the
goal of predicting the context in which each word appears.
 The resulting word embeddings capture both syntactic and semantic relationships
between words, such that similar words are represented by similar vectors in the
vector space.
 Word embeddings have numerous applications in data science, including natural
language processing (NLP), text classification, sentiment analysis, and machine
translation.
 They have been shown to improve the performance of many NLP tasks compared to
traditional representations such as bag-of-words or one-hot encoding.
 Some popular algorithms for generating word embeddings include Word2Vec, GloVe,
and FastText.
 It generally refers to the number of features you have for each sample in
the problem you are trying to classify. For example, the famous Iris flower
dataset only includes 4 features (Sepal length, sepal width, petal width,
petal length), and would be considered as a low dimensional dataset.
VECTOR SPACE MODEL
 The Vector Space Model (VSM) is a mathematical model used in information
retrieval to represent text documents as vectors of features. The model
represents documents as high-dimensional vectors, where each dimension
corresponds to a specific term or feature
 The VSM is based on the idea that documents with similar content will have
similar vector representations.
 It is used to calculate the similarity between documents based on their
vector representations.
 The most common measure of similarity used in the VSM is the cosine
similarity, which measures the angle between two vectors.
VECTOR SPACE MODEL
 In the VSM, documents are represented as a set of terms, and each term is
assigned a weight that reflects its importance in the document.
 The most commonly used weighting scheme is the Term Frequency-Inverse
Document Frequency (TF-IDF) scheme.
 This scheme assigns a higher weight to terms that are frequent in the
document but rare in the corpus.
 The VSM has many applications in information retrieval, including text
classification, document clustering, and recommender systems. It is also
used in natural language processing tasks such as text summarization and
information extraction.
VECTOR SPACE MODEL
EXAMPLE
 Suppose we have a corpus of three documents:
 Document 1: "The quick brown fox jumps over the lazy dog“
 Document 2: "A brown dog jumps over a quick fox"
 Document 3: "The lazy dog is sleeping in the sun"
 To apply the VSM, we first preprocess the text by removing stopwords (e.g., "the", "is",
"in") and stemming the remaining words (e.g., "jumps" and "jumping" become "jump").
 Next, we create a term-document matrix that represents the frequency of each term in
each document. Here's what the term-document matrix looks like for our example:
brow
quick n fox jump dog lazy sleep sun
Doc 1 1 1 1 1 1 1 0 0
Doc 2 1 1 1 1 1 0 0 0
Doc 3 0 0 0 0 1 1 1 1
 Each row in the matrix represents a document, and each column
represents a term. The numbers in the matrix represent the frequency of
each term in each document.
 Next, we use the TF-IDF weighting scheme to calculate the weight of each
term in each document. The TF-IDF weight of a term is a product of its Term
Frequency (TF) in the document and its Inverse Document Frequency (IDF)
across the corpus. Here's what the weighted term-document matrix looks
like for our example:
 Each row in the matrix represents a document, and each column
represents a term. The numbers in the matrix represent the frequency of
each term in each document.
 Next, we use the TF-IDF weighting scheme to calculate the weight of each
term in each document. The TF-IDF weight of a term is a product of its Term
Frequency (TF) in the document and its Inverse Document Frequency (IDF)
across the corpus. Here's what the weighted term-document matrix looks
like for our brow
example:
quick n fox jump dog lazy sleep sun
Doc 1 0.29 0.29 0.29 0.29 0.29 0.29 0 0
Doc 2 0.29 0.29 0.29 0.29 0.29 0 0 0
Doc 3 0 0 0 0 0.29 0.29 0.45 0.45
 Finally, we can represent each document as a vector in a high-dimensional
space, where each dimension corresponds to a term. We can use the
weighted term-document matrix to represent each document as a vector
by flattening each row into a vector. For example, the vector representation
of Document 1 is:
 [0.29, 0.29, 0.29, 0.29, 0.29, 0.29, 0, 0]
 We can then use the cosine similarity measure to compare the similarity
between any two documents.
 For example, the cosine similarity between Document 1 and Document 2
is 0.67, which indicates that they have some degree of similarity.
CALCULATING COSINE
SIMILARITY
 Cosine similarity is a measure of similarity between two non-zero vectors of
an inner product space. It measures the cosine of the angle between two
vectors, which ranges from -1 (opposite directions) to 1 (same direction). A
value of 0 indicates the vectors are orthogonal (perpendicular) to each
other.
 Here's an example of how to calculate cosine similarity between two
vectors:
 Suppose we have two vectors, A and B:
 A = [1, 2, 3, 4, 5] B = [2, 4, 6, 8, 10]
CALCULATING COSINE
SIMILARITY
 To calculate the cosine similarity between A and B, we first need to compute the dot
product of A and B. The dot product is the sum of the element-wise products of the
two vectors:
 A . B = (1 * 2) + (2 * 4) + (3 * 6) + (4 * 8) + (5 * 10) = 2 + 8 + 18 + 32 + 50 = 110
 Next, we need to compute the magnitude (length) of each vector. The magnitude of
a vector is the square root of the sum of the squares of its elements:
 ||A|| = sqrt((1^2) + (2^2) + (3^2) + (4^2) + (5^2)) = sqrt(55) = 7.42 ||B|| =
sqrt((2^2) + (4^2) + (6^2) + (8^2) + (10^2)) = sqrt(120) = 10.95
 Finally, we can compute the cosine similarity between A and B using the dot product
and magnitudes:
 similarity = A . B / (||A|| * ||B||) = 110 / (7.42 * 10.95) = 0.878
 The cosine similarity between A and B is 0.878, which indicates they are quite similar
because they are in the same direction in the vector space.
TF-IDF WEIGHTING SCHEME
 TF-IDF stands for term frequency-inverse document frequency and is a
commonly used weighting scheme in information retrieval and text mining.
 It is used to assign weights to each term in a document based on how
important it is in the context of the document and the collection of
documents.
 Here's an example of how to calculate TF-IDF for a term in a document:
 Suppose we have a collection of documents and we want to calculate the
TF-IDF weight for the term "data" in one of the documents, called
Document A.
 Document A: "The data science team analyzed data and found some
interesting insights."
 First, we need to calculate the term frequency (TF) of the term "data" in
Document A. The term frequency is simply the number of times the term
appears in the document divided by the total number of terms in the
document.
 TF = (number of times "data" appears in Document A) / (total number of
terms in Document A)
 In this case, the term "data" appears twice in Document A and the total
number of terms in the document is 11, so:
 TF = 2 / 11 = 0.18
 Next, we need to calculate the inverse document frequency (IDF) of the term "data"
across the entire collection of documents. The IDF measures how rare or common the
term is in the collection of documents and is calculated as the logarithm of the total
number of documents in the collection divided by the number of documents that
contain the term.
 IDF = log (total number of documents in the collection / number of documents
containing the term "data")
 Suppose our collection contains 100 documents and the term "data" appears in 50 of
them:
 IDF = log (100 / 50) = log (2) = 0.30
 Finally, we can calculate the TF-IDF weight for the term "data" in Document A by
multiplying the TF and IDF values:
 TF-IDF = TF * IDF = 0.18 * 0.30 = 0.054
 So the TF-IDF weight of the term "data" in Document A is 0.054. This indicates that the
term "data" is not very important in the context of Document A, but it is still informative
for the overall collection of documents since it is not a very common term in the corpus.
SINGLE DOCUMENT
VISALIZATION
 https://jcsites.juniata.edu/faculty/rhodes/ida/textDocViz.html#tagcloud
SINGLE DOCUMENT
VISALIZATION
 Single document visualization is a type of data visualization that focuses on
the content of a single document.
 It is used to help users quickly understand and absorb the core content of a
document, as well as its key features and relationships.
 Single document visualization techniques can be used to represent a wide
variety of document content, including text, images, tables, and graphs.
SINGLE DOCUMENT
VISALIZATION
 There are many different single document visualization techniques available, each with
its own strengths and weaknesses. Some of the most common techniques include:
• Word clouds: Word clouds are a simple but effective way to visualize the frequency of
words in a document. They can be used to identify the most important topics in a
document, as well as to identify relationships between different words.
• Tag clouds: Tag clouds are similar to word clouds, but they use tags instead of words.
Tags are short, descriptive words or phrases that are used to categorize documents. Tag
clouds can be used to help users find documents that are relevant to their interests.
• Document maps: Document maps are a more complex type of visualization that shows
the structure of a document. They can be used to help users understand the flow of
information in a document, as well as to identify the relationships between different
sections of the document.
• Timelines: Timelines are a useful way to visualize the chronological order of events in a
document. They can be used to help users understand the development of a story, as
well as to identify key events in a document.
DOCUMENT COLLECTION
VISUALIZATION

Document collection visualization is a type of data visualization that focuses on the
content of a collection of documents
 There are many different document collection visualization techniques available, each
with its own strengths and weaknesses. Some of the most common techniques include:
• Document maps: Document maps are a type of visualization that shows the structure of
a document collection. They can be used to help users understand the relationships
between different documents in a collection.
• Term maps: Term maps are a type of visualization that shows the frequency of terms in
a document collection. They can be used to help users identify the most important
topics in a document collection, as well as to identify relationships between different
topics.
• Co-occurrence maps: Co-occurrence maps are a type of visualization that shows the
relationships between different terms in a document collection. They can be used to
help users identify relationships between different concepts in a document collection.
• Topic models: Topic models are a type of statistical model that can be used to identify
the topics in a document collection. They can be used to help users identify the most
important topics in a document collection, as well as to identify relationships between
different topics.
EXTENDED DOCUMENT
VISUALIZATION
 Extended text visualization is a type of data visualization that focuses on
the content of a long document, such as a book or a research paper.
 It is used to help users quickly understand and absorb the core content of
a long document, as well as its key features and relationships.
 There are many different extended text visualization techniques available, each with
its own strengths and weaknesses. Some of the most common techniques include:
• Document maps: Document maps are a type of visualization that shows the
structure of a long document. They can be used to help users understand the
relationships between different sections of a document.
• Term maps: Term maps are a type of visualization that shows the frequency of terms
in a long document. They can be used to help users identify the most important topics
in a long document, as well as to identify relationships between different topics.
• Co-occurrence maps: Co-occurrence maps are a type of visualization that shows the
relationships between different terms in a long document. They can be used to help
users identify relationships between different concepts in a long document.
• Topic models: Topic models are a type of statistical model that can be used to
identify the topics in a long document. They can be used to help users identify the
most important topics in a long document, as well as to identify relationships between
different topics.
• Sentiment analysis: Sentiment analysis is a technique that can be used to identify
the sentiment of a document, such as whether it is positive, negative, or neutral. This
can be helpful for tasks such as identifying the emotional impact of a document.
INTERACTION OPERATIONS

Interaction operations are used to repeatedly perform the same action on a set
of data. They are a powerful tool for data visualization, as they can be used to
create dynamic and interactive visualizations.
 There are many different types of iteration operations, each with its own
strengths and weaknesses. Some of the most common types of iteration
operations include:
• Filtering: Filtering is used to remove data that does not meet a certain criteria.
This can be useful for cleaning up data or for focusing on a specific subset of
data.
• Aggregation: Aggregation is used to combine data into a single value. This can
be useful for summarizing data or for creating new features.
• Transformation: Transformation is used to change the format of data. This can
be useful for making data more understandable or for preparing data for
further analysis.
INTERACTION OPERATIONS
 Iteration operations are a powerful tool for data visualization. By using iteration
operations, you can create dynamic and interactive visualizations that can help you to
better understand your data.
 Here are some examples of how iteration operations can be used in data visualization:
• To create a dynamic visualization that updates as the data changes. For example, you
could use a filtering operation to create a visualization that only shows data for a specific
time period. As the data changes, the visualization would update to reflect the new data.
• To create an interactive visualization that allows users to explore the data. For example,
you could use an aggregation operation to create a visualization that shows the average
value of a data set over time. Users could then interact with the visualization to explore
the data by different time periods.
• To create a visualization that is more understandable and easier to interpret. For
example, you could use a transformation operation to create a visualization that shows
the data in a different format. This could make the data easier to understand and easier
to interpret.
INTERACTION OPERANDS
AND SPACES
 In data visualization, interaction operands and spaces are the two key
components that allow users to interact with and explore data.
• Interaction operands are the objects or entities that users can interact with.
They can be individual data points, groups of data points, or even the entire
visualization itself.
• Interaction spaces are the areas of the visualization where users can interact
with the operands. They can be the entire screen, a specific region of the
screen, or even a specific element of the visualization.
 The combination of interaction operands and spaces allows users to explore
data in a variety of ways. For example, users can select individual data points
to see more information about them, or they can zoom in on a specific region
of the visualization to get a closer look.
 Here are some examples of interaction operands and spaces:
• Interaction operands:
• Individual data points
• Groups of data points
• The entire visualization
• Interaction spaces:
• The entire screen
• A specific region of the screen
• A specific element of the visualization
BENEFITS OF INTERACTION
OPERANDS AND SPACES.
• Improved understanding of data: allow users to explore data in a more
interactive way, which can help them to better understand the data.
• Enhanced data exploration: allow users to explore data in a more flexible
way, which can help them to find patterns and trends that they might not
have otherwise found.
• Improved decision-making: help users to make better decisions by
providing them with a better understanding of the data.
• Enhanced problem-solving: help users to solve problems by providing them
with a better understanding of the data.
A UNIFIED FRAMEWORK
 A unified framework for data visualization is a set of principles and
guidelines that can be used to create effective and informative
visualizations.
 The framework should be based on a solid understanding of the cognitive
and perceptual processes involved in human visual perception.
 It should also take into account the different types of data that can be
visualized, the different audiences that will be viewing the visualizations,
and the different purposes for which the visualizations will be used.
A UNIFIED FRAMEWORK

A unified framework for data visualization can help to ensure that
visualizations are:
• Effective: They communicate the intended message to the intended
audience in a clear and concise way.
• Informative: They provide insights into the data that would not be possible
to see from the data alone.
• Perceptually accurate: They are easy to understand and interpret.
• Elegant: They are visually appealing and engaging.
A UNIFIED
FRAMEWORK
 There are a number of different unified frameworks for data visualization that
have been proposed. Some of the most well-known frameworks include:
• The Grammar of Graphics: This framework was developed by Edward Tufte and
is based on the idea that all visualizations can be decomposed into a small
number of basic elements, such as marks, scales, and legends.
• The Information Visualization Design Framework: This framework was developed
by Ben Shneiderman and is based on the idea that all visualizations can be
designed using a set of seven design principles, such as overview first, zoom
and filter, and animation.
• The Data-Ink Ratio: This framework was developed by William Cleveland and is
based on the idea that the most effective visualizations use as little ink as
possible to communicate the intended message.
SCREEN SPACE

In data visualization, screen space is the area of the screen that is used to display the
visualization.
 It is important to consider screen space when designing a visualization, as you want
to make sure that the visualization is easy to read and understand.
 There are a few things to keep in mind when considering screen space:
• The size of the screen: The size of the screen will obviously affect the amount of
screen space that is available. A larger screen will provide more space for the
visualization, while a smaller screen will provide less space.
• The resolution of the screen: The resolution of the screen will also affect the amount
of screen space that is available. A higher resolution screen will provide more pixels
per inch, which will make it easier to read small text and details.
• The type of visualization: The type of visualization will also affect the amount of
screen space that is needed. Some visualizations, such as bar charts and pie charts,
can be relatively compact, while others, such as heat maps and scatterplots, can take
up more space.
 When designing a visualization, it is important to consider the amount of
screen space that is available and to make sure that the visualization is easy to
read and understand. Here are a few tips for using screen space effectively:
• Use whitespace: Whitespace is the empty space around the visualization. It can
be used to make the visualization easier to read and understand.
• Use clear and concise labels: The labels should be clear and concise, and they
should be easy to read.
• Use a consistent style: The visualization should use a consistent style, which
will make it look more professional and polished.
• Use a legend: If the visualization uses multiple colors or symbols, it is helpful to

use a legend to explain what each color or symbol represents.
OBJECT SPACE
 In data visualization, object space is the area of the visualization that is used to
display the data objects. Data objects can be anything from individual data
points to groups of data points.
 It is important to consider object space when designing a visualization, as you
want to make sure that the data objects are easy to see and understand. T
 here are a few things to keep in mind when considering object space:
• The size of the data objects: The size of the data objects will obviously affect the
amount of object space that is required. Larger data objects will require more
space, while smaller data objects will require less space.
• The density of the data objects: The density of the data objects will also affect
the amount of object space that is required. A high-density of data objects will
require more space, while a low-density of data objects will require less space.
• The type of visualization: The type of visualization will also affect the amount of
object space that is needed. Some visualizations, such as bar charts and pie
charts, can be relatively compact, while others, such as heat maps and
scatterplots, can take up more space.
DATA SPACE

In data visualization, data space is the area of the visualization that is used to display
the data. Data space can be divided into two main types: physical space and
information space.

Physical space is the actual space on the screen or page that is used to display the
data. Information space is the conceptual space that is used to represent the data.
 There are a number of factors that can affect the relationship between the physical
space and information space. These factors include:
• The type of data: The type of data will affect the way that it is represented in the
physical space. For example, numerical data is often represented as points, lines, or
bars, while categorical data is often represented as shapes or colors.
• The amount of data: The amount of data will affect the size of the physical space that
is needed. More data will require more space.
• The complexity of the data: The complexity of the data will affect the way that it is
represented in the physical space. More complex data will require more space and
more complex representations.
ATTRIBUTE SPACE

Attribute space is a term used in data visualization to describe the space in
which data attributes are represented.
 Attribute space can be thought of as a multidimensional space, with each
dimension representing a different attribute.
 For example, a data set with three attributes, such as age, gender, and
income, would have three dimensions in attribute space.
 Attribute space can also be used to identify patterns and trends in data.
 For example, a heatmap can be used to visualize the distribution of data
values within a multidimensional space.
 By assigning different colors to different values, a heatmap can show how
data values are distributed across the different dimensions of attribute space.
 Here are some examples of how attribute space can be used in data
visualization:
• Scatterplots: Scatterplots can be used to visualize the relationship between
two variables. For example, a scatterplot could be used to visualize the
relationship between age and income.
• Heatmaps: Heatmaps can be used to visualize the distribution of data values
within a multidimensional space. For example, a heatmap could be used to
visualize the distribution of population density within a city.
• Treemaps: Treemaps can be used to visualize hierarchical data. For example, a
treemap could be used to visualize the organization of a company.
• Bubble charts: Bubble charts can be used to visualize the relationship between
three variables. For example, a bubble chart could be used to visualize the
relationship between age, income, and education level.
VISUALIZATION STRUCTURE
 A visualization structure is a way of organizing data so that it can be easily visualized.
There are many different types of visualization structures, each with its own advantages
and disadvantages.
 Some common visualization structures include:
• Trees: Trees are a good way to visualize hierarchical data, such as the organization of a
company.
• Graphs: Graphs are a good way to visualize relationships between entities, such as the
connections between people on social media.
• Matrices: Matrices are a good way to visualize data that can be represented as a table,
such as the results of a survey.
• Charts: Charts are a good way to visualize data that can be represented as a series of
points, such as the stock market.
• Maps: Maps are a good way to visualize data that is related to a geographic location,
such as the distribution of population density.
 Here are some examples of how visualization structures can be used:
• A tree can be used to visualize the organization of a company. The company's

departments can be represented as the nodes in the tree, and the relationships
between departments can be represented as the edges in the tree.
• A graph can be used to visualize the connections between people on social
media. Each person can be represented as a node in the graph, and the
connections between people can be represented as the edges in the graph.
• A matrix can be used to visualize the results of a survey. The survey's
questions can be represented as the rows in the matrix, and the survey's
answers can be represented as the columns in the matrix.
• A chart can be used to visualize the stock market. The stock market's prices
can be represented as the points in the chart, and the time can be represented
as the x-axis of the chart.
• A map can be used to visualize the distribution of population density. The
population density can be represented as the color of the map, and the
geographic location can be represented as the x- and y-coordinates of the map.
ANIMATING TRANFORMATIONS
 Animating transformations in data visualization can be a powerful way to
communicate insights and tell stories. It can help viewers to understand the
relationships between different data points, and to see how data changes over time.
 There are many different ways to animate transformations in data visualization. Some
common techniques include:
• Morphing: Morphing is a technique that allows you to smoothly transition from one
visualization to another. This can be used to show how data changes over time, or to
compare different data sets.
• Zooming: Zooming is a technique that allows you to magnify a specific area of a
visualization. This can be used to focus on a particular detail, or to see the big picture.
• Rotating: Rotating is a technique that allows you to view a visualization from different
angles. This can be used to see different perspectives on the data, or to reveal hidden
patterns.
• Filtering: Filtering is a technique that allows you to hide or show specific data points.
This can be used to focus on a particular subset of the data, or to remove noise from
the data.
 When animating transformations in data visualization, it is important to
consider the following factors:
• The type of data: Some types of data are more suited to animation than
others. For example, time series data is well-suited to animation, as it can be
used to show how data changes over time.
• The goals of the visualization: What do you want to achieve with the
animation? Do you want to show how data changes over time? Do you want to
compare different data sets? Do you want to focus on a particular detail?
• The audience for the visualization: Who will be viewing the animation? What
level of detail do they need? What will they find easy to understand?
 Here are some examples of how animating transformations can be used in
data visualization:
• A morphing animation could be used to show how a population grows over
time.
• The animation could start with a small population, and then gradually grow
larger over time.
• This would help viewers to understand how the population changes over
time.
• A zooming animation could be used to show the details of a city map.
• The animation could start with a zoomed-out view of the city, and then
gradually zoom in on a specific area.
• This would help viewers to see the details of the city, such as the streets,
buildings, and parks.
• A rotating animation could be used to show the different sides of a 3D object.
• The animation could start with a view of the front of the object, and then
rotate the object to show the sides and back. This would help viewers to see
INTERACTION CONTROL
 Interaction control in data visualization is the process of allowing users to
interact with a visualization in a way that allows them to explore the data and
discover insights.
 There are many different ways to control interaction in data visualization, and
the best approach will depend on the specific visualization and the goals of
the user.
 Some common interaction techniques include:
• Selection: Users can select data points or regions of a visualization to focus on

specific data.
• Filtering: Users can filter the data to show only certain data points or regions.
• Highlighting: Users can highlight data points or regions to make them stand
out.
• Zooming: Users can zoom in or out of a visualization to see more or less
detail.
• Panning: Users can pan a visualization to move around it.
 When designing interaction controls for data visualization, it is important to
consider the following factors:
• The type of data: Some types of data are more suited to certain interaction
techniques than others. For example, time series data is well-suited to zooming
and panning, as it allows users to see how the data changes over time.
• The goals of the user: What do you want users to be able to do with the
visualization? Do you want them to explore the data? Do you want them to find
patterns? Do you want them to make comparisons?
• The audience for the visualization: Who will be using the visualization? What
level of technical expertise do they have? What will they find easy to use?
 Once you have considered these factors, you can choose the interaction
techniques that are appropriate for your needs.
 Here are some examples of how interaction control can be used in data
visualization:
• A scatterplot could allow users to select data points to see more information
about them. For example, if a scatterplot shows the relationship between
income and age, users could select a data point to see the income and age of
the person who corresponds to that data point.
• A bar chart could allow users to filter the data by category. For example, if a bar
chart shows the sales of different products, users could filter the data to show
only the sales of products in a particular category.
• A map cloud allow users to highlight regions of interest. For example, if a map
shows the distribution of population density, users could highlight regions with
high population density to see more information about those regions.
• A 3D model cloud allow users to pan and zoom to see different parts of the
model. For example, if a 3D model shows the inside of a car, users could pan
and zoom to see the engine, the dashboard, and the seats.

Unit iv

Uploaded by

Copyright:

Available Formats

Unit iv

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit iv

Uploaded by

Copyright:

Available Formats

UNIT IV

 Natural Language Processing is a sub-field of artificial intelligence that

• Discrete text representations

 It is a type of representation that assigns 0 to all elements in a vector

 If i had a sentence, “I love my dog”, each word in the sentence would be

 I → [1 0 0 0], love → [0 1 0 0], my → [0 0 1 0], dog → [0 0 0 1]

 The entire sentence is then represented as:

• Easy to understand & implement

• Explosion in feature space if number of categories are very high

 The “weight” of a word in a sentence is its frequency

• CountVectorizer also gives us frequency of words in a text

 Document 2: "A brown dog jumps over a quick fox"

 Document 3: "The lazy dog is sleeping in the sun"

• Use a legend: If the visualization uses multiple colors or symbols, it is helpful to

• A tree can be used to visualize the organization of a company. The company's

• Selection: Users can select data points or regions of a visualization to focus on

You might also like