0% found this document useful (0 votes)
4 views12 pages

ISB - BA - W9 - Video Transcripts

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 12

Week 9: Elementary Text Analytics

Video 1: Elementary Text Analytics: Overview


Hello, everybody. Namaste, salaam, sat sri akal. Welcome to today's session on Text
Analytics. Analysis of unstructured text.
Now, this is familiar to you. This is our analytics course map. And, as you can see, the left-
hand side is a supervised learning approaches covered by my esteemed colleague, Professor
Manish Gangwar. On the right-hand side are unsupervised learning topics which most of which
I am doing. The second from the last on the right-hand side is today's topic, Text Analysis.
Without any further ado, let's get there.

Video 2: Text Analytics: Motivating Example


All right. So let me motivate the why of text analytics using two small, simple use cases. The
first use case goes this way. Now the firm is Amazon. The category is Bluetooth sound
speakers. So you have these speakers that work via Bluetooth. The task is to develop specs
for an own brand. This was, you know, from Amazon's point of view, in the 150-dollar odd
price range. So this is a US example.
This is what, you know, the speakers look like. So, this one, for instance, is JBL. Now, an
analyst team at Amazon gets to work. They scrape all Amazon reviews for the top five brands
in the category of Bluetooth sound speakers, yeah? And the data that would come would
include the customer rating for each brand, the price of each brand, customer reviews which
include unstructured text in them.
Now, this is a snapshot of what the data look like. Now, who is the manufacturer? What is the
average rating? What is the price? This is easy to see. I mean these are attributes within the
review itself. However, have a look at this. High rated features, low rated features. Now, these
are fished out of text. Unstructured text contained features mentioned in different ways
somewhere inside them. How did these features get captured? Further, it says high rated and
low rated. How do we know which one is high rate? Which one is low rated? Some sort of a
sentiment associated with the mention of each feature would be required.
The approach would be to use customer ratings to find which product features influence scores
the most. So, in some sense, there is some bit of a supervised learning bit also put into this
particular kind of problem.
One of the important insight that emerged from this exercise was that well, you have to look
also at low rated features and not just at the high rated ones if you are going to launch in that
category. You have to know not just what to do but also what not to do in some sense. Well,
where are things today? Last heard of Amazon is considering product design development of
its own, you know some sort of a private labelled Bluetooth speaker series, which it might
launch, you know, at some point of time.
Here is another use case. This one comes, the firm is Fitbit and you know we are all familiar
with fitness trackers. That's what they make and they are everywhere. Now, the task is to get
product feedback. What are consumers who are using Fitbit trackers saying about them. Turns
out that Fitbit has a Fitbit support Twitter handle where, you know, people write to Fitbit saying,

Applied Business Analytics Page 1 of 12


okay, I have this problem, this is not working, how do I solve this? So I mined some tweets
from there.
Now, an analyst team guided by a manager got to work, the idea was to filter the feedback by
product type. Try to get to see what customers are saying about different fitness trackers inside
the Fitbit range. This is what the results look like typically. And this is coming directly from the
twitter API.
Now, it turns out that different products, have different grievances, you know, different issues
associated with them. For Fitbit charge HR, the strap was an issue. It tended to break, bubbles
would develop, the rubber stamp would peel off, etc. The Fitbit blaze had a different set of
issues. The operating system itself was an issue. I mean, a lot of people had trouble going
beyond the logo screen, you know, in these kind of trackers.
What is going on here? In some sense these are secondary databased insights, this is data
that is already there, we are using text analysis to find these insights. Two, there are third
party opportunities. Who is this kind of data useful for? It is not just for Fitbit to be aware of
these things. A third parties supplier, if I would makes straps which are compatible with Fitbit,
then I would also monitor Fitbit support and basically pitch Fitbit straps to customers,
effectively targeted marketing in some sense but that's what's going on.
Let me quickly conclude these two use cases. Right now, what are we trying to motivate?
More generally, one, there is informational value locked inside unstructured data. Data looks
unstructured. Two people could be saying the same thing but what are the odds they will use
exactly the same words? There is a lot of informational value locked into what is called
unstructured text. And this value is for brands, for platforms, for third parties, you name it.
The way we impose preliminary structure on text, I mean, you cannot analyse something that
is purely unstructured. The only way to analyse it would be to impose structure on that
particular content. So this is what we will do with text as well. We will find out how to impose
structure, what structure to impose and as a result of that, what we will get out of it. The
constructs of interest can be found inside data provided we know where to look and how to
look, which is where we are headed today.
If you want to automate text analytics and that is fairly common place nowadays, we need a
lot of, what is called training data. Now, in the absence of such training data, we will see today
wherever required, we will use manual judgement instead. However, trained models are
available and we are going to invoke some of those as well.

Video 3: Text Analytics: The Why and How


Alright, so let's now head in to text analytics. The "why" and the "how" of text-an, so to say.
The why part is fine and the how part, we will see two basic approaches as well as an
introduction to some terminology.
So, let me start with the why, the why of text-an. Consider this scenario. You have a manager
who wants to know the broad contours of what customers are saying when they call into the
company's call centre. This call may be recorded for training and quality purposes partly
because they might want to know what are customers saying. What broad buckets are
emerging from customers calls into the call centre. Two, a firm might want to know if there is
a persistent pattern in the content of customer feedback records in customer complains there
is a grievance cell typically, service calls, emails, what a customer is saying, all of this involves
text.

Applied Business Analytics Page 2 of 12


The first one involves voice converted to text. The second one would involves text directly. An
investor might want to know what major topics around press coverage of a company that he
or she is interested in. In all these cases, we have unstructured text and the analysis of such
unstructured open-ended text data, data which may or may not be amenable to measurement
and scaling by traditional method. In fact, we will impose preliminary structure whereby which
such measurement becomes possible. We will see shortly the primary data scales in some
sense and we will put them to task.
All right, the how of basic text -an. The why part is done, let me get to the how. Broadly
speaking, there are two ways to handle text analytics. Broadly speaking, two basic ways. One
is called the NLP or the Natural Language Processing approach and the other is the text
mining bag-of-words approach. Both are related in some sense.
What is the big difference between these two approaches? It is an assumption that is made,
in bag-of-words. Natural Language Processing assumes that the order in which the words
come, the words in text come, have meaning, that they are meaningful. You change the order,
you change the meaning. That's basically the assumption that NLP makes. Bag-of-words
assumes that all words are exchangeable. It doesn't matter in which order the words come,
you change them aside, meaning remains the same.
Now, this is a strong assumption. We know that NLP makes a more realistic assumption.
However, bag-of-words, yeah, if you were to change the order of the words in a given passage
of text, the meaning that would be retained at the end of that, that particular meaning can be
mined for insight. Bag-of-words happens to be much more practical and applicable to much
larger datasets. Let's have a look at that.
Today we will focus a lot on bag-of-words. We will also borrow from NLP where we have to.
Let's head there. Bag-of-words. Now, this is a passage from Shakespeare's "As You Like It".
"All the world's a stage, and all the men and women merely players: they have their exits and
their entrances; and one man in his time plays many parts..." Now, if you're an English major,
you might find deep meaning and insight in that. But, if you're a data scientist, this is what you
would see.
Effectively, one frequency table. "World" occurs once. "World's", the apostrophe and the s are
cut here. Suffixes is taken out, only the root remains. "Play" occurs twice. Where? "players"
and "plays". As you can see again, the suffixes are cut and only the root, the word root
remains. Why do we do this? If we don't do this, we would have a proliferation of tokens. "play"
would be one token, "plays" would be one token, "players" would be one token, and so on. So
we, in some sense, try to take that part out. Now, this is the bag-of-words approach.
Have a look at this. This is two different documents so I can construct a frequency table
similarly for this. I'm going to get into how this is constructed and what happens. Now, bag-of-
words is very powerful as an approach. It remains a logical, mechanical way to input text. It
also remains the only way really in which a machine can handle text. It reads it in his
characters. Characters form words, words form sentences, sentences form paragraphs,
different units of analysis can be taken.
Well, let's look at a real dataset in where we're heading. This is the dataset. The transcript of
a quarterly earnings call. What are quarterly earnings calls? Why are they done? Typically,
these happen to be teleconferences which are done every quarter by the top leadership of
public companies. On the other side of the telecons are these equity analysts who basically
ask questions about the company's earnings guidance and performance, and so on. Quite
clearly, these are important. The quarterly earnings calls tend to have a large impact on a,
significant impact on equity prices, company's share prices, and so on.

Applied Business Analytics Page 3 of 12


What I am going to use today as my example dataset is the call transcript of IBM's quarter
three 2015 earnings call. Now, IBM has been having a tough decade. Watson gamble, in some
sense, with a big one. But, the numbers have not been as great as in the decade prior to this.
There it is. This is what the dataset looks like. By default, in each line in this text file what we
call a document. Each line is a document separated by a line break.
Now, documents can be short or long. In this case, they're fragments of a sentence. The entire
set of documents, the entire stack of documents is what we call a corpus. Corpus, in this case,
stands for body of text and the etymology can root is in some sense, well a "corp" which in
Latin means body. So, corpse is a dead body. Corpus, in this case, is the body of text. And
incorporation is the process of a giving a business a body in law. Effectively, most of it coming
from the same Latin root. What kind of insights could we hope to see in this text? This is a
quarterly earnings call text. What kind of insights?
So, I'm going to next walk you through a place where we basically open the basic text analysis
app. We are going to read this file in and we are going to analyse it using bag-of-words as well
as a little bit of NLP. We have a mixture of both these approaches in where we go.
Now, this is what the app will look like. We will examine the interface, the input fields, the
output data, what goes in, and then go through each bit of output that comes from here. How
are the questions that we would have in our minds be answered by combining the outputs
given by this tool. Basically, that's the idea.
Before I proceed into the tool itself, let me go through some more terminology. One,
tokenenisation. Tokenenisation is the process of breaking up the cleaned corpus into
individual terms, words basically. Let me give you an example. Business school is a two-word
token. It's a single token, it's called a bigram. It has two words in it. "Business" and "school"
separately would be one-word tokens, unigrams. Big data, for instance. "Big" and "data" two
different unigrams. Put together as a phrase they become one bigram effectively, and so on.
Once you have a corpus. Once we tokenised a corpus into tokens. Token could be a word. It
could be two-word phrases. It could be a sentence. A token could be a paragraph. Typically,
we will go with one-word token that will be by default of our definition. We can build a simple
frequency table. What does this frequency table do? Pretty much like what we saw in the
Shakespeare example. It takes each token and tells you how many times it occurred in the
corpus. If I have 10 documents then, I will have a frequency table with 10 rows. Each document
is a row and the columns are my tokens and it tells me how many times each token occurred
in each row. It's basically where it's going.
This is what the output will look like. It's called a document token matrix. Documents are the
rows, tokens are my columns and the cells are basically populated by frequencies. We will
see this as we get into the app. I just wanted to put this out here. So here, for instance.
"Business" occurs once in document one, "year" doesn't occur at all, frequency is zero, and
so on and so forth. Very important what we are seeing here, folks. The DTM.
Just realise what we've done. We've taken unstructured raw text and what we are getting out
is a matrix. So frequency table, it is also a matrix which implies that all the battery of operations
that are applicable to matrices, all statistical machinery that applies to matrices, everything is
now in play. Simple trick, bag-of-words, with what we get out of it, very powerful. Let's get
there next.

Applied Business Analytics Page 4 of 12


Video 4: Basic Text Analytics App Demo: Earnings Call
Welcome folks and let us now explore the basic text-an app. If you open the app, this is what
it should look like. These are the UI or User Interface elements. These are your various output
tabs, we will go through each of them. Let us first start with reading in the dataset. This is a
slightly large app so you know this button is important here. Let's upload the data first.
Click on "Upload File" and read in "IBM Quarter 3 2015 Results - Earnings Call Transcript",
it's a dot txt file, just to read it in. If there is no document ID, the app will create one and this
will come. If there are multiples, if you have a CSV file with multiple columns then these two
will help us identify which column we want to enter. Okay, now this is where we are. Please
click on "Apply Changes". Let's get to "DTM".
So, this is the first tab, DTM stands for Document Token Matrix, in this case it's Token
Document Matrix, Term Document Matrix. There are 257 documents in that corpus and about
638 unique tokens which are not stopwords. This is the word cloud of the tokens. Let's
increase this and make it a 100 words. Let's just do that, there you go. You have 100 words.
"Business" has the largest font. "Business" also comes up here followed by "year" and
"quarter". So, these are the top three words by frequency in the corpus, the top three tokens
by frequency. And here you have their actual count in the corpus, "business", "year", "quarter",
"cloud", and so on.
Okay. Now, we can go here and have a look "TF-IDF". This is another way to weigh the
document token matrix. I will go into what this means down the line. You will get a different
word clouds, slightly different word cloud, business still top, but it's slightly different word cloud
and different weight. We'll get into what this is later.
Go to the third tab "Term Co-occurrence" and you should see something like this. Now, bring
your mouse here and scroll up and down and you can see this move. Left-click and drag it
around and it will move. So, "cloud", the green-coloured node and the pink-coloured nodes.
The green-coloured nodes are what we call central nodes and the pink-coloured nodes are
what we call peripheral nodes. So, in this case, I have four green nodes. I can increase their
number. The top for most frequent tokens become your central nodes and everything else
becomes peripheral and that's the idea. You can also likewise increase the number of
connections of peripheral nodes per central node, I mean you could do that. And so, this is
what it looks like for now.
The same applies also to TF-IDF. It will also give you what is called a COG or a Co-occurrence
graph. Alright. Next, we go to "Bigram". So far, so good. So, we check DTM. It gives you DTM
size, it shows you a 10x10, a small version of the most frequently occurring terms and the first
few documents, word cloud and weight distribution. Then you have the same thing for TF-IDF.
I will come to what this means and then we came and saw "Term Co-occurrence" graph. Let
me go back to the default, it's just easier to see. There you go. And so on. What these are and
how to interpret them? I will talk about later.
Let's now go to "Bigram". What is a Bigram? A Bigram are two words that occur consecutively,
one before the other in the corpus, multiple time. So, patricia_murphy, this is a proper noun,
martin_schoeter, these occurred 15 times each. This is the world cloud for the bigram. Noticed
this underscore here. What this underscore does is it takes two different tokens and stitches
them into one token. These will be used as a single token by the machine and strong_growth,
fourth quarter, cash flow, club_based, revenue_grew, services_business, cloud_solutions,
kind of makes sense, and so on. And you can go out and see what it is saying unstructured
data, mainframe_cycle, and so on.

Applied Business Analytics Page 5 of 12


So, bigrams can be quite useful in checking out what's going on, what are people talking about,
and so on. Then let's come to "Concordance". This is what it looks like here. And this is the
place, "Enter word in which you want to find concordance". What does concordance do? The
concordance, for instance, is giving you a context window. This concordance window of five
means that before this word here "good", it takes the first five words before "good" and the
first five words after "good" and puts them together to show it to us. So, let me replace this
with, let's say watson, and click on "Apply Changes" and you will get this "of merge healthcare
to give watson the ability to see million", "on two of for our platforms watson and bluemix".
Now suppose I want to see more of this context window. I want more words before and after
watson, let's drag this around. So, I'm putting this at 12. So, the half window size before the
focal word, after the focal word, half window size is now 12 word and so on. you can extend it
all the all the way to 100 words and then it would become a large one but that's what would
get.
Finally, go to "Downloads" and you can basically, you know you can just click on this and it
will download this 257x638 column DTM, I mean it will be, this particular matrix, you can
download it. There is a third one you can likewise do the same for TF-IDF weighing of the
DTM. I will come to what this means. And you can also download the Bigram Corpus. And
what does it mean? It means that in the original corpus, every, all of these bigrams will be
replaced with this underscore between them. So, you can download this and see, download it
and see. Alright. And then you can read this in again.
One other thing that I are forgot to show you is the power of this thing called 'stopwords'. Now,
what is the most frequently occurring word 'Business." Just as an example. Suppose I type
business here as a stop word, notice all of these are lower case, I mean we deliberately do
this, yeah? And then you click on "apply changes". Voila! Business here is gone. I have
stopped the word "business" from entering after the analysis. That's what a stopword would
mean. So, "business" is gone, there it is.
Suppose I want to get rid of, let's say, "question". So, type "question" here, the "business and
question" both in this case, right? "Apply changes". And voila. Question is gone. It is gone
from the analysis completely. Now, it won't show up here either. It won't show up anywhere
now. Alright, okay. With that, let me return to, Please practice this a couple of times. And with
that we can return to the slides again.

Video 5: Basic Text-An Example: Earnings Call Transcript


Now that we've actually seen the Basic Text-An app and how it works, what I will do next is
walk you through a few summary highlights of what we saw as pertain to the Earnings Call
Transcript Dataset, the IBM Earnings Call Dataset.
We saw the DTM tab, recall that we saw the DTM tab. We also saw how to use stopwords to
drop irrelevant tokens. This is important. Stopwords can be a powerful way to drop or filter out
a lot of noise from a dataset. We saw how to use the slider to set the number of words in a
wordcloud. We saw how to interpret a wordcloud.
We also saw for instance, a similar term tab called TF-IDF which I will explain shortly. We
haven't gotten through that, in the walk through the app. We asked questions such as, What
are the most frequently occurring words in the corpus? What happens when we use custom
stopwords to remove the top one to two words? I mean how does the ranking change for the

Applied Business Analytics Page 6 of 12


other tokens and so on? We moved beyond Wordcloud, right? And, we moved to COGs, Co-
occurrence Graphs We saw that wordclouds can only do so much.
We're corpus level things. If I want to Document Level Analysis, well I need a sharper tool.
Hence, we examined Co-occurrence Graphs which basically tell us which token pairs tend to
co-occur somewhere in a document across documents in the corpus. So for instance, in this
example the IBM transcript example, we saw that cloud tended to co-occur with data. Data in
the cloud, we saw that cloud tended to co-occur with platform, the cloud platform, cloud
solutions, cloud services, and so on. All of which is eminently sensible in IBM's case. We also
saw we could control the display of the COG by controlling the number of central nodes in the
COG as well as the max connections to the peripheral nodes in the COG. We asked questions
such as which tokens tended to most to co-occur in the same documents and so on.
There is a similar COG for the TF-IDF scheme and I will come to that next. We saw Bigrams.
We checked the Bigrams tab. Now, bigrams remember are two tokens which tend to
consecutively follow one another. And bigrams to that extent are borrowed from NLP because
order matters in this case.
We saw that for any sentence with n words that can be at most n -1 bigrams. We saw that
when you look at bigrams you get a whole different set of aspects of the the corpus than when
you were to look at unigrams alone. We also saw a content window in the concordance tab of
customisable size around key tokens of interests.
So for instance, when we saw Watson, we saw basically, every instance, every mention of
Watson was a customisable context window around it. So many tokens before, so many
tokens after every single mention of Watson. Why just Watson, you could use other words to
get a hang. Now this would also be an NLP borrow in some sense because it is actually giving
me the exact order in which words tend to occur around any mention of our focal token.

Video 6: Text Analysis: DTM and TF-IDF


Alright. Now we are going to look at DTM and what is called TF-IDF. This is a way of weighing
the document term matrix. Let's have a look at this.
The DTMs as you recall, are these matrix objects that are outputted by the basic text -an app.
Now, there are two ways to weigh a DTM. There is TF, which is token frequency, term
frequency, the simple one, the one that we've already seen. There is another one, it's called
the TF-IDF. TF is again Term Frequency, the IDF part is Inverse Document Frequency.
What does this new method of weighing the DTM imply? What does it achieve? What are it's
applications? Let's have a quick look. Let me walk you through an example to demonstrate
what it is that IDF does.
Consider a hundred document corpus of Nokia Lumia reviews on Amazon. Nokia Lumia was
the last smartphone I believe of Nokia. 2013-'14 timeframe. And interestingly, it was neither
IOS nor Android, it was a Windows operating system phone. Consider a DTM of Nokia Lumia
reviews, I analyse it through basic text-an and I get a DTM, the document token matrix.
Suppose there are 100 documents. There are 100 reviews. And suppose the tokens are
bigrams, screen size, battery life, windows OS, app ecosystem, and so on. So, that's basically
what you're seeing.
Now, look at document number 50, reviewer number 50. This person mentions screen size
four times and mentions battery life zero times. Now, if I make the assumption that the more

Applied Business Analytics Page 7 of 12


times you mention a feature, the more you care about the feature, not an unreasonable
assumption. If that is the case, would it be unfair to say that reviewer number 50 cares more
about screen size than battery life? It's eminently reasonable.
Compare that with reviewer number 70. This person mentions screen size once, mentions
battery life twice. Appears to care more about battery life than screen size, fair enough, right?
Fair enough. What I am going to ask you is kind of a googly, right? Who cares more about
what? Can I say that number 50 cares more about screen size than number 70 cares about
battery life? Now, you might say how can you compare apples and pineapples; can we even
make a comparison of this sort? Hold on to that. That's where we are headed with IDF.
You look at the last line, the very last row in that table out there, it says document frequency.
What does it mean? The average number of times each token appears in the corpus. So,
screen size occurs 200 times in a corpus of 100 documents. What is its document frequency?
Two. What does it mean? On the average, screen size tends to occur about twice per review.
It tends to get mentioned twice. Battery life on the other hand, occurs 50 times in the entire
corpus of 100 documents. So, battery life has a document frequency of 0.5. What does it
mean?
Every other reviewer tends to mention battery life, and that's basically it, document frequency.
All right, now let's get the game going. IDF implies what? Inverse document frequency. If
screen size has a document frequency of two, the IDF, Inverse Document Frequency, will be
one by two. And similarly, in battery life, 0.5 becomes one by 0.5, that is times two.
Each token frequency number I am now multiplying by the IDF number. What am I doing? I'm
reweighing each of these token frequencies with the inverse document frequency. What does
it imply? It is telling me that it matters how many times a token occurs in the corpus. If there is
a token that tends to get mentioned all the time, really is it all that important? If there is a token
that gets mentioned rarely but has gotten mentioned a lot in one particular review, does that
not make that more important? Those kind of questions is what an IDF approach would try to
answer.
Now, let's see what happened here. When I do IDF, and I take the inverse of document
frequency, multiply it with TF, the token frequency, these are the numbers I'm getting. Now,
look at number 50 again. Reviewer number 50. Token frequency score was four and IDF score
is two. Number 70, token frequency score was two, IDF score is four. What has happened?
The two have flipped. Under TF, I would say, number 50 cares more about screen size. Under
IDF, I would say, number 70 cares more about battery life than 50 did about screen size.
Why does it matter whether they care more about it? They would be willing to pay more for it.
The willingness to pay argument and marketing, it's a very important one. It matters what a
customer cares about in some sense. And here, we have two different approaches for
weighing the DTM, giving me two different results. Which one should we trust? Which one
would you trust? I'll come back to this question by the end of this slide. What I showed was
illustrative. There are different methods of weighing the IDF. So, this one, typically, we use
logarithms in there because token frequencies tend to follow power laws.
Alright, now similar to DTM tab, the weighing is different. So, you can go to TF-IDF tab, you
will see it is very similar to the DTM tab. But, the numbers are different now. The size of the
matrix is the same, but the weighing is different. What does it mean? Let's have a look.
If you look at the wordcloud that comes out of IDF, the words are different now. The top words
are different. The rankings have changed. IDF allows those words, those tokens that didn't
have a chance to shine before, a chance to shine now, because their document frequency

Applied Business Analytics Page 8 of 12


was low, inverse document frequency becomes high. These tokens that are mentioned only
occasionally here and there but tend to get mentioned in concentrated places, they basically
get bumped up.
All right, in this particular case, as you can see, it is people's names. It's actually proper nouns
that are coming up. Whereas in the previous one, it was object names under TF that were
coming up, the most important. Which of the two should you prefer? There are only two
schemes to weigh, TF and IDF. Which should you prefer? Try both. It is very hard to say priory
for a corpus which will work better, whether it is TF or IDF. In general, you should try both and
see, whichever makes more sense is good to go with.
Lastly, let me quickly recap what we did with basic text-an, and after this, we will proceed to
sentiment analysis. Basic text analysis, here is the question folks. What have we been able to
accomplish with just elementary text analysis? Elementary being the key word, what did we
just accomplish?
One, we were able to rapidly, scale-ably, cheaply, crunch through raw text input. Now, this
was a small file, you could input a big file, it will take a few seconds more, but it will give you
the answer just the same. We were able to transform unstructured text data into token counts.
From token counts to frequency tables and thereby, to a finite dimensional mathematical
object, the matrix object, the DTM. The moment the DTM comes into the picture folks, a lot of
operations, all matrix operations are now in play. A huge step forward. We also saw some
display contours. We saw wordclouds, we saw co-occurrence graphs, we saw some ways of
borrowing from NLP, we saw bigrams, and we saw concordances that were giving us context
windows. So, we basically, just with elementary text analysis, we were able to achieve quite a
bit.

Video 7: Sentiment Analysis


All right, let's now come to Sentiment Analysis, having seen basic text analysis. The next step
is Sentiment Mining. What we are going to do is Elementary Sentiment Mining, the key word
being Elementary. But let's have a look at this.
I'm going to show you two review excerpts of a popular TV series. I want you to take a guess
at there Sentiment Content. This is the first one. "This is one of the best TV series I have seen
in a long time. I am yet to read the books, but of the TV series is anything to go by, the books
will be outstanding." Folks, is this review positive in general, is it negative, is it neutral, is it
more positive than negative? What do you think? Why do you think so? Hold on to that.
Take a look at the second review, a little more complicated. "After seeing today in game of
Thrones, I realised the author of the serial as well as HBO may need professional medical
assistance. Since the beginning, hard to find anything that would leave anyone feeling good."
Folks, this second review is it positive, is it negative, is it neutral, is it more one way than the
other, is it a mix of both? What is it? If you formed your opinions, they may I ask you, what
tokens in particular led you to think one way or the other?
Now in the first review, one of the best TV series, best. Think about the books will be
outstanding, outstanding as a word, ok, now these best and outstanding are adjectives.
Basically, it's adjectives we focus on when we do Sentiment Mining. Both of them give it a
way. This is a positive review, hugely positive review.
Now look at the second one. The second one is tricky, why? Because there are no clear
adjectives per say. HBO may need professional medical assistance, really? I mean, you won't

Applied Business Analytics Page 9 of 12


find professional medical assistance in an adjective list, right? How do we, this is very
contextual. This is hard to get at.
So, sentiment mining can trip up sometimes because what it does, is it focuses on matching
sentiment laden words, adjectives. In fact, a machine might actually go wrong here. It would
look at "Good." That would hardly make anyone feel "Good." It will pick the "Good" up and say
"Hey, this is positive.," possible. It is hard to find anything. It will pick up "Hard" and say "Okay,
maybe this is negative." Positive negative, they cancel out, its "Mixed." That's what it might
say, yeah. Hold on to that, we're headed there. But more generally, the question will arise, it's
not just about the direction, is it positive or negative, can I also measure magnitude? How
positive, how negative? Turns out we can. That is where we're headed.
Some conceptual definitional preliminaries first. One, what is sentiment mining? Sentiment
mining is a process, it's basically an attempt to detect, extract, measure value judgments,
opinions, measuring right and wrong in some sense, measuring good and bad, it's a value
judgement, okay. It's emotional content that we are trying to measure in text data. Why do we
care about sentiment? Why should businesses care about sentiment?
As marketers, okay marketers would know, it is very important, more important to figure out
what people feel about your brand than what they think about your brand really because feel
people's thinking may change much more easily than people's feelings change. Important in
some sense. How is sentiment measured? Well, the technical term for it is Valence.
Valence is a continuum, positive, neutral, negative. So, it's somewhere on that continuum.
Valence can be measured and scored and it makes sense in some sense to build our own
context-specific, domain-specific, vertical-specific valence scoring scheme. But you can rely
on a general sentiment dictionary, but really what are the odds it will serve your particular
domain? What I'm going to do next is walk you through an illustrative exercise.
Which movie is this? Two major parts, classic movie? What I am going to do in some sense,
is walk you through the review corpus of Bahubali. Please open Bahubali reviews dot text.
This is what it looks like and the questions we are going to ask after we feed it into the
sentiment mining app are these.
One, what experience, associations emerged from the corpus? What sentiments predominate
what attributes of what experience? All of these things important? Three, how do people feel
about the movie? What do they compare it with? People compared it to a Sholay which
basically is a classic one of the all time classics out there. In some sense, what is the
consideration set? What is the comparison set? These are questions that will come up and
together by combining the basic text app with the sentiment mining app, we can start to answer
such questions. For reviews running into 100s, 1000s, that's where we are headed. Let us
open the sentiment analysis app, and walk through it.

Video 8: Sentiment App Demo: User Reviews


Alright folks, let us now explore, navigate the sentiment analysis shiny app. Have a look at
this. This is what the app looks like. These are your UI elements. These are your output tabs.
Precious, few UI elements. You something called sentiment dictionary. We'll go with a AFINN
by default. You have a user defined one in case you want and this one is for finance specific
tokens. NRC I have removed, I mean, for there are some permission issues and everything
else is there. I'll Come to what document index does?

Applied Business Analytics Page 10 of 12


First and foremost let us read the data in. You have three plots I mean you have three tabs,
sentiment plot, stat, and document analysis. Let's have a look at what they do. First off let's
click this, let's read the data, right? Click on this and let's read in the data "Bahubali reviews."
There it is, yeah I have read it in. Please go to Sentiment Plot. What does it show you? Well,
these are your document numbers. There are about 360 odd and the document numbers don't
reflect. So, there are jumps, gaps in there.
Now, hover on a point. So, if you take your this thing and put it here for instance, it will give
you the coordinate of that. So, each of these lines represents a document. This zero level
represents neutral sentiment. Anything above the zero level is positive sentiment, net positive.
Anything below the zero line is net negative sentiment. Basically, the reviews of Bahubali seem
to be more positive than negative. I mean you have way more lines above the zero than below
it, yeah. Now you might wonder. Okay you know, but I want to know what a specific review is
saying. Okay let's go to this is the tallest I think of lot, lets go here and see.
So, this is review number 405 and this score, net positive score is 76. That's basically plus 76.
Now one might wonder why is review number 405 so positive? Type the "Document Index,"
press "Enter" and then go to Document Level Analysis, yeah. What it will do is, it will break
down the document into individual sentences and it will also list the tokens that carry
sentiment, okay. Now the AFINN scoring, sentiment scoring scheme goes from plus five to
minus four. So, these words in black are plus five. These words in grey colour coded are plus
four, these ones are I guess plus three, plus two, plus one, and so on, okay.
So, if you go back to the Document Level Analysis, these were the words that were given a
score of minus two, these were the words in the review that were given a score of minus one
and so on. And you can continue each sentence in the review, it's breaking down the review,
passing it into sentences and then giving you the next sentiment per sentence.
Take a look at this. There is not even a single scene where you will get a chance to blink. I
mean this is high praise. But, I mean we know that reading it. Sentiment score given is only
plus two here. So, plus six for instance.
Look at this, Ramya, Krishnan, Nasser do their respective extremely well, excellent
performances, pleasant music, and so on. So, you basically get all the keyword matches pretty
well, which is why you are getting a high score here. Okay, you can check out the others also.
Bing for instance is, it's a simpler one. It's just positive and negative. These are negative and
these are positive words. And if you choose Bing, this will also come down to a much more
coarse kind of a classification. That's what basically you will get to a much more coarse kind
of a classification. That's basically what you will get. And this will change to... it will give you...
these are the reviews with high negative and high positive. Positive matches.
Well, please go through this video again, upload any other dataset and explore these options.
Play around with it a little bit till you are comfortable.

Video 9: Elementary Text Analytics: Summary


So, let me now put everything together. I'm gonna summarise what we just did in sentiment
analysis and then recap the entire session. So, let's have a look at this. The key points of
sentiment mining.
Question: What factors does good sentiment analysis depend on? We figured out what it
depends on when we went through the app, we saw that there are 3-4 different sentiment
lexical, sentiment dictionary is in there. It depends crucially. Sentiment analysis quality will

Applied Business Analytics Page 11 of 12


depend crucially on the quality of the wordlists of the sentiment dictionaries that we are using
as an input.
Two: Is sentiment mining context-sensitive? Yes, bigtime, it has to be. The same words in one
context can say something and in another context can say something else. Context in
business settings particularly becomes very domain and vertical specific. Hence, it is
imperative that we evolve our own wordlists for valence polarity scoring. We evolve our own
scoring schemes. We can and we should do that. The App provides a very basic way for us
to be able to do that. That is a user-defined sentiment dictionary which we could customise to
our needs.
Can we make running plots of sentiment or buzz over time? So, this is a little bit of an
introduction to dynamics of sentiment. Can we do this? Yes, we can. Why would we want to
do that? It basically tells us where sentiment peaked, where it crashed, where it stabilised,
what is happening, in any event where chronology matters.
Can you think of any major sources of sentiment information out there for your firm, for your
product, for your brand, for your vertical, for your Industry? There are plenty of sources out
there and armed with the tools that we have now seen, there is something we can do about it.
We can pick that data up, we can feed it through the app, we can interpret the output, we can
generate actionable insights.
With that, let me now combine the two big apps of today: Basic text analysis and sentiment
analysis. Let me walk through what we did in this module. We started with a motivating
example of product reviews on e-commerce sites.
Remember that Amazon Bluetooth speakers and a Fitbit support. We went through the 'Why'
of text analysis and is part of the 'How' of basic text-an. We introduced the basic text-an app.
We did the IBM analyst calls dataset as our basic introductory, real world dataset through the
app. And there was a graded exercise on the Amazon product reviews corpus for OnePlus 8.
We also did elementary sentiment mining, and in the process of doing so, we did a hands on
exercise as our Introductory app exercise with the Bahubali movie reviews and we did a
graded exercise, a second graded exercise for the module with hotel reviews. With that, what
we have seen are these two big apps. We have used each of them in turn, but we can actually
combine them and use them together as well.

Applied Business Analytics Page 12 of 12

You might also like