Hello, everybody. Namaste, salaam, sat sri akal. Welcome to today's session on Text Analytics. Analysis of unstructured text. Now, this is familiar to you. This is our analytics course map. And, as you can see, the left- hand side is a supervised learning approaches covered by my esteemed colleague, Professor Manish Gangwar. On the right-hand side are unsupervised learning topics which most of which I am doing. The second from the last on the right-hand side is today's topic, Text Analysis. Without any further ado, let's get there.
Video 2: Text Analytics: Motivating Example
All right. So let me motivate the why of text analytics using two small, simple use cases. The first use case goes this way. Now the firm is Amazon. The category is Bluetooth sound speakers. So you have these speakers that work via Bluetooth. The task is to develop specs for an own brand. This was, you know, from Amazon's point of view, in the 150-dollar odd price range. So this is a US example. This is what, you know, the speakers look like. So, this one, for instance, is JBL. Now, an analyst team at Amazon gets to work. They scrape all Amazon reviews for the top five brands in the category of Bluetooth sound speakers, yeah? And the data that would come would include the customer rating for each brand, the price of each brand, customer reviews which include unstructured text in them. Now, this is a snapshot of what the data look like. Now, who is the manufacturer? What is the average rating? What is the price? This is easy to see. I mean these are attributes within the review itself. However, have a look at this. High rated features, low rated features. Now, these are fished out of text. Unstructured text contained features mentioned in different ways somewhere inside them. How did these features get captured? Further, it says high rated and low rated. How do we know which one is high rate? Which one is low rated? Some sort of a sentiment associated with the mention of each feature would be required. The approach would be to use customer ratings to find which product features influence scores the most. So, in some sense, there is some bit of a supervised learning bit also put into this particular kind of problem. One of the important insight that emerged from this exercise was that well, you have to look also at low rated features and not just at the high rated ones if you are going to launch in that category. You have to know not just what to do but also what not to do in some sense. Well, where are things today? Last heard of Amazon is considering product design development of its own, you know some sort of a private labelled Bluetooth speaker series, which it might launch, you know, at some point of time. Here is another use case. This one comes, the firm is Fitbit and you know we are all familiar with fitness trackers. That's what they make and they are everywhere. Now, the task is to get product feedback. What are consumers who are using Fitbit trackers saying about them. Turns out that Fitbit has a Fitbit support Twitter handle where, you know, people write to Fitbit saying,
Applied Business Analytics Page 1 of 12
okay, I have this problem, this is not working, how do I solve this? So I mined some tweets from there. Now, an analyst team guided by a manager got to work, the idea was to filter the feedback by product type. Try to get to see what customers are saying about different fitness trackers inside the Fitbit range. This is what the results look like typically. And this is coming directly from the twitter API. Now, it turns out that different products, have different grievances, you know, different issues associated with them. For Fitbit charge HR, the strap was an issue. It tended to break, bubbles would develop, the rubber stamp would peel off, etc. The Fitbit blaze had a different set of issues. The operating system itself was an issue. I mean, a lot of people had trouble going beyond the logo screen, you know, in these kind of trackers. What is going on here? In some sense these are secondary databased insights, this is data that is already there, we are using text analysis to find these insights. Two, there are third party opportunities. Who is this kind of data useful for? It is not just for Fitbit to be aware of these things. A third parties supplier, if I would makes straps which are compatible with Fitbit, then I would also monitor Fitbit support and basically pitch Fitbit straps to customers, effectively targeted marketing in some sense but that's what's going on. Let me quickly conclude these two use cases. Right now, what are we trying to motivate? More generally, one, there is informational value locked inside unstructured data. Data looks unstructured. Two people could be saying the same thing but what are the odds they will use exactly the same words? There is a lot of informational value locked into what is called unstructured text. And this value is for brands, for platforms, for third parties, you name it. The way we impose preliminary structure on text, I mean, you cannot analyse something that is purely unstructured. The only way to analyse it would be to impose structure on that particular content. So this is what we will do with text as well. We will find out how to impose structure, what structure to impose and as a result of that, what we will get out of it. The constructs of interest can be found inside data provided we know where to look and how to look, which is where we are headed today. If you want to automate text analytics and that is fairly common place nowadays, we need a lot of, what is called training data. Now, in the absence of such training data, we will see today wherever required, we will use manual judgement instead. However, trained models are available and we are going to invoke some of those as well.
Video 3: Text Analytics: The Why and How
Alright, so let's now head in to text analytics. The "why" and the "how" of text-an, so to say. The why part is fine and the how part, we will see two basic approaches as well as an introduction to some terminology. So, let me start with the why, the why of text-an. Consider this scenario. You have a manager who wants to know the broad contours of what customers are saying when they call into the company's call centre. This call may be recorded for training and quality purposes partly because they might want to know what are customers saying. What broad buckets are emerging from customers calls into the call centre. Two, a firm might want to know if there is a persistent pattern in the content of customer feedback records in customer complains there is a grievance cell typically, service calls, emails, what a customer is saying, all of this involves text.
Applied Business Analytics Page 2 of 12
The first one involves voice converted to text. The second one would involves text directly. An investor might want to know what major topics around press coverage of a company that he or she is interested in. In all these cases, we have unstructured text and the analysis of such unstructured open-ended text data, data which may or may not be amenable to measurement and scaling by traditional method. In fact, we will impose preliminary structure whereby which such measurement becomes possible. We will see shortly the primary data scales in some sense and we will put them to task. All right, the how of basic text -an. The why part is done, let me get to the how. Broadly speaking, there are two ways to handle text analytics. Broadly speaking, two basic ways. One is called the NLP or the Natural Language Processing approach and the other is the text mining bag-of-words approach. Both are related in some sense. What is the big difference between these two approaches? It is an assumption that is made, in bag-of-words. Natural Language Processing assumes that the order in which the words come, the words in text come, have meaning, that they are meaningful. You change the order, you change the meaning. That's basically the assumption that NLP makes. Bag-of-words assumes that all words are exchangeable. It doesn't matter in which order the words come, you change them aside, meaning remains the same. Now, this is a strong assumption. We know that NLP makes a more realistic assumption. However, bag-of-words, yeah, if you were to change the order of the words in a given passage of text, the meaning that would be retained at the end of that, that particular meaning can be mined for insight. Bag-of-words happens to be much more practical and applicable to much larger datasets. Let's have a look at that. Today we will focus a lot on bag-of-words. We will also borrow from NLP where we have to. Let's head there. Bag-of-words. Now, this is a passage from Shakespeare's "As You Like It". "All the world's a stage, and all the men and women merely players: they have their exits and their entrances; and one man in his time plays many parts..." Now, if you're an English major, you might find deep meaning and insight in that. But, if you're a data scientist, this is what you would see. Effectively, one frequency table. "World" occurs once. "World's", the apostrophe and the s are cut here. Suffixes is taken out, only the root remains. "Play" occurs twice. Where? "players" and "plays". As you can see again, the suffixes are cut and only the root, the word root remains. Why do we do this? If we don't do this, we would have a proliferation of tokens. "play" would be one token, "plays" would be one token, "players" would be one token, and so on. So we, in some sense, try to take that part out. Now, this is the bag-of-words approach. Have a look at this. This is two different documents so I can construct a frequency table similarly for this. I'm going to get into how this is constructed and what happens. Now, bag-of- words is very powerful as an approach. It remains a logical, mechanical way to input text. It also remains the only way really in which a machine can handle text. It reads it in his characters. Characters form words, words form sentences, sentences form paragraphs, different units of analysis can be taken. Well, let's look at a real dataset in where we're heading. This is the dataset. The transcript of a quarterly earnings call. What are quarterly earnings calls? Why are they done? Typically, these happen to be teleconferences which are done every quarter by the top leadership of public companies. On the other side of the telecons are these equity analysts who basically ask questions about the company's earnings guidance and performance, and so on. Quite clearly, these are important. The quarterly earnings calls tend to have a large impact on a, significant impact on equity prices, company's share prices, and so on.
Applied Business Analytics Page 3 of 12
What I am going to use today as my example dataset is the call transcript of IBM's quarter three 2015 earnings call. Now, IBM has been having a tough decade. Watson gamble, in some sense, with a big one. But, the numbers have not been as great as in the decade prior to this. There it is. This is what the dataset looks like. By default, in each line in this text file what we call a document. Each line is a document separated by a line break. Now, documents can be short or long. In this case, they're fragments of a sentence. The entire set of documents, the entire stack of documents is what we call a corpus. Corpus, in this case, stands for body of text and the etymology can root is in some sense, well a "corp" which in Latin means body. So, corpse is a dead body. Corpus, in this case, is the body of text. And incorporation is the process of a giving a business a body in law. Effectively, most of it coming from the same Latin root. What kind of insights could we hope to see in this text? This is a quarterly earnings call text. What kind of insights? So, I'm going to next walk you through a place where we basically open the basic text analysis app. We are going to read this file in and we are going to analyse it using bag-of-words as well as a little bit of NLP. We have a mixture of both these approaches in where we go. Now, this is what the app will look like. We will examine the interface, the input fields, the output data, what goes in, and then go through each bit of output that comes from here. How are the questions that we would have in our minds be answered by combining the outputs given by this tool. Basically, that's the idea. Before I proceed into the tool itself, let me go through some more terminology. One, tokenenisation. Tokenenisation is the process of breaking up the cleaned corpus into individual terms, words basically. Let me give you an example. Business school is a two-word token. It's a single token, it's called a bigram. It has two words in it. "Business" and "school" separately would be one-word tokens, unigrams. Big data, for instance. "Big" and "data" two different unigrams. Put together as a phrase they become one bigram effectively, and so on. Once you have a corpus. Once we tokenised a corpus into tokens. Token could be a word. It could be two-word phrases. It could be a sentence. A token could be a paragraph. Typically, we will go with one-word token that will be by default of our definition. We can build a simple frequency table. What does this frequency table do? Pretty much like what we saw in the Shakespeare example. It takes each token and tells you how many times it occurred in the corpus. If I have 10 documents then, I will have a frequency table with 10 rows. Each document is a row and the columns are my tokens and it tells me how many times each token occurred in each row. It's basically where it's going. This is what the output will look like. It's called a document token matrix. Documents are the rows, tokens are my columns and the cells are basically populated by frequencies. We will see this as we get into the app. I just wanted to put this out here. So here, for instance. "Business" occurs once in document one, "year" doesn't occur at all, frequency is zero, and so on and so forth. Very important what we are seeing here, folks. The DTM. Just realise what we've done. We've taken unstructured raw text and what we are getting out is a matrix. So frequency table, it is also a matrix which implies that all the battery of operations that are applicable to matrices, all statistical machinery that applies to matrices, everything is now in play. Simple trick, bag-of-words, with what we get out of it, very powerful. Let's get there next.
Applied Business Analytics Page 4 of 12
Video 4: Basic Text Analytics App Demo: Earnings Call Welcome folks and let us now explore the basic text-an app. If you open the app, this is what it should look like. These are the UI or User Interface elements. These are your various output tabs, we will go through each of them. Let us first start with reading in the dataset. This is a slightly large app so you know this button is important here. Let's upload the data first. Click on "Upload File" and read in "IBM Quarter 3 2015 Results - Earnings Call Transcript", it's a dot txt file, just to read it in. If there is no document ID, the app will create one and this will come. If there are multiples, if you have a CSV file with multiple columns then these two will help us identify which column we want to enter. Okay, now this is where we are. Please click on "Apply Changes". Let's get to "DTM". So, this is the first tab, DTM stands for Document Token Matrix, in this case it's Token Document Matrix, Term Document Matrix. There are 257 documents in that corpus and about 638 unique tokens which are not stopwords. This is the word cloud of the tokens. Let's increase this and make it a 100 words. Let's just do that, there you go. You have 100 words. "Business" has the largest font. "Business" also comes up here followed by "year" and "quarter". So, these are the top three words by frequency in the corpus, the top three tokens by frequency. And here you have their actual count in the corpus, "business", "year", "quarter", "cloud", and so on. Okay. Now, we can go here and have a look "TF-IDF". This is another way to weigh the document token matrix. I will go into what this means down the line. You will get a different word clouds, slightly different word cloud, business still top, but it's slightly different word cloud and different weight. We'll get into what this is later. Go to the third tab "Term Co-occurrence" and you should see something like this. Now, bring your mouse here and scroll up and down and you can see this move. Left-click and drag it around and it will move. So, "cloud", the green-coloured node and the pink-coloured nodes. The green-coloured nodes are what we call central nodes and the pink-coloured nodes are what we call peripheral nodes. So, in this case, I have four green nodes. I can increase their number. The top for most frequent tokens become your central nodes and everything else becomes peripheral and that's the idea. You can also likewise increase the number of connections of peripheral nodes per central node, I mean you could do that. And so, this is what it looks like for now. The same applies also to TF-IDF. It will also give you what is called a COG or a Co-occurrence graph. Alright. Next, we go to "Bigram". So far, so good. So, we check DTM. It gives you DTM size, it shows you a 10x10, a small version of the most frequently occurring terms and the first few documents, word cloud and weight distribution. Then you have the same thing for TF-IDF. I will come to what this means and then we came and saw "Term Co-occurrence" graph. Let me go back to the default, it's just easier to see. There you go. And so on. What these are and how to interpret them? I will talk about later. Let's now go to "Bigram". What is a Bigram? A Bigram are two words that occur consecutively, one before the other in the corpus, multiple time. So, patricia_murphy, this is a proper noun, martin_schoeter, these occurred 15 times each. This is the world cloud for the bigram. Noticed this underscore here. What this underscore does is it takes two different tokens and stitches them into one token. These will be used as a single token by the machine and strong_growth, fourth quarter, cash flow, club_based, revenue_grew, services_business, cloud_solutions, kind of makes sense, and so on. And you can go out and see what it is saying unstructured data, mainframe_cycle, and so on.
Applied Business Analytics Page 5 of 12
So, bigrams can be quite useful in checking out what's going on, what are people talking about, and so on. Then let's come to "Concordance". This is what it looks like here. And this is the place, "Enter word in which you want to find concordance". What does concordance do? The concordance, for instance, is giving you a context window. This concordance window of five means that before this word here "good", it takes the first five words before "good" and the first five words after "good" and puts them together to show it to us. So, let me replace this with, let's say watson, and click on "Apply Changes" and you will get this "of merge healthcare to give watson the ability to see million", "on two of for our platforms watson and bluemix". Now suppose I want to see more of this context window. I want more words before and after watson, let's drag this around. So, I'm putting this at 12. So, the half window size before the focal word, after the focal word, half window size is now 12 word and so on. you can extend it all the all the way to 100 words and then it would become a large one but that's what would get. Finally, go to "Downloads" and you can basically, you know you can just click on this and it will download this 257x638 column DTM, I mean it will be, this particular matrix, you can download it. There is a third one you can likewise do the same for TF-IDF weighing of the DTM. I will come to what this means. And you can also download the Bigram Corpus. And what does it mean? It means that in the original corpus, every, all of these bigrams will be replaced with this underscore between them. So, you can download this and see, download it and see. Alright. And then you can read this in again. One other thing that I are forgot to show you is the power of this thing called 'stopwords'. Now, what is the most frequently occurring word 'Business." Just as an example. Suppose I type business here as a stop word, notice all of these are lower case, I mean we deliberately do this, yeah? And then you click on "apply changes". Voila! Business here is gone. I have stopped the word "business" from entering after the analysis. That's what a stopword would mean. So, "business" is gone, there it is. Suppose I want to get rid of, let's say, "question". So, type "question" here, the "business and question" both in this case, right? "Apply changes". And voila. Question is gone. It is gone from the analysis completely. Now, it won't show up here either. It won't show up anywhere now. Alright, okay. With that, let me return to, Please practice this a couple of times. And with that we can return to the slides again.
Video 5: Basic Text-An Example: Earnings Call Transcript
Now that we've actually seen the Basic Text-An app and how it works, what I will do next is walk you through a few summary highlights of what we saw as pertain to the Earnings Call Transcript Dataset, the IBM Earnings Call Dataset. We saw the DTM tab, recall that we saw the DTM tab. We also saw how to use stopwords to drop irrelevant tokens. This is important. Stopwords can be a powerful way to drop or filter out a lot of noise from a dataset. We saw how to use the slider to set the number of words in a wordcloud. We saw how to interpret a wordcloud. We also saw for instance, a similar term tab called TF-IDF which I will explain shortly. We haven't gotten through that, in the walk through the app. We asked questions such as, What are the most frequently occurring words in the corpus? What happens when we use custom stopwords to remove the top one to two words? I mean how does the ranking change for the
Applied Business Analytics Page 6 of 12
other tokens and so on? We moved beyond Wordcloud, right? And, we moved to COGs, Co- occurrence Graphs We saw that wordclouds can only do so much. We're corpus level things. If I want to Document Level Analysis, well I need a sharper tool. Hence, we examined Co-occurrence Graphs which basically tell us which token pairs tend to co-occur somewhere in a document across documents in the corpus. So for instance, in this example the IBM transcript example, we saw that cloud tended to co-occur with data. Data in the cloud, we saw that cloud tended to co-occur with platform, the cloud platform, cloud solutions, cloud services, and so on. All of which is eminently sensible in IBM's case. We also saw we could control the display of the COG by controlling the number of central nodes in the COG as well as the max connections to the peripheral nodes in the COG. We asked questions such as which tokens tended to most to co-occur in the same documents and so on. There is a similar COG for the TF-IDF scheme and I will come to that next. We saw Bigrams. We checked the Bigrams tab. Now, bigrams remember are two tokens which tend to consecutively follow one another. And bigrams to that extent are borrowed from NLP because order matters in this case. We saw that for any sentence with n words that can be at most n -1 bigrams. We saw that when you look at bigrams you get a whole different set of aspects of the the corpus than when you were to look at unigrams alone. We also saw a content window in the concordance tab of customisable size around key tokens of interests. So for instance, when we saw Watson, we saw basically, every instance, every mention of Watson was a customisable context window around it. So many tokens before, so many tokens after every single mention of Watson. Why just Watson, you could use other words to get a hang. Now this would also be an NLP borrow in some sense because it is actually giving me the exact order in which words tend to occur around any mention of our focal token.
Video 6: Text Analysis: DTM and TF-IDF
Alright. Now we are going to look at DTM and what is called TF-IDF. This is a way of weighing the document term matrix. Let's have a look at this. The DTMs as you recall, are these matrix objects that are outputted by the basic text -an app. Now, there are two ways to weigh a DTM. There is TF, which is token frequency, term frequency, the simple one, the one that we've already seen. There is another one, it's called the TF-IDF. TF is again Term Frequency, the IDF part is Inverse Document Frequency. What does this new method of weighing the DTM imply? What does it achieve? What are it's applications? Let's have a quick look. Let me walk you through an example to demonstrate what it is that IDF does. Consider a hundred document corpus of Nokia Lumia reviews on Amazon. Nokia Lumia was the last smartphone I believe of Nokia. 2013-'14 timeframe. And interestingly, it was neither IOS nor Android, it was a Windows operating system phone. Consider a DTM of Nokia Lumia reviews, I analyse it through basic text-an and I get a DTM, the document token matrix. Suppose there are 100 documents. There are 100 reviews. And suppose the tokens are bigrams, screen size, battery life, windows OS, app ecosystem, and so on. So, that's basically what you're seeing. Now, look at document number 50, reviewer number 50. This person mentions screen size four times and mentions battery life zero times. Now, if I make the assumption that the more
Applied Business Analytics Page 7 of 12
times you mention a feature, the more you care about the feature, not an unreasonable assumption. If that is the case, would it be unfair to say that reviewer number 50 cares more about screen size than battery life? It's eminently reasonable. Compare that with reviewer number 70. This person mentions screen size once, mentions battery life twice. Appears to care more about battery life than screen size, fair enough, right? Fair enough. What I am going to ask you is kind of a googly, right? Who cares more about what? Can I say that number 50 cares more about screen size than number 70 cares about battery life? Now, you might say how can you compare apples and pineapples; can we even make a comparison of this sort? Hold on to that. That's where we are headed with IDF. You look at the last line, the very last row in that table out there, it says document frequency. What does it mean? The average number of times each token appears in the corpus. So, screen size occurs 200 times in a corpus of 100 documents. What is its document frequency? Two. What does it mean? On the average, screen size tends to occur about twice per review. It tends to get mentioned twice. Battery life on the other hand, occurs 50 times in the entire corpus of 100 documents. So, battery life has a document frequency of 0.5. What does it mean? Every other reviewer tends to mention battery life, and that's basically it, document frequency. All right, now let's get the game going. IDF implies what? Inverse document frequency. If screen size has a document frequency of two, the IDF, Inverse Document Frequency, will be one by two. And similarly, in battery life, 0.5 becomes one by 0.5, that is times two. Each token frequency number I am now multiplying by the IDF number. What am I doing? I'm reweighing each of these token frequencies with the inverse document frequency. What does it imply? It is telling me that it matters how many times a token occurs in the corpus. If there is a token that tends to get mentioned all the time, really is it all that important? If there is a token that gets mentioned rarely but has gotten mentioned a lot in one particular review, does that not make that more important? Those kind of questions is what an IDF approach would try to answer. Now, let's see what happened here. When I do IDF, and I take the inverse of document frequency, multiply it with TF, the token frequency, these are the numbers I'm getting. Now, look at number 50 again. Reviewer number 50. Token frequency score was four and IDF score is two. Number 70, token frequency score was two, IDF score is four. What has happened? The two have flipped. Under TF, I would say, number 50 cares more about screen size. Under IDF, I would say, number 70 cares more about battery life than 50 did about screen size. Why does it matter whether they care more about it? They would be willing to pay more for it. The willingness to pay argument and marketing, it's a very important one. It matters what a customer cares about in some sense. And here, we have two different approaches for weighing the DTM, giving me two different results. Which one should we trust? Which one would you trust? I'll come back to this question by the end of this slide. What I showed was illustrative. There are different methods of weighing the IDF. So, this one, typically, we use logarithms in there because token frequencies tend to follow power laws. Alright, now similar to DTM tab, the weighing is different. So, you can go to TF-IDF tab, you will see it is very similar to the DTM tab. But, the numbers are different now. The size of the matrix is the same, but the weighing is different. What does it mean? Let's have a look. If you look at the wordcloud that comes out of IDF, the words are different now. The top words are different. The rankings have changed. IDF allows those words, those tokens that didn't have a chance to shine before, a chance to shine now, because their document frequency
Applied Business Analytics Page 8 of 12
was low, inverse document frequency becomes high. These tokens that are mentioned only occasionally here and there but tend to get mentioned in concentrated places, they basically get bumped up. All right, in this particular case, as you can see, it is people's names. It's actually proper nouns that are coming up. Whereas in the previous one, it was object names under TF that were coming up, the most important. Which of the two should you prefer? There are only two schemes to weigh, TF and IDF. Which should you prefer? Try both. It is very hard to say priory for a corpus which will work better, whether it is TF or IDF. In general, you should try both and see, whichever makes more sense is good to go with. Lastly, let me quickly recap what we did with basic text-an, and after this, we will proceed to sentiment analysis. Basic text analysis, here is the question folks. What have we been able to accomplish with just elementary text analysis? Elementary being the key word, what did we just accomplish? One, we were able to rapidly, scale-ably, cheaply, crunch through raw text input. Now, this was a small file, you could input a big file, it will take a few seconds more, but it will give you the answer just the same. We were able to transform unstructured text data into token counts. From token counts to frequency tables and thereby, to a finite dimensional mathematical object, the matrix object, the DTM. The moment the DTM comes into the picture folks, a lot of operations, all matrix operations are now in play. A huge step forward. We also saw some display contours. We saw wordclouds, we saw co-occurrence graphs, we saw some ways of borrowing from NLP, we saw bigrams, and we saw concordances that were giving us context windows. So, we basically, just with elementary text analysis, we were able to achieve quite a bit.
Video 7: Sentiment Analysis
All right, let's now come to Sentiment Analysis, having seen basic text analysis. The next step is Sentiment Mining. What we are going to do is Elementary Sentiment Mining, the key word being Elementary. But let's have a look at this. I'm going to show you two review excerpts of a popular TV series. I want you to take a guess at there Sentiment Content. This is the first one. "This is one of the best TV series I have seen in a long time. I am yet to read the books, but of the TV series is anything to go by, the books will be outstanding." Folks, is this review positive in general, is it negative, is it neutral, is it more positive than negative? What do you think? Why do you think so? Hold on to that. Take a look at the second review, a little more complicated. "After seeing today in game of Thrones, I realised the author of the serial as well as HBO may need professional medical assistance. Since the beginning, hard to find anything that would leave anyone feeling good." Folks, this second review is it positive, is it negative, is it neutral, is it more one way than the other, is it a mix of both? What is it? If you formed your opinions, they may I ask you, what tokens in particular led you to think one way or the other? Now in the first review, one of the best TV series, best. Think about the books will be outstanding, outstanding as a word, ok, now these best and outstanding are adjectives. Basically, it's adjectives we focus on when we do Sentiment Mining. Both of them give it a way. This is a positive review, hugely positive review. Now look at the second one. The second one is tricky, why? Because there are no clear adjectives per say. HBO may need professional medical assistance, really? I mean, you won't
Applied Business Analytics Page 9 of 12
find professional medical assistance in an adjective list, right? How do we, this is very contextual. This is hard to get at. So, sentiment mining can trip up sometimes because what it does, is it focuses on matching sentiment laden words, adjectives. In fact, a machine might actually go wrong here. It would look at "Good." That would hardly make anyone feel "Good." It will pick the "Good" up and say "Hey, this is positive.," possible. It is hard to find anything. It will pick up "Hard" and say "Okay, maybe this is negative." Positive negative, they cancel out, its "Mixed." That's what it might say, yeah. Hold on to that, we're headed there. But more generally, the question will arise, it's not just about the direction, is it positive or negative, can I also measure magnitude? How positive, how negative? Turns out we can. That is where we're headed. Some conceptual definitional preliminaries first. One, what is sentiment mining? Sentiment mining is a process, it's basically an attempt to detect, extract, measure value judgments, opinions, measuring right and wrong in some sense, measuring good and bad, it's a value judgement, okay. It's emotional content that we are trying to measure in text data. Why do we care about sentiment? Why should businesses care about sentiment? As marketers, okay marketers would know, it is very important, more important to figure out what people feel about your brand than what they think about your brand really because feel people's thinking may change much more easily than people's feelings change. Important in some sense. How is sentiment measured? Well, the technical term for it is Valence. Valence is a continuum, positive, neutral, negative. So, it's somewhere on that continuum. Valence can be measured and scored and it makes sense in some sense to build our own context-specific, domain-specific, vertical-specific valence scoring scheme. But you can rely on a general sentiment dictionary, but really what are the odds it will serve your particular domain? What I'm going to do next is walk you through an illustrative exercise. Which movie is this? Two major parts, classic movie? What I am going to do in some sense, is walk you through the review corpus of Bahubali. Please open Bahubali reviews dot text. This is what it looks like and the questions we are going to ask after we feed it into the sentiment mining app are these. One, what experience, associations emerged from the corpus? What sentiments predominate what attributes of what experience? All of these things important? Three, how do people feel about the movie? What do they compare it with? People compared it to a Sholay which basically is a classic one of the all time classics out there. In some sense, what is the consideration set? What is the comparison set? These are questions that will come up and together by combining the basic text app with the sentiment mining app, we can start to answer such questions. For reviews running into 100s, 1000s, that's where we are headed. Let us open the sentiment analysis app, and walk through it.
Video 8: Sentiment App Demo: User Reviews
Alright folks, let us now explore, navigate the sentiment analysis shiny app. Have a look at this. This is what the app looks like. These are your UI elements. These are your output tabs. Precious, few UI elements. You something called sentiment dictionary. We'll go with a AFINN by default. You have a user defined one in case you want and this one is for finance specific tokens. NRC I have removed, I mean, for there are some permission issues and everything else is there. I'll Come to what document index does?
Applied Business Analytics Page 10 of 12
First and foremost let us read the data in. You have three plots I mean you have three tabs, sentiment plot, stat, and document analysis. Let's have a look at what they do. First off let's click this, let's read the data, right? Click on this and let's read in the data "Bahubali reviews." There it is, yeah I have read it in. Please go to Sentiment Plot. What does it show you? Well, these are your document numbers. There are about 360 odd and the document numbers don't reflect. So, there are jumps, gaps in there. Now, hover on a point. So, if you take your this thing and put it here for instance, it will give you the coordinate of that. So, each of these lines represents a document. This zero level represents neutral sentiment. Anything above the zero level is positive sentiment, net positive. Anything below the zero line is net negative sentiment. Basically, the reviews of Bahubali seem to be more positive than negative. I mean you have way more lines above the zero than below it, yeah. Now you might wonder. Okay you know, but I want to know what a specific review is saying. Okay let's go to this is the tallest I think of lot, lets go here and see. So, this is review number 405 and this score, net positive score is 76. That's basically plus 76. Now one might wonder why is review number 405 so positive? Type the "Document Index," press "Enter" and then go to Document Level Analysis, yeah. What it will do is, it will break down the document into individual sentences and it will also list the tokens that carry sentiment, okay. Now the AFINN scoring, sentiment scoring scheme goes from plus five to minus four. So, these words in black are plus five. These words in grey colour coded are plus four, these ones are I guess plus three, plus two, plus one, and so on, okay. So, if you go back to the Document Level Analysis, these were the words that were given a score of minus two, these were the words in the review that were given a score of minus one and so on. And you can continue each sentence in the review, it's breaking down the review, passing it into sentences and then giving you the next sentiment per sentence. Take a look at this. There is not even a single scene where you will get a chance to blink. I mean this is high praise. But, I mean we know that reading it. Sentiment score given is only plus two here. So, plus six for instance. Look at this, Ramya, Krishnan, Nasser do their respective extremely well, excellent performances, pleasant music, and so on. So, you basically get all the keyword matches pretty well, which is why you are getting a high score here. Okay, you can check out the others also. Bing for instance is, it's a simpler one. It's just positive and negative. These are negative and these are positive words. And if you choose Bing, this will also come down to a much more coarse kind of a classification. That's what basically you will get to a much more coarse kind of a classification. That's basically what you will get. And this will change to... it will give you... these are the reviews with high negative and high positive. Positive matches. Well, please go through this video again, upload any other dataset and explore these options. Play around with it a little bit till you are comfortable.
Video 9: Elementary Text Analytics: Summary
So, let me now put everything together. I'm gonna summarise what we just did in sentiment analysis and then recap the entire session. So, let's have a look at this. The key points of sentiment mining. Question: What factors does good sentiment analysis depend on? We figured out what it depends on when we went through the app, we saw that there are 3-4 different sentiment lexical, sentiment dictionary is in there. It depends crucially. Sentiment analysis quality will
Applied Business Analytics Page 11 of 12
depend crucially on the quality of the wordlists of the sentiment dictionaries that we are using as an input. Two: Is sentiment mining context-sensitive? Yes, bigtime, it has to be. The same words in one context can say something and in another context can say something else. Context in business settings particularly becomes very domain and vertical specific. Hence, it is imperative that we evolve our own wordlists for valence polarity scoring. We evolve our own scoring schemes. We can and we should do that. The App provides a very basic way for us to be able to do that. That is a user-defined sentiment dictionary which we could customise to our needs. Can we make running plots of sentiment or buzz over time? So, this is a little bit of an introduction to dynamics of sentiment. Can we do this? Yes, we can. Why would we want to do that? It basically tells us where sentiment peaked, where it crashed, where it stabilised, what is happening, in any event where chronology matters. Can you think of any major sources of sentiment information out there for your firm, for your product, for your brand, for your vertical, for your Industry? There are plenty of sources out there and armed with the tools that we have now seen, there is something we can do about it. We can pick that data up, we can feed it through the app, we can interpret the output, we can generate actionable insights. With that, let me now combine the two big apps of today: Basic text analysis and sentiment analysis. Let me walk through what we did in this module. We started with a motivating example of product reviews on e-commerce sites. Remember that Amazon Bluetooth speakers and a Fitbit support. We went through the 'Why' of text analysis and is part of the 'How' of basic text-an. We introduced the basic text-an app. We did the IBM analyst calls dataset as our basic introductory, real world dataset through the app. And there was a graded exercise on the Amazon product reviews corpus for OnePlus 8. We also did elementary sentiment mining, and in the process of doing so, we did a hands on exercise as our Introductory app exercise with the Bahubali movie reviews and we did a graded exercise, a second graded exercise for the module with hotel reviews. With that, what we have seen are these two big apps. We have used each of them in turn, but we can actually combine them and use them together as well.