You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
12 Can you recommend anyone for this #job? Rehab Director - https://t.co/gzxnns476V #Healthcare #Greenville, SC #Hiring
122
-
13 Greenville, SC playing cards https://t.co/Yb0Yz40MaY
123
-
14 Oh, yeah I'm back in Greenville.
124
-
15 Want to work in #Greenville, SC? View our latest opening: https://t.co/EI2orG2G0N #Job #Accounting #Jobs #Hiring #CareerArc
125
-
16 Want to work in #Greenville, SC? View our latest opening: https://t.co/MScE6OeHjh #Job #Sales #Jobs #Hiring #CareerArc
109
+
text
110
+
1 Can you recommend anyone for this #job? Delivery Driver - https://t.co/DRHtAzXQYO #Transportation #Greenville, NC #Hiring #CareerArc
111
+
2 @marcelllamariee@ashlynmariieee shoulda went to Greenville I would have went and we would have fucked shit up
112
+
3 I guess I'll go to Greenville today <ed><U+00A0><U+00BD><ed><U+00B8><U+0098>
113
+
4 @Uber_Support We need one of these in Greenville SC!!! Where are you guys?\nWhen are you opening a Greenlight station here? <ed><U+00A0><U+00BD><ed><U+00B8><U+008A>
19 What nail place in Greenville is the best bc every one I go to effs up my nails
129
+
20 I'm at @Walmart Supercenter in Greenville, TX https://t.co/uYSIZPA4VE
130
+
21 Goodbye Greenville, take off in 20, New York in an hour! <ed><U+00A0><U+00BD><ed><U+00BB><U+00A9> #TakeOff #NewYorkCity #BigAppleChristmas https://t.co/upN8kaGONK
126
131
{% endhighlight %}
127
132
128
133
The thing to notice here is that there are several different Greenvilles, so this makes analysis of the local area pretty hard. Many of the tweets can be about Greenville, NC or SC. In this particular dataset, there was even a Greenville Road in California (where there was a car fire). Rather than play a filtering game, it may be better to apply some knowledge specific to the area. For instance, local tweets will often be tagged with `#yeahThatgreenville`. So we will search again for the `#yeahthatgreenville` hashtag (and add a few more tweets as well). This time, we'll keep retweets:
@@ -186,13 +191,13 @@ head(tweet_words)
186
191
187
192
188
193
{% highlight text %}
189
-
id word
190
-
1 812052233693097990 rt
191
-
1.1 812052233693097990 greenville_sc
192
-
1.2 812052233693097990 not
193
-
1.3 812052233693097990 a
194
-
1.4 812052233693097990 bad
195
-
1.5 812052233693097990 way
194
+
id word
195
+
1 812352595159355392the
196
+
1.1 812352595159355392 wall
197
+
1.2 812352595159355392gods
198
+
1.3 812352595159355392 mountains
199
+
1.4 812352595159355392amp
200
+
1.5 812352595159355392cities
196
201
{% endhighlight %}
197
202
198
203
I used the `select` function from `dplyr` to keep only the `id` and `text` fields. The `unnest_tokens()` functions creates a long dataset with a single word replacing the text. All the other fields remain unchanged. We can now easily create a bar chart of the words used the most:
The `anti_join` function is probably not familiar to most data scientists or statisticians. It is the opposite of a merge in a sense. Basically, the command above merges the `tweet_words` and `my_stop_words` data frames, and then _removes_ the rows that came from the `my_stop_words` dataset, leaving only the rows in `tweet_words` (the `id` and `word`) that does not match with something from `my_stop_words`. This is desirable because our `my_stop_words` dataset contains words we _do not_ want to analyze.
@@ -297,7 +302,7 @@ head(bing_lex)
297
302
298
303
299
304
{% highlight text %}
300
-
# A tibble: 6 × 2
305
+
# A tibble: 6 × 2
301
306
word sentiment
302
307
<chr> <chr>
303
308
1 2-faced negative
@@ -320,13 +325,13 @@ head(gvl_sentiment)
320
325
321
326
322
327
{% highlight text %}
323
-
id word sentiment
324
-
1 811749098273538048 stretch <NA>
325
-
2 811749098273538048 volunteer <NA>
326
-
3 811749098273538048 strengthen <NA>
327
-
4 811749098273538048 community <NA>
328
-
5 811567485363294208 community <NA>
329
-
6 811749098273538048 becausey <NA>
328
+
id word sentiment
329
+
1 812352595159355392 wall <NA>
330
+
2 812352159916367873 wall <NA>
331
+
3 812351889681633280 wall <NA>
332
+
4 812352595159355392 gods <NA>
333
+
5 812352159916367873 gods <NA>
334
+
6 812351889681633280 gods <NA>
330
335
{% endhighlight %}
331
336
332
337
Once you get to this point, sentiment analysis can start fairly easily:
There are many more positive words than negative words, so the mood tilts positive in our crude analysis. We can also group by tweet, and see whether there more more positive or negative tweets:
@@ -359,15 +364,15 @@ gvl_sent_anly2
359
364
360
365
361
366
{% highlight text %}
362
-
# A tibble: 3 × 2
367
+
# A tibble: 3 × 2
363
368
sentiment n
364
369
<chr> <dbl>
365
-
1 negative 1.000000
366
-
2 positive 1.344828
367
-
3 <NA> 6.733668
370
+
1 negative 1.055556
371
+
2 positive 1.338710
372
+
3 <NA> 6.040404
368
373
{% endhighlight %}
369
374
370
-
On average, there are 1.3448276 positive words per tweet and 1 negative words per tweet, if you accept the assumptions of the above analysis.
375
+
On average, there are 1.3387097 positive words per tweet and 1.0555556 negative words per tweet, if you accept the assumptions of the above analysis.
371
376
372
377
There is, of course, a lot more that can be done, but this will get you started. For some more sophisticated ideas you can check [Julia Silge's analysis of Reddit data](http://juliasilge.com/blog/Reddit-Responds/), for instance. Another kind of analysis looking at sentiment and emotional content can be found [here](https://mran.microsoft.com/posts/twitter.html) (with the caveat that it uses the predecessor to `dplyr` and thus runs somewhat less efficiently). Finally, it would probably be useful to supplement the above sentiment data frames with situation-specific sentiment analysis, such as making `goallllllll` in the above a positive word.
0 commit comments