randomjohn
diff --git a/‎_posts/2016-12-21-r-twitter.md
Lines changed: 78 additions & 73 deletions b/‎_posts/2016-12-21-r-twitter.md
Lines changed: 78 additions & 73 deletions
diff --git a/‎figures/unnamed-chunk-12-1.png
2.97 KB b/‎figures/unnamed-chunk-12-1.png
2.97 KB
diff --git a/‎figures/unnamed-chunk-5-1.png
-3.37 KB b/‎figures/unnamed-chunk-5-1.png
-3.37 KB
diff --git a/‎figures/unnamed-chunk-7-1.png
263 Bytes b/‎figures/unnamed-chunk-7-1.png
263 Bytes
diff --git a/‎figures/unnamed-chunk-8-1.png
11 Bytes b/‎figures/unnamed-chunk-8-1.png
11 Bytes
@@ -59,31 +59,31 @@ head(gvl_twitter_df)
 
 
 {% highlight text %}
-                                                                                                                                            text
-1 RT @TEN_GOP: Hillary voter: set the #Greenville black church on fire and spray painted 'Vote Trump'. \n\nTrump supporters: raised $180K to reÂ…
-2                                                                            Aviation MX Jobs Sheet Metal-LHM-GREENVILLE https://t.co/0B104F2S1f
-3     Why are people from all over the world moving to Greenville, SC? https://t.co/Ck7FLxHiA2 Orion Realty 864-631-2663 https://t.co/SqFbO5rO5M
-4                     @gibbshundred chillin' in the Wisconsin refrigerator #beerforthought #beer @ Greenville, Wisconsin https://t.co/Vz0fey2pgO
-  favorited favoriteCount    replyToSN             created truncated
-1     FALSE             0         <NA> 2016-12-22 22:01:13     FALSE
-2     FALSE             0         <NA> 2016-12-22 22:00:57     FALSE
-3     FALSE             0         <NA> 2016-12-22 22:00:43     FALSE
-4     FALSE             0 GibbsHundred 2016-12-22 22:00:34     FALSE
-  replyToSID                 id replyToUID
-1       <NA> 812055418193186816       <NA>
-2       <NA> 812055350438428672       <NA>
-3       <NA> 812055292322181120       <NA>
-4       <NA> 812055254678245377 2225686310
-                                                                        statusSource
-1 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>
-2                    <a href="http://www.hootsuite.com" rel="nofollow">Hootsuite</a>
-3              <a href="http://www.facebook.com/twitter" rel="nofollow">Facebook</a>
-4                        <a href="http://instagram.com" rel="nofollow">Instagram</a>
-      screenName retweetCount isRetweet retweeted longitude latitude
-1  lisalicious12         2593      TRUE     FALSE      <NA>     <NA>
-2 AviationMXjobs            0     FALSE     FALSE      <NA>     <NA>
-3      kenpujdak            0     FALSE     FALSE      <NA>     <NA>
-4       kev_tosh            0     FALSE     FALSE      <NA>     <NA>
+                                                                                                                                                                   text
+1                                  Can you recommend anyone for this #job? Delivery Driver - https://t.co/DRHtAzXQYO #Transportation #Greenville, NC #Hiring #CareerArc
+2                                                        @marcelllamariee @ashlynmariieee shoulda went to Greenville I would have went and we would have fucked shit up
+3                                                                                          I guess I'll go to Greenville today <ed><U+00A0><U+00BD><ed><U+00B8><U+0098>
+4                          RT @sportsguymarv: In a loss to Cook Co., Brittany Davis from Greenville High scrambled a Triple-Double. 52 12 10. || #A1Skills #GHSA #High<U+0085>
+  favorited favoriteCount       replyToSN             created truncated
+1     FALSE             0            <NA> 2016-12-23 17:55:48     FALSE
+2     FALSE             0 marcelllamariee 2016-12-23 17:54:59     FALSE
+3     FALSE             0            <NA> 2016-12-23 17:54:24     FALSE
+4     FALSE             0            <NA> 2016-12-23 17:53:34     FALSE
+          replyToSID                 id         replyToUID
+1               <NA> 812356044567375872               <NA>
+2 812292443509026816 812355837901553664 777670063751135233
+3               <NA> 812355693948837888               <NA>
+4               <NA> 812355480425263104               <NA>
+                                                                          statusSource
+1                  <a href="http://www.tweetmyjobs.com" rel="nofollow">TweetMyJOBS</a>
+2   <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>
+3   <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>
+4 <a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>
+     screenName retweetCount isRetweet retweeted   longitude   latitude
+1 tmj_NC_transp            0     FALSE     FALSE -77.3936674 35.6096532
+2  s_danielss16            0     FALSE     FALSE        <NA>       <NA>
+3 __GorgeousNiq            0     FALSE     FALSE        <NA>       <NA>
+4    QuayBrizzy           14      TRUE     FALSE        <NA>       <NA>
  [ reached getOption("max.print") -- omitted 2 rows ]
 {% endhighlight %}
 
@@ -106,23 +106,28 @@ print(gvl_twitter_unique %>% select(text))
 
 
 {% highlight text %}
-                                                                                                                                           text
-1                                                                           Aviation MX Jobs Sheet Metal-LHM-GREENVILLE https://t.co/0B104F2S1f
-2    Why are people from all over the world moving to Greenville, SC? https://t.co/Ck7FLxHiA2 Orion Realty 864-631-2663 https://t.co/SqFbO5rO5M
-3                    @gibbshundred chillin' in the Wisconsin refrigerator #beerforthought #beer @ Greenville, Wisconsin https://t.co/Vz0fey2pgO
-4                         Greenville, NC 5:00 PM Temp: 57.8ÂºF Dew: 41.3ÂºF Pressure: 1019.4mb Rain: 0.00 in #encwx #ncwx https://t.co/TRgZS3ZJmp
-5      Bon Jovi's #ThisHouseIsNotForSaleTour kicks off at @BSWArena Greenville February 8! Tix: https://t.co/LFkpYxEIQ0 https://t.co/PLrIWhfk8J
-6                                                                                  The Greenville mall lines are equivalent to Carowinds lines.
-7                                                                                              #flowers in greenville sc turbo tax for realtors
-8  I think them upstate folk accent more country than Sumter we aren't that bad them damn Greenville area folks country as a couch on the porch
-9                         @TheBuns1194 come out retirement and transfer back to good ol greenville lol <ed><U+00A0><U+00BD><ed><U+00B8><U+0082>
-10                    Can you recommend anyone for this #job in #Greenville, NC? https://t.co/j25SnMwtTE #SONIC #Hospitality #Hiring #CareerArc
-11          We're #hiring! Click to apply: Customer Service- Retail Sales - https://t.co/q7rFALqoT5 #Job #CustomerService #Greenville, NC #Jobs
-12                         Can you recommend anyone for this #job? Rehab Director - https://t.co/gzxnns476V #Healthcare #Greenville, SC #Hiring
-13                                                                                         Greenville, SC playing cards https://t.co/Yb0Yz40MaY
-14                                                                                                             Oh, yeah I'm back in Greenville.
-15                  Want to work in #Greenville, SC? View our latest opening: https://t.co/EI2orG2G0N #Job #Accounting #Jobs #Hiring #CareerArc
-16                       Want to work in #Greenville, SC? View our latest opening: https://t.co/MScE6OeHjh #Job #Sales #Jobs #Hiring #CareerArc
+                                                                                                                                                                    text
+1                                   Can you recommend anyone for this #job? Delivery Driver - https://t.co/DRHtAzXQYO #Transportation #Greenville, NC #Hiring #CareerArc
+2                                                         @marcelllamariee @ashlynmariieee shoulda went to Greenville I would have went and we would have fucked shit up
+3                                                                                           I guess I'll go to Greenville today <ed><U+00A0><U+00BD><ed><U+00B8><U+0098>
+4  @Uber_Support We need one of these in Greenville SC!!!  Where are you guys?\nWhen are you opening a Greenlight station here? <ed><U+00A0><U+00BD><ed><U+00B8><U+008A>
+5                                                                                                                Greenville Police Beat 12-23-16 https://t.co/Lt4rbams5z
+6                                                                                                                        #market risk solutions greenville cancer center
+7                                                      Can you recommend anyone for this #job in #Greenville, SC? https://t.co/Xt7hznlGgA #Purchasing #Hiring #CareerArc
+8                                                            Check out my #listing in #Liberty #SC  #realestate #realtor https://t.co/CuzFTjp5ge https://t.co/sD1VypKlTu
+9                                     Join the Aerotek team! See our latest #job opening here: https://t.co/gZ6jEIqOyN #Manufacturing #Greenville, SC #Hiring #CareerArc
+10                                                        @Sie_SoSweet I'm driving to greenville and I don't feel like stopping <ed><U+00A0><U+00BD><ed><U+00B8><U+0082>
+11                            Here's a little behind the scenes peek of 2017's 1st Off the Grid Greenville, A Visual Guide to Local Favorites... https://t.co/M33IWBDAZe
+12                                                                                       #greenville electric company best western plus muskoka inn huntsville on canada
+13                                                 Registered Nurse - $5,000 Sign On Bonus - Vidant Home Hospice - 925571 in Greenville, NC https://t.co/xONI21XSDq #job
+14                                                              Apply to this job: Population Health Analyst Job - 926645 in Greenville, NC https://t.co/6J5IeqNak8 #job
+15                                                   Greenville may be the TU tonight . <ed><U+00A0><U+00BE><ed><U+00B4><U+0094><ed><U+00A0><U+00BE><ed><U+00B4><U+0094>
+16                                              I had the privilege of meeting Lottie Gibson back when James Akers Jr ran for Greenville County<U+0085> https://t.co/UzsRoWMNdr
+17                                             Done been to Greenville 3 times this week<ed><U+00A0><U+00BD><ed><U+00B8><U+00BC><ed><U+00A0><U+00BD><ed><U+00B8><U+0082>
+18                                                                                                                   Greenville <ed><U+00A0><U+00BD><ed><U+00B3><U+008D>
+19                                                                                       What nail place in Greenville is the best bc every one I go to effs up my nails
+20                                                                                                 I'm at @Walmart Supercenter in Greenville, TX https://t.co/uYSIZPA4VE
+21    Goodbye Greenville, take off in 20, New York in an hour! <ed><U+00A0><U+00BD><ed><U+00BB><U+00A9> #TakeOff #NewYorkCity #BigAppleChristmas https://t.co/upN8kaGONK
 {% endhighlight %}
 
 The thing to notice here is that there are several different Greenvilles, so this makes analysis of the local area pretty hard. Many of the tweets can be about Greenville, NC or SC. In this particular dataset, there was even a Greenville Road in California (where there was a car fire). Rather than play a filtering game, it may be better to apply some knowledge specific to the area. For instance, local tweets will often be tagged with `#yeahThatgreenville`. So we will search again for the `#yeahthatgreenville` hashtag (and add a few more tweets as well). This time, we'll keep retweets:
@@ -186,13 +191,13 @@ head(tweet_words)
 
 
 {% highlight text %}
-                    id          word
-1   812052233693097990            rt
-1.1 812052233693097990 greenville_sc
-1.2 812052233693097990           not
-1.3 812052233693097990             a
-1.4 812052233693097990           bad
-1.5 812052233693097990           way
+                    id      word
+1   812352595159355392       the
+1.1 812352595159355392      wall
+1.2 812352595159355392      gods
+1.3 812352595159355392 mountains
+1.4 812352595159355392       amp
+1.5 812352595159355392    cities
 {% endhighlight %}
 
 I used the `select` function from `dplyr` to keep only the `id` and `text` fields. The `unnest_tokens()` functions creates a long dataset with a single word replacing the text. All the other fields remain unchanged. We can now easily create a bar chart of the words used the most:
@@ -227,7 +232,7 @@ head(stop_words)
 
 
 {% highlight text %}
-# A tibble: 6 Ã— 2
+# A tibble: 6 × 2
        word lexicon
       <chr>   <chr>
 1         a   SMART
@@ -258,13 +263,13 @@ head(tweet_words_interesting)
 
 
 {% highlight text %}
-                  id       word
-1 811749098273538048    stretch
-2 811749098273538048  volunteer
-3 811749098273538048 strengthen
-4 811749098273538048  community
-5 811567485363294208  community
-6 811749098273538048   becausey
+                  id word
+1 812352595159355392 wall
+2 812352159916367873 wall
+3 812351889681633280 wall
+4 812352595159355392 gods
+5 812352159916367873 gods
+6 812351889681633280 gods
 {% endhighlight %}
 
 The `anti_join` function is probably not familiar to most data scientists or statisticians. It is the opposite of a merge in a sense. Basically, the command above merges the `tweet_words` and `my_stop_words` data frames, and then _removes_ the rows that came from the `my_stop_words` dataset, leaving only the rows in `tweet_words` (the `id` and `word`) that does not match with something from `my_stop_words`. This is desirable because our `my_stop_words` dataset contains words we _do not_ want to analyze.
@@ -297,7 +302,7 @@ head(bing_lex)
 
 
 {% highlight text %}
-# A tibble: 6 Ã— 2
+# A tibble: 6 × 2
         word sentiment
        <chr>     <chr>
 1    2-faced  negative
@@ -320,13 +325,13 @@ head(gvl_sentiment)
 
 
 {% highlight text %}
-                  id       word sentiment
-1 811749098273538048    stretch      <NA>
-2 811749098273538048  volunteer      <NA>
-3 811749098273538048 strengthen      <NA>
-4 811749098273538048  community      <NA>
-5 811567485363294208  community      <NA>
-6 811749098273538048   becausey      <NA>
+                  id word sentiment
+1 812352595159355392 wall      <NA>
+2 812352159916367873 wall      <NA>
+3 812351889681633280 wall      <NA>
+4 812352595159355392 gods      <NA>
+5 812352159916367873 gods      <NA>
+6 812351889681633280 gods      <NA>
 {% endhighlight %}
 
 Once you get to this point, sentiment analysis can start fairly easily:
@@ -339,11 +344,11 @@ gvl_sentiment %>% filter(!is.na(sentiment)) %>% group_by(sentiment) %>% summaris
 
 
 {% highlight text %}
-# A tibble: 2 Ã— 2
+# A tibble: 2 × 2
   sentiment     n
       <chr> <int>
-1  negative    14
-2  positive    78
+1  negative    19
+2  positive    83
 {% endhighlight %}
 
 There are many more positive words than negative words, so the mood tilts positive in our crude analysis. We can also group by tweet, and see whether there more more positive or negative tweets:
@@ -359,15 +364,15 @@ gvl_sent_anly2
 
 
 {% highlight text %}
-# A tibble: 3 Ã— 2
+# A tibble: 3 × 2
   sentiment        n
       <chr>    <dbl>
-1  negative 1.000000
-2  positive 1.344828
-3      <NA> 6.733668
+1  negative 1.055556
+2  positive 1.338710
+3      <NA> 6.040404
 {% endhighlight %}
 
-On average, there are 1.3448276 positive words per tweet and 1 negative words per tweet, if you accept the assumptions of the above analysis.
+On average, there are 1.3387097 positive words per tweet and 1.0555556 negative words per tweet, if you accept the assumptions of the above analysis.
 
 There is, of course, a lot more that can be done, but this will get you started. For some more sophisticated ideas you can check [Julia Silge's analysis of Reddit data](http://juliasilge.com/blog/Reddit-Responds/), for instance. Another kind of analysis looking at sentiment and emotional content can be found [here](https://mran.microsoft.com/posts/twitter.html) (with the caveat that it uses the predecessor to `dplyr` and thus runs somewhat less efficiently). Finally, it would probably be useful to supplement the above sentiment data frames with situation-specific sentiment analysis, such as making `goallllllll` in the above a positive word.