GOING DOWN TO SOUTH PARK

to make some tidytext analysis

PATRIK DRHLÍK
freelance data scientist

Web scraping and R packages

Glimpse at the data

## Observations: 312,767
## Variables: 16
## $ season                <fct> Season Thirteen, Season One, Season Sixt...
## $ season_number         <int> 13, 1, 16, 5, 15, 3, 11, 9, 13, 3, 21, 8...
## $ season_episode_number <dbl> 7, 10, 14, 10, 10, 4, 1, 2, 5, 15, 6, 1,...
## $ episode               <fct> Fatbeard, Damien, Obama Wins!, How to Ea...
## $ episode_number        <int> 188, 10, 237, 75, 219, 35, 154, 127, 186...
## $ character             <chr> "cartman", "stan", "kyle", "jonesy", "mr...
## $ year                  <int> 2009, 1997, 2012, 2001, 2011, 1999, 2007...
## $ line_number           <int> 63817, 4528, 76451, 31001, 72011, 16346,...
## $ word                  <chr> "hey", "bubye", "program", "wrong", "gen...
## $ word_stem             <chr> "hei", "buby", "program", "wrong", "gene...
## $ swear_word            <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE...
## $ episode_name          <chr> "Fatbeard", "Damien", "Obama Wins!", "Ho...
## $ air_date              <date> 2009-04-22, 1998-02-04, 2012-11-07, 200...
## $ user_rating           <dbl> 8.2, 8.1, 7.5, 8.1, 7.6, 6.7, 8.8, 8.8, ...
## $ user_votes            <dbl> 1578, 1703, 1156, 1488, 1229, 1546, 2356...
## $ score                 <int> NA, NA, NA, -2, 2, NA, NA, NA, NA, NA, N...

Basic statistics about the show

figures text
21 Number of seasons
287 Number of episodes
914 475 Number of words
312 767 No stopwords (a, the, this, ...)
6 170 Number of swear words
1.97 % of swear words
34.2 % used for analysis
4 403 Number of characters
8.14 Mean IMDB rating
9.6 Scott Tenorman Must Die (S05E04)
6.3 A Million Little Fibers (S10E05)

Overall sentiment analysis

plot of chunk unnamed-chunk-4

Episode popularity

plot of chunk unnamed-chunk-5

Are naughty episodes more popular?

plot of chunk unnamed-chunk-6

So who's the naughtiest character?

It's Kenny!

plot of chunk unnamed-chunk-7

Contact