Graham Norton's Guests

The Guests of the Graham Norton Show

Sick, lying on the sofa trying to pass an evening. I found myself in desperate need of something easy to watch as I waited for the headache, sore throat and fever to leave me. Of the shows I did want to watch, all are shows I watch with my girlfriend. Even sick, I don’t think I could get away with that level of betrayl!

So I put of Graham Norton. As my girlfriend describes it, “It’s okay if you miss a bit”. Perfect for when you’re half dozing.

Nicole Kidman was the first guest on this weeks show. At the start, she says “I think I’ve been on here the most of anyone”. That was like drugs to a sniffer dog, even to my tired mind. Now that I’m feeling a little better, let’s see if we can answer it.

Wikipedia has a page detailing every episode and it’s guests. We can extract the information from this to get our data for analysis.

library(rvest)
library(purrr)
library(dplyr)
library(stringr)
library(tidyr)
library(ggplot2)
library(lubridate)

page <- read_html("https://en.wikipedia.org/wiki/List_of_The_Graham_Norton_Show_episodes")

#list all of the selectors
selector_guests <- ".summary"
selector_episode_no <- ".vevent td:nth-child(2)"
selector_overall_episode_no <- ".vevent th"
selector_air_date <- ".vevent td:nth-child(4)"

selectors <- c(air_date = selector_air_date, 
               guests = selector_guests, 
               episode_no = selector_episode_no, 
               overall_episode_no = selector_overall_episode_no)

# Create general function to extract text from page
extract_graham_norton <- function(site, selector) {
  site %>% html_nodes(selector) %>% html_text()
}

lists_mania <- map_df(selectors, ~extract_graham_norton(page, .x))

df <- bind_cols(lists_mania)

head(df)
## # A tibble: 6 x 4
##                        air_date
##                           <chr>
## 1                             –
## 2 22 February 2007 (2007-02-22)
## 3     1 March 2007 (2007-03-01)
## 4     8 March 2007 (2007-03-08)
## 5    15 March 2007 (2007-03-15)
## 6    22 March 2007 (2007-03-22)
## # ... with 3 more variables: guests <chr>, episode_no <chr>,
## #   overall_episode_no <chr>

We have the data in it’s rough form, but we’ll need to extract the names into something we can work with. Currently they are in the form of "guest1, guest2, ..., guest(n-1) and guest(n)". We’ll need to split the string on "," and "and", then unnest our new dataframe to create a record for every guest appearance on the show.

tidy_df <- df %>%
  
  #Remove "compilation show episodes. These have no guests"
  filter(guests != "Compilation show") %>%
  
  # Extract the air date and parse it. Wikipedia has a strange format for these dates
  mutate(air_date_tidy = str_extract(air_date, "(.{1,2}\\W[A-z]+\\W\\d{4})") %>% dmy(),
         
         # Split the listing the guests on "," or "and"
         guests = str_split(guests, ",|\\band\\b")) %>%
  
  tidyr::unnest() %>%
  
  # Tidy up the names so that they are in a uniform format
  mutate(guests = str_to_lower(guests) %>% str_replace_all("pilot|\\(|\\)", "") %>% str_trim())

#filename <- paste("output-data/", "graham-norton-guests_", Sys.Date(), ".csv", sep = "")
#write.csv(tidy_df, file = filename)

head(tidy_df)
## # A tibble: 6 x 5
##                        air_date episode_no overall_episode_no
##                           <chr>      <chr>              <chr>
## 1                             –          –                  –
## 2                             –          –                  –
## 3 22 February 2007 (2007-02-22)          1                  1
## 4 22 February 2007 (2007-02-22)          1                  1
## 5 22 February 2007 (2007-02-22)          1                  1
## 6 22 February 2007 (2007-02-22)          1                  1
## # ... with 2 more variables: air_date_tidy <date>, guests <chr>

Our data is now fit for purpose. Is Nicole Kidman the most common guest on the show? Let’s check the top 5

tidy_df %>%
  count(guests) %>%
  arrange(desc(n)) %>%
  top_n(5, n)
## # A tibble: 8 x 2
##             guests     n
##              <chr> <int>
## 1    ricky gervais    10
## 2      dawn french     9
## 3     miranda hart     9
## 4  dame judi dench     8
## 5 daniel radcliffe     8
## 6   jack whitehall     8
## 7     james mcavoy     8
## 8      john bishop     8

She’s no where near! How many times has she been on the show?

tidy_df %>%
  filter(guests =="nicole kidman") %>%
  count()
## # A tibble: 1 x 1
##       n
##   <int>
## 1     5

Bonus

Histogram of Guest frequency

tidy_df %>%
  count(guests) %>%
  ggplot(aes(x = n)) + geom_bar() + labs(x = "Occurence", y = "Frequency", title = "A barchart to show the number of times each guest has appeared on the show ", subtitle = "Most guests only appear once")

tidy_df %>%
  count(guests) %>%
  group_by(n) %>%
  summarise(count = n())
## # A tibble: 10 x 2
##        n count
##    <int> <int>
##  1     1   477
##  2     2   159
##  3     3    75
##  4     4    34
##  5     5    15
##  6     6    19
##  7     7     2
##  8     8     5
##  9     9     2
## 10    10     1

Number of distinct guests

How many guests has Graham Norton had?

Graham has had {r} nrow(tidy_df) guest appearances. How many of these were distinct guests have made an appearance?

# Total Guest appearances
nrow(tidy_df)
## [1] 1427
# Distinct number of guests
tidy_df %>%
  distinct(guests) %>%
  count()
## # A tibble: 1 x 1
##       n
##   <int>
## 1   789

Number of Guests Per Epidsode

tidy_df %>%
  count(overall_episode_no) %>%
  arrange(desc(n)) %>%
  top_n(10, wt = n)
## # A tibble: 27 x 2
##    overall_episode_no     n
##                 <chr> <int>
##  1                206    14
##  2                248     9
##  3                136     8
##  4                171     8
##  5                313     8
##  6                240     7
##  7                298     7
##  8                345     7
##  9                103     6
## 10                189     6
## # ... with 17 more rows

What’s so special about epidsode 206? It appears to be a New Years Eve special 2013

tidy_df %>%
  filter(overall_episode_no == 206) 
## # A tibble: 14 x 5
##                         air_date episode_no overall_episode_no
##                            <chr>      <chr>              <chr>
##  1 31 December 2013 (2013-12-31)         11                206
##  2 31 December 2013 (2013-12-31)         11                206
##  3 31 December 2013 (2013-12-31)         11                206
##  4 31 December 2013 (2013-12-31)         11                206
##  5 31 December 2013 (2013-12-31)         11                206
##  6 31 December 2013 (2013-12-31)         11                206
##  7 31 December 2013 (2013-12-31)         11                206
##  8 31 December 2013 (2013-12-31)         11                206
##  9 31 December 2013 (2013-12-31)         11                206
## 10 31 December 2013 (2013-12-31)         11                206
## 11 31 December 2013 (2013-12-31)         11                206
## 12 31 December 2013 (2013-12-31)         11                206
## 13 31 December 2013 (2013-12-31)         11                206
## 14 31 December 2013 (2013-12-31)         11                206
## # ... with 2 more variables: air_date_tidy <date>, guests <chr>
tidy_df %>%
  group_by(overall_episode_no) %>%
  count() %>% 
  ungroup() %>%
  summarise(mean_no_of_guests = mean(n),
            median_no_of_guests = median(n),
            max_no_of_guests = max(n),
            min_no_of_guests = min(n))
## # A tibble: 1 x 4
##   mean_no_of_guests median_no_of_guests max_no_of_guests min_no_of_guests
##               <dbl>               <dbl>            <dbl>            <dbl>
## 1          4.148256                   4               14                1