What's the most popular car in the UK?

2017-10-10

The Office of National Statistics is a brilliant source of interesting data sets. This page lists recently released datasets. Check regularly and you’ll soon notice these data sets feed into news articles by the Economist, BBC news and more.

One I recently thought to explore detailed the count of car makes and models in the UK. I wonder what’s the most popular car make and model in the UK?

library(tidyverse)
library(readxl)
library(stringr)

path <- "~/Learning/R/Datasets/vehicle-statistics-make-model.xlsx"
car_data <- read_excel(path = path, skip = 8, col_names = FALSE, sheet =  1)

Great, we’ve got our data. But what does it look like?

X__1	X__2	X__3	X__4	X__5	X__6	X__7	X__8	X__9	X__10	X__11	X__12	X__13	X__14	X__15	X__16	X__17	X__18	X__19	X__20	X__21	X__22	X__23	X__24	X__25	X__26	X__27	X__28	X__29	X__30	X__31	X__32	X__33	X__34	X__35	X__36	X__37	X__38	X__39	X__40	X__41	X__42	X__43	X__44	X__45	X__46	X__47	X__48	X__49	X__50	X__51	X__52	X__53	X__54	X__55	X__56	X__57	X__58	X__59	X__60	X__61	X__62	X__63
ABARTH	124 SPIDER MULTIAIR AUTO	72	71	59	58	36	35	26	25	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
ABARTH	500	4456	4382	4447	4370	4490	4411	4525	4448	4568	4487	4550	4469	4580	4500	4574	4494	4548	4467	4462	4387	4421	4347	4263	4094	3977	3834	3782	3622	3535	3387	3296	3148	2982	2752	2630	2417	2293	2055	1909	1677	1358	1039	706	456	232	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
ABARTH	500 C	608	596	608	598	603	592	614	602	615	604	614	603	631	619	627	615	638	624	640	626	640	626	640	648	642	628	604	587	560	520	499	455	401	342	316	239	190	127	74	2	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
ABARTH	500 CUSTOM	367	362	363	358	374	370	382	378	378	373	380	376	380	376	373	369	329	326	196	196	16	16	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
ABARTH	500 CUSTOM S-A	112	110	112	110	114	112	113	111	111	109	89	87	57	55	26	26	18	18	15	15	2	2	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
ABARTH	500 S-A	90	88	87	85	92	90	92	90	94	92	96	94	95	93	95	93	90	88	82	80	78	76	71	62	55	48	43	35	24	15	8	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
oh.

oh no.

It’s not as we’d like it, but we can certainly handle it.

We’ll need to tidy it up a little before the real fun begins.

names <- read_excel(path = path, skip = 6, col_names = F, n_max = 1)
names[2] <- "Model"

names(car_data) <- names

tidy_cars <- car_data %>%
  gather(key = Period, value = Count, -Make, -Model) %>%
  extract(Period, c("Year", "Quarter", "Region"), "(\\d{4}) Q(1|2|3|4) (UK|GB)") %>%
  mutate(Year = as.integer(Year),
         Quarter = as.integer(Quarter),
         Make = str_trim(Make),
         Model = str_trim(Model),
         YQ = lubridate::yq(paste(Year,"/",Quarter))) %>%
  filter(stringr::str_length(Make) < 20,
         !is.na(Count))

Make	Model	Year	Quarter	Region	Count	YQ
ABARTH	124 SPIDER MULTIAIR AUTO	2017	2	UK	72	2017-04-01
ABARTH	500	2017	2	UK	4456	2017-04-01
ABARTH	500 C	2017	2	UK	608	2017-04-01
ABARTH	500 CUSTOM	2017	2	UK	367	2017-04-01
ABARTH	500 CUSTOM S-A	2017	2	UK	112	2017-04-01
ABARTH	500 S-A	2017	2	UK	90	2017-04-01

Much better!

Now that we have our tidy dataset, lets explore the latest counts in the UK, from Q2 2017.

tidy_cars %>% 
  filter(Year == 2017, Quarter == 2, Region == "UK") %>%
  count(Make, wt = Count) %>%
  arrange(desc(n)) %>%
  top_n(n, n = 30) %>%
  ggplot(aes(reorder(Make, n), n)) + 
  geom_bar(stat = "identity") + 
  labs(x = "Make", y = "Count", 
       title = "The 30 most popular car makes in the UK as of Q2 2017", 
       subtitle = "Ford, Vauxhal and Volkswagen are significantly more popular than other makes") +
  coord_flip() + scale_y_continuous(labels = scales::comma)

Most popular car models (by Manufacturer)

Before we can find which car models are most popular, we need to do a bit of data transformation first. To illustrate the problem, look out the output below.

tidy_cars %>% 
  select(Make, Model) %>%
  filter(Make == "FORD", grepl("B-MAX", Model)) %>%
  distinct() %>% 
  knitr::kable()

Make	Model
FORD	B-MAX STUDIO
FORD	B-MAX TITANIUM AUTO
FORD	B-MAX TITANIUM NAVIGATOR
FORD	B-MAX TITANIUM NAVIGATOR AUTO
FORD	B-MAX TITANIUM NAVIGATOR TDCI
FORD	B-MAX TITANIUM TDCI
FORD	B-MAX TITANIUM TURBO
FORD	B-MAX TITANIUM X
FORD	B-MAX TITANIUM X AUTO
FORD	B-MAX TITANIUM X NAV TDCI
FORD	B-MAX TITANIUM X NAVIGATOR
FORD	B-MAX TITANIUM X NAVIGATOR A
FORD	B-MAX TITANIUM X TDCI
FORD	B-MAX ZETEC
FORD	B-MAX ZETEC AUTO
FORD	B-MAX ZETEC NAVIGATOR
FORD	B-MAX ZETEC NAVIGATOR AUTO
FORD	B-MAX ZETEC NAVIGATOR TDCI
FORD	B-MAX ZETEC RED EDITION
FORD	B-MAX ZETEC RED EDITION AUTO
FORD	B-MAX ZETEC SILVER EDITION
FORD	B-MAX ZETEC SILVER EDITION A
FORD	B-MAX ZETEC TDCI
FORD	B-MAX ZETEC TURBO
FORD	B-MAX ZETEC WHITE EDITION
FORD	B-MAX ZETEC WHITE EDITION AUTO

When we consider Car models, should we group similar models together, or consider them separate? If we don’t group them, a popular car model with many different sub-models may appear less popular that it is overall.

For our purposes, we will group them together the best we can. This should gives us a more representative picture of car model popularity. The downside is that our grouping is arbitary. The strategy I’ve taken is to split the strings on the first space, and take the first chunk. This works for the majority of cases, but still misses some.

# helper function to split Model on the first " " and extract the first chunk
simplify_model <- function(model) {
  split_list <- stringr::str_split(model, " ") %>% `[[`(1)
  
  split_list[[1]]
}

tidy_cars %>% 
  filter(Year == 2017, Quarter == 2, Region == "UK") %>%
  mutate(simple_model = map_chr(Model, ~ simplify_model(.x))) %>%
  group_by(Make, simple_model) %>%
  summarise(Count = sum(Count)) %>% 
  ungroup() %>%
  top_n(25, Count) %>%
  ggplot(aes(x = (reorder(simple_model, Count)), y = Count, fill = Make)) + 
    geom_bar(stat = "identity") + 
    scale_y_continuous(labels = scales::comma) +
    labs(x = "Model",
         y = "Count", 
         title = "The most popular car makes as of Q2 2017", 
         subtitle = "Ford, Vauxhall and Volkswagen are the most popular makes") +
    coord_flip()