Happy Little Accidents
This morning, I just found out about #tidytuesday and I figured it would be a fun thing to play with.
For my first foray into tidytuesday, we have data on Bob Ross’s paintings during his show. The data were compiled by fivethirtyeight and reported here.
The data are available here. On the info page for the data, they show how to load the data and give an example of some basic tidying. I’ll do that below:
library(dplyr) library(tidyr) library(stringr) bob_ross <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-08-06/bob-ross.csv") # to clean up the episode information bob_ross <- bob_ross %>% janitor::clean_names() %>% separate(episode, into = c("season", "episode"), sep = "E") %>% mutate(season = str_extract(season, "[:digit:]+")) %>% mutate_at(vars(season, episode), as.integer) head(bob_ross)
## # A tibble: 6 x 70 ## season episode title apple_frame aurora_borealis barn beach boat bridge ## <int> <int> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 1 1 "\"A… 0 0 0 0 0 0 ## 2 1 2 "\"M… 0 0 0 0 0 0 ## 3 1 3 "\"E… 0 0 0 0 0 0 ## 4 1 4 "\"W… 0 0 0 0 0 0 ## 5 1 5 "\"Q… 0 0 0 0 0 0 ## 6 1 6 "\"W… 0 0 0 0 0 0 ## # … with 61 more variables: building <dbl>, bushes <dbl>, cabin <dbl>, ## # cactus <dbl>, circle_frame <dbl>, cirrus <dbl>, cliff <dbl>, ## # clouds <dbl>, conifer <dbl>, cumulus <dbl>, deciduous <dbl>, ## # diane_andre <dbl>, dock <dbl>, double_oval_frame <dbl>, farm <dbl>, ## # fence <dbl>, fire <dbl>, florida_frame <dbl>, flowers <dbl>, ## # fog <dbl>, framed <dbl>, grass <dbl>, guest <dbl>, ## # half_circle_frame <dbl>, half_oval_frame <dbl>, hills <dbl>, ## # lake <dbl>, lakes <dbl>, lighthouse <dbl>, mill <dbl>, moon <dbl>, ## # mountain <dbl>, mountains <dbl>, night <dbl>, ocean <dbl>, ## # oval_frame <dbl>, palm_trees <dbl>, path <dbl>, person <dbl>, ## # portrait <dbl>, rectangle_3d_frame <dbl>, rectangular_frame <dbl>, ## # river <dbl>, rocks <dbl>, seashell_frame <dbl>, snow <dbl>, ## # snowy_mountain <dbl>, split_frame <dbl>, steve_ross <dbl>, ## # structure <dbl>, sun <dbl>, tomb_frame <dbl>, tree <dbl>, trees <dbl>, ## # triple_frame <dbl>, waterfall <dbl>, waves <dbl>, windmill <dbl>, ## # window_frame <dbl>, winter <dbl>, wood_framed <dbl>
There are a couple of paintings that are named the same thing.
bob_ross <- bob_ross %>% group_by(title) %>% mutate(title_count = 1:group_size(.)) %>% ungroup() %>% mutate(title = if_else(title_count > 1, paste(title, title_count), title)) %>% select(-title_count)
There are some columns that are relate to the frame the painting got put into and some columns that relate to elements inside each painting. Like the fivethirtyeight crew, I’m more interested in the elements inside the paintings as opposed to the frames, so i’ll go ahead and drop those columns
painting_data <- bob_ross %>% select(-contains("frame"), -steve_ross, -guest, -diane_andre)
In my professional work, I perform social network analysis, so let’s go ahead and look at networks of elements in Bob Ross’s paintings!
Networks of Bob Ross Paintings
To get us looking at social networks, we first need to take the data from this wide format and turn it into an edge list. The edge list will connect each painting to every element that is inside it. From there, we can get a picture of what the network of paintings looks like!
Organizing the Edge List
library(igraph) library(ggnetwork) titles <- painting_data[["title"]] incidence_mat <- painting_data %>% select(-season, -episode, -title) incidence_mat <- as.matrix(incidence_mat) rownames(incidence_mat) <- titles incidence_graph <- graph_from_incidence_matrix(incidence_mat, ) ggplot(incidence_graph, aes(x = x, y = y, xend = xend, yend = yend)) + geom_edges() + # type is TRUE if a node is an episode and FALSE if it's an element geom_nodes(aes(color = type)) + theme_blank()
That’s real busy! It looks like there are a few episodes that only have a few elements and some episodes that have many elements in it. There’s also that one episode that shares three elements with another episode and no others.
Let’s see if we can clean this up a bit! First, i’ll connect episodes by how many elements they share.
episode_x_episode <- incidence_mat %*% t(incidence_mat) ep_x_ep_graph <- graph_from_adjacency_matrix(episode_x_episode, # only look at the upper part of the matrix since it is symetrical mode = "upper", # the connections are weighted weighted = TRUE, # don't count self-loops diag = FALSE) ggnetwork(ep_x_ep_graph, weight = "weight") %>% ggplot(aes(x = x, y = y, xend = xend, yend = yend)) + geom_edges() + geom_nodes() + theme_blank()
Not much to look at. Or better yet, Ross’s paintings tend to share something in common with other paintings.
Instead of looking at all the features at once, why don’t we look at groups of features. I’ve gone ahead and grouped each feature into different categories. I’ll load that up and then split up the painting df into different categories.
feature_categories <- readr::read_csv("../../data/Bob Ross/ross_painting_features.csv") painting_categories <- painting_data %>% gather(feature, value, -season, -episode, -title) %>% left_join(feature_categories, by = "feature") %>% filter(value > 0) %>% count(season, episode, title, category) %>% spread(category, n) %>% mutate_at(vars(-season, -episode, -title), ~if_else(is.na(.), 0, 1)) titles <- painting_categories[["title"]] incidence_mat <- painting_categories %>% select(-season, -episode, -title) %>% as.matrix() rownames(incidence_mat) <- titles episode_x_episode <- incidence_mat %*% t(incidence_mat) ep_x_ep_graph <- graph_from_adjacency_matrix( episode_x_episode, # only look at the upper part of the matrix since it is symetrical mode = "upper", # the connections are weighted weighted = TRUE, # don't count self-loops diag = FALSE) # add in season as an attribute of each episode vertex_attr(ep_x_ep_graph, "season") <- painting_categories[["season"]]
Now to plot! I’m going to build these plots with a cutpoint though, because otherwise they become very unweildy.
ep_x_ep_graph %>% delete_edges(which(edge_attr(ep_x_ep_graph)$weight <= 2)) %>% delete_vertices(., which(igraph::degree(.) == 0)) %>% ggnetwork(weight = "weight") %>% ggplot(aes(x = x, y = y, xend = xend, yend = yend)) + geom_edges(color = "gray") + geom_nodes(aes(color = season)) + theme_blank() + Friedman::scale_color_drexel(discrete = FALSE) + labs(title = "Paintings that share 3, 4, or 5 feature categories")
Above, I’ve colored nodes by season and only shown connections between episodes if those episodes share more than two classes of feature (e.g. two episodes have a sky feature, tree and plant feature, and a man-made feature).
What happens if we filter to only show edges if two espodes share 4 features? 5?
Looking at the paintings that share more than 3 features really brings about that there is a central group of paintings that all share a lot in common and then a few different groups of paintings that all have different things in common.
Now looking just at paintings that share 5 feature categories, it can really be seen that there is a central set of themes that is very common accross seasons. What are they?
episodes_of_interest <- ep_x_ep_graph %>% delete_edges(which(edge_attr(ep_x_ep_graph)$weight <= 4)) %>% delete_vertices(., which(igraph::degree(.) == 0)) %>% vertex_attr("name") painting_categories %>% filter(title %in% episodes_of_interest) %>% mutate(feature_sum = aquatic + clouds + `general nature` + `man made` + nature + sky + `trees and plants`) %>% filter(feature_sum > 4) %>% summarize_at(vars(-season, -episode, -title, -feature_sum), sum)
## # A tibble: 1 x 7 ## aquatic clouds `general nature` `man made` nature sky `trees and plant… ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 27 23 27 19 3 11 26
All (or near all) of these paintings have an aquatic element, a general nature element, and trees and plants. So the real defining factor between the three groups in the plot above probably has to do with the other four categories. Next time I play with these data, I’ll look at that and do a deeper dive on components of the network of Bob Ross paintings.