A few months ago, I was talking with a friend of mine about the idea for this blog and how I wanted to use data science to explore beer. He suggested that I use the blog as well as beer to learn something new about where I live. So I ask, what can beer teach me about Philadelphia?
The first thing I need? Data!
Oddly enough, it’s actually pretty challenging to get access to high quality, current beer data.
I chose to use RateBeer’s data, mostly because they have an easily accessible API, and meet my needs better than anyone else. They also disclose how they come to their average beer rating, allowing me to see what’s under the hood. In the footnotes, I briefly explain some alternatives1
Collecting Data
I want to look at breweries in the area. Sadly, the RateBeer API doesn’t have a feature to search for breweries in the area. There is however, a way to query what beers a brewery makes. To get a list of beers a brewery makes though, I need to know what that brewery’s unique ID is. Easy enough to find. the URL for a brewery on RateBeer is of the form:
https://www.ratebeer.com/brewers/<BREWERY_NAME_HERE>/<BREWER_ID_HERE>
So, as an example, 166
is the ID for Yards Brewering Company. The url is:
https://www.ratebeer.com/brewers/yards-brewing-company/166/
RateBeer’s API uses the language of GraphQL. It’s beyond the scope of this post to dive into GraphQL, so instead, I’ll explain how it’s implemented in regards to the queries that I make.
GraphQL requests are written in JSON format. Basically, I specify that I make a query. Nested in that call is the type of query I want to make (as well as arguments to that query), which then has the responses I wanted nested within the call.
So, a query for the name of the beer with the ID number 4934
looks like this:
query {
beer(id: 4934) {
name
}
}
Similarly, my query for the beers made by Yards will look like this:
query{
beersByBrewer(brewerId: 166) { # 166 is Yards
totalCount # This gives the total number of beers in the beer list
items{ # For each beer, I want these items...
name # the name of the beer
abv # the beer's ABV
averageRating # the average rating of the beer
ratingCount # the number of ratings the beer has
isRetired # is the beer retired?
style{
# Style needs to be jumped into one level because when I query style, I
# can also ask for a description of the style, and can even jump into
# recommended glassware. That said, all I want is the style name.
name
}
}
}
}
Now that I know what format to make my request in, it’s time to actually get my first bit of data off the API!
Hello, httr
httr
is a package developed by Hadley Wickham of RStudio to make it easy to make HTTP requests. the package and the nuances of HTTP wont be gone into here, but some good resources for httr include the quickstart vignette, and Bradley Boehmke’s post on using httr.
The httr syntax is quite simple. The main functions are curl verbs (httr is a wrapper for curl functions), and the function arguments all start with the URL and are then followed by things to modify the URL and to send with the URL.
RateBeer’s API is at https://api.ratebeer.com/v1/api/graphql
. In the header of the request, content type and response type are required along with an API key. The query itself modifies the URL to call. So, my call to the RateBeer API to get the beers made by yards looks like this:
library(httr)
API_key <- Sys.getenv("rateBeer_API_key")
URL <- "https://api.ratebeer.com/v1/api/graphql"
beers_by_yards <- POST(URL,
body = list(
query =
"query{
beersByBrewer(brewerId: 166) {
totalCount
items{
name
abv
averageRating
ratingCount
isRetired
style{
name
}
}
}
}",
variables = "{}",
operationName = NULL),
encode = "json", # tells httr to encode the body of the request as json
add_headers("content-type" = "application/json",
"Accept" = "application/json",
"x-api-key" = API_key))
Parsing the Response
content()
is httr’s function for extracting content from a request. Using the type =
argument, we can have the function give us the data from the request as valid JSON. then we can use the jsonlite
package to make the data easier to work with.
library(jsonlite)
json <- content(beers_by_yards, type = "text")
parsed_json <- fromJSON(json, flatten = TRUE)
Recall that the response to the request was supposed to be JSON? This means that all of our items are nested in the same way as we requested them. So, to get the number of beers that Yards makes, as well as a data frame with those beers, we work through those levels.
beer_count <- parsed_json$data$beersByBrewer$totalCount
beer_df <- parsed_json$data$beersByBrewer$items
beer_count
## [1] 126
kable(beer_df, "html") %>%
kable_styling(bootstrap_options = c("striped", "hover"))
name | abv | averageRating | ratingCount | isRetired | style.name |
---|---|---|---|---|---|
Yards IPA | 7.0 | 3.487598 | 452 | FALSE | IPA |
Yards Philadelphia Pale Ale | 4.6 | 3.087262 | 402 | FALSE | Pale Ale - American / APA |
Yards General Washington’s Tavern Porter | 7.0 | 3.414651 | 373 | FALSE | Porter |
Yards Thomas Jefferson’s Tavern Ale | 8.0 | 3.384788 | 350 | FALSE | Strong Ale - American |
Yards Extra Special Ale (ESA) | 6.3 | 3.312626 | 344 | FALSE | Bitter - Premium / Strong / ESB |
Yards Brawler | 4.2 | 3.123125 | 322 | FALSE | Mild Ale |
Yards Love Stout | 5.5 | 3.373304 | 284 | FALSE | Stout |
Yards Poor Richard’s Tavern Spruce | 5.0 | 3.281131 | 230 | FALSE | Flavored - Other |
Yards Saison (pre-2009) | 4.7 | 3.009691 | 203 | TRUE | Saison / Farmhouse / Grisette |
Yards Saison (2009-) | 6.5 | 3.326367 | 159 | FALSE | Saison / Farmhouse / Grisette |
So, where’s all 111 beers? The API gives us 10 beers at a time. When We make the request to the API, we can tell it where to start that list of 10 beers, alongside our request to look at beers from a certain brewery.
In my next post, I’ll show how we can get all 111 of those beers and beers from other breweries, programmatically.
- Beer Advocate expressly forbids scraping and does not have an official API.
- Untappd has an API but they don’t give out API keys to people that are just interested in data. If I build an app, maybe my decision will change, but in the meantime, no using their API. It looks like they may not expressly forbidscraping or crawling on the site, but scraping has its own challenges. I may cover it in the future, but in the meantime, I want to just use an API.
- BeerDB looks like an awesome idea - beer data for developers! Yet, the API doesn’t show ratings and you can only get ABV if you are a premium user.I can get all the information that I am looking for from other APIs, so no need to use and pay for this one.
- Open Beer Database hasn’t updated the database since 2011. That’s a solid no.
- The Beer Spot looks like it could be a fun community, but considering that A), they may not have very many users and B) no one has reviewd Yuengling (beer geek or not, a Philadelphia area staple) I’m not going to use their API.