A few months ago, I was talking with a friend of mine about the idea for this blog and how I wanted to use data science to explore beer. He suggested that I use the blog as well as beer to learn something new about where I live. So I ask, what can beer teach me about Philadelphia?

The first thing I need? Data!

Oddly enough, it’s actually pretty challenging to get access to high quality, current beer data.

I chose to use RateBeer’s data, mostly because they have an easily accessible API, and meet my needs better than anyone else. They also disclose how they come to their average beer rating, allowing me to see what’s under the hood. In the footnotes, I briefly explain some alternatives1

Collecting Data

I want to look at breweries in the area. Sadly, the RateBeer API doesn’t have a feature to search for breweries in the area. There is however, a way to query what beers a brewery makes. To get a list of beers a brewery makes though, I need to know what that brewery’s unique ID is. Easy enough to find. the URL for a brewery on RateBeer is of the form:

https://www.ratebeer.com/brewers/<BREWERY_NAME_HERE>/<BREWER_ID_HERE>

So, as an example, 166 is the ID for Yards Brewering Company. The url is:

https://www.ratebeer.com/brewers/yards-brewing-company/166/

RateBeer’s API uses the language of GraphQL. It’s beyond the scope of this post to dive into GraphQL, so instead, I’ll explain how it’s implemented in regards to the queries that I make.

GraphQL requests are written in JSON format. Basically, I specify that I make a query. Nested in that call is the type of query I want to make (as well as arguments to that query), which then has the responses I wanted nested within the call.

So, a query for the name of the beer with the ID number 4934 looks like this:

query {
  beer(id: 4934) {
    name
  }
}

Similarly, my query for the beers made by Yards will look like this:

query{
  beersByBrewer(brewerId: 166) { # 166 is Yards
    totalCount # This gives the total number of beers in the beer list
    items{ # For each beer, I want these items...
      name # the name of the beer
      abv # the beer's ABV
      averageRating # the average rating of the beer
      ratingCount # the number of ratings the beer has
      isRetired # is the beer retired?
      style{ 
        # Style needs to be jumped into one level because when I query style, I
        # can also ask for a description of the style, and can even jump into 
        # recommended glassware. That said, all I want is the style name. 
        name
      }
    }
  }
}

Now that I know what format to make my request in, it’s time to actually get my first bit of data off the API!

Hello, httr

httr is a package developed by Hadley Wickham of RStudio to make it easy to make HTTP requests. the package and the nuances of HTTP wont be gone into here, but some good resources for httr include the quickstart vignette, and Bradley Boehmke’s post on using httr. The httr syntax is quite simple. The main functions are curl verbs (httr is a wrapper for curl functions), and the function arguments all start with the URL and are then followed by things to modify the URL and to send with the URL.

RateBeer’s API is at https://api.ratebeer.com/v1/api/graphql. In the header of the request, content type and response type are required along with an API key. The query itself modifies the URL to call. So, my call to the RateBeer API to get the beers made by yards looks like this:

library(httr)

API_key <- Sys.getenv("rateBeer_API_key") 

URL <- "https://api.ratebeer.com/v1/api/graphql"

beers_by_yards <- POST(URL,
         body = list(
           query = 
"query{
  beersByBrewer(brewerId: 166) {
    totalCount
    items{
      name
      abv
      averageRating
      ratingCount
      isRetired
      style{
        name
      }
    }
  }
}",
           variables = "{}", 
           operationName = NULL),
         encode = "json", # tells httr to encode the body of the request as json
         add_headers("content-type" = "application/json", 
                     "Accept" = "application/json", 
                     "x-api-key" = API_key))

Parsing the Response

content() is httr’s function for extracting content from a request. Using the type = argument, we can have the function give us the data from the request as valid JSON. then we can use the jsonlite package to make the data easier to work with.

library(jsonlite)

json <- content(beers_by_yards, type = "text")

parsed_json <- fromJSON(json, flatten = TRUE)

Recall that the response to the request was supposed to be JSON? This means that all of our items are nested in the same way as we requested them. So, to get the number of beers that Yards makes, as well as a data frame with those beers, we work through those levels.

beer_count <- parsed_json$data$beersByBrewer$totalCount

beer_df <- parsed_json$data$beersByBrewer$items

beer_count
## [1] 126
kable(beer_df, "html") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
name abv averageRating ratingCount isRetired style.name
Yards IPA 7.0 3.487598 452 FALSE IPA
Yards Philadelphia Pale Ale 4.6 3.087262 402 FALSE Pale Ale - American / APA
Yards General Washington’s Tavern Porter 7.0 3.414651 373 FALSE Porter
Yards Thomas Jefferson’s Tavern Ale 8.0 3.384788 350 FALSE Strong Ale - American
Yards Extra Special Ale (ESA) 6.3 3.312626 344 FALSE Bitter - Premium / Strong / ESB
Yards Brawler 4.2 3.123125 322 FALSE Mild Ale
Yards Love Stout 5.5 3.373304 284 FALSE Stout
Yards Poor Richard’s Tavern Spruce 5.0 3.281131 230 FALSE Flavored - Other
Yards Saison (pre-2009) 4.7 3.009691 203 TRUE Saison / Farmhouse / Grisette
Yards Saison (2009-) 6.5 3.326367 159 FALSE Saison / Farmhouse / Grisette

So, where’s all 111 beers? The API gives us 10 beers at a time. When We make the request to the API, we can tell it where to start that list of 10 beers, alongside our request to look at beers from a certain brewery.

In my next post, I’ll show how we can get all 111 of those beers and beers from other breweries, programmatically.


    • Beer Advocate expressly forbids scraping and does not have an official API.
    • Untappd has an API but they don’t give out API keys to people that are just interested in data. If I build an app, maybe my decision will change, but in the meantime, no using their API. It looks like they may not expressly forbidscraping or crawling on the site, but scraping has its own challenges. I may cover it in the future, but in the meantime, I want to just use an API.
    • BeerDB looks like an awesome idea - beer data for developers! Yet, the API doesn’t show ratings and you can only get ABV if you are a premium user.I can get all the information that I am looking for from other APIs, so no need to use and pay for this one.
    • Open Beer Database hasn’t updated the database since 2011. That’s a solid no.
    • The Beer Spot looks like it could be a fun community, but considering that A), they may not have very many users and B) no one has reviewd Yuengling (beer geek or not, a Philadelphia area staple) I’m not going to use their API.