For my second project at Metis, I scraped a beer review website and utilized supervised learning models to predict beer ratings with a low complexity, high R-squared random forest regressor.
I scraped Beer Advocate’s website for the following attributes: average beer rating, beer name, style, alcohol by volume (ABV), brewery, state, and the year the beer was released. A typical beer page is shown below.
I discovered that there is a web page for each state listing the top 25-100 beers, from which I could grab links to the individual beer pages. I then discovered a web page that listed all the different states, from which I could scrape the state page urls.
So with only a few lines of code, I was able to compile a list of 3,824 beer pages and then scrape the requisite attributes.
Feature engineering: It’s all about the brewery
Despite having acquired a lot of data, most of the variables I had were strings and therefore couldn’t be used in a linear regression without some transformation. This turned out to be blessing not a curse as it challenged me to be creative with my features and dive head first into feature engineering.
One of the main thing I was curious about was how important brewery is for beer ratings. Did users tend to care whether this was a stout with notes of chocolate? Or just that it was from Russian River?
To start, I created a variable to proxy brewery size by counting up the number of beers for each brewery in the dataset. This variable turned out to be helpful but not critical in predictions.
I also created a variable to proxy how good beers tended to be from the brewery. Rather than taking the simple average of all the beers at the brewery, for each beer I averaged the rating of all other beers from that brewery. This was critical to avoid data leakage (having target data leak into the features).
Models: The simpler the better
The best model proved to be a random forest though linear regression did surprisingly well. The fact that linear regression did so well (R-squared of 0.69) makes sense given that the data show people tend to like beers with more alcohol (higher ABV content) and they give higher ratings to more recent beers. More importantly, the average rating of other beers at the brewery seems to be linearly related to the rating of a new beer by that same brewery (see scatterplot below). In short, the brewery is key.
I experimented with two different style taxonomy classifications but neither seemed to matter. The inclusion of 50 state dummies didn’t improve predictions either.
A surprisingly simple model did quite well. With only five variables, my random forest model had an R-squared around 0.73.
- Brewery rating
- Zip code
- Year released
- Brewery size proxy
Location, location, location
While state dummies did not add predictive power to the model, I scraped zip code data to see if more local location data mattered. Maybe it didn’t help to know a beer was from a brewery in California, but maybe it would help to know if it was a San Francisco brewery versus a San Diego one.
The map below shoes the average rating of beers at breweries across the United States. Coastal breweries seem to have higher rated beers. See for example, the dark blue dots in the Northeast, California, and Oregon, among others places.
It would be interesting to add in city population data by zip code to see if breweries in larger cities tend to produce higher rated beer. If this relationship holds, it may be the case that there’s more competition among breweries in larger cities which tends to lead to higher quality beers. Or top brewers choose to be near each other to maximize learning. It’s important to note that beer is often shipped regionally or nationwide and so the people rating a San Francisco brewery need not live in San Francisco (i.e. higher city ratings are unlikely to be a result of selection bias).
In the future, I I look forward to creating an interactive map with d3 to further explore patterns.