BP Comment Quick Links
![]() | |
December 30, 2015 Best of BP 2015We Know How Happy You AreWith the year winding to a close, Baseball Prospectus is revisiting some of our favorite articles of the year. This was originally published on February 18, 2015. People love to talk about the mood of a franchise, or the collective feeling of its fanbase. Are they dispirited, optimistic? Ecstatic following a World Series win, or broken after an agonizing walkoff loss? For the most part, we leave it to the beat writers to gauge mood (which is not necessarily a bad thing), without any kind of backing for their proclamations (which might be a bad thing). Hypothetically, fans are a reservoir of great wisdom (collectively, although perhaps not individually). So tapping into the mood of a fanbase could be more than interesting, it could be useful. But, beyond inquiring with potentially biased observers, there was little we could do to objectively or quantitatively measure a fanbase’s mood. In this article, I’m going to present one way to gauge the happiness of a fanbase, using a text analysis of the website Reddit. Reddit is an aggregation engine, to which individual users can submit links to other websites or original content, which is then upvoted, downvoted, and commented upon. Importantly, Reddit self-organizes into communities of like-minded individuals, one category of which is fans of a sports team. As a result, there is one team-specific subreddit (community) for each MLB teams’ fans, along with a huge body of text from that teams’ fans. I used a freely-available program[1] to harvest Reddit comments and posts en masse, over a month-long time period (roughly Jan. 5-Feb. 5). The program spits out a list of words, along with the number of times each word occurs. So, for example, the Yankees subreddit uses the word “money” 25 times in the past month. The small-market Rays, on the other hand, used the same word merely five times. To figure out how happy each team’s fanbase is, I did what’s called ‘sentiment analysis’ on each list of words. The idea is like this: Some words tend to be used in positive situations, and indicate that the writer is happier, while others are more negative in connotation, and suggestive of despair. For example, ‘excellence’ is a very positive word, and ‘deception’ an unpleasant one. If a team’s comments are filled with words like excellence, and bereft of words like deception, they are probably happy, and vice versa. To do the sentiment analysis, I used a list of words (called AFINN-111[2]) which had been manually assigned levels of positivity from -5 to 5. To give you an idea of how it works, the word ‘excellence’ is rated a +3 on this list, while ‘deception’ is rated -3. Then I matched up words from the Reddit analysis with the sentiment list and multiplied by the number of times each word was used in each subreddit. The higher the total score, which I called the total affect rating, the more happy the fanbase[3]. Here’s what I found, for all 30 teams, sorted by total affect rating, our proxy for fanbase happiness.
It’s Always Sunny in {Insert City Here} But perhaps these fanbases aren’t any happier than the rest of the internet. To check that, I looked at a few other subreddits, and calculated their levels of positive affect. For example, I scrutinized a collection of texts from city-based subreddits (for example, /r/Chicago, /r/Miami, etc.). No city I looked at had higher than the lowest affect ratio for a team-specific subreddit. All in all, this makes a lot of sense: baseball is an optional hobby, so if someone doesn’t like participating in it, they probably won’t. The Causes of Fan Happiness
Another possibility is that the fanbase is less concerned about the past performance, and more with the future. It’s possible that fans are already over the results of last season, and have moved on in their mood to thinking about next season. We can check this by going to PECOTA, which objectively projects the performance of every team for the next year. PECOTA stands in here for the conventional wisdom, reflecting what we think we know about next year’s likely performance.
Individually, past performance and future projections contribute relatively little to explaining a fanbase’s mood. But perhaps together, there are some synergistic effects that can explain more of the variation. I put both predictors into a combined regression, and checked to see how well I could predict the resulting affect ratio.
Irrational Exuberance
There’s no surprise in number one. The Giants total happiness is off the charts, which I think must be the result of winning the World Series (again and again and again, in all even-numbered years since 2010). The magnitude of the effect is kind of incredible: The Giants fans have a total affect number about 50 percent higher than the next happiest fanbase. The other teams are a bit more surprising. The Seattle Mariners were significant to the playoff picture last year for the first time in a few seasons, and they project to be above average this year as well. Maybe this excess happiness is the side effect of that return to relevancy. A similar argument could be made for the White Sox, whose shrewd offseason has seen their postseason odds increase substantially. The Braves confuse me, both at the organizational and fanbase levels. The team is not projected to be competitive, nor were they last year, and yet their hopes spring eternally enough to invest $44 million in the dubious defense of Nick Markakis. On top of that, the team is undergoing a gruesome publically-funded stadium controversy, with allegations of political corruption. How the fans remain so optimistic is anybody’s guess. And the reverse, the fanbases that are most groundlessly unhappy:
Three of the top five are in the AL East, and that might be more than coincidence. It must be frustrating to see your team regularly compete with great teams outside of the division, only to contend for division titles and wild cards with two of the richest teams in baseball, along with three less wealthy but exceedingly well-run teams (one of whom possesses occult powers). Beyond them, we have the Angels, who are as puzzling as the Braves above. They are good, young, and projected to win 91 games after pacing all of baseball with 98 wins last year. Their continuing despair is mysterious. There could be a variety of reasons which explain deviations from their expected behavior, some of which I’ve explained above. I have a faint and probably baseless hope that some of the deviations in expected happiness are the result of the fanbases being able to weigh and take into account factors beyond PECOTA’s considerable purview, like changes in coaching staff (the Rays and the Cubs) or other positive or negative indications from their organization. If that’s the case, than maybe the teams with exceptionally happy or sad redditors (relative to expectations) might be able to tell us something about the accuracy of the projections. To that end, as the season goes on, I’m hoping to continue tracking the mood of the redditors, checking back in a few times during the year to see how their sentiment scores have changed. It would be fun to see when each fanbase gives up on a team, or if they simply don’t until the very last gasp; or how they react to winning or losing streaks, injuries to their core players, and so on. On top of that, although it’s a very long shot, maybe the mood of the fans will be able to tell us something PECOTA doesn’t know.
[1] Thanks to github user rhiever for making this script. [2] Check out this paper for some details about the word sentiment list. [3] Fan bases also differed in terms of their levels of Reddit particpitation, so in addition to the total affect rating, I calculated the ratio of positive to negative affect scores, which I term the affect ratio. The latter statistic corrects for the variation in participation, and could be used as another measure of fanbase ‘happiness’. Surprisingly, however, affect ratio was not correlated with total number of words in a Reddit, indicating the participation and happiness are somewhat decoupled. The other results also mostly hold if I look at affect ratio instead of total, although some of the surprisingly happy/unhappy teams change. [4] For these correlations, I am using the Spearman, i.e. rank-order, correlation coefficient, because the relationships don’t look linear to me. [5] Along with the total number of words on each subreddit, to account for the level of participation. [6] To guard against overfitting, I built a support-vector machine model with 2-fold cross-validation, because that’s all this small sample of data could bear. However, there still exists the possibility of overfitting, with so few datapoints. I would like to have more data than just the 30 teams, but unfortunately I am not yet able to harvest subreddit information from earlier than a year ago.
Robert Arthur is an author of Baseball Prospectus. Follow @No_Little_Plans
|
For the city of Chicago, your study may confuse fan happiness with fan optimism. Cubs fans are happy even when not optimistic. In my 30 years of direct observation, White Sox fans are happy only when optimistic. I have not found White Sox fans to be irrationally optimistic away from Reddit. I think you may be facing self-selection bias in those that use the Reddit.
Orioles fans are simply desperate, and that desperation takes over for good sense. Orioles ownership is desperately greedy. I am willing to lead a torches and pitchfork charge to change the ownership to one that will at least properly invest in scouting, drafting and signing minor league free agents.