Gauging Debate is a data analytics application that tracks Twitter sentiment about presidential candidates in the 2016 US electoral cycle.
Sentiment can be tracked in realtime, or for historical analysis. On the app's main page, click "Live Tracking" for realtime analysis. Click "Previous Debate" to analyze past events. Currently, data from two past debates are available for review - the Sept 16 GOP debate and the Oct 13 Democratic debate. See the Analysis section below for more on how sentiment is measured.
This page provides a full report on what we did, why we did it, and where to go from here.
The app was developed as part of an assignment for Harvard University's CS205, "Computing Foundations for Computational Science". Its authors are Daniel Rajchwald and Andrew Reece. The code for this app is entirely open-source, and can be acquired at its GitHub repo.
The goal of the assignment was to demonstrate an advantage of distributed computing. Speed, scalability, and compatibility are three common motivations to build a software solution with parallel processing methods. Gauging Debate focuses on improving scalability and compatibility. (It needs to be pretty fast, too, in order to provide data in near realtime, but the code is not specifically optimized for speed.)
Twitter is a good example of a data source characterized by the "Three Vs" of big data: Velocity, Volume, and Variety. Running at full volume, the Twitter stream outputs around 7,000 tweets per second - high Velocity. Each tweet is accompanied by 26 fields of semi-structured metadata, such as GPS coordinates, user statistics, images, and URLs - lots of Variety. Including metadata, this all amounts to about a gigabyte of data every 5 minutes (at about 500K per tweet+metadata) - big Volume. Modern analytics solutions need to be able to handle data that comes with these characteristics - that was part of our challenge.
It's also important to take into account the varying rate of Twitter data. Consider this chart of tweets about Hillary Clinton just before and during the October 13 2016 Democratic debate:
Each bar represents a 30-second period, and the y-axis shows how many new, non-retweet posts were created about Hillary Clinton in each period. The debate actually started at 8:30pm EST (around 20:32 on the x-axis). You can see how, at different points during the debate, there's more than an order of magnitude difference in the frequency of tweets (from about 50 to almost 800 per 30-second interval). The Gauging Debate app needs to be able to scale from 80 to 800 (and higher), and having multiple processes computing the analyses in parallel allows for this flexibility.
Gauging Debate is also an exercise in integration and compatibility. We wanted to build an application that fit snugly into the "Big Data Ecosystem". The "Big Data" part may be a debatable buzzword, but there is definitely a software ecosystem that has emerged around working with large, fast analytics solutions, and a lot of it is designed to run operations in parallel. Much of this ecosystem's development is driven by Apache and Amazon Web Services, and you can see in the Data Pipeline section below how Gauging Debate makes good use of both these organizations' products.
We used a suite of open-source tools to construct an end-to-end data analytics pipeline. That sounds like a mouthful of buzzwords, so here's a picture of buzzwords to illustrate. Descriptions follow.
We can access the Twitter stream through the Twitter developer's API, which provides free access to a small portion of the entire stream of tweets. Since we were only attempting to acquire a small portion of all tweets anyway (ie. only candidate- or debate-related tweets), the app collects most (but not all) of its target tweets with this free access tier.
Apache Kafka is a distributed publish-subscribe messaging system that is built to work with the rest of the Apache ecosystem. It serves as a broker for incoming streams of data, and for outgoing requests. Spark Streaming has a native Kafka connector. (Actually, Spark's Java and Scala versions have native connectors for the Twitter stream, but we developed in PySpark, which does not yet have this feature.)
Spark Streaming works more-or-less like normal Spark. The main abstraction is the "DStream", but with a few I/O exceptions you can basically treat these like regular RDDs. There is a start, await, and exit sequence that is set to tell the stream when to open and close, and otherwise it's Spark as usual.
Spark SQL offers Pandas-y data frames and Hive query functionality on RDDs. This is nice for conducting groupby operations when groupByKey() is not feasible. In our case, it came in handy for grouping data by both timestamp and candidates. It's worth noting that aggregate functions for groupby objects are still quite primitive, and (at least in PySpark) don't yet offer the degree of customization for aggregating functions that you might expect from, say, Pandas. In fact, we used Pandas for analyzing the historical data, as it was much easier to get the data in the shape we needed.
Even though the goal was to stream live analytics, we still wanted the ability to (a) keep a buffer of recent past analysis, and (b) make it easy to access both live and historical data. This, on top of the fact that we're not Node.js experts, led us to a storage-based solution, wherein stream data is written to a database, and the front end then queries the most recent records for display. Simple static databases don't work well with parallelized writes, so platforms like MySQL and SQLite were unavailable to us. Amazon offers a number of database options, and we chose Simple DB, a schema-less key-value storage system - mainly because it had a low learning overhead, and we didn't need very sophisticated querying. SimpleDB can handle concurrent reads and writes, and we can interface with it in Python through Boto, which is a truly excellent module.
Flask is a Python framework for serving web content. We used it to do all the heavy lifting between the backend and the web interface. It works in conjunction with Jinja2 templating. Just about anything that isn't boilerplate on the website (with the exception of this report) is served through Flask in one way or another.
The main feature of the front-end is a chart of average tweet sentiment, per candidate, updated once every 30 seconds. (More on that time interval below in Analysis.) The chart has a lot of built-in customization, including panning and zooming on both axes, error bars, and the ability to add or remove candidates. You can also save any chart view as an image file for download. (We can't take credit for these great features, that's all Plotly.)
The web interface allows users to choose either streaming or historical analysis. In order for streaming data to appear, there needs to be either an AWS cluster running which is processing realtime data, or a Spark instance on someone's local computer which is doing the same.
There is also an administrator dashboard which allows admins to start up Spark clusters for streaming functionality. The address of this dashboard is not public - if you're on the CS205 staff you should have received this address in an email.
This isn't part of the data pipeline, per se, but we ended up storing almost all our configurations, settings, credentials, and scripts on S3. This made it easy for us not to worry about file paths when switching between local and cluster instances, and it interfaces well with the AWS ecosystem. Most, if not all, of the configuration files we use are not hosted here on GitHub, but are on S3 instead.
We originally set out to provide live analysis of both topical content and tweet sentiment. In other words, we wanted to answer the questions: "What are people talking about?" and "How happy are they talking about it?" We got the sentiment scoring working, but ran into some challenges with analyzing topics in real-time. Here's more detail on each method:
This software analyzes tweets related to the 2016 US Presidential Debates. It gauges the sentiment (ie. level of happiness) towards each candidate, and towards the election in general. Sentiment is analyzed with unigram (word-by-word) averaging, using the LabMT sentiment dictionary. (A big thanks to Andy Reagan for his accompanying Python module.) The LabMT scale ranges from 1 (very unhappy) to 9 (very happy). We ignored all words between 4 and 6 on this scale, as too many words like "the" and "and" can drown out the more meaningful words we really care about.
Unigram analysis is a relatively crude take on analyzing sentiment, as it is largely context-ignorant. As such, you need to have a decent chunk of words (at least 1,000) before you can start to be confident that their averaged value is giving a reliable signal. Based on analysis of past debates, we found that each candidate gets enough tweets to meet this limit every 30-60 seconds. (The less popular candidates take longer than that.) Fast and frequent updates were also a priority here, as the whole point of a streaming analytics engine is that it continuously delivers content.
Taking all this into account, we decided to offer updated analysis in 30-second intervals. That means every 30 seconds, the chart on GaugingDebate.com adds new sentiment scores.
So how reliable is the political sentiment we are gauging? It's hard to know exactly, but our approach definitely has its limitations. For example, there's a lot of Twitter political commentary like this: "I'd love to see Trump try and win the White House with his fascist craziness". This is clearly expressing a negative sentiment towards Trump. But our sentiment analyzer is going to have a hard time correctly scoring this for three big reasons.
With these limitations, "fascist" and even "craziness" won't make the "most common word" cut. Without factoring in sarcasm (and after dropping minor neutral words), what's left are words like "love" and "win" - which are scored as very positive, happy terms!
It's true that all unigram sentiment analysis is prone to these weaknesses. The solution to these problems is based on the idea that, in the long run, the sentiment of language is actually well-characterized by the individual words in someone's speech. So, if you can collect enough words, then you should be able to trust their average per-word sentiment as a real signal of how happy that chunk of language is. But especially with the prevalence of issues 1. (sarcasm) and 3. (uncommon words) in political discourse, the genre of tweets we've chosen to analyze may actually require a much higher quantity of unigrams before their average score presents us with a true signal. That's at least our best guess for why sentiment seemed so generally positive across the board (as well as for why there's such a huge variance to most candidates, most of the time - try turning on the error bars in the chart to see what we mean).
A future feature to add to the main page is a side panel that actually shows the tweets that went into a given score for a candidate at a particular point in time. This feature seems like a good way to at least give users a first-hand sense for what we're talking about. We do store all the tweets we analyze in the aggregate, so this feature will apply retroactively, as well. Stay tuned!
Topical content was determined using a parallelized adaptation of Latent Dirichlet Allocation. LDA is, if anything, more demanding than unigram sentiment analysis, in terms of the amount of content it needs to provide stable results. We discovered over the course of our work that it doesn't really work to run this algorithm in a streaming context, as it requires both (a) many words per document and (b) many separate documents. We tried clumping all tweets per candidate together as documents (many words per document, few documents), as well as treating each tweet as a document (few words per document, many documents). Neither one gave very satisfactory results, so, in the end, we generated working code, but we removed it from the app itself. See Daniel's LDA report for details on implementation and results.
There are two main interfaces on the Gauging Debate app: (1) The Startup screen and (2) the Chart Interface.
On loading the Gauging Debate startup screen, the app automatically determines when the next debate is, and reports which party will be debating, what time the debate begins, and gives a countdown clock for time remaining. (The debate schedule is scraped and updated automatically from a page on the Washington Post.) If you sign on when a debate is already ongoing, it will inform you of this and invite you to start monitoring sentiment!
The first step asks you to choose either Live Streaming or Previous Debate.
Live Streaming is only functional when a background process is collecting data in realtime - you will see an error message if you try to access a live stream when there is none available. (We have included a 'Try Tracking Anyway' button in such an event, as if we are running a local streaming instance you may still be able to load realtime data.)
As of Dec 10 2015, there are two debate events from the 2015 primaries stored for analysis - the Sept 16 GOP and Oct 13 Democratic debates. We plan to include more debates soon. Selecting a historical debate loads the entire archive of tweets pertaining to that debate. (A big thanks to our partners at the UVM Computational Story Lab for collaborating and sharing their archive of tweets!)
If you're loading live data, the chart will refresh every 30 seconds. If you load historical data, the chart will quickly load the first 20 minutes, and then shortly after, once the entire archive has downloaded, the chart will refresh and display all the data. (This is where Oboe.js really came in handy!)
The chart interface (developed by Plotly) has a lot of nice features for customization. They're a bit hard to explain in words, so here's a short screencast to give you an idea of all it can do:
"Coming Soon" is too often followed by "Yeah, about that..." Nevertheless, hope springs eternal, and we have big plans to add the following features:
During this project we logged progress in our process book. This is the digital record of all the considerable effort that went into creating this project over a period of 30 days. It is extremely detailed, but untidy and not made for easy consumption. If you're interested in building similar software, it is a goldmine of links, dos and don'ts, and hard-won experience - it may save you time and frustration if you're starting fresh with all of these tools, like we did.
A brief introduction to the authors and project credits.
Andrew is a data scientist and PhD candidate at Harvard University.
He works on machine learning applications in cyber security and social psychology.
Daniel is a second year student in the ME Computational Science and Engineering program. His academic interests include statistics, machine learning, and network science. He has professional experience in systems engineering and computational advertising. He likes exploring different fields such as climate science and urban ethnography.
Andrew is responsible for:
Daniel is responsible for: