TweetDVR: Fun with Big Data and Game of Thrones
I never watch live television in real time. My favorite shows are all set to record on my DVR (and I use a computer running Windows Media Center as my DVR). When I want to watch TV, I just browse through my recorded shows and watch what I’m in the mood for. However, one disadvantage of this approach is that you miss the fun of Twitter during a favorite show. Twitter adds a fun social component to TV-watching – discussing the show with the internet as it is happening. This is especially true for reality TV shows (which I don’t watch, but lots of people seem to have strong opinions on voting people off shows), sports games (with fans commenting on exciting plays), and shows with cult followings (see below).
Now, last December at a hackathon, I worked with a team to develop Pipe Dreams. On Tues 4/14/2015, my team held another hackathon called hack.central. On this date, the Game of Thrones season premiere had just aired, but several of us hadn’t seen it yet. So the serious issue facing our world that we wanted to tackle was: how can we see the Twitter stream that was happening during the show as if we were watching it live?
We wanted to use Azure’s big data capabilities. A service called HDInsight offers the Hadoop ecosystem running on Azure. Storm is part of the Hadoop ecosystem which allows you to process data in real time. HBase is an open-source NoSQL database built on Hadoop that provides random access and strong consistency for large amounts of unstructured and semi-structured data.
Below is an architecture diagram for our TweetDVR project. The data is pulled from Twitter using TweetInvi (which can be downloaded as a nuget package). We created two distinct data pulls: an “archive job” which went back to pull a batch of old Twitter data from the time that Game of Thrones originally aired, AND a realtime pull using Storm on Azure so we could support running continuously to save data. Both data pulls put the data into HBase on Azure. Then a web front end displayed the results at https://datadvr.com.
Here is a screenshot of the web front end in action. As you can tell from their content, the tweets scrolling through are the tweets that were happening at the time of the Game of Thrones season premiere. (It says 4pm due to time zone differences and a bug when we were importing the data.)
This was a fabulous team effort! From my team, Tim Benroeck, David Giard, and Yung Chou all participated. Some members of the HDInsight team (Matt Winkler, Asad Khan, Maxim Lukiyanov, and Hiren Patel) came out and hacked with us.
I wrote the realtime streaming component using Storm. I’ll do an additional post on Storm in the near future; subscribe to my blog via RSS or email (on this page, under “Options” in the right sidebar), and you’ll get it automatically.
Our code is available at https://github.com/jennifermarsman/TweetDVR. Please feel free to check it out.
Getting Started
If you don’t have an Azure account yet, you can sign up for a Microsoft Azure Free Trial. If you’re just getting started with big data on Azure, I highly recommend the learning map for HDInsight. It’s an organized flowchart walking you through the basics. In terms of tools, you should download the Azure SDK for your version of Visual Studio (which includes the HDInsight Tools for Visual Studio).
Source Code
The code is available at https://github.com/jennifermarsman/TweetDVR.
Tips & Tricks
I wrote the Storm data pull, so here are some tidbits I learned when working with Storm on Azure.
First, to build a new Storm application in Visual Studio, the project templates are found under an HDInsight node. My instinct was to look for them under language, then Cloud.
Secondly, if it can’t find any Storm instances to connect to, expand the HDInsight node under Azure in the Server Explorer.
Here is a little more background. To push your code up to Azure, you need to right-click on your Storm application project and select “Submit to Storm on HDInsight”.
When you click this, you will see a dialog box like this.
Click the dropdown menu to choose a Storm cluster. (I had already created one in the Azure portal.) This can take some time to populate. You should see a task to “Get storm clusters list” appear in the HDInsight Task List at the bottom of Visual Studio.
I occasionally would have issues with the dropdown menu not populating (we were having sporadic internet access issues), and another way to get the storm clusters list and force the dropdown to populate is to open Server Explorer and under Azure, expand the HDInsight node. It will show progress as it refreshes and gets all of your clusters.
When this is complete, you should also see your Storm clusters listed in the dropdown menu in the “Submit Topology” dialog box. Then you can select the Storm cluster to deploy to and submit.
Finally, this Storm article has a nice “Troubleshooting” section near the end which shows you how to test a topology locally. There is also an article on deploying and monitoring your topologies.
Now that I have taught you about Storm, you may address me as JENNIFER STORMBORN, Khaleesi, Mother of Dragons, Breaker of Chains, and true Queen of the Seven Kingdoms.
Technorati Tags: Storm,Big Data,Azure,Windows Azure,Microsoft Azure,Hackathons,DataDVR,TweetDVR,Data DVR,Tweet DVR,HDInsight