UC Berkeley Course Lectures: Analyzing Big Data With Twitter

Thank you all for a wonderful semester. Here is a summary, in chronological order, of our recorded lectures. You can also view the entire playlist on youtube.

Course Introduction

Marti Hearst, the course instructor at UC Berkeley, introduces the main concepts for the course, and Gilad Mishne (@gilad) of Twitter describes his goals for the course and provides an introduction to Twitter. (slides for lecture1a) (slides for lecture1b).

Twitter Philosophy and Software Architecture

Othman Laraki (@othman), Twitter’s Vice President for Growth, International and Revenue, on Growing a Human-Scale Service, and Raffi Krikorian (@raffi), the Director of Twitter’s Platform Services group, on the Twitter Software Ecosystem. View the slides for Othman‘s and Raffi‘s talks.

Introduction to Hadoop

Bill Graham (@billgraham), who is active in the Hadoop community and a Pig contributor, gave a very clear and detailed intro to Hadoop and outlined how it is used at Twitter. His slides can be found here.

Introduction to Apache Pig

Jon Coveney (@jco) gives an in-depth tutorial on Apache Pig, including how it interacts with Hadoop. The log analysis group at Twitter uses Pig extensively. Jon’s slides can be found here: (pdf)

Coding to the Twitter API

Rion Snow (@rion) gave an introduction to the Twitter API, including the RESTful API and the streaming API for both Java and Python. See all the slides (no video).

Slide on Sampling the Streaming API: Twitter4J

Detecting Twitter Trends

If you’d like to know how Twitter computes its Trending Topics, Kostas Tsioutsiouliklis (@kostas) shared some of the secrets with the class. He also talked about MinHash algorithms. See his lecture notes.

Real-Time Twitter Search

Brian Larson (@larsonite), the tech lead for search and relevance at Twitter, gives a detailed technical talk about how real-time search works at Twitter.

Splunk’s Software Architecture and GUI for Analyzing Twitter Data

Stephen Sorkin of Splunk described alternative software architecture for processing large data. Splunk also has a sophisticated GUI for analyzing Twitter and other data sources in real time; be sure to watch the last 15 minutes of the video to see the demo. Stephen’s slides: pdf

Twitter’s Social Network

Learn about weak ties, triadic closures, and personal pagerank, and how they all relate to the Twitter social graph from Aneesh Sharma (@aneeshs) in this lecture. Slides here.

Big Learning with Graphs

Joey Gonzalez, a recent PhD from CMU and a postdoc at UC Berkeley, is working on GraphLab, the hot technology for processing huge graphs quickly. There is new a version called GraphChi (for chihuahua) that you can run on your personal computer; so you don’t even need access to EC2 to run it going forward. Slides here.

Twitter Recommendations

Alpa Jain (@alpa), who works on monetization algorithms at Twitter, described SVD and other recommendation algorithms used at Twitter. Alpa’s slides are here: pdf

Security at Twitter and Elsewhere

Kurt Thomas is a former Twitter engineer and a current PhD student at UC Berkeley who studies how the criminal underground conspires to make money via unintended uses of computer systems. He talked about fraud detection for Twitter and other online systems. See his lecture notes.

Information Diffusion on Twitter

Stan Nikolov (@snikolov) of the Twitter Search and Relevance team walked through one particular theoretical model of information diffusion which tries to predict under what conditions an idea stops spreading based on a network’s structure. The slides in his Lecture Notes let you see the Pig scripts in detail, and you can see the video simulatinos that Stan created on his blog.

Introduction to Scalding

On Thursday we learned about an alternative language for analyzing big data: Scalding. It’s built on Scala and is used extensively by the Twitter Revenue group. Oscar Boykin (@posco) presented a lecture that he and Argyris Zymnis (@argyris) put together. See the lecture notes for more details.

Spark: Making Big Data Analytics Interactive and Real-Time

Spark is the hot next thing for Hadoop / MapReduce, and Matei Zaharia (@matei_zaharia), a PhD student in UC Berkeley’s AMP Lab, described how it works and what’s coming next. The key idea is to make analysis of big data interactive and able to respond in real time. Next up in the research agenda is streaming data and blending real time and batch processing. Matei also gave a live demo. (slides here)