Posted by: Mark Fontecchio
Just in time. The National Football League is holding its conference championship games this weekend, and Scott M. Sawyer has a lot of obscure data about how one team might win over another.
During the day, Sawyer is on staff at MIT’s computing and analytics group. His research areas include big data and parallel algorithms. But in his spare time lately, he’s been building a web app to parse NFL play-by-play data from 2002 to the present. The result? Some interesting findings:
- Since 2002, running the ball on 4th-and-1 works 71% of the time. It only works 66% of the time if you include pass plays. Conclusion? Run it on 4th-and-1.
- Since 2002, the New England Patriots have scored on 40% of drives when down one score with 5 minutes or less left in the game. League average: 34%.
- The Baltimore Ravens, in their first match-up against the Denver Broncos, actually had more success against the pass than the run, an interesting statistic considering how Peyton Manning is quarterback of the Broncos.
What Sawyer did is, conceptually, fairly simple. He took NFL play-by-play data from 2002 to 2012, which had been compiled into comma-separated value (CSV) files by Brian Burke at Advanced NFL Stats. The files add up to hundreds of thousands of rows in an Excel spreadsheet. Each row represents a play in the NFL that season, and includes which team was on offense, which on defense, what quarter it was, how much time was left in the quarter, and field position. Then there is a cell for the play itself, like this: “(13:13) (Shotgun) 12-T.Brady pass deep left to 34-S.Vereen for 33 yards, TOUCHDOWN.”
Sawyer then described what he did with all those CSV files. He spent a few hours writing some Python code, and parsed each row. He filtered out non-offensive plays such as kicks and penalties, determined whether each play was a pass or run, and noted the yardage gained or lost. He then rated each play as a success or a failure. A play was successful if it resulted in a first down, scored a touchdown, or gained at least four yards on 1st or 2nd down.
Sawyer then put all of that data into a MySQL database.
“I don’t expect to make money on this project, but I don’t want it to cost a lot either,” he wrote to me in an email. “I use inexpensive shared hosting, and MySQL is the best choice for delivering stats to a lot of visitors with minimal CPU cycles.”
Sawyer added that MySQL is already installed and configured on his web host, and he is using extensive indexing and query caching to reduce the load on the web server.
Though the data was captured in CSV files, some might consider the cells to be big data in part because of the unstructured text descriptions of each play. Sawyer doesn’t believe it’s there yet. He said the total “text corpus” of plays is about 64MB, which is tiny. He reduced this to about 30MB after parsing the descriptions, but then that went up to 100MB after ingesting it into MySQL and using indexing.
That said, Sawyer is not done. He expects the data to get bigger as he brings in more sources – weather at the game, for example, or the results of analytics.
“If you really want to predict winners, you’re going to need a lot more information,” he wrote. “But we’re not talking terabytes and petabytes anytime soon.”