A Little College Sports Analysis, but First a Little Data Wrangling
I’m a big college sports fan, especially active in debates about which D1 conference is best. Five years ago, I came across the Learfield Directors’ Cup, an annual evaluation/ranking program of college sports performance based on hard numbers. In separate rankings, Division I,II, and III schools are assigned points based on their standings in 20-odd collegiate sports across gender and season. The sports are equal-weighted in the final calculations, so that the winner of women’s rowing is awarded the same number of points as the winner of men’s football. And it’s far from all or nothing, as points are allotted to many top competitors in each sport. The point differential between first and fifth is not that dramatic. Athletic Directors see their schools’ aggregate DC “points” and “ranks” as important grades on their athletic report cards. Indeed many AD’s are bonused in part based on their Directors’ Cup performance.
Historical Directors’ Cup data are readily available, albeit in not easily accessible pdf files. When I first found the data, I immediately set out to do a five year “study” of Directors’ Cup rankings by school and conference. (Actually, there hasn’t been much competition for the top spot: the Stanford Cardinal has finished atop the pack the last 20+ years. There’s still conference bragging rights up for grabs, however.). My plan was to look at how the FBS (top football) conferences compared in end-of-school-year rankings, as well as in the fall, winter, and spring “seasons”. I thought that assembling the five years of data would be the simple task, and that I’d expend most of my energy attempting to distinguish the top six to ten conferences. I was wrong. Munging the data from multiple years ended up being the more daunting task. Duh.
The challenges working with DC data mirror those faced by data scientists in their corporate jobs, making DC analyses ideal for DS training. Indeed, I’ve recommended exercises similar to what undertook with the Directors’ Cup for MS in analytics/data science capstone projects.
The first obstacle with the data is obviously the pdf format. In the beginning, I simply physically scraped data from the pdf’s, although I later found some reasonable packages/libraries that could automate some of the work. There are now some pretty nice tools for handling pdf tables. A second challenge is that the table format almost always changes from year to year, necessitating separate code for each report. So if you decide to aggregate three years, that’ll mean three separate data wrangles. Third, only schools that scored points are included in a given year’s report, so “lesser” conferences might not have all schools represented each year — and the number of records does not equal the number of competitors. A school might be represented one year and not the next. Fourth, the reports have changed school and conference names over time, so this must be accounted for in multiple year aggregations. For example, the Southeastern Conference has been represented as both “Southeastern” and “SEC” over time, while “UCLA” was once denoted “California, Los Angeles”. And finally, there’s the problem of slowly-changing dimensions, as schools switch divisions and conferences over time. Up until a few years ago, Maryland was part of the ACC conference; now it’s a member of the Big Ten. The last two “opportunities” are well understood by data warehouse professionals.
This time around I got started after discovering the web app tabula and R versions of the tabulizer java library, I decided to test-drive them on the 2018 DC data, fully expecting an easy munge from internet pdf download to R data.table. Not so. Again, what was I thinking? Fortunately, my task was made easier by limiting attention to just eight attributes per record, including school name, conference, division, total points, final rank, fall points, winter points, and spring points.
Below is the R code for downloading the final 2018 DC pdf, ultimately producing an R data.table of 8 attributes. The technology used is Microsoft Open R 3.4.4 running JupyterLab Beta and tabulizer 0.2.2.
Read the full blog here.
Link: A Little College Sports Analysis, but First a Little Data Wrangling