I’ve been trying to learn data science, and inspired by a contest at Kaggle that is waaaay beyond my level of understanding, I’m doing some number-crunching to do a bracket.
Here’s what I did — you can see the full data set at the end.
First: In Google Sheets, I imported several data sets from ESPN, starting with their basic rankings pages listing BPI (their own numbers), strength of schedule (their computation), “strength of record” (which I don’t fully understand) and the official NCAA RPI. I also imported the seven-day ranking change and each team’s seeds.
All that importing put a strain on Google Sheets (only 25 teams per page, and I went through eight pages to get 200 teams’ data, plus a couple more), so I eventually just copied the values within Google Sheets and canceled the importing.
Then I took those numbers and computed the following:
- A simple average of each teams BPI, RPI and strength of record, giving me an average of computer rankings.
- A rough difference between each team’s computer ranking and seed. Sort of. Generally, the higher the number, the worse the team is for its seed. Only Villanova (1.0 in computer rankings, and I multiplied the seed number because there are multiple #1 seeds) and SMU (14.7 computer rankings, #6 seed) came out with negative numbers (which, in this case, means “good”).
- I saved the 7-day change in BPI for a future calculation.
This much was quite easy and didn’t take much time.
Second: This took quite a bit more time. If I knew how to do R programming with all the datasets in Kaggle, maybe it would’ve been easier.
I wanted to see each team’s quality wins. I’ll spare the details, but we’ll say I did a lot of checking of each team’s schedule to come up with what I’m pretty sure are each team’s six best wins, ranking “best” in order of their computer ranking average.
Some teams, all from major conferences, had a few more quality wins that I listed to the side but did not factor into the equation. Some teams didn’t even beat six teams in top 200, so I added a dummy “xx” team with a figure of 250.
Then the easy part: I took the average of their six best wins.
You can see here that this is why people are high on Duke, even though those of us who suffered through each injury and each Grayson Allen meltdown think they’ve already overachieved by winning the ACC Tournament. Duke has more quality wins than I can possibly list.
I figure this computation helps me emphasize what a team is capable of doing. Yeah, maybe they lost to the 220th-ranked team, but they also beat three in the top 20, so they have the capacity to make a run. The losses are already figured into the computer ranking.
I moved the spreadsheet into Excel at this point to try to make a cool chart. I wasn’t able to make said cool chart. But this gives you an idea of the top 20 teams’ computer ranking and quality win index. (Again, lower is better.)
Third: The Dure Power Index takes the computer average, the quality wins and the seven-day change and puts them in a super-secret formula to spit out … this.
So now that I’ve done all that, I’m going to fill out a bracket using this info. But I still have to do some hunches here and there, mostly to give mid-major schools that haven’t had many opportunities for quality wins a chance to pull some upsets. If I see a 12 seed that looks a little underrated and a 5 seed that looks a little overrated, I’ll go with that.
And no, I’m not picking Duke.
The raw-ish data (I’ve spared you a couple of sheets) is online.