I’ve been trying to learn data science, and inspired by a contest at Kaggle that is waaaay beyond my level of understanding, I’m doing some number-crunching to do a bracket.
Here’s what I did — you can see the full data set at the end.
First: In Google Sheets, I imported several data sets from ESPN, starting with their basic rankings pages listing BPI (their own numbers), strength of schedule (their computation), “strength of record” (which I don’t fully understand) and the official NCAA RPI. I also imported the seven-day ranking change and each team’s seeds.
All that importing put a strain on Google Sheets (only 25 teams per page, and I went through eight pages to get 200 teams’ data, plus a couple more), so I eventually just copied the values within Google Sheets and canceled the importing.
Then I took those numbers and computed the following:
- A simple average of each teams BPI, RPI and strength of record, giving me an average of computer rankings.
- A rough difference between each team’s computer ranking and seed. Sort of. Generally, the higher the number, the worse the team is for its seed. Only Villanova (1.0 in computer rankings, and I multiplied the seed number because there are multiple #1 seeds) and SMU (14.7 computer rankings, #6 seed) came out with negative numbers (which, in this case, means “good”).
- I saved the 7-day change in BPI for a future calculation.
This much was quite easy and didn’t take much time.
Second: This took quite a bit more time. If I knew how to do R programming with all the datasets in Kaggle, maybe it would’ve been easier.
I wanted to see each team’s quality wins. I’ll spare the details, but we’ll say I did a lot of checking of each team’s schedule to come up with what I’m pretty sure are each team’s six best wins, ranking “best” in order of their computer ranking average.
Some teams, all from major conferences, had a few more quality wins that I listed to the side but did not factor into the equation. Some teams didn’t even beat six teams in top 200, so I added a dummy “xx” team with a figure of 250.
Then the easy part: I took the average of their six best wins.
You can see here that this is why people are high on Duke, even though those of us who suffered through each injury and each Grayson Allen meltdown think they’ve already overachieved by winning the ACC Tournament. Duke has more quality wins than I can possibly list.
I figure this computation helps me emphasize what a team is capable of doing. Yeah, maybe they lost to the 220th-ranked team, but they also beat three in the top 20, so they have the capacity to make a run. The losses are already figured into the computer ranking.
I moved the spreadsheet into Excel at this point to try to make a cool chart. I wasn’t able to make said cool chart. But this gives you an idea of the top 20 teams’ computer ranking and quality win index. (Again, lower is better.)
Third: The Dure Power Index takes the computer average, the quality wins and the seven-day change and puts them in a super-secret formula to spit out … this.
So now that I’ve done all that, I’m going to fill out a bracket using this info. But I still have to do some hunches here and there, mostly to give mid-major schools that haven’t had many opportunities for quality wins a chance to pull some upsets. If I see a 12 seed that looks a little underrated and a 5 seed that looks a little overrated, I’ll go with that.
And no, I’m not picking Duke.
The raw-ish data (I’ve spared you a couple of sheets) is online.
So what did this mean for my bracket?
Upsets picked:
#9 Vanderbilt over #8 Northwestern: The tiebreaker is SAT scores. No, actually, Vandy has some puzzling losses but some huge wins as well. They’re 4-4 vs. Top 25; Northwestern is 1-5. Vandy’s 8-4 in their last 12; Northwestern is 5-7. All due credit to Chris Collins for getting this team to the tournament at last, but this is a tough draw.
#9 Seton Hall over #8 Arkansas: Arkansas has beaten no one in the Top 25, and they’ve had chances. Seton Hall has beaten Butler and Creighton, and they gave Villanova a tough game.
#10 Marquette over #7 South Carolina: Another former Duke guard (Wojo) has a better draw. Marquette is actually higher in my rankings.
#10 Wichita State over #7 Dayton: Wichita State’s favored by six, thanks in part to a late-season surge and a whopping 21-place difference in the BPI. Neither team has great wins.
#11 Rhode Island over #6 Creighton: Tough one, but the injury to Creighton’s point guard swayed me. They’re also pretty close in the computer rankings. (And yes, Rhode Island gave Duke a very good game back when Duke was playing well early in the season.)
#11 Xavier over #6 Maryland: Look at the numbers, and you’d swear the seeds are reversed. Xavier’s higher in my rankings. There’s a reason the line on this game is only two points.
#12 Middle Tennessee State over #5 Minnesota: Have to pick one 12-over-5, right? And the others aren’t fair. I like Princeton, but against Notre Dame? UNC Wilmington has 29 wins, but over Virginia? And Nevada gives me no reason to pick them over Iowa State. Minnesota is by far the weakest No. 5 in the DPI, and MTSU is the top No. 12. They’ve beaten UNCW and routed Vanderbilt. They’ve been here before. If you must pick one 12-over-5, this is it.
Upsets not picked:
#10 Oklahoma State vs. #7 Michigan: Oklahoma State is 1-9 vs. Top 25. Michigan’s 4-2.
#10 VCU vs. #7 St. Mary’s: Both teams are in the RPI Top 20, even though they haven’t beaten anybody. St. Mary’s is 0-3 vs. the Top 25 (all to Gonzaga); VCU is 0-1. Odd fact: Each team’s best win is against Dayton. It’s really a coin flip, but I’ll figure East Coast bias is making us overlook St. Mary’s a bit.
Later-round upsets:
#5 Virginia over #4 Florida: UVA is fourth in the BPI and has beaten North Carolina, Louisville (twice) and Notre Dame. Florida’s lone big win is against a good but not dominant Kentucky team.
#5 Iowa State over #4 Purdue: Slightly better numbers, with big wins over Kansas and Baylor.
#5 Notre Dame over #4 West Virginia: But it’s close. West Virginia’s my top No. 4.
Final Four:
Villanova over Duke
Louisville over Kansas
Florida State over Gonzaga (great draw for the Seminoles — I also have them knocking off No. 2 Arizona)
North Carolina over UCLA (fun fact: In regular season, Kentucky beat UNC, but UCLA beat Kentucky)
Final: North Carolina over Villanova. Revenge.
So I correctly picked four “upsets,” though at least one of them was hardly an upset:
– #10 Wichita State over #7 Dayton (65.3% of ESPN pickers picked them)
– #11 Rhode Island over #6 Creighton (39.6%)
– #11 Xavier over #6 Maryland (43.5%)
– #12 Middle Tennessee State over #5 Minnesota (42.9%)
The only 10/11/12 seed I picked that did NOT win was Marquette, which spoiled 55.9% of ESPN brackets. (And they weren’t even close.)
The only upset I didn’t pick was #11 USC over #6 SMU. Only 18.3% got that one.
The other three games I missed (out of five) were all 8vs9. Those were toss-ups. Northwestern-Vanderbilt was literally 50.0-50.0 at ESPN. Seton Hall, which I think got robbed by the refs, drew 50.5%. Michigan State was 51.4% over Miami. The only big split was Wisconsin (67.8%) over Virginia Tech — and that’s the only one I got right.
I’m in the 95th percentile at ESPN. The best news: I still have all my Sweet 16.
Now watch it all fall apart today.