basketball basketball basketball Did you get suckered into Kedro #resources

:basketball: :basketball: :basketball: Did you get...

Deepyaman Datta

03/19/2025, 3:00 PM

🏀 🏀 🏀 Did you get suckered into joining the office March Madness pool but don't know anything about sports? Have you still not submitted your bracket? (deadline's tomorrow!) 🏀 🏀 🏀 Take control now and only have yourself (and model convergence issues) to blame! To get you started, here's the Kedro project I threw together to make my predictions: https://github.com/deepyaman/dagster-dunks Additional notes: • To my chagrin, I'm not much of a data scientist; I basically ported https://www.kaggle.com/code/rustyb/paris-madness-2023 (hopefully without too many bugs) • This uses data from https://www.kaggle.com/competitions/march-machine-learning-mania-2025 • There are several things I want to improve about the Kedro project (e.g. I ended up calling a bunch of feature engineering nodes in model inference rather than reusing the pipeline; also, I gave up at the modeling step/as I was running late and reverted to pandas). This is basically a first-pass attempt at getting something working, but suggestions and PRs are most welcome! And just to prove I didn't end up with an insane bracket: 😉

🏀 5

Chris Schopp

03/19/2025, 4:13 PM

Everyone knows

params:best_team

is going to win this year.

🥳 1

Ian Whalen

03/19/2025, 5:04 PM

I did this topic for grad school in 2017! We spent a lot of time getting Glicko2 to work as an ELO-like metric - turned out to be top 1/2 feature in all our models Also a lot went into comparing when to start calculating certain features (e.g. win rate since 2020) Fun stuff

💡 2

Deepyaman Datta

03/19/2025, 5:39 PM

We spent a lot of time getting Glicko2 to work as an ELO-like metric - turned out to be top 1/2 feature in all our models

I did see a number of Elo-based approaches, too! I ended up going with this, since it seems to perform near the top consistently, and seemed like slightly less work to do haha.

Also a lot went into comparing when to start calculating certain features (e.g. win rate since 2020)

I think one of the biggest benefits of converting it to a Kedro pipeline is that can experiment with these things more (rather than having to find places in the notebook to update them), but I obviously don't need to tell you that! (I also still have a number of things in this pipeline that I didn't parametrize, like

repeat_cv = 3  # recommend 10

showing up in 3 places 🙈 🙈 🙈 )

Ian Whalen

03/19/2025, 5:41 PM

Preaching to the choir! I looked back at my code and boy is it a nightmare Its just a bunch of lose files haha

2 Views

Open in Slack

Previous Next