:basketball: :basketball: :basketball: Did you get...
# resources
d
šŸ€ šŸ€ šŸ€ Did you get suckered into joining the office March Madness pool but don't know anything about sports? Have you still not submitted your bracket? (deadline's tomorrow!) šŸ€ šŸ€ šŸ€ Take control now and only have yourself (and model convergence issues) to blame! To get you started, here's the Kedro project I threw together to make my predictions: https://github.com/deepyaman/dagster-dunks Additional notes: • To my chagrin, I'm not much of a data scientist; I basically ported https://www.kaggle.com/code/rustyb/paris-madness-2023 (hopefully without too many bugs) • This uses data from https://www.kaggle.com/competitions/march-machine-learning-mania-2025 • There are several things I want to improve about the Kedro project (e.g. I ended up calling a bunch of feature engineering nodes in model inference rather than reusing the pipeline; also, I gave up at the modeling step/as I was running late and reverted to pandas). This is basically a first-pass attempt at getting something working, but suggestions and PRs are most welcome! And just to prove I didn't end up with an insane bracket: šŸ˜‰
šŸ€ 5
c
Everyone knows
params:best_team
is going to win this year.
🄳 1
i
I did this topic for grad school in 2017! We spent a lot of time getting Glicko2 to work as an ELO-like metric - turned out to be top 1/2 feature in all our models Also a lot went into comparing when to start calculating certain features (e.g. win rate since 2020) Fun stuff
šŸ’” 2
d
We spent a lot of time getting Glicko2 to work as an ELO-like metric - turned out to be top 1/2 feature in all our models
I did see a number of Elo-based approaches, too! I ended up going with this, since it seems to perform near the top consistently, and seemed like slightly less work to do haha.
Also a lot went into comparing when to start calculating certain features (e.g. win rate since 2020)
I think one of the biggest benefits of converting it to a Kedro pipeline is that can experiment with these things more (rather than having to find places in the notebook to update them), but I obviously don't need to tell you that! (I also still have a number of things in this pipeline that I didn't parametrize, like
repeat_cv = 3  # recommend 10
showing up in 3 places šŸ™ˆ šŸ™ˆ šŸ™ˆ )
i
Preaching to the choir! I looked back at my code and boy is it a nightmare Its just a bunch of lose files haha