https://kedro.org/ logo
#questions
Title
# questions
m

Mark Einhorn

03/12/2024, 1:41 PM
Hi Kedro community! QQ, wondering if anyone has been able to successfully build a Kedro pipeline which integrates with Amazon SageMaker to train an XGboost model on a Spark dataframe? It looks like it should be possible? Cheers!
m

Mark Einhorn

03/12/2024, 1:49 PM
thanks Data Joely, have a taken a look at this. Is it as straight forward as just referencing a Spark dataframe from your catalog and initialising a spark session? Or is it a bit more involved?
d

datajoely

03/12/2024, 1:49 PM
I’m not entirely sure about sagemaker specifics @marrrcin do you know by any chance?
n

Nok Lam Chan

03/12/2024, 2:01 PM
Assuming you already able to run Spark /XGBoost/SageMaker together,
kedro-sagemaker
helps you to translate the Kedro Pipeline in to SageMaker native one that allows you to use the SageMaker UI.
m

Mark Einhorn

03/12/2024, 2:59 PM
thanks @Nok Lam Chan. Yup, have managed to run Spark/XGBoost together in a Kedro pipeline which then runs on DataBricks, but training times are excessive, so was wondering if using SageMaker might speed things up a bit?
d

datajoely

03/12/2024, 3:10 PM
I’m not sure if that’s the case, do you get a sense of where the pipeline is slow?
m

Mark Einhorn

03/12/2024, 3:46 PM
It's def the model training, but can't see what is driving it exactly. Not sure if it's a combination of the modelling approach and size of the training data.
d

datajoely

03/12/2024, 3:59 PM
so if you’re training on massive data there are obvious question about how effective a sampling will be
but then also (and this isn’t my area of expertise) how much can you leverage this distributed bit of the library
w

William Caicedo

03/12/2024, 8:04 PM
XGBoost can benefit from a GPU: https://xgboost.readthedocs.io/en/stable/gpu/index.html So if you have access to SageMaker you can select a GPU enabled instance to run your pipeline
m

marrrcin

03/13/2024, 7:01 AM
Yeah, I agree with @William Caicedo - GPU would be probably the easiest "speed up" option. Or switching to LightGBM 😛 As for Spark & SageMaker - I'm not sure that you will get anything from it, unless you actually do distributed computing, which is not supported by the plugin. You could also try
from sagemaker.xgboost.estimator import XGBoost
(XGB implementation made by SageMaker) but it has its own quirks.
n

Nok Lam Chan

03/13/2024, 9:12 AM
Is xgboost still slower than lgbm? Just curious 🤔given enormous engineering effort was put on xgboost.
m

Mark Einhorn

03/13/2024, 10:58 AM
thanks all! Lots to look into here!
m

marrrcin

03/13/2024, 2:30 PM
@Nok Lam Chan it's usually slower as per my experience
👀 1