Hi Kedro community! QQ, wondering if anyone has b...
# questions
m
Hi Kedro community! QQ, wondering if anyone has been able to successfully build a Kedro pipeline which integrates with Amazon SageMaker to train an XGboost model on a Spark dataframe? It looks like it should be possible? Cheers!
m
thanks Data Joely, have a taken a look at this. Is it as straight forward as just referencing a Spark dataframe from your catalog and initialising a spark session? Or is it a bit more involved?
d
I’m not entirely sure about sagemaker specifics @marrrcin do you know by any chance?
n
Assuming you already able to run Spark /XGBoost/SageMaker together,
kedro-sagemaker
helps you to translate the Kedro Pipeline in to SageMaker native one that allows you to use the SageMaker UI.
m
thanks @Nok Lam Chan. Yup, have managed to run Spark/XGBoost together in a Kedro pipeline which then runs on DataBricks, but training times are excessive, so was wondering if using SageMaker might speed things up a bit?
d
I’m not sure if that’s the case, do you get a sense of where the pipeline is slow?
m
It's def the model training, but can't see what is driving it exactly. Not sure if it's a combination of the modelling approach and size of the training data.
d
so if you’re training on massive data there are obvious question about how effective a sampling will be
but then also (and this isn’t my area of expertise) how much can you leverage this distributed bit of the library
w
XGBoost can benefit from a GPU: https://xgboost.readthedocs.io/en/stable/gpu/index.html So if you have access to SageMaker you can select a GPU enabled instance to run your pipeline
m
Yeah, I agree with @William Caicedo - GPU would be probably the easiest "speed up" option. Or switching to LightGBM 😛 As for Spark & SageMaker - I'm not sure that you will get anything from it, unless you actually do distributed computing, which is not supported by the plugin. You could also try
from sagemaker.xgboost.estimator import XGBoost
(XGB implementation made by SageMaker) but it has its own quirks.
n
Is xgboost still slower than lgbm? Just curious 🤔given enormous engineering effort was put on xgboost.
m
thanks all! Lots to look into here!
m
@Nok Lam Chan it's usually slower as per my experience
👀 1