Hello everyone! I created a pipeline with several ...
# questions
b
Hello everyone! I created a pipeline with several nodes, using pyspark, everything works fine. However, now, I moved to a way bigger pyspark dataframe and I don't have the computational resources to run it. My idea would be to split the dataframe and call n times the pipeline on different subset of the dataset. How would you do it? Is it a good idea? Would you do it in another way? Thanks for your help!
👍 1
d
Hi Benjamin, could you please provide more details about where you are running your Kedro project? Are you using a Spark cluster? Also, has one of your nodes failed?
b
Hi Dmitry, thanks for your help! I am using it on a Databricks, python cluster with Apache spark and scala. It is crashing on one of the first nodes, when doing calculations on the huge dataframe. And it's a memory error. I guess I could update the node to run on subsets of this dataframe, but it would take time to change all the nodes to avoid memory errors. So I was wondering if there was a way to run a pipeline several times on different subsets
d
I recommend starting by optimising your cluster configuration, because Spark is typically used to efficiently handle large datasets. Consider adding more resources, or run the code of the failing node in a Databricks notebook without Kedro to pinpoint the cause of the failure. If these adjustments don't resolve the issue, you can process your pipeline in segments by splitting your large dataset into chunks, perhaps naming them with a partition_id. Next, parameterise your datasets in
catalog.yml
to accommodate these partitions and save results. Finally, run the Kedro pipeline with different parameters for each chunk. You might find it useful to create a script to automate this process.
b
Thank you for your help!