https://kedro.org/ logo
#questions
Title
# questions
d

Debanjan Banerjee

11/17/2022, 1:32 PM
Hi Team , can we make dataset load to a kedro node parallel ? I have 19 datasets that i need to read in a single function but when i see the logs it taking 2 minutes per dataset i.e. ~40 minutes of read time , any way we can make the read parallel ? I dont think they are conflicting in any way so i dont see why it needs to be sequential ?
d

datajoely

11/17/2022, 1:36 PM
please don’t tag us directly - someone will answer you. What kind of dataset?
d

Debanjan Banerjee

11/17/2022, 1:38 PM
This is a custom dataset we have created
pulling in heavy datasets
d

datajoely

11/17/2022, 1:39 PM
well then Kedro isn’t mandating sequential reads?
there’s two types of parallelism, you can do multi-processing in the dataset itself if you want to do some sort of chunking,
or you can use the ParallelRunner or ThreadRunner to run non-dependant tasks at the same time
d

Debanjan Banerjee

11/17/2022, 1:40 PM
this is a dataset agnostic question , i have 19 catalog entries , how can i make sure that they are being loaded at the same time ? im assuming by default it loads step by step ?
d

datajoely

11/17/2022, 1:42 PM
oh as in 19 dataset loads to the same node
d

Debanjan Banerjee

11/17/2022, 1:42 PM
indeed
d

datajoely

11/17/2022, 1:42 PM
that is a simple loop and if you want to implement your own runner you can
d

Debanjan Banerjee

11/17/2022, 1:42 PM
its loads like this
Copy code
00:02 INFO Loading data from xxxxx
00:04 INFO Loading data from yyyyy
d

datajoely

11/17/2022, 1:43 PM
oh sorry
we have this
kedro run --async
d

Debanjan Banerjee

11/17/2022, 1:43 PM
aaah lemme check this
OMG it reduced the load time to 10 minutes hahahahaha, incredible , thanks!!!
K 1
j

Jo Stichbury

11/17/2022, 2:20 PM
Nice ! I was going to ask if this is documented @datajoely but it is, and I'd completely forgotten about it 😆 I think a blog post about optimising Kedro usage, with this kind of tip, could be a good one :)
d

Debanjan Banerjee

11/17/2022, 2:21 PM
Fully agree! I would encourage a separate documentation for this too , this is much broader than just "running a pipeline"
d

datajoely

11/17/2022, 3:02 PM
A question for @Merel is there an argument to make this default behaviour?
m

Merel

11/17/2022, 3:12 PM
Not that I’m aware of!
d

datajoely

11/17/2022, 3:12 PM
as in its been ruled out?
m

Merel

11/17/2022, 3:13 PM
Oh haha I thought you meant argument as in argument to the CLI 😂
No I guess we just haven’t discussed it
d

datajoely

11/17/2022, 3:14 PM
what’s the best way to do so? An issue?
d

Debanjan Banerjee

11/17/2022, 3:14 PM
calls for a user feature vote ? 😛
m

Merel

11/17/2022, 3:14 PM
Issue sounds good! You can mark it as “technical design” so we’ll discuss with the team
👍 1
j

Jo Stichbury

11/17/2022, 3:22 PM
LOL, if you add it as default I won't have the makings of an infomercial blog post any more 😆
d

Debanjan Banerjee

11/17/2022, 3:23 PM
@datajoely do you want me to make the issue about this ? I have screenshot et al and can put it up
d

datajoely

11/17/2022, 3:27 PM
I’m doing it
as in 90% written
d

Debanjan Banerjee

11/17/2022, 3:30 PM
Awesome, the only push back i can expect is with the
ThreadRunner
as async calls doesnt work with it , but lets see
d

datajoely

11/17/2022, 3:30 PM
I’m proposing a fallback
but I’m not sure if it’s posisble
7 Views