i don t know if I m missing something simple but how do you Kedro #questions

i don't know if I'm missing something simple, but ...

Jon Cohen

02/02/2024, 6:35 PM

i don't know if I'm missing something simple, but how do you specify data dependencies in kedro? Like I have some data which I sometimes generate from data earlier in my pipeline, but often I want to just cache it so I don't have to re-compute it when I run my pipeline again. How do I tell kedro that if I regenerate earlier data in my pipeline that all data ahead of it needs to be regenerated?

datajoely

02/02/2024, 6:37 PM

can you show me an example of what you node and catalog look like today?

datajoely

02/02/2024, 6:37 PM

if i’m reading things correctly we just need to specify catalog entires in order to persist outputs of certain nodes?

Jon Cohen

02/02/2024, 7:35 PM

I haven't fully integrated kedro yet. The project I'm working on has needed to do a lot of tightening up of our data definitions before we could meaningfully create a data catalog. But yes each node will persist outputs to pass to the next node. We're also running some backup code so that we have historical revisions of all our input, intermiediate, and output data for auditing purposes

Jon Cohen

02/02/2024, 7:36 PM

I'm sure there's a better way to do it than manually with

rsync

but it seems to be working ok for now

datajoely

02/02/2024, 11:19 PM

so any node outputs you define are automatically defined as ephemeral

MemoryDataset

this is actually equivalent to defining your catalog as:

Copy code

my_output:
  type: MemoryDataset

Now if you wanted to persist this all you would need to do is swap the attributes to a different file format, preserving the same output identifier

Copy code

my_output:
   type: pandas.ParquetDataset
   path: <s3://bucket/directory/>...

datajoely

02/02/2024, 11:19 PM

does that make sense?

Open in Slack

Previous Next