i don't know if I'm missing something simple, but ...
# questions
j
i don't know if I'm missing something simple, but how do you specify data dependencies in kedro? Like I have some data which I sometimes generate from data earlier in my pipeline, but often I want to just cache it so I don't have to re-compute it when I run my pipeline again. How do I tell kedro that if I regenerate earlier data in my pipeline that all data ahead of it needs to be regenerated?
d
can you show me an example of what you node and catalog look like today?
if i’m reading things correctly we just need to specify catalog entires in order to persist outputs of certain nodes?
j
I haven't fully integrated kedro yet. The project I'm working on has needed to do a lot of tightening up of our data definitions before we could meaningfully create a data catalog. But yes each node will persist outputs to pass to the next node. We're also running some backup code so that we have historical revisions of all our input, intermiediate, and output data for auditing purposes
I'm sure there's a better way to do it than manually with
rsync
but it seems to be working ok for now
d
so any node outputs you define are automatically defined as ephemeral
MemoryDataset
this is actually equivalent to defining your catalog as:
Copy code
my_output:
  type: MemoryDataset
Now if you wanted to persist this all you would need to do is swap the attributes to a different file format, preserving the same output identifier
Copy code
my_output:
   type: pandas.ParquetDataset
   path: <s3://bucket/directory/>...
does that make sense?