Hi kedroids In my pipeline I have this logic for 2 nodes nod Kedro #questions

Hi kedroids. In my pipeline, I have this logic for...

Sebastian Cardona Lozano

06/16/2023, 2:14 PM

Hi kedroids. In my pipeline, I have this logic for 2 nodes: node 1: reads a table and executes a data process only to new items that are not in the table node 2: Executes transformations to those new items and append them to the same table. I'm getting this error:

Copy code

CircularDependencyError: Circular dependencies exist among these items: [node1 ...., node2]

Yes, the output of node 2 is an input for node 1. My goal is to not process all the items every time I run the pipeline, but only the new items not in that table. How can I do this? Thanks!! 🙂

datajoely

06/16/2023, 2:23 PM

so you’re in a funny place where Kedro’s design decisions will fight this - we’ve built things to guarantee reproducibility and this sort of ‘out of Kedro DAG’ operation is hard to do

datajoely

06/16/2023, 2:24 PM

If you really want to do this potentially kedro hooks allow you to set things up before_pipeline_run

Nok Lam Chan

06/16/2023, 2:35 PM

Any chance you can use IncrementaDataSet for this?

Jose Nuñez

06/16/2023, 2:53 PM

~~I don't suggest you doing this, but gonna say it anyways, it may work as a workaround for now:~~ ~~instead of doing:~~ ~~All_DATA --->~~ \ ~~--->~~ ~~/Current Data --->~~

Jose Nuñez

06/16/2023, 3:14 PM

Try this instead (ONLY as a workaround): . # catalog.yaml: actual_data: ... filepath: .../01_raw/actual_data.csv . actual_data_updated: ... filepath: .../02_intermediate/actual_data_updated.csv . # nodes.py Node1: 1. Reads <all_data> and compares it to <actual_data> 2. calculate <new_records> 3. do the transformations on <new_records> 4. * Do some magic to avoid circular dependency * 5. return the new <actual_data_updated> . The trick on step 4 is to manually save ( pd.to_csv ) the <new_data_updated> into the same path as the <actual_data> (filepath: .../01_raw/actual_data.csv) and without stating this on the pipeline.py, so that kedro doesn't see the dependency problem. . You can do this as a workaround, but this is a very very bad practice, if you are on a hurry go ahead, and if not, is best if you plan your pipeline better

Sebastian Cardona Lozano

06/17/2023, 1:29 AM

Thanks all for your ideas 🙂

4 Views

Open in Slack

Previous Next