Hi kedroids. In my pipeline, I have this logic for...
# questions
s
Hi kedroids. In my pipeline, I have this logic for 2 nodes: node 1: reads a table and executes a data process only to new items that are not in the table node 2: Executes transformations to those new items and append them to the same table. I'm getting this error:
Copy code
CircularDependencyError: Circular dependencies exist among these items: [node1 ...., node2]
Yes, the output of node 2 is an input for node 1. My goal is to not process all the items every time I run the pipeline, but only the new items not in that table. How can I do this? Thanks!! 🙂
d
so you’re in a funny place where Kedro’s design decisions will fight this - we’ve built things to guarantee reproducibility and this sort of ‘out of Kedro DAG’ operation is hard to do
If you really want to do this potentially kedro hooks allow you to set things up before_pipeline_run
n
Any chance you can use IncrementaDataSet for this?
j
I don't suggest you doing this, but gonna say it anyways, it may work as a workaround for now: instead of doing: All_DATA ---> \ ---> /Current Data --->
Try this instead (ONLY as a workaround): . # catalog.yaml: actual_data: ... filepath: .../01_raw/actual_data.csv . actual_data_updated: ... filepath: .../02_intermediate/actual_data_updated.csv . # nodes.py Node1: 1. Reads <all_data> and compares it to <actual_data> 2. calculate <new_records> 3. do the transformations on <new_records> 4. * Do some magic to avoid circular dependency * 5. return the new <actual_data_updated> . The trick on step 4 is to manually save ( pd.to_csv ) the <new_data_updated> into the same path as the <actual_data> (filepath: .../01_raw/actual_data.csv) and without stating this on the pipeline.py, so that kedro doesn't see the dependency problem. . You can do this as a workaround, but this is a very very bad practice, if you are on a hurry go ahead, and if not, is best if you plan your pipeline better
s
Thanks all for your ideas 🙂