Is it possible to reload a data within a function/...
# questions
a
Is it possible to reload a data within a function/node? For example
Copy code
node(func=regenerate, inputs="mydata_sql",outputs="mydata_excel")

def regenerate(mydata):
     # run SQL stored procedure that impacts the table of mydata_sql in the database
     # reload mydata because the stored procedure will have changed mydata
     return mydata # convert it to excel file
Unfortunately, recreating the data transformation of the stored procedure in Python may not be straightforward. That's why I depend on the stored procedure to transform/update my data.
d
Hi Afiq, I think that reloading data within a node isn't a best practice in Kedro. However, have you considered achieving this by dividing the process into multiple nodes step by step?
a
@Dmitry Sorokin Basically, what I want to do with Kedro is the ability to update my SQL tables by executing some stored procedures. So the expected output from these stored procedures would be the updated SQL tables. Once the SQL tables are updated, I want to output them as Excel files. I think the main challenge I have is to create a pipeline with the right inputs and outputs so that when I execute the pipeline, it will always start with executing the stored procedures, return the SQL table, and then output the SQL table as Excel.
The node that takes in SQL table and outputs Excel file is fine. It's pretty straightforward. But I can't seem to properly create a node that takes in a set of parameters (related to Stored Procedures) and return a SQL table.
n
So if I understand correctly - you want to load
mydata_sql
but you need to make sure the store_proc get executed?
Do you have processing logic inside
regenerate
and does it take any
parameters
? If not - I think
before_dataset_loaded
is a good candidate https://docs.kedro.org/en/stable/kedro.framework.hooks.specs.DatasetSpecs.html#kedro.framework.hooks.specs.DatasetSpecs.before_dataset_loaded If yes - It’s a bit tricky because it’s not pure I/O but it is also not processing logic, the real compute happens in the database and your code only trigger SP and load the data (which is more a responsibility of dataset). You will most likely need some “dummy input/output” instead case to make sure the dependency is correct.
a
@Nok Lam Chan the
regenerate
takes in
parameters
but these
parameters
are actually for the
stored procedures
(SP) At the moment, there's no plan to migrate the SP to Python, hence why we rely on the SP. The latest code iteration I have is to reload the data within the node Added this to
nodes.py
Copy code
credentials = config_loader.get("credentials.yml")
catalogs = config_loader.get("catalog.yml")
thedata= DataCatalog.from_config(catalogs, credentials)
Copy code
def regenerate(mydata):
  # run SP
  # reload sql data after SP execution
  mydata= thedata.load("mydata_sql")
return mydata