Hi team, is there a known issue around using datas...
# questions
m
Hi team, is there a known issue around using dataset factories with ThreadRunner? I keep facing
DatasetAlreadyExistsError: Dataset '<dataset name>' has already been registered
πŸ‘€ 2
My hypothesis is that because I have two nodes that both have this dataset as input, when they are executed on Spark in parallel, the input dataset gets added to the catalog twice in the thread-unsafe manner?
I encountered this issue on both 0.18.14 and 0.19.3... If this isn't a known issue I can open an issue on GitHub πŸ™‚
I created a hook as a temporary fix in case anyone is searching for the same issue here:
Copy code
class ResolveDatasetsHooks:
    @hook_impl
    def before_pipeline_run(self, pipeline, catalog):

        data_sets = set()
        for node in pipeline.nodes:
            data_sets.update(node.outputs)
            data_sets.update(node.inputs)

        for ds in data_sets:
            catalog._get_dataset(ds)
n
@Melvin Kok hey thanks for reporting this to us. This is the first time I see this, would you mind open an issue and if it’s not too much include an example that we can reproduce on our end?
m
@Nok Lam Chan I've opened an issue: https://github.com/kedro-org/kedro/issues/3739 πŸ™‚
thankyou 1
πŸ‘ 1
n
Hey @Melvin Kok, love that you provide a clean script instead of a scaffold project, it's very easy for me to run this, appreciate your effort a lot ✨! I suspect this is related to: - https://github.com/kedro-org/kedro/issues/3720 Can you try to change
{name}
->
{abc}
? I try to change the Runner to
SequentialRunner
which is still failing, so maybe there is something wrong in the script. After I change the
{name}
, I get different error message and that may solve your problem already.
I left a new comment in the issue, I think there are problem in the script. I manage to run it but I am not sure if that was your original intention