Hi team is there a known issue around using dataset factorie Kedro #questions

Hi team, is there a known issue around using datas...

Melvin Kok

03/25/2024, 7:21 AM

Hi team, is there a known issue around using dataset factories with ThreadRunner? I keep facing

DatasetAlreadyExistsError: Dataset '<dataset name>' has already been registered

👀 2

Melvin Kok

03/25/2024, 8:57 AM

My hypothesis is that because I have two nodes that both have this dataset as input, when they are executed on Spark in parallel, the input dataset gets added to the catalog twice in the thread-unsafe manner?

Melvin Kok

03/25/2024, 8:58 AM

I encountered this issue on both 0.18.14 and 0.19.3... If this isn't a known issue I can open an issue on GitHub 🙂

Melvin Kok

03/25/2024, 10:11 AM

I created a hook as a temporary fix in case anyone is searching for the same issue here:

Copy code

class ResolveDatasetsHooks:
    @hook_impl
    def before_pipeline_run(self, pipeline, catalog):

        data_sets = set()
        for node in pipeline.nodes:
            data_sets.update(node.outputs)
            data_sets.update(node.inputs)

        for ds in data_sets:
            catalog._get_dataset(ds)

Nok Lam Chan

03/25/2024, 10:35 AM

@Melvin Kok hey thanks for reporting this to us. This is the first time I see this, would you mind open an issue and if it’s not too much include an example that we can reproduce on our end?

Melvin Kok

03/26/2024, 12:45 AM

@Nok Lam Chan I've opened an issue: https://github.com/kedro-org/kedro/issues/3739 🙂

thankyou 1

👍 1

Nok Lam Chan

03/26/2024, 1:26 PM

Hey @Melvin Kok, love that you provide a clean script instead of a scaffold project, it's very easy for me to run this, appreciate your effort a lot ✨! I suspect this is related to: - https://github.com/kedro-org/kedro/issues/3720 Can you try to change

{name}

{abc}

? I try to change the Runner to

SequentialRunner

which is still failing, so maybe there is something wrong in the script. After I change the

{name}

, I get different error message and that may solve your problem already.

Nok Lam Chan

03/26/2024, 1:57 PM

I left a new comment in the issue, I think there are problem in the script. I manage to run it but I am not sure if that was your original intention

Open in Slack

Previous Next