Hi everyone I have an issue Kedro 0 18 4 issues with transco Kedro #questions

Hi everyone. I have an issue Kedro 0.18.4 issues w...

Filip Panovski

03/29/2023, 8:46 AM

Hi everyone. I have an issue Kedro 0.18.4 issues with transcoded datasets that I don't quite understand:

Copy code

ValueError: The following datasets are used with transcoding, but were referenced without the separator: typed_invoices
Please specify a transcoding option or rename the datasets.

Details within thread.

✅ 1

Filip Panovski

03/29/2023, 8:48 AM

I have the following datasets:

Copy code

typed_invoices:
  type: prod.datasets.dask.DaskAvroParquetDataSet
  ...

typed_invoices@focus_revenue:
  type: prod.datasets.dask.FilterableParquetDataSet
  ...

typed_invoices@with_partition_filters:
  type: prod.datasets.dask.FilterableParquetDataSet
  ...

The datasets are custom ones, but they just override

_load

and

_save

. These datasets are used in the following places: 1. As an output of a node:

Copy code

node(
                func=repartition,
                inputs=dict(
                    data='...',
                    partition_size='...',
                ),
                outputs='typed_invoices',
            ),

And as input to two other nodes:

Copy code

node(
                func=outlier_detection,
                inputs=dict(data='typed_invoices@focus_revenue'),
                outputs='...',
                name='write_outlier_score_v1',
            ),

            ...,

            node(
                func=merge,
                inputs=dict(
                    left_data='typed_invoices@with_partition_filters',
                    right_data='...',
                    on='...',
                ),
                outputs='invoices_with_outlier_score',
            ),

When I run my pipeline, I get the following error:

Copy code

ValueError: The following datasets are used with transcoding, but were referenced without the separator: typed_invoices
Please specify a transcoding option or rename the datasets.

I'm not sure I understand this error. If I use transcoding, am I not allowed to have a 'non-transcoded' version of the dataset anymore?

Deepyaman Datta

03/29/2023, 9:02 AM

If I use transcoding, am I not allowed to have a 'non-transcoded' version of the dataset anymore?

Correct

Filip Panovski

03/29/2023, 9:15 AM

I see. I suppose the solution would then be to just rename my non-transcoded dataset then. Could you by chance explain the reasoning behind this? I'm thinking that I might be abusing transcoding a little bit. What I'm trying to do: 1. We're using kedro to run daily pipelines for specific dates (i.e.

yesterday

) 2.

typed_invoices

is written to, containing all data for that date 3.

typed_invoices@focus_revenue

contains a subset of all columns (

load_args.columns: [...]

), containing all data for that date 4.

typed_invoices@with_partition_filters

contains all columns, containing all data for that date Since the last 2 datasets are using the same

type

, I'm guessing this isn't quite the use case transcoding was meant for, but I wish to make it explicit that I'm accessing the same data, just differently. Is there a better/recommended way for me to get what I want? Should I just prefer a non-transcoded naming schema that reflects my usage, e.g.

typed_invoices_output

typed_invoices_input_focus_revenue

, etc?

Filip Panovski

03/29/2023, 9:21 AM

Ahh, I think what I'm looking for is actually the following: https://docs.kedro.org/en/stable/data/data_catalog.html#load-multiple-datasets-with-similar-configuration It was right above the transcoding docs 🙂

Deepyaman Datta

03/29/2023, 12:53 PM

Sorry, I fell asleep. But, I think you actually are looking for transcoding here? Transcoding will make sure the data is represented as one entity (e.g. in Viz); you just need to make sure you're also transcoding on write. I would need to test to make sure you can have more than 2 datasets on transcoding, but I don't see why not.

Filip Panovski

03/31/2023, 10:19 AM

No problem. Yes, you're correct, since I wanted to combine my pipeline outputs/inputs transcoding was indeed what I needed. I ended up going with a mix of transcoding+YAML anchors and I'm quite pleased with the configuration cleanup that resulted in. Thanks!

Deepyaman Datta

03/31/2023, 12:11 PM

Great it worked out!

I'm quite pleased with the configuration cleanup that resulted in.

Always nice when that happens. :D

38 Views

Open in Slack

Previous Next