Hi everyone. I have an issue Kedro 0.18.4 issues w...
# questions
f
Hi everyone. I have an issue Kedro 0.18.4 issues with transcoded datasets that I don't quite understand:
Copy code
ValueError: The following datasets are used with transcoding, but were referenced without the separator: typed_invoices
Please specify a transcoding option or rename the datasets.
Details within thread.
1
I have the following datasets:
Copy code
typed_invoices:
  type: prod.datasets.dask.DaskAvroParquetDataSet
  ...

typed_invoices@focus_revenue:
  type: prod.datasets.dask.FilterableParquetDataSet
  ...

typed_invoices@with_partition_filters:
  type: prod.datasets.dask.FilterableParquetDataSet
  ...
The datasets are custom ones, but they just override
_load
and
_save
. These datasets are used in the following places: 1. As an output of a node:
Copy code
node(
                func=repartition,
                inputs=dict(
                    data='...',
                    partition_size='...',
                ),
                outputs='typed_invoices',
            ),
And as input to two other nodes:
Copy code
node(
                func=outlier_detection,
                inputs=dict(data='typed_invoices@focus_revenue'),
                outputs='...',
                name='write_outlier_score_v1',
            ),

            ...,

            node(
                func=merge,
                inputs=dict(
                    left_data='typed_invoices@with_partition_filters',
                    right_data='...',
                    on='...',
                ),
                outputs='invoices_with_outlier_score',
            ),
When I run my pipeline, I get the following error:
Copy code
ValueError: The following datasets are used with transcoding, but were referenced without the separator: typed_invoices
Please specify a transcoding option or rename the datasets.
I'm not sure I understand this error. If I use transcoding, am I not allowed to have a 'non-transcoded' version of the dataset anymore?
d
If I use transcoding, am I not allowed to have a 'non-transcoded' version of the dataset anymore?
Correct
f
I see. I suppose the solution would then be to just rename my non-transcoded dataset then. Could you by chance explain the reasoning behind this? I'm thinking that I might be abusing transcoding a little bit. What I'm trying to do: 1. We're using kedro to run daily pipelines for specific dates (i.e.
yesterday
) 2.
typed_invoices
is written to, containing all data for that date 3.
typed_invoices@focus_revenue
contains a subset of all columns (
load_args.columns: [...]
), containing all data for that date 4.
typed_invoices@with_partition_filters
contains all columns, containing all data for that date Since the last 2 datasets are using the same
type
, I'm guessing this isn't quite the use case transcoding was meant for, but I wish to make it explicit that I'm accessing the same data, just differently. Is there a better/recommended way for me to get what I want? Should I just prefer a non-transcoded naming schema that reflects my usage, e.g.
typed_invoices_output
,
typed_invoices_input_focus_revenue
, etc?
Ahh, I think what I'm looking for is actually the following: https://docs.kedro.org/en/stable/data/data_catalog.html#load-multiple-datasets-with-similar-configuration It was right above the transcoding docs 🙂
d
Sorry, I fell asleep. But, I think you actually are looking for transcoding here? Transcoding will make sure the data is represented as one entity (e.g. in Viz); you just need to make sure you're also transcoding on write. I would need to test to make sure you can have more than 2 datasets on transcoding, but I don't see why not.
f
No problem. Yes, you're correct, since I wanted to combine my pipeline outputs/inputs transcoding was indeed what I needed. I ended up going with a mix of transcoding+YAML anchors and I'm quite pleased with the configuration cleanup that resulted in. Thanks!
d
Great it worked out!
I'm quite pleased with the configuration cleanup that resulted in.
Always nice when that happens. :D