Filip Panovski
03/29/2023, 8:46 AMValueError: The following datasets are used with transcoding, but were referenced without the separator: typed_invoices
Please specify a transcoding option or rename the datasets.
Details within thread.typed_invoices:
type: prod.datasets.dask.DaskAvroParquetDataSet
...
typed_invoices@focus_revenue:
type: prod.datasets.dask.FilterableParquetDataSet
...
typed_invoices@with_partition_filters:
type: prod.datasets.dask.FilterableParquetDataSet
...
The datasets are custom ones, but they just override _load
and _save
.
These datasets are used in the following places:
1. As an output of a node:
node(
func=repartition,
inputs=dict(
data='...',
partition_size='...',
),
outputs='typed_invoices',
),
And as input to two other nodes:
node(
func=outlier_detection,
inputs=dict(data='typed_invoices@focus_revenue'),
outputs='...',
name='write_outlier_score_v1',
),
...,
node(
func=merge,
inputs=dict(
left_data='typed_invoices@with_partition_filters',
right_data='...',
on='...',
),
outputs='invoices_with_outlier_score',
),
When I run my pipeline, I get the following error:
ValueError: The following datasets are used with transcoding, but were referenced without the separator: typed_invoices
Please specify a transcoding option or rename the datasets.
I'm not sure I understand this error. If I use transcoding, am I not allowed to have a 'non-transcoded' version of the dataset anymore?Deepyaman Datta
03/29/2023, 9:02 AMIf I use transcoding, am I not allowed to have a 'non-transcoded' version of the dataset anymore?Correct
Filip Panovski
03/29/2023, 9:15 AMyesterday
)
2. typed_invoices
is written to, containing all data for that date
3. typed_invoices@focus_revenue
contains a subset of all columns (load_args.columns: [...]
), containing all data for that date
4. typed_invoices@with_partition_filters
contains all columns, containing all data for that date
Since the last 2 datasets are using the same type
, I'm guessing this isn't quite the use case transcoding was meant for, but I wish to make it explicit that I'm accessing the same data, just differently.
Is there a better/recommended way for me to get what I want? Should I just prefer a non-transcoded naming schema that reflects my usage, e.g. typed_invoices_output
, typed_invoices_input_focus_revenue
, etc?Deepyaman Datta
03/29/2023, 12:53 PMFilip Panovski
03/31/2023, 10:19 AMDeepyaman Datta
03/31/2023, 12:11 PMI'm quite pleased with the configuration cleanup that resulted in.Always nice when that happens. :D