Pradeep Ramayanam
06/27/2025, 5:34 PMRavi Kumar Pilla
06/27/2025, 5:39 PMPradeep Ramayanam
06/27/2025, 5:42 PMRavi Kumar Pilla
06/27/2025, 5:45 PMName: kedro-datasets
Version: 4.1.0
?Pradeep Ramayanam
06/27/2025, 5:46 PMRavi Kumar Pilla
06/27/2025, 5:55 PMs
instead of S
in the Dataset like - https://docs.kedro.org/projects/kedro-datasets/en/kedro-datasets-2.1.0.post1/api/kedro_datasets.pandas.CSVDataset.html
and for partitioned - https://docs.kedro.org/projects/kedro-datasets/en/kedro-datasets-2.1.0.post1/api/kedro_datasets.partitions.PartitionedDataset.html
since 2.0.0 of kedro-datasets, the spelling is changed. In the mean time, I will try to test your setup locally. Thank youPradeep Ramayanam
06/27/2025, 6:01 PMRAW_NCES_CCD_STAFF:
type: PartitionedDataset
path: data/01_raw/nces_ccd/*/Staff/DataFile/
dataset:
type: pandas.CSVDataSet
load_args:
encoding: 'cp1252'
filename_suffix: .csv
updated it and still got the same errorPradeep Ramayanam
06/27/2025, 6:01 PMRavi Kumar Pilla
06/27/2025, 6:01 PMPradeep Ramayanam
06/27/2025, 6:01 PMRavi Kumar Pilla
06/27/2025, 6:01 PMPradeep Ramayanam
06/27/2025, 6:04 PMRAW_NCES_CCD_STAFF:
type: PartitionedDataSet
path: data/01_raw/nces_ccd/*/Staff/DataFile
dataset:
type: pandas.CSVDataset
load_args:
encoding: 'cp1252'
filename_suffix: .csv
RAW_NCES_CCD_STAFF:
type: PartitionedDataset
path: data/01_raw/nces_ccd/*/Staff/DataFile
dataset:
type: pandas.CSVDataset
load_args:
encoding: 'cp1252'
filename_suffix: .csv
both giving
DatasetError:
cannot import name 'AbstractVersionedDataset' from 'kedro.io.core' (/Users/Pradeep_Ramayanam/anaconda3/envs/kedro-environment/lib/python3.10/site-packages/kedro/io/core.py).
Failed to instantiate dataset 'RAW_NCES_CCD_STAFF' of type 'kedro.io.partitioned_dataset.PartitionedDataset'.Ravi Kumar Pilla
06/27/2025, 6:26 PM"data/01_raw/nces_ccd"
will show you partitions like - Partitions found: ['2020/Staff/DataFile/file_0', '2020/Staff/DataFile/file_1', '2021/Staff/DataFile/file_0', '2021/Staff/DataFile/file_1'].
But if you have csv files in multiple locations under nces_ccd, you might need further filtering. Let me play with this for some timePradeep Ramayanam
06/27/2025, 6:35 PMRAW_NCES_STAFF = pd.DataFrame()
for partition_key, partition_load_func in sorted(partitioned_RAW_NCES_STAFF.items()):
match = re.search(r'20\d{2}-20\d{2}', partition_key)
if match:
start_year = int(match.group(0).split('-')[0])
if start_year >= 2017:
print(f"File found: {partition_key}")
partition_data = partition_load_func() # load the actual partition data
# concat with existing result
RAW_NCES_STAFF = pd.concat([RAW_NCES_STAFF, partition_data], ignore_index=True, sort=True)
and then I filter to Staff similar to above, but lot of data will be in-memory alreadyRavi Kumar Pilla
06/27/2025, 6:39 PMPradeep Ramayanam
06/27/2025, 6:39 PMRavi Kumar Pilla
06/27/2025, 7:12 PMfilename_suffix: __stf_dat_.csv
and get what you need making the path arg a concrete path. Let me know if this helps. If not, I will let my team suggest any alternative here. Thank youPradeep Ramayanam
06/27/2025, 8:24 PMRavi Kumar Pilla
06/27/2025, 9:40 PMnces_ccd
should be the root and the partitions are 2015-2016, 2017-2018 etc. If you think the partition directory should be till DataFile, that means you are trying to load multiple partitioned datasets using wild character so you need multiple partition dataset entries for it in your catalog.yml. You can make use of kedro dataset factories - https://docs.kedro.org/en/0.19.14/data/kedro_dataset_factories.html (introduced from kedro 0.18.12) for this usecase. I have posted this in our internal channel anyway for alternatives. Thank youPradeep Ramayanam
06/30/2025, 3:35 PM