Hi All hope everyone is doing well I have a weird file struc Kedro #questions

Hi All, hope everyone is doing well! I have a wei...

Pradeep Ramayanam

06/27/2025, 5:34 PM

Hi All, hope everyone is doing well! I have a weird file structure as attached and would love to hear if anyone has solved it before, I tried to solve it as attached but I am getting below error DatasetError: No partitions found in '/data/01_raw/nces_ccd/*/Staff/DataFile' Any help would be much appreciated, thanks in advance!!

👀 1

Ravi Kumar Pilla

06/27/2025, 5:39 PM

Hi @Pradeep Ramayanam, could you tell us which kedro-datasets version you are on ?

Pradeep Ramayanam

06/27/2025, 5:42 PM

Hi Ravi, I am using Kedro 0.18.11

👍 1

Ravi Kumar Pilla

06/27/2025, 5:45 PM

Copy code

Name: kedro-datasets
Version: 4.1.0

Pradeep Ramayanam

06/27/2025, 5:46 PM

Name: kedro-datasets Version: 2.1.0

Ravi Kumar Pilla

06/27/2025, 5:55 PM

Can you use

instead of

in the Dataset like - https://docs.kedro.org/projects/kedro-datasets/en/kedro-datasets-2.1.0.post1/api/kedro_datasets.pandas.CSVDataset.html and for partitioned - https://docs.kedro.org/projects/kedro-datasets/en/kedro-datasets-2.1.0.post1/api/kedro_datasets.partitions.PartitionedDataset.html since 2.0.0 of kedro-datasets, the spelling is changed. In the mean time, I will try to test your setup locally. Thank you

Pradeep Ramayanam

06/27/2025, 6:01 PM

Copy code

RAW_NCES_CCD_STAFF:
  type: PartitionedDataset
  path: data/01_raw/nces_ccd/*/Staff/DataFile/
  dataset:
    type: pandas.CSVDataSet
    load_args:
      encoding: 'cp1252'
  filename_suffix: .csv

updated it and still got the same error

Pradeep Ramayanam

06/27/2025, 6:01 PM

And thanks for the help Ravi, appreciate it!

👍 1

Ravi Kumar Pilla

06/27/2025, 6:01 PM

CSVDataSet - CSVDataset

Pradeep Ramayanam

06/27/2025, 6:01 PM

Oh Sorry

Ravi Kumar Pilla

06/27/2025, 6:01 PM

also remove the trailing slash at DataFile

Pradeep Ramayanam

06/27/2025, 6:04 PM

I tried both

Copy code

RAW_NCES_CCD_STAFF:
  type: PartitionedDataSet
  path: data/01_raw/nces_ccd/*/Staff/DataFile
  dataset:
    type: pandas.CSVDataset
    load_args:
      encoding: 'cp1252'
  filename_suffix: .csv

Copy code

RAW_NCES_CCD_STAFF:
  type: PartitionedDataset
  path: data/01_raw/nces_ccd/*/Staff/DataFile
  dataset:
    type: pandas.CSVDataset
    load_args:
      encoding: 'cp1252'
  filename_suffix: .csv

both giving DatasetError: cannot import name 'AbstractVersionedDataset' from 'kedro.io.core' (/Users/Pradeep_Ramayanam/anaconda3/envs/kedro-environment/lib/python3.10/site-packages/kedro/io/core.py). Failed to instantiate dataset 'RAW_NCES_CCD_STAFF' of type 'kedro.io.partitioned_dataset.PartitionedDataset'.

👀 1

Ravi Kumar Pilla

06/27/2025, 6:26 PM

I think kedro-datasets 2.1.0 is not compatible with kedro==0.18.11 (kedro-datasets 2.1.0 requires kedro>=0.19). Either bump kedro version to 0.19 or shift kedro-datasets==1.8.0 Also I checked wildcard paths does not work. You might need a concrete path -

"data/01_raw/nces_ccd"

will show you partitions like - Partitions found: ['2020/Staff/DataFile/file_0', '2020/Staff/DataFile/file_1', '2021/Staff/DataFile/file_0', '2021/Staff/DataFile/file_1']. But if you have csv files in multiple locations under nces_ccd, you might need further filtering. Let me play with this for some time

Pradeep Ramayanam

06/27/2025, 6:35 PM

Thanks for the reply Ravi! But under 2020 there are multiple folders, so if I put entire "data/01_raw/nces_ccd", it will load all csvs from different folders not just Staff which will load lot of unnecessary data into memory

Copy code

RAW_NCES_STAFF = pd.DataFrame()
for partition_key, partition_load_func in sorted(partitioned_RAW_NCES_STAFF.items()):
    match = re.search(r'20\d{2}-20\d{2}', partition_key)
    if match:
        start_year = int(match.group(0).split('-')[0])
        if start_year >= 2017:
            print(f"File found: {partition_key}")
            partition_data = partition_load_func()  # load the actual partition data
            # concat with existing result
            RAW_NCES_STAFF = pd.concat([RAW_NCES_STAFF, partition_data], ignore_index=True, sort=True)

and then I filter to Staff similar to above, but lot of data will be in-memory already

Ravi Kumar Pilla

06/27/2025, 6:39 PM

yes I get it...let me get back to you. Thanks for your patience

Pradeep Ramayanam

06/27/2025, 6:39 PM

Sure, please take your time, just wanna flag it, appreciate all the help!

👍 1

Ravi Kumar Pilla

06/27/2025, 7:12 PM

Hi Pradeep, Based on your dir structure, I could not find a way to pass a glob to PartitionedDataset as the path arg is a string. However, one workaround would be to filter the files based on filename_suffix. If it is possible to save the csv files like "{filename}_stf_dat.csv" (or some suffix of your choice), you should be able to do

filename_suffix: __stf_dat_.csv

and get what you need making the path arg a concrete path. Let me know if this helps. If not, I will let my team suggest any alternative here. Thank you

Pradeep Ramayanam

06/27/2025, 8:24 PM

Thanks Ravi! Apologies for the late reply. These files are being ingested using api and preprocessing these files would be challenging especially as we ingest data from 1980 and multiple folders inside each year, could you please ask the team to see if they have a different solution? Thank you so much!

Ravi Kumar Pilla

06/27/2025, 9:40 PM

Hi Pradeep, In your case the partition directory contains sub-directories. i.e.,

nces_ccd

should be the root and the partitions are 2015-2016, 2017-2018 etc. If you think the partition directory should be till DataFile, that means you are trying to load multiple partitioned datasets using wild character so you need multiple partition dataset entries for it in your catalog.yml. You can make use of kedro dataset factories - https://docs.kedro.org/en/0.19.14/data/kedro_dataset_factories.html (introduced from kedro 0.18.12) for this usecase. I have posted this in our internal channel anyway for alternatives. Thank you

Pradeep Ramayanam

06/30/2025, 3:35 PM

Thank you Ravi!

👍 1

2 Views

Open in Slack

Previous Next