Hi All, hope everyone is doing well! I have a wei...
# questions
p
Hi All, hope everyone is doing well! I have a weird file structure as attached and would love to hear if anyone has solved it before, I tried to solve it as attached but I am getting below error DatasetError: No partitions found in '/data/01_raw/nces_ccd/*/Staff/DataFile' Any help would be much appreciated, thanks in advance!!
👀 1
r
Hi @Pradeep Ramayanam, could you tell us which kedro-datasets version you are on ?
p
Hi Ravi, I am using Kedro 0.18.11
👍 1
r
Copy code
Name: kedro-datasets
Version: 4.1.0
?
p
Name: kedro-datasets Version: 2.1.0
r
Can you use
s
instead of
S
in the Dataset like - https://docs.kedro.org/projects/kedro-datasets/en/kedro-datasets-2.1.0.post1/api/kedro_datasets.pandas.CSVDataset.html and for partitioned - https://docs.kedro.org/projects/kedro-datasets/en/kedro-datasets-2.1.0.post1/api/kedro_datasets.partitions.PartitionedDataset.html since 2.0.0 of kedro-datasets, the spelling is changed. In the mean time, I will try to test your setup locally. Thank you
p
Copy code
RAW_NCES_CCD_STAFF:
  type: PartitionedDataset
  path: data/01_raw/nces_ccd/*/Staff/DataFile/
  dataset:
    type: pandas.CSVDataSet
    load_args:
      encoding: 'cp1252'
  filename_suffix: .csv
updated it and still got the same error
And thanks for the help Ravi, appreciate it!
👍 1
r
CSVDataSet - CSVDataset
p
Oh Sorry
r
also remove the trailing slash at DataFile
p
I tried both
Copy code
RAW_NCES_CCD_STAFF:
  type: PartitionedDataSet
  path: data/01_raw/nces_ccd/*/Staff/DataFile
  dataset:
    type: pandas.CSVDataset
    load_args:
      encoding: 'cp1252'
  filename_suffix: .csv
Copy code
RAW_NCES_CCD_STAFF:
  type: PartitionedDataset
  path: data/01_raw/nces_ccd/*/Staff/DataFile
  dataset:
    type: pandas.CSVDataset
    load_args:
      encoding: 'cp1252'
  filename_suffix: .csv
both giving DatasetError: cannot import name 'AbstractVersionedDataset' from 'kedro.io.core' (/Users/Pradeep_Ramayanam/anaconda3/envs/kedro-environment/lib/python3.10/site-packages/kedro/io/core.py). Failed to instantiate dataset 'RAW_NCES_CCD_STAFF' of type 'kedro.io.partitioned_dataset.PartitionedDataset'.
👀 1
r
I think kedro-datasets 2.1.0 is not compatible with kedro==0.18.11 (kedro-datasets 2.1.0 requires kedro>=0.19). Either bump kedro version to 0.19 or shift kedro-datasets==1.8.0 Also I checked wildcard paths does not work. You might need a concrete path -
"data/01_raw/nces_ccd"
will show you partitions like - Partitions found: ['2020/Staff/DataFile/file_0', '2020/Staff/DataFile/file_1', '2021/Staff/DataFile/file_0', '2021/Staff/DataFile/file_1']. But if you have csv files in multiple locations under nces_ccd, you might need further filtering. Let me play with this for some time
p
Thanks for the reply Ravi! But under 2020 there are multiple folders, so if I put entire "data/01_raw/nces_ccd", it will load all csvs from different folders not just Staff which will load lot of unnecessary data into memory
Copy code
RAW_NCES_STAFF = pd.DataFrame()
for partition_key, partition_load_func in sorted(partitioned_RAW_NCES_STAFF.items()):
    match = re.search(r'20\d{2}-20\d{2}', partition_key)
    if match:
        start_year = int(match.group(0).split('-')[0])
        if start_year >= 2017:
            print(f"File found: {partition_key}")
            partition_data = partition_load_func()  # load the actual partition data
            # concat with existing result
            RAW_NCES_STAFF = pd.concat([RAW_NCES_STAFF, partition_data], ignore_index=True, sort=True)
and then I filter to Staff similar to above, but lot of data will be in-memory already
r
yes I get it...let me get back to you. Thanks for your patience
p
Sure, please take your time, just wanna flag it, appreciate all the help!
👍 1
r
Hi Pradeep, Based on your dir structure, I could not find a way to pass a glob to PartitionedDataset as the path arg is a string. However, one workaround would be to filter the files based on filename_suffix. If it is possible to save the csv files like "{filename}_stf_dat.csv" (or some suffix of your choice), you should be able to do
filename_suffix: __stf_dat_.csv
and get what you need making the path arg a concrete path. Let me know if this helps. If not, I will let my team suggest any alternative here. Thank you
p
Thanks Ravi! Apologies for the late reply. These files are being ingested using api and preprocessing these files would be challenging especially as we ingest data from 1980 and multiple folders inside each year, could you please ask the team to see if they have a different solution? Thank you so much!
r
Hi Pradeep, In your case the partition directory contains sub-directories. i.e.,
nces_ccd
should be the root and the partitions are 2015-2016, 2017-2018 etc. If you think the partition directory should be till DataFile, that means you are trying to load multiple partitioned datasets using wild character so you need multiple partition dataset entries for it in your catalog.yml. You can make use of kedro dataset factories - https://docs.kedro.org/en/0.19.14/data/kedro_dataset_factories.html (introduced from kedro 0.18.12) for this usecase. I have posted this in our internal channel anyway for alternatives. Thank you
p
Thank you Ravi!
👍 1