hey guys I m having some issues when applying partitions Par Kedro #questions

hey guys I'm having some issues when applying part...

Thiago José Moser Poletto

12/18/2024, 3:06 PM

hey guys I'm having some issues when applying partitions.PartitionedDataset, I manage to create multiple files but when accessing them on a .ipynb to check each partition, thats my problem, and I would like to make sure they are Ok in order to open one by one by iterating over them on the next pipeline, can someone help me with that?

Copy code

my_partitioned_dataset:
  type: partitions.PartitionedDataset
  path: data/02_intermediate  # path to the location of partitions
  dataset: pandas.CSVDataset

👀 1

Hall

12/18/2024, 3:06 PM

Someone will reply to you shortly. In the meantime, this might help:

Ravi Kumar Pilla

12/18/2024, 3:45 PM

Hi @Thiago José Moser Poletto, Is it possible for you to share the actual issue ? Thank you

Thiago José Moser Poletto

12/18/2024, 4:17 PM

I mean, when I load it using catalog.load(), I did tried to access it like any Dict, but it doesn't work. so whatwould it be the correct way to access each partition

Ravi Kumar Pilla

12/18/2024, 4:23 PM

Are you facing this issue only in notebook ? Did you try loading the partition in local dev env in an IDE ?

Ravi Kumar Pilla

12/18/2024, 4:25 PM

I hope you already went through the docs, if not can you have a look at the Python API example mentioned here

Thiago José Moser Poletto

12/18/2024, 7:35 PM

I did, it's just a bit confusing, I'm trying to use the same way to iterate over the catalog entry after loaded, but that is not working

Thiago José Moser Poletto

12/18/2024, 7:38 PM

No I'm using vertex ai workbench to code, and I do load to try it out in a jupyter notebook .ipynb

Copy code

%load_ext kedro.ipython
%reload_kedro ../

catalog.list()
[
    'companies',
    'historical_product_demand',
    'my_partitioned_dataset',
    'reviews',
    'shuttles_excel',
    'shuttles@csv',
    'shuttles@spark',
    'preprocessed_companies',
    'preprocessed_shuttles',
    'preprocessed_reviews',
    'model_input_table@spark',
    'model_input_table@pandas',
    'regressor',
    'metrics',
    'companies_columns',
    'shuttle_passenger_capacity_plot_exp',
    'shuttle_passenger_capacity_plot_go',
    'dummy_confusion_matrix',
    'parameters',
    'params:model_options',
    'params:model_options.test_size',
    'params:model_options.random_state',
    'params:model_options.features'
]

my_partitioned_dataset = catalog.load('my_partitioned_dataset')

Ravi Kumar Pilla

12/18/2024, 9:05 PM

Thanks for the information. The problem might be due to some missing partitions or access permission issues. I will check with my team for some more help. Thanks for your patience

Ankita Katiyar

12/19/2024, 10:02 AM

Once you’ve loaded the partitioned dataset with

catalog.load()

it’ll be a

Dict

with the partition name and it’s corresponding load function. You can iterate over it to load the individual partitions -

Copy code

my_partitioned_dataset = catalog.load('my_partitioned_dataset')

for file, func in my_partitioned_dataset.items():
  data = func()

Thiago José Moser Poletto

12/19/2024, 12:24 PM

I did that and it didn't work, but it was due to something that it was created and I don't know why it happen, it was a partition gitkeep.

Copy code

'.gitkeep': <bound method CSVDataset._load of kedro_datasets.pandas.csv_dataset.CSVDataset(filepath=PurePosixPath('/home/jupyter/demand-forecast-gcp-kedro/pdi-demand-forecast/data/02_intermediate/.gitkeep'), protocol='file', load_args={}, save_args={'index': False})>,

Thiago José Moser Poletto

12/19/2024, 12:25 PM

If I skip that it works

Ankita Katiyar

12/19/2024, 1:08 PM

Oh it’s reading the gitkeep file as one of the data partitions as well, you can just delete that file

Thiago José Moser Poletto

12/19/2024, 1:15 PM

yeah, I just didn't understand how that happen, like there's any way to avoid that, because every time that node runs it will do the same, I know that with a simple "if" I can avoid it, but, I would like to understand how that was created.

Ankita Katiyar

12/19/2024, 1:30 PM

The Kedro template comes with the

.gitkeep

file in the data folders so they can be uploaded to GitHub, as Github doesn’t read empty folders. You can delete these files when the folders actually contain something. I’d also recommend creating a folder within

02_intermediate

for the actual data

thankyou 1

Thiago José Moser Poletto

12/19/2024, 1:31 PM

ohhh I see,I thought was something genarated when the node was executed to create partitions, I got it now, thanks

8 Views

Open in Slack

Previous Next