hey guys I'm having some issues when applying part...
# questions
t
hey guys I'm having some issues when applying partitions.PartitionedDataset, I manage to create multiple files but when accessing them on a .ipynb to check each partition, thats my problem, and I would like to make sure they are Ok in order to open one by one by iterating over them on the next pipeline, can someone help me with that?
Copy code
my_partitioned_dataset:
  type: partitions.PartitionedDataset
  path: data/02_intermediate  # path to the location of partitions
  dataset: pandas.CSVDataset
👀 1
h
Someone will reply to you shortly. In the meantime, this might help:
r
Hi @Thiago José Moser Poletto, Is it possible for you to share the actual issue ? Thank you
t
I mean, when I load it using catalog.load(), I did tried to access it like any Dict, but it doesn't work. so whatwould it be the correct way to access each partition
r
Are you facing this issue only in notebook ? Did you try loading the partition in local dev env in an IDE ?
I hope you already went through the docs, if not can you have a look at the Python API example mentioned here
t
I did, it's just a bit confusing, I'm trying to use the same way to iterate over the catalog entry after loaded, but that is not working
No I'm using vertex ai workbench to code, and I do load to try it out in a jupyter notebook .ipynb
Copy code
%load_ext kedro.ipython
%reload_kedro ../

catalog.list()
[
    'companies',
    'historical_product_demand',
    'my_partitioned_dataset',
    'reviews',
    'shuttles_excel',
    'shuttles@csv',
    'shuttles@spark',
    'preprocessed_companies',
    'preprocessed_shuttles',
    'preprocessed_reviews',
    'model_input_table@spark',
    'model_input_table@pandas',
    'regressor',
    'metrics',
    'companies_columns',
    'shuttle_passenger_capacity_plot_exp',
    'shuttle_passenger_capacity_plot_go',
    'dummy_confusion_matrix',
    'parameters',
    'params:model_options',
    'params:model_options.test_size',
    'params:model_options.random_state',
    'params:model_options.features'
]

my_partitioned_dataset = catalog.load('my_partitioned_dataset')
r
Thanks for the information. The problem might be due to some missing partitions or access permission issues. I will check with my team for some more help. Thanks for your patience
a
Once you’ve loaded the partitioned dataset with
catalog.load()
it’ll be a
Dict
with the partition name and it’s corresponding load function. You can iterate over it to load the individual partitions -
Copy code
my_partitioned_dataset = catalog.load('my_partitioned_dataset')

for file, func in my_partitioned_dataset.items():
  data = func()
t
I did that and it didn't work, but it was due to something that it was created and I don't know why it happen, it was a partition gitkeep.
Copy code
'.gitkeep': <bound method CSVDataset._load of kedro_datasets.pandas.csv_dataset.CSVDataset(filepath=PurePosixPath('/home/jupyter/demand-forecast-gcp-kedro/pdi-demand-forecast/data/02_intermediate/.gitkeep'), protocol='file', load_args={}, save_args={'index': False})>,
If I skip that it works
a
Oh it’s reading the gitkeep file as one of the data partitions as well, you can just delete that file
t
yeah, I just didn't understand how that happen, like there's any way to avoid that, because every time that node runs it will do the same, I know that with a simple "if" I can avoid it, but, I would like to understand how that was created.
a
The Kedro template comes with the
.gitkeep
file in the data folders so they can be uploaded to GitHub, as Github doesn’t read empty folders. You can delete these files when the folders actually contain something. I’d also recommend creating a folder within
02_intermediate
for the actual data
thankyou 1
t
ohhh I see,I thought was something genarated when the node was executed to create partitions, I got it now, thanks