I was working with GPT 4 to brainstorm how to connect to an Kedro #questions

I was working with GPT 4 to brainstorm how to conn...

Emilio Gagliardi

08/06/2023, 2:52 AM

I was working with GPT 4 to brainstorm how to connect to an azure container blob that stores 1-to-many JSON files. The suggestion it provided was not what I expected and I wonder if someone can comment? I want to create a partitioned dataset and the underlying files are JSON. GPT 4 suggested the following which references a

<http://kedro.contrib.io|kedro.contrib.io>.azure.JSONBlobDataSet

which I can't find in the documentation under 18.12, but under 15.6. Did something change in the way kedro organizes contrib.io? GPT 4 also said that the built-in kedro JSON dataset doesn't work on azure. Any guidance is appreciated. THanks kindly,

Copy code

my_partitioned_dataset:
  type: kedro.io.PartitionedDataSet
  path: <your_blob_folder_path>
  credentials: azure_blob_storage
  dataset:
    type: kedro.contrib.io.azure.JSONBlobDataSet <- is this valid?
    container_name: <your_container_name>
    credentials: azure_blob_storage

Deepyaman Datta

08/06/2023, 2:01 PM

JSONBlobDataSet

was removed in Kedro 0.16, along with a lot of storage-specific datasets. I don't know why the

pandas.JSONDataSet

shouldn't work; not sure I would trust GPT 4. See https://stackoverflow.com/a/69941391/1093967 for example;

fsspec

should be able to handle Azure blob same way as other storage backends.

Emilio Gagliardi

08/07/2023, 2:43 AM

Thank you for clarifying. One further follow-up. Why did you recommend the pandas.JSONDataSet instead of the json.JSONDataSet. is it because it automatically returns a dataframe of the data? The jsons I'm extracting are emails and I was sure if there was a reason to use a dataframe in this context. Its saving me a step if I was going to use a dataframe, is that right?

Deepyaman Datta

08/07/2023, 4:11 AM

Yeah, I suggested

pandas.JSONDataSet

since that was similar to the behavior of old

<http://kedro.contrib.io|kedro.contrib.io>.azure.JSONBlobDataSet

, which also produces a dataframe.

Emilio Gagliardi

08/08/2023, 8:35 AM

@Deepyaman Datta I got it working mostly, thank you. As far as I can tell, I'm writing valid json to the files on the blob container. however when I use a PartitionedDataSet with pandas.JSONDataset, when I call the callable to load the file, some of the files are throwing an error. I've verified they are valid json, and as far as I can tell there is no discernable difference between files that load successfully and those that don't. Is there a way of loosening how the json is loaded so that if there are issues something is returned instead of nothing? I don't know how to solve this issue. The only thing I can think of is that the text may contain emojis or special characters.

Copy code

Error loading cleaned-emails-20230806003837.json: Failed while loading data from data set JSONDataSet(filepath=cleaned-emails/cleaned-emails-20230806003837.json, protocol=abfs).
Expected object or value

there is a json object in the underlying file... any ideas greatly appreciated!

cleaned-emails-20230808023208.json

Emilio Gagliardi

08/08/2023, 7:04 PM

I'm not sure if I found a bug or a behavior. when I use a PartitionedDataset+pandas.JSONDataSet to load json files from a Folder, some of the files cause the callable to fail. When I use a PartitionedDataset+json.JSONDataSet to load the same set of json files, there are no errors. I don't know how to debug the failed callables, so I'm not sure what the problem is. I think its related to non utf-8 characters in one of the keys, but I'm not sure how to tinker with the callable or how to extract what caused the error. If you want more example files (some that work and some that don't work) let me know.

Nok Lam Chan

08/09/2023, 8:06 PM

Would be great if you can create a reproducible example and github issue. Maybe a gist or demo repository that we can clone and run.

Emilio Gagliardi

08/14/2023, 6:57 PM

Hi @Nok Lam Chan sorry, I'm not familiar with github issue. Is it easy to make a gist to show you?

Nok Lam Chan

08/14/2023, 8:20 PM

Maybe a repository? It would be easier if it's something we can clone and run

5 Views

Open in Slack

Previous Next