I was working with GPT 4 to brainstorm how to conn...
# questions
e
I was working with GPT 4 to brainstorm how to connect to an azure container blob that stores 1-to-many JSON files. The suggestion it provided was not what I expected and I wonder if someone can comment? I want to create a partitioned dataset and the underlying files are JSON. GPT 4 suggested the following which references a
<http://kedro.contrib.io|kedro.contrib.io>.azure.JSONBlobDataSet
which I can't find in the documentation under 18.12, but under 15.6. Did something change in the way kedro organizes contrib.io? GPT 4 also said that the built-in kedro JSON dataset doesn't work on azure. Any guidance is appreciated. THanks kindly,
Copy code
my_partitioned_dataset:
  type: kedro.io.PartitionedDataSet
  path: <your_blob_folder_path>
  credentials: azure_blob_storage
  dataset:
    type: kedro.contrib.io.azure.JSONBlobDataSet <- is this valid?
    container_name: <your_container_name>
    credentials: azure_blob_storage
d
JSONBlobDataSet
was removed in Kedro 0.16, along with a lot of storage-specific datasets. I don't know why the
pandas.JSONDataSet
shouldn't work; not sure I would trust GPT 4. See https://stackoverflow.com/a/69941391/1093967 for example;
fsspec
should be able to handle Azure blob same way as other storage backends.
e
Thank you for clarifying. One further follow-up. Why did you recommend the pandas.JSONDataSet instead of the json.JSONDataSet. is it because it automatically returns a dataframe of the data? The jsons I'm extracting are emails and I was sure if there was a reason to use a dataframe in this context. Its saving me a step if I was going to use a dataframe, is that right?
d
Yeah, I suggested
pandas.JSONDataSet
since that was similar to the behavior of old
<http://kedro.contrib.io|kedro.contrib.io>.azure.JSONBlobDataSet
, which also produces a dataframe.
e
@Deepyaman Datta I got it working mostly, thank you. As far as I can tell, I'm writing valid json to the files on the blob container. however when I use a PartitionedDataSet with pandas.JSONDataset, when I call the callable to load the file, some of the files are throwing an error. I've verified they are valid json, and as far as I can tell there is no discernable difference between files that load successfully and those that don't. Is there a way of loosening how the json is loaded so that if there are issues something is returned instead of nothing? I don't know how to solve this issue. The only thing I can think of is that the text may contain emojis or special characters.
Copy code
Error loading cleaned-emails-20230806003837.json: Failed while loading data from data set JSONDataSet(filepath=cleaned-emails/cleaned-emails-20230806003837.json, protocol=abfs).
Expected object or value
there is a json object in the underlying file... any ideas greatly appreciated!
I'm not sure if I found a bug or a behavior. when I use a PartitionedDataset+pandas.JSONDataSet to load json files from a Folder, some of the files cause the callable to fail. When I use a PartitionedDataset+json.JSONDataSet to load the same set of json files, there are no errors. I don't know how to debug the failed callables, so I'm not sure what the problem is. I think its related to non utf-8 characters in one of the keys, but I'm not sure how to tinker with the callable or how to extract what caused the error. If you want more example files (some that work and some that don't work) let me know.
n
Would be great if you can create a reproducible example and github issue. Maybe a gist or demo repository that we can clone and run.
e
Hi @Nok Lam Chan sorry, I'm not familiar with github issue. Is it easy to make a gist to show you?
n
Maybe a repository? It would be easier if it's something we can clone and run