Hello Team, We're facing a huge blocker to save p...
# questions
r
Hello Team, We're facing a huge blocker to save pandas.ParquetDataset to /dbs/mnt/ points whereas same works on /dbfs/ on databricks ., we've a pandas dataset and team is blocked while saving the same to the /dbfs/mnt/<storage-container>. However it works with just on "/dbfs/" looks like a issue with pandas dataframe and /dbfs/mnt/<storage-container> Error from kedro: Failed while saving data to data set ParquetDataset(filepath=/dbfs/mnt/container, load_args={}, protocol=file, save_args={}). [Errno 5] Input/output error: '/dbfs/mnt/container' Also open for any alternative solutions. Please treat this with high priority as our entire pipeline run depends on it
j
hello @Raghunath Nair, sorry you're having a rough experience. can you write other files to
/dbfs/mnt/container
? or does it only fail with Kedro?
also, please tell us what Python, Kedro, and
kedro-datasets
versions you're using
r
@Juan Luis yes
spark.SparkDataset
works fine issue and spark we are able to save with
pandas.ParquetDataset
we cannot save gets Error from kedro: Failed while saving data to data set ParquetDataset(filepath=/dbfs/mnt/container, load_args={}, protocol=file, save_args={}). [Errno 5] Input/output error: '/dbfs/mnt/container'
using kedro
0.18.14
python is
3.10.11
on Databricks runtime
13.3 LTS ML (includes Apache Spark 3.4.1, Scala 2.12)
a
I remember @Shubham Agrawal was facing a similar issue, so tagging him in case he has a solution already. I’ll investigate this in the meanwhile.
r
thank @Ankita Katiyar appreciated please let me know!
s
I still have the issue. What helped me was just saving to a different location on dbfs.. check if that works for you? Is the <container> a azure data storage?
r
@Shubham Agrawal yes it is an
azure storage account
As i mentioned in my issue. Saving to dbfs works but does not work on the dbfs/mnt/storage
s
interesting.. it was the same for me too
r
@Shubham Agrawal yes it is an
azure storage account
As i mentioned in my issue. Saving to dbfs works but does not work on the dbfs/mnt/storage. dbfs is not meant for storing data so, we want to push into a mounted storage account and we can't use it for our security reasons without mounting
s
however, there were a few containers mounted on Databricks by my company IT team.. and I was able to save the pandas dataset there.. My hypothesis is it to something to do with mounting maybe? worth a try
r
@Shubham Agrawal
spark.SparkDataset
works fine
pandas.ParquetDataset
none of them works
so, i doubt the issue with databricks support for pandas and mounts
s
Yep, I had the same issue. I tried different version of pandas, python and Databricks cluster.. but this is what eventually worked for me
r
@Shubham Agrawal you mean switching from
dbfs/mnt
to
dbfs
right?
j
@Raghunath Nair another thing you can try is launching a script that does
Copy code
import pandas as pd

df = pd.DataFrame(...)  # fill it with some data

df.to_parquet("/dbfs/mnt/container/test.pq")
if it fails, it will hopefully tell you more information (and also rule out that it's a Kedro problem). if it works, we'd need to keep investigating.
n
Worth to note that Spark has native integration on Databricks, so the authentication mechanism and how it interact with the filesystem/dbfs could be different from others
💡 1
r
import pandas as pd df = pd.DataFrame(...) # fill it with some data df.to_parquet("/dbfs/mnt/container/test.pq") fails with
[Errno 5] Input/output error: '/dbfs/mnt/teamdata/test.pq
@Nok Lam Chan @Juan Luis
only issue with pandas
j
thanks @Raghunath Nair. that's an indication that this is a pandas problem, not a Kedro problem. there's nothing we can do about it - my recommendation would be to find a workaround. given that you do want to use that location, are you able to use Spark to write Parquet? or even Polars?
n
Before we can conclude this is a pandas problem - can you try with something that is not pandas? maybe just try to save a file with
open
Right now you are trying to say it's not working with pandas, I tend to think it's the opposite, which is it's only working with spark but nothing else. https://forums.linuxmint.com/viewtopic.php?t=396045 The error itself suggested it's a common problem in mount drive.
I also did a quick search on pandas github page and nothing show up, so I don't think it's a pandas specific issue otherwise it would be more common
r
@Nok Lam Chan open also gave the same error
and we're using kedro -
pandas.ParquetDataset
FYI - @Juan Luis
spark.SparkDataset
works perfectly fine in mounts and all pandas save fails with the same error"
Copy code
Error from kedro: Failed while saving data to data set ParquetDataset(filepath=/dbfs/mnt/container, load_args={}, protocol=file, save_args={}).
[Errno 5] Input/output error: '/dbfs/mnt/container
and you can see we're getting directly the error within the kedro itself of
ParquetDataset
hope its failing due to same cause?
j
if
Copy code
import pandas as pd
df = pd.DataFrame(...)  # fill it with some data
df.to_parquet("/dbfs/mnt/container/test.pq")
fails and
spark.SparkDataset
works, then again it's not a problem of
kedro_datasets.pandas.ParquetDataset
, but
pandas
itself. am I missing something?
r
@Nok Lam Chan mentioned its related to mounts then my question is why then it worked with spark. So, we can confirm its a pandas issue in that case?
n
Raghunath Nair [3:01 PM]
@Nok Lam Chan open also gave the same error
I think this is sufficient to say that it's the mount drive issue rather than a specific library?
I mentioned above, Databricks have very specific Spark integration
You can check out the link above
r
@Nok Lam Chan are you referring to "Work with files in cloud object storage" section of pandas? in here?
n
I can't point to a specific section, but I think https://docs.databricks.com/en/files/index.html#do-i-need-to-provide-a-uri-scheme-to-access-data explains a lot. There are used to be a nice page explaining how Spark treating path differently but I cannot find it anymore unfortunately they may have moved it or deleted it.
But the idea is that databricks provide you an abstraction of filesystem. It is however, an abstraction only because under the hood it is a blob storage, it gives you familar interface like assessing stuff with the POSIX style path but there are subtle difference and they apply in a databricks way
r
so, since databricks has native spark integration mounting works fine but with pandas it doesn't possible? i would like to rasie this issue with databricks then
n
It'd be great if you can get helps from Databricks, I am also curious what's the right way to handle these situatations. Just reading from the diagram though, fundamentally
dbfs/
and
dbfs/mnt
is two different things, which may explains why you can save in
dbfs/
but not the
dbfs/mnt
Azure Databricks enables users to mount cloud object storage to the Databricks File System (DBFS) to simplify data access patterns for users that are unfamiliar with cloud concepts. Mounted data does not work with Unity Catalog, and Databricks recommends migrating away from using mounts and managing data governance with Unity Catalog.
https://learn.microsoft.com/en-us/azure/databricks/dbfs/mounts, you should also aware that databricks advise migrating away from mount storage.
r
yes i think its one of need for the unity catalog to function