Hello Team We re facing a huge blocker to save pandas Parque Kedro #questions

Hello Team, We're facing a huge blocker to save p...

Raghunath Nair

02/05/2024, 1:12 PM

Hello Team, We're facing a huge blocker to save pandas.ParquetDataset to /dbs/mnt/ points whereas same works on /dbfs/ on databricks ., we've a pandas dataset and team is blocked while saving the same to the /dbfs/mnt/<storage-container>. However it works with just on "/dbfs/" looks like a issue with pandas dataframe and /dbfs/mnt/<storage-container> Error from kedro: Failed while saving data to data set ParquetDataset(filepath=/dbfs/mnt/container, load_args={}, protocol=file, save_args={}). [Errno 5] Input/output error: '/dbfs/mnt/container' Also open for any alternative solutions. Please treat this with high priority as our entire pipeline run depends on it

Juan Luis

02/05/2024, 1:18 PM

hello @Raghunath Nair, sorry you're having a rough experience. can you write other files to

/dbfs/mnt/container

? or does it only fail with Kedro?

Juan Luis

02/05/2024, 1:18 PM

also, please tell us what Python, Kedro, and

kedro-datasets

versions you're using

Raghunath Nair

02/05/2024, 1:19 PM

@Juan Luis yes

spark.SparkDataset

works fine issue and spark we are able to save with

pandas.ParquetDataset

we cannot save gets Error from kedro: Failed while saving data to data set ParquetDataset(filepath=/dbfs/mnt/container, load_args={}, protocol=file, save_args={}). [Errno 5] Input/output error: '/dbfs/mnt/container'

Raghunath Nair

02/05/2024, 1:19 PM

using kedro

0.18.14

python is

3.10.11

on Databricks runtime

13.3 LTS ML (includes Apache Spark 3.4.1, Scala 2.12)

Ankita Katiyar

02/05/2024, 1:22 PM

I remember @Shubham Agrawal was facing a similar issue, so tagging him in case he has a solution already. I’ll investigate this in the meanwhile.

Raghunath Nair

02/05/2024, 1:22 PM

thank @Ankita Katiyar appreciated please let me know!

Shubham Agrawal

02/05/2024, 1:26 PM

I still have the issue. What helped me was just saving to a different location on dbfs.. check if that works for you? Is the <container> a azure data storage?

Raghunath Nair

02/05/2024, 1:27 PM

@Shubham Agrawal yes it is an

azure storage account

As i mentioned in my issue. Saving to dbfs works but does not work on the dbfs/mnt/storage

Shubham Agrawal

02/05/2024, 1:27 PM

interesting.. it was the same for me too

Raghunath Nair

02/05/2024, 1:28 PM

@Shubham Agrawal yes it is an

azure storage account

As i mentioned in my issue. Saving to dbfs works but does not work on the dbfs/mnt/storage. dbfs is not meant for storing data so, we want to push into a mounted storage account and we can't use it for our security reasons without mounting

Shubham Agrawal

02/05/2024, 1:29 PM

however, there were a few containers mounted on Databricks by my company IT team.. and I was able to save the pandas dataset there.. My hypothesis is it to something to do with mounting maybe? worth a try

Raghunath Nair

02/05/2024, 1:29 PM

@Shubham Agrawal

spark.SparkDataset

works fine

pandas.ParquetDataset

none of them works

Raghunath Nair

02/05/2024, 1:30 PM

so, i doubt the issue with databricks support for pandas and mounts

Shubham Agrawal

02/05/2024, 1:33 PM

Yep, I had the same issue. I tried different version of pandas, python and Databricks cluster.. but this is what eventually worked for me

Raghunath Nair

02/05/2024, 1:38 PM

@Shubham Agrawal you mean switching from

dbfs/mnt

dbfs

right?

Juan Luis

02/05/2024, 1:47 PM

@Raghunath Nair another thing you can try is launching a script that does

Copy code

import pandas as pd

df = pd.DataFrame(...)  # fill it with some data

df.to_parquet("/dbfs/mnt/container/test.pq")

if it fails, it will hopefully tell you more information (and also rule out that it's a Kedro problem). if it works, we'd need to keep investigating.

Nok Lam Chan

02/05/2024, 2:21 PM

Worth to note that Spark has native integration on Databricks, so the authentication mechanism and how it interact with the filesystem/dbfs could be different from others

💡 1

Nok Lam Chan

02/05/2024, 2:25 PM

https://docs.databricks.com/en/files/index.html

Raghunath Nair

02/05/2024, 2:35 PM

import pandas as pd df = pd.DataFrame(...) # fill it with some data df.to_parquet("/dbfs/mnt/container/test.pq") fails with

[Errno 5] Input/output error: '/dbfs/mnt/teamdata/test.pq

@Nok Lam Chan @Juan Luis

Raghunath Nair

02/05/2024, 2:38 PM

only issue with pandas

Juan Luis

02/05/2024, 2:43 PM

thanks @Raghunath Nair. that's an indication that this is a pandas problem, not a Kedro problem. there's nothing we can do about it - my recommendation would be to find a workaround. given that you do want to use that location, are you able to use Spark to write Parquet? or even Polars?

Nok Lam Chan

02/05/2024, 2:57 PM

Before we can conclude this is a pandas problem - can you try with something that is not pandas? maybe just try to save a file with

open

Right now you are trying to say it's not working with pandas, I tend to think it's the opposite, which is it's only working with spark but nothing else. https://forums.linuxmint.com/viewtopic.php?t=396045 The error itself suggested it's a common problem in mount drive.

Nok Lam Chan

02/05/2024, 2:58 PM

I also did a quick search on pandas github page and nothing show up, so I don't think it's a pandas specific issue otherwise it would be more common

Raghunath Nair

02/05/2024, 3:01 PM

@Nok Lam Chan open also gave the same error

Raghunath Nair

02/05/2024, 3:03 PM

and we're using kedro -

pandas.ParquetDataset

FYI - @Juan Luis

spark.SparkDataset

works perfectly fine in mounts and all pandas save fails with the same error"

Copy code

Error from kedro: Failed while saving data to data set ParquetDataset(filepath=/dbfs/mnt/container, load_args={}, protocol=file, save_args={}).
[Errno 5] Input/output error: '/dbfs/mnt/container

Raghunath Nair

02/05/2024, 3:14 PM

and you can see we're getting directly the error within the kedro itself of

ParquetDataset

hope its failing due to same cause?

Juan Luis

02/05/2024, 3:22 PM

Copy code

import pandas as pd
df = pd.DataFrame(...)  # fill it with some data
df.to_parquet("/dbfs/mnt/container/test.pq")

fails and

spark.SparkDataset

works, then again it's not a problem of

kedro_datasets.pandas.ParquetDataset

, but

pandas

itself. am I missing something?

Raghunath Nair

02/05/2024, 3:24 PM

@Nok Lam Chan mentioned its related to mounts then my question is why then it worked with spark. So, we can confirm its a pandas issue in that case?

Nok Lam Chan

02/05/2024, 3:24 PM

Raghunath Nair [3:01 PM]

@Nok Lam Chan open also gave the same error

I think this is sufficient to say that it's the mount drive issue rather than a specific library?

Nok Lam Chan

02/05/2024, 3:24 PM

I mentioned above, Databricks have very specific Spark integration

Nok Lam Chan

02/05/2024, 3:25 PM

You can check out the link above

Raghunath Nair

02/05/2024, 3:25 PM

@Nok Lam Chan are you referring to "Work with files in cloud object storage" section of pandas? in here?

Nok Lam Chan

02/05/2024, 3:28 PM

I can't point to a specific section, but I think https://docs.databricks.com/en/files/index.html#do-i-need-to-provide-a-uri-scheme-to-access-data explains a lot. There are used to be a nice page explaining how Spark treating path differently but I cannot find it anymore unfortunately they may have moved it or deleted it.

Nok Lam Chan

02/05/2024, 3:29 PM

But the idea is that databricks provide you an abstraction of filesystem. It is however, an abstraction only because under the hood it is a blob storage, it gives you familar interface like assessing stuff with the POSIX style path but there are subtle difference and they apply in a databricks way

Raghunath Nair

02/05/2024, 3:30 PM

so, since databricks has native spark integration mounting works fine but with pandas it doesn't possible? i would like to rasie this issue with databricks then

Nok Lam Chan

02/05/2024, 3:34 PM

It'd be great if you can get helps from Databricks, I am also curious what's the right way to handle these situatations. Just reading from the diagram though, fundamentally

dbfs/

and

dbfs/mnt

is two different things, which may explains why you can save in

dbfs/

but not the

dbfs/mnt

Nok Lam Chan

02/05/2024, 3:37 PM

Azure Databricks enables users to mount cloud object storage to the Databricks File System (DBFS) to simplify data access patterns for users that are unfamiliar with cloud concepts. Mounted data does not work with Unity Catalog, and Databricks recommends migrating away from using mounts and managing data governance with Unity Catalog.

https://learn.microsoft.com/en-us/azure/databricks/dbfs/mounts, you should also aware that databricks advise migrating away from mount storage.

Raghunath Nair

02/05/2024, 3:48 PM

yes i think its one of need for the unity catalog to function

Open in Slack

Previous Next