Hi fellows, I am cleaning dependencies in our `ked...
# questions
f
Hi fellows, I am cleaning dependencies in our
kedro
code and, upon scrutiny, I am a bit confused by the dependencies for
databricks.ManagedTableDataset
. In
pyproject.toml
, https://github.com/kedro-org/kedro-plugins/blob/main/kedro-datasets/pyproject.toml, is stated
Copy code
hdfs-base = ["hdfs>=2.5.8, <3.0"]
s3fs-base = ["s3fs>=2021.4"]
...
databricks-managedtabledataset = ["kedro-datasets[hdfs-base,s3fs-base]"]
databricks = ["kedro-datasets[databricks-managedtabledataset]"]
But in the implementation, I don't see any reference to those two packages while the dataset requires
pyspark
which is not stated as a dependency if I am not mistaken. Could you tell me if my interpretation is incorrect?
d
This is possibly true and no one has ever reported the issue until now because you’re so likely to have Pyspark in a databricks env I don’t think it’s come up! Please submit a PR or a GitHub issue it would be much appreciated
👍 1
f
I will do that. 👍
🙏 2
j
there are several issues related to this, see https://github.com/kedro-org/kedro-plugins/issues/135 and linked issues
f
@Juan Luis, while I see the connection between the two groups of datasets, they are nevertheless independent if I am not mistaken. By mentioning the issues, do you mean that we shouldn't fix the Databricks part without fixing the more generic Spark ones?