Jorit Studer
11/10/2023, 7:02 AMkedro[azure]
.marrrcin
11/10/2023, 7:05 AMJorit Studer
11/10/2023, 7:07 AMbotocore
, which seems to be AWS specific.Nok Lam Chan
11/10/2023, 7:41 AMbotocore
https://github.com/kedro-org/kedro-plugins/blob/main/kedro-datasets/setup.py
botocore
is most likely come with S3FS
, you can look at the specific dataset that you need and add that your project.
This mean that you cannot do pip install kedro-datasets[spark.SparkDataSet]
Jorit Studer
11/10/2023, 7:49 AMNok Lam Chan
11/10/2023, 7:52 AMspark_require = {
"spark.SparkDataSet": [SPARK, HDFS, S3FS]
,...
}
It’s not really just Spark but also HDFS and S3FS. (who still use HDFS these days?)
We can potentially separate the storage. In the past most of our users are using s3
I guess that’s why it’s bundled. From the dependencies point of view, it is better to separate it.
It does make the installation a bit longer and is a breaking change.
pip install kedro-dataset[spark.SparkDataset]
may become pip install kedro-dataset[spark.SparkDataset, s3]
Cc @Juan Luisazure
extra dependecies?Jorit Studer
11/10/2023, 12:46 PM