Hi all ive been playing around with Kedro and I wo...
# plugins-integrations
b
Hi all ive been playing around with Kedro and I would love to migrate my current project to it. I am however, having a lot of trouble with the following • Connecting to the Databricks Delta Tables without having spark. (Trying to connect on my local IDE) • Understanding how am I supposed to run the same pipeline on a bunch of different documents. ◦ Example: I have a folder in Azure Blob storage, and I have a bunch of documents there that needs to be run through the pipeline. Can I have it point to the folder instead of the individual documents?
y
@Richard Purvis maybe you encountered question 1? On question 2, likely what you need is a
PartitionedDataset
(docs). In short, how it works is that you define it like this:
Copy code
my_partitioned_dataset:
  type: partitions.PartitionedDataset
  path: <s3://my-bucket-name/path/to/folder>  # path to the location of partitions
  dataset: pandas.CSVDataset  # shorthand notation for the dataset which will handle individual partitions
And that means: 1. Go to the folder specified in
path
2. Read all items as individual datasets (in this case
pandas.CSVDataset
) 3. On load, it would return a
dict[str, object]
thing where
str
is a filename and
object
is whatever your
dataset
would read - in example above it would be
pd.DataFrame
.
r
Hi @Benjamin Cheung, there is a pandas delta table dataset, see if that works. You can probably use transcoding to access from either pandas or spark.
🚀 1