Hi all ive been playing around with Kedro and I would love t Kedro #plugins-integrations

Hi all ive been playing around with Kedro and I wo...

Benjamin Cheung

06/30/2024, 6:55 PM

Hi all ive been playing around with Kedro and I would love to migrate my current project to it. I am however, having a lot of trouble with the following • Connecting to the Databricks Delta Tables without having spark. (Trying to connect on my local IDE) • Understanding how am I supposed to run the same pipeline on a bunch of different documents. ◦ Example: I have a folder in Azure Blob storage, and I have a bunch of documents there that needs to be run through the pipeline. Can I have it point to the folder instead of the individual documents?

Yury Fedotov

07/01/2024, 1:51 AM

@Richard Purvis maybe you encountered question 1? On question 2, likely what you need is a

PartitionedDataset

(docs). In short, how it works is that you define it like this:

Copy code

my_partitioned_dataset:
  type: partitions.PartitionedDataset
  path: <s3://my-bucket-name/path/to/folder>  # path to the location of partitions
  dataset: pandas.CSVDataset  # shorthand notation for the dataset which will handle individual partitions

And that means: 1. Go to the folder specified in

path

2. Read all items as individual datasets (in this case

pandas.CSVDataset

) 3. On load, it would return a

dict[str, object]

thing where

str

is a filename and

object

is whatever your

dataset

would read - in example above it would be

pd.DataFrame

Richard Purvis

07/01/2024, 8:17 PM

Hi @Benjamin Cheung, there is a pandas delta table dataset, see if that works. You can probably use transcoding to access from either pandas or spark.

🚀 1

3 Views

Open in Slack

Previous Next