Hi I m using Kedro on Databricks and using ManagedTableDataS Kedro #questions

Hi! I'm using Kedro on Databricks, and using Manag...

Ana Paula Rojas

11/30/2023, 4:44 PM

Hi! I'm using Kedro on Databricks, and using ManagedTableDataSet in the catalog to load the data I need. So far is working good, but for my use case I would like to load using a sql query. So insted of loading tableA I can load something like *select * from tableA where date <= parameter* Any suggestions on how to achieve this? https://docs.kedro.org/en/stable/kedro_datasets.databricks.ManagedTableDataset.html#kedro_datasets.databricks.ManagedTableDataset

Juan Luis

12/01/2023, 8:27 AM

hi @Ana Paula Rojas, I'm looking into this and I can't find how to run such a

SELECT * FROM tableA WHERE ...

from the PySpark API (without Kedro). do you have an example of how that would look like? all the examples I find read the whole table:

df = spark.read.format("delta").load("/tmp/delta-table")

Ana Paula Rojas

12/01/2023, 8:43 AM

Hi! This an example of how I can run this in databricks notebook (outside Kedro): df = spark.sql("SELECT * FROM Tabla A WHERE bla bla '").toPandas()

Juan Luis

12/01/2023, 8:59 AM

where

TableA

is a delta table, correct?

👍 1

Juan Luis

12/01/2023, 8:59 AM

I don't think we have a Spark SQL dataset just yet but would definitely be a good addition

👍 1

datajoely

12/01/2023, 9:01 AM

You would have to do filter in the node - but remember spark / delta is lazy so there is no performance loss ‘loading the entire table’ because it’s a pointer until you do something like .toPandas()

Ana Paula Rojas

12/01/2023, 9:10 AM

I was using ManagedTableSet with the option _*dataframe_type='pandas'*_, which I understand, in the node, it's already loaded the whole data. In any case, the SQL Dataset would allow me not to only do filters but maybe also something like "select * from A left join B ..."

datajoely

12/01/2023, 9:11 AM

yes you can declare all of that in the catalog if you’d like

datajoely

12/01/2023, 9:12 AM

but I think we encourage you to do complex logic in nodes within Kedro because it’s easier to unit test and document

datajoely

12/01/2023, 9:13 AM

pandas.SQLDataSet

works if you have the sqlalchemy-databricks plug-in installed too

👍 1

datajoely

12/01/2023, 9:14 AM

But our

ManagedTableDataSet

implementation is mostly designed for users who want to work with a SparkDataFrame

👍 1

Ana Paula Rojas

12/01/2023, 11:56 AM

Super clear, thanks for your comments 🙂

18 Views

Open in Slack

Previous Next