Hi! I'm using Kedro on Databricks, and using Manag...
# questions
a
Hi! I'm using Kedro on Databricks, and using ManagedTableDataSet in the catalog to load the data I need. So far is working good, but for my use case I would like to load using a sql query. So insted of loading tableA I can load something like *select * from tableA where date <= parameter* Any suggestions on how to achieve this? https://docs.kedro.org/en/stable/kedro_datasets.databricks.ManagedTableDataset.html#kedro_datasets.databricks.ManagedTableDataset
j
hi @Ana Paula Rojas, I'm looking into this and I can't find how to run such a
SELECT * FROM tableA WHERE ...
from the PySpark API (without Kedro). do you have an example of how that would look like? all the examples I find read the whole table:
df = spark.read.format("delta").load("/tmp/delta-table")
a
Hi! This an example of how I can run this in databricks notebook (outside Kedro): df = spark.sql("SELECT * FROM Tabla A WHERE bla bla '").toPandas()
j
where
TableA
is a delta table, correct?
👍 1
I don't think we have a Spark SQL dataset just yet but would definitely be a good addition
👍 1
d
You would have to do filter in the node - but remember spark / delta is lazy so there is no performance loss ‘loading the entire table’ because it’s a pointer until you do something like .toPandas()
a
I was using ManagedTableSet with the option _*dataframe_type='pandas'*_, which I understand, in the node, it's already loaded the whole data. In any case, the SQL Dataset would allow me not to only do filters but maybe also something like "select * from A left join B ..."
d
yes you can declare all of that in the catalog if you’d like
but I think we encourage you to do complex logic in nodes within Kedro because it’s easier to unit test and document
pandas.SQLDataSet
works if you have the sqlalchemy-databricks plug-in installed too
👍 1
But our
ManagedTableDataSet
implementation is mostly designed for users who want to work with a SparkDataFrame
👍 1
a
Super clear, thanks for your comments 🙂