Subject Create a data catalog for files in databricks unity Kedro #questions

Subject: Create a data catalog for files in databr...

FIRAS ALAMEDDINE

01/27/2025, 8:53 PM

Subject: Create a data catalog for files in databricks unity catalog Hello everyone. When I create a data catalog for a kedro project, I manually specify the absolute paths for input and output files that reside in DBFS, as it is required by Kedro. But since I started using Unity Catalog, these paths become relative, and the absolute paths start having hash keys in them. It is easy to fetch absolute paths for input files, but how can we create such paths for output files?

Hall

01/27/2025, 8:53 PM

Someone will reply to you shortly. In the meantime, this might help:

Juan Luis

01/28/2025, 11:11 AM

hi @FIRAS ALAMEDDINE,

But since I started using Unity Catalog, these paths become relative, and the absolute paths start having hash keys in them.

could you clarify this a bit more?

FIRAS ALAMEDDINE

01/28/2025, 2:52 PM

I saved files on DBFS at first. When defining a data catalog, we used this format:

Copy code

ingest_dict= {"df1": "dbfs:/Filestore/tables/path/to/df1",
              "df2": "dbfs:/Filestore/tables/path/to/df2",
                          ...}
dict_ingest_catalog = {}
for table in ingest_dict:
    a_df = SparkDataSet(
        filepath=ingest_dict[table],
        file_format='parquet',
            load_args={"header": True, "inferSchema": True,"nullValue" : "NA" },
            save_args={"sep": ",", "header": True, "mode":"overwrite"},
        )
     dict_ingest_catalog[table] = a_df

full_catalog = DataCatalog(dict_ingest_catalog)

FIRAS ALAMEDDINE

01/28/2025, 2:56 PM

Now I want to save files on UC. Instead of using initial file paths on DBFS, I tried using f"{catalog}.{schema}.{tableName}" and it failed. Then I replaced filepaths with the new "locations" of files on UC. We can get them with this piece of code:

Copy code

details = spark.sql(f"DESCRIBE DETAIL {catalog}.{schema}.{tableName}").collect()
location = details[0]['location']

I can get locations for a pipeline's input files. However, since its output files don't exist yet, I cannot get their location and I cannot predefine a format that is similar to

location

. Therefore, my data catalog is not complete

👀 1

Juan Luis

01/28/2025, 3:16 PM

can you maybe try the

ManagedTableDataset

instead of

SparkDataset

Copy code

DataCatalog.from_config(
    {
        "nyctaxi_trips": {
            "type": "databricks.ManagedTableDataset",
            "catalog": "samples",
            "database": "nyctaxi",
            "table": "trips",
        }
    }
)

(from https://github.com/astrojuanlu/kedro-databricks-demo/blob/main/First%20Steps%20with%20Kedro%20on%20Databricks.ipynb)

FIRAS ALAMEDDINE

01/28/2025, 3:26 PM

I'll give it a shot. Thank you!

FIRAS ALAMEDDINE

01/28/2025, 6:08 PM

A question: I ingest my input tables into UC by calling raw tables that reside in an external location, using something like

Copy code

df = spark.read.table(f"{external_catalog}.{external_schema}.{raw_table}")

then modified a bit, then written using

Copy code

df.write.mode("overwrite").saveAsTable(f"{catalog}.{schema}.{tableName}")

In order to use a

config

that is similar to what you wrote, is it mandatory to write these input files using another way? Maybe something like:

Copy code

from kedro_dataset.databricks import ManagedTableDataset
dataset = ManagedTableDataset(table=tableName, catalog=catalog, database=schema, write_mode="overwrite")
dataset.save(df)

FIRAS ALAMEDDINE

01/28/2025, 6:38 PM

Apparently yes, my data engineering pipeline is working again. Thanks a lot Juan!

Juan Luis

01/28/2025, 8:54 PM

amazing, happy to help!

3 Views

Open in Slack

Previous Next