Hello Team Is there any way to read from catalog based on mu Kedro #questions

Hello Team Is there any way to read from catalog ...

Abdullah Alsqoor

02/18/2024, 12:36 PM

Hello Team Is there any way to read from catalog based on multiple and dynamic paths For instance, if we want to read parquet files from Google Cloud Storage as SparkDataset, but the path in catalog will be determined on run time

Copy code

filepath: <gcs://bucket/table-name/year={}/month={}/day={}> 

year month and day templates will be replaced with either 
* or certain day i.e. year=2024 or list of days year={2023, 2024} day={1, 2, 3, 4}

<gcs://bucket/table-name/year=2024/month=1/day=*> # meaning get all the days in year=2024 folder in month=1 folder

<gcs://bucket/table-name/year=2023/month={12,11}/day=*> # Get all parquet files in year=2023 folder in month=12 and month=11 folders in all days folders day=1... day=31

Dmitry Sorokin

02/19/2024, 10:47 AM

Hi Abdullah, Are you looking to run a single pipeline multiple times, each with different inputs, or would you prefer to process all the data in one go, thereby handling multiple datasets within the same pipeline run?

Abdullah Alsqoor

02/19/2024, 12:09 PM

Hi Dmitry, It's one pipeline which would use a dataset, this dataset will point to Cloud Storage and read parquet files (multiple paths), these files will be determined at run time. hence calculate date range and replace the path template with actual dates and get the data from all these paths

Copy code

list_of_dates = ['2024/2/19', '2024/2/18', '2024/2/17', '2024/2/16', '2024/2/15']

path_template = '<gcs://bucket/table-name/year={}/month={}/day={}>'

def format_date_template(date_str):
    """Converts a date string (YYYY/MM/DD) into the desired path template format.

    Args:
        date_str (str): The date string in YYYY/MM/DD format.

    Returns:
        str: The formatted path template string.
    """

    try:
        # Extract year, month, and day components using slicing and conversion
        year, month, day = date_str.split('/')
        year = int(year)
        month = int(month)
        day = int(day)

        # Format the template with extracted values
        formatted_template = path_template.format(year, month, day)
        return formatted_template

    except ValueError:
        print(f"Invalid date format for '{date_str}': Please use YYYY/MM/DD format.")
        return None  # Indicate error for invalid date strings

# Map the function to get formatted templates
formatted_paths = list(map(format_date_template, list_of_dates))

# Print the resulting list of formatted paths
print(formatted_paths)

Now will read the data stored in

formatted_paths

Copy code

df = spark.read.parquet(gcs_path)

I wonder if there is something like this built-in with catalog

Dmitry Sorokin

02/19/2024, 6:20 PM

I think Kedro doesn't have a built-in feature for dynamically setting dataset paths in the catalog, but you can use the

before_pipeline_run

hook to dynamically modify the catalog. Inside this hook, you can use the

catalog.add

method to add datasets:

catalog.add("dynamic_ds", SparkDataset(filepath=formatted_path))

👍 1

Abdullah Alsqoor

02/20/2024, 5:31 AM

Sounds good, Thank you Dmitry

👍 1

2 Views

Open in Slack

Previous Next