Hello Team Is there any way to read from catalog ...
# questions
a
Hello Team Is there any way to read from catalog based on multiple and dynamic paths For instance, if we want to read parquet files from Google Cloud Storage as SparkDataset, but the path in catalog will be determined on run time
Copy code
filepath: <gcs://bucket/table-name/year={}/month={}/day={}> 

year month and day templates will be replaced with either 
* or certain day i.e. year=2024 or list of days year={2023, 2024} day={1, 2, 3, 4}

<gcs://bucket/table-name/year=2024/month=1/day=*> # meaning get all the days in year=2024 folder in month=1 folder

<gcs://bucket/table-name/year=2023/month={12,11}/day=*> # Get all parquet files in year=2023 folder in month=12 and month=11 folders in all days folders day=1... day=31
d
Hi Abdullah, Are you looking to run a single pipeline multiple times, each with different inputs, or would you prefer to process all the data in one go, thereby handling multiple datasets within the same pipeline run?
a
Hi Dmitry, It's one pipeline which would use a dataset, this dataset will point to Cloud Storage and read parquet files (multiple paths), these files will be determined at run time. hence calculate date range and replace the path template with actual dates and get the data from all these paths
Copy code
list_of_dates = ['2024/2/19', '2024/2/18', '2024/2/17', '2024/2/16', '2024/2/15']

path_template = '<gcs://bucket/table-name/year={}/month={}/day={}>'

def format_date_template(date_str):
    """Converts a date string (YYYY/MM/DD) into the desired path template format.

    Args:
        date_str (str): The date string in YYYY/MM/DD format.

    Returns:
        str: The formatted path template string.
    """

    try:
        # Extract year, month, and day components using slicing and conversion
        year, month, day = date_str.split('/')
        year = int(year)
        month = int(month)
        day = int(day)

        # Format the template with extracted values
        formatted_template = path_template.format(year, month, day)
        return formatted_template

    except ValueError:
        print(f"Invalid date format for '{date_str}': Please use YYYY/MM/DD format.")
        return None  # Indicate error for invalid date strings

# Map the function to get formatted templates
formatted_paths = list(map(format_date_template, list_of_dates))

# Print the resulting list of formatted paths
print(formatted_paths)
Now will read the data stored in
formatted_paths
Copy code
df = spark.read.parquet(gcs_path)
I wonder if there is something like this built-in with catalog
d
I think Kedro doesn't have a built-in feature for dynamically setting dataset paths in the catalog, but you can use the
before_pipeline_run
hook to dynamically modify the catalog. Inside this hook, you can use the
catalog.add
method to add datasets:
catalog.add("dynamic_ds", SparkDataset(filepath=formatted_path))
.
👍 1
a
Sounds good, Thank you Dmitry
👍 1