Hi all I was wondering if there is some way to skip a node i Kedro #questions

Hi all, I was wondering if there is some way to sk...

Thomas d'Hooghe

08/28/2024, 4:47 PM

Hi all, I was wondering if there is some way to skip a node in a pipeline based on the existence of a dataset in the catalog? Concrete problem: I have an Excel file that is very slow to load, and therefore I want to transform it to a pd parquet dataset. This would be a one-off procedure for a certain version of the file, and when I already have performed this one-off, I would like Kedro to use the parquet version, and not transform to parquet again. Would love to hear your feedback 🙂 Cheers, Thomas

Deepyaman Datta

08/28/2024, 4:50 PM

Does it need to be dynamic? What I usually do in this situation is define a separate pipeline for the "load" step. One time, I will do `kedro run --pipeline load`; after that, I will just do

kedro run

(or

kedro run --pipeline whatever_i_want

), which doesn't include the node for loading. Yes,

kedro run

will fail if the

load

pipeline hasn't been run yet, but I think it's fine to be explicit about this expensive operation.

👍 2

Thomas d'Hooghe

08/28/2024, 4:51 PM

hm makes sense! haven't thought about it that way, thank you 🙂

Yury Fedotov

08/28/2024, 5:11 PM

Hey @Thomas d'Hooghe I've been approaching it similarly to what @Deepyaman Datta proposed, and let me share a few details. It is very common that your analytics pipeline starts with slow loading large files, but once that's done, all the rest runs fast. I approach it exactly as @Deepyaman Datta mentions, and create an

ingestion

pipeline whose sole responsibility is to load all raw files into typed formats like Parquet or Pickle. I allow myself to do only those things within this pipeline: • Combine same data from different Excel sheets or file systems into a single data object. • Standardize column names right after reading from raw inputs. • Ensure relevant types (e.g., Order ID should probably be a

str

, not

int

). • Drop completely empty rows and columns from raw data. • Drop rows which are immediately useless regardless of what analytics will be performed on this downstream. • Perform alignment of common categories names. Then I run

kedro run --pipeline ingestion

once And downstream pipelines, e.g.

kedro run --pipeline everything_else

often. Example from my last project:

ingestion

takes 3.5 minutes,

everything_else

is 7 seconds. So you get a very massive speed up by separating all those loading operations into a single pipeline and running it once.

🙌 2

thankyou 2

Thomas d'Hooghe

08/28/2024, 5:11 PM

getting:

Copy code

A node cannot have the same inputs and outputs even if they are transcoded: {'raw_data'}

data catalog looks like

Copy code

raw_data@xlsx:
  type: pandas.ExcelDataset
  filepath: xxx
  load_args:
    sheet_name: "Data overview"

raw_data@csv:
  type: pandas.CSVDataSet
  filepath: xxx

raw_data@parquet:
  type: pandas.ParquetDataSet
  filepath:xxx

Thomas d'Hooghe

08/28/2024, 5:53 PM

solved when doing

Copy code

raw_data.xlsx:
  type: pandas.ExcelDataset
  filepath: xxx
  load_args:
    sheet_name: "Data overview"

raw_data.csv:
  type: pandas.CSVDataSet
  filepath: xxx

raw_data.parquet:
  type: pandas.ParquetDataSet
  filepath:xxx

Elena Khaustova

08/28/2024, 6:24 PM

Some tips on transcoding, explaining the error above: https://docs.kedro.org/en/stable/data/data_catalog_yaml_examples.html#how-not-to-use-transcoding

3 Views

Open in Slack

Previous Next