Hi all, I was wondering if there is some way to sk...
# questions
t
Hi all, I was wondering if there is some way to skip a node in a pipeline based on the existence of a dataset in the catalog? Concrete problem: I have an Excel file that is very slow to load, and therefore I want to transform it to a pd parquet dataset. This would be a one-off procedure for a certain version of the file, and when I already have performed this one-off, I would like Kedro to use the parquet version, and not transform to parquet again. Would love to hear your feedback šŸ™‚ Cheers, Thomas
d
Does it need to be dynamic? What I usually do in this situation is define a separate pipeline for the "load" step. One time, I will do `kedro run --pipeline load`; after that, I will just do
kedro run
(or
kedro run --pipeline whatever_i_want
), which doesn't include the node for loading. Yes,
kedro run
will fail if the
load
pipeline hasn't been run yet, but I think it's fine to be explicit about this expensive operation.
šŸ‘ 2
t
hm makes sense! haven't thought about it that way, thank you šŸ™‚
y
Hey @Thomas d'Hooghe I've been approaching it similarly to what @Deepyaman Datta proposed, and let me share a few details. It is very common that your analytics pipeline starts with slow loading large files, but once that's done, all the rest runs fast. I approach it exactly as @Deepyaman Datta mentions, and create an
ingestion
pipeline whose sole responsibility is to load all raw files into typed formats like Parquet or Pickle. I allow myself to do only those things within this pipeline: • Combine same data from different Excel sheets or file systems into a single data object. • Standardize column names right after reading from raw inputs. • Ensure relevant types (e.g., Order ID should probably be a
str
, not
int
). • Drop completely empty rows and columns from raw data. • Drop rows which are immediately useless regardless of what analytics will be performed on this downstream. • Perform alignment of common categories names. Then I run
kedro run --pipeline ingestion
once And downstream pipelines, e.g.
kedro run --pipeline everything_else
often. Example from my last project:
ingestion
takes 3.5 minutes,
everything_else
is 7 seconds. So you get a very massive speed up by separating all those loading operations into a single pipeline and running it once.
šŸ™Œ 2
thankyou 2
t
getting:
Copy code
A node cannot have the same inputs and outputs even if they are transcoded: {'raw_data'}
data catalog looks like
Copy code
raw_data@xlsx:
  type: pandas.ExcelDataset
  filepath: xxx
  load_args:
    sheet_name: "Data overview"

raw_data@csv:
  type: pandas.CSVDataSet
  filepath: xxx

raw_data@parquet:
  type: pandas.ParquetDataSet
  filepath:xxx
solved when doing
Copy code
raw_data.xlsx:
  type: pandas.ExcelDataset
  filepath: xxx
  load_args:
    sheet_name: "Data overview"

raw_data.csv:
  type: pandas.CSVDataSet
  filepath: xxx

raw_data.parquet:
  type: pandas.ParquetDataSet
  filepath:xxx
e