Thomas d'Hooghe
08/28/2024, 4:47 PMDeepyaman Datta
08/28/2024, 4:50 PMkedro run
(or kedro run --pipeline whatever_i_want
), which doesn't include the node for loading.
Yes, kedro run
will fail if the load
pipeline hasn't been run yet, but I think it's fine to be explicit about this expensive operation.Thomas d'Hooghe
08/28/2024, 4:51 PMYury Fedotov
08/28/2024, 5:11 PMingestion
pipeline whose sole responsibility is to load all raw files into typed formats like Parquet or Pickle.
I allow myself to do only those things within this pipeline:
⢠Combine same data from different Excel sheets or file systems into a single data object.
⢠Standardize column names right after reading from raw inputs.
⢠Ensure relevant types (e.g., Order ID should probably be a str
, not int
).
⢠Drop completely empty rows and columns from raw data.
⢠Drop rows which are immediately useless regardless of what analytics will be performed on this downstream.
⢠Perform alignment of common categories names.
Then I run kedro run --pipeline ingestion
once
And downstream pipelines, e.g. kedro run --pipeline everything_else
often.
Example from my last project: ingestion
takes 3.5 minutes, everything_else
is 7 seconds. So you get a very massive speed up by separating all those loading operations into a single pipeline and running it once.Thomas d'Hooghe
08/28/2024, 5:11 PMA node cannot have the same inputs and outputs even if they are transcoded: {'raw_data'}
data catalog looks like
raw_data@xlsx:
type: pandas.ExcelDataset
filepath: xxx
load_args:
sheet_name: "Data overview"
raw_data@csv:
type: pandas.CSVDataSet
filepath: xxx
raw_data@parquet:
type: pandas.ParquetDataSet
filepath:xxx
Thomas d'Hooghe
08/28/2024, 5:53 PMraw_data.xlsx:
type: pandas.ExcelDataset
filepath: xxx
load_args:
sheet_name: "Data overview"
raw_data.csv:
type: pandas.CSVDataSet
filepath: xxx
raw_data.parquet:
type: pandas.ParquetDataSet
filepath:xxx
Elena Khaustova
08/28/2024, 6:24 PM