Hi team Does anyone have experience using kedro for ETL pipe Kedro #questions

Hi team. Does anyone have experience using kedro f...

Vladimir Filimonov

08/28/2024, 10:46 AM

Hi team. Does anyone have experience using kedro for ETL pipelines (not ML) incrementally loading unstructured documents? ex. extract, parse and process PDF, Word, etc. I'm not sure if whole idea of Kedro datasets is made for this use-case as we'll be working with bunch of files that need to be loaded one by one vs relational-like data read from parquet/csv etc. Of course we can do Extraction, then Parsing and after combine data from all files into single dataframe [document name; document text; ...] to be processed in kedro-dataset fashion. But not sure if this sounds like trying to fit a tool into solution it was not designed for. Any experiences or best practices are much appreciated

👀 1

Franklyn Ebikam

08/28/2024, 10:47 AM

@Filipe Oliveira

😯 1

Elena Khaustova

08/28/2024, 3:22 PM

Hi Vladimir, any reasons why you want to do it one by one rather than in parallel?

Elena Khaustova

08/28/2024, 3:23 PM

You can do it with existing datasets, then each file will be a separate dataset

Elena Khaustova

08/28/2024, 3:24 PM

But you can also create a custom dataset with a collection of your files and process them in the way you want

Juan Luis

08/28/2024, 3:49 PM

also could you detail a bit more what makes you doubt of Kedro applicability here? kedro datasets for unstructured data exist: Kedro can load emails, BioPython sequences, images, videos. (I don't think there's a PDFDataset but you can create your own)

7 Views

Open in Slack

Previous Next