Hi team. Does anyone have experience using kedro f...
# questions
v
Hi team. Does anyone have experience using kedro for ETL pipelines (not ML) incrementally loading unstructured documents? ex. extract, parse and process PDF, Word, etc. I'm not sure if whole idea of Kedro datasets is made for this use-case as we'll be working with bunch of files that need to be loaded one by one vs relational-like data read from parquet/csv etc. Of course we can do Extraction, then Parsing and after combine data from all files into single dataframe [document name; document text; ...] to be processed in kedro-dataset fashion. But not sure if this sounds like trying to fit a tool into solution it was not designed for. Any experiences or best practices are much appreciated
👀 1
f
@Filipe Oliveira
😯 1
e
Hi Vladimir, any reasons why you want to do it one by one rather than in parallel?
You can do it with existing datasets, then each file will be a separate dataset
But you can also create a custom dataset with a collection of your files and process them in the way you want
j
also could you detail a bit more what makes you doubt of Kedro applicability here? kedro datasets for unstructured data exist: Kedro can load emails, BioPython sequences, images, videos. (I don't think there's a PDFDataset but you can create your own)