Hey there Has anyone ever tried training a yolo with Kedro I Kedro #questions

Hey there! Has anyone ever tried training a yolo ...

Alexandre Ouellet

03/28/2023, 2:09 PM

Hey there! Has anyone ever tried training a yolo with Kedro? I struggle a bit with it as yolo requires a path to its dataset folder, as it handles all of the opening of files through a pytorch dataloader. Is there a way in Kedro to handle "folder" as a dataset, and leave it as a folder?

Alexandre Ouellet

03/28/2023, 2:09 PM

I've seen this article : https://getindata.com/blog/deep-learning-with-azure-pytorch-distributed-training-done-right-kedro/

Alexandre Ouellet

03/28/2023, 2:11 PM

but even in this example, the "train_model_pytorch" starts with some dataframes, which I won't have, only a folder (on AzureML's blob storage)

marrrcin

03/28/2023, 4:01 PM

This example uses images https://kedro-org.slack.com/archives/C03QPDQTLBG/p1678891373264199

Alexandre Ouellet

03/28/2023, 4:12 PM

I think it gets close to what I'm trying to accomplish (using kedro with yolov8), but not quite there, or I'm missing something. Is the code for this available somewhere? I can't seem to find a link for it in the video or a repo

Alexandre Ouellet

03/28/2023, 4:38 PM

Maybe with a bit of context, my question will make more sense : I'm trying to use kedro for my ML pipeline to train a yolo model. I have the following steps : 1. I preprocess my images to generate the bounding boxes used by yolo. The input is a PartitionedDataset of RawData (I have a custom dataset for reading the raw data), and a metadata.csv files (for bouding boxes). The output should be a folder with Images, a folder with Bouding boxes AND the classes. 2. I want to train yolo. The input should be the previously generated folders path AND the classes. I will then generate the YAML that yolo expects (passing it the classes AND the path of where the images are supposed to be), write it on disk somewhere, and pass it to yolo's "train" function

Alexandre Ouellet

03/28/2023, 4:42 PM

Everything is already written and working outside of kedro, but if I want to use kedro, I'm not sure how to write the custom dataset that will transfer the files/folder in-between nodes

Alexandre Ouellet

03/28/2023, 4:44 PM

For instance, where should I write the images/labels? Should I write to a temporary directory, output that directory's name, and have a custom dataset whose sole purpose is to keep that path in memory to transfer to the other node?

Alexandre Ouellet

03/28/2023, 4:50 PM

Right now, my best guess as to how to do it with kedro is to have some kind of DirectoryDataset. When used as an output of a node (the _save() function), it will look into that directory, and transfer its content (using fsspec) to whatever is configured in the catalog. When used as an input, it will take the content of whatever is configured in the catalog, copy it into a temporary directory (using the _load() function), and return a path to that directory

marrrcin

03/30/2023, 7:30 AM

I recommend you to first understand how Kedro data catalog works. It’s perfectly suited for your use case. The YT video I’ve linked actually shows similar case. From your first node (the one that prepares the data), you will output your images and bounding boxes - preferably into two outputs = two data catalog entries, which can again be `PartitionedDataset`s. This https://stackoverflow.com/a/74066977/1955346 can also come in handy. The training node will take outputs from previous node as inputs and will be able to read them.

Alexandre Ouellet

03/30/2023, 3:45 PM

Hmm..The first node is not the issue, it's the training of yolo that I seem to be having difficulties.

Alexandre Ouellet

03/30/2023, 3:49 PM

yolov8 has a way of being customized (https://docs.ultralytics.com/usage/engine/). It seems that the way they expect us to customize it is through a custom Trainer. From that CustomTrainer, we can then override a fuction called get_dataloader() (see https://docs.ultralytics.com/reference/base_trainer/#ultralytics.yolo.engine.trainer.BaseTrainer.get_dataloader)

Alexandre Ouellet

03/30/2023, 3:53 PM

The dataloader is where we would pass the custom Dataset/AbstractDataset shown in the video

Alexandre Ouellet

03/30/2023, 3:54 PM

This custom dataloader still requires a path to existing files on disks (well, only the Dataset part of it).

Alexandre Ouellet

03/30/2023, 3:56 PM

and as far as I know, there is no way to pass custom Dataset to that part.

Alexandre Ouellet

03/30/2023, 3:59 PM

So, if I were to use a custom Dataset (as suggested in the video), I would need to create a new Trainer, a new Dataloader, AND a new Dataset (that part is expected)

Alexandre Ouellet

03/30/2023, 4:03 PM

the new Trainer and dataloader is not that bad in itself, but the new Dataset is my main issue. yolo uses its own dataset for a reason : it has tons of code already there (transformations, optimizations, caching, resizing, etc...) that I wouldn't want to lose

Alexandre Ouellet

03/30/2023, 4:05 PM

I could copy some of its optimization into the custom kedro Dataset, but since the code is GPL, that's a no go my current project

Alexandre Ouellet

03/30/2023, 4:05 PM

Am I missing something?

238 Views

Open in Slack

Previous Next