Hey there! Has anyone ever tried training a yolo ...
# questions
Hey there! Has anyone ever tried training a yolo with Kedro? I struggle a bit with it as yolo requires a path to its dataset folder, as it handles all of the opening of files through a pytorch dataloader. Is there a way in Kedro to handle "folder" as a dataset, and leave it as a folder?
but even in this example, the "train_model_pytorch" starts with some dataframes, which I won't have, only a folder (on AzureML's blob storage)
I think it gets close to what I'm trying to accomplish (using kedro with yolov8), but not quite there, or I'm missing something. Is the code for this available somewhere? I can't seem to find a link for it in the video or a repo
Maybe with a bit of context, my question will make more sense : I'm trying to use kedro for my ML pipeline to train a yolo model. I have the following steps : 1. I preprocess my images to generate the bounding boxes used by yolo. The input is a PartitionedDataset of RawData (I have a custom dataset for reading the raw data), and a metadata.csv files (for bouding boxes). The output should be a folder with Images, a folder with Bouding boxes AND the classes. 2. I want to train yolo. The input should be the previously generated folders path AND the classes. I will then generate the YAML that yolo expects (passing it the classes AND the path of where the images are supposed to be), write it on disk somewhere, and pass it to yolo's "train" function
Everything is already written and working outside of kedro, but if I want to use kedro, I'm not sure how to write the custom dataset that will transfer the files/folder in-between nodes
For instance, where should I write the images/labels? Should I write to a temporary directory, output that directory's name, and have a custom dataset whose sole purpose is to keep that path in memory to transfer to the other node?
Right now, my best guess as to how to do it with kedro is to have some kind of DirectoryDataset. When used as an output of a node (the _save() function), it will look into that directory, and transfer its content (using fsspec) to whatever is configured in the catalog. When used as an input, it will take the content of whatever is configured in the catalog, copy it into a temporary directory (using the _load() function), and return a path to that directory
I recommend you to first understand how Kedro data catalog works. It’s perfectly suited for your use case. The YT video I’ve linked actually shows similar case. From your first node (the one that prepares the data), you will output your images and bounding boxes - preferably into two outputs = two data catalog entries, which can again be `PartitionedDataset`s. This https://stackoverflow.com/a/74066977/1955346 can also come in handy. The training node will take outputs from previous node as inputs and will be able to read them.
Hmm..The first node is not the issue, it's the training of yolo that I seem to be having difficulties.
yolov8 has a way of being customized (https://docs.ultralytics.com/usage/engine/). It seems that the way they expect us to customize it is through a custom Trainer. From that CustomTrainer, we can then override a fuction called get_dataloader() (see https://docs.ultralytics.com/reference/base_trainer/#ultralytics.yolo.engine.trainer.BaseTrainer.get_dataloader)
The dataloader is where we would pass the custom Dataset/AbstractDataset shown in the video
This custom dataloader still requires a path to existing files on disks (well, only the Dataset part of it).
and as far as I know, there is no way to pass custom Dataset to that part.
So, if I were to use a custom Dataset (as suggested in the video), I would need to create a new Trainer, a new Dataloader, AND a new Dataset (that part is expected)
the new Trainer and dataloader is not that bad in itself, but the new Dataset is my main issue. yolo uses its own dataset for a reason : it has tons of code already there (transformations, optimizations, caching, resizing, etc...) that I wouldn't want to lose
I could copy some of its optimization into the custom kedro Dataset, but since the code is GPL, that's a no go my current project
Am I missing something?