Hi all, I’m developing a generative AI pipeline f...
# questions
s
Hi all, I’m developing a generative AI pipeline focused on poetry using Kedro. I’m encountering an issue where the output from one node (
generate_poem_image_node
) isn’t being passed correctly to another (
generate_poem_video_node
) despite configurations seeming correct. Specifically, the
image_path
output isn’t accessible from my session object, although other outputs are fine. When attempting to access
image_path
in my session with:
Copy code
result = session.run("run_generation_pipeline")
...
"image_path": Path(result.get("image_path", ""))
I notice
image_path
is missing in the
result
dictionary, unlike other paths.
Copy code
{'image_prompt_path': 'generated_image_prompts/user_/2024-04-10/prompt_image_14-49-49-442418.txt', 'poem_path': 'generated_poetry/user_/2024-04-10/poetry_14-49-47-295838.txt', 'audio_path': 'generated_audio/user_/2024-04-10/audio_14-50-02-478601.mp3', 'video_path': 'generated_videos/user_/2024-04-10/video_14-51-25-185057.mp4'}
Here’s the relevant snippet from
pipeline.py
showing the node configurations:
Copy code
from kedro.pipeline import Pipeline, node, pipeline
...

def create_pipeline(**kwargs) -> Pipeline:
    return pipeline(
        ...
        node(func=generate_poem_image, outputs="image_path", ...),
        node(func=generate_poem_video, inputs=["image_path", ...], ...)
    )
🙌🏼 1
n
Is it possible to trim down this example? Also can you provided the relevant datasets in your
catalog.yml
?
s
by trim down, do you mean take out some code from the pipeline.py above? also there are no datasets in my catalog.yml, I’m making API calls to get each generated file. My approach is inspired by this simpler example where they only call OpenAI models
n
By trim down I mean https://stackoverflow.com/help/minimal-reproducible-example
also there are no datasets in my catalog.yml, I’m making API calls to get each generated file
Does that mean your nodes are taking
path
directly as input and the node outputs are just dictionary of paths?
Alright, I think I know what's the question is. So the pipeline is fine, it's just that the
result
object didn't have what you expected?
s
yes that’s correct, i’m passing paths to the generated files between the nodes. And yup, one path seems to missing from the result object even though I can see it’s getting saved as a MemoryDataset just like the rest of the paths
n
Can you try
kedro catalog create <pipeline>
? This will create some entries in your catalog.yml as
MemoryDataset
The reason for this to happen is that this is slightly deviated from what Kedro expect, there is a concept of "free dataset" which mean we throw away dataset that are not longer needed during a pipeline. We only return what is in memory already. https://github.com/kedro-org/kedro/pull/3475
So in your case, probably only the last output is in the
result
object
s
I see, in the result object I see all the other paths some of which occur before image_path is created e..g poem_path
Copy code
{'image_prompt_path': 'generated_image_prompts/user_/2024-04-10/prompt_image_14-49-49-442418.txt', 'poem_path': 'generated_poetry/user_/2024-04-10/poetry_14-49-47-295838.txt', 'audio_path': 'generated_audio/user_/2024-04-10/audio_14-50-02-478601.mp3', 'video_path': 'generated_videos/user_/2024-04-10/video_14-51-25-185057.mp4'}
so it that there’s a limited number of datasets that result can store at any one time…and having hit the limit, it’s dropping image_path?
also, thanks for the trim reference. Will do so and see if I can isolate the issue. Also, given that I’m passing paths in MemoryDataset, is there a better way you would recommend I approach this?
n
Can you try
kedro catalog create <pipeline>
?
have you tried this already?
so it that there’s a limited number of datasets that result can store at any one time…and having hit the limit, it’s dropping image_path?
We don't do anything special to limit datasets number
👍🏾 1
s
just run kedro catalog create <pipeline> and got this in the yaml file:
Copy code
audio_path:
  type: MemoryDataset
image_path:
  type: MemoryDataset
image_prompt:
  type: MemoryDataset
image_prompt_path:
  type: MemoryDataset
poem:
  type: MemoryDataset
poem_path:
  type: MemoryDataset
video_path:
  type: MemoryDataset
n
this looks right, do you still get missing thing in the
result
?
s
yes, it’s still missing from result, I’ve run the kedro pipeline to check.
result = {
'image_prompt_path': xyz.txt',
'audio_path': 'abc.mp3',
'video_path': 'abc.mp4',
'poem_path': 'def.txt'
]
This is the kedro viz of the pipeline, looking at it…I think the issue is that only node outputs which are not passed along to other nodes are kept
👍🏼 1
n
Thanks for confirming
I don't have an optimal solution. What will work for sure is change all these memory dataset to
pickle.PickleDataset
and instead of getting them from
result
use
catalog.load("dataset_name")
instead. I have a custom implementation for Runner which was designed for debugging purpose. This would work too but you need to use a custom runner. https://github.com/kedro-org/kedro/issues/1802#issuecomment-1270096651
s
Thanks for helping me get to the root cause. Your recommendation sounds workable, I’ll look into it. I’m fairly new to kedro so just want to check if the catalog will work for files generated on the fly. The files I’m generating via api each have a custom name based on the time of generation hence why I was passing the paths. Will using a custom runner help with this?
n
This is fine, it's not the "Kedro" way of doing thing because you normally find these
path
attributes as a dataset attribute instead of a node input. For reference:
APIDataset
, the more Kedro way of doing things will be creating a OpenAPIDataset, but I don't have time to look into the details of the pipeline to see if it's possible.
You may also want to look at
dataset factories
partitionDataset
etc
s
amazing, thank you so much Nok. You’ve been extremely helpful. Now that I’ve seen that my proof-of-concept can work in Kedro, I’ll checkout APIDataset to see if I can rewrite my API classes to be allow for a more kedro-like flow.
👍🏼 1
n
amazing, keep us post! And if you end up working on a OpenAPI dataset, maybe even try to submit a PR to
kedro-plugins
s
will keep you posted! i’m cooking up a storm 🙂 Okay, will let you know if so and submit PR
K 1
🥳 1