Hi all I m developing a generative AI pipeline focused on po Kedro #questions

Hi all, I’m developing a generative AI pipeline f...

Stanley

04/10/2024, 2:54 PM

Hi all, I’m developing a generative AI pipeline focused on poetry using Kedro. I’m encountering an issue where the output from one node (

generate_poem_image_node

) isn’t being passed correctly to another (

generate_poem_video_node

) despite configurations seeming correct. Specifically, the

image_path

output isn’t accessible from my session object, although other outputs are fine. When attempting to access

image_path

in my session with:

Copy code

result = session.run("run_generation_pipeline")
...
"image_path": Path(result.get("image_path", ""))

I notice

image_path

is missing in the

result

dictionary, unlike other paths.

Copy code

{'image_prompt_path': 'generated_image_prompts/user_/2024-04-10/prompt_image_14-49-49-442418.txt', 'poem_path': 'generated_poetry/user_/2024-04-10/poetry_14-49-47-295838.txt', 'audio_path': 'generated_audio/user_/2024-04-10/audio_14-50-02-478601.mp3', 'video_path': 'generated_videos/user_/2024-04-10/video_14-51-25-185057.mp4'}

Here’s the relevant snippet from

pipeline.py

showing the node configurations:

Copy code

from kedro.pipeline import Pipeline, node, pipeline
...

def create_pipeline(**kwargs) -> Pipeline:
    return pipeline(
        ...
        node(func=generate_poem_image, outputs="image_path", ...),
        node(func=generate_poem_video, inputs=["image_path", ...], ...)
    )

🙌🏼 1

Nok Lam Chan

04/10/2024, 2:58 PM

Is it possible to trim down this example? Also can you provided the relevant datasets in your

catalog.yml

Stanley

04/10/2024, 3:01 PM

by trim down, do you mean take out some code from the pipeline.py above? also there are no datasets in my catalog.yml, I’m making API calls to get each generated file. My approach is inspired by this simpler example where they only call OpenAI models

Nok Lam Chan

04/10/2024, 3:05 PM

By trim down I mean https://stackoverflow.com/help/minimal-reproducible-example

also there are no datasets in my catalog.yml, I’m making API calls to get each generated file

Does that mean your nodes are taking

path

directly as input and the node outputs are just dictionary of paths?

Nok Lam Chan

04/10/2024, 3:06 PM

Alright, I think I know what's the question is. So the pipeline is fine, it's just that the

result

object didn't have what you expected?

Stanley

04/10/2024, 3:09 PM

yes that’s correct, i’m passing paths to the generated files between the nodes. And yup, one path seems to missing from the result object even though I can see it’s getting saved as a MemoryDataset just like the rest of the paths

Nok Lam Chan

04/10/2024, 3:10 PM

Can you try

kedro catalog create <pipeline>

? This will create some entries in your catalog.yml as

MemoryDataset

The reason for this to happen is that this is slightly deviated from what Kedro expect, there is a concept of "free dataset" which mean we throw away dataset that are not longer needed during a pipeline. We only return what is in memory already. https://github.com/kedro-org/kedro/pull/3475

Nok Lam Chan

04/10/2024, 3:11 PM

So in your case, probably only the last output is in the

result

object

Stanley

04/10/2024, 3:13 PM

I see, in the result object I see all the other paths some of which occur before image_path is created e..g poem_path

Copy code

{'image_prompt_path': 'generated_image_prompts/user_/2024-04-10/prompt_image_14-49-49-442418.txt', 'poem_path': 'generated_poetry/user_/2024-04-10/poetry_14-49-47-295838.txt', 'audio_path': 'generated_audio/user_/2024-04-10/audio_14-50-02-478601.mp3', 'video_path': 'generated_videos/user_/2024-04-10/video_14-51-25-185057.mp4'}

so it that there’s a limited number of datasets that result can store at any one time…and having hit the limit, it’s dropping image_path?

Stanley

04/10/2024, 3:15 PM

also, thanks for the trim reference. Will do so and see if I can isolate the issue. Also, given that I’m passing paths in MemoryDataset, is there a better way you would recommend I approach this?

Nok Lam Chan

04/10/2024, 3:18 PM

Can you try

kedro catalog create <pipeline>

have you tried this already?

Nok Lam Chan

04/10/2024, 3:18 PM

so it that there’s a limited number of datasets that result can store at any one time…and having hit the limit, it’s dropping image_path?

We don't do anything special to limit datasets number

👍🏾 1

Stanley

04/10/2024, 3:22 PM

just run kedro catalog create <pipeline> and got this in the yaml file:

Copy code

audio_path:
  type: MemoryDataset
image_path:
  type: MemoryDataset
image_prompt:
  type: MemoryDataset
image_prompt_path:
  type: MemoryDataset
poem:
  type: MemoryDataset
poem_path:
  type: MemoryDataset
video_path:
  type: MemoryDataset

Nok Lam Chan

04/10/2024, 3:26 PM

this looks right, do you still get missing thing in the

result

Stanley

04/10/2024, 3:34 PM

yes, it’s still missing from result, I’ve run the kedro pipeline to check.

result = {

'image_prompt_path': xyz.txt',

'audio_path': 'abc.mp3',

'video_path': 'abc.mp4',

'poem_path': 'def.txt'

This is the kedro viz of the pipeline, looking at it…I think the issue is that only node outputs which are not passed along to other nodes are kept

👍🏼 1

Nok Lam Chan

04/10/2024, 3:37 PM

Thanks for confirming

Nok Lam Chan

04/10/2024, 3:39 PM

I don't have an optimal solution. What will work for sure is change all these memory dataset to

pickle.PickleDataset

and instead of getting them from

result

use

catalog.load("dataset_name")

instead. I have a custom implementation for Runner which was designed for debugging purpose. This would work too but you need to use a custom runner. https://github.com/kedro-org/kedro/issues/1802#issuecomment-1270096651

Stanley

04/10/2024, 3:47 PM

Thanks for helping me get to the root cause. Your recommendation sounds workable, I’ll look into it. I’m fairly new to kedro so just want to check if the catalog will work for files generated on the fly. The files I’m generating via api each have a custom name based on the time of generation hence why I was passing the paths. Will using a custom runner help with this?

Nok Lam Chan

04/10/2024, 3:49 PM

This is fine, it's not the "Kedro" way of doing thing because you normally find these

path

attributes as a dataset attribute instead of a node input. For reference:

APIDataset

, the more Kedro way of doing things will be creating a OpenAPIDataset, but I don't have time to look into the details of the pipeline to see if it's possible.

Nok Lam Chan

04/10/2024, 3:50 PM

You may also want to look at

dataset factories

partitionDataset

etc

Stanley

04/10/2024, 3:53 PM

amazing, thank you so much Nok. You’ve been extremely helpful. Now that I’ve seen that my proof-of-concept can work in Kedro, I’ll checkout APIDataset to see if I can rewrite my API classes to be allow for a more kedro-like flow.

👍🏼 1

Nok Lam Chan

04/10/2024, 3:54 PM

amazing, keep us post! And if you end up working on a OpenAPI dataset, maybe even try to submit a PR to

kedro-plugins

Stanley

04/10/2024, 3:56 PM

will keep you posted! i’m cooking up a storm 🙂 Okay, will let you know if so and submit PR

K 1

🥳 1

42 Views

Open in Slack

Previous Next