Hi everyone! I am new here and trying to get mysel...
# questions
l
Hi everyone! I am new here and trying to get myself onboard with Kedro. I am following instructions of the Spaceflights tutorial and in particular, I am at the part about adding plots into Kedro viz. Unfortunately, I am failing miserably 😅. When I add the node and pipeline code to the data_science code I get the error message that the data_science pipeline does not exist (??) and when I create a separate pipeline just for the viz, there is no error but there is no graph shown in kedro viz either. It is said in the tutorial that the code should be pasted into nodes.py and pipeline.py respectively, but not which. Should it be added to the data_science files, or should we make a new pipeline? The pipeline code notably also describes a create_pipeline() function, so it did seem wrong to just paste in the data_science/pipeline.py file. I tried to add it as a new pipeline inside create_pipeline, and call something like pipe=ds_pipeline_1 + ds_pipeline_2 + plotly_pipeline at the end. No luck. Anybody has experienced in adding visual outputs such as a Plotly graph?
y
Thank you so much for flagging this @Lucie Gattepaille! We're aware that our tutorial was incomplete and @Jo Stichbury is actually working on the fix for this in an open PR. I suggest actually checking the documentation that is in progress, you can see a build of it here which has much clearer instructions: https://stichbury.github.io/visualisation/visualise_charts_with_plotly.html
j
I had problems with this section too. I've not revised that one yet @Yetunde as I've removed it from the main tutorial, and the visualisation text updates will be in a later PR. Let me just take another look at my notes though and help you out @Lucie Gattepaille
l
OK, I have managed to create a plot, although I still have something strange going on, probably owing to the fact that I don't fully understand how namespaces should be used. Because the plot of average shuttle capacity relates to the preprocessed_shuttle intermediate table, it looked somewhat logical to me to put the plot inside the data_processing pipeline. Here is what I did to the code: In data_processing/nodes.py, I pasted the additional function
Copy code
def compare_passenger_capacity(preprocessed_shuttles: pd.DataFrame):
    return preprocessed_shuttles.groupby(["shuttle_type"]).mean().reset_index()
In data_processing/pipeline.py, I replaced by the following code:
Copy code
from kedro.pipeline import Pipeline, node
from kedro.pipeline.modular_pipeline import pipeline

from .nodes import create_model_input_table, preprocess_companies, preprocess_shuttles
from .nodes import compare_passenger_capacity

def create_pipeline(**kwargs) -> Pipeline:
    return pipeline(
        [
            node(
                func=preprocess_companies,
                inputs="companies",
                outputs=["preprocessed_companies", "companies_columns"],
                name="preprocess_companies_node",
            ),
            node(
                func=preprocess_shuttles,
                inputs="shuttles",
                outputs="preprocessed_shuttles",
                name="preprocess_shuttles_node",
            ),
            node(
                func=create_model_input_table,
                inputs=["preprocessed_shuttles", "preprocessed_companies", "reviews"],
                outputs="model_input_table",
                name="create_model_input_table_node",
            ),
            node(
                func=compare_passenger_capacity,
                inputs="preprocessed_shuttles",
                outputs="shuttle_passenger_capacity_plot",
                #name="shuttle_passenger_capacity_plot_node"
            ),
        ],
        namespace="data_processing",
        inputs=["companies", "shuttles", "reviews"],
        outputs=["model_input_table","shuttle_passenger_capacity_plot"],
    )
and finally in catalog.yml, I added:
Copy code
shuttle_passenger_capacity_plot:
  type: plotly.PlotlyDataSet
  filepath: data/08_reporting/shuttle_passenger_capacity_plot.json
  plotly_args:
    type: bar
    fig:
      x: shuttle_type
      y: passenger_capacity
      orientation: v
    layout:
      xaxis_title: Shuttles
      yaxis_title: Average passenger capacity
      title: Shuttle Passenger capacity
This created the plot in kedro viz (note that the orientation was wrong in the tutorial, it needs to be v to be making sense). BUT as you will see if I manage to share the picture, I also end up with a dataset called data_processing.shuttle_passenger_capacity_plot (probably some namespace misunderstandings on my part).
In any case, I am wondering wether this is the appropriate place to display visualisations. I find notebooks easier to do these kinds of exploratory data analyses. What does everyone else think?
j
👀 Taking a look now. I put my code in
data_processing
too, although you could equally well put it in the
data_science
pipeline too. When I update the text, I'll add a third pipeline for reporting.
l
Turns out it was some other piece of code that was creating that double "data" (I had created a separate pipeline in an attempt to make the plots, and had forgotten to remove it). Now, with the code above, I have a working pipeline with a plot node that displays what I believe is the right viz. Hopefully this can help in updating the documentation for the tutorial.
j
That's awesome, thanks @Lucie Gattepaille for coming back with that information. I'm just about to start writing up a ticket to improve the visualisation sections of the documentation (you'll see from https://stichbury.github.io/ that I've already moved them out of the tutorial to avoid complicating it quite so much)