Afaque Ahmad
06/12/2023, 8:14 AMkedro-plugins
.
How should I setup my local development environment? I cannot find a requirements.txt
file.Abhishek Bhatia
06/12/2023, 1:06 PMMemoryDataSet
from nodes. By default kedro, deep copies the memoery dataset which leads to loss of information so I created a catalog entry with copy_mode
set to assign
. This solves our basic problem of objects being retained as is but messes up the DAG order displayed in kedro viz. Any solutions?Jose Nuñez
06/12/2023, 3:32 PMDataSetError: Failed while saving data to data set ParquetDataSet(filepath=/Users/jose_darnott/PycharmProjects/planta-litio/data/01_raw/data_sql.parquet, load_args={'engine': pyarrow}, protocol=file, save_args={'engine': pyarrow}). Duplicate column names found: ['timestamp', 'lims_BFIL CO3', 'lims_BFIL Ca %', ...]
# It's basically showing all the columns inside the dataframe, (here I'm showing only 3 of them)
.
My catalog entry looks like this:
data_sql:
type: pandas.ParquetDataSet
filepath: data/01_raw/data_sql.parquet
load_args:
engine: pyarrow
save_args:
engine: pyarrow
layer: III
.
I'm using:
kedro==0.18.8
pandas==2.0.1
pyarrow==12.0.0
.
The problem is quite similiar to this issue from 2022: https://github.com/kedro-org/kedro/discussions/1286 but in my case removing the load and save args as the OP mentions won't solve my problem.
.
This is quite puzzling, since I just did a df.to_clipboard() inside the node before returning my output, open it on a jupyter notebook and I see no problems with the dataframe, I can even save it to parquet without any issues. So that makes me thing the problem comes from kedro (?)
.
Anyways, as a workaround I'm saving the dataframe as csv and it's working just fine. But I'd like to find a way to make the parquet work again since this is a huge file.
Thanks in advance 🦜!Trevor
06/12/2023, 4:52 PMMemoryDataSet
returned by nodes as pieces of larger scripts. Up until now, I've only needed to import main
and that has worked for our purposes so farJared T
06/12/2023, 4:54 PMValueError: Duplicate keys found in
<project repo>/conf/base/parameters/pr
epare.yml and:
-
<project repo>/conf/base/parameters/in
gest.yml: train_pipeline
I have the train_pipeline
namespace in both the ingest and prepare modular sub-pipelines, here are the respective yamls:
# The following is a list of parameters for ingest pipeline for each namespace (train, inference)
# Parameters for train namespace
train_pipeline:
ingestion_options:
#Portfolio to use
portfolio_name: has_meds_portfolio.HasMedsPortfolio
# Feature store sub-pipes, only one for now.
feature_store_subpipe_name: BasicFeaturePipeline
# Expected output columns
expected_columns:
datetime: datetime64[ns]
patient_id: int64
age_days: int64
Male: int64
binary_smoking_status: object
overall_censorship_time: datetime64[ns]
months_until_overall_censorship: int64
death_date: datetime64[ns]
# Parameters for inference namespace
# currently same as train but this will change
# first updated to Nightly Porrtfolio then to
# an api call to the valuation queue.
inference_pipeline:
ingestion_options:
#Portfolio to use
portfolio_name: has_meds_portfolio.HasMedsPortfolio
# Feature store sub-pipes, only one for now.
feature_store_subpipe_name: BasicFeaturePipeline
# Expected output columns
expected_columns:
datetime: datetime64[ns]
patient_id: int64
age_days: int64
Male: int64
binary_smoking_status: object
overall_censorship_time: datetime64[ns]
months_until_overall_censorship: int64
death_date: datetime64[ns]
# all parameters for prepare pipeline are in train_pipeline namespace
train_pipeline:
preparation_options:
# target params
target_death_buffer_months: 2
# split params
splitter: TimeSeriesSplit
holdout_size: 0.3
am I not allowed to use the same namespace in multiple modular pipelines?CHIRAG WADHWA
06/13/2023, 4:34 AMkedro-datasets 1.4.0 does not provide the extra 'pickle.pickledataset'
does kedro-datasets not support pickle datasets ?
context - i'm removing kedro.extras
datasets from our asset codebase and using kedro-datasetsAbhishek Bhatia
06/13/2023, 10:21 AMPartitionedDataSet
. In the below pipeline, I have a node which returns a dictionary with values as pandas dataframes, so I define a PartionedDataSet
catalog entry for it. If I run the nodes till only this node then the files do get saved in the correct location but the output is an empty dictionary. If I add an identity node, then the correct key-value pair is returned. Is this the desired behaviour?Jose Nuñez
06/13/2023, 1:39 PMJeremi DeBlois-Beaucage
06/13/2023, 4:32 PMAndreas_Kokolantonakis
06/14/2023, 12:19 PMkedro run --env=dev
from docker and I am getting ValueError: Failed to format pattern '${s3_root_path}': no config value found, no default provided
What’s the best way to fix it? Thank you in advance!Rafał Nowak
06/14/2023, 4:49 PMgto
which depends on semver >= 3
Unfortunately I cannot install kedro-viz
since kedro-viz 6.3.0
depends on semver < 3
Is there any reason why kedro-viz
is limited to semver < 3
?
The current semver
is 3.0.1
.
Could anyone from kedro-viz team relax this dependency limitation?Alexandre Ouellet
06/14/2023, 7:07 PMKhangjrakpam Arjun
06/15/2023, 12:08 PMtype: kedro.extras.datasets.pandas.HTMLDataSet
On using the above class I am getting this error :
kedro.io.core.DataSetError: An exception occurred when parsing config for DataSet 'boxplot_figures_cfa':
Class 'kedro.extras.datasets.pandas.HTMLDataSet' not found or one of its dependencies has not been installed.
Does this class even exist?Javier del Villar
06/15/2023, 6:51 PMGeorgi Iliev
06/16/2023, 7:56 AMONNX
files and uploading them to S3 automatically using "only" the catalog definition.
Broadly speaking, the main flow of what we're trying to build is the following:
1. There is a process that trains and creates some files (PCA, scaler, some K-Means models, etc.) and saves them as Pickle
to use them between different nodes.
2. Once the main pipeline
is done, we're ready to distribute the model to our services.
3. We're using ONNX
because our services are not built in Python and the ONNX libraries we use are a bit faster.
4. So taking this into account, we have a publish
pipeline now that takes this Picke
files, converts them to ONNX
using convert_sklearn
, and then uploads to S3.
So, my main question here is: Is there a way to implement this so the transformation and the S3 upload is done automatically?
• I know that we can specify a S3 path in the catalog, but I didn't see how to set the .onnx
file type.Khangjrakpam Arjun
06/16/2023, 8:23 AMkedro.extras.datasets.matplotlib.MatplotlibWriter
class to save a figure object as a .png file in the kedro catalog and I got the below error:
'Figure' object has no attribute 'save'
Is there a way to use sav_fig
method instead of save
method to save an figure object in the kedro catalog?Camilo López
06/16/2023, 12:18 PMGuilherme Parreira
06/16/2023, 12:28 PMauto-sklearn
I will need to downgrade my Kedro project to 3.9
version.
I already installed python 3.9.16
with pyenv
.
Which would be my next steps?
(I need to change the python version in Pipfile
to 3.9
, and individually change the kernel of the notebook?)
If I change manually the kernel version of my notebook, it does not recognize as being part of the project (second photo attached)
Thanks in advance!Vici
06/16/2023, 1:08 PMkedro viz
. I saved the plots as follows:
plots:
type: PartitionedDataSet
path: data/08_reporting/plots
dataset:
type: plotly.JSONDataSet
filename_suffix: '.json'
Saving all the plots worked just fine (and I was able to load and show individual JSONs via fig = plotly.io.read_json(file); fig.show()
. But it turns out, when you save plots in bulk like this, they cannot be displayed in kedro viz. Is there a way to allow accessing bulk-saved-plots from kedro-viz (e.g., clicking the partitioned dataset in kedro viz, then having the option to select a specific plots), without forcing me to literally have a hundred JSONDataSets cluttering kedro viz? Thank you so much 😊
Edit: I'm also open for other (non-kedronic) ideas regarding the exploration of a large bulk of plotly plots.Sebastian Cardona Lozano
06/16/2023, 2:14 PMCircularDependencyError: Circular dependencies exist among these items: [node1 ...., node2]
Yes, the output of node 2 is an input for node 1. My goal is to not process all the items every time I run the pipeline, but only the new items not in that table.
How can I do this?
Thanks!! 🙂Nok Lam Chan
06/17/2023, 10:33 AMAbhishek Bhatia
06/17/2023, 1:15 PMPartitionedDataSet
?
It seems kedro assumes, the keys to be flat and string so neither a specification of tuple as keys nor nested dictionary specification works.Abhishek Bhatia
06/19/2023, 7:46 AMPartitionedDataSet
like this:
scenario_x/
├── iter_1/
│ ├── run_1.csv
│ ├── run_2.csv
│ └── run_3.csv
└── iter_2/
├── run_1.csv
├── run_2.csv
└── run_3.csv
scenario_y/
├── iter_1/
│ ├── run_1.csv
│ ├── run_2.csv
│ └── run_3.csv
└── iter_2/
├── run_1.csv
├── run_2.csv
└── run_3.csv
The catalog entry is like this:
_partitioned_csvs: &_partitioned_csvs
type: PartitionedDataSet
dataset:
type: pandas.CSVDataSet
load_args:
index_col: 0
save_args:
index: true
overwrite: true
filename_suffix: ".csv"
_partitioned_jsons: &_partitioned_jsons
type: PartitionedDataSet
dataset:
type: json.JSONDataSet
filename_suffix: ".json"
my_csv_part_ds:
path: data/07_model_output/my_csv_part_ds
<<: *_partitioned_csvs
my_json_part_ds:
path: data/07_model_output/my_json_part_ds
<<: *_partitioned_jsons
When I run the pipeline, the csv partitioned dataset gets deleted first, and then new one gets written, but the json partitioned dataset remains, and new ones get added.
I need a sort of a custom behaviour, wherein, the 2nd level of the partition should get overwritten, and not first level partition i.e.
in the node which produces the partitioned csv, the return value is like this:
def node_that_generates_part_ds(scenario, **kwargs):
res = {'scenario_x/iter_1/run_1': df1, 'scenario_x/iter_1/run_2': df2, .... and so on}}
return res
so when return res
keys contain scenario_x, scenario_y shoul NOT get deleted.
Can anyone guide me on how can I achieve this?
Thanks! 🙂marrrcin
06/19/2023, 7:49 AMbool
? We experience an issue where all values from the interactive prompts are being casted to str
, which is really inconvenient for `true`/`false` values, because they enforce such syntax: {%- if <http://cookiecutter.my|cookiecutter.my>_flag != "False" %}
.Juan Luis
06/20/2023, 10:42 AM127.0.0.1
was not working. I suspect it's because they were using an SSH connection to a Linux machine on AWS. localhost
worked perfectly. any reason to use the IP directly? (user was on Windows)
• their pipelines were huuuuuuge. he asked me about a way to group sub-pipelines visually, but I'm not versed enough. is there any way to do it?Pranav Khurana
06/20/2023, 11:32 AMKevin Mills
06/20/2023, 7:32 PMIdris Benkhelil
06/21/2023, 6:02 AM[etape 1] > [etape 2] > [if score_etape2 < X ] > [etape4]
> [if score_etape2 >= X ] > [etape5]
Do you have any indication of how I can do this? Or an example of code already implemented?
Thanks in advance.
IdrisMarc Gris
06/21/2023, 7:10 AM@singledispatchmethod
from the functools
library to refactor my code and create “per-model type” implementations of fit()
, predict()
…).
Unfortunately, this results in a
ValueError: Invalid Node definition: first argument must be a function, not 'singledispatchmethod'.
And indeed in kedro/pipeline/node.py:72
if not callable(func):
raise ValueError(
_node_error_message(
f"first argument must be a function, not '{type(func).__name__}'."
)
)
Is this “rejection” of functools.singledispatchmethod a “un-intended collateral” of the test (in which case I could make a pull request to handle it) or are there some things “down the line” that would justify not allow the use of functools & co ? 🙂
ThxMarc Gris
06/21/2023, 9:31 AMrandom_state: 42
and in conf/model_training.yml
random_state: ${random_state}
kedro run
>>> [...]
TypeError: Cannot cast scalar from dtype('<U15') to dtype('int64') according to the rule 'safe'
If I get this correctly, the consolidation / interpolation process resulted in random_state
being assigned the value "42"
instead of 42
Granted, I could easily circumvent this issue with int(params['random_state'])
, but I’m curious and would like to know if this is an expected behavior, and whether there is a more robust / elegant way of handling it.
Thx in advance
M