noam
05/04/2023, 2:55 PMvalidation_data:
type: kedro.io.PartitionedDataSet
path: data/03_primary/validation_data/
dataset: pickle.PickleDataSet
filename_suffix: ".df"
versioned: True
Thanks in advance!marrrcin
05/05/2023, 8:45 AMkedro new --starter=spaceflights
, but when weโve developed our own starter for Kedro Snowflake, it requires to additionally specify --checkout=
flag, because of the default mechanism in Kedro:
Error: Kedro project template not found at git+<https://github.com/getindata/kedro-snowflake>. Specified tag 0.18.8. The following tags are available: 0.0.1, 0.1.0, 0.1.1.
Is there a way (or if not, I think there should be) for the plugin to specify the default tag to use? The versioning of Kedro should not affect the versioning of custom starters/plugins ๐คnoam
05/05/2023, 12:01 PMversioned: True
argument in the data catalog for this kind of dataset).
Perhaps it is better than I explain the root issue/challenge, in case there are solutions I am missing.
The Problem: By default, Kedro overwrites data objects with each run, using the paths set in the data catalog.
The Question: What is a convenient solution/tech stack for enabling the execution of multiple parallel ML experiments in my Kedro pipeline, while maintaining thatโฆ
1. Each experiment triggers the data to be versioned effectively. Ideallyโฆ
a. When there are changes to the data, the data is copied and assigned a unique ID (sha, md5, timestamp), perhaps with metadata regarding the parameters that were used to generate the data. In this case, it is important that the data is stored a sensible, organized manner.
b. When there have been no changes the data, the same unique ID (and metadata) are used and can be extracted
2. The unique IDs (and metadata) for each relevant dataset relevant to the ML run can be extracted and stored alongside the (presumably lighter) results of the experiment
3. Given 1. and 2. above, the results are reproducible (they offer point-time-correctness)
The solutions I have thus far come across are problematic:
โข Writing a class to set dynamic dataset filepaths
โฆ The main issue with this approach is that it is incredibly high-maintenance. It requires continuous, careful attention to the parameters used to define the dynamic filepaths.
โช๏ธ For example, if I set the filepath to training_data_{a}_{b}
using parameters a
and b
and I change parameter c
, changing the composition of the data, a new dataset will overwrite the previous dataset. If I wanted to have kept them both, I would have had to remember to update filepath parameters to include c
. Of course, with many different data-defining parameters, this becomes problematic rather quickly.
โข Use Kedro versioning - use the versioned: True
argument in the catalog underneath datasets for which you desire to version
โฆ The first issue with this approach is that it appears to version all of the data with every new run, presenting a massive storage issue and the necessity for a custom retention policy to clear useless/outdated data.
โฆ The second issue is that this doesnt work with PartitionedDataSet datasets.
Are there any effective solutions I am missing?Ofir
05/05/2023, 2:22 PMOfir
05/05/2023, 3:04 PMkedro run
but it doesnโt accept output dirs / workspace as a parameter. It assumes you have the configuration files already in place.
Perhaps I need to come up with my own kedro
wrapper? (to auto-generate the configuration files and then call kedro run
)Ofir
05/05/2023, 3:11 PMOfir
05/05/2023, 3:16 PMAdrien
05/05/2023, 3:38 PMJose Nuรฑez
05/05/2023, 7:51 PMRun Command
of every function starts with None.
for instance in the picture you can read kedro run --to-nodes=None.clean_mdt
why is this? my pipeline executes just fine without any issues after a regular kedro run
.
If I do a kedro run --to-nodes=None.clean_mdt
I'll get an error so I manually need to erase the None.
before running.
So running this instead works just fine kedro run --to-nodes=clean_mdt
Rob
05/06/2023, 3:59 AMkedro viz
over Host '0.0.0.0'
? I want to show them expanded by default. Here's an example of the behavior I'm seeing: https://brawlstars-retention-pipeline-6u27jcczha-uw.a.run.app/, and this doesn't happen with localhost.
Can someone suggest a way to modify the code used in this thread to show the nodes expanded? Thanks ๐Dawid Bugajny
05/08/2023, 10:34 AMAndreas_Kokolantonakis
05/09/2023, 2:51 PMJuan Luis
05/09/2023, 3:38 PMpython -m build
that is failing, but I can't reproduce it locallyElena Mironova
05/09/2023, 3:43 PMkedro pull micropkg
and sdist here. Would anyone know how to best deal with the multiple egg.info error in the screenshot?
Context: with kedro==0.18.3
on a mac I used python -m build --sdist path/to/package
to create tar.gz inside our Git repo (i know alternative is available to create sdist through kedro micropkg package
, but i can't do it from inside a kedro project). I see the archive, but when doing kedro pull micropkg
(from kedro project root), the following error comes up. This may or may not be related to the existing issue .Javier del Villar
05/10/2023, 1:45 PMSparkDataSet
. More details in thread.Richard Bownes
05/10/2023, 2:46 PMMate Scharnitzky
05/10/2023, 3:33 PM==0.18.3
. When we relax it to ~=0.18.3
, pip would install 0.18.8
while compiling and we get the below error for some of our pipelines:
KeyError: 'logging'
Can you provide some pointers what could be the reason behind this? 0.18.*
should have no breaking changes based on the RELEASE notes so Iโm not sure what could explain this.
Thanks for the help!Panos P
05/10/2023, 5:08 PMparams:
- p1
- p2
p1: value1
p2: value2
I want to create a pipeline with 2 nodes that each node take as input one of these params. e.g.
nodes = [node(func, f"params:{p}", f"output_{p}" for p in params]
pipeline(nodes)
Is that possible and how?Brandon Meek
05/10/2023, 11:58 PMMelvin Kok
05/11/2023, 1:21 AMrun
function directly, Great Expectations is doing it, and the config only allows for passing in serializable objectsArtur Dobrogowski
05/11/2023, 1:35 PMErwin
05/11/2023, 1:38 PMGiuseppe Ughi
05/11/2023, 3:34 PMconf/base/catalog.yml
as follows
{% for region in ['parasubicular', 'parainsular'] %}
{{ region }}.data_right:
type: PartitionedDataSet
path: data/01_raw/ClinicalDTI/R_VIM/seedmasks/
dataset: pandas.CSVDataSet
filename_suffix: /{{ region }}_R_T1.nii.gz
{{ region }}.data_right_output:
type: pandas.CSVDataSet
filepath: data/03_primary/{{ region }}_output.csv
{% endfor %}
everything works fine. However, I need to iterate over a list that is not practical to hard-code therefore I was hoping to have something like follows
regions:
- 'parasubicular'
- 'parainsular'
{% for region in regions %}
{{ region }}.data_right:
type: PartitionedDataSet
path: data/01_raw/ClinicalDTI/R_VIM/seedmasks/
dataset: pandas.CSVDataSet
filename_suffix: /{{ region }}_R_T1.nii.gz
{{ region }}.data_right_output:
type: pandas.CSVDataSet
filepath: data/03_primary/{{ region }}_output.csv
{% endfor %}
but no matter where I define the regions
list (i tried to define it in different .yml
files) I stumble on the same error screen-shotted below.
Do you by chance know if I have to save the jinja pattern in a different file, there is a specific place where I have to save the list that I want to read, or if I have to change the parsing somehow?
Thank you in advance!!Toni - TomTom - Madrid
05/12/2023, 7:52 AMNitin Soni
05/12/2023, 12:23 PMOfir
05/14/2023, 8:43 AMChengjun Jin
05/14/2023, 8:55 PMname: pm_stat_checkpoint
config_version: 1.0
template_name:
module_name: great_expectations.checkpoint
class_name: Checkpoint
run_name_template:
expectation_suite_name:
batch_request: {}
action_list:
- name: store_validation_result
action:
class_name: StoreValidationResultAction
- name: store_evaluation_params
action:
class_name: StoreEvaluationParametersAction
- name: update_data_docs
action:
class_name: UpdateDataDocsAction
site_names: []
evaluation_parameters: {}
runtime_configuration: {}
validations:
- batch_request:
datasource_name: default_pandas_datasource
data_asset_name: my_runtime_asset_name
data_connector_name: default_runtime_data_connector_name
expectation_suite_name: pm_expectation_suite
profilers: []
ge_cloud_id:
expectation_suite_ge_cloud_id:
INFO - Loading data from 'pm_sales_raw' (ParquetDataSet)...
INFO - FileDataContext loading fluent config
INFO - Loading 'datasources' ->
[{'name': 'default_pandas_datasource', 'type': 'pandas'}]
INFO - Loaded 'datasources' ->
[]
INFO - Of 1 entries, no 'datasources' could be loaded
...
...
DatasourceError: Cannot initialize datasource default_pandas_datasource, error: The given datasource
could not be retrieved from the DataContext; please confirm that your configuration is accurate.
It seems that there is a problem in loading the data source. Did I miss some steps?
Thank youMate Scharnitzky
05/15/2023, 12:03 PMkedro-datasets
package? We ran into some pip resolver issues and turned out that from kedro-datasets==1.0.0
and above would require kedro to be kedro~=0.18.4
. We can verify this by pip install kedro-datasets==0.0.7 --dry-run
but we donโt find where this dependency is actually defined. In setup.py
itโs actually not mentioned.
Thank you!
@Kasper JanehagAndrew Doherty
05/15/2023, 2:27 PM"neptune_run"
as an input to a pipeline node I get the following error:
ValueError: Pipeline input(s) {'NAMESPACE.neptune_run'} not found in the DataCatalog
Where "NAMESPACE"
is my namespace pipeline name.
Is there a way to use Neptune along with namespace pipelines?
Thanks again.Amanda Locatelli
05/15/2023, 2:52 PMlib/python3.7/site-packages/kedro/io/core.py", line 191, in load
raise DataSetError(message) from exc
kedro.io.core.DataSetError: Failed while loading data from data set SparkPostgresJDBCDataSet(load_args={'properties': {'connectTimeout': 300, 'driver': org.postgresql.Driver}}, option_args=True, save_args={'properties': {'driver': org.postgresql.Driver}}, table= , url=jdbc:postgresql:).
An error occurred while calling o49135.setProperty. Trace:
py4j.Py4JException: Method setProperty([class java.lang.String, class java.lang.Integer]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
at py4j.Gateway.invoke(Gateway.java:274)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
at java.lang.Thread.run(Thread.java:750)
Does anyone know what I am doing wrong, and how to fix it? Or what are the files which I am suppose to update this timeout argument?