Muhammad Ghazalli
05/31/2023, 3:04 AMLucas Hattori
05/31/2023, 1:52 PMpipe2
depends on an output from pipe1
), will they run in the correct order always, ie, pipe1
-> pipe2
?
If they were regular pipelines I know they will, but I’m not sure about modular_pipelines (tho I can’t imagine why they would be different)
Mock code below:
pipe1 = modular_pipeline.pipeline(
pipe=func,
namespace="pipe1_ns"
inputs={"input":"pipe1_ns.input"}
)
pipe2 = modular_pipeline.pipeline(
pipe=func,
namespace="pipe2_ns"
inputs={"input":"pipe1_ns.output"}
)
Nan Dong
05/31/2023, 3:20 PMGabriel Bandeira
05/31/2023, 5:12 PMkedro jupyter notebook
but it’s failing saying Error: No such command 'jupyter'.
how can I make it work?
packages versions on threadEzekiel Day
05/31/2023, 7:49 PMManilson AntĂłnio Lussati
05/31/2023, 8:13 PMEzekiel Day
06/01/2023, 8:07 PMRichard Purvis
06/01/2023, 8:18 PMos.environ
before the import statements in my node scripts, to no avail.
Edit: I resolved this by putting the os.environ["VAR"] = "value"
call at the top of my settings.py
file. I don't know if this is the best solution, but since this is a workaround until a bugfix occurs in one of the project dependencies I'm happy to leave it there.Ofir
06/01/2023, 9:08 PMTomás Rojas
06/02/2023, 6:15 AMModularPipelineError: Failed to map datasets and/or parameters:
and a list of datasets that do exist on the catalog.
This is the code:
from kedro.pipeline import Pipeline, node
from kedro.pipeline.modular_pipeline import pipeline
from .nodes import select_columns
def create_pipeline(**kwargs) -> Pipeline:
nominal_template = pipeline(
[
node(
func=select_columns,
inputs=["nominal_raw_data_normalized", "params:columns"],
outputs="nominal_raw_data_features",
name="extracting_columns_nominal_data",
namespace="data_preprocessing"
)
]
)
faulty_template = pipeline(
[
node(
func=select_columns,
inputs=[f"fault_{i}_raw_data_normalized", "params:columns"],
outputs=f"fault_{i}_raw_data_features",
name=f"extracting_columns_fault_{i}",
namespace="data_preprocessing"
) for i in range(1, 29)
]
)
reactor_nominal = pipeline(
pipe=nominal_template,
inputs={f"fault_{i}_raw_data_normalized" for i in range(1, 29)},
parameters={"params:columns": "params:reactor_columns"},
namespace="reactor"
)
reactor_faulty = pipeline(
pipe=faulty_template,
inputs={f"fault_{i}_raw_data_normalized" for i in range(1, 29)},
parameters={"params:columns": "params:reactor_columns"},
namespace="reactors"
)
reactor = reactor_nominal + reactor_faulty
return reactor
Any ideas on what is the error? Maybe I am not using the module correctly thanks in advance 🙂Riley Brady
06/02/2023, 4:44 PMPIPELINE1
node1
tags=[
"task1",
variable,
model,
region
]
node2
tags=[
"task2",
variable,
model,
region
]
I want to run all node1
s under PIPELINE1
for a certain variable and model, but over all regions (working on geospatial data).
We run from the kedro CLI, launching AWS batch jobs. I found that I could run jobs from a config spec here. So I set up the following `config.yml`:
run:
tags: task1, temperature, GFDL-ESM4 # don't declare region so all regions are run
pipeline: PIPELINE1
env: dev
runner: cra_data_pipelines.runner.BatchRunner
Then I run kedro run --config=config.yml
.
RESULT: It ends up launching all 700 jobs from PIPELINE1 without any distinction for the listed tags above. I of course just want the 20 or so that meet the AND conditions of those three tags.
I recall having this issue back in the fall and asked about it, and at the time I don’t think there was any way to run tags with AND logic. I was told that recent versions of kedro updated this, and saw on the config page that it listed multiple tags, so I assumed that’s how it should work.
Any help would be great here! Would prefer a simple solution like this rather than looping through each node manually in a shell script.Tomás Rojas
06/03/2023, 5:02 AMkedro jupyter lab
on a project and it seems to run ok but sometimes I get an error, it crashes and the cell returns me ERROR! Session/line number was not unique in database. History logging moved to new session 668
. Any idea on what could be the isue?Artur Janik
06/04/2023, 8:03 PMkedro ipython
appears to accept and tolerate the old way of doing things, with the extras folder, while kedro run
and kedro viz
do not, and cannot find the dataset definitions.
https://github.com/kedro-org/kedro-plugins/tree/main/kedro-datasets doesn't appear to provide any new advice as to how to declare datasets that are not in the extras folder
What is the correct way to declare custom datasets in kedro 0.18.*?Artur Janik
06/05/2023, 12:55 AMkedro ipython
is still happy with it, while neither kedro viz
nor kedro jupyter
is.Dan Knott
06/05/2023, 1:18 PMNok Lam Chan
06/05/2023, 1:56 PMJoseph Mehltretter
06/05/2023, 5:36 PMZhe An
06/05/2023, 10:47 PMnode(func=create_class_code_list, inputs=["full_raw_data_dump", 'params:feature_engineering.primary_policy_key', 'params:feature_engineering.class_code_col'], outputs="full_data_with_agg_features")
I want to test the inputs
1. full_raw_data_dump
is a dataframe from catalog.yaml. I want to test keys in this df.
2. params:feature_engineering.primary_policy_key
is str from catalog.yaml. I want to test the string using a keyword pattern.charles
06/05/2023, 11:15 PMIñigo Hidalgo
06/06/2023, 7:40 AM_infer_copy_mode
? In this issue it was mentioned as a possibility but was discarded because it’s too “heavy” but I think adding one additional branch to the already-existing pandas check could be worth it for incorporating Ibis functionality.Andreas_Kokolantonakis
06/06/2023, 2:03 PMIñigo Hidalgo
06/07/2023, 7:02 AMfmfreeze
06/07/2023, 9:48 AMkedro run
showed this error (attached screenshot).
How can I "reset" kedros experiment tracking?fmfreeze
06/07/2023, 10:29 AMMemoryDataSet
flowing around a pipeline, and I want to inspect them for individual tracked pipeline runs after the run (e.g. load them again like with session.run(to_outputs...)
but for a specific experiment run from the past.)Manilson AntĂłnio Lussati
06/07/2023, 11:33 AMJulius Hetzel
06/08/2023, 6:30 AMtorch==2.0.1+cpu -f <https://download.pytorch.org/whl/torch_stable.html>
torchvision==0.15.2+cpu -f <https://download.pytorch.org/whl/torch_stable.html>
the lambda is not able to access s3 and fails with Install s3fs to access S3
.
If I install everything locally on my linux and run Kedro run
it runs fine.
Anyone came across this problem or has an idea on how to fix it?Hannes
06/08/2023, 1:53 PMDataSetError: Failed while loading data from data set CSVDataSet(filepath=/home/foo/dev.csv, load_args={}, protocol=sftp, save_args={'index': False}).
<urlopen error unknown url type: sftp>
The file is referenced in conf\base\catalog.yml
using the following syntax:
input_data:
type: pandas.CSVDataSet
filepath: "sftp:///home/foo/dev.csv"
credentials: cluster_credentials
Where the cluster_credentials are as follows in my conf\local\credentials.yml
if
cluster_credentials:
username: username
host: localhost
port: 22
password: password
I am running Kedro version 0.18.8 and I have Paramiko version 3.2.0 installed running on a Windows machine.
I have followed the instruction in the data catalog docs here.
I would greatly appreciate any insights or suggestions on how to debug and resolve this issue. Thank you in advance for your help!
Best Regards
HannesIñigo Hidalgo
06/08/2023, 5:01 PMMelvin Kok
06/09/2023, 7:14 AM> kedro new --starter=spaceflights
kedro.framework.cli.utils.KedroCliError: Kedro project template not found at git+<https://github.com/kedro-org/kedro-starters.git>. Specified tag 0.18.10. The following tags are available: 0.17.0, 0.17.1, 0.17.2, 0.17.3, 0.17.4, 0.17.5, 0.17.6, 0.17.7, 0.18.0, 0.18.1, 0.18.2, 0.18.3, 0.18.4, 0.18.5, 0.18.6, 0.18.7, 0.18.8, 0.18.9. The aliases for the official Kedro starters are:
- astro-airflow-iris
- astro-iris
- pandas-iris
- pyspark
- pyspark-iris
- spaceflights
- standalone-datacatalog
Run with --verbose to see the full exception
Error: Kedro project template not found at git+<https://github.com/kedro-org/kedro-starters.git>. Specified tag 0.18.10. The following tags are available: 0.17.0, 0.17.1, 0.17.2, 0.17.3, 0.17.4, 0.17.5, 0.17.6, 0.17.7, 0.18.0, 0.18.1, 0.18.2, 0.18.3, 0.18.4, 0.18.5, 0.18.6, 0.18.7, 0.18.8, 0.18.9. The aliases for the official Kedro starters are:
- astro-airflow-iris
- astro-iris
- pandas-iris
- pyspark
- pyspark-iris
- spaceflights
- standalone-datacatalog
Sebastian Cardona Lozano
06/10/2023, 12:31 AMimport fsspec
from pathlib import PurePosixPath
from typing import Any, Dict
from annoy import AnnoyIndex
from <http://kedro.io|kedro.io> import AbstractDataSet
from kedro.io.core import get_filepath_str, get_protocol_and_path
class AnnoyIndexDataSet(AbstractDataSet[AnnoyIndex, AnnoyIndex]):
"""``AnnoyIndexDataSet`` loads / save Annoy index from a given filepath.
"""
def __init__(self, filepath: str, dimension:int, metric:str):
"""Creates a new instance of AnnoyIndexDataSet to load / save an Annoy
Index at the given filepath.
Args:
filepath (str): The path to the file where the index will be saved
or loaded from.
dimension (int): The length of the vectors that will be indexed.
metric (str): The distance metric to use. One of "angular",
"euclidean", "manhattan", "hamming", or "dot".
"""
# parse the path and protocol (e.g. file, http, s3, etc.)
protocol, path = get_protocol_and_path(filepath)
self._protocol = protocol
self._filepath = PurePosixPath(path)
self._fs = fsspec.filesystem(self._protocol)
self.dimension = dimension
self.metric = metric
def _load(self) -> AnnoyIndex:
"""Load the index from the file.
Returns:
An instance of AnnoyIndex.
"""
# using get_filepath_str ensures that the protocol and path are appended correctly for different filesystems
load_path = get_filepath_str(self._filepath, self._protocol)
annoy_index = AnnoyIndex(self.dimension, self.metric)
annoy_index.load(load_path)
return annoy_index
def _save(self, annoy_index: AnnoyIndex) -> None:
"""Save the index to the file.
Args:
data: An instance of AnnoyIndex.
"""
save_path = get_filepath_str(self._filepath, self._protocol)
annoy_index.save(save_path)
def _describe(self) -> Dict[str, Any]:
"""Return a dict describing the dataset.
Returns:
A dict with keys "filepath", "dimension", and "metric".
"""
return {
"filepath": self._filepath,
"dimension": self.dimension,
"metric": self.metric,
}
And in the data catalog I have this:
annoy_index:
type: pricing.extras.datasets.annoy_dataset.AnnoyIndexDataSet
dimension: 1026
metric: angular
filepath: /data/06_models/products_index.ann
layer: model_input
My goal is to save the .ann file in Google Cloud Storage or in a local folder, but I got the next error when running the node that saves the file:
DataSetError: Failed while saving data to data set AnnoyIndexDataSet(dimension=1026,
filepath=/data/06_models/products_index.ann, metric=angular).
Unable to open: No such file or directory (2)
Please your help. Thanks!!