Filip Panovski
02/06/2023, 3:55 PM>= 2.0.0
to run Kedro pipelines? Currently, only Prefect 1.x is tested to work with 0.18.x
according to the docs which is making us hesitate a bit on that end. We're currently evaluating both as a higher level orchestration platform for our Kedro pipelines, and both seem great for generic workflows, so some community feedback would be much appreciated.Zoran
02/06/2023, 5:26 PMMarioFeynman
02/07/2023, 3:00 AMJOEL WILSON
02/07/2023, 7:15 AMpyarrow==0.14.0
java version "1.8.0_341"
Java(TM) SE Runtime Environment (build 1.8.0_341-b10)
Java HotSpot(TM) 64-Bit Server VM (build 25.341-b10, mixed mode)
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 3.3.1
/_/
Using Scala version 2.12.15, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_341
Branch HEAD
Compiled by user yumwang on 2022-10-15T09:47:01Z
Revision fbbcf9434ac070dd4ced4fb9efe32899c6db12a9
Url <https://github.com/apache/spark>
Vassilis Kalofolias
02/07/2023, 3:06 PMVassilis Kalofolias
02/07/2023, 3:15 PMSequentialRunner
instead of using the session
of jupyter.
For example I would like to run the same pipeline in a loop with different partitions of a PartitionedDataSet
and I find it weird to call %reload_ext kedro.ipython
in a loop.
Is this discouraged practice? What is the benefit of having a session in jupyter if you develop interactively? (related but not answering my question: https://kedro-org.slack.com/archives/C03RKP2LW64/p1668423931294329)
Thanks a lot!Lawrence Shaban
02/07/2023, 7:32 PMhandlers:
console:
class: logging.StreamHandler
level: DEBUG
formatter: simple
stream: <ext://sys.stdout>
import logging
logger = logging.getLogger(__name__)
def example_node(input):
logger.debug(input)
output = input + 1
return output
I might be just doing something simple wrong but any help be appreciated! It works for info, so just using that for now but would be good to have the option of debug! 🙂Dustin
02/08/2023, 12:58 AMAfaque Ahmad
02/08/2023, 4:46 AMkedro
on EMR. The run fails because it is not able to find the conf
folder. Is there a way to package conf
folder together when doing kedro package
?user
02/08/2023, 8:28 AMDavid Pérez
02/08/2023, 10:41 AMSzymon Czop
02/08/2023, 10:52 AMMassinissa SaĂŻdi
02/08/2023, 1:59 PMconf/base
and conf/prod
environment and my credentials.yml
file in conf/local
. If I run kedro run
, will conf/local/credentials.yml
overwrite conf/base/credentials.yml
? And if I run kedro run --env prod
which credentials file will be used? I have the impression that it is the local file that is always used? Thank youOscar Villa
02/08/2023, 9:42 PMAnkar Yadav
02/09/2023, 12:18 PMkeyerror: "logging"
I immediately get this message as soon as I run the pipeline, any idea why this is happening?user
02/09/2023, 2:18 PMChouaib Nemri
02/09/2023, 4:39 PMJorge sendino
02/09/2023, 5:15 PMConfigLoader
to namespace catalog and parameter entries using the folder structure inside conf
? For example, I have:
conf/
catalog/
ns1/
ns2/
parameters/
ns1/
ns2/
Ideally I would modify ConfigLoader
to automatically add ns1
and ns2
as namespaces for all entries in the catalog and parameters below that folder. Is this possible?Sebastian Pehle
02/10/2023, 12:06 AMAndrew Stewart
02/10/2023, 5:10 AMWojciech Szenic
02/10/2023, 6:37 AMpredict
pipeline can output predictions. Ideally, this predict pipeline can be run as kedro run --pipeline=predict --date=2023-01-05
and this would ingest the dataset for 05th of Jan 2023 and run the prediction on it.
I'm wondering how can I pass the CLI argument into the dataset catalog?Jong Hyeok Lee
02/10/2023, 9:32 AMSergei Benkovich
02/12/2023, 8:35 PMOlivia Lihn
02/13/2023, 1:49 PMdef before_pipeline_run(self, run_params, catalog: DataCatalog) -> None:
"""Change feature inclusion parameters for
scoring pipeline
"""
if run_params["pipeline_name"] == "scoring":
# retrieve feature_list from catalog
feature_list_df = catalog.load("modeling.feature_selection_report")
feature_list = list(feature_list_df[feature_list_df.selected == True].feature.unique())
# get list of feature engineering pipelines
params = catalog.load("parameters")
feateng_pipes = [fteng_name for fteng_name in params.keys() if fteng_name.endswith("_fteng")]
# overwrite parameters
for pipeline in feateng_pipes:
catalog.add_all(
{f"params:{pipeline}.feature_inclusion_params.feature_list": feature_list,
f"params:{pipeline}.feature_inclusion_params.enable_regex": True},
replace=True
)
I also tried using run_params["params"]
without any luck, and tried returning the catalog but no luck. The hook runs (tested with print statements), so my guess is i'm missing something. Thanks!Rob
02/13/2023, 4:52 PM{category}_output_1.parquet
, {category}_output_2.parquet
and so on...
Any alternative suggestion is welcome 🙂Akshay
02/14/2023, 5:03 AM/mnt/testmount/data/05_model_input/partitions
Details--
I am running a Kedro pipelines on Azure Databricks notebook. There are 4 pipelines in the project. First two, Parse and Clean works fine, read the raw data from ADLS, do the transformation and write the data back to ADLS.
Third pipeline 'optimize' has spark dataset as input and generates 2 outputs. PartitionedDataset and transformed Pandas Dataframe.
Optimize.partition@spark:
type: kedro.io.PartitionedDataSet
dataset:<<: *spark_parquet_partitioned
load_args:
maxdepth: 1
withdirs: True
layer : Data Transformation
path : /mnt/testmount/data/05_model_input/partitions
model_input@pandas:
type: kedro.io.PartitionedDataSet
dataset:<<: *pandas_parquet_partitioned
load_args:
maxdepth: 1
withdirs: True
layer : Data Transformation
path : /mnt/testmount/data/05_model_input/model_data
Note- pipeline works fine when run in local environment.
Kedro =0.18.3
Python =3.8.10
Cluster= Spark 3.2.1Filip WĂłjcik
02/14/2023, 9:40 AMpandas.CSVDataSet
with save_args: mode: "a"
and PartitionedDataSet
- but every time a dataset is overridden.
I cannot find any such case in the docs. Should I create my implementation, deriving from the AbstractDataSet?
I've heard from many fellow DS-Kedro Users that a similar use case happens from time to time, so probably I'm not alone.
Thanks in advance, and best regards, Kedro is an absolute blast!Filip Panovski
02/14/2023, 12:58 PMinput
, output
and wrong
entries. wrong
has a configuration problem (e.g. no credentials could be found), but I'm running a pipeline mypipeline
which only uses input
and output
.
Why does kedro run --pipeline mypipeline
fail if wrong
is configured improperly in this case? I get that you usually want to be able to view the entire catalog, but is --pipeline <...>
not enough information to let Kedro know that I potentially don't want that?Zirui Xu
02/14/2023, 3:01 PMsetuptools
a Kedro dependency? It gets ignored when I piptools compile a <http://requirements.in|requirements.in>
that contains kedro because setuptools is “considered to be unsafe in a requirements file”FlorianGD
02/14/2023, 4:32 PMpandas.ParquetDataSet
does not use pandas all the time? I would like to use it for partitioned data, and I want to use the filters
that pandas.read_parquet
provides, but it is not available for pyarrow.parquet.ParquetDataset.read
. Doing a quick test and using pd.read_parquet
every time seems to work ok, even though it does not behave exactly the same.