Jo Stichbury
11/20/2022, 4:11 PMuser
11/20/2022, 9:38 PMLeo Casarsa
11/21/2022, 1:57 PMAhmed Afify
11/21/2022, 3:40 PMFrancisca Grandón
11/21/2022, 7:16 PMuser
11/21/2022, 7:48 PMZihao Xu
11/21/2022, 11:16 PMINFO Loading data from 'modeling.model_best_params_' (JSONDataSet)... ]8;id=949765;file:///databricks/python/lib/python3.8/site-packages/kedro/io/data_catalog.py\data_catalog.py]8;;\:]8;id=434875;file:///databricks/python/lib/python3.8/site-packages/kedro/io/data_catalog.py#343\343]8;;\
DataSetError: Loading not supported for 'JSONDataSet'
where we have the following catalog entry:
modeling.model_best_params_:
type: tracking.JSONDataSet
filepath: "${folders.tracking}/model_best_params.json"
layer: reporting
The same code runs completely fine locally, but is failing within data braicks.
Could you please help us understand why?Moinak Ghosal
11/22/2022, 8:30 AMAnkar Yadav
11/22/2022, 11:49 AMAnkar Yadav
11/22/2022, 1:04 PMAndreas Adamides
11/23/2022, 12:09 PMKedro now uses the Rich library to format terminal logs and tracebacks
Is there any way to revert to plain console logging and not use rich logging when running a Kedro pipeline using the Sequential Runner from the API and not via kedro
CLI?
runner = SequentialRunner()
runner.run(pipeline_object, catalog, hook_manager)
I tried to look for configuration, but I believe you can only add configuration if you are in a kedro project and intend to run with Kedro CLI.
Any ideas?Afaque Ahmad
11/24/2022, 9:43 AMcache
made available inside the _load
method of multiple Kedro Datasets. How to go about it? Can we use hooks? or anything simpler?Fabian
11/24/2022, 12:13 PMJose Alejandro Montaña Cortes
11/24/2022, 7:40 PMAfaque Ahmad
11/25/2022, 6:53 AMget_spark
inside the ProjectContext
which I need to access in the register_catalog
hook. How can I access that function?Elias
11/25/2022, 10:13 AMkedro.io.core.DataSetError:
__init__() got an unexpected keyword argument 'table_name'.
DataSet 'inspection_output' must only contain arguments valid for the constructor of `kedro.extras.datasets.pandas.sql_dataset.SQLQueryDataSet`.
Elias
11/25/2022, 10:13 AMcatalog.yml:
inspection_output:
type: pandas.SQLQueryDataSet
credentials: postgresql_credentials
table_name: shuttles
layer: model_output
save_args:
index: true
Elias
11/25/2022, 10:13 AMShreyas Nc
11/25/2022, 10:26 AMfrom <http://kedro.io|kedro.io> import DataCatalog
from kedro.extras.datasets.pillow import ImageDataSet
io = DataCatalog( {
"cauliflower": ImageDataSet(filepath="data/01_raw/cauliflower"),
}
)
But I dont see this in the catalog and get an error when I reference this in the pipeline node that the entry doesnt exist in the catalog.
Am I missing something here?
Note: this is on the latest version of kedro kedro, version 0.18.3
I just joined the channel, if I am bot using the right format or channel to ask this question, please let me know .
Thanks in advance!Anu Arora
11/25/2022, 1:45 PMdbx execute <workflow-name> --cluster-id=<cluster-id>
; kedro is failing on the below error;
/local_disk0/.ephemeral_nfs/envs/pythonEnv-f0037269-19cc-4c81-9dc2-43bcd22cd8ff/lib/python3.8/site-packages/kedro/framework/startup.py in _get_project_metadata(project_path)
64
65 if not pyproject_toml.is_file():
---> 66 raise RuntimeError(
67 f"Could not find the project configuration file '{_PYPROJECT}' in {project_path}. "
68 f"If you have created your project with Kedro "
RuntimeError: Could not find the project configuration file 'pyproject.toml' in /databricks/driver.
I can see that the file was never packaged but I am not sure if it was supposed to be packaged or not. Plus it is pointing to working directory as /databricks/driver somehow. Below is the python file I am running: as spark_python_task
from kedro.framework.project import configure_project
from kedro.framework.session import KedroSession
package_name = "project_comm"
configure_project(package_name)
with KedroSession.create(package_name,env="base") as session:
session.run()
Any help would be great!!
PS: I have tried with dbx deploy and launch as well and is still facing the same issueKarl
11/26/2022, 12:27 AMFabian
11/26/2022, 1:27 PMYousri
11/28/2022, 3:27 PMpython3 -m project_name.run
But i have question about parameters. When i run the packaged project i can't anymore pass parameters to the project or modifiy the parameters.yml so my question is how to pass arguments when i run a packaged kedro project ?Afaque Ahmad
11/29/2022, 6:56 AMcatalog
dict in the after_node_run
hook?Fabian
11/29/2022, 9:59 AMHi Team,
another beginner's question: I have created a pipeline that nicely analyzes my DataFrame. Now, I add a new level of complexity to my DataFrame and want to execute the pipeline on each level, similiar to a function in groupby.apply.
Can I do this without modifiying the pipeline itself? E.g., splitting the DataFrame ahead of the pipeline and remerging it afterwards while leaving the existing pipeline as it is?
Ankar Yadav
11/29/2022, 11:37 AMBalazs Konig
11/29/2022, 3:09 PMconf/dev/
)
2. by pipeline (conf/base/data_connectors/xyz/
)
Is there a simple way to achieve this double filter without much hacking?Jan
11/30/2022, 10:29 AMconf/base
will not be loaded? I would like to do something like kedro run --env=prod
and in the prod
env I have a catalog that is prefixed (e.g. file: data/prod/01_raw/file.txt
) so that I can have the prod data separated. I would like to avoid leakage of development data into the prod env. For example if I add a new step and create a new entry in the data catalogue (base
) and forget to add this entry in the prod catalog it will be used later on in the prod environment by default because it is not overwritten? Instead I would like to get an error or implicitly use a MemoryDataset, in other words: don't load conf/base
. Does this make sense? 😄
Edit: Just realizing that this behaviour would be possible if I just use conf/base
as the prod env and always develop in a conf/dev
env. However, ideally I would like to use by default the conf/base
and only work in prod by specifying it explicitly to avoid mistakenly changing something there 🤔Qiuyi Chen
11/30/2022, 6:35 PMfrom pyspark.sql import DataFrame
def function_a(params: Dict, *df_lst: List[DataFrame]):
report = pd.Dataframe()
for df in df_lst:
temp = function(df,params)
report = pd.concat([report,temp])
return report
I can run function like this
Function_a(params, df1,df2,df3)
But in the pipeline, how can I define the node and catalog in this situation. Here is what I did, please let me know where I did it wrong
def create_pipeline(**kwargs):
return Pipeline(
[ node( function = function_a,
Inputs = ["params", "df_lst"],
outputs= "report",
]
)
Catalog = DataCatalog(
data_sets={"df_lst": df1},
feed_dict={"params":params, },
)
I can only run the pipeline when df_lst is just one dataframe, but I do want it do be something like “df_lst”: df_1,df_2,df_3 …df_n(n>3)Fabian
12/01/2022, 10:59 AM