Jo Stichbury
12/01/2022, 11:14 AMreporting
.
More context:
I took the basic spaceflights starter as it will be after 18.4, which means I stripped out the namespaces/modular pipelines, so the example code is more straightforward. You can see the starter on the repo here (when we put out release 18.4 it'll be merged and available immediately for access via kedro new --starter=spaceflights
) but right now you'll need to use kedro new --starter=spaceflights --checkout=68a27db42335366b07f9362f677d69684ec4e942
OK, so here's my example code with a reporting pipeline but when I kedro run
and then kedro viz
I see a different graphic to the one in the docs:
TL;DR -- what are the questions?
Q1: Is this viz correct? If it is not supposed to look like this, please roast my pipeline.
Q2: I tried to save my visualisation with kedro viz --save-file my_shareable_pipeline.json
but when I then reload it with kedro viz --load-file my_shareable_pipeline.json
I don't see the chart. So question 2 is: what's wrong with my viz?
Thanks in advance for any advice. LMK if you need more information.shawn
12/01/2022, 3:34 PMshawn
12/01/2022, 3:38 PMValueError: Given configuration path either does not exist or is not a valid directory: /databricks/driver/conf/base
Q1: Is the issue on due to the .whl file itself or the way I am configuring the job ?
Q2: Would this be due to a permissions issue on the environment I am using ?shawn
12/01/2022, 3:38 PMJan
12/02/2022, 8:17 AMrun_only_missing
as _hook_manager._ Can anyone assist? 🙂Anu Arora
12/02/2022, 3:58 PMEugene P
12/02/2022, 4:41 PMsample_sql_query_data:
type: pandas.SQLQueryDataSet
credentials: postgres_re_db
sql: SELECT * FROM rr_norm.sample_gov_torgi
Unfortunately, the amount of queries grows fast and catalog.yaml starts bloating with long query strings. Also, it looks like not a good idea to keep sql queries strings within the catalog.yaml itself for reproducibility.
What would be the most kedroic/pythonic approach to extract queries from the catalog.yaml to a separate folder/module? AFAIK (or understood from googling) yaml doesn’t natively has include/import features?shawn
12/05/2022, 3:07 PMValueError: Given configuration path either does not exist or is not a valid directory: /databricks/driver/conf/base
marrrcin
12/06/2022, 8:45 AM0.18.4
is already in PyPI, but starters are not tagged yet, making our CI/CD pipelines fail:
kedro.framework.cli.utils.KedroCliError: Kedro project template not found at git+https://github.com/kedro-org/kedro-starters.git. Specified tag 0.18.4. The following tags are available: 0.17.0, 0.17.1, 0.17.2, 0.17.3, 0.17.4, 0.17.5, 0.17.6, 0.17.7, 0.18.0, 0.18.1, 0.18.2, 0.18.3.Can we expect tagging today? 🤔 Maybe there should be some fallback mechanism for kedro starters to use versioning similar to Python (e.g.
~=0.18.0
but for tags).Yifan
12/06/2022, 10:40 AMPallavi Kumari
12/06/2022, 11:41 AMuser
12/06/2022, 12:18 PMFabian
12/07/2022, 9:14 AMJan
12/07/2022, 9:30 AMFabian
12/07/2022, 12:06 PMOlivia Lihn
12/07/2022, 12:29 PMOperationalError: (sqlite3.OperationalError) unable to open database file
(Background on this error at: <https://sqlalche.me/e/14/e3q8>)
My guess is that the run session info cannot be saved because of writing permissions on databricks Repo. We have deleted logging.yml
and to be honest this is more of an annoying error (as the pipeline runs). Any ideas on how can we avoid this?Maurits
12/07/2022, 5:29 PMjava.lang.OutOfMemoryError: Java heap space
error storing a JSON-file of 2.5M rows on AWS S3 via a Kedro pipeline. ECS Compute has 104 GB memory already.
Any recommendation how to configure this? Repartition experience? Spark config? Or work around it?Olga Chumakova
12/07/2022, 9:33 PMTooba Mukhtar
12/07/2022, 9:53 PMJaakko
12/08/2022, 8:53 AMkedro build-reqs
but when running kedro build-reqs
I get the following deprecation warning:
DeprecationWarning: Command 'kedro build-reqs' is deprecated and will not be available from Kedro 0.19.0.
How should project dependencies be managed after build-reqs
is not available anymore? Can the documentation be updated accordingly?Jo Stichbury
12/08/2022, 10:39 AMShreyas Nc
12/08/2022, 1:09 PMimageset:
type: PartitionedDataSet
dataset: {
"type": pillow.ImageDataSet
}
path: <path_to_data>
filename_suffix: ".jpg"
getting below error:
kedro.io.core.DataSetError:
Object 'ImageDataSet' cannot be loaded from 'kedro.extras.datasets.pillow'. Please see the documentation on how to install relevant dependencies for kedro.extras.datasets.pillow.ImageDataSet:
<https://kedro.readthedocs.io/en/stable/kedro_project_setup/dependencies.html>.
Failed to instantiate DataSet 'imageset' of type 'kedro.io.partitioned_dataset.PartitionedDataSet'.
kedro.framework.cli.utils.KedroCliError:
Object 'ImageDataSet' cannot be loaded from 'kedro.extras.datasets.pillow'. Please see the documentation on how to install relevant dependencies for kedro.extras.datasets.pillow.ImageDataSet:
<https://kedro.readthedocs.io/en/stable/kedro_project_setup/dependencies.html>.
Failed to instantiate DataSet 'imageset' of type 'kedro.io.partitioned_dataset.PartitionedDataSet'.
Run with --verbose to see the full exception
Error:
Object 'ImageDataSet' cannot be loaded from 'kedro.extras.datasets.pillow'. Please see the documentation on how to install relevant dependencies for kedro.extras.datasets.pillow.ImageDataSet:
<https://kedro.readthedocs.io/en/stable/kedro_project_setup/dependencies.html>.
Failed to instantiate DataSet 'imageset' of type 'kedro.io.partitioned_dataset.PartitionedDataSet'.
Shreyas Nc
12/08/2022, 1:11 PMManilson António Lussati
12/09/2022, 2:19 AMSebastian Pehle
12/09/2022, 9:37 AMMax S
12/09/2022, 10:26 AMyaml
files?)
Or am I thinking about this the wrong way and there is a good reason that this is not possible?
Thanks!Balazs Konig
12/09/2022, 12:19 PMschema
for a SparkDataSet
in the catalog entry itself? What’s the best practice to represent the StructType()
object in yaml?
EDIT: or is the best practice to always save the schema to a separate params file and add just the file_path
to the catalog entry?Adam_D
12/09/2022, 3:49 PMJohn Melendowski
12/10/2022, 12:41 AMMathilde Lavacquery
12/12/2022, 2:54 PMdef register_pipeline():
countries = ["a", "b"]
brands = ["1", "2", "3"]
return {
"preprocess_macro": preprocess_macro_pipeline(countries=countries),
"preprocess_brand": preprocess_brand_pipeline(countries=countries, brands=brands),
"train_model": train_model_pipeline(countries=countries, brands=brands),
}
and my catalog looks like that:
{% for country in ["a", "b"] %}
{% for brand in ["1", "2", "3"] %}
{{ country }}.pre_master_macro:
...
{{ country }}.{{ brand }}.master:
...
{{ country }}.{{ brand }}.model:
...
Would there be a way to single pass countries / brands in both ?
The usecase is that we are developing a generic pipeline that can be replicated in different regions / for different brands according to the client