Elias
10/18/2022, 1:00 PMElias
10/18/2022, 1:00 PMparameters.yml
t_-0:
filters:
date_max: 2022/07/01
t_-1:
filters:
date_max: 2022/06/01
So I want to avoid doing this, as I would need to pass 12 or more variables on each induction. Whereas they are actually all are dependent on the first one.user
10/19/2022, 2:38 PMSean Westgate
10/19/2022, 3:00 PMkedro build-docs
would pull down the latest jinja2 version 3.1.2 which then caused an error as contextfunction
was deprecated in version 3.1.0. I manually downgraded jinja2 to version 3.0.3 and all worked fine. Not sure if it is just me or a general issue.
Is posting bugs like this here the right thing to do? I had a look at your open issues on the Github repo but couldn't find anything related.Shubham Gupta
10/20/2022, 3:40 AMShubham Gupta
10/20/2022, 3:43 AMShubham Gupta
10/20/2022, 3:44 AMSuryansh Soni
10/21/2022, 1:56 PMIan Whalen
10/21/2022, 5:01 PMJinja
black magic 🧙
High level: I want to add a global variable to globals_dict
in settings.py
and use it in a loop in my catalog.
See thread for an example.
Any ideas?Jordan
10/21/2022, 8:49 PMcatalog.yml
entries for the inputs and outputs of each batch process.
Therefore, I was hoping to implement a solution using hooks that would avoid this tedium:
If possible, I would like the solution to:
1. Dynamically populate the catalog with input and output entires for each partitioned dataset.
2. Instantiate and run the modular pipeline using each partitioned dataset’s dynamically populated catalog entries.
3. Make the output datasets of each run available via the data catalog at any time.
This should (maybe) be possible with some combination of after_context_created
, after_catalog_created
and before_pipeline_run
hooks, but unsure how to actually implement this.
Any guidance would be much appreciated, cheers.user
10/23/2022, 7:58 AMuser
10/24/2022, 8:18 AMYetunde
10/24/2022, 8:52 AMuser
10/25/2022, 8:18 AM>> ds = GenMsmtsDataSet()
>> catalog.add('ipy_msmts', ds)
>> session.run(pipeline_name='sim', from_inputs=['ipy_msmts', 'params:simulation'])ValueError: Pipeline does not contain data_sets named...
Toni
10/25/2022, 10:35 AMnode(
func = some_function,
inputs = "some_input",
outputs = "the_output",
name = "node",
),
the_output:
type: pandas.CSVDataSet
filepath: data/output_csv.csv
the_output:
type: pandas.ParquetDataSet
filepath: data/output_parquet.parquet
Luis Gustavo Souza
10/25/2022, 1:03 PMYuchu Liu
10/25/2022, 1:40 PMDanhua Yan
10/25/2022, 2:00 PMdelta
datasets created by databricks in pandas. The current configs look like below:
_pandas_parquet: &_pandas_parquet
type: pandas.ParquetDataSet
_spark_parquet: &_delta_parquet
type: spark.SparkDataSet
file_format: delta
What I want to achieve:
node1:
outputs: dataset@spark
node2:
inputs: dataset@pandas
Unfortunately pandas
doesn’t support reading delta
as is. I found below workaround that requires additional steps. https://mungingdata.com/pandas/read-delta-lake-dataframe/
How should I create a dataset that can do something like this internally when being loaded?
from deltalake import DeltaTable
dt = DeltaTable("resources/delta/1")
df = dt.to_pandas()
Tried looking into this https://kedro.readthedocs.io/en/stable/tools_integration/pyspark.html#spark-and-delta-lake-interaction but nothing mentioned about using pandas to interact with delta
. Thank you!Denis Araujo da Silva
10/25/2022, 2:40 PMkedro build-reqs
? Just saw the message that it will be deprecated at 0.19.Sasha Collin
10/26/2022, 9:53 AM- 05_model_input (folder)
--- master_table_1 (folder)
------ master_table_1.csv (file)
------ split_1 (folder)
--------- X_train.csv
--------- X_test.csv
--------- y_train.csv
--------- y_test.csv
------ split_2 (folder)
--------- X_train.csv
--------- X_test.csv
--------- y_train.csv
--------- y_test.csv
Would you say this is good practice? or would you advice not saving the splits and parametrise the split method selected in the parameters.yml for instance?
Thanks a lot for your help!Nichita Morcotilo
10/26/2022, 10:02 AMconf/base
, conf/test
, and conf/local
(empty directory).
My conf/test/pipelines.yml
is an empty file, and executing kedro run --env=test
results in creating in /data
directory folders for each of the nodes listed in conf/base/pipelines.yml
It is expected behavior for kedro? I mean, if one of the environments has empty pipelines.yml
to fallback for base env?
Thank you!Erwin
10/26/2022, 11:43 AMbranch = git.Repo(search_parent_directories=True).a
Is there any way to disable experiment tracking at runtime? Or what would it be a better approach to check if kedro can at least create the graph and detect circular dependencies?
Detailed log:
Run kedro run --tag tag_dict
As an open-source project, we collect usage analytics.
We cannot see nor store information contained in a Kedro project.
You can find out more by reading our privacy notice:
<https://github.com/kedro-org/kedro-plugins/tree/main/kedro-telemetry#privacy-notice>
Do you opt into usage analytics? [y/N]: [10/25/22 20:27:24] WARNING Failed to confirm consent. No data plugin.py:210
was sent to Heap. Exception:
[10/25/22 20:27:24] INFO Kedro project session.py:343
Pipelines started
[10/25/22 20:27:24] INFO Seeding sklearn, numpy and random seed_file.py:41
libraries with the seed 42
INFO Loading data from data_catalog.py:343
'tag_dictionary'
(ExcelDataSet)...
[10/25/22 20:27:25] INFO Running node: create_td: node.py:327
create_td([tag_dictionary]) -> [td]
INFO Saving data to 'td' data_catalog.py:382
(PickleDataSet)...
INFO Completed 1 out of 1 tasks sequential_runner.py:85
INFO Pipeline execution completed runner.py:90
successfully.
╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /opt/hostedtoolcache/Python/3.8.0/x64/bin/kedro:8 in <module> │
│ │
│ /opt/hostedtoolcache/Python/3.8.0/x64/lib/python3.8/site-packages/kedro/fram │
│ ework/cli/cli.py:211 in main │
│ │
│ 208 │ """ │
│ 209 │ _init_plugins() │
│ 210 │ cli_collection = KedroCLI(project_path=Path.cwd()) │
│ ❱ 211 │ cli_collection() │
│ 212 │
│ │
│ /opt/hostedtoolcache/Python/3.8.0/x64/lib/python3.8/site-packages/click/core │
│ .py:1130 in __call__ │
│ │
│ /opt/hostedtoolcache/Python/3.8.0/x64/lib/python3.8/site-packages/kedro/fram │
│ ework/cli/cli.py:139 in main │
│ │
│ 136 │ │ ) │
│ 137 │ │ │
│ 138 │ │ try: │
│ ❱ 139 │ │ │ super().main( │
│ 140 │ │ │ │ args=args, │
│ 141 │ │ │ │ prog_name=prog_name, │
│ 142 │ │ │ │ complete_var=complete_var, │
│ │
│ /opt/hostedtoolcache/Python/3.8.0/x64/lib/python3.8/site-packages/click/core │
│ .py:1055 in main │
│ │
│ /opt/hostedtoolcache/Python/3.8.0/x64/lib/python3.8/site-packages/click/core │
│ .py:1657 in invoke │
│ │
│ /opt/hostedtoolcache/Python/3.8.0/x64/lib/python3.8/site-packages/click/core │
│ .py:1404 in invoke │
│ │
│ /opt/hostedtoolcache/Python/3.8.0/x64/lib/python3.8/site-packages/click/core │
│ .py:760 in invoke │
│ │
│ /opt/hostedtoolcache/Python/3.8.0/x64/lib/python3.8/site-packages/kedro/fram │
│ ework/cli/project.py:366 in run │
│ │
│ 363 │ node_names = _get_values_as_tuple(node_names) if node_names else n │
│ 364 │ │
│ 365 │ with KedroSession.create(env=env, extra_params=params) as session: │
│ ❱ 366 │ │ session.run( │
│ 367 │ │ │ tags=tag, │
│ 368 │ │ │ runner=runner(is_async=is_async), │
│ 369 │ │ │ node_names=node_names, │
│ │
│ /opt/hostedtoolcache/Python/3.8.0/x64/lib/python3.8/site-packages/kedro/fram │
│ ework/session/session.py:293 in __exit__ │
│ │
│ 290 │ def __exit__(self, exc_type, exc_value, tb_): │
│ 291 │ │ if exc_type: │
│ 292 │ │ │ self._log_exception(exc_type, exc_value, tb_) │
│ ❱ 293 │ │ self.close() │
│ 294 │ │
│ 295 │ def run( # pylint: disable=too-many-arguments,too-many-locals │
│ 296 │ │ self, │
│ │
│ /opt/hostedtoolcache/Python/3.8.0/x64/lib/python3.8/site-packages/kedro/fram │
│ ework/session/session.py:285 in close │
│ │
│ 282 │ │ if `save_on_close` attribute is True. │
│ 283 │ │ """ │
│ 284 │ │ if self.save_on_close: │
│ ❱ 285 │ │ │ self._store.save() │
│ 286 │ │
│ 287 │ def __enter__(self): │
│ 288 │ │ return self │
│ │
│ /opt/hostedtoolcache/Python/3.8.0/x64/lib/python3.8/site-packages/kedro_viz/ │
│ integrations/kedro/sqlite_store.py:68 in save │
│ │
│ 65 │ │ engine, session_class = create_db_engine(self.location) │
│ 66 │ │ Base.metadata.create_all(bind=engine) │
│ 67 │ │ database = next(get_db(session_class)) │
│ ❱ 68 │ │ session_store_data = RunModel(id=self._session_id, blob=<http://self.to|self.to> │
│ 69 │ │ database.add(session_store_data) │
│ 70 │ │ database.commit() │
│ 71 │
│ │
│ /opt/hostedtoolcache/Python/3.8.0/x64/lib/python3.8/site-packages/kedro_viz/ │
│ integrations/kedro/sqlite_store.py:52 in to_json │
│ │
│ 49 │ │ │ │ try: │
│ 50 │ │ │ │ │ import git # pylint: disable=import-outside-toplev │
│ 51 │ │ │ │ │ │
│ ❱ 52 │ │ │ │ │ branch = git.Repo(search_parent_directories=True).a │
│ 53 │ │ │ │ │ value["branch"] = branch.name │
│ 54 │ │ │ │ except ImportError as exc: # pragma: no cover │
│ 55 │ │ │ │ │ logger.warning("%s:%s", exc.__class__.__name__, exc │
│ │
│ /opt/hostedtoolcache/Python/3.8.0/x64/lib/python3.8/site-packages/git/repo/b │
│ ase.py:865 in active_branch │
│ │
│ 862 │ │ :raises TypeError: If HEAD is detached │
│ 863 │ │ :return: Head to the active branch""" │
│ 864 │ │ # reveal_type(self.head.reference) # => Reference │
│ ❱ 865 │ │ return self.head.reference │
│ 866 │ │
│ 867 │ def blame_incremental(self, rev: str | HEAD, file: str, **kwargs: │
│ 868 │ │ """Iterator for blame information for the given file at the g │
│ │
│ /opt/hostedtoolcache/Python/3.8.0/x64/lib/python3.8/site-packages/git/refs/s │
│ ymbolic.py:309 in _get_reference │
│ │
│ 306 │ │ │ to a reference, but to a commit""" │
│ 307 │ │ sha, target_ref_path = self._get_ref_info(self.repo, self.path │
│ 308 │ │ if target_ref_path is None: │
│ ❱ 309 │ │ │ raise TypeError("%s is a detached symbolic reference as it │
│ 310 │ │ return self.from_path(self.repo, target_ref_path) │
│ 311 │ │
│ 312 │ def set_reference( │
╰──────────────────────────────────────────────────────────────────────────────╯
TypeError: HEAD is a detached symbolic reference as it points to
'b508cdadd62cf912ab26104388cef8e08d1066eb'
Error: Process completed with exit code 1.
Suryansh Soni
10/26/2022, 3:17 PMMichał Stachowicz
10/27/2022, 9:56 AMZirui Xu
10/27/2022, 4:34 PMkedro.extras.datasets.spark.SparkDataSet
without installing dependencies specified in kedro[spark]
? I am on a databricks cluster where the installation of pyspark is blocked.Eivind Samseth
10/28/2022, 11:05 AMSeth
10/28/2022, 2:45 PMLorenzo Castellino
10/31/2022, 8:48 AMnp.random.set_state()
before the nodes that require it but I would like to hear what your solution to "reproducible randomness" looks like 🙂Jordan
10/31/2022, 5:03 PMcredentials.yml
available for use in a hook? I know I can just use the yaml loader, but this doesn’t feel correct.Pedro Arthur
10/31/2022, 7:52 PM