Yetunde
08/26/2022, 2:35 PMAsch Harwood
10/28/2022, 1:37 PMFilip Panovski
11/10/2022, 1:36 PMdask.ParquetDataSet
. I had a use case where I needed to parse an existing Avro schema and transform it into a pyarrow schema so that the <http://dask.to|dask.to>_parquet
function behaves nicely. I was wondering whether this is something the community would be interested in and would appreciate feedback.Sean Westgate
11/28/2022, 3:42 PMkedro build-docs
) will be discontinued with version 0.19, I wonder if a plugin should fill its place. @Asch Harwood suggested that there is a need for communicating with non-technical stakeholders - are there more users thinking this way? Would a plugin that assists with the creation of documentation be useful?
In order to aid discussion, I played around with a prototype static site as an example for project documentation. I used the Kedro spaceflight tutorial project as a base, and you can explore the finished documentation here. Given that the Kedro framework defines much of the information needed for project documentation, I think it would be pretty straight forward to create a plugin that would:
- create the basic documentation structure
- fill in details about pipeline, nodes, data and parameters automatically
- insert an interactive Kedro-Viz graph
- provide empty templates for additional notes to write
- example how to publish as a static website, for example to GitHub Pages
Just to clarify, this is not a plugin, just a "fake" output for discussion. I would like to find out if:
- such a plugin would be useful
- maybe find a few projects other than spaceflight that could be used during development/specification
- get clarity on desired functionality
- maybe some collaborators interested in making it
If you want to find out more you can also clone the project repo. There are instructions in the manual how to build the docs locally and use them. Leave your feedback either on slack or if concrete ideas, please create issues in the repo.
Looking forward to hearing from you
SeanYetunde
12/06/2022, 3:02 PMMatthias Roels
01/14/2023, 9:04 PMAndrew Stewart
01/21/2023, 7:30 AMLeo Casarsa
02/09/2023, 5:09 PMWorkflow Orchestration tools..
I have been crushing through the documentation of a bunch of different workflow orchestration tools. This is my inner map so far. [...]
Kubeflow, Metaflow, Flyte, Kedro, and ZenML focus more on ML pipelines and experimentation usability, like easy switching between local and cloud. Kubeflow is for ML what Argo is for data flows, so expect it to be a steep learning curve if you are not a Kubernetes expert, which most data scientists are not, so this might explain why it is frowned upon. All of these are new and shiny, but again I need to dig a little deeper to understand the differences. Kedro is opinionated about project structure and does not seems to be build with big scalable workflows in mind, and I got the feeling that Kedro is like DVC but more aimed towards ML specifically, and thus it might be a good fit for consultants that are building many smaller projects (?), Metaflow, Flyte, and ZenML all deal with how to utilize compute clusters in an easy way. ZenML seems to me like it might have some gaps, but it is also the newest one, so that is to be expected at this point in time.Another member then replies:
Thanks for starting the thread, it's very interesting!
I'd like to clarify that Kedro is a Python library for building modular data science pipelines. Kedro helps you write data science workflows that are made of reusable components, each with a "single responsibility".
Kedro is not an orchestration tool like Argo Workflows or Kubeflow Pipelines. Check out the deployment guide for how to run Kedro pipelines on Airflow, Argo Workflows or Kubeflow Pipelines. We have successfully used Kedro to build data-science-friendly pipelines that we can still run at scale with Kubeflow Pipelines.https://mlops-community.slack.com/archives/C015J2Y9RLM/p1675865574676169
Amanda
03/02/2023, 1:16 PMPolly
03/02/2023, 1:22 PMVictoria Sicking
03/14/2023, 5:23 PMPolly
03/18/2023, 3:05 PMDeepyaman Datta
03/18/2023, 4:08 PMOleg Pilipenok
03/20/2023, 6:55 AMPolly
03/22/2023, 12:52 PMPolly
04/04/2023, 4:30 PMStephanie Kaiser
04/05/2023, 2:23 PMPolly
04/26/2023, 2:20 PMMerel
06/01/2023, 8:54 AMJuan Luis
06/21/2023, 8:54 AMNok Lam Chan
08/09/2023, 8:26 PMYetunde
08/22/2023, 5:16 PMcatalog.yml
& parameters.yml
straight from a Jupyter/Databricks/AWS SageMaker notebook without a project template or an IDE.
• party wizard Use the project creation wizard to add features to your project template. Don't need the files and folders created by linting, testing, and documentation? No worries! Just skip those to get a simpler template.
We'd love your help testing these ideas! If you can spare 30 minutes to try either of them, then indicate your interest with jupyter or party wizard. Your feedback will help make Kedro more flexible.datajoely
09/28/2023, 1:56 PMJuan Luis
10/13/2023, 12:17 PMDeepyaman Datta
10/22/2023, 1:32 PMPartitionedDataset
users out there! We have a question for you, related to enabling versioning for PartitionedDataset
--which of the below options makes the most sense to you?
1. https://github.com/kedro-org/kedro/pull/521 proposes to enable versioning of the underlying dataset, by specifying versioned: true
in the dataset config:
station_data:
type: PartitionedDataset
path: data/03_primary/station_data
dataset:
type: pandas.CSVDataset
versioned: true
On the plus side, having the versioned: true
config on the dataset
config makes it clear that the versioning is applied to the underlying dataset, not to the PartitionedDataset
. However, there are some edge cases (see https://github.com/kedro-org/kedro/pull/521#issuecomment-744653023, if you're keen).
2. Alternatively, we can move the versioned: true
flag to the top level PartitionedDataset
config:
station_data:
type: PartitionedDataset
path: data/03_primary/station_data
versioned: true
dataset:
type: pandas.CSVDataset
Note that the versioning is still of the underlying dataset (e.g. data/03_primary/station_data/first_station.csv/<version>/first_station.csv
), even though the config is at the top level.
3. None of these options make sense; what you really need is versioning of the top-level dataset. (Note that we don't have a solution designed for this case, but it would be great to know nonetheless!)
Please feel free to vote using 1️⃣2️⃣3️⃣, and elaborate further on your thoughts in the thread below!Juan Luis
11/02/2023, 1:37 PM~/.miniconda
, ~/.virtualenvs
), or next to the code (~/Projects/spaceflights/.venv
)?
• when you create a new Kedro project, what are the steps you usually follow? for example 1. create and activate conda environment, 2. pip install kedro
, 3. kedro new
• what do you think of the current process?
(please leave a reply on the thread 🧵, 1 comment per person to keep the conversation tidy)
your feedback and ideas are very much welcome 🙏🏼Роман Белый
11/02/2023, 1:53 PMJuan Luis
11/06/2023, 9:23 AMrequirements.txt
and them read them in pyproject.toml
https://github.com/kedro-org/kedro/blob/93dc1a91e4bb476287040ea3db4a610696cacb0c/k[…]project/%7B%7B%20cookiecutter.repo_name%20%7D%7D/pyproject.toml
but you can also just avoid requirements.txt
files entirely. what do you think of this approach?Juan Luis
11/06/2023, 10:20 AMkedro new
if you haven't installed Kedro yet? 🙃 cc @Lukas InnigJuan Luis
12/07/2023, 11:57 AMA node cannot have the same inputs and outputs
) so it requires you to define a read-only version of the dataset and an appendable version, both referring to the same underlying storage.
any thoughts on this approach?