Andrew Stewart
09/13/2023, 8:42 PMkedro-linter
... NOT for linting Python code, but rather for linting the kedro project's config files and general file tree for adherence to best practices.
For example, it could report the number of implicit in-memory datasets not present in caatalog.yml
, report presence of non-notebook files in notebooks
or other file types out of place (independent of .gitignore
), report "feature coverage" of various kedro features (parameters, named datasets, etc) across pipelines, etc etc etc.Hugo Evers
09/19/2023, 1:50 PMimport bentoml
from pathlib import Path
def _find_kedro_project(current_dir): # pragma: no cover
from kedro.framework.startup import _is_project
while current_dir != current_dir.parent:
if _is_project(current_dir):
return current_dir
current_dir = current_dir.parent
return None
def retrieve_kedro_context(env="local"):
from kedro.framework.session import KedroSession
from kedro.framework.startup import bootstrap_project
project_path = _find_kedro_project(Path.cwd())
metadata = bootstrap_project(project_path)
with KedroSession.create(
package_name=metadata.package_name,
project_path=metadata.project_path,
env=env,
) as kedro_session:
return kedro_session.load_context()
def download_model(name: str) -> bentoml.Model:
try:
return bentoml.transformers.get(name)
except bentoml.exceptions.NotFound:
catalog = retrieve_kedro_context().catalog
pipeline = catalog.load(name)
return bentoml.transformers.save_model(name, pipeline)
def get_runner(name: str, init_local: bool = False):
runner = download_model(name).to_runner()
if init_local:
runner.init_local(quiet=True)
return runner
This allows one to use any arbitrary catalog entry (so you store your model as a pickle on s3, or in mlflow), however integration with KedroPipeline seems very complicated as bento needs to be aware of the model framework. In additation, CI/CD now needs access the kedro context, while id prefer to simply link the mlflow storage to bentoml and maybe use some adapters for pre and post-processing.
Additionaly, Bento uses its own storage location, which for as far as i know i can;t move to the cloud (which is quite catastrophic when using Llama=70b, since it will instantly fill up your local storage), have you found any ways around this?
Do you make storing bentos part of your kedro pipelines? in the above code, i look in the bento storage, and if not found in the kedro catalog. But this obvisously only works when you actively manage this (otherwise youβd be pulling old models). any thoughts or preferences?
Also, do you use pip, conda or poetry? im looking to use different dependancy groups to separate model deps from dev/training deps. also because there have been quite a few breaking changes lately when upgrading packages. Is there any special tricks you employ wrt staggered updating of deps combined with tests? Do you link the poetry deps with mlflow, or do you use bentos inver_packages?
Also, what are your opinions when it comes to deploying the packaged models to k8s? Do you simply deploy docker containers direcly, or use something like Seldon or Kserve? Or even Bentoβs Yatai?
Im curious!Juan Luis
09/21/2023, 10:38 AMJuan Luis
09/21/2023, 10:42 AMMarc Gris
09/22/2023, 4:27 PMPipelineML
pipeline. Does anyone have experience with this or suggestions on how to implement it? Thanks in advance!
Slack conversationWilliam Caicedo
09/26/2023, 9:52 PMdocker buildx build --platform=linux/amd64
to build the images I eventually push to ECR. is there a way to specify what platform target kedro sagemaker
should use?marrrcin
10/10/2023, 7:21 AMkedro-azureml
with GREAT help from external contributors π Keep it going guys!
@Florian d, @Tomas Van Pottelbergh
https://www.linkedin.com/posts/getindata_getindata-kedro-azure-activity-7117156892325388289-C-bt?utm_source=share&utm_medium=member_desktopAntonio PerellΓ³ Moragues
10/10/2023, 10:24 AMstreamlit
, but I could not do so with dash
or vizro
... Do you know how to do that or if there are any other libraries to do so? Thank you!Fazil B. Topal
10/11/2023, 1:23 PMimport logging
from pathlib import Path
from hera.auth import ArgoCLITokenGenerator
from hera.shared import global_config
from hera.workflows import (
DAG,
Container,
Env,
Parameter,
RetryPolicy,
RetryStrategy,
Workflow,
)
from kedro.framework.project import pipelines
from kedro.framework.startup import bootstrap_project
from kedro.pipeline import Pipeline
# More info on hera docs
global_config.host = "<https://argo-workflows.io>" # Put YOUR OWN INSTANCE PROFILE
global_config.token = ArgoCLITokenGenerator
global_config.namespace = "NAMESPACE OF THE ARGO WORKFLOWS IN K8s"
global_config.service_account_name = "SERVICE ACCOUNT OF THE ARGO WORKFLOWS IN K8s"
IMAGE_REGISTRY = "TO BE FILLED"
IMAGE = "IMAGE NAME TO ADD HERE"
WORKFLOW_DEFAULTS = {}
PROJECT_DIR = Path(__file__)
logger = logging.getLogger()
def convert_camel_case_to_kebab_case(name: str):
return "".join(["-" + c.lower() if c.isupper() else c for c in name]).lstrip("-")
def get_container(image_tag: str, envs: list[Env] = None) -> Container:
return Container(
name="k8s-pod",
inputs=[
Parameter(name="cmd"),
Parameter(name="memory"),
Parameter(name="cpu"),
],
retry_strategy=RetryStrategy(limit="5", retry_policy=RetryPolicy.on_error),
image=f"{IMAGE_REGISTRY}/{IMAGE}:{image_tag}",
termination_message_policy="FallbackToLogsOnError",
image_pull_policy="Always",
# In order to define pod resources for each task, use podspecpatch
pod_spec_patch="""
{
"containers":[
{
"name":"main",
"resources":{
"limits":{
"cpu": "{{inputs.parameters.cpu}}",
"memory": "{{inputs.parameters.memory}}"
},
"requests":{
"cpu": "{{inputs.parameters.cpu}}",
"memory": "{{inputs.parameters.memory}}"
}
}
}
]
}
""",
env=envs, # Add user specified envs
command=["bash"],
args=["-c", "{{inputs.parameters.cmd}}"],
)
def get_pipeline(pipeline_name: str = None) -> Pipeline:
metadata = bootstrap_project(Path(PROJECT_DIR))
<http://logger.info|logger.info>("Project name: %s", metadata.project_name)
<http://logger.info|logger.info>("Initializing Kedro...")
pipeline_name = pipeline_name or "__default__"
pipeline = pipelines.get(pipeline_name)
return pipeline
def create_workflow(
image_tag: str,
envs: list[Env] = None,
generate_name="kedro-wf-",
kedro_env: str = "staging",
kedro_pipeline_name: str = None,
**extra_params
) -> Workflow:
"""Create a workflow"""
if envs is None:
envs = []
with Workflow(
generate_name=generate_name,
entrypoint="main",
**WORKFLOW_DEFAULTS,
) as w:
k8s_pod = get_container(image_tag=image_tag, envs=envs)
with DAG(name="main"):
for node, deps in get_pipeline(
kedro_pipeline_name
).node_dependencies.items():
# node.name, [d.name for d in deps]
kedro_cmd = (
f"kedro "
f"run "
f"--env "
f"{kedro_env} "
f"--nodes "
f"{node.name} "
)
if extra_params is not None:
kedro_cmd = kedro_cmd + f"--params {extra_params}"
# Use can add tags to kedro nodes for compute heavy task to automatically
# assign more CPU resources. The code below can be changed depending on
# user need. Similar logic can be implemented for memory as well
if "ComputeHeavyTask" in node.tags:
cpu = "6"
else:
cpu = "1"
memory = "5Gi"
k8s_pod(
name=convert_camel_case_to_kebab_case(f"{node.name}"),
arguments={
"cmd": kedro_cmd,
"cpu": cpu,
"memory": memory,
},
dependencies=[
convert_camel_case_to_kebab_case(d.name) for d in deps
],
)
return w
Guang Yang
10/11/2023, 3:40 PMGalen Seilis
10/25/2023, 3:16 PMEd Henry
10/25/2023, 9:36 PMkedro kubeflow
with a repo that has multiple pipelines and lots and lots of parameters in aggregate across pipelines and I'm receiving an error when trying to upload a compiled manifest : The pipeline spec is invalid.: Invalid input error: The input parameter length exceed maximum size of 10000.
Ed Henry
10/25/2023, 9:36 PMEd Henry
10/25/2023, 9:38 PMkedro kubeflow
with a repo that has multiple pipelines and lots and lots of parameters in aggregate across pipelines, and I'm receiving an error when trying to upload a compiled manifest : The pipeline spec is invalid.: Invalid input error: The input parameter length exceed maximum size of 10000.
This looks to be a limitation of K8s, specifically : https://github.com/kubeflow/pipelines/issues/2286
I've tried modifying some of the kedro-kubeflow plugin code to account for passing in specific pipelines via kedro kubeflow compile --pipeline <blah> -o <pipeline_blah>.yml
and the manifest looks clean but I'm getting another error now : kedro: error: argument --configure_logging: invalid <lambda> value: 'config.yaml'
and I was just curious if folks, including @marrrcin, had seen this previously, or not. πTakieddine Kadiri
10/30/2023, 12:11 PMpip install kedro-boot
β’ G Visit Kedro Boot github repo, give feedback and β if you like it
β’ π§ͺ Try & Learn with Kedro Boot Examples and give feedback !
β¨ This enable using Kedro pipelines in a wide range of online and low latency use cases, including model serving, data apps (streamlit, dash), statistical simulations, paralell processing of unstructured data, streaming, and more.
π₯ Kedro Boot propose an answer to many issues: #2627, #2169, #1993, #2626, #devrel7, #2663, #2182 #2879 #1846 #795 #933 #2058 #143 #1041 and numerous slack questions about dynamic pipelines, injecting external data, serving pipeline as API, exposing kedro's resources to a generic application. @datajoely wanted this feature for ages π π
If the plugin proves to be valuable and effectively addresses the problem, @Yolan HonorΓ©-RougΓ© and I will ensure its maintenance and support. π§βπ»
Takieddine & YolanRennan Haro
11/07/2023, 2:02 PMkedro-mlflow
is it possible to retrieve a model from S3 based on itβs name and stage, directly from a catalog entry? E.g., Iβd like to get the production
version of the foobar
model. The .pkl
is saved as an artifact on S3.
I was thinking of building a custom dataset that calls the Mlflow REST API, getβs the path to the model with the given name and stage, and then downloads the artifact from S3, but I wonder if there is a simpler/better way of doing it.Hugo Evers
11/08/2023, 3:13 PMkedro-boot
allows one to override pipeline inputs dynamically, maybe its also possible to overwrite arguments passed to datasets?
im specifically asking because kedro-boot
seems to be the answer to using kedro
with something like FastApi
, but FastApi
is routinely used for CRUD on a database, and kedro has this nice mechanism for handling credentials and such. So actually passing the credentials into a node to access a database is quite ugly.Marc Gris
11/10/2023, 7:35 AMkedro-airflow
and am having a little concern. Please correct me if Iβm wrong.
Since individual nodes are turned into airflow tasks as instances of the KedroOperator
whose execute()
method creates a new session
, if one were to use versioned datasets, one would be in for a little surprise:
A single airflow run would produce non-homogeneously timestamped artifacts⦠Correct ?
Thanks in advance for your inputs / comments.
M.Hugo Evers
11/13/2023, 3:52 PMAppPipeline
objects work with find_pipelines
?
my AppPipeline
is found by find_pipelines, but then when in the apps route i do:
kedro_boot_session.run(
name="web_api",
β¦)
the view cannot be found. only when i do kedro boot --app β¦ --pipeline web_api
, my AppPipeline
is included in the kedro_boot_session._pipeline.viewsWilliam Caicedo
11/15/2023, 4:02 AMkedro-sagemaker
and got hit by an AttributeError: cython_sources
error that seems to be related to this issue. I managed to work around it by installing pyyaml
in advance: pip install "cython<3.0.0" && pip install --no-build-isolation pyyaml==5.4.1
Has anyone seen anything like it?William Caicedo
11/15/2023, 10:32 PMkedro-sagemaker
question: I manage to get the pipeline showing in Processing jobs but then I get an Error: No such command 'sagemaker'.
error. I have kedro-sagemaker
in my requirements.txt
file and Iβm building and pushing the image myself, so I just do a kedro sagemaker run
. Any ideas what am I doing wrong?Andrew Stewart
11/16/2023, 5:35 AMWilliam Caicedo
11/21/2023, 11:11 AMwith KedroSession.create(...) as session:
session.run(SageMakerPipelinesRunner())
and run a pipeline with kedro-sagemaker
from inside a Jupyter notebook?Artur Dobrogowski
11/22/2023, 12:33 PM0.10.0
of kedro-vertexai
plugin is here! π
I'm kinda new here so I'm not sure if this belongs to #announcements also, as it's a plugin update.
We added the grouping feature to allow more freedom in shaping your pipelines at vertexai. More details in the docs:
https://kedro-vertexai.readthedocs.io/en/0.10.0/source/02_installation/02_configuration.html#example
https://github.com/getindata/kedro-vertexai/releases/tag/0.10.0Hugo Evers
11/27/2023, 2:15 PMMlflowModelRegistryDataSet
in the Kedro-Mlflow integration for logging models to MLflowβs model registry, as documented in the Kedro-MLflow Python Objects section.
Initially, I followed the documentation for using MlflowModelLoggerDataSet
in the catalog.yml
file, which I implemented successfully. However, I encountered confusion with MlflowModelRegistryDataSet
. My initial attempt was based on the following configuration:
my_transformer_model:
type: kedro_mlflow.io.models.MlflowModelRegistryDataSet
flavor: mlflow.transformers
model_name: my_transformer_model_name
stage_or_version: staging
When trying to save a model using catalog.save("my_transformer_model", model)
, I received a DatasetError
indicating that the βsaveβ method is not implemented for MlflowModelRegistryDataSet
. The documentation provides parameters for this dataset but lacks a clear example for its correct usage in saving and registering a model to MLflow.
Moving forward, I found a working solution for logging the transformer model in YAML API:
my_transformer_model:
type: kedro_mlflow.io.models.MlflowModelLoggerDataSet
flavor: mlflow.transformers
save_args:
registered_model_name: "my_transformer_model_name"
This allowed me to save and load the model to MLflow successfully. This however is not documented as such. For model loading, I could indeed use the initial catalog entry for loading specific versions directly, Yet, I still have unresolved queries w.r.t Model Staging/Versioning*:* How to stage or version the model directly through the API, instead of using the MLflow UI. so using the MlflowModelLoggerDataSet to save, but also specify a version/stage.
In addition i was wondering how to view associated metrics with the model training run in the mlflow model UI to efficiently promote the best model to staging.
I can imagine that including practical examples in the official documentation, would significantly enhance the user experience.Gilad Rubin
11/29/2023, 9:04 PMpuneet makhija
11/30/2023, 1:39 PMpuneet makhija
11/30/2023, 1:44 PMAppPipeline
by using app_pipeline
factory
but how we can register this ?puneet makhija
11/30/2023, 1:52 PMGilad Rubin
12/09/2023, 4:36 PM