Hey Team Is there any `mypy` equivalent for kedro i e When I Kedro #questions

Hey Team! Is there any `mypy` equivalent for kedr...

Abhishek Bhatia

02/22/2024, 11:04 AM

Hey Team! Is there any

mypy

equivalent for kedro? i.e. When I define a node with certain type hints (for e.g. spark), but in my node declaration, I pass it an incompatible type (for e.g. pandas). Thanks! 🙂

datajoely

02/22/2024, 11:04 AM

Pandera is good for this

datajoely

02/22/2024, 11:04 AM

The QB team contributed PySpark support the library last year

Abhishek Bhatia

02/22/2024, 11:04 AM

Ah! Thanks thanks! 🙏

datajoely

02/22/2024, 11:05 AM

you bind the checks to the function bound to the node

datajoely

02/22/2024, 11:05 AM

nothing kedro node

Abhishek Bhatia

02/22/2024, 11:05 AM

So pandera will validate at runtime, right?

datajoely

02/22/2024, 11:06 AM

yes at runtime currently

Abhishek Bhatia

02/22/2024, 11:08 AM

And not before that, right? For e.g.

Copy code

my_dataset@pandas:
   type: pandas.CSVDataSet

my_dataset@spark:
   type: spark.SparkDataSet

Copy code

def my_func(sdf: pyspark.sql.DataFrame):
    pass

node(
   my_func,
   inputs=[
    "my_dataset@pandas"
   ],
)

Abhishek Bhatia

02/22/2024, 11:09 AM

In above, Pandera would check before running the node but after loading the dataset, right?

datajoely

02/22/2024, 11:35 AM

so you would apply the pandera check to

my_func

and it would run before the data is processed by the function

Deepyaman Datta

02/22/2024, 2:29 PM

My understanding is that Pandera would be more useful for data validation checks, but if you're passing a wrong type object altogether, it's not doing quite what you want. I'm not aware of this kind of static type checking between catalog and node/pipeline. It sounds interesting, but it's also the first time I've heard this request. My two cents: 1. Create an issue on the Kedro repo to track this, in case there are more people interested/it may become a priority 2. If you're keen to play around with it, this could be a Kedro plugin? I'm not sure what other alternatives are. I don't know much about mypy plugins, but I don't see support for something like this at a glance. Alternatively, in CI, you could "compile" a catalog and convert it into Python code, and then type check the result.

K 1

Iñigo Hidalgo

02/22/2024, 2:43 PM

I've definitely had a similar "wish" in the past, but the technical complexity involved would probably be infeasible to handle. You depend on all the datasets being properly typed, and additionally the AbstractDataset which is actually calling the inner _load would add a layer of confusion. @Juan Luis shared an issue a couple of weeks back talking about potential IDE plugins which, although they wouldn't address this specific usecase, could be useful in providing pre-runtime validation of at least having datasets properly defined

K 1

39 Views

Open in Slack

Previous Next