Hey Team! Is there any `mypy` equivalent for kedr...
# questions
a
Hey Team! Is there any
mypy
equivalent for kedro? i.e. When I define a node with certain type hints (for e.g. spark), but in my node declaration, I pass it an incompatible type (for e.g. pandas). Thanks! 🙂
d
Pandera is good for this
The QB team contributed PySpark support the library last year
a
Ah! Thanks thanks! 🙏
d
you bind the checks to the function bound to the node
nothing kedro node
a
So pandera will validate at runtime, right?
d
yes at runtime currently
a
And not before that, right? For e.g.
Copy code
my_dataset@pandas:
   type: pandas.CSVDataSet

my_dataset@spark:
   type: spark.SparkDataSet
Copy code
def my_func(sdf: pyspark.sql.DataFrame):
    pass

node(
   my_func,
   inputs=[
    "my_dataset@pandas"
   ],
)
In above, Pandera would check before running the node but after loading the dataset, right?
d
so you would apply the pandera check to
my_func
and it would run before the data is processed by the function
d
My understanding is that Pandera would be more useful for data validation checks, but if you're passing a wrong type object altogether, it's not doing quite what you want. I'm not aware of this kind of static type checking between catalog and node/pipeline. It sounds interesting, but it's also the first time I've heard this request. My two cents: 1. Create an issue on the Kedro repo to track this, in case there are more people interested/it may become a priority 2. If you're keen to play around with it, this could be a Kedro plugin? I'm not sure what other alternatives are. I don't know much about mypy plugins, but I don't see support for something like this at a glance. Alternatively, in CI, you could "compile" a catalog and convert it into Python code, and then type check the result.
K 1
i
I've definitely had a similar "wish" in the past, but the technical complexity involved would probably be infeasible to handle. You depend on all the datasets being properly typed, and additionally the AbstractDataset which is actually calling the inner _load would add a layer of confusion. @Juan Luis shared an issue a couple of weeks back talking about potential IDE plugins which, although they wouldn't address this specific usecase, could be useful in providing pre-runtime validation of at least having datasets properly defined
K 1