Hi, our team is new to kedro and we would like to ...
# questions
r
Hi, our team is new to kedro and we would like to use it as a data engineering tool. The concerns we have are • If we work with ibis or snowpark, we don't want to define each table/view on the database. As far as I understand the DataSets are the persistance objects that connect the different transformations in the pipeline. It there a way to get around defining these? • How many nodes could we run in parallel ? Is there an uper limit if the heavy computing is mainly happening on snowflake? • I understand it that the nodes/transformations have to be molded into a pipeline. Is there an option to do that implicily by referencing another node? • Is there a proper way to solve data quality inclunding generic tests, custom tests • Is there an example project that we could benefit from? Thanks for your inputs
extreme teamwork 1
d
Raif - welcome to the community, some excellent questions here and you're very much adopting my view of what a modern Kedro stack should look like with Ibis at the core. Myself and @Deepyaman Datta are delighted to see your thinking here. • So catalog sprawl is a real issue when you have many views/tables to declare. There are two points here: ◦ You only need to declare catalog entries for the datasets you want to persist someway, within a pipeline any 'free' inputs/outputs are passed between nodes in memory. ◦ For situations where you want to persist many similar objects we've introduced dataset factories which allow you to define a pattern. This makes the catalog much more concise and DRY, at the expense of giving up some explicitness at rest. You can use
kedro catalog resolve
to review what Kedro compiles at runtime. • Can you explain this a bit more? My view with Kedro (and any other framework) is that you should adopt the principle of 'loose coupling, high cohesion'. That means your business logic should be independently well tested pure python package, your kedro pipeline should be dumb in the sense you are not coupling business logic to your expression of flow. In my opinion the
nodes.py
we generate is purely for newbies, in practice you should only need a
pipeline.py
which imports the python functions from elsewhere. • So there are a couple of different ways to achieve parallelism in Kedro, the
ParallelRunner
uses multiprocesses and will be good for local procesing engines like Pandas / Polars. The
ThreadRunner
is perfect for Spark / Snowpark / Ibis since it delegates execution to a remote computation backend. We don't have a great deal of control of how that gets delegated beyond the number of threads, I think some tuning will be required on the engine side if performance is bottleneck. • Data Quality is an interesting topic, there have been many frameworks come and go so it's been hard to build long term integrations.
Pandera
is by far my favourite way of doing runtime expectation testing. It has support for Spark / Polars / Ibis (Snowpark may work since it's Spark like at an API level, but don't rely on me saying that). The other advantage of this is that you annotate the pure python functions I mentioned above, without coupling to your flow framework like Kedro. In summary, if you have well unit tested pure python functions annotated with Pandera schemas you have a solid foundation of trust to work against. This doesn't support expectation tests on persisted data like
dbt
or great expectations.
kedro-pandera
does exist, but I'm not sure how up to date it is. • In truth most examples of Kedro at scale are not open to the public, beyond our tutorial docs, maybe explore the github dependents view?
❤️ 1