Looks like Airflow is considering built-in support...
# random
d
Looks like Airflow is considering built-in support for Ibis, partially to enable out-of-the-box lineage. https://lists.apache.org/thread/qx3yh6h0l6jb0kh3fz9q95b3x5b4001l
🔥 4
g
Amateurish question here, but what is "lineage" in this context?
d
Data lineage, specifically column-level lineage. Basically, what columns in your source data are used to create columns in downstream datasets. It's very hard (read: impossible) to track lineage across something like pandas transformations, but when you have SQL(-backed) transformations, it's possible to compute this information.
g
Thank you, that's an interesting feature. And does ibis currently allow for that? If so, does that mean that if i just "wrap" my regular pd.DataFrames inside ibis, would i gain that as a feature?
d
Ibis does not directly provide lineage. Maybe, if there is a lot of demonstrated user need, this is something that could be prioritized in the future. 🙂 One of the maintainers wrote a pretty detailed guide of how you can extract lineage right now though: https://github.com/ibis-project/ibis/discussions/7248#discussioncomment-7138710 (The expression tree visualization comes out of the box for free, though, so you can definitely visualize the lineage!) @datajoely also did a prototype of extracting lineage in Kedro using Ibis: https://linen-slack.kedro.org/t/16603380/wave-hiya-i-d-like-to-test-an-experiment-with-you-all-test-t
If so, does that mean that if i just "wrap" my regular pd.DataFrames inside ibis, would i gain that as a feature?
You would need to define your logic using Ibis to enable any of this. If you really want, Ibis does support a pandas backend, so you can write code in Ibis and execute it (lazily) using pandas; however, we generally recommend using something like the (default) DuckDB backend for local execution, since it's much more performant.
🙏 1
g
Massive!! Thanks a lot for the detailed answer