is anyone here using <Dolt> with Kedro? “Dolt is a...
# questions
h
is anyone here using Dolt with Kedro? “Dolt is a SQL database that you can fork, clone, branch, merge, push and pull just like a Git repository.” they have authored a hook about 3 years ago which no longer works. Im using dolt with a sqlquerydataset, which works fine(since its basically mysql under the hood), and im looking to make a versioned sqlquery dataset (which is not based on a hook, but im interested in exploring a similar setup to kedro-mlflow in terms of run ids to version). anyway, if anyone would like to compare notes, id be up for that!
d
I've been aware of Dolt for years, but never used it, so I may misunderstand some things. I think you may benefit from extending the Ibis table dataset instead of SQL query dataset, since the query dataset is read-only. It does seem you could then potentially call
dolt_commit
in the dataset implementation on save (as you say, somewhat similar to MLflow dataset). Let me reach out to a colleague on the Ibis team who talked to the Dolt team recently, to see if they have any thoughts...
h
Thanks! that would be great! I would also be very interested in a diff dataset, that would only load the difference between commits. however i am not sure yet about how to ‘map’ the api to any pre-existing kedro dataset, so im definitely interested in their experiences
d
Do you mind laying out your requirements/what you want a bit more? • "im looking to make a versioned sqlquery dataset" ◦ Should it support writing back to the database? ◦ Do you want to make a commit on each pipeline run (similar to existing hook, also more similar to how Kedro versioning works) or on each write? • "I would also be very interested in a diff dataset, that would only load the difference between commits. however i am not sure yet about how to ‘map’ the api to any pre-existing kedro dataset, so im definitely interested in their experiences" ◦ Would you want to load this in the pipeline? Dolt provides a SQL interface for accessing diffs, but I'm not sure this is what you want: https://docs.dolthub.com/concepts/dolt/git/diff#sql As for the Ibis team, I was a bit mistaken; seems they talked a while back, but nothing clear came out of it, except for the fact that things should "just work" because MySQL is well-supported
h
yes it should write to the database, i prefer using sqlalchemy datamodels, but inserting using pandas is also an option. and indeed commit per pipeline run. In terms of diffing, i’d love to only load the rows that were updated/inserted between the last commit. ill probably try to make something similar to a incrementaldataset, where you can process only the new data in increments. I’ve worked extensively with kedro combined with sql, either custom datasets to use sqlalchemy, or very hacky solutions to work with sqlquerydataset/sqltabledataset, but im looking to clean up my act. i dont know whether there are action items rn, but basically i searched for dolt in this slack channel and found nothing, so if there is interest in collecting some best practices wrt dolt and kedro, we can keep this thread alive and maybe move the discussion somewhere more strucured if there is some momentum for it?
d
I’ve worked extensively with kedro combined with sql, either custom datasets to use sqlalchemy, or very hacky solutions to work with sqlquerydataset/sqltabledataset, but im looking to clean up my act.
Highly recommend using Ibis for this (via https://docs.kedro.org/projects/kedro-datasets/en/kedro-datasets-4.1.0/api/kedro_datasets.ibis.TableDataset.html). I am biased as I work on Ibis; that said, I do believe it's the best way to do this now in Kedro, especially given the Python dataframe model fits well. Happy to share some more context/answer questions, if you'd like.
i dont know whether there are action items rn, but basically i searched for dolt in this slack channel and found nothing, so if there is interest in collecting some best practices wrt dolt and kedro, we can keep this thread alive and maybe move the discussion somewhere more strucured if there is some momentum for it?
Yeah, haven't seen any recent activity on this front. If you do write up/share your implementation along the way, will be happy to see it!
n
https://github.com/dolthub/kedro-dolt There was a plugin but honestly I haven’t tried it. Diffing dataset sounds like a huge engineering challenge