Hello I noticed that my Kedro project contains a requirement Kedro #questions

Hello! I noticed that my Kedro project contains a ...

Francis Duval

03/06/2024, 6:37 PM

Hello! I noticed that my Kedro project contains a requirements.txt file, that looks like this:

Copy code

black~=22.0
ipython>=8.10
jupyterlab>=3.0
kedro-datasets[pandas.CSVDataset, pandas.ExcelDataset, pandas.ParquetDataset, spark.SparkDataset, plotly.PlotlyDataset, plotly.JSONDataset, matplotlib.MatplotlibWriter, pickle.PickleDataset, tracking.JSONDataset, huggingface.HFTransformerPipelineDataset]>=1.0
kedro-telemetry>=0.3.1
kedro-viz>=6.7.0
kedro~=0.19.1
notebook
pytest-cov~=3.0
pytest-mock>=1.7.1, <2.0
pytest~=7.2
ruff~=0.0.290
scikit-learn~=1.0
seaborn~=0.12.1
pyspark~=3.5.0
langid~=1.1.6
pandas~=2.1.4
plotly~=5.18.0
nltk~=3.8.1
skorch~=0.15.0

First of all, I'm wondering why this is very different from the requirements.txt I get when running

pip freeze > requirements.txt

. There are much more packages with pip freeze. Also, pip freeze has the package kedro-dataset, but does not enumerate the different types of datasets (pandas.CSVDataset, pandas.ExcelDataset, etc.). Also, why are the packages requirements not strict (for instance, package_name==3.0)? Looks like this could lead to reproducibility problems. When installing a new package in my environment, what actions should I take in order to update the requirements? At the end of the day, I just want my project to be as reproducible as possible. I'm a bit lost about all this! Maybe you can direct me to the proper documentation. Many thanks!

Deepyaman Datta

03/06/2024, 6:44 PM

A requirements file is a concise way to state what is required for a package or program to work.

pip freeze

is the exact packages in your installation, pinned and including transitive dependencies.

Deepyaman Datta

03/06/2024, 6:46 PM

It is generally bad practice to pin dependencies; specify them as loosely as possible such that they work. If you pin everything, you miss Bug fixes, upgrades, etc. You also get into requirements hell when you try to upgrade or add a dependency.

👍 1

Deepyaman Datta

03/06/2024, 6:49 PM

If you're concerned about reproducibility, you can use a "lock" file that you periodically update (e.g. for production builds). In this case, you will have a loosely-defined requirements file, and a process for compiling those into pinned requirements. There are tools that do this for you, like poetry, pip-compile, pdm I think

👍 1

Juan Luis

03/06/2024, 6:54 PM

also, famously

pip freeze

doesn't account for extras (hence

[optional-dependencies]

...

👍 1

Juan Luis

03/06/2024, 6:56 PM

to add on top of what @Deepyaman Datta said, it's a bad practice to pin dependencies for reusable libraries, but it's absolutely a good practice for applications, and

requirements.txt

is the closest we have now to standard lock files in Python (there have been several other attempts: Pipfile, poetry lock files, pdm lock files, and more) what

pip-tools

does is including the non fully resolved dependencies in a file called

<http://requirements.in|requirements.in>

and then lock them in a

requirements.txt

https://github.com/jazzband/pip-tools/

👍 1

Francis Duval

03/06/2024, 7:02 PM

Thank you! I come from R, where "renv" is used to manage dependencies. "renv::snapshot()" is used to take a snapshot of the environment. It creates (or updates) a lockfile with the installed package versions. Conversely, "renv::restore()" will update the environment using the specified versions in the lockfile. So after installing a new package, you run "renv::snapshot()", which will record the new package in the lockfile. When updating a package, you should verify that this does not break your code. If it does, you run "renv::restore()", which will restore the environment to the lockfile (in other words, it will downgrade your updated package to the previous version). If it does not break your code, then you can record the new version in the lockfile with "renv::snapshot()". From what you say, I think that this kind of workflow can be achieved with poetry or pip-compile.

👍 1

Francis Duval

03/06/2024, 7:43 PM

Is there an automatic way of updating this requirements.txt file? Like pip freeze, but for loose requirements? Or should I update it by hand every time I install a new package?

Copy code

black~=22.0
ipython>=8.10
jupyterlab>=3.0
kedro-datasets[pandas.CSVDataset, pandas.ExcelDataset, pandas.ParquetDataset, spark.SparkDataset, plotly.PlotlyDataset, plotly.JSONDataset, matplotlib.MatplotlibWriter, pickle.PickleDataset, tracking.JSONDataset, huggingface.HFTransformerPipelineDataset]>=1.0
kedro-telemetry>=0.3.1
kedro-viz>=6.7.0
kedro~=0.19.1
notebook
pytest-cov~=3.0
pytest-mock>=1.7.1, <2.0
pytest~=7.2
ruff~=0.0.290
scikit-learn~=1.0
seaborn~=0.12.1
pyspark~=3.5.0
langid~=1.1.6
pandas~=2.1.4
plotly~=5.18.0
nltk~=3.8.1
skorch~=0.15.0

Deepyaman Datta

03/06/2024, 7:45 PM

If you use something like

poetry

, you do

poetry add

, and it will manage the inserts for you. For a vanilla

requirements.txt

file, I'm not aware of a way to "manage" it; I generally add by hand.

👍 1

Juan Luis

03/06/2024, 10:17 PM

I do

pip compile -P nltk

Yury Fedotov

03/07/2024, 2:25 AM

Hi @Francis Duval So

pip freeze

would almost always differ a lot from contents of

requirements.txt

as they serve different purposes: •

requirements.txt

is used to reproduce your environment and should include only packages that you actually call from IDE (like linters) or import in your code (like

pandas

probably). And as guys said above, ideally it should define loose requirements instead of literal versions. • output of

pip freeze

just logs all packages in your env, including "direct dependencies" mentioned in requirements and their dependencies (which you don't control). The other way to say this is that this file outlines the outcome of pip resolving your loosely defined requirements to specific versions of each package.

👍 1

87 Views

Open in Slack

Previous Next