Hello! I noticed that my Kedro project contains a ...
# questions
f
Hello! I noticed that my Kedro project contains a requirements.txt file, that looks like this:
Copy code
black~=22.0
ipython>=8.10
jupyterlab>=3.0
kedro-datasets[pandas.CSVDataset, pandas.ExcelDataset, pandas.ParquetDataset, spark.SparkDataset, plotly.PlotlyDataset, plotly.JSONDataset, matplotlib.MatplotlibWriter, pickle.PickleDataset, tracking.JSONDataset, huggingface.HFTransformerPipelineDataset]>=1.0
kedro-telemetry>=0.3.1
kedro-viz>=6.7.0
kedro~=0.19.1
notebook
pytest-cov~=3.0
pytest-mock>=1.7.1, <2.0
pytest~=7.2
ruff~=0.0.290
scikit-learn~=1.0
seaborn~=0.12.1
pyspark~=3.5.0
langid~=1.1.6
pandas~=2.1.4
plotly~=5.18.0
nltk~=3.8.1
skorch~=0.15.0
First of all, I'm wondering why this is very different from the requirements.txt I get when running
pip freeze > requirements.txt
. There are much more packages with pip freeze. Also, pip freeze has the package kedro-dataset, but does not enumerate the different types of datasets (pandas.CSVDataset, pandas.ExcelDataset, etc.). Also, why are the packages requirements not strict (for instance, package_name==3.0)? Looks like this could lead to reproducibility problems. When installing a new package in my environment, what actions should I take in order to update the requirements? At the end of the day, I just want my project to be as reproducible as possible. I'm a bit lost about all this! Maybe you can direct me to the proper documentation. Many thanks!
d
A requirements file is a concise way to state what is required for a package or program to work.
pip freeze
is the exact packages in your installation, pinned and including transitive dependencies.
It is generally bad practice to pin dependencies; specify them as loosely as possible such that they work. If you pin everything, you miss Bug fixes, upgrades, etc. You also get into requirements hell when you try to upgrade or add a dependency.
šŸ‘ 1
If you're concerned about reproducibility, you can use a "lock" file that you periodically update (e.g. for production builds). In this case, you will have a loosely-defined requirements file, and a process for compiling those into pinned requirements. There are tools that do this for you, like poetry, pip-compile, pdm I think
šŸ‘ 1
j
also, famously
pip freeze
doesn't account for extras (hence
[optional-dependencies]
...
šŸ‘ 1
to add on top of what @Deepyaman Datta said, it's a bad practice to pin dependencies for reusable libraries, but it's absolutely a good practice for applications, and
requirements.txt
is the closest we have now to standard lock files in Python (there have been several other attempts: Pipfile, poetry lock files, pdm lock files, and more) what
pip-tools
does is including the non fully resolved dependencies in a file called
<http://requirements.in|requirements.in>
and then lock them in a
requirements.txt
https://github.com/jazzband/pip-tools/
šŸ‘ 1
f
Thank you! I come from R, where "renv" is used to manage dependencies. "renv::snapshot()" is used to take a snapshot of the environment. It creates (or updates) a lockfile with the installed package versions. Conversely, "renv::restore()" will update the environment using the specified versions in the lockfile. So after installing a new package, you run "renv::snapshot()", which will record the new package in the lockfile. When updating a package, you should verify that this does not break your code. If it does, you run "renv::restore()", which will restore the environment to the lockfile (in other words, it will downgrade your updated package to the previous version). If it does not break your code, then you can record the new version in the lockfile with "renv::snapshot()". From what you say, I think that this kind of workflow can be achieved with poetry or pip-compile.
šŸ‘ 1
Is there an automatic way of updating this requirements.txt file? Like pip freeze, but for loose requirements? Or should I update it by hand every time I install a new package?
Copy code
black~=22.0
ipython>=8.10
jupyterlab>=3.0
kedro-datasets[pandas.CSVDataset, pandas.ExcelDataset, pandas.ParquetDataset, spark.SparkDataset, plotly.PlotlyDataset, plotly.JSONDataset, matplotlib.MatplotlibWriter, pickle.PickleDataset, tracking.JSONDataset, huggingface.HFTransformerPipelineDataset]>=1.0
kedro-telemetry>=0.3.1
kedro-viz>=6.7.0
kedro~=0.19.1
notebook
pytest-cov~=3.0
pytest-mock>=1.7.1, <2.0
pytest~=7.2
ruff~=0.0.290
scikit-learn~=1.0
seaborn~=0.12.1
pyspark~=3.5.0
langid~=1.1.6
pandas~=2.1.4
plotly~=5.18.0
nltk~=3.8.1
skorch~=0.15.0
d
If you use something like
poetry
, you do
poetry add
, and it will manage the inserts for you. For a vanilla
requirements.txt
file, I'm not aware of a way to "manage" it; I generally add by hand.
šŸ‘ 1
j
I do
pip compile -P nltk
y
Hi @Francis Duval So
pip freeze
would almost always differ a lot from contents of
requirements.txt
as they serve different purposes: ā€¢
requirements.txt
is used to reproduce your environment and should include only packages that you actually call from IDE (like linters) or import in your code (like
pandas
probably). And as guys said above, ideally it should define loose requirements instead of literal versions. ā€¢ output of
pip freeze
just logs all packages in your env, including "direct dependencies" mentioned in requirements and their dependencies (which you don't control). The other way to say this is that this file outlines the outcome of pip resolving your loosely defined requirements to specific versions of each package.
šŸ‘ 1