Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possibility for integrating with automated conda build and deployments as well #633

Open
sgbaird opened this issue Mar 29, 2022 · 15 comments

Comments

@sgbaird
Copy link

sgbaird commented Mar 29, 2022

Very nice repo! I'll be promoting this tool to some of the researchers in my group.

See marcelotrevisani/souschef#32 for a breakdown of the methods I've been using with conda. Figured it was worth mentioning.

I have this methodology implemented for a few different repositories, and it's worked pretty nicely.

@abravalheri
Copy link
Collaborator

abravalheri commented Mar 30, 2022

Hi @sgbaird thank you very much for proposing this discussion.

I currently know just the very basics about packaging with conda.
According to a previous discussion, it seemed that:

  • The main focus of the open-source Python community is to use channels like conda-forge to publish packages.
    • To be sincere nowadays conda-forge seems even more popular than the default anaconda channel itself.
  • To use these community channels, the users are asked to create separated repositories (called feedstock) that maintain the recipe externally to the main package repository.
  • Grayskull is already a template-based project generator that handles the creation of the feedstock.

Therefore, at that moment in time our conclusion was that PyScaffold would bring very little value to the process of publishing these conda packages.

Maybe something have changed in the ecosystem since then so it might be worth to revisit this decision.

I tried to have a look on the linked examples/discussions, but I don't know if I understood them correctly. Maybe you can help with that by summarizing what is the approach you are taking?

Is the idea here to not use conda-forge but instead a custom channel and therefore have the recipe files inside the same repository as the main code?

@mfhepp
Copy link
Contributor

mfhepp commented May 18, 2022

IMO, many people are nowadays using the conda package manager for all kinds of Python projects, not just for Data Science projects or work related to the Anaconda environment. conda has some advantages over pip, in my biased opinion (e.g. it is not limited to Python packages, so you can avoid a mixture of pip, brew, etc. with potentially hard-to-debug conflicts). While you can use pip inside conda, arbitrary mixing will lead to frustration.

If you use pip in parallel to conda, then pip will install the package into the currently active conda environment, and if none is active, into the base conda environment. However, conda's ability to manage dependencies and package meta-data for pip is limited. Hence, a conda user will want to minimize the usage of pip, and if a package is not available via conda, the most reliable way of adding a pip package is to create a new conda environment and add the pip package or package as the last step.

This is why conda users are trying hard to find a package with conda installation support. It has become a very relevant feature for any Python package. Hence, a ready-to-go template and workflow for Python projects that can be installed via pip and via conda would be a really big thing.

@abravalheri
Copy link
Collaborator

Hi @mfhepp, thank you very much.

The needs of the community and the use cases are very clear now. However I don't understand what is the original suggestion on how we can start to tackle this challenge.

In my limited understanding, I have the impression that what the community uses the most is conda-forge, but conda-forge seems to require 2 separated repositories...

I still have the same doubts I posted in my previous comment. It would be nice if we can have some clarification so we can start understanding better what would be a possible implementation.

@mfhepp
Copy link
Contributor

mfhepp commented May 18, 2022

Hi @abravalheri: Thanks for taking this up so swiftly! I have not yet packaged any work for conda-forge myself, so hopefully someone who has been doing this regularly will chime in.

But as far as I understand from the YAML files linked by @sgbaird, one would basically have to run grayskull and then build and upload the feedstock to conda-forge or a personal conda channel using conda, like so

        mkdir scratch
        cp LICENSE scratch/
        python run_grayskull.py
        conda config --add channels conda-forge
        conda config --set anaconda_upload yes
        cd scratch
        conda build .
        cd ..

https://github.com/sparks-baird/chem_wasserstein seems to be available via PyPi and conda_forge [taken from here]:

conda install -c sgbaird chem_wasserstein
pip install chem_wasserstein

In this particular case, @sgbaird seems to use a personal conda channel rather than conda-forge, but the basic workflow seems to be the same.

As for the "two repositories" issue: As far a I understand, the build files for conda will just be uploaded to the conda channel and not clutter the main project repository. Only the workflow files (YAML etc.) and customized scripts for the packaging like https://github.com/sparks-baird/chem_wasserstein/blob/main/run_grayskull.py will be in the Git repository.

This looks like a straightforward and rewarding approach to me, but hopefully @sgbaird or someone from the group can help.

@sgbaird
Copy link
Author

sgbaird commented May 19, 2022

@abravalheri thanks for the patience with me rejoining the discussion. @mfhepp great summary of both the needs and the basic workflow involved with my suggestion.

Uploading to a personal channel via conda-forge was an oversight of mine. I ran into some difficulty with using conda-smithy, thinking I had to use something like that to get it up on conda-forge. I later found out it would be much easier and conda-smithy was overkill for my needs. One of the important nuances of publishing packages to -c conda-forge channel is that all dependencies also need to be available via the -c conda-forge channel. The next time I publish a package for one of my main projects to conda, I plan to shift everything over to the conda-forge channel rather than my personal sgbaird channel. This is the documentation that I plan to follow (sparks-baird/mat_discover#35) when the time comes: https://conda-forge.org/docs/maintainer/adding_pkgs.html.

For reference, here is what one of the run_grayskull.py scripts looks like (which could be more generalized):

"""Touch up the conda recipe from grayskull using conda-souschef."""
import os
from os.path import join
from souschef.recipe import Recipe
import chem_wasserstein

os.system(
    "grayskull pypi {0}=={1}".format(
        chem_wasserstein.__name__, chem_wasserstein.__version__
    )
)

fpath = join("chem_wasserstein", "meta.yaml")
fpath2 = join("scratch", "meta.yaml")
my_recipe = Recipe(load_file=fpath)
my_recipe["requirements"]["host"].append("flit")
my_recipe.save(fpath)
my_recipe.save(fpath2)

Which would produce e.g. the following YAML file based on the corresponding PyPI metadata (in this case for v1.0.8):

{% set name = "chem_wasserstein" %}
{% set version = "1.0.8" %}


package:
  name: {{ name|lower }}
  version: {{ version }}

source:
  url: https://pypi.io/packages/source/{{ name[0] }}/{{ name }}/chem_wasserstein-{{ version }}.tar.gz
  sha256: 61dfd2069eb2579870f31d16ca156553757cc84824e2356e038cf1ae0bbf2fcf

build:
  number: 0
  noarch: python
  script: {{ PYTHON }} -m pip install . -vv

requirements:
  host:
    - pip
    - flit
    - python >=3.6,<3.10
  run:
    - colorama
    - dist_matrix >=1.0.2
    - elmd ==0.4.8
    - numba >=0.53.1
    - pandas
    - plotly
    - pqdm ==0.1.0
    - python >=3.6,<3.10
    - scikit-learn
    - scipy
    - tqdm
    - umap-learn

test:
  imports:
    - chem_wasserstein
  commands:
    - pip check
  requires:
    - pip

about:
  home: https://pypi.org/project/chem_wasserstein/
  summary: A high performance mapping class to construct ElM2D plots from large datasets of inorganic compositions.
  license: GPL-3.0
  license_file: LICENSE

extra:
  recipe-maintainers:
    - sgbaird

(Note that the run_grayskull.py line - my_recipe["requirements"]["host"].append("flit") - is irrelevant for PyScaffold.)

One possibility for incorporating this into PyScaffold would be to make a .github/workflows as part of the template. Another would be to just include a run_grayskull.py template with some usage instructions. I don't have a very good grasp of the history and vision for PyScaffold, so these suggestions may be off-base. Not too familiar with tox, but I could imagine something like a tox -e publish --repository conda-forge or similar. Also just noticed https://github.com/pyscaffold/pyscaffoldext-dsproject.

@abravalheri
Copy link
Collaborator

Hi @sgbaird, thank you very much for the clarifications.
I have some doubts about your run_grayskull.py script. I hope you don't mind me asking them bellow 😅.

  1. Is the script only running the grayskull CLI and adding flit as a dependency? So if a project does not use flit, would it be the case that running grayskull is enough?

  2. Another question is, since the build step is defined as pip install . -vv, why does flit needs to be available on the host? I am under the impression that pip would automatically install flit for the build as long as it is listed in the pyproject.toml file, under [build-system] requires. Is there anything I am missing here?

@sgbaird
Copy link
Author

sgbaird commented May 19, 2022

Hi @sgbaird, thank you very much for the clarifications. I have some doubts about your run_grayskull.py script. I hope you don't mind me asking them bellow 😅.

  1. Is the script only running the grayskull CLI and adding flit as a dependency? So if a project does not use flit, would it be the case that running grayskull is enough?

In that case, running grayskull via CLI should be enough. chem_wasserstein didn't have anything that needed to be changed, but there were others that did. For example, mat_discover required adding pytorch and cudatoolkit as dependencies and renaming kaleido to python-kaleido since the package names differ between PyPI and Anaconda. For some reason, it didn't recognize it as being a noarch distribution, so I add that in manually and make sure that a build section gets deleted if it pops up since there aren't any fancy compilation requirements. dist_matrix on the other hand was similar to chem_wasserstein in that I just included flit as a host dependency.

  1. Another question is, since the build step is defined as pip install . -vv, why does flit needs to be available on the host? I am under the impression that pip would automatically install flit for the build as long as it is listed in the pyproject.toml file, under [build-system] requires. Is there anything I am missing here?

See pypa/flit#461. Probably not super important here since I doubt many people using PyScaffold would also be using flit. It seemed to be something fairly specific to flit (and maybe poetry too).

@abravalheri
Copy link
Collaborator

Thank you very much @sgbaird. The following are issues in grayskull that (when/if solved) would simplify also this process a lot:

@sgbaird
Copy link
Author

sgbaird commented May 20, 2022

Figured these were worth xref-ing: #422 and https://pyscaffold.org/en/stable/dependencies.html#creating-a-conda-package

@sgbaird
Copy link
Author

sgbaird commented May 26, 2022

@abravalheri I got around to trying out packaging on conda-forge with PyScaffold by adapting some of my other workflows. One of the main differences between my other workflows and PyScaffold is that I hardcode __version__ in mymodule/__init__.py (flit workflow) rather than base it on a git tag. I'm wondering whether you think I should try to grab the version from the git tag or remove the .post... suffix. Also, how to deal with the "unknown" case, assuming I go with trying to import the version instead of using git tag.

There's also a conflict between grayskull's meta.yaml and precommit's check-yaml hook.

trim trailing whitespace.................................................Passed
check for added large files..............................................Passed
check python ast.....................................(no files to check)Skipped
check json...........................................(no files to check)Skipped
check for merge conflicts................................................Passed
check xml............................................(no files to check)Skipped
check yaml...............................................................Failed
- hook id: check-yaml
- exit code: 1

while scanning for the next token
found character that cannot start any token
  in "src/xtal2png/meta.yaml", line 1, column 2

debug statements (python)............................(no files to check)Skipped
fix end of files.........................................................Passed
fix requirements.txt.................................(no files to check)Skipped
mixed line ending........................................................Passed
autoflake............................................(no files to check)Skipped
isort................................................(no files to check)Skipped
black................................................(no files to check)Skipped
blacken-docs.........................................(no files to check)Skipped
flake8...............................................(no files to check)Skipped

First two lines from meta.yaml:

{% set name = "xtal2png" %}
{% set version = "0.1.3" %}

So, it seems to be referring to the first instance of the % character as shown. Probably related to this SO post. Same issue I experienced in a different context related to safe_load(): conda/grayskull#253

Might just ignore the meta.yaml file since it doesn't technically need to be committed if I'm putting it on conda-forge anyway.

@sgbaird
Copy link
Author

sgbaird commented May 26, 2022

For now, I can workaround by ignoring meta.yaml and using version = os.popen("git describe --abbrev=0 --tags").read().replace("\n", "") slightly modified from https://pyscaffold.org/en/v4.0.1/contributing.html#troubleshooting

@abravalheri
Copy link
Collaborator

Hi @sgbaird, regarding the pre-commit error, you can also ignore that specific file (since it is not exaclty a YAML file, but rather a Jinja template...) by adding a exclude pattern in your .pre-commit-config.yaml:

...
- id: check-yaml
  exclude: 'meta.yaml$'
...

In terms of version, setuptools-scm will try to generate unique tags for each state of your repo (that is why the .post... suffix is attached), so you don't have the case that 2 wheels containing divergent implementations are marked with the same version. If you commit everything (git status should be empty) and then do git tag just before building the wheel/sdist, the version will not have a suffix.

setuptools-scm will also write the version number into the package metadata once you build the wheel/sdist. This means that the only situations when you have unknown is when:

  • the installation went wrong
  • the site-packages directory was tempered with
  • there is a problem in importlib.metadata.

I personally never handle this scenario directly, but I believe setuptools-scm also allow you to write the version number into a file (see write_to).

@sgbaird
Copy link
Author

sgbaird commented May 27, 2022

@abravalheri thanks for the clarifications and suggestions! These are very helpful.

@mfhepp
Copy link
Contributor

mfhepp commented Jul 26, 2022

Note that #422 describes the basic procedure and is referenced from the documentation.

It would be good to add this issue #633 also to the documentation.

mfhepp added a commit to mfhepp/pyscaffold that referenced this issue Jul 26, 2022
Mention pyscaffold#633 in the documentation regarding deployment to `conda-forge`, as it contains useful information on current tooling.
FlorianWilhelm pushed a commit that referenced this issue Jul 27, 2022
Mention #633 in the documentation regarding deployment to `conda-forge`, as it contains useful information on current tooling.
@mfhepp
Copy link
Contributor

mfhepp commented Sep 27, 2022

FYI: I personally think that the syncing of PyPi and Conda packages will become a much lesser issue in the future, because of a fundamental shift in the ecosystems for Data Science:

In the past, having Conda packages available was mainly motivated by the need to keep your local system/machine clean, because PIP and Conda do not work well in parallel; overtime, you will clutter your local environments etc.

In the past months, however, the risk of Supply Chain attacks on the open Python ecosystem has grown so tremendously (the risk has actually always been there, but is now more likely to be exploited) that nobody should install external Python packages etc. lightheartedly on her or his local machine, because

  1. every single one could be compromised and that would be sufficient to compromise the entire machine, and
  2. local user rights are sufficient for very bad stuff, even on OSX and Linux machines, and
  3. it is impossible to judge this risk from fuzzy trust measures, like "a package developed at a major US university", code signing, or SHA256 integrity checking.

Without jumping into this too deeply: IMO the only viable consequence is to run all Python Data Science workflows (applies to other languages like R, too, btw) in constrained Docker containers (no access to the full file-system, no outbound/inbound network access, ...) or on virtual or actual external machines.

Otherwise, a single pip install foo can put your entire machine at risk.

Now, with this having said, mixing Pip and Conda and Homebrew and Github libraries is no big deal, because you can always create a fresh Docker image from your Dockerfile files. Or even just work without a virtual environment.

This might be a bit off-topic, but I think it is important to share this information.

If you want a concrete example, read about the PyMafka incident on PyPi.

@sgbaird

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants