Distributions vs packages in Python
Motivation
Now I am gathering information and planning a series of articles dedicated to Python packages development and while preparing I encounter some interesting facts that I hadn’t thought about before.
One of them is the ambiguity in terminology for distributions and packages in Python and their naming.
So I decided to share some thoughts about it.
Terminology
Package
If we try to search for “package” definition, we can find in The Hitchhiker’s Guide to Packaging that among others it is:
A directory containing an __init__.py file …, and also usually containing modules (possibly along with other packages).
Probably this is not complete defintion because without __init__.py file it is still can be used as a namespace package and there is a great article about that.
Moreover, we also can read in Python documentation’s glossary (and it is also discussed in the article above) that there is a little difference between packages and modules:
Technically, a package is a Python module with an __path__ attribute.
Distribution
For “distribution” definition we can find useful note in the same glossary of The Hitchhiker’s Guide to Packaging:
A Python distribution is a versioned compressed archive file that contains Python packages, modules, and other resource files
At the same time:
… it is not uncommon in Python to refer to a distribution using the term package
To complicate things even more, the glossary in Python Package User Guide has the term “Distribution Package” and contains the following remark:
A distribution package is more commonly referred to with the single words “package” or “distribution”
Even in PyPI documentation we can read about package term which essentialy is a distribution:
A “file”, also known as a “package”, on PyPI is something that you can download and install. Because of different hardware, operating systems, and file formats, a release may have several files (packages), like an archive containing source code or a binary wheel.
Also Python libraries usually use services like GitHub or GitLab to store the code, so in fact repository name can be different from the distribution name.
As a result:
- We have two terms (distribution and package) that have different meaning but sometimes are used interchangeably and therefore can confuse beginners
- The names for installation (distribution), working with code (repository) and import (package) can be different
It was also complicated for me at the beginning, and I suppose that you start to clearly distinguish between these terms when you install libraries like “scikit-learn” (distribution name) but import “sklearn” (package name) or you delve into building Python libraries yourself.
Naming
Package
Python naming best practices for the packages (and modules) are clearly defined in Style Guide for Python Code:
Modules should have short, all-lowercase names. Underscores can be used in the module name if it improves readability. Python packages should also have short, all-lowercase names, although the use of underscores is discouraged.
Distribution
What about distribution names? To be honest, I didn’t find any official information about it (e.g. in Python documentation or PEPs) and it is confirmed here:
PEP 8 has nothing to do with the question as it doesn’t talk about distribution, only about importable packages and modules.
This surprised me quite a lot because if I need to use underscores in the package name, what symbol I should use in distribution name?
For example, I will choose the hyphens without hesitation. Why? I don’t know
Probably I’ve seen distribution names with hyphens much more often and this is the main reason for me to choose hyphen, but how to confirm it?
Are there any compelling reasons to use hyphens?
Research
To be more objective I decided to research it using different methods:
- Search for the information on the internet
- Gather distribution names statistics from PyPI repository
- Ask people who have a lot of experience in building Python libraries
For the methods 2 and 3 you can find all the code and extended statistics in the GitHub repository.
Information from the internet
- Some answers from online resources contain the statements like this:
- … Long, concatenated words are hard to understand. …
- “_” is harder to type than “-“
- I’ve found 4 popular cookiecutter templates for Python packages:
-
Cookiecutter Data Science
Version 2 of this template accepts hyphens for distribution names. -
Py-Pkgs
This template complements incredibly useful book and forces using underscores for distribution names. -
Hypermodern Python
This template is based on awesome article series and there is an explicit recommendation to use hyphen for distribution names. -
pylibrary
This template has explicit distinction between repository, distribution and package names and hyphens are converted to underscores for distribution names.
-
Cookiecutter Data Science
- There is also an interesting research about repositories naming, but I was searching more facts about Python distribution itself based on more complete data.
Naming statistics from PyPI repository
First of all we will get the list of all distributions and using the solution from here:
distributions = get_distributions_list(base_url)
distributions_count = len(distributions)
print(distributions_count)
378373
Distribution names consist of uppercase and lowercase letters, digits and punctuation - hyphen, underscore and dot:
name_chars = set(''.join(distributions))
print(''.join(sorted(name_chars)))
-.0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz
Let’s calculate statistics for common char combinations which are interesting for us:
no_char = set()
hyphen = set('-')
underscore = set('_')
dot = set('.')
punctuation = hyphen | underscore | dot
digits = set(string.digits)
letters = set(string.ascii_letters)
letters_digits = letters | digits
# Letters and digits
names_letters_digits = get_names_stats(distributions, letters_digits)
Possible chars: 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
Required chars:
Names count: 192944
Names proportion: 0.50973
Examples: ['harrpy', 'logconf', 'irate']
# With only hyphens
names_with_only_hyphens = get_names_stats(distributions, letters_digits | hyphen,
required_chars=hyphen)
Possible chars: -0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
Required chars: -
Names count: 161045
Names proportion: 0.42546
Examples: ['sentry-tablestore', 'requests-async-session', 'dash-loading-spinners']
# With only underscores
names_with_only_underscores = get_names_stats(distributions, letters_digits | underscore,
required_chars=underscore)
Possible chars: 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz
Required chars: _
Names count: 10650
Names proportion: 0.02814
Examples: ['nextcloud_news_updater', 'nose_priority', 'dms_tools']
# With only dots
names_with_only_dots = get_names_stats(distributions, letters_digits | dot,
required_chars=dot)
Possible chars: .0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
Required chars: .
Names count: 11489
Names proportion: 0.03035
Examples: ['plone.introspector', 'RBX.py', 'danse.ins']
# With all punctuation
names_with_all_punctuation = get_names_stats(distributions, letters_digits | punctuation,
required_chars=punctuation)
Possible chars: -.0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz
Required chars: -._
Names count: 13
Names proportion: 0.00003
Examples: ['vision_utils-0.1.1', '0-._.-._.-._.-._.-._.-._.-0', 'sloth-ci.ext.docker_exec']
Additionaly I created visualization for names which contain one or multiple punctuation chars using matplotlib-venn library:
Due to the technical reasons there is a space for intersection of all 3 punctuation chars, but it is an expected behaviour and explained here.
An alternative way which I found to visualize multiple sets relationship is UpSetPlot and I decided to use it for our case:
Findinds
Based on these results we can make several interesting conclusions:
- There are more than half (~51%) of the distributions without any punctuation chars
- A lot of distributions use only hyphens in their names (~43%)
- Distribution names with only dots are also used (~3%) and more frequently than only underscores (~2.8%)
- There are distributions with all punctuation chars but there are very few of them (13 packages)
As we can see in the examples above, even distributions with names like0-._.-._.-._.-._.-._.-._.-0
are acceptable for PyPI repository
Answers from experts
I decided to ask the authors of Py-Pkgs
, pylibrary
and Hypermodern Python
cookiecutter templates why they choose underscore or hyphen.
Py-Pkgs
Co-author of the Py-Pkgs
template, Tomas Beuzen, answers about their choice:
… because it is simpler, less cognitive-load (esp. for the beginner), and to us, it seems logical.`
Tomas also provides an interesting link that Google now explicitly recommends using hyphens in file and directory names in their developer documentation (but they don’t explain why).
pylibrary
Author of the pylibrary
template, Ionel Cristian Mărieș, has the following reason to choose hyphen:
… you’ll see dashes more often than underscores in urls in general. I haven’t given it much thought, just went with what people usually have in urls (dashes instead of underscores) …
He also gave the examples when repository, distribution and package names may differ:
I guess the main reason I have this dist/package/project name distinction is that I don’t always have a consistent scheme for all my projects:
- Sometimes the dist name I want is already taken on PyPI.
- Django projects have a differently styled project name.
About the dashes… it’s just styling at this point. https://pypi.org/project/lazy-object-proxy/ is the same as https://pypi.org/project/lazy_object_proxy/
Hypermodern Python
Author of the Hypermoder Python
template, Claudio Jolowicz, really surprised me with his comprehensive answer which includes both technical and historical notes.
His personal choice is also based on convenience:
Personally, I prefer package names with hyphens because I find them easier to read.
This is obviously a personal choice, and there are well-known projects using underscores in their names. To name one prominent example, the import sorter previously used by this template was reorder_python_imports.
Historical review is very interesting:
Hyphens have been a part of standard typography since Gutenberg.
By contrast, we’ve had the underscore character only since the advent of typewriters. According to this thesis, it was a fixture of the keyboard by 1881, and used for underlining by backing up the carriage and typing over the previous letter. It became a part of ASCII in 1963, and the C programming language made it a legal character in identifiers in the 1970s. I think you’d be hard-pressed to find a text with underscores that’s not aimed at technical people. What’s more, the original purpose of underscores was to provide underlining for devices that support inserting multiple characters in the same position. I still find it somewhat weird to read identifiers that underline the gaps between words, but not the words themselves.
Also technical details are very useful:
Using underscores for both distribution and import names has the advantage of consistency, as hyphens are not legal characters in Python identifiers, while underscores are.
Worth noting that PEP 503 treats the three non-alphanumeric characters (., -, _) in package names as equivalent to -. So the above example appears as reorder-python-imports on PyPI.
PyPI preserves the original name in the package metadata, as specified by the package author. The URL uses the normalized name (PEP 503), but name variants are accepted and redirected to the normalized name.
PEP 503 normalization also includes transforming to lowercase. For example, the metadata name for pyyaml is
PyYAML
, whose canonical form ispyyaml
.There is also the question of normalizing distribution filenames. For wheels, this is specified in PEP 427 and based on underscores and original case (for pyyaml, the wheel filename uses
PyYAML
). For sdists, this is unspecified (see this blog and this issue).Tools like pip or Poetry will work with any name variant. For example, Poetry uses the original name from the package metadata (
PyYAML
) in pyproject.toml and poetry.lock, but you canpoetry add pyyaml
.
As an aside, name variants also exist for the console script (where applicable).
- cogapp (cog)
- rst-to-myst (rst2myst) Sometimes, only the “marketing name” and/or the repository name are different:
- coverage (distribution, package, and script are named coverage, repository is named coveragepy, human-friendly name is Coverage.py)
I am grateful to the authors of these templates for their answers and insights.
Additional experiments
URL normalization
I was interested in trying URL normalization which was mentioned by experts and made a small experiment - use python_reorder_import package and watch how changing PyPI URL to the package will be processed:
check_alternative_urls(distribution_name)
https://pypi.python.org/project/reorder-python_imports -> https://pypi.org/project/reorder-python-imports/
https://pypi.python.org/project/reorder.python_imports -> https://pypi.org/project/reorder-python-imports/
https://pypi.python.org/project/reorder_python-imports -> https://pypi.org/project/reorder-python-imports/
https://pypi.python.org/project/reorder_python.imports -> https://pypi.org/project/reorder-python-imports/
And all URLs are redirected to the URL with normalized distribution name as expected.
Installation
An interesting fact that you can install distribution replacing any punctuation char by any number of hyphens, underscores and dots or change case of the letters, as mentioned in Python Packaging User Guide.
For example, the following commands will successfully install python-reorder-import
distribution:
pip install reorder-.__.-pYtHoN_-..-_imports
poetry add reorder-.__.-pYtHoN_-..-_imports
For uninstallation it is a little more complicated:
- For pip you can uninstall any alternative name:
pip uninstall reorder---pYtHoN...imports
- It seems that for poetry you can use only normalized name:
poetry remove reorder-python-imports
Open questions
What I still don’t understand is not consistent behavior for different packages.
For example, I found the latest by update time package with all punctuation chars and now this package is carson-tool.create_template. Let’s make the same experiments and compare results.
URL normalization
distribution_update_stats, incorrect_distributions = get_distribution_update_stats(base_url, names_with_all_punctuation)
distribution_stats = pd.Series(distribution_update_stats)
distribution_name = distribution_stats[distribution_stats == distribution_stats.max()].index.item()
distribution_info = get_distribution_info(base_url, distribution_name)['info']
distribution_url = distribution_info['project_url']
print(distribution_url)
'https://pypi.org/project/carson-tool.create_template/'
Here URL is redirected from multiple URL variants to the original one, not normalized version of the name:
check_alternative_urls(distribution_name)
https://pypi.python.org/project/carson.tool.create_template -> https://pypi.org/project/carson-tool.create_template/
https://pypi.python.org/project/carson_tool.create_template -> https://pypi.org/project/carson-tool.create_template/
https://pypi.python.org/project/carson-tool-create_template -> https://pypi.org/project/carson-tool.create_template/
https://pypi.python.org/project/carson-tool_create_template -> https://pypi.org/project/carson-tool.create_template/
https://pypi.python.org/project/carson-tool.create-template -> https://pypi.org/project/carson-tool.create_template/
https://pypi.python.org/project/carson-tool.create.template -> https://pypi.org/project/carson-tool.create_template/
Installation
If I install the package using pip
:
pip install carson-tool.create_template
Then I can see in the packages list that underscore char was replaced by hyphen but dot char remained unchanged:
pip list
carson-tool.create-template 0.2.0
I would appreciate if somebody clarifies this behaviour because I didn’t find any clear explanations for it.
Summary
- Terminology for distributions and packages in Python can be quite ambiguous and it takes practice in order to distinguish between them
- Packages naming has the most specific requirements
- Distributions naming is not so obvious though the choice can be supported by both historical and technical points of views
Comments