Introduction

Unsatisified with the current state of Linux OS I have decided to build my own distro with it's own pacakge manager. I have created this book in order to evaluate how current package formats work, what they do well, and where they can make improvements.

Terminology

In this post I will use several terms that are not explained in the glossary, this section will (hopefully) explain them so that everyone knows what I mean when I use each of them. If this section, or any other section of this post, is confusing please contact me and tell me. I am afraid I am not perfect, or a mind-reader, so I need you to tell me if something is wrong or if I omitted some important information.

Package Building

Building a package will be used to refer to the process of compiling, or otherwise making a package’s source code executable, usually so that some file inside (usually a shell script or file written in object code) can be executed in order to start the program the package is for.

Package Installation

The act of installing a software package involves moving the installed files of a package into a file system. Most of the time, when someone says that they are installing a package, what they are doing is they are installing the package’s file to the live file system (that is, the file system of the PC that is being used).

Package Preparation

An important step in package development is the preparation of the sources being used to build the package.

Source Compilation

When one is creating a software package, quite frequently one will need to compile the package’s source code, which means convert the source code written in some programming language into something that is machine readable and executable. This conversion is performed, using so called “compilers” like the GNU Compiler Collection (GCC) which creates object code from the source code.

Glossary

Basic Definitions

Hardware

A computer’s hardware comprises the physical components of the computer, like its hard drive (or hard disk drive [HDD]) on which information is stored (analogous to a person’s long-term memory), random-access memory (RAM) devices (analogous to someone’s working memory) and central processing unit (CPU) or processor (which is where information processing is done, this processing is analogous to a person’s thoughts). The term “computer architecture” or “instruction set architecture” (ISA) refers to the set of instructions that the CPU, in question, can understand.

Software

A computer’s software are the non-physical components to the computer, which are stored in the computer’s hard drive. It includes the data stored on the hard drive like one’s files as well as the computer programs, which are more or less just instructions for the computer’s processor. For example, as I am writing this post I am using the Atom text editor, which is a computer program.

Source Code

Source code is a collection of computer instructions written in at least one human-readable computer language. The type of computer language most frequently-used to write computer programs are so called “programming languages” such as C, C++, Python, etc. Other common computer languages sometimes used to write computer programs (usually alongside programming languages) include markup languages (like HTML and Markdown) and style sheet languages (like CSS). For example, the Atom text editor is written in CoffeeScript, HTML, JavaScript and CSS, where CoffeeScript and JavaScript are programming languages.

Free and Open-Source Software

Free and open-source software (FOSS) refers to software who’s source code is licensed such that it can freely shared, distributed, studied and even modified and re-released by anyone. FOSS licenses usually require that anyone that modifies and re-releases the source code gives some recognition to the original authors of the source code and some licenses (which are called copyleft licenses) even require that the modified source code be released under a similar (if not identical) license to the original source code. The most popular copyleft license is the GNU General Public License (GPL), while the most popular non-copyleft (or permissive) FOSS licenses include the BSD licenses and the MIT License.

Proprietary Software

Proprietary software (PS) refers to software who’s source code is licensed such that it cannot be freely and legally shared. PS is usually, for charge, that is you have to pay to use them. Although some PS is so called “freeware”, in other words, you can use it for free, but you cannot freely and legally access the program’s source code.

Fork

A fork or project fork, is when developers take a copy of the source code from one software project and start independently working on it (or developing it) themselves. In the FOSS world this is commonplace due to the relative lack of restrictions imposed by FOSS licenses.

Downstream / Upstream

The terms downstream and upstream refers to the direction from and toward the original authors or maintainers of software that is distributed as source code, respectively.

Back-End / Front-End

The terms back-end and front-end are used together to refer to a relationship between two programs. The front-end program runs the back-end program in the background (often without the user noticing), to perform a specified action (usually package management on this site), but usually the front-end has some higher-level features that it brings to the table, that the back-end program lacks. Quite often, on this site, I will refer to graphical package managers as being front-ends for a command-line package manager, which the graphical package manager will run in the background. What the graphical package manager brings to the table is a more intuitive user interface.

Operating System

An operating system (OS) is the base set of system software that forms the foundation for other programs to run on top of. It manages all communication between the computer’s hardware and the application software that run on top of it. At its heart, each OS has what is known as a kernel, which manages all communication with the computer’s hardware. Notable examples of operating systems, include:

  • FreeBSD
  • Linux
  • Mac OS
  • MS-DOS
  • OS X
  • Windows 1.0 to Windows ME (including Windows 95 and 98)
  • Windows NT (including Windows XP, Vista, 7, 8, 10, etc)

Kernel

An operating system kernel is the piece of system software that manages all communication between a computer’s hardware and its software. Most kernels also perform virtual memory allocation. Some of this virtual memory is allocated to the kernel (which is called kernel mode), the rest is allocated to the user’s processes (or user mode, application software run in user mode). It is in many ways the operating system’s brain, without it the OS will not work. There are three main types of operating system kernel, based on their respective design:

  • Hybrid kernel — as its name suggest it is mid-way between the below designs. Examples include the kernels of OS X and Windows NT. Linus Torvalds, the lead and original developer of the Linux kernel, amongst others, believe that the hybrid kernel category is purely marketing and that these kernels fit quite easily into one of the following two categories.

  • Microkernel (or μkernel) — kernels that keep as few processes running in kernel mode as possible. They, as their name suggest, usually do the bare minimum needed of a kernel. Consequently, the number of lines of code in their source code is usually substantially less than their monolithic counterparts. Extreme variants of microkernels include nanokernels and picokernels, which have even fewer lines of source code. The most notable example is MINIX.

  • Monolithic kernel — these are kernels that perform all operating system functions in kernel mode. They usually consist of far more lines of code than their microkernel counterparts. The most notable examples are the FreeBSD, Linux, NetBSD and OpenBSD kernels.

Microkernels are easier for software developers to work on and are supposedly more easily portable to additional instruction set architectures (ISAs). Despite this, the monolithic Linux kernel has been ported to the greatest number of ISAs of any operating system kernel.

Unix

Unix (or UNIX for the trademark) is a family of operating systems that share several common characteristics and are certified as being compliant to at least one version of the Single UNIX Specification (SUS). These common characteristics are best summarized with what is sometimes called the “Unix philosophy”. The Unix philosophy is that the operating system provides a set of simple command-line tools or utilities (which I will sometimes refer to as Unix utilities) that perform a limited, well-defined function, which usually involves manipulating components (like files) in the unified file system that is also characteristic of Unix systems. Additionally Unix systems also have what is known as a command language or shell scripting language, called a Unix shell, with which users can call on these simple tools. Unix is also notable in that it is designed to be easily portable to a variety of different computers (distinguished from one another, mostly by the details of their CPU).

Unix-like

A unix-like operating system is one that behaves and to all the world seems like a Unix system, despite not being certified compliant to any version of the SUS. These systems can also be referred to as Unix clones. The only real practical difference between Unix and Unix-like systems are that Unix systems are usually developed by a commercial entity with the funds required to get their system SUS-certified.

*nix

*nix is my shorthand way of referring collectively to both Unix and Unix-like systems. Despite this, you will also see other authors, including those that know far more than I do about these systems, using *nix to refer to just Unix-like systems.

Linux

Linux is a large family of Unix-like operating systems (its members are referred to, collectively, as Linux distributions) that share one common defining characteristic: they use the Linux kernel as their kernel. The Linux kernel is a predominantly free and open-source kernel developed by Linus Torvalds who first started developing the kernel in 1991 when he was a 21-year-old computer science student studying at the University of Helsinki in Finland. I used the phrasing “predominantly free and open-source” deliberately, as while most of the kernel’s source code is licensed under the GNU GPL, there are some binary blobs on the kernel (which mostly just add to its compatibility with available hardware) that are not free and open-source.

GNU

The GNU Project is a FOSS project that was first founded in the 1980s by Richard M. Stallman (or rms), its stated mission is to develop a free and open-source Unix clone. This clone is referred to as GNU, which is incredibly unpopular. The GNU Project provides the Unix utilities (contained in the GNU Coreutils package) and the Unix shell (called Bash) used by the vast majority of Linux distributions. Due to this some people refer to the family of operating systems known as Linux, as GNU/Linux.

Free Software Foundation

The Free Software Foundation (FSF) non-profit organization set up by Stallman that advocates the widespread use of FOSS and was originally set up to hire people to develop software for the GNU Project. The FSF also provides its own set of predominantly copyleft licenses which includes the GNU GPL mentioned earlier.

Linux-libre

The Linux-libre kernel is essentially the Linux kernel with its binary blobs removed. This makes the kernel entirely free and open-source, as these binary blobs are proprietary software. A comparatively small number of Linux distributions use the libre kernel, these distributions are often called GNU/Linux-libre distributions and are fairly unpopular compared to their non-libre counterparts.

Berkeley Software Distribution

The Berkeley Software Distribution (BSD) is a Unix operating system that was developed at the University of California, Berkeley between 1977 and 1995. It was originally closed-source (that is, not open-source), but later releases were licensed under permissive BSD licenses. Since then several descendants of BSD have emerged with the most notable and popular one being FreeBSD. I use the terminology *BSD to collectively refer to BSD and its descendants.

File Archive

A file archive is a type of file, that stores one or more files, potentially with associated metadata, in a more portable and easily-compressed format. On Linux and other *nix systems, the most common type of file archive is a tar archive (which have the .tar file extension), with rar archives (file extension: .rar) being a popular alternative, especially for non-Linux platforms such as Windows. tar is a basic Unix utility, the specifics of this utility, its syntax, supported file formats, etc. is dependent on the *nix system it belongs to. Most Linux distributions use GNU tar as their default tar utility, but also have FreeBSD’s tar (usually called by the bsdtar command) utility available from their software repositories.

Software Package

Software packages are distributions (or packages) of software and data contained within a file archive. This archive is sometimes compressed so as to save space and make them easier to transfer over a network. Software packages usually also contain some metadata, that is, information about the contents of the package. Example software package formats include:

  • Debian package format (file extension: .deb) — used by Debian’s dpkg package manager. They are archives created using the Unix ar utility.
  • pacman package format (file extension: .pkg.tar.xz) — used by Arch Linux and other Linux distributions using the pacman package manager. They are xz-compressed tar archives.
  • RPM package format (file extension: .rpm) — used by Red Hat’s RPM package manager.

Package Management System

A package management system (PMS, plural form: PMSs) or package manager is a collection of system software that automates the process of installing, configuring, removing and upgrading software packages. Traditionally Linux PMSs were operated solely from the command-line but nowadays several graphical PMSs also exist.

Software Repository

A software repository is essentially an archive of software packages from which package managers can download said packages and then install them.

Linux Distribution

A Linux distribution (LD or distro) is an individual, specific, operating system that uses the Linux kernel (that is, a specific member of the Linux family of operating systems). Most Linux distributions use GNU Project software for their Unix utilities, hence meaning that most Linux distributions can also be called GNU/Linux distributions. An LD consists of, at least, the following basic components:

  • The Linux kernel (or the Linux-libre kernel, in some cases)
  • GNU Project software, such as Bash, the coreutils package, glibc, etc.
  • A package management system, such as APT, DNF, Entropy, pacman, Portage, etc.

Software Version Nomenclature

In the field of software development, software releases are usually distinguished from one another by use of a numbering scheme. Each project has a different numbering scheme. The most common scheme is of the form:

  • x.y.z. x, y and z are all natural numbers (that is, x, y, z{0, 1, 2, 3, ...}). x is the major version of the software. Major versions have major feature differences in them, for example, the 4.0 Linux kernel had far more features than the 3.0 version of the Linux kernel. y is the minor version of the software. Minor software releases (where the only difference is the minor version number of the software. For example 4.6.0 is one minor release ahead of 4.5.0) usually only differ by a few, often minor features. z is the patch (or bug-fix) version of the software. If you are sick of my algebra I will give you an example of a program that follows this versioning scheme: the Linux kernel. An example version of the kernel is the 4.9.6 kernel. 4 is the major version, 9 is the minor version and 6 is the patch version.

Release Models: Fixed vs. Rolling

In the context of operating system development, there are two main types of release model: fixed and rolling. The most popular of these is the fixed release model, with the rolling release model being mostly used by bleeding edge operating systems geared towards advanced users like Arch Linux and Gentoo Linux.

Operating systems that follow a fixed release model (FRM) have fixed releases, which users are required to upgrade to, in order to keep their system up-to-date. This upgrade process can be (1) via downloading or otherwise acquiring the installation media for the latest version and going through its installer, or (2) via the distribution’s package manager or other methods (e.g., Ubuntu uses the do-release-upgrade command). These upgrades usually include major updates for several pieces of software, including core system components like the kernel, C libraries, compilers, core utilities, etc. In my experience, this process takes hours, uses up a lot of bandwidth and usually breaks at least one program on one’s system. This is why I, personally, are not a big fan of systems following a FRM.

Systems following a rolling release model (RRM) have no fixed releases, rather software updates, including major ones, simply become available for installation as soon as they have been packaged by the repository’s maintainers.

Virtualization

Virtualization is the act of creating a virtual (rather than actual) version of something. Most commonly this “something” is an operating system. Virtualization is usually performed using computer programs that can be referred to as virtualization programs. In this context, the host is the computer being used to run the virtualization program, while the guest is the virtual system being run. For example, I am presently typing this glossary section on Arch Linux, so if I fire up a CentOS 7 VM in Oracle VM VirtualBox, then my host operating system would be Arch Linux and my guest operating system would be CentOS 7.

Cross-Distribution Package Formats

A cross-distribution package format (CDPF; plural form is CDPFs) are a type of Linux software package format that are designed to run on most, if not all, Linux distributions. There are four main, distinct CDPFs I am aware of:

  • AppImage (previously known as klik), a type of self-mounting image file (created using SquashFS) that contains all the libraries, executables, desktop configuration files, icon files, etc. used by the application it provides. They need no root privileges in order to be run and need no special package manager of their own, as unlike most package formats they are not installed. They are distinct from other CDPFs as they are not installed and unlike binary archives they need not be extracted in order to be run. They merely need to be marked as executable (with chmod +x) and run (with ./<AppImage> where <AppImage> is the name of the AppImage, including its file extension. I have built quite a few AppImages, usually with the help of Simon Peter, the creator of this package format.

  • Binary archives, which have been aroud for decades are essentially archives (usually zipped or otherwise compressed) which contain all the libraries and executables required to run the program they provide on Linux. They need to be extracted in order for the program contained within to be run. SageMath and Scilab are both programs that are distributed like this.

  • Flatpak (previously known as xdg-app), a package format best supported (but not officially supported by) by the Fedora (for example, the flatpak package is in the official Fedora repositories) Linux distribution and the GNOME desktop environment. Most GNOME applications are officially available as Flatpaks. GNOME Software (a graphical front-end for package management) also has support for Flatpaks as of the 3.22.0 release. Flatpaks are installed using a package manager also known as Flatpak and called by the command flatpak. As one can probably guess one has to install the Flatpak package manager using your distribution’s respective package manager (e.g., APT, DNF, pacman, URPMI, Yum, ZYpp) in order to use it to install Flatpak packages. The only distributions I know of with Flatpak in its official repositories are: Arch Linux, Fedora and Manjaro Linux. CentOS, Debian, Gentoo, openSUSE and Ubuntu users have to use unofficial repositories in order to install it. Like most package managers, Flatpak requires administrative privileges in order to be run. Flatpaks are perhaps the most challenging package format to build, in my opinion.

  • Snap, a package format officially developed by Canonical Ltd, the same company responsible for the development of the Ubuntu operating system. Like Flatpak it also requires its own package manager to be installed in order for one to run them. This package manager is written in Google’s Go programming language (which was first publically released in 2009, so it is fairly new). Snap packages are built using a program called snapcraft which is written in Python. The source files used to tell snapcraft how to build Snap packages and the metadata to include with them are yaml files that often call on plugins (written in Python) in order to build the Snap package. I have never built a Snap package from scratch as I do not understand Python well enough to write my own plugins, but I have managed to create custom packages using existing source files.

Governance

In the context of open-source projects, the term governance refers to how the project is governed or lead. Some projects are authoritarian in governance, in other words they are dictatorships, with a single leader in charge. A notable example of such a project is the Slackware Linux Project. Others are donation-sponsored democracies, wherein the volunteer developers of the project usually have the power to make decisions via consensus. Likewise others are company-sponsored semi-democracies, wherein a company sponsors the project and has some say over decisions, but the majority of the authority to make decisions about the project resides with the community of volunteer developers (or so it should, whether it does in practise is something I will likely not know until I work in one).

Acronyms

NOTE

  • Acronyms covered in the previous sections are not repeated in this list.
  • Hyperlinks are to sources that explain, in greater detail, the term.
  • ABS: Arch Build System, a system whereby all the PKGBUILDs, patch files and assorted other files used to build packages in the pacman repositories are stored in subdirectories of /var/abs. It is essentially Arch’s equivalent to the Portage Tree.
  • AUR: Arch User Repository, a repository of Arch user-supplied PKGBUILDs, patch files and assorted other files that, while they are not used to build any packages in the pacman repositories, can (provided they do not have any bugs preventing them from building successfully, that is) be used to build Arch packages locally (on their own Arch machine) by users.
  • Deb: the package format used by Debian’s dpkg package manager.
  • PC: Personal computer. Can be a desktop or laptop, does not matter if it once ran OS X or Windows, still to me it is a PC.
  • RPM: RPM Package Manager, originally Red Hat Package Manager. This is the name of both a package manager and the package format used by this package manager.

Package Formats

In order to effectively build packages one must understand the basics of the package format one intends to build. There are four major types of Linux package format that I worked with:

  • Arch Linux packages (ALPs, file extension: .pkg.tar.xz), the package format used by Arch Linux, its derivatives and select “independent” distributions such as Frugalware Linux and KaOS. They are built based on the contents of PKGBUILDs, which are Bash scripts with build instructions for the package along with its associated metadata.
  • Debian packages (or Deb packages, file extension: .deb), the package format used by Debian and its derivatives such as Ubuntu and its derivative, Linux Mint. They are built based on the contents of a whole directory and its subdirectories. The build instructions are found within the rules file.
  • Gentoo packages (file extension: .tbz2), the package format used by Gentoo Linux and its derivatives like Sabayon Linux. tbz2 files are built based on the contents of a specialized Bash script called an ebuild (with the .ebuild file extension). ebuilds are stored within a set of directories and subdirectories (called overlays), usually managed by Git (git) or some other version control system (VCS) like Mercurial (hg) or Subversion (svn). They are essentially like more complicated (and hence more difficult to write) equivalents to PKGBUILDs and like PKGBUILDs they include package metadata and build instructions.
  • RPM packages (file extensions: .rpm, .src.rpm), a package format used by select distributions such as CentOS, Fedora, Mageia and openSUSE. They are built based on the contents of a whole directory, entitled rpmbuild, and its subdirectories. The most important file in the rpmbuild directory and its subdirectories is called a spec file, which has the .spec file extension. This spec file contains package metadata and build instructions, similarly to ebuilds and PKGBUILDs.

Arch Linux Packages

Arch Linux packages are xz-compressed tar archives that are built and installed using commands provided by the pacman package on Arch Linux. ALPs are the package format used by Arch Linux derivatives (like Manjaro Linux) along with the “independent” distributions, Frugalware Linux and KaOS, which also use the pacman package manager, so this information should be applicable to packaging on these distributions too.

ALP Contents

ALPs have the following contents:

$INSTALLED_FILES
.BUILDINFO
.INSTALL
.MTREE
.PKGINFO

where $INSTALLED_FILES are, of course, the installed files of the package with its respective file structure. For example, for the broadcom-wl package the $INSTALLED_FILES have the directory structure:

usr/
 - lib/
   - modprobe.d/
     - broadcom-wl.conf
   - modules/
     - extramodules-4.4-ARCH/
       - wl.ko.gz
 - share/
   - licenses/
     - broadcom-wl/
       - LICENSE

The package metadata (which is used by pacman when it installs new packages to check for file conflicts and such) is stored in the four hidden files (that is, those with . in their filename) in the package’s top-level directory.

PKGBUILD Structure

ALPs are built from PKGBUILDs using the makepkg command that comes bundled with the pacman package manager. They are the easiest packages to build, in my opinion. PKGBUILDs have the following general format (for more details see the PKGBUILD(5) man page):

# ~ Maintainer/Contributor name and email ~
pkgname=      # The package's name.
pkgver=       # The upstream package version, e.g., 1.5.0 for Atom 1.5.0.
pkgrel=       # The PKGBUILD revision number.
pkgdesc=      # The PKGBUILD's description.
arch=         # The architecture(s) on which the package is to be built.
url=          # The website of the package.
license=      # The legal license of the package.
depends=      # Runtime dependencies.
makedepends=  # Build dependencies.
optdepends=   # Optional dependencies.
provides=     # What the package provides.
conflicts=    # The package conflicts.
source=       # The source files required; also includes patches.
sha256sums=   # SHA256 sums of the source files.
md5sums=      # MD5 sums of the source files. Usually used INSTEAD of sha256sums.
install=      # Install files.

prepare() {   # Prepare the sources. Most commonly you will find sed functions
}             # and patches being applied here.

build() {     # Perform any compiling of the source code that may be necessary.
}             # You may also see configure scripts being run here.

package() {   # This will actually build the package. If more than one package is
}             # built from the one PKGBUILD then more than one package() function is provided.

the sha256sums can be replaced with sha512sums and sometimes GPG signatures are used also. For example, the Linux kernel PKGBUILD, in the core pacman repository, uses GPG and sha256sums to check package integrity and validity. The variable definition lines (that is, the pkgname line through to install line) provide mostly the package’s metadata and security checks (as well as variables that can be used in the following functions), while the prepare(), build() and package() functions are responsible for the actual building of the package. The install line defines the .install file that contains pre-, peri- and post-install checks and functions that need to be executed for the package. Here is an example PKGBUILD I have used to build gVim 7.4.1525:

# Maintainer: Brenton Horne <brentonhorne77 at gmail dot com>
# Contributor: Peter Mattern <pmattern at arcor dot de>

_pkgname=vim
pkgname="gvim"
pkgver=7.4.1525
pkgrel=1
pkgdesc="Vim the editor. CLI version and GTK2 GUI providing majority of features."
arch=("i686" "x86_64")
url="http://www.vim.org"
license=("custom:vim")
depends=("gtk2" "hicolor-icon-theme" "gtk-update-icon-cache" "desktop-file-utils")
optdepends=("lua: Lua interpreter" "perl: Perl interpreter" "python: Python 3 interpreter"
            "python2: Python 2 interpreter" "ruby: Ruby interpreter")
makedepends=("lua" "python" "python2" "ruby")
provides=("gvim" "xxd" "vim-runtime")
conflicts=("vim-minimal-git" "vim-git"
           "vim-minimal" "vim" "vim-python3" "gvim" "gvim-python3")
source=("https://github.com/vim/vim/archive/v$pkgver.tar.gz"
        "gvim.desktop")
sha256sums=('SKIP'
            'c346da4725b2db6f7b58c5b72bdf9e7efbba2a3275e97c17db48689e4de674ca')
install=gvim.install

prepare() {

    # set global configuration files to /etc/[g]vimrc
    sed -i 's|^.*\(#define SYS_.*VIMRC_FILE.*"\) .*$|\1|' ${srcdir}/${_pkgname}-${pkgver}/src/feature.h

}

build() {

    cd "${srcdir}/${_pkgname}-${pkgver}"
    ./configure \
      --enable-fail-if-missing \
      --with-compiledby='Arch Linux AUR' \
      --prefix=/usr \
      --enable-gui=gtk2 \
      --with-features=huge \
      --enable-cscope \
      --enable-multibyte \
      --enable-perlinterp=dynamic \
      --enable-pythoninterp=dynamic \
      --enable-python3interp=dynamic \
      --enable-rubyinterp=dynamic \
      --enable-luainterp=dynamic
    make

}

package() {

    # actual installation
    cd "${srcdir}/${_pkgname}-${pkgver}"
    make DESTDIR=$pkgdir install

    # desktop entry file and corresponding icon
    install -D -m644 ../gvim.desktop      $pkgdir/usr/share/applications/gvim.desktop
    install -D -m644 runtime/vim48x48.png $pkgdir/usr/share/icons/hicolor/48x48/apps/gvim.png

    # remove ex/view and man pages (normally provided by package 'vi' on Arch Linux)
    cd $pkgdir/usr/bin ; rm ex view
    find $pkgdir/usr/share/man -type d -name 'man1' 2>/dev/null | \
      while read _mandir; do
        cd ${_mandir}
        rm -f ex.1 view.1
      done

    # add license
    install -D -m644 ${srcdir}/${_pkgname}-${pkgver}/runtime/doc/uganda.txt \
      $pkgdir/usr/share/licenses/$pkgname/LICENSE
}

prepare() is used to prepare the source, which means if the source is compressed (like a gz-compressed tar archive) the prepare() function will exact its contents so that they are available for the build() and package() functions. build() is used to build, or compile, the source, that is if this needs to be done (for example, some PKGBUILDs actually build ALPs from Debian or RPM packages, so no source code compiling is required). package() is what builds a package from either the compiled source (that is, the source after the build() function is run) or the prepared pre-compiled sources (that is, the contents of Debian/RPM binaries).

The package() function is essentially where the objective of the game is to move all the files you wish to be in the end package from the products (whether it be compiled source code, or unpacked Debian package contents) of the build() function into the $pkgdir directory. The $pkgdir directory is meant to have the same internal file system structure as where the package will place its installed files, if installed on one’s file system. For example, GTK themes are usually installed to /usr/share/themes so this is an example package() function for such cases (this one is specifically taken from the osx-el-capitan-theme PKGBUILD):

package() {
  mkdir -p "$pkgdir/usr/share/themes/"
  cp -a "$srcdir/${_pkgname}-${pkgver}/OS X El Capitan" "$pkgdir/usr/share/themes/"
}

see the package’s contents are moved to ${pkgdir}/usr/share/themes/OS X El Capitan.

Building ALPs

To build an ALP you run:

user $  makepkg

from within the same directory, as the PKGBUILD you intend to build is located. You may not have the package’s build dependencies pre-installed so this command may return an error stating that you have missing build dependencies. To fix this (assuming all the dependencies are in the presently-enabled pacman repositories) by installing all required build dependencies prior to the build, run:

user $  makepkg -s

Debian Packages

The Debian package format (file extension: .deb) was one of the first Linux package formats developed. It was first developed by Ian Murdock and other members of the Debian development team. The package manager that was originally developed to work with Debian packages (installing, uninstalling, upgrading, etc. these packages) was called dpkg (invoked by the dpkg command), while APT, aptitude and Synaptic are front-ends that perform repository management, dependency resolution, etc. and then use dpkg to perform the actual installation of Debian packages. Debian packages are built based on the contents of several different files in a directory (with its own set file structure, including subdirectories and alike) entitled debian. Debian packages are ar archives, that is archives generated with the ar Unix utility. They are built using the debuild or dpkg-buildpackage commands that are provided by the devscripts package, which is separate from the package that provides the dpkg command.

Package Contents

As previously mentioned, Debian packages are ar archives and they have the following three files inside them:

debian-binary
control.tar.xx
data.tar.xx

where .xx denotes the compression file extension of the containing files. Most Debian packages use gz-compression for its control and data tar archives, so in this case .xx is replaced with .gz. Some Debian packages have xz-compressed control and data tar archives inside them. The debian-binary file is a plain text file containing the standard number of the Debian package (e.g., the latest is 3.0). The control.tar.xx archive contains the package’s metadata, while the data.tar.xx archive contains the package’s installed files.

Build Directory Structure

The debian directory used to build Debian packages, has the structure:

debian/
  - changelog
  - compat
  - control
  - copyright
  - rules
  - source/
    - format

The changelog and copyright files have pretty self-explanatory contents, so I will not bother describing their contents. The compat file has the number nine (9) in it, because allegedly it is a “magic number”. The control file contains the package metadata, like its description, name, version, dependencies, etc. The rules file contains the package build instructions. The format file contains the standard of the Debian package being described, for example, most packages at the moment will be using the 3.0 (quilt) standard.

Gentoo/Sabayon Packages

Gentoo packages (file extension: .tbz2) are bz2-compressed binary packages used by Gentoo Linux and its derivatives. They are produced and installed using the Portage package manager. Sabayon Linux’s Entropy package manager uses a slightly different package format (same file extension though, .tbz2), generated from the corresponding Gentoo packages using Entropy. Most Gentoo users will not install their software from tbz2 files, as Portage is a source code package manager (which is usually the reason why people use Gentoo in the first place, because they want to install packages from source code using Portage) and as a result most packages are built from source code and not installed from binary packages. The way that Portage installs software from source code is by following the instructions found in a specialized Bash script called an ebuild. Portage can be used to install tbz2 binary packages, however, and it can be configured to work with (that is, install, remove, upgrade, etc. packages in said repositories) binary package repositories. This is just an uncommon Portage configuration.

Package Contents

Running:

user $  qtbz2 $package.tbz2

where $package.tbz2 is a tbz2 binary, extracts an xpak file (file extension: .xpak; which contains the package metadata) and .tar.bz2 archive containing the installed files of the package.

ebuild Structure

Syntactically, I would say that ebuilds are most similar to PKGBUILDs, but there are several key differences. For one, they include eclasses, specialized Bash functions designed specifically for ebuilds, many of which are poorly documented, in my opinion. Secondly, PKGBUILDs are all named PKGBUILD, while ebuilds only share the same file extension .ebuild. Their name consists of the package’s name and its version, e.g., gVim 7.4.1342 would have an ebuild named gvim-7.4.1342.ebuild. ebuilds also come with manifests (files entitled Manifest) that include checksums for all the source files and the ebuilds themselves. Here is an ebuild for gVim that you can compare to the previously-provided PKGBUILD and spec file for gVim, it is over 400 lines long so I am not going to include it in this post. To build a Gentoo binary package from an ebuild run:

user $  ebuild $package.ebuild package

while to build a Sabayon binary package, one has to run one additional command:

root #  equo pkg inflate $package.tbz2

RPM Packages

RPM Packages (file extension: .rpm, source RPMs have the .src.rpm file extension) are the package format used by Red Hat Linux (RHL), its derivatives (such as CentOS, Fedora, Korora, Oracle Linux, Red Hat Enterprise Linux, Scientific Linux), openSUSE, SUSE Linux Enterprise, etc. They are built using the rpmbuild command provided by the rpmdevtools package on most distributions. From what I understand RPMs are a type of file archive (which can be extracted using the bsdtar or rpm2cpio commands). They are not ar archives, however. RPM is a binary package format, although a source code version also exists, which is called a SRPM. SRPMs can also be extracted using bsdtar.

RPM Contents

Decompressing RPM packages using user $ bsdtar -xf $package.rpm extracts just the package’s installed files. This might make it seem like RPM packages have no metadata, but they do, it is just not readily apparent by decompressing them using bsdtar. To show a summary of the metadata inside these packages you need to run user $ rpm -qip $package.rpm.

rpmbuild

rpmbuild needs to be run within a directory called rpmbuild within the current user’s home directory, with its own set of subdirectories, this is its general structure:

rpmbuild/
  - BUILD
  - BUILDROOT
  - RPMS
  - SOURCES
  - SPECS
  - SRPMS

The BUILD and BUILDROOT subdirectories are used for compiling the source code and collecting the necessary installed files for packaging, respectively. The SOURCES subdirectory contains the source files, including any patches, and SPECS contains the all-important spec files, which instruct the rpmbuild utility how to build the package and what metadata the RPM should contain. The RPM is stored in the RPMS subdirectory and the SRPM is stored in the SRPMS subdirectory.

spec files look sort of like PKGBUILDs, except they use macros instead of many of the variables and functions found in PKGBUILDs. I would provide an example here in this post of Vim’s spec file (the one I use to build Vim in the Open Build Service) but it is over 520 lines long (as opposed to 72 lines for the gVim PKGBUILD shown earlier). So to view it see here. I personally find writing spec files significantly more complicated than writing PKGBUILDs, as PKGBUILDs are written more like as if you were writing a shell script to install the software package locally on your machine. The use of macros can make things more complicated to follow for package development newcomers.