This is a decomposition of the testing-python.nix and build-vms.nix
files into modules.
By refactoring the glue, we accomplish the following:
- NixOS tests can now use `imports` and other module system features.
- Network-wide test setup can now be reusable; example:
- A setup with all VMs configured to use a DNS server
- Split long, slow tests into multiple tests that import a
common module that has most of the setup.
- Type checking for the test arguments
- (TBD) "generated" options reference docs
- Aspects that had to be wired through all the glue are now in their
own files.
- Chief example: interactive.nix.
- Also: network.nix
In rewriting this, I've generally stuck as close as possible to the
existing code; copying pieces of logic and rewiring them, without
changing the logic itself.
I've made two exceptions to this rule
- Introduction of `extraDriverArgs` instead of hardcoded
interactivity logic.
- Incorporation of https://github.com/NixOS/nixpkgs/pull/144110
in testScript.nix.
I might revert the latter and split it into a new commit.
deprecate literalDocBook by adding a warning (that will not fire yet) to
its uses and other docbook literal strings by adding optional warning
message to mergeJSON.
For example, the wait_for_unit() call in the Moodle test times out for
myself and others[1], so it would be good to be able to increase it to
something less likely to be hit by a test that would otherwise pass.
[1]: https://github.com/NixOS/nixpkgs/pull/177052#issue-1266336706
conversions were done using https://github.com/pennae/nix-doc-munge
using (probably) rev f34e145 running
nix-doc-munge nixos/**/*.nix
nix-doc-munge --import nixos/**/*.nix
the tool ensures that only changes that could affect the generated
manual *but don't* are committed, other changes require manual review
and are discarded.
leaving some newlines around after an admonition was closed causes the
newline rule to match, which in turn inserts literallayout newlines into
te xml output. that's not what we want.
Pass `-t` to pixz to prevent it from appending an index to the end of
the uncompressed stream, confusing tools such as `machinectl import-tar`.
Fixes: #187816
with ever more options being markdown rather than docbook the conversion
time is starting to become a significant factor of doc build time.
luckily we can pre-convert all nixos option docs to MD and cache the
result of this conversion, then merge the already-converted json file
with user option docs. we leave options.json unconverted to keep it as
close to the actual nix code as possible.
during docs conversion it can be very useful to know exactly *where* the
error the script complained about is. the name of the option should be
sufficient since option merging is rather rare, and won't merge doc
attributes anyway.
Within a dual VM test-setup a strange behaviour was observed.
The two VMs are connected via one vde_switch instance
(instancevirtualisation.vlans = [ 1 ]; IMO a bad attribute name for
switch instances, has nothing to do with VLANs in sense of 802.1Q).
A ping on the base interface (eth1) works, but not on VLAN
subinterfaces (vlan1@eth1). A tcpdump of eth1 includes the ARP requests
tagged with the subinterfaces VLAN ID, but responses seems not to pass
the vde_switch. This works fine if performed on the base interface.
Putting the vde_switch in hub mode results in flooding
traffic to all vde_switch ports. This results in a expected behaviour
and a ping on a VLAN subinterface works as expected.
Signed-off-by: Philippe Schaaf <philippe.schaaf@secunet.com>
Previously, the location logic was hardcoded, supporting only
Nixpkgs and NixOps properly, leaving other uses of the module
system without good location support.
- initialSystem was keeping track of the evaluating system
- it had been used by `nesting.children`
- since, 20.09, `nesting.children` has been replaced with named
specializations
It appears that this option was left over and not cleand up properly.
Without this fix, setting the shellopts in `machine.execute` is
inconsitent. When no timeout is used, shellopts `set -euo pipefail` are
applied to the command as expected. When a timeout is specified, the
shellopts are not applied to the command itself (which is called inside
a `sh -c` that doesn't inherit the shellopts) but rather to the
`timeout` command, leading to the following full command:
```bash
(set -euo pipefail; timeout 900 sh -c 'cmd') | (base64 --wrap 0; echo)\n
```
With this fix, this is the command we get:
```bash
timeout 900 sh -c 'set -euo pipefail; false | true') | (base64 --wrap 0; echo)\n
```
Documents the _module.args option, motivated by many usages in Flakes,
especially with the deprecation of extraArgs
(78ada83361)
The documentation rendering for this option had to be handled a bit
specially, since it's not declared in nixos/modules like all the other
NixOS options.
Co-Authored-By: pennae <github@quasiparticle.net>
Co-Authored-By: Robert Hensing <robert@roberthensing.nl>
This patch allows creation of files like
/etc/systemd/system/user-.slice.d/limits.conf with
systemd.units."user-.slice.d/limits.conf" = {
text = ''
[Slice]
CPUAccounting=yes
CPUQuota=50%
'';
};
which previously threw an error
Also renames the systemd-unit-path test to sytsemd-misc, and extends it to
test that `systemd.units` can handle directories. In this case we make
sure that resource limits specified in user slices apply.
Naively deduplicate VLANs in the python driver for NixOS tests. The
current implementation accidentally works, since the VLan class mutates
the environment. On construction it sets QEMU_VDE_SOCKET_${id} and this
environment variable gets overwritten once a second VLAN with the same
id is constructed. Because the NIC flags passed to qemu just use the
QEMU_VDE_SOCKET_${id} environment variable, this implicitly chooses a
single vde_switch process for each VLAN.
However, this leads to unusable vde_switch processes being spawned in
each test run and as a side effect makes it impossible to access the
correct VLan objects in the interactive test driver. It also makes it
remarkably hard to understand why the current implementation ever
worked.
it's really easy to accidentally write the wrong systemd Exec* directive, ones
that works most of the time but fails when users include systemd metacharacters
in arguments that are interpolated into an Exec* directive. add a few functions
analogous to escapeShellArg{,s} and some documentation on how and when to use them.
Makes all options rendered in the manual throw an error if they don't
have a type specified.
This is a follow-up to #76184
Co-Authored-By: Silvan Mosberger <contact@infinisil.com>
With the previous change that enabled error propagation through
`inherit_errexit`, the script would fail if `errexit` was set, but
`inherit_errexit` was not. This is due to `shopt -p` exiting with an
error if the option is disabled. To work around this, use the exit
code instead of the text value returned by `shopt -p`.
Fixes#160869.
The aarch64-linux versions of the boot.uefiUsb and boot.uefiCdrom tests
were broken by b0fc9da879.
That commit was a refactor which omitted the qemuBinary option, which was
previously available in the legacy start command. This restores that
option and fixes the tests previously mentioned.
If an error occurs while trying to read a secret file, we want that
error to propagate to the main shell context. That means we have to
set the `inherit_errexit` option, which allows errors from subshells
to propagate to the outer shell. Also, the subshell cannot run as part
of another command, such as `export`, since that will simply ignore
the subshell exit status and only respect `export`s exit status; first
assigning the value to a variable and then exporting it solves issue.
this partially solves the problem of "missing description" warnings of the
options doc build being lost by nix build, at the cost of failing builds that
previously ran. an option to disable this behaviour is provided.
link to search.nixos.org instead of pulling package metadata out of pkgs. this
lets us cache docs of a few more modules and provides easier access to package
info from the HTML manual, but makes the manpage slightly less useful since
package description are no longer rendered.
most modules can be evaluated for their documentation in a very
restricted environment that doesn't include all of nixpkgs. this
evaluation can then be cached and reused for subsequent builds, merging
only documentation that has changed into the cached set. since nixos
ships with a large number of modules of which only a few are used in any
given config this can save evaluation a huge percentage of nixos
options available in any given config.
in tests of this caching, despite having to copy most of nixos/, saves
about 80% of the time needed to build the system manual, or about two
second on the machine used for testing. build time for a full system
config shrank from 9.4s to 7.4s, while turning documentation off
entirely shortened the build to 7.1s.
by default all cores are used
hoping this will fix the hydra i686 squashfs build issues as all the
failures were using 64 cores
Parallel mksquashfs: Using 64 processors
Creating 4.0 filesystem on ..., block size 1048576.
FATAL ERROR: mangle2:: xz compress failed with error code 5
to do this we must replace derivations with attrsets in make-options-doc, since
xml can represent derivations differently from attrset but json cannot. this
also given asciidoc and mddoc the ability to handle derivation differently,
which they previously didn't have.
use the json file derivation we already have to also generate the asciidoc and
md options docs instead of formatting the options in nix. docbook docs are
already produced in derivations.
the new script produce the exact same output as the old in-nix generation.
When displaying the amount of time some step took, with no other
context, it becomes nigh impossible (especially in longer tests) to see
when specific steps finished.
The flag -cpu max leaves QEMU 6.1.0 stuck on some systems,
for example when /dev/kvm is not read-writable.
This does not happen with -cpu qemu64.
Getting stuck like that is a regression in 6.1.0 not yet present in 6.0.0
and should be fixed with 6.2.0 according to early testing with rc1.
We should consider reverting this change when we merge QEMU 6.2.0.
See #146526.
fixes#141596
By using the new extendModules function to produce the specialisations,
we avoid reimplementing the eval-config.nix logic in reverse and fix
cross compilation support for specialisations in the process.
This reverts commit e2bea4427b.
While this commit was probably fine, I want to be conservative
with changes right before the release branch-off.
So far the extendModules commits have been adding and refactoring
expressions that did not affect derivation hashes, whereas this
commit changes the module ordering. I will open a separate PR for
it.
The involved test was nixosTests.nextcloud.basic21.
It has a testScript that is strict in nodes.nextcloud.config.system.build.vm,
in assertions about imagemagick being in the system closure.
The recursion was introduced in 329a4461a7 from
https://github.com/NixOS/nixpkgs/pull/140792
Make sure the all derivations referenced by the test script are
available on the nodes. Accessing these derivations works just fine
without this change when using 9p to mount the host's store, but when
an image is built (virtualisation.buildRootImage), the dependencies
need to be copied to the image. We don't want to copy the script
itself, though, since that would trigger unnecessary image rebuilds.
Add a copyChannel argument which controls whether the current source
tree will be made available as a nix channel in the image or
not. Previously, it always was. Making it available is useful for
interactive use of nix utils, but changes the hash of the image when
the sources are updated.
The current implementation just forks off a thread to read
QEMU's stdout and lets it exist forever. This, however,
makes the interpreter shutdown racy, as the thread could
still be running and writing out buffered stdout when the
main thread exits (and since it's using the low level API,
the worker thread does not get cleaned up by the atexit hooks
installed by `threading`, either). So, instead of doing that,
let's create a real `threading.Thread` object, and also
explicitly `join` it along with the other stuff when cleaning up.
we need the file itself as a dependency for the docbook build, but we don't need
it to be properly sorted at the nix level. push the sort out to a python script
instead to save eval time. on the machine used to write this `nix-instantiate
<nixos/nixos> -A system` went down from 7.1s to 5.4s and GC heap size decreased
by 50MB (or 70MB max RSS).
We have no usecase for manually/selectively starting or stopping VLANs
in integration tests.
By starting and stopping the VLANs with the constructor and destructor
of VLAN objects, we remove the obligation and complexity to maintain
network lifetime separately.
This commit encapsulates the involved domain into classes and
defines explicit and typed arguments where untyped dicts where used.
It preserves backwards compatibility through legacy wrappers.
The current name is misleading: it doesn't contain cli arguments,
but several constants and utility functions related to qemu.
This commit also removes the use of `with import ...` for clarity.
This is a private interface for internal NixOS use. It is similar
to `make-disk-image` except it is much more opinionated about what
kind of disk image it'll make.
Specifically, it will always create *two* disks:
1. a `boot` disk formatted with FAT in a hybrid GPT mode.
2. a `root` disk which is completely owned by a single zpool.
The partitioning and FAT decisions should make the resulting images
bootable under EFI or BIOS, with systemd-boot or grub.
The root disk's zpools options are highly customizable, including
fully customizable datasets and their options.
Because the boot disk and partition are highly opinionated, it is
expected that the `boot` disk will be mounted at `/boot`. It is
always labeled ESP even on BIOS boot systems.
In order for the datasets to be mounted properly, the `datasets`
passed in to `make-zfs-image` are turned in to NixOS configuration
stored at /etc/nixos/configuration.nix inside the VM.
NOTE: The function accepts a system configuration in the `config`
argument. The *caller* must manually configure the system
in `config` to have each specified `dataset` be represented
by a corresponding `fileSystems` entry.
One way to test the resulting images is with qemu:
```sh
boot=$(find ./result/ -name '*.boot.*');
root=$(find ./result/ -name '*.root.*');
echo '`Ctrl-a h` to get help on the monitor';
echo '`Ctrl-a x` to exit';
qemu-kvm \
-nographic \
-cpu max \
-m 16G \
-drive file=$boot,snapshot=on,index=0,media=disk \
-drive file=$root,snapshot=on,index=1,media=disk \
-boot c \
-net user \
-net nic \
-msg timestamp=on
```
Previous to this commit, the entire test driver environment was shared
with the actual python test environment.
This is a hefty api surface. This commit selectively exposes only those
symbols to the test environment that are actually meant to be used by
tests.
This is the case when the test-script is empty. `nixos-build-vms(8)` is
primarily supposed to be used as tool to test changes or to reproduce
bugs (IMHO) where "just spinning up a few VMs" is the primary use-case.
In the ongoing discussion about these changes[1] it was suggested to
only expose it when needed (i.e. in the case I described above) to keep
the API surface as slim as possible.
[1] https://github.com/NixOS/nixpkgs/pull/133675#discussion_r688112485
This is relevant for `nixos-build-vms(8)` which doesn't have a
test-script. In that case it's more intuitive to directly go into the
interactive mode which is IMHO more intuitive.
Previously the driver was configured exclusively through convoluted
environment variables.
Now the driver's defaults are configured through env variables.
Some additional concerns are in the github comments of this PR.
the use of python further restricts possible RFC1035 host labels since
dash is not allowed for use in python identifiers.
The previous implementation of this check was flawed, since it did not
check the `hostName` value that is actually used to construe the
identifier, but the node name, which can be anything, e.g. just `machine`.
The previous implementation, by further restricting RFC1035 labels, only
for the sake of testing seems to be an unacceptable restriction and should
be addressed separately.
The disk image calculator was using find + exec forking du for every
file in the disk image, making it very slow. Change du to accept files,
nul delimeted on stdin to speed it back up.
Before change:
nix-build nixos/tests/image-contents.nix 9.71s user 1.06s system 8% cpu 2:13.11 total
After change:
nix-build nixos/tests/image-contents.nix 9.93s user 1.23s system 21% cpu 51.601 total
nixos tests are blended with other system configurations, hence
their settings must be either enforced or defaulted.
This particular setting is set via lib.nixosSystem as
`system.nixos.revision = final.mkIf (self ? rev) self.rev;` which would
mean that without this change no flake generated nixos could be blended
with nixos testing.
This setting was made previously constant in
169c6b4b14 in order to avoid pointless
rebuilds of the testing VMs, but was set without enforcing it.
Apparently this looks like it was forgotten when doing commit
3884ff70ba, which refactored the test
runner and driver a bit.
The passthru argument actually was correctly reintroduced in
setupDriverForTest, but the actual makeTest function didn't use it.
This fixes the nixpkgs tarball job, which previously failed with:
attribute 'elkPackages' missing, at /build/source/pkgs/tools/misc/logstash/6.x.nix:58:30
Signed-off-by: aszlig <aszlig@nix.build>
Acked-by: David Arnold <dar@xoe.solutions>
Fixes: https://github.com/NixOS/nixpkgs/issues/127274
Merges: https://github.com/NixOS/nixpkgs/pull/127346
Less nesting, where that improves readability. More nesteing, where
that improves readability, but most importantly:
Expose individual functions separately so that they can be more easily
built directly, eg.:
`nix build --impure --expr '(import ./testing-python.nix {system = builtins.currentSystem;}).mkTestDriver'`
At nixpkgs root:
`rg redirectSerial ./` does not result in any other match
nor does
`rg USE_SERIAL ./` except for an unrelated match in:
pkgs/tools/graphics/argyllcms/default.nix
Bash's standard behavior of not propagating non-zero exit codes
through a pipeline is unexpected and almost universally
unwanted. Default to setting `pipefail` for the command being run;
it can still be turned off by prefixing the pipeline with
`set +o pipefail` if needed.
Also, set `errexit` and `nonunset` options to make the first command
of consecutive commands separated by `;` fail, and disallow
dereferencing unset variables respectively.
When importing Nixpkgs within Nixpkgs, we should not consider aliases
to ensure we don't rely on them internally.
There are probably more places that need to be converted.
The root filesystem resizing step, `resize2fs -M', does not provide any
control over the amount of slack left in the result. It can produce an
arbitrarily tight fit, depending on how well the payload aligns with
ext4 data structures.
This is problematic, as NixOS must create a few files and directories
during its first boot, before the root is enlarged to match the size of
the containing SD card.
An overly tight fit can cause failures in the first stage:
mkdir: can't create directory '/mnt-root/proc': No space left on device
or in the second stage:
install: cannot create directory '/var': No space left on device
A previous version of `make-ext4-fs' (before PR #79368) was explicitly
"reserving" 16 MiB of free space in the final filesystem. Manually
calculating the size of an ext4 filesystem is a perilous endeavor,
however, and the method it employed was apparently unreliable.
Reverting is consequently not a good option.
A solution would be to create some sort of "balloon" occupying inodes
and blocks in the image prior to invoking `resize2fs -M', and to remove
these temporary files/directories before the compression step.
This changeset takes the simpler approach of simply dropping the
resizing step.
Note that this does *not* result in a larger image in general, as the
current procedure does not truncate the `.img' file anyway. In fact, it
has been observed to yield *smaller* compressed images---probably
because of some "noise" left after resizing. E.g., before-vs-after:
-r--r--r-- 2 root root 607M 1. Jan 1970 nixos-sd-image-21.11pre-git-x86_64-linux.img.zst
-r--r--r-- 2 root root 606M 1. Jan 1970 nixos-sd-image-21.11pre-git-x86_64-linux.img.zst
For now you had to know that the actions are retried for 900s when
seeing an error like
> Traceback (most recent call last):
> File "/nix/store/dbvmxk60sv87xsxm7kwzzjm7a4fhgy6y-nixos-test-driver/bin/.nixos-test-driver-wrapped", line 927, in run_tests
> exec(tests, globals())
> File "<string>", line 1, in <module>
> File "<string>", line 31, in <module>
> File "/nix/store/dbvmxk60sv87xsxm7kwzzjm7a4fhgy6y-nixos-test-driver/bin/.nixos-test-driver-wrapped", line 565, in wait_for_file
> retry(check_file)
> File "/nix/store/dbvmxk60sv87xsxm7kwzzjm7a4fhgy6y-nixos-test-driver/bin/.nixos-test-driver-wrapped", line 142, in retry
> raise Exception("action timed out")
> Exception: action timed out
in your (hydra) build failure. Due to the absence of timestamps you were
left guessing if the machine was just slow, someone passed a low timeout
value (which they couldn't until now) or whatever might have happened.
By making this error a bit more descriptive (by including the elapsed
time) these hopefully become more useful.
Our test driver exposes a bunch of variables and functions, which
pyflakes doesn't recognise by default because it assumes that the test
script is executed standalone. In reality however the test driver script
is using exec() on the testScript.
Fortunately pyflakes has $PYFLAKES_BUILTINS, which are the attributes
that are globally available on all modules to be checked. Since we only
have one module, using this environment variable is fine as opposed to
my first approach to this, which tried to use the unstable internal API
of pyflakes.
The attributes are gathered by the main derivation of the test driver,
because we don't want to end up defining a new attribute in the test
driver module just to being confused why using it in a test will result
in an error.
Another way we could have gathered these attributes would be in
mkDriver, which is where the linting takes place. However, we do have a
different set of Python dependencies in scope and duplicating these will
again just cause confusion over having it at one location only.
Signed-off-by: aszlig <aszlig@nix.build>
Co-Authored-By: aszlig <aszlig@nix.build>
So far, we have used "black" for formatting the test code, which is
rather strict and opinionated and when used inline in Nix expressions it
creates all sorts of trouble.
One of the main annoyances is that when using strings coming from Nix
expressions (eg. store paths or option definitions from NixOS modules),
completely unrelated changes could cause tests to fail, since eg. black
wants lines to be broken.
Another downside of enforcing a certain kind of formatting is that it
makes the Nix expression code inconsistent because we're mixing two
spaces of indentation (common in nixpkgs) with four spaces of
indentation as defined in PEP-8. While this is perfectly fine for
standalone Python files, it really looks ugly and inconsistent IMO when
used within Nix strings.
What we actually want though is a linter that catches problems early on
before actually running the test, because this is *actually* helping in
development because running the actual VM test takes much longer.
This is the reason why I switched from black to pyflakes, because the
latter actually has useful checks, eg. usage of undefined variables,
invalid format arguments, duplicate arguments, shadowed loop vars and
more.
Signed-off-by: aszlig <aszlig@nix.build>
Closes: https://github.com/NixOS/nixpkgs/issues/72964
On my system I have XWayland disabled and therefore only WAYLAND_DISPLAY
is set. This ensures that the graphical output will still be enabled on
such setups (both Wayland and X11 are supported by the viewer).
Work around missing /dev files inside runInLinuxVM by creating a
symlink before calling nixos-enter.
This fixes https://github.com/NixOS/nixpkgs/issues/93381.
I ran into this issue when trying to create a VMware image that boots from EFI.
Thanks @colemickens for reporting this and @danielfullmer for fixing the same thing in in qemu-vm.nix (37676e77cb) and explaining what the issue was.
This ensures the following gptfdisk warning won't happen:
```
Warning: File size is not a multiple of 512 bytes! Misbehavior is likely!
```
Additionally, helps towards aligning the partition to be more optimal
for the underlying storage.
It is actually impossible to align for the actual underlying storage
optimally because we don't know what the block device will be!
But aligning on 1MiB should help.
This is a bit of a thorny issue. See, the actual `diskSize` variable is
for the *total* disk size, not for the filesystem!
The automatic numbers are meant to compute the *filesystem* required
space. So we have to add any other reserved space!
We have different requirements for reserved space. E.g. there could be
none (when it's actually a filesystem image). There could also be 1MiB
for alignment for an MBR image, legacy+gpt needs 2MiB, then GPT with an
ESP ("bootSize") needs to take the boot partition and GPT size into
account too!
Though luckily(?) for this latter situation we can cheat! As noted in the
change, `bootSize` is NOT the boot partition size. It is actually the
offset where the target filesystem starts.
Reserved space includes:
- inodes space in use (2 blocks per)
- about 5.2% of the space
The 5.2% reserved space was computed empirically when working on a
previous EXT4 image builder. It seems to stabilize around 5% even for
much larger filesystems.
On some filesystems, `du` without `--apparent-size` will not give the
actual size for a file. Using `--apparent-size` will give us the actual
file size.
Though, this is not actually correct still. 1000 × 1 bytes is not 1000
bytes. It is 1000 × ceil(filesize/blockSize)*blockSize.
So instead of adding up the actual file sizes. We are adding up the
block sizes.
Note that this also changes the builder to work with *bytes*, rather
than with any other units. Doing maths on bytes is less likely to go
awry than doing it on other units.
When performing OCR, some of the Tesseract settings perform better than
others on a variety of different workloads, but they mostly take
~negligible incremental time to run compared to the overhead of running
the ImageMagick filters.
After this commit, we try using all three of the current Tesseract
models (classic, LSTM, and classic+LSTM) to generate output text. This
fixes chromium-90's tests at release-20.09, and should make cases where
you're looking for *specific* text better, with the tradeoff of running
Tesseract multiple times.
To make it sensible to cherrypick this into release-20.09, this doesn't
change the existing API surface for the test driver. In particular,
get_screen_text continues to have the existing behaviour.
Make the revision metadata constant, in order to avoid needless retesting.
The human version (e.g. 21.05-pre) is left as is, because it is useful
for external modules that test with e.g. nixosTest and rely on that
version number.
the nix store may contain hardlinks: derivations may output them
directly, or users may be using store optimization which automatically
hardlinks identical files in the nix store.
The presence of these links are intended to be a 'transparent'
optimization. However, when creating a squashfs image, the image
will be different depending on whether hard links were present
on the filesystem, leading to reproducibility problems.
By passing '-no-hardlinks' to mksquashfs the files are stored
as duplicates in the squashfs image. Since squashfs has support
for duplicate files this does not lead to a larger image.
For more details see
https://github.com/NixOS/nixpkgs/issues/114331
/var/lib/nixos is used by update-users-groups.pl in the activation
script for storing uid/gid mappings. If this has its own mountpoint
(as is the case in some setups with fine-grained bind mounts pointing
into persistent storage), the mappings are written to /var/lib, /var,
or /. These may be backed by a tmpfs or (otherwise ephemeral storage),
resulting in the mappings not persisting between reboots.