Sunday, March 3, 2024

vimrc tips

On Debian-family systems, vim.tiny may be providing the vim command, through the alternatives system. If I bring in my dotfiles and haven’t installed a full vim package yet, such as vim-gtk3, then dozens of errors might show up.  vim.tiny really does not support many features.

Other times, I run gvim -ZR for quickly checking some code, to get read-only restricted mode.  In that case, anything that wants to run a shell command will fail.  Restricted mode is also a signal that I don’t trust the files I’m viewing, so I don’t want to process their modelines at all.

To deal with these scenarios, my vimrc is shaped like this (line count heavily reduced for illustration):

set nocompatible ruler laststatus=2 nomodeline modelines=2
if has('eval')
    call plug#begin('~/.vim/plugged')
    try
        call system('true')
        Plug 'dense-analysis/ale'
        Plug 'mhinz/vim-signify' | set updatetime=150
        Plug 'pskpatil/vim-securemodelines'
    catch /E145/
    endtry
    Plug 'editorconfig/editorconfig-vim'
    Plug 'luochen1990/rainbow'
    Plug 'tpope/vim-sensible'
    Plug 'sapphirecat/garden-vim'
    Plug 'ekalinin/Dockerfile.vim', { 'for': 'Dockerfile' }
    Plug 'rhysd/vim-gfm-syntax', { 'for': 'md' }
    Plug 'wgwoods/vim-systemd-syntax', { 'for': 'service' }
    call plug#end()
    if !has('gui_running') && exists('&termguicolors')
        set termguicolors
    endif
    let g:rainbow_active=1
    colorscheme garden
endif

We start off with the universally-supported settings.  Although I use the abbreviated forms in the editor, my vimrc has the full spelling, for self-documentation.

Next is the feature detection of if has('eval') … endif.  This ensures that vim.tiny doesn’t process the block.  Sadly, inverting the test and using the finish command inside didn’t work.

If we have full vim, we start loading plugins, with a try-catch for restricted mode.  If we can’t run the true shell command, due to E145, we cancel the error and proceed without that subset of non-restricted plugins.  Otherwise, ALE and signify would load in restricted mode, but throw errors as soon as we opened files.

After that, it’s pretty straightforward; we’re running in a full vim, loading things that can run in restricted mode.  When the plugins are over, we finish by configuring and activating the ones that need it.

Friday, February 2, 2024

My Issues with Libvirt / Why I Kept VirtualBox

At work, we use VirtualBox to distribute and run development machines.  The primary reasons for this are:

  1. It is free (gratis), at least the portions we require
  2. It has import/export

However, it isn’t developed in the open, and it has a worrying tendency to print sanitizer warnings on the console when I shut down my laptop.

Can I replace it with kvm/libvirt/virt-manager?  Let’s try!

Sunday, December 31, 2023

No More Realtek WiFi

The current Debian kernel (based on 6.1.66, after the ext4 corruption mess) seems to be locking up with the Realtek USB wireless drivers I use.  Anything that wants the IP address (like agetty or ip addr) hangs, as does shutdown.  It all works fine on the "old" kernel, which is the last version prior to the ext4 issue.

Meanwhile in Ubuntu 23.10, the in-kernel RTW drivers were flaky and bouncing the connection, so I had returned to morrownr’s driver there, as well.  But now that I don’t trust any version of this driver?  Forget this company.  In the future, I will be using any other option:

  1. A Fenvi PCIe WiFi card with an Intel chip on board, or the like
  2. Using an extra router as a wireless client/media bridge, with its Ethernet connected to the PC
  3. If USB were truly necessary, as opposed to simply “convenient,” a Mediatek adapter

Remember that speed testing and studying dmesg output led me to the conclusion that this chipset comes up in USB 2.0 mode, and even the Windows drivers just use it that way.  While morrownr’s driver offers the ability to switch it to USB 3.0 mode under Linux, this prevents it from being connected properly.  I never researched hard enough to find out if there is a way to make that work, short of warm rebooting again so that it is already in USB 3.0 mode.

It’s clearly deficient by design, and adding injury to insult, the drivers aren’t even stable.  Awful experience, one star ★☆☆☆☆, would not recommend. Intel or Mediatek are much better choices.

Addendum, 2024-01-13: I purchased an AX200-based Fenvi card, the FV-AXE3000Pro.  It seemed not to work at all.  In Windows it would fail to start with error code 10, and in Linux it would fail to load RT ucode with error -110.  And then, Linux would report hangs for thermald, and systemd would wait forever for it to shut down.  When the timer ran out at 1m30s, it would just kick up to 3m.

Embarrassingly enough, all problems were solved by plugging it into the correct PCIe slot.  Apparently, despite being physically compatible, graphics card slots (which already had the punch-outs on my case, um, punched out) are for graphics cards only.  (My desktop is sufficiently vintage that it has two PCIe 3.0 x16 slots, one with 16 lanes and one with 4 lanes, and two classic PCI slots between them.)

Result: my WiFi is 93% faster, matching the WAN rate as seen on the Ethernet side of the router.  Good riddance, Realtek!

Tuesday, December 26, 2023

Diving too deeply into DH Moduli and OpenSSH

tl;dr:

  • Debian/Ubuntu use the /etc/ssh/moduli file as distributed by the OpenSSH project at the time of the distribution’s release
  • This file is only used for diffie-hellman-group-* KexAlgorithms
  • The default KEX algorithm on my setup is the post-quantum sntrup761x25519-sha512@openssh.com instead
  • Therefore, you can generate your own moduli, but it is increasingly irrelevant
  • Having more moduli listed means that sshd will do more processing during every connection attempt that uses the file

There is also a “fallback” behavior if the server can’t read the moduli file or find a match, which I don’t fully understand.

Wednesday, December 13, 2023

Viewing the Expiration Date of an SSH Certificate

A while ago, I set up my server with SSH certificates (following a guide like this one), and today, I began to wonder: “When do these things expire?”

Host certificate (option 1)

Among the output of ssh -v (my client is OpenSSH_9.3p1 Ubuntu-1ubuntu3) is this line about the host certificate:

debug1: Server host certificate: ssh-ed25519-cert-v01@openssh.com [...] valid from 2023-04-07T19:58:00 to 2024-04-05T19:59:44

That tells us the server host certificate expiration date, where it says “valid … to 2024-04-05.”  For our local host to continue trusting the server, without using ~/.ssh/known_hosts and the trust-on-first-use (TOFU) model, we must re-sign the server key and install the new signature before that date.

User certificate

I eventually learned that ssh-keygen -L -f id_25519-cert.pub will produce some lovely human-readable output, which includes a line:

Valid: from 2023-04-07T20:14:00 to 2023-05-12T20:15:56

Aha!  I seem to have signed the user for an additional month-ish beyond the host key’s signature.  I will be able to log into the server without my key listed in ~/.ssh/authorized_keys (on the server) until 2023-05-12.

This looks like a clever protection mechanism left by my past self.  As long as I log into my server at least once a month, I'll see an untrusted-host warning before my regular authentication system goes down.  (If that happened, I would probably have to use a recovery image and/or the VPS web console to restore service.)

Host certificate (option 2)

There’s an ssh-keyscan command, which offers a -c option to print certificates instead of keys.  It turns out that we can paste its output to get the certificate validity again.  (Lines shown with $ or > are input, after that prompt; the other lines, including #, are output.)

$ ssh-keyscan -c demo.example.org
# demo.example.org:22 SSH-2.0-OpenSSH_8.9p1
# demo.example.org:22 SSH-2.0-OpenSSH_8.9p1
ssh-ed25519-cert-v01@openssh.com AAAA[.....]mcwo=
# demo.example.org:22 SSH-2.0-OpenSSH_8.9p1

The ssh-ed25519-cert line is the one we need.  We can pass it to ssh-keygen with a filename of - to read standard input, then use the shell’s “heredoc” mechanism to provide the standard input:

$ ssh-keygen -L -f - <<EOF
> ssh-ed25519-cert-v01@openssh.com AAAA[.....]mcwo=
> EOF

Now we have the same information as before, but from the host certificate.  This includes the Valid: from 2023-04-07T19:58:00 to 2024-04-05T19:59:44 line again.

Tips for setting up certificates

Document what you did, and remember the passphrases for the CA keys! This is my second setup, and now I have scripts to do the commands with all of my previous choices.  They’re merely one-line shell scripts with the ssh-keygen command.  But they still effectively record everything like the server name list, identity, validity period, and so forth.

To sign keys for multiple users/servers, it may be convenient to add the CA key to an SSH agent.  Start a separate one to keep things extra clean, then specify the signing key slightly differently:

$ ssh-agent $SHELL
$ ssh-add user-ca
$ ssh-keygen -Us user-ca.pub ...
(repeat to sign other keys)
$ exit

Note the addition of -U (specifying the CA key is in an agent) and the use of the .pub suffix (the public half of the key) in the signing process.

Saturday, October 7, 2023

The Logging Tarpit

Back in August, Chris Siebenmann had some thoughts on logging:

A popular response in the comments was “error numbers solve everything,” possibly along with a list (provided by the vendor) detailing all error numbers.

The first problem is, what if the error number changes?  MySQL changed from key index to name in their duplicate-key error, and consequently changed from error code 1062 to 1586.  Code or tools that were monitoring for 1062 would never hit a match again.  Conversely, if “unknown errors” were being monitored for emergency alerts, the appearance of 1586 might get more attention than it deserves.

In other cases, the error numbers may not capture enough information to provide a useful diagnostic.  MySQL code 1586 may tell us that there was a duplicate value for a unique key, but we need the log message to tell us which value and key.  Unfortunately, that is still missing the schema and table!

Likewise, one day, my Windows 10 PC hit a blue screen, and the only information logged was an error code for machine check 0x3E.  The message “clarified” that this was a machine check exception with code 3e.  No running thread/function, no stack trace, no context.

Finally, in some cases, logging doesn’t always capture an intent fully.  If a log message is generated, is it because of a problem, or is it operationally irrelevant?  Deciding this is the real tar pit of log monitoring, and the presence of an error number doesn’t really make a difference to it.  There’s no avoiding the decisions.

In the wake of Chris’ posts, I changed one of our hacky workaround services to log a message if it decides to take action to fix the problem.  Now we have the opportunity to find out if the service is taking action, not simply being started regularly.  Would allocating an error number (for all time) help with that?

All of this ends up guiding my log monitoring philosophy: look at the logs sometimes, and find ways to clear the highest-frequency messages. I don't want dozens of lines of known uninteresting messages clogging the log during incident response.  For example, “can’t load font A, using fallback B.”  We’d either install font A properly, or mute the message for font A, specifically.  But, I want to avoid trying to categorize every single message, because that way lies madness.

Friday, September 29, 2023

AWS: Requesting gp3 Volumes in SSM Automation Documents

I updated our EC2 instance-building pipeline to use the gp3 volume type, which offers more IOPS at lower costs.

Our initial build runs as an SSM [Systems Manager] Automation Document.  The first-stage build instance is launched from an Ubuntu AMI (with gp2 storage), and produces a “core” image with our standard platform installed.  This includes things like monitoring tools, our language runtime, and so forth.  The core image is then used to build final AMIs that are customized to specific applications.  That is, the IVR system, Drupal, internal accounting platform, and antique monolith all have separate instances and AMIs underlying them.

Our specific SSM document uses the aws:runInstances action, and one of the optional inputs to it is BlockDeviceMappings.  Through some trial and error, I found that the value it requires is the same structure as the AWS CLI uses:

- DeviceName: "/dev/sda1"
  Ebs:
    VolumeType: gp3
    Encrypted: true
    DeleteOnTermination: true

Note 1: this is in YAML format, which requires spaces for indentation.  Be sure “Ebs” is indented two spaces, and the subsequent lines four spaces.  The structure above is a 1-element array, containing a dictionary with two keys, and the “Ebs” key is another dictionary (with 3 items.)

Note 2: the DeviceName I am using comes from the Ubuntu AMI that I am using to start the instance.  DeviceName may vary with different distributions.  Check the AMI you are using for its root device setting.

The last two lines (Encrypted and DeleteOnTermination) may be unnecessary, but I don’t like leaving things to chance.

Doing this in a launch template remains a mystery.  The best I have been able to do, when trying to use the launch template, Amazon warns me that it’s planning to ignore the entire volume as described in the template.  It appears as if it will replace the volume with the one from the AMI, rather than merging the configurations.

I know I have complained about Amazon in the past for not providing a “launch from template” operation in SSM, but in this case, it appears to have worked out in my favor.