Friday, June 12, 2020

Puppet Acceptance Tests, the hard way (part 1)

The first in a series of posts about build acceptance test infrastructure for my team's Puppet.

This first post is a mainly background, thoughts, and ranting.  It's been quite cathartic writing it but probably not super useful.  If you just want to see some tech stuff, skip to the next post.

Here is (will be) related posts, as I write them:
  1. ... There aren't any yet, this is the first one.

Testing @ LMAX

Our Development team has an amazing CI and Staging of their own (my team doesn't use it, we just keep it running).  They've been iterating over their CI/CD toolkit for 10+ years, and while bits and pieces have been swapped out and replaced with other components, it's essentially the same.  It helps enormously when you have a culture of continuous delivery instilled in your development team from the get-go, rather than trying to add CI in later.

It is very mature, and while it's not perfect and there's many things I'd like to improve, it is a lot better than others.  Every now and then someone moves on and gets to experience how other company's do their testing.  When meeting up for beers a few months later, often the feedback goes something like "No one does it as good as LMAX".  It's nice to hear, it's nice to be validated, but it's also nice to know there are others that would give us a run for our money.

Over on my side where the infrastructure is, things are not as rosy as my coder colleagues.  We are primarily designers and builders; hardware experts, systems and networking specialists.  We're not developers, but we do code to an extent.

We've used Puppet for about 8 years, however the testing of our Puppet code has been lacking; highlighted more so when I glance up from my desk at the 60" TVs displaying the Dev team's CI results in real time.

Puppet describes the vast majority of our operating systems, and I would say we are on the Power User side of the community.  Our Puppet configures complex storage and networking, makes BIOS and performance tuning changes, and drives the monitoring of our server infrastructure.

Such complexity without tests slowed us down; catalog compilation failures, missing Hiera data.  Stuff that works for 90% of machines, but then there's these few exceptions.  Touch one thing over here, and something all the way over there breaks.

In the last 4 years we've gotten a lot better.  We introduced puppet-rspec unit tests.  We adopted the Role Profile design pattern.  We built our own CI, using Git Lab.  We moved to Puppet 4, then 5, then 6.  We've retired many lmax-* modules and replaced them with upstream ones that are better tested.  We have a lot less catalog compilation failures now, and the culture is shifting to "I'll write a test for that".

As we learn and our testing gets better, I find we are slowed down less by the simple Puppet problems and more by the interactions between our complexity, our data and our platform.  The kind of problems where combining this Puppet Role, with this Hiera data, on top of a certain shape of system will cause a machine to provision incorrectly.

These complex failures are difficult to model using unit tests.  We need to actually run Puppet on a system, not just compile a catalog.  To be able to test how we have Puppet interact with a system we need to run it on something that looks as much like hardware as possible.

Provisioning

A good portion of what my team does is provisioning and re-provisioning systems, both hardware and VMs.  When it works, it works great; you send a piece of hardware for network boot, it kickstarts itself, reboots into what we call our initial Puppet run, and if that has no failures it reboots again.  After the second reboot the machine should be ready for use - it will be running in the kernel we want, networking has come up on boot, and storage is in place, our monitoring auto-magically updates to add the new machine and the everything goes green.

If it doesn't work perfectly you've got to figure out what went wrong.  In some cases it might be a simple failure of which a second Puppet run will fix it.  We aim for single-pass Puppet, but sometimes there's bugs.  If it's a serious problem like a catalog compilation failure, it takes longer to repair. Console, root login, check the logs, etc.

To be able to continually test provisioning machines over and over again, and notify the team of any failures, that would make us faster.  Taking this idea further, the long term dream would be to provision sets of machines as different things and run tests across these systems to make sure they interact with each other correctly.  If eventually we could simulate the provisioning of an entire financial exchange at once...  Now that would be something to brag about.

I would need to be able to test the low level things that our Puppet does now; things like creating Logical Volumes and file systems, kernel changes.  A container-based acceptance test system would probably be easier to start with, but it's not an accurate representation of hardware so would not provide as much value long term.  For these reasons I wanted to concentrate on testing with VMs.

It was simple enough to "break" one of our KVM kickstarts, then convert that disk image into a Vagrant image.  I now had a repeatable image of a machine that looked like it had just finished our kickstart, the perfect starting point to test our Puppet Roles from.  We use libvirt/KVM exclusively, so to have Vagrant create libvirt VMs via the vagrant-libvirt plugin would mean we are in familiar territory.

Getting Started (or not)

Being a Systems Engineer by trade, my approach to implementing a solution to a problem usually goes something like this;
  1. Figure out that what we've got right now doesn't solve my problem, nor will it do so sensibly
  2. Google about my problem
  3. Get annoyed that nothing appears on page 1 of the results, so add a few more words to Google I think might help
  4. Read some blog posts or marketing material of other people who have sort of done something similar, and learn about some New Thing
  5. Read a bit more about New Thing, maybe read some of the manual, enough to vaguely understand how it works and if it will solve the problem
  6. Look for example code that works with this New Thing that I can borrow
  7. Utilise upstream code in my own estate
  8. Deploy everywhere
  9. Give 60 minute presentation to my team about how amazing I am having done this all by myself and not copied anything from the Internet at all
Using this approach, I managed to get a small amount of Puppet acceptance tests running several years ago; I had some tests of our internal networking module, ad-hoc, on my laptop.

Then I did a Fedora upgrade and Vagrant broke.

I fixed that, then a new version of puppet-rspec clashed with Beaker.

I had to wait for a fix for that, and by then another Fedora upgrade broke Vagrant again.

We managed to get a magical combination of Gems on a CentOS 6 Git Lab Runner working at some point in time, then it broke.

Then Beaker 4 got released and a bunch of stuff stopped working.

Then my own Vagrant broke again, or more specifically vagrant-libvirt (I think we're at about Fedora 29 at the moment, I've suppressed a lot of these memories).

Then all the Beaker modules were archived on Git Hub.  That was a rather large nail in the coffin, but inconsequential at the moment, as my Vagrant was still broken.

My SysAdmin approach of "lets just whack some things together" wasn't working.  I wanted an acceptance test pipeline to test complex provisioning, but kept getting stuck provisioning the simplest SUT.

Even after some people that are smarter than me solved the current instance of my vagrant-libvirt problem, I'd pretty much convinced myself that my current approach was too fragile.  A lot of my problems boiled down to trying to provision machines using Vagrant while inside a Ruby development environment designed to test Puppet.  I simply didn't have the time or skill set to get it working.

I looked at other things.  Test Kitchen still had the same problems with Vagrant and Ruby clashing.  Puppet have their own provisioner built on VMWare, vmpooler and ABS, but we don't use VMWare.  I even thought about Testinfra...  Do the provisioning and testing in Python, a language I'm more familiar with, but mixing two different languages and test suites seemed like a bad idea.

It frustrated me that I could go to a public cloud and do this in 15 seconds, but was struggling so hard with Vagrant and vagrant-libvirt on my own laptop (now Fedora 32).

Getting Started (again)

A year or so ago Puppet announced Litmus, but it was still very new.  From what I gather reading a few conversations, it is mainly designed to run the same tests on SUTs in parallel. Very useful if you wanted to spin up 14 different operating systems and run the tests on each, like if you were testing a Puppet module for mass consumption.  Litmus has a provisioning system built on top of Bolt, currently supporting Docker only, so didn't immediately help me provision Vagrant machines.

Over time Bolt has gotten more provisioners.  Then recently someone in the Puppet Community Slack posted this Youtube video from Daniel Carabas that I found very exciting.  It was a quick demonstration of using Terraform to provision some Vagrant machines to run tests against, all driven by Litmus.

At first I missed the point; I saw Terraform and the Terraform Vagrant Provider as a way to get around my Vagrant provisioning problems.  I'd never used Terraform, Bolt, or Litmus before, so I paused the video every 10 seconds, typing out exactly what the video described so I could learn the bits and pieces.

I was able to provision Vagrant machines using Terraform directly.  Then when trying to provision when running the Litmus tests, the tests just... Stopped.  Turns out Terraform just stopped, and it stopped because it didn't handle the Vagrant failure, which failed in the exact same way it always failed if running from Bundler inside the Ruby environment.  I was right back where I started.

The Actual Point

Litmus runs on top of Bolt, Bolt interfaces with Terraform, and Terraform can do a lot.  Rather than trying so hard to make Vagrant work because it is easy to make Vagrant resemble our hardware, can I spend less effort making cloud infrastructure resemble our hardware, which is easy to provision?

I'm swapping one problem for another, and hope what I swap to is less work.  Terraform also has the potential to pay off long term, as it is very capable of modelling complex infrastructure.

In the next posts we'll do some real work as I explore this.