Friday, June 12, 2020

Puppet Acceptance Tests, the hard way (part 1)

The first in a series of posts about build acceptance test infrastructure for my team's Puppet.

This first post is a mainly background, thoughts, and ranting.  It's been quite cathartic writing it but probably not super useful.  If you just want to see some tech stuff, skip to the next post.

Here is (will be) related posts, as I write them:
  1. ... There aren't any yet, this is the first one.

Testing @ LMAX

Our Development team has an amazing CI and Staging of their own (my team doesn't use it, we just keep it running).  They've been iterating over their CI/CD toolkit for 10+ years, and while bits and pieces have been swapped out and replaced with other components, it's essentially the same.  It helps enormously when you have a culture of continuous delivery instilled in your development team from the get-go, rather than trying to add CI in later.

It is very mature, and while it's not perfect and there's many things I'd like to improve, it is a lot better than others.  Every now and then someone moves on and gets to experience how other company's do their testing.  When meeting up for beers a few months later, often the feedback goes something like "No one does it as good as LMAX".  It's nice to hear, it's nice to be validated, but it's also nice to know there are others that would give us a run for our money.

Over on my side where the infrastructure is, things are not as rosy as my coder colleagues.  We are primarily designers and builders; hardware experts, systems and networking specialists.  We're not developers, but we do code to an extent.

We've used Puppet for about 8 years, however the testing of our Puppet code has been lacking; highlighted more so when I glance up from my desk at the 60" TVs displaying the Dev team's CI results in real time.

Puppet describes the vast majority of our operating systems, and I would say we are on the Power User side of the community.  Our Puppet configures complex storage and networking, makes BIOS and performance tuning changes, and drives the monitoring of our server infrastructure.

Such complexity without tests slowed us down; catalog compilation failures, missing Hiera data.  Stuff that works for 90% of machines, but then there's these few exceptions.  Touch one thing over here, and something all the way over there breaks.

In the last 4 years we've gotten a lot better.  We introduced puppet-rspec unit tests.  We adopted the Role Profile design pattern.  We built our own CI, using Git Lab.  We moved to Puppet 4, then 5, then 6.  We've retired many lmax-* modules and replaced them with upstream ones that are better tested.  We have a lot less catalog compilation failures now, and the culture is shifting to "I'll write a test for that".

As we learn and our testing gets better, I find we are slowed down less by the simple Puppet problems and more by the interactions between our complexity, our data and our platform.  The kind of problems where combining this Puppet Role, with this Hiera data, on top of a certain shape of system will cause a machine to provision incorrectly.

These complex failures are difficult to model using unit tests.  We need to actually run Puppet on a system, not just compile a catalog.  To be able to test how we have Puppet interact with a system we need to run it on something that looks as much like hardware as possible.

Provisioning

A good portion of what my team does is provisioning and re-provisioning systems, both hardware and VMs.  When it works, it works great; you send a piece of hardware for network boot, it kickstarts itself, reboots into what we call our initial Puppet run, and if that has no failures it reboots again.  After the second reboot the machine should be ready for use - it will be running in the kernel we want, networking has come up on boot, and storage is in place, our monitoring auto-magically updates to add the new machine and the everything goes green.

If it doesn't work perfectly you've got to figure out what went wrong.  In some cases it might be a simple failure of which a second Puppet run will fix it.  We aim for single-pass Puppet, but sometimes there's bugs.  If it's a serious problem like a catalog compilation failure, it takes longer to repair. Console, root login, check the logs, etc.

To be able to continually test provisioning machines over and over again, and notify the team of any failures, that would make us faster.  Taking this idea further, the long term dream would be to provision sets of machines as different things and run tests across these systems to make sure they interact with each other correctly.  If eventually we could simulate the provisioning of an entire financial exchange at once...  Now that would be something to brag about.

I would need to be able to test the low level things that our Puppet does now; things like creating Logical Volumes and file systems, kernel changes.  A container-based acceptance test system would probably be easier to start with, but it's not an accurate representation of hardware so would not provide as much value long term.  For these reasons I wanted to concentrate on testing with VMs.

It was simple enough to "break" one of our KVM kickstarts, then convert that disk image into a Vagrant image.  I now had a repeatable image of a machine that looked like it had just finished our kickstart, the perfect starting point to test our Puppet Roles from.  We use libvirt/KVM exclusively, so to have Vagrant create libvirt VMs via the vagrant-libvirt plugin would mean we are in familiar territory.

Getting Started (or not)

Being a Systems Engineer by trade, my approach to implementing a solution to a problem usually goes something like this;
  1. Figure out that what we've got right now doesn't solve my problem, nor will it do so sensibly
  2. Google about my problem
  3. Get annoyed that nothing appears on page 1 of the results, so add a few more words to Google I think might help
  4. Read some blog posts or marketing material of other people who have sort of done something similar, and learn about some New Thing
  5. Read a bit more about New Thing, maybe read some of the manual, enough to vaguely understand how it works and if it will solve the problem
  6. Look for example code that works with this New Thing that I can borrow
  7. Utilise upstream code in my own estate
  8. Deploy everywhere
  9. Give 60 minute presentation to my team about how amazing I am having done this all by myself and not copied anything from the Internet at all
Using this approach, I managed to get a small amount of Puppet acceptance tests running several years ago; I had some tests of our internal networking module, ad-hoc, on my laptop.

Then I did a Fedora upgrade and Vagrant broke.

I fixed that, then a new version of puppet-rspec clashed with Beaker.

I had to wait for a fix for that, and by then another Fedora upgrade broke Vagrant again.

We managed to get a magical combination of Gems on a CentOS 6 Git Lab Runner working at some point in time, then it broke.

Then Beaker 4 got released and a bunch of stuff stopped working.

Then my own Vagrant broke again, or more specifically vagrant-libvirt (I think we're at about Fedora 29 at the moment, I've suppressed a lot of these memories).

Then all the Beaker modules were archived on Git Hub.  That was a rather large nail in the coffin, but inconsequential at the moment, as my Vagrant was still broken.

My SysAdmin approach of "lets just whack some things together" wasn't working.  I wanted an acceptance test pipeline to test complex provisioning, but kept getting stuck provisioning the simplest SUT.

Even after some people that are smarter than me solved the current instance of my vagrant-libvirt problem, I'd pretty much convinced myself that my current approach was too fragile.  A lot of my problems boiled down to trying to provision machines using Vagrant while inside a Ruby development environment designed to test Puppet.  I simply didn't have the time or skill set to get it working.

I looked at other things.  Test Kitchen still had the same problems with Vagrant and Ruby clashing.  Puppet have their own provisioner built on VMWare, vmpooler and ABS, but we don't use VMWare.  I even thought about Testinfra...  Do the provisioning and testing in Python, a language I'm more familiar with, but mixing two different languages and test suites seemed like a bad idea.

It frustrated me that I could go to a public cloud and do this in 15 seconds, but was struggling so hard with Vagrant and vagrant-libvirt on my own laptop (now Fedora 32).

Getting Started (again)

A year or so ago Puppet announced Litmus, but it was still very new.  From what I gather reading a few conversations, it is mainly designed to run the same tests on SUTs in parallel. Very useful if you wanted to spin up 14 different operating systems and run the tests on each, like if you were testing a Puppet module for mass consumption.  Litmus has a provisioning system built on top of Bolt, currently supporting Docker only, so didn't immediately help me provision Vagrant machines.

Over time Bolt has gotten more provisioners.  Then recently someone in the Puppet Community Slack posted this Youtube video from Daniel Carabas that I found very exciting.  It was a quick demonstration of using Terraform to provision some Vagrant machines to run tests against, all driven by Litmus.

At first I missed the point; I saw Terraform and the Terraform Vagrant Provider as a way to get around my Vagrant provisioning problems.  I'd never used Terraform, Bolt, or Litmus before, so I paused the video every 10 seconds, typing out exactly what the video described so I could learn the bits and pieces.

I was able to provision Vagrant machines using Terraform directly.  Then when trying to provision when running the Litmus tests, the tests just... Stopped.  Turns out Terraform just stopped, and it stopped because it didn't handle the Vagrant failure, which failed in the exact same way it always failed if running from Bundler inside the Ruby environment.  I was right back where I started.

The Actual Point

Litmus runs on top of Bolt, Bolt interfaces with Terraform, and Terraform can do a lot.  Rather than trying so hard to make Vagrant work because it is easy to make Vagrant resemble our hardware, can I spend less effort making cloud infrastructure resemble our hardware, which is easy to provision?

I'm swapping one problem for another, and hope what I swap to is less work.  Terraform also has the potential to pay off long term, as it is very capable of modelling complex infrastructure.

In the next posts we'll do some real work as I explore this.

Thursday, February 16, 2017

Test Driven Infrastructure - Validating Layer 1 Networking with Nagios

Previously we've talked about how we use Nagios / Icinga for three broad types of monitoring at LMAX: alerting, metrics, and validation. The difference between our definitions of alerting and validation is a fine one and it more has to do with the importance of the state of the thing we are checking and the frequency in which we check it. An example of what I consider an "Alert" is if Apache is running or not on a web server. However the version of Apache might be something I "Validate" with Nagios as well, but I wouldn't bother checking this every few minutes and if there was a discrepancy I wouldn't react as fast as if the entire Apache service was down. It's a loose distinction but a distinction none the less.

The vast majority of our network infrastructure is implemented physically in a data centre by a human being. Someone has to go plug in all those cables, and there's usually some form of symmetry, uniformity and standard to how we patch things that gives Engineers like me warm fuzzy feelings. Over many years of building our Exchange platforms we've found that going back to correct physical work costs a lot of time, so we like to get it right the first time, or, be told very quickly if something is not where it's expected to be. Thus enters our Test Driven Networking Infrastructure - our approach uses Nagios / Icinga as the validation tool, Puppet as the configuration and deployment engine, LLDP as the protocol on which everything runs on top of, and Patch Manager as the source of truth.

Validating Network Patching

I've written about our Networking Puppet module before and how we use it to separate our logical network design from it's physical implementation. The same Puppet Networking module also defines the monitoring and validation for our network interfaces. Specifically this is defined inside Puppet Class networking::monitoring::interface, which has a hard dependency on LMAX's internal Nagios module which unfortunately at this time is not Open Source (and would be one long blog post of it's own to explain).

So since you can't see the code I'll skip over all the implementation and go straight to the result. Here is what our Puppet Networking module gives us in terms of alerts:



Pretty self explanatory. Here's the end result of our networking infrastructure validation, with server names and switch names obfuscated:



However a green "everything-is-ok" screenshot is probably not a helpful example of why this is so useful, so here's some examples of failing checks from out build and test environments:




To summarise the above, our validation fails when:
  • we think an interface should be patched somewhere but it's not up or configured
  • an interface is patched in to something different than to what it should be
  • an interface is up (and maybe patched in to something) but not in our source of truth
Next I'll describe how the Nagios check works. Combined with a specific provisioning process which I describe below, the above checks give us Test Driven Infrastructure that helps us quickly correct physical patching errors.

How The Nagios Check Actually Works

The idea behind the check is for the Nagios server to first retrieve what the server says the LLDP neighbour of each interface is, then compare this with it's own source of truth and raise an appropriate OK, WARNING or CRITICAL check result.

Nagios knows what interfaces to check for because Puppet describes every interface to monitor. Nagios makes an SNMP call to the server, getting back CSV output that looks like this:

em1,yes,switch01,1/31,10,Brocade ICX6450-48
em2,yes,switch02,1/31,10,Brocade ICX6450-48

The fields are:
  1. interface name
  2. link
  3. remote LLDP device name
  4. remote LLDP device port
  5. VLAN
  6. remote LLDP device model
A version of this script is up on GitHub here. It contains a lot of conditional logic to handle the LLDP information for different vendor hardware. For example certain Brocade switches don't mention the word "Brocade" so we infer that from the MAC address. Different switches use different fields for the same information as well, and the script parses the right field based on the remote side model type, eg: Brocades and Linux Kernels put the Port ID in the "descr" field but other devices put it in the "id" field.

The Nagios check cross references this data against it's own records which is the "source of truth" file, which looks like this:

server01,em1,switch01,0/31
server01,em2,switch02,0/31

The Nagios check script has some smarts built in to handle logical implementations that don't model well in Patch Manager. One of the complexities is stacked switches. The LLDP information from the server will describe a stacked switch port as something like "3/0/10", where 3 is the Stack ID. In Patch Manager it would get confusing if we labelled every device in a stack the same, so instead we name them switch1-3 where the "-3" indicates the stack number. The Nagios script looks for and parses this as Stack ID.

Our TDI Workflow

The Nagios checks are the critical part of a much larger workflow which gives us Test Driven Infrastructure when we provision new machines. The workflow follows the steps below roughly, and I go into each step in more detail in the following sections:
  1. Physical design is done in Patch Manager, including placement in the rack and patching connections
  2. Connections are exported from Patch Manager into a format that our Nagios servers can parse easily
  3. Logical design is done in Puppet - Roles are assigned and necessary data is put in Hiera
  4. Hardware is physically racked and the management patches are put in first
  5. Server is kickstarted and does it's first Puppet run, Nagios updates itself and begins to run checks against the new server
  6. Engineers use the Nagios checks as their test results, fixing any issues
As you might have deduced already the workflow is not perfectly optimised; the "tests" (Nagios checks) come from Puppet, so you need a machine to be installed before you get any test output. Also we need at least some patching done in order to kickstart the servers before we can get feedback on any of the other patching.

Physical Design in Patch Manager

We use Patch Manager's Software-As-A-Service solution to model our physical infrastructure in our data centres. It is our source of truth for what's in our racks and what connections are between devices. Here's an example of a connection (well, two connections really) going from Gb1 in a server, through a top of rack patch panel, and into a switch:



Exporting Patch Manager Connections

Having all our Nagios servers continually reach out to the Patch Manager API in order to search for connections is wasteful, considering that day to day the data in Patch Manager doesn't change much. Instead we export the connections in patch manager and at the same time filter to remove any intermediate patch panels or devices we don't care about - we only want to know about both ends of the connection. Each Nagios server has a copy of the "patchplan.txt" file, which is an easy to parse CSV that looks like this:

server01,em1,switch01,0/31
server01,em2,switch02,0/31


Logical Design In Puppet

As part of creating the new server in Puppet, the networking configuration is defined and modelled in line with what has been planned in Patch Manager. So for example if a Dell server has it's first two on board NICs connected to management switches in Patch Manager, somewhere in Puppet a bonded interface will be defined with NICs em1 and em2 as slaves (which are the default on board NIC names on a Dell Server).

How we model our logical network design in Puppet is covered in to much more detail here.

Hardware is Physically Racked

Obviously someone needs to go the data centre and rack the hardware. If it's a large build it can take several days, or weeks if there's restricted time we can work in the data centre (like only on weekends). We try to prioritise the patching for management first so we're able to kickstart machines as quickly as possible.

Kickstarts and Puppet Runs

Once a new has done it's first Puppet run and it's catalog is compiled, a set of Exported Puppet Resources that describe Nagios checks for this server are available for collection. The Puppet runs on our Nagios servers will collect all these resources and turn them into relevant Nagios configuration files and begin running these service checks.

Make the Red and Yellow go Green

Since this is a newly built server it's expected that a lot of the validation style Nagios checks will fail, especially if only the management networks are patched but our Puppet code and Patch Manager is expecting other NICs to be connected. Our engineers use the Nagios check results for the new server as the feedback for our Test Driven Infrastructure approach to provisioning new servers - make the tests pass (make the red and yellow go green) and the server is ready for production.

Monday, January 30, 2017

Puppet Networking Example

The majority of our Puppet modules - and I think most organisations that adopted Puppet over 5 years ago are in the same boat - is nothing to be proud of. It's written for a specific internal purpose for a single operating system, and has no tests. It was written very quickly to achieve our purpose. These internal modules aren't really worth sharing, as there's much better modules on the Forge or GitHub. Over the years I've been slowly retiring our home grown modules for superior, publicly available modules, but it's a long process.

There are a couple of our internal modules that I am quite proud of though. One of them is our Networking module, which we use to write the configuration files that describe the interfaces, bonds, vlans, bridges, routes and rules on our Red Hat derived systems. Our networking module allows us to quickly define an interface for a VM with a piece of Hiera config if we want something quickly, but the real strength of it comes from how we use it to model our defence in depth networking architecture for our platform.

The module's not perfect, but we've been able to largely abstract our logical network design from how we implement it physically. Our Puppet roles and profiles describe themselves as having "Application Networks" and the module takes care of what that looks like on servers in different environments - perhaps it's an untagged bonded network in production but it's vlan tagged in staging with a completely different IP range.

Here is the module + accompanying documentation on GitHub, along with the first few paragraphs of the Preface.


LMAX-Exchange/puppet-networking-example

This is not a "real" Puppet module - it's not designed to be cloned or put on the Forge. It even refers to other Puppet modules that are not publicly available. In fact, if you did blindly install this module into your infrastructure, I guarantee it will break your servers, eat your homework, and kill your cat.

This module is a fork of LMAX's internal networking module with a lot of internal information stripped out of it. The idea behind releasing this is to demonstrate a method of abstracting networking concepts from networking specifics. It is designed to educate new LMAX staff, plus a few people on the Puppet Users list who expressed some interest. The discussion in Puppet Users thread How to handle predictable network interface names is what motivated me to fork our internal module to describe it to other people.

I'm now going to fabricate a scenario (or a "story") that will explain the goals we are trying to reach by doing networking this way in Puppet. While the scenario's business is very loosely modelled on our own Financial systems architecture, the culture and values of the Infrastructure team in the scenario match our own Infrastructure team much more closely - which is how our Puppet networking evolved into what it is now.

If the scenario sounds completely alien to you - for example if you run a Cloud web farm where every instance is a transient short-lived VM - then the design pattern this module is promoting probably won't be that helpful to you. Likewise if you are a 1 man Sys Admin shop then this level of abstration will read like a monumental waste of time. If however you run an "enterprise" shop, manage several hundred servers and "things being the same" is very important to you, then hopefully you'll get something from this.