Friday, January 20, 2012

Diagnosing a Drupal 7 cache generation problem

My company launched their new website recently. When we launched before Christmas we encountered a reoccurring problem that was more difficult than most to diagnose. The problem itself is very specific to our site so I doubt the exact details will help many people, but maybe the troubleshooting steps involved will prove interesting to someone. I'm not particularly proud of the time it took to track down nor our exact thought process (hardly blowing my own horn with this post) but here we go anyway.

The website platform was built for us by a third party, the technology is mostly Drupal 7 with some custom modules written for functionality we required. We wrote our own "Drupal deployment interface" that mirrors the contents of one Drupal site (our Dev server) onto our UAT or Live platform. The Live platform is a simple Apache / Drupal / Varnish stack with a load balancer in front of several web servers, the back-end is several MySQL servers.

When we deployed our final site to launch, we ran into a problem where a specific image on our front page was not displaying. Looking at the HTML source when the image is broken and we see that the image source rather bizarrely contains the hostname 127.0.0.1:

<img alt="" class="media-image" typeof="foaf:Image"
  src="http://127.0.0.1/sites/default/files/LMAX-intro-video.jpg" />
Not being Drupal 7 experts we can't code dive into it's PHP with much confidence so on comes the black box testing and some facts we discovered:
  1. We confirm this section of HTML is dynamically generated - it's not a hard coded link to 127.0.0.1 someone's typed into the Drupal interface.
  2. No other images are broken, just this one.
  3. Looking at other images, the source of the image should be starting with "http://www.lmax.com/...".
  4. If we request the correct image link directly it loads fine, so the image file is not missing nor does there appear to be a problem with Apache serving the file.
  5. We restart Varnish on all web servers to see if this is problem between Varnish and Drupal but it does not fix the problem.
  6. We dump all databases and grep for the offending string and pin the problem to one table.
The bad HTML is being stored in Drupal 7's cache_filter table. We delete the entry from the cache_filter table, refresh our site in a browser and the problem is solved, but unfortunately not for good.

The next day the problem re-occurred - our web developer says that he deployed a new copy of the site onto production and the image is missing again. We investigate again and find the same bad HTML on the database servers. We delete the entry from the cache_filter table again then check our website - the image is still broken. Looking back in the cache_filter table and we see that the same bad HTML has been regenerated, despite us deleting that row. Just to be sure, we truncate the entire cache_filter table on both databases and refresh - still contains 127.0.0.1. What we thought resolved our problem yesterday has not worked a second time and we now have no quick fix way of solving it.

We convey to the business that we can't fix this in five minutes and settle down for some more serious investigation. We now know:
  1. The tail end of the problem is the Drupal generated HTML stored in the cache_filter table in our database(s).
  2. The problem appears to occur after a deployment of new content from our Dev server to our Live servers.
  3. We specifically don't restore any cache table content when doing a deployment to avoid any "stale" cache from Dev reaching Live - so after deployment, the cache tables are empty.
  4. Something is continually repopulating the cache with bad HTML.
We have a UAT environment to test deployment specific problems, built for exactly these kinds of problems. We only managed to reproduce the problem once in several test deployments of the same content to our UAT environment - the issue is very intermittent on UAT and practically constant on Live. UAT does not have any load balancing normally, we add some but still cannot reproduce.

We search through the core Drupal 7 PHP code, our custom modules and contributed modules for mention of the host 127.0.0.1. It appears a few times but leads no where relevant. We also spend time playing with Varnish on UAT: we know that each Varnish server's Apache backend is configured over the loopback interface and it's written in the Varnish configuration file with '127.0.0.1'. Our work proves unhelpful there as well.

Trying a different approach, we turn on full query logging on the UAT database, deploy to UAT and browse around our website, looking for "insert into cache_filter" lines. Our thinking is to trace back through the queries for an idea of what occurs before the cache_filter insert and thus hint at what's populating this table. The UAT query log does not help much: we find the insert query but the problem has not occurred after the deployment so the cache_filter contents is correct. Other than witnessing a lot of queries against the domain table, the UAT query log is not very helpful.

We decide to turn on the full query log for one of the production MySQL servers, as we were not happy that our efforts in UAT had exhausted this avenue. We finally have our eureka moment: within seconds of turning on the log we see a queries against the domain table, but these are ever so slightly different:
SELECT domain_id, subdomain, sitename FROM domain WHERE subdomain = '127.0.0.1'
We had a general idea of the domain module and that it works by what hostname someone puts in the browser, so these queries said to us that someone or something was hitting localhost with URL requests and they are getting far enough into our web stack for Drupal to query for it. We immediately revisit Varnish but can't prove it is the cause on UAT yet again. We compare the Apache logs with the Varnish logs, we think on how UAT (unfortunately) differs to production and finally the sack of pennies drops.

The answer was in front of us the entire time - The load balancers use HTTP health checks of "GET /" against the web servers. The load balancer health checks run continuously almost every second and so when a deployment occurs against Live, the load balancers will almost always be the first request to the front page of the website. Since we effectively truncate the cache tables when we deploy, the load balancer health check triggers Drupal to repopulate it's cache. Something about the load balancer's request is causing Drupal to search for a '127.0.0.1' domain, perhaps incomplete HTTP headers, or maybe a REMOTE_HOST header of 127.0.0.1. Since we don't have a domain of 127.0.0.1 the request falls back to our default domain (a feature of the domain module) but somehow content for the front page is being generated incorrectly with details from the original request and cached.

To confirm what was only a theory at this point we changed the load balancer health check to just test the TCP connection rather than a HTTP test, waited for a request to come through and checked the site - the generated content from our request was correct. Rather than keep the TCP health check we found an example Drupal PHP script that does a minimal Drupal bootstrap to check the database health and return HTTP status codes appropriately.

The clarity of hindsight:
  • When we had the issue the very first time, after deleting the bad cache_filter entry I must have refreshed the website faster than the load balancers check, hiding the problem until the next day.
  • When we added the load balancer to UAT, we mustn't have set up a health check (or if so, only a TCP connection check), as we were unable to reproduce the problem in UAT. Lesson: if trying to mirror production, mirror production.
The vast majority of the problem is now worked around, but it is not solved - the issue still re-occurs every once every couple of weeks, in Live and UAT now as well. There are still several questions that I would like to answer:
  1. What is the exact part of the load balancer request that caused Drupal to generate it's cache incorrectly? Is it the REMOTE_HOST header?
  2. Is it just the load balancer or was Varnish also a catalyst? If we take Varnish out of the mix and just have the load balancer point to Apache directly, do we still have a problem?
  3. What's causing the very infrequent re-occurrences of the problem now? Could it be the Varnish cache expiring and requesting a new copy of the object?
Like all Systems Administration problems though, it will get attention when it annoys someone enough to justify spending the time to fix it permanently. If only computers didn't exist, our lives would be so much simpler...