Monday, March 18, 2013

Sometimes, Running a Business Stinks

The past 72 hours been a pretty wild ride. And it's highlighted some of the less glamorous aspects of running an online business. I think things are starting to stabilize this afternoon, but I'm pretty exhausted from the experience.

Things Were Running Smoothly When...

Friday was a pretty normal day. Quiet, even. Most of the day was spent working on plot development for NEO Scavenger. The day didn't go without it's hitches, but as days go, it was nothing unusual. I was happy to finish the day with some plot knots to untie over the weekend, so I wrapped-up, had some dinner, and Rochelle and I went bowling with some friends.

The next morning, I was pretty excited to see a sudden spike in traffic in my logs. NEO Scavenger was mentioned in a post apocalyptic survival reddit post, and dozens of folks were popping by the site to check out the game. Cool!

Site Maintenance

Except, my ISP was doing scheduled maintenance early Saturday morning. It wasn't until I tried loading that I realized it was down.

5-day Site Visitor Graph: Hourly

Well crap. That stinks. Just as a bunch of new folks are drawn to check out my game, the site goes dark. And worse, dark for close to 12 hours.

"Oh well, " I said. It's a bummer, but nobody's fault, really. I mean, I guess I could blame myself for not having a redundant server or something, but I'm not that high tech (or deep-pocketed). I decided to just live with it. Besides, friendly Reddit server take-downs were nothing new. Folks probably would just figure my little server was flooded temporarily.

Site Flakiness

Later that day, was back up and running. I did my usual email and forum check, to verify nobody was reporting any issues. And all seemed clear.

Except for one thing: every other page load seemed to turn up blank. No error message, no content, nothing. Just a blank white screen. Refreshing the page seemed to fix it, so I figured it was a temporal thing. "It'll sort itself out."

No, it won't. The following day, I was replying to a customer having issues with NEO Scavenger on their Mac, and it was still happening. Happening everywhere. Sometimes it was a content page on the site. Other times it was a forum. Even some of the site admin pages were failing. And as before, usually a reload would fix it. But the reloading was becoming less and less reliable. Sometimes, I had to reload the page 4-5 times to see anything. And if my customers were seeing the same thing, then that was unacceptable.

The White Screen of Death

I decided to dig into the issue a bit more. I started searching for related issues on the web, and was initially happy to see others reporting the same issue. Blank screens in Drupal (the content management system I use, v6.28) were pretty common. Maybe finding a solution will be easy?

However, upon further investigation, I was less happy to discover I had the same problem. This problem, as it was known to the Drupal community, was the "White Screen of Death."

The WSOD is a common issue with Drupal, but it doesn't have a common solution. In fact, there are almost as many causes of the WSOD as there are Drupal installations, and finding the right one for you can be a real quest.

The link above is probably the biggest authority on the issue, and even there, there are no less than 28 different causes listed in the article, and some 80+ comments detailing other issues customers have had. Basically, Drupal's WSOD is a symptom of practically every Drupal disease. I'm having trouble thinking of a human analogue. Headache? Fever? Common cold?

I spent hours on Sunday trying to make rhyme or reason of it. None of the remedies I saw seemed to help. I couldn't even get error or log messages. And worse, it was an intermittent problem, so I couldn't even reliably cause it to happen.

The only things I could verify about it were:

  1. I only get the WSOD on my live server. Migrating the db from live to my localhost didn't duplicate the WSOD issues.
  2. I only started getting the WSOD since my ISP's scheduled maintenance, when the server was (apparently) upgraded/restarted.
  3. I only get a WSOD in Chrome.
  4. Additionally, Chrome seemed to exhibit stylesheet issues when the page did load. Textareas would be too narrow to fit their <div>, or the page would nudge upward when I clicked a link (seemingly to adjust alignment with the Admin Menu module's topnav bar). Something was stalling the css until I clicked a link, upon which it would fix itself, then load a new page with stalled css (or js, or WSOD).
  5. Firefox never exhibits any issues.
  6. IE seems to work too, except for one page partially loading (later determined to be a known issue with YouTube embedding)
  7. No errors appeared on the page, in Drupal's logs, nor server logs, even with error reporting hard-coded to be on in index.php.
  8. Using Chrome's "Inspect Element" context menu option revealed that the page was entirely missing the <body> tag and contents. It was just an <html> and <head>, and the <head> seemed to be missing some elements. Also, Chrome usually complained of "Failed to load resource" on the page itself, but all css/js/images were loaded ok.
  9. Using "View Source" on Chrome apparently reloaded the page, and showed the correct, full content source.
  10. The WSOD appeared more frequently when js and css optimization was turned on (but still occurred when all Drupal optimization/compression/caching was disabled).
  11. Flushing all caches had no effect.
  12. Running update.php had no effect.
  13. Manually truncating Drupal db tables had no effect.
  14. Using Backup & Restore on Drupal's db had no effect.
  15. Disabling all modules outside of Core and Core Optional had no effect.
  16. Switching from BBG's custom theme to the Garland theme had no effect.
  17. Rebuilding permissions had no effect.
By the time Sunday evening rolled around, was stripped down to core modules, no theme, no caching, and had a rebuilt database. And it was still flaking out.

Worse, the node access modules I disabled caused the permissions table to get out-of-date, which caused all site content to disappear for all users, all the time. Basically, when the WSOD wasn't happening, all of the site's pages were empty blue shells, and the forums all had 0 posts in them.

Even worse still, to rebuild the permissions and fix the empty content, I had to run a script via Drupal. And that script failed with a WSOD whenever I attempted it.

I had totally messed up my site. The "Site Crash" label in the above image refers to the time when I took the site offline to avoid any more users seeing the empty shell of a site.

Fortunately, Firefox was able to run the permission rebuild script without any WSOD. And I was able to at least get the site showing content again.

But as my efforts continued past 10pm, I decided it wasn't going to be solved anytime soon. I started making changes necessary to return the site to the formerly flaky intermittent WSOD. It wasn't ideal, but the occasional user reload was a far cry better than no site at all. If nothing else, I wanted the forums and contact page online for users to report issues, if needed.

I posted a news item to the homepage alerting customers to the WSOD issue, and apologized for the inconvenience of the downtime. Then, I went to bed.

Come Monday, It'll Be Alright

I was back at the computer at 6:15am. And unsurprisingly, the site hadn't fixed itself. I fired up the usual websites, checked messages, looked for forum posts. Some users reported seeing similar WSOD issues. And, bless them, they blamed their internet connection instead of me.

I decided to try a different approach this time. Instead of grasping at straws offered by forums on the net, I decided to debug Drupal. I added print statements to Drupal's index.php, to see if I could trace the value of the content. And when that failed to reveal anything, I started adding traces to Drupal's core code (*.inc files).

I don't like doing this sort of thing, as I'm nervous about screwing things up worse than they are. Plus, doing it in a way that doesn't affect the live site's users is tricky. But in retrospect, it's the only way to really know what's going on.

I found a function in which loads various Drupal bits in phases: drupal_bootstrap($phase). Every page on a Drupal site calls this function first, doing a full bootstrap (all phases). I added a trace inside the while loop that executes for each phase as it loads, and I printed the ID of the phase.Testing on my local site, I could see that my site loads phases 0-8.

When I uploaded the modified file to the live server, I saw traces for all phases. Reload. All phases again. Repeat this another dozen times, on different pages where I most often encountered the WSOD. Everything loads normally.

Was the WSOD gone?

Tentatively, I backed out the changes, so it was back to normal. Still, the site seemed to be loading ok. I reenabled caching to normal. Still ok. Turned on page compression. Still ok. I stopped short of turning on optimized js and css. Maybe I'll muster the courage for that tomorrow.

A Wizard Did It

So what happened? Uploading that file seemed to stop the WSOD, and leave it fixed even after that file is restored. What dark magic is this?

That's actually my first theory: dark magic. But if you pin me down for a more mundane answer, I'm going to guess it was some sort of behind-the-scenes cache. I'm not sure what else would explain a site's complete performance alteration when a single file is uploaded and then un-uploaded again.

That was at about 11am today. As of 5pm, I haven't seen a WSOD. Mercifully, no players have posted in the White Screen tech support thread on my forums, either. I'm hopeful this issue is resolved.

But What About That Mac User?

Oh yeah, remember me mentioning way up there that this whole investigation started when trying to help a Mac user with NEO Scavenger? Yeah, Mac compatibility is an issue in it's own right, concurrent with the site debacle.

I'm not going to detail that issue here, as it's a pretty lengthy topic of it's own. The forum thread linked above has all the details. And what's more, I've partially touched on it in the past.

The short version is this: Flash is rapidly becoming as much a burden as a boon. For someone trying to develop a stand-alone application, I'm at the point where I am highly reluctant to recommend Flash as a viable option. Issues include:

  1. "Create projector" no longer supported on Linux, as of v11.2.
  2. Flash CS6 no longer supported on OSX 10.8.
  3. Any projectors one does create are going to trip security on modern OSes. And in Mac's case, Gatekeeper is a sticky issue for OSX 10.7.3 and 10.7.4.
  4. Digitally signing Flash projectors appears to have spotty support, unless one uses 3rd party wrappers (which are, in themselves, reportedly unreliable).
  5. Adobe's recommended solution, building AIR apps, is unsupported on Linux. Also, AIR's installation process is flawed, at best. Also, AIR has the periodic "Update AIR" nagging dialog.
  6. Flash is no longer officially supported on Android nor iOS.
I was a long adherent to Flash. It served me well. NEO Scavenger wouldn't be if it weren't for Flash's ease of use, and then-universal deployment options.

So it's unfortunate that this era appears to be in it's winter.

All's Well That Ends Well

The good news? At least we're back to normal. The site seems to be running again. Upgrading OSX to 10.7.5 seems to fix Flash projector Gatekeeper woes until I can find another way to certify projectors. I think I may actually be able to return to plot development tomorrow.

Let's just hope that stretch of actual game dev continues for a while!