How we solved outages and performance issues in Helpmonks

#1

At the end of March 2019 and until the middle of April 2019, we experienced outages and performance issues in Helpmonks. What follows is a report of what happened and how we solved it.

Performance issues were caused because we moved our core database to a new cloud provider with a whole service package that did not perform the same in production as during our stress tests. After several support calls, upgrades to the underlining operating system and virtualization layer, we managed to bring the performance back. However, we experienced the “Support Service” that we wanted at first hand and realized that it was not any better than our team can provide. Long story short, after two weeks of, what seems like an endless nightmare of issues, we moved everything back to our private cloud. However, we added more servers to the database which is now scaled over several servers and networks.

Outages were caused for one by the provided Load-Balancer from our cloud provider and by issues in our code when parsing incoming emails. Furthermore, we identified a problem with our distributed storage that failed under some circumstances.

The provider confirmed the issue with the Load-Balancer and we were told that no fix is coming soon. Subsequently, we had to look for an alternative. However, as this is the first entry to our web app and API, we could not just “configure” something. Besides, we wanted to continue to cache calls to static assets and also create a fail-safe set up. In the end, we settled for haproxy with a second haproxy server and keepalived, i.e., if the load-balancer should become unavailable it will switch to another one automatically.

The storage issue was fixed by using several storage servers that are clustered with glusterfs. The significant benefit of this is that the whole glusterfs cluster (no pun intended) can be mounted with a “volume file” so that the disk is mounted even when some servers in the group are not available or have a network interruption.

In addition to all the hardware and fail-over configurations, we also enhanced our code by running multiple threads within one application so that one email that might throw a parsing error is not taking down the entire parsing application. This rarely ever was an issue, but murphy’s law hit us full time and caused another outage during the time we were having the problems that we outlined above. Nowadays, we deploy multiple threads within the parsing task that guarantees that this will not happen again.

We learned a lot during this time and have made sure that every single service that you depend on is configured in a fail-over matter, is available under any circumstance, and independent if a server or a network is down.

Today, we can report that all the issues are a thing of the past and that our performance is not only back to what it was before, but surpasses all expectations.

We’re fully aware that many of our customers around the world experienced performance issues and sometimes couldn’t access their emails in Helpmonks. We are sorry about this. We know this caused a severe interruption in the daily workflow for many of you. We work diligently that this will not happen again in Helpmonks.

Thank you for being a customer, thank you for your patience during these rough times, and thank you for your understanding.

Maintenance over the next few weeks