Background
This week about 1000 servers were affected by an issue in one of our storage nodes. This issue is not related to our issue last year in our Gluster environment. Some of the servers were simply affected by limited performance and some were down for as much as a few hours.
The storage node in question is a straight up a redundant NFS/ZFS solution where we in some cases have used a form of deduplication for better efficiency in actual storage usage. We noticed a performance degradation already on Tuesday which then yesterday (Thursday) culminated in an even greater performance degradation where ultimately many VMs did not get enough I/O performance to stay alive. The deduplication system uses a lot of memory and if there is not enough it will start using disk which simply is not speedy enough to supply the need for City Cloud. This occurance slowed the general disk performance and in turn either slowed machines or took them down.
What will happen going forward
We have had extreme stability in these types of storage nodes in the last 5 years. They offer good value in performance as well and allows us to offer good price/quality to all our customers. we will make sure we remove any type of deduplication that we have used as we do not risk this scenario going forward. To do so we will move a lot of servers off of the storage node in question to a new node which is similar in setup but newer and configured with no deduplication.
We expect this work to happen during February and when a VM is moved it could mean 10-15 minutes of down time. Any such maintenance will of course be scheduled during night time and will be kept to a minimum.
We apologize for any inconvenience this may have caused you. We will work very hard for it not to happen again.
Thank you for being a customers of our City Cloud service.