In the beginning of April 2020, eBag turned to us for help in solving this issue that held the company back in such a crucial moment. As the infrastructure of the store is hosted by Delta.BG, we already had an overview of the current resources, hence we were able to make thorough analysis in very short terms. It turned out that a simple scaling up would patch up the problem in the first place. However, with the expectation of a further increase in traffic, the issue would inevitably come back again. Therefore, the optimal solution was a combination of horizontal scaling of server resources with full automation of the deployment process, and a change in the error-and-failure detection algorithm.
Vertical or horizontal scaling?
Vertical increase (or vertical scaling in more general sense) usually means adding hardware resources (CPU, RAM, storage) to the current system. Horizontal scaling means adding whole nodes to the already existing ones and balancing the load between them.
If you would imagine a snack kiosk, servicing a queue of hungry customers, vertical scaling would be increasing the number of people handling orders in the kiosk. Horizontal scaling would be opening parallel kiosks, copies of the original, to meet the rising demand. When resources become scarce, the obvious question is in which direction to scale. In case the server in question is on-premise, we have just one option – vertically (scaling in our own hypervisor is in fact still vertical). Placing their infrastructure in the cloud, eBag had the opportunity to scale horizontally, obtaining the following advantages:
-
Possibility for unlimited hardware upgrade;
-
Close binding between operational expenditures and revenues;
-
More system stability;
-
Dropping the necessity to pay constantly at the rate of the highest used capacity.
Automation of the deployment process
For the implementation of the flexible horizontal infrastructure scaling, according to the established best practices, we suggested moving to Ansible as Configuration Management solution. The advantages of this method over shell scripting are many, but few of the more essential ones are: parametrization, config files templates, and full automation of the deployment process. The last one is crucial for optimizing development because it allows testing and going live with just the press of a single button – without interruptions and without customers noticing that work is being done on the store. Yet, in case of a failure, the automation of the deployment allows for a full rollback to the last stable version, so that an additional error-testing can be done. The mix of technologies we offered were open-source, so they can be audited or managed by a third-party without locking the customer to a certain service or developer.
Employing best practices for error and failure detection
Even with the most flexible and sturdy infrastructure, even with fully-automated deployment process, issues in the systems are possible and should be detected and removed as fast as possible. Accessing every single server to check its logs could quickly change from a minor challenge to a nightmare, especially with the increase of the number of servers to meet the increase in traffic. The eBag team knew that centralizing logs would avoid such a scenario, but they needed a more specific solution. We suggested installing small shipper from ELK Stack on every server. This would allow for searching, archiving and processing of the important information. Except for optimizing the work process, this would also free up server resources. Again the solution was open-source, which saved the customer additional licensing and fees.
Following the above events, the plan for a renewal of the systems was developed and presented to eBag and after their approval Delta.BG was able to test it and deploy it in three weeks. Virtual testing platform was built for the project and a copy of the new store was treated with peak traffic simulations. The results were excellent and we went into production. Going live was done without any interruptions, so not a single customer can be disturbed in any way.