Based on the noteworthy events in the beginning of 2021 in this article we will share some thoughts about Scalability in the Cloud. We referred to Signal outage in January 2021 as well as mentioned some other companies and people solely for comparing and contrasting purposes, that’s why any of the figures and numbers provided in the text may not be accurate, but will likely reflect the level of magnitude necessary for making certain logical conclusions. The end goal is to learn from complex while keeping things simple.
In this article we will look at and analyze what’s behind the numbers to learn from this example.
Everything is relative. That’s why to better understand the level of magnitude of the numbers let’s first understand what’s perceived “large”, “huge” and “massive” in 2021. Let’s assume there are 7.8 billion people in the world, 5.1 unique mobile phone users, 2.7 billion unique smart phone users. In December 2020 WhatsApp arguably had 400 million monthly active users and WhatsApp Messenger app was downloaded from Google Play Store (Android Store) over 5 billion times. At the same time in December 2020 Signal arguably had 20 million monthly active users and the number of Signal Private Messenger app downloads from Google Play Store (Android Store) raised from over 10 million to over 50 million times until mid-January 2021. Many of these downloads (circa 7.5 million) might have happened between January 6 and January 10 2021 which would present a substantial increase in numbers after just 5 days and it was a signal to the company to quickly start adding new servers for extra capacity. This is an example of a scalability challenge, in fact, while we are living in the Cloud era nowadays, and Signal were able to promptly restore their operations by January 16 2021.
Now as for an aftermath exercise, so how many servers (X) of type Y would we need to enable a capacity of Z? If this is not a linear algebraic equation, what other considerations should be taken into account when designing and implementing a solution architecture at scale for a particular use case or workload? And what capacity is available in the Cloud nowadays for us to possibly tap into?
Perception of scale
According to the Moore’s law which created an industry expectation for ever increasing computing performance every N years, we’ve been observing constant technology advancements over the years. In fact, lately it seems like Moore’s law was slowing and, for example, the processor core performance is now forecasted to double every 20 years (not every 2 years). And, obviously, with the increase of technological capacities customers demands across the globe do increase too. There’re several industries which could be great examples of a planet-scale demands such as Media & Communications, Retail, Public Sector and more. Speaking of Media & Communications, in January 2021 Netflix streaming turned 14 years old and its Cloud Edge Architecture is described here. Next let’s talk some numbers and try to define how a “massive” data center might look like. For this we will watch a CES 2021 “Microsoft reveals its massive data center (Full Tour)” video here published on January 13 2021 in which Microsoft’s Brad Smith gives a tour of Microsoft’s data center in Quincy, WA (USA) that houses circa 500000 servers (again, that is 0.5 million servers in a single data center). And, for example, Microsoft has hundreds of data centers around the globe. Now coming back to Signal, in the end of January 2020 the app may have well over 500 million monthly active users which would be a remarkable growth. It is noteworthy to review Twitter reply by Octave Klaba offering help to Signal which talks about “>10K servers with 16 cores/24 cores/48 cores in USA, Canada, Europe and APAC” and could possibly serve as a starting point (rough estimation) for capacity increase in the situation. Just for the reference, according to OVHcloud web site here in January 2021 the company has 27 operational data centers with hosting capacity of 1 million servers and serving 1300000 customers. Then, say 10000 servers allocated to Signal (1 customer) might represent 1% (10000/1000000) of overall company’s server capacity world-wide.
On January 15 2020 Signal twitted “We have been adding new servers and extra capacity …”. So how much extra capacity of Z do we get by adding X servers of type Y? The answer you likely hear will be: It depends. In the next sections we’ll decompose this complex question into a set of simpler questions and introduce a number of assumptions to help us formulate some concrete considerations and conclusions.
Definition of the problem domain
In the attempt to simply things let’s look at the server side and describe server resources as CPU, Memory, Disk (I/O), Network, etc. For a particular app, product, service, etc. that we’re building we’ll also going to have a set of functional and non-functional requirements. These requirements will likely inform some of the decisions which we will be making about the solution architecture and the code. For example, while talking about Instant Messaging and Privacy, from the data handling perspective it would be logical to minimize the amount of the customer-related data stored on the backend and maybe try to store that data in the app on user’s device and obviously properly protect the data (messages) travelling over the internet between user’s devices (clients) with the help of the backend (servers). One of the important aspects of the entire system would be an efficient backend message routing. Of course, there’re features, functions, etc. and the devil will be in the details (for example, sign-up/sign-in/sign-out processes, handling multimedia content in messages, etc. which will require to further beef up the server side), but we are trying to simplify the understanding and focus on essentials and basics in this article. In this light, we will just point out Elon Musk’s Twitter reply saying “You server-side code is doing too much”. So, while your scaling strategy may be based on the robust solution architecture which scales-up and -out quasi-linearly, it might also be so that you could put the foundation of this scaling strategy by optimizing your code in the first place. We see the task of architecting and building robust, resilient, performant solutions which scale as an optimization problem with constraints. Going forward let’s focus on the following to-do list for the further analysis:
- Server resources
- Server capabilities (we’ll pinpoint the web technology specifically)
- Solution architecture and code
The list above is not the most exhaustive at all, but it will help us focus and to keep things simple and be organized.
CPU, Memory, I/O, Network, etc. While in terms of the complexity systems may be CPU-bound, Memory-bound, Disk-bound, Network-bound, bound to other resources and any combination of the above, to keep things simple we will focus solely on CPU. CPU resource is typically characterized by the number of cores. In describing processing methods, we will highlight concurrent and parallel processing. To be precise we will look up some fundamental term definitions from Intel and AMD. According to Intel a thread (or thread of execution) is a software term for the basic ordered sequence of instructions that can be passed through or processed on a single CPU core. And core is a hardware term that describes the number of independent central processing units in a single computer component (chip). For concurrent processing using a single core thread is a unit of execution. Then in addition to Multi-threading which essentially improves the throughput of the system there’re a few vendor specific terms to be mentioned. Namely, Intel’s Hyper-threading delivers 2 processing threads per physical core, this way highly threaded applications can get more work done in parallel*, completing tasks sooner. Also, similarly, AMD’s Simultaneous multi-threading (SMT) is the process of a CPU splitting each of its physical cores into virtual cores (also known as threads) and this is done to increase performance and allow each core to run 2 instruction steams at once. This defines the relationship between the number of cores and the number of threads for modern CPUs as [Number of threads] = 2 x [Number of cores]. That’s why in your computer which may have a CPU with 2 cores you may observe 4 virtual cores (CPU0, CPU1, CPU2, CPU3) representing 4 supported threads. Another example would be a server with 48 cores which supports 96 threads. By increasing the number of cores, we increase the amount of work accomplished at a time. Thus, for parallel processing using multiple cores multiple tasks can be processed at the same time. Finally, on a dual socket system with, for example, 2 CPUs each of them having 64 cores with HT/SMT enabled the number of supported threads would be 256 (2 x 64 x 2). Now we have sorted out some terminology around CPUs.
HTTP and gRPC. While there may be many specialized protocols for client-server and server-server communication applicable to different use cases, for the sake of simplicity we will pinpoint the web technology and highlight HTTP(S) and gRPC protocols widely used to build micro-services based web applications on Cloud and Edge nowadays. HTTP is a ubiquitous, stateless, text-based, request-response protocol that uses the client-server computing model. gRPC is modern open-source high performance RPC framework that can run in any environment. It can effectively connect services in and across data centers with pluggable support for load balancing, tracing, health checking and authentication. It is also applicable in last mile of distributed computing to connect devices, mobile applications and browsers to backend services. You can find a good and concise comparison of HTTP and gRPC here, we will just highlight a few points related to Microservices: 1) gRPC is designed for low latency and high throughput communication; 2) gRPC is great for lightweight microservices where efficiency is critical; and Point-to-point real-time communication: 1) gRPC has excellent support for bi-directional streaming; 2) gRPC services can push messages in real-time without polling. Thus, the selection of the appropriate protocols of communication for your solution may help optimizing the usage of your server(s) capabilities.
RPS and CPS. Next let’s consider 2 typical measures to assess server capabilities in a context of a specific solution: RPS (Requests per second) and CPS (Connections per second) which are often used for estimation purposes.
RPS (Requests per second) measure in general implies no persistent connection and request-response interaction between the client and the server. To achieve a large number of RPS we may want to minimize (optimize) the response time per request. In fact, HTTP keep-alive (HTTP persistent connection) is an instruction that allows a single TCP connection to remain open for multiple HTTP connections requests/responses. By default, HTTP connections close after each request. Keep-alive also reduces both CPU and memory usage on server. Pipelining requests for parallel execution is yet another technique which can be used for optimization. Depending on circumstances and requirements modern servers (assuming the complexity of modern applications) might handle, say, x10K or x100K +/- RPS.
CPS (Connections per second) measure does imply a persistent connection, Socket (or WebSocket) connection. WebSocket API (WebSockets) is a technology that makes possible to open a two-way interactive communication session between the user’s browser and a server. With this API, you can send messages to a server and receive event-driven responses without having to poll the server for reply. And typically, the goal of the optimization is to maximize the number of concurrent connections. In the past (1999) C10K (per server) problem described here was considered to be a canonical problem in this area, then, for example, later it became C10M (per server) problem described here setting up a new baseline to count in x1M or x10M +/- CPS.
Events streaming with Kafka. We’ve just contrasted a request-response (no connection) model with Socket (or WebSocket) connection model for assessing server capabilities, in fact, for the sake of completeness we will also mention an events-driven model and namely event streaming where in general no persistent connection is implied and events are asynchronous in nature. For example, Kafka is in general a publish-subscribe messaging system. Kafka uses a binary protocol over TCP. The protocol defines all APIs as request-response message pairs. However, there’s Kafka-websocket which is a simple WebSocket server interface to Kafka where a client may produce and consume messages on the same connection. Thus, in the real world the line between different approaches is blurry, and the real-world modern applications will likely be leveraging a combination of approaches depending on specific requirements and rationales.
To make things more concrete at this stage we may use the following intuitive RPS formula for our estimations:
The illustration above combines the information about the estimation formula for RPS, some quick-hand calculations of the estimated RPS for different number of cores, benchmarking results for RPS (depending on CPUs) and CPS (depending on CPUs) for Nginx (and Nginx Plus) which can be found here, approximate hardware costs from Nginx benchmarking which can be found here and here, and brief benchmarking results for Fast HTTP package for Go assessing RPS and CPS for application server(s) which can be found here. The clear observation we can make from it is that the numbers for RPS and CPS will vary depending on the function the server is performing, whether it is proxying or load balancing requests (typically faster), or executing some application logic (typically slower). It also helps to make a connection with the economy of scale (more information can be found here).
Scalability and Performance. In general, when estimating or optimizing operations we typically talk about Scalability and Performance, and it will be beneficial to delineate the two for clarity. Performance is a measurement for efficiency, it describes a relative capacity of the system, for example, 1 server can handle X RPS under a certain SLA. Scalability describes how effectively you can translate additional resources into additional capacity. For example, by increasing the number of servers to 100 we’ll be able to handle Y RPS under a certain SLA. In the best case, while increasing resources we expect to increase our capacity linearly (Linear scaling). However, in the real-world the optimizations we introduce typically lead to sub-optimal results (Non-linear scaling) and the bottlenecks can be found in the application itself, upstream applications, or downstream applications. In a few words, the performance problem would manifest itself if the system is slow for a single user, and we would deal with a scalability problem is the system is fast enough for a single user and slow for many users.
Solution architecture and code
Load Balancer and API Gateway. In the previous section we’ve already mentioned general purpose proxies and load balancers in addition to application servers, however, when developing Web APIs you typically need a specialized API management tool that is in between clients and a collection of your backend services, and this is called API Gateway. Clarifying the distinction between API Gateway and Load Balancer will help us going forward. Both can manage and balance out incoming traffic. Simply speaking, Load Balancer aims to distribute requests evenly to a set of backend servers. And API Gateway may direct requests to specific resources based on its configuration. By digging deeper, we could also undig further distinctions between API Gateway and Service Mesh, but for simplicity we will keep it out of scope in this article (if you are interested, “Istio Up & Running” book listed in the very bottom will be a great source of information on this topic). For the purposes of this article API Gateway will allow to connect micro-services, and Load Balancer(s) may help to scale out individual micro-services (each micro-service may be hosted on multiple servers).
Now we’ll put all components together to illustrate how we could deploy our Web API(s) in production:
Reverse proxy helps to properly hide the server side, Web Application Firewall (WAF) protects us from malicious activities, API Gateway allows to get our Web API(s) well-organized and control request/response information flow, Load Balancer helps to balance out the traffic across multiple API Servers which in their turn may connect to other downstream applications, for example, databases. Please note that some components may be bundled together in the Cloud by using multi-purpose solution architecture components, for example, as it is described for Nginx Plus here. It might also be some upstream applications, for example, other Load Balancers, Service Buses, Messaging Brokers, etc. if our Web API(s) would be just a part of a bigger picture/architecture.
Logic complexity. Another important consideration will be the overall complexity of implementation for our Web API(s) — the code. A simple way to illustrate a computational complexity may be to look at the Fibonacci set. Fibonacci sequence calculation may be implemented in a programming language of your choice using different approaches: 1) a simple loop, 2) recursion, or 3) one of the above with hashing/caching optimization. Fibonacci set calculation can be presented as F(n) = F(n-1) + F(n-2) where F(0) = 0 and F(1) = 1, and in general the bigger the input the longer it will take to calculate the output, you can find an academic explanation of the complexity theory and computational complexity of Fibonacci set calculation approaches, for example, here. The following illustration just highlights the growth pattern of Fibonacci set (as an example) and provides an additional insight into the storage requirements (against core data types) should we decide to implement an intelligent caching as an optimization technique:
For the resulting implementation of our Web API(s) we can further assess its runtime, latency, etc. And this is where observability and monitoring tools come to the rescue so we can distinguish the value-adding activities (runtime) from an overhead (latency).
The original question is too broad, and we don’t have the necessary information to answer it properly, other than possibly using our previous experience and intuition. Thus, instead of attempting to boil the whole ocean at once, let’s further simplify things and focus on a related task within a scope of a team building a function or a micro-service (and not the scope of an organization building a product or a service offering). The following challenge may be something you can practically come across in the Engineering and Data Science domains nowadays:
Your team is delivering a constant time function that takes 500 milliseconds (half a second) to execute. The function takes an input and produces an outcome. How do you deliver this function as a service and support 5 million requests per second?
This is a better scoped question with clearly defined parameters. To further clarify it, we will assume that our function is sophisticated enough according to its estimated duration (an artificial example we used in the section above was Fibonacci set calculation). We will consider this challenge as an optimization problem with constraints. To apply this generic Engineering example to the Data Science domain we can assume that our function is running an inference on a pre-trained Machine Learning (ML) model, say, a multiclass image classification model.
Reference architectures. There’re many reference architectures and benchmarks available that we might look at while designing the solution architecture for our specific requirements. For example, from Real-World Web Application Benchmarking from Facebook Engineering (2009) described here focusing on classic web apps to the recent posting about Netflix’s Cloud Edge Architecture (2021) described here featuring BFFs (Backend For Frontend pattern). Also, lots of more granular benchmarks focusing on certain aspects such as scalable proxying and load balancing, for example, High-Performance Load Balancing Nginx described here, Sizing recommendations HAProxy (HAProxy Enterprise 2.1r1) — here, Multithreading in HAProxy — here, Nginx & HAProxy comparison (Nginx’s POV) — here, etc. We’ll just highlight one conclusion from this section: we’ll have to optimize the solution architecture on different levels, for example, distinguishing between the capacity of load balancer(s) and the capacity of application server(s). Certain technology choices will have certain implications, if we choose to use HAProxy for load balancing, “On the same server, HAProxy is able to saturate approximately: 100–1000 application servers depending on the technology in use” as described here, and we will need to estimate a sufficient number of load balancers to satisfy the requirements. Thus, the overall capacity of the system will be defined by the capacity of its weakest link (bottleneck).
Load Balancing tier. At this state we’ve identified that we will have multiple application servers (API servers layer) to support the workload, and to keep going let’s add multiple load balancers to the architecture (Load Balancing tier). Load Balancing may be implemented using different approaches such as 1) DNS-based; 2) Hardware-based or 3) Software-based, etc.
In case the incoming traffic is geographically dispersed we might be better off with using DNS-based Load Balancing. DNS-based Load Balancing is the approach to configure a domain in DNS in such a way that client requests to the domain are distributed across a group of servers. To support this configuration individual load balancers will be assigned with dedicated IP addresses, and multiple A records will be associated with the domain associated with those IP addresses. The actual load balancing function may be provided by using already mentioned Nginx, HAProxy, or Traefik (benchmarks here), or Envoy (benchmarks here).
While implementing a Load Balancing tier ourselves is an option, there’s always an option to use a Cloud PaaS service for that too. For example, Azure Application Gateway acts as a reverse-proxy and provides L7 load balancing, routing, Web Application Firewall (WAF) and other services. Description of interesting use cases for using Azure API Management and Azure Application Gateway can also be found here.
At this level of approximation, the solution architecture diagram will look like the following:
API Servers tier. API Servers tier may be implemented using different options such as VMs (IaaS), Cloud Web Apps (PaaS), as individual container instances, for example, Azure Container Instances (ACI), or containers in a Kubernetes cluster, for example, Azure Kubernetes Service, etc. In general case we may assume that API Servers are PaaS Web Apps so we can take care of the Load Balancing tier ourselves. However, if using managed Kubernetes service(s) in the Cloud we might take advantage of some deployment automation that helps to scale individual deployments. Specifically, when deploying a containerized stateless workload into Azure Kubernetes Service (AKS) the resulting deployment will already have Kubernetes Workload objects (such as Deployments, ReplicaSets and Pods) and Kubernetes Networking objects (such as Services and Load Balancers) in place. Below illustration depicts such sample deployment and the immediate value it may bring:
QoS and SLA. When providing Web APIs as a service it will be important to ensure the appropriate Quality of Service (QoS) which describes the overall performance of a service. Cloud services nowadays have SLAs (Service Level Agreements) which describe guarantees about their uptime (high availability), for example, a service may be available 99.9% of the time, etc. For the Web API(s) you develop you may also introduce a response time guarantees in form of a percentile (p95, p99 and p999), for example, if the 95th percentile response time is 0.5 seconds, it means that 95 out of 100 requests are expected to take less than 0.5 seconds, and 5 out of 100 requests — 0.5 seconds or longer. Logically to meet our percentile-based guarantees we’ll be trying to minimize requests variance and make them as predictable as possible.
Intermediate results. With Load Balancing tier and API Servers tier introduced we might address our challenge very roughly with, say, Nx10000 servers (N=1..5..) — more beefy ones (more CPUs) for API Servers and more shy (less CPUs) for Load Balancers (with the assumption that, for example, 1 Load Balancer may serve 100–1000 API Servers).
Cache tier. Depending on the requirements it may be possible to reuse the results of the previous lengthy computations instead of doing everything from scratch every time. This is an opportunity for an intelligent caching. Redis is a ubiquitous and flexible caching solution which also supports distributed operations with Redis cluster(s). Distributed Redis cache allows to manage large volumes of data by leveraging the memory on many servers. This also allows to leverage other resources such as CPU, Network, etc. on many servers. In practice Redis is usually either memory or network bound. In general, a separation of a large amount of data into smaller chunks is typically referred to as partitioning. Partitioning and Sharding. However, there’s a distinction between the two. Partitioning may be about grouping subsets of data in a single instance (server), for example, for data isolation purposes. And Sharding usually means to split up the data and store it on different instances (servers). The split logic suitable for Redis cache may be: 1) Range-based keys (0..1000 = 1, 1001..2000 = 2, ..X); 2) Hash-based keys (hash, modulo, 0..X), etc.
Redis cluster is the native Sharding implementation available within Redis that allows you to automatically distribute your data across multiple nodes, and Redis cluster will take care of data sharding, replication, synchronization, failover operations for you. Thus, in addition to Load Balancing tier and API Servers tier we can now introduce a distributed Redis cache and at this level of approximation the solution architecture diagram will look like the following:
The assumption is that we’re not using any additional client libraries to encapsulate and proxy the calls to the cache and the function implemented with an API, that’s why API Servers tier is directly connected with Caching tier on the diagram.
As described here in Redis cluster you must have at least 6 Redis nodes — 3 for masters and 3 for slaves. Redis cluster may have a few master instances and each can have one more (up to 1000 recommended) slaves. The maximum recommended number of Redis master nodes is ~ 1000. Redis cluster can deliver high performance and linear scalability up to 1000 nodes.
As described here Redis can handle up to 2³² keys and was tested in practice to handle at least 250 million key per instance. Every hash, list, set and sorted set can hold 2³² elements. And your limit is likely to be the available memory on the server. Redis is single threaded and to take advantage of multiple CPU/cores on a server you may, for example, use pipelining and on an average Linux system this approach can help delivering over 1 million requests per second (RPS). To maximize CPU usage it is also possible you can start multiple instances of Redis in the same box and treat them as separate servers, however, using sharding in the first place instead may yield much better results overall.
Over time Redis has already become and is becoming even more threaded. The official benchmarks for Redis can be found here. And the optimization we may achieve by introducing a Caching tier based on Redis may be significant. Say, our usual request takes 500 milliseconds to compute, and in case of retrieving a pre-computer result from the in-memory cache it might be x10 milliseconds instead (representing x50 — x100 possible performance improvement).
Intermediate results. Assuming that our function in the challenge is deterministic (it returns the same outcome when it is called with the same input values) and we can cache and reuse the cache effectively, with an additional Caching tier introduced we might bring in an optimization and reduce the number of servers (between multiple tiers) possibly down by half or more.
When you make your Web API(s) available, to make sure it is functioning as expected and copes with expected loads well, it will be necessary to perform Load testing. There’re many tools available for Load testing which you can install On-premises or in the Cloud, a comparison of open source load testing tools can be found here. You may also be interested in taking a look at the famous Netflix’s Chaos Monkey here.
Throttling and Capping. While load testing may help to validate that you are prepared for the anticipated volumes and peak volumes of legitimate requests, not all the requests may be as such. And unless you are properly protected from malicious requests, for example, by using Web Application Firewalls (WAF), etc., you may be susceptible to DoS and DDoS attacks. Thus, you definitely want to add an appropriate Security tier to your solution architecture (and this could be a separate discussion outside of the scope of this article), but in addition to this you may want to introduce Throttling and Capping policies for your Web API(s). Throttling may help to control server bandwidth by limiting the amount of data (or number of requests) that a server may accept for a period of time, and Capping may help to prevent clogging the pipe based on the overall amount of data (or number of requests) over time.
External Dependencies. In the sections above we’ve mentioned how your Web API(s) may be related to upstream and downstream services. In fact, often your Web API may be dependent upon certain external dependencies. One typical example of such external dependency is Identity service. Identity as a Service options available from Cloud vendors or ISVs. For example, to implement authentication and authorization for your Web API(s) you may be leveraging an external solution, and providers of those solutions usually publish Load testing policies to govern how load testing activities should be performed. For example, as described here Auth0 defines its load testing policy as the following:
Auth0 strongly recommends including a brief “ramp up” period to the desired load testing numbers. For example, a load test request of 100 RPS might be preceded by three five minute periods: 5 mins at 25 RPS, 5 minutes at 50 RPS, and 5 mins at 75 RPS. This ramp up period allows Auth0 and the customer to observe and compare effects at increasing RPS levels prior to peak RPS. If the ramp up period is not possible, please indicate why.
Moreover, in addition to load testing policies different providers may have different limitations around their services which you would have to factor in into the capacity planning for your solution.
On-Premises and Cloud. Standing up a solution itself is a considerable effort, and standing up a load testing infrastructure to test the solution may be quite an effort too. In fact, Serverless compute in the Cloud might be a solution to this call. Many Cloud providers manage Serverless compute infrastructure around the globe which is available as a development platform under SLA for a programming language of your choice. Thus, Serverless compute may help to make load testing setup more doable (and possibly more affordable), and potentially help to make the solution itself more robust.
Serverless compute in the Cloud allows you to focus on the workload you are implementing and the code you are writing, while the Cloud provider is responsible for managing and optimizing the underlying infrastructure, help you to scale up and out the capacity when needed, etc. A few ways of taking advantage of Serverless compute in the Cloud are Azure Function, AWS Lambda, CGP Cloud Functions, etc.
Serverless compute may help when you are Engineering Web APIs or deploying Data Science models in the Cloud for inference, and in many more use cases. Thus, you Web API(s) may be delivered as functions in the Cloud. Also, to illustrate how you can deploy a Machine Learning (ML) model as a function in the Cloud we will consider using Azure Cognitive Services Custom Vision and Azure Functions. In the Custom Vision portal (here) you can take advantage of existing ML algorithms, for example, for object detection or image classification, bring your own data to train those models in the Cloud, and then export already trained model as a Docker container as shown below:
This export package contains everything needed to expose your model as a Web API for inference in the Cloud. Namely, in this concrete example Python Flask web framework is used to expose a TensorFlow model (.pb) as Web API. Using the similar approach of packaging your models into Docker containers and deploying them as functions in the Cloud, you could serve models from .pkl (.pickle) or .joblib for ScikitLearn, .pb or .h5 for TensorFlow/Keras, .caffemodel (.h5) or .prototxt for Caffe, .pt or .pth for PyTorch for inference (also assuming that there’re libraries to work with those models in the programming language of your choice for the web framework).
Now let’s review different implementations of Serverless compute in the Cloud and look at their elasticity parameters.
Resources. When you create an Azure Function you specify a hosting plan. Different hosting plans are described here. The default hosting plan is Consumption plan, there’re also Premium plan and Dedicated (App Service) plan. Function instance sizes which define the amount of resource allocated per instance will vary depending on the plan selected. Each instance in the Consumption plan is limited to 1.5 GB of memory and one CPU. For example, for Premium plan the options may be 1 CPU core + 3.5GB RAM, 2 CPU cores + 7GB RAM or 4 CPU cores + 14GB RAM. With Dedicated plan you may take advantage of a more flexible and expandable setup similar to setting up server farms. When you create a Function app you will see the following resources being provisioned: Microsoft.Web/sites, Microsoft.Web/serverfarm. In the Consumption and Premium plans, Azure Functions scales CPU and memory resources by adding additional instances of the host where The number of instances is determined on the number of events that trigger a function — event-driven scaling described here. Azure Functions currently support the following runtimes: .NET, NodeJS, Python, Java, PowerShellCore, CustomHandler.
Scaling considerations and best practices for Azure Functions are described here. Thus, the hosting plan you choose dictates the following behaviors: 1) how your function app is scaled; 2) the resources available for each function app instance. When your function is first enabled, there is only one instance of the function. For example, in Consumption plan Azure Functions scale automatically and you pay for compute resources when your functions are running. Event-driven auto-scaling. The instances of the Functions host are dynamically added and removed based on the number of incoming events. Automatic scaling continues to happen even during periods of high load. You pay only for the time your functions run. Billing is based on number of executions, execution time, and memory used. Cold start behavior. App may scale to zero when idle, meaning some requests may have additional latency at startup. After your function app has been idle for a number of minutes, the platform may scale the number of instances on which your app runs down to zero. Maximum instances. A single function app only scales out to a maximum of 200 instances. A single instance may process more than one request at a time, so there isn’t a set limit of number of concurrent executions. You can specify a lower maximum to throttle scale as required. New instance rate. For HTTP triggers, new instances are allocated, at most, once per second. For Non-HTTP triggers, new instances are allocated, at most, once every 30 seconds. Scaling is faster for Premium plan.
As described here you may leverage certain configurations to your advantage when using Azure Functions: 1) Use multiple worker processes. By default, any host instance for Functions uses a single worker process. To improve performance, especially with single-threaded runtimes like Python, use FUNCTIONS_WORKER_PROCESS_COUNT to increase the number of worker processes per host (up to 10). Azure Functions then tries to evenly distribute simultaneous function invocations across these workers. The FUNCTIONS_WORKER_PROCESS_COUNT applies to each host that Functions creates when scaling out your application to meet demand. 2) Configure host behaviors to better handle concurrency. The host.json file in the function app allows for configuration of host runtime and trigger behaviors. You can manage concurrency for triggers. Settings in the host.json file apply across all functions within the app, within a single instance of the function. For example, maxConcurrentRequests = 25, 10 instances = 250 concurrent requests (10 instances x 25 concurrent requests per instance).
Some limitations may be imposed by external dependencies and a relevant use case is described here when a function app shares resources such as connections: HTTP connections, database connections, and connections to services such as Azure Storage. When many functions are running concurrently, it’s possible to run out of available connections. Connection limit. The number of available connections is limited partly because a function app runs in a sandbox environment. One of the restrictions that the sandbox imposes on your code is a limit on the number of outbound connections, which is currently 600 active (1200 total) connections per instance.
As we’ve shown earlier Azure Functions can accommodate for an ML model in a Docker container, or any kind of logic packaged in a Docker container, also you could deploy, for example, a class library as a function as described here. It is also quite convenient to pre-package your code for deployment as an archive and then deploy your function leveraging a Zip deployment option described here. To monitor invocations of Azure Functions you can connect it with Azure Application Insights. And in case you want to save the state between executions of your functions, Azure Functions provide Durable Functions for that described here.
Frontend. When you invoke an Azure Function, the response comes back from Kestrel server. Kestrel is an open-source, cross-platform, lightweight and a default web server used for ASP.NET Core applications. Now if we do a banner grabbing exercise using the IP address of the response, then we’ll see that the response is coming back from IIS (Microsoft Internet Information Services) this time.
Resources. When you create an AWS Lambda Function you can choose between the following options: 1) Author from scratch: Start with a simple Hello World example; 2) Use blueprint: Build a lambda application from sample code and configuration presets from common use cases; 3) Container image: Select a container image to deploy for your function; 4) Browse serverless app repository: Deploy a sample lambda application from AWS Serverless Application Repository. Lambda then provisions an instance of the function and its supporting resources. According to the description here Lambda manages the compute fleet that offers a balance of memory, CPU, network and other resources. This is in exchange for flexibility, which means you can not log in to compute instances, or customize the operating system on provided runtimes. These constraints enable Lambda to perform operational and administrative activities on your behalf, including provisioning capacity, monitoring fleet health, applying security patches, deploying code, monitoring and logging.
You can choose an amount of memory to be allocation for your Lambda Function between 128MB and 10240MB. Lambda Functions currently support the following runtimes: .NET Core, Go, Java, NodeJS, Python, Ruby.
AWS Lambda Function scaling guidance is provided here: The first time you invoke your function, AWS Lambda creates an instance of the function and runs its handler method to process the event. When the function returns a response, it stays active and waits to process additional events. If you invoke the function again while the first event is being processed, Lambda initiates another instance, and the function processes the two events concurrently. As more events come in, Lambda routes them to available instances and creates new instances as needed. When the number of requests decreases, Lambda stops unused instances to free up scaling capacity for other functions. Your functions’ concurrency is the number of instances that serve requests at a given time. For an initial burst of traffic, your function’s cumulative concurrency in a region can reach an initial level of between 500 and 3000 which varies per region. After the initial burst, your functions’ concurrency can scale by an additional 500 instances each minute. This continues until there are enough instances to serve all requests, or until a concurrency limit is reached. When requests come in faster than your function can scale, or when your function is at the maximum concurrency, additional requests fail with a throttling error. If this is not enough concurrency to serve all the requests, additional requests are throttled and should be retired. The function continues to scale until the account’s concurrency limit for the function’s region is reached. The function catches up to demand, requests subside (become less intense), and unused instances of the function are stopped after being idle for some time. Unused instances are frozen while they’re waiting for requests and don’t incur any charges. The current burst concurrency quotas are: 3000 (US/Europe), 1000 (Asia), 500 (Other). There’re approaches you may use for further optimization such as Reserved concurrency (>limit) which creates a pool (server pool) of resources, Provisioned concurrency to enable scaling without fluctuations in latency (hot/warm start versus a cold start), Auto-scaling for provisioned concurrency, Asynchronous invocation, etc.
In the settings (Basic Settings) of your Lambda Function you find an explanation about CPU (resource) allocation: The Memory (MB) setting determines the amount of memory available for your Lambda function during invocation. Lambda allocates CPU power nearly linearly in proportion to the amount of memory configured. At 1769MB, a function has the equivalent of 1 vCPU (one vCPU-second of credit per second). To increase or decrease the memory and CPU power allocated to your function, set a value between 128MB and 10240MB. And also, a link to AWS Compute Optimizer.
In case you want to save the state between executions of your AWS Lambda Functions this can be achieved with AWS Step Functions Workflows & State machines described here. In the logs of the provisioned functions you can also find references to container images used for different runtimes, for example, something like “gallery.ecr.aws/lambda/nodejs” for NodeJS image.
Frontend. One of the recommended ways to expose AWS Lambda Function is to use AWS API Gateway which provide an integration with Lambda — this is how you get the “execute-api” URI like https://abc.execute-api.xyz.amazonaws.com/prod/ for invocation. Then when you invoke AWS Lambda Function via AWS API Gateway, by default any extra headers (including potentially OWASP unsafe headers such as Server, etc.) will be stripped out unless you configure AWS API Gateway to pass through the required headers. So, we’ll be gutting a response from an IP address without knowing what Server serves the request. Now if we do a banner grabbing exercise using that IP address, then we’ll see that the response is coming back from awselb/2.0 — ELB (AWS Elastic Load Balancer). As described here Elastic Load Balancing in AWS supports the following load balancers: Application Load Balancers, Network Load Balancers, Gateway Load Balancers, and Classic Load Balancers. For example, a Classic Load Balancer may distribute your incoming traffic to across multiple targets where each of them is mapped to a specific address, and Application Load Balancer may help to route requests based on the content of the URL and direct them to a subset of servers registered with the load balancer.
GCP Cloud Functions
Resources. When you create an GCP Cloud Function you specify its basic info (name, etc.), a trigger, choose an authentication option, set networking, ingress/egress parameters, etc. Starting a new function instance involves loading the runtime and your code. As described here you can control function’s scaling behavior: Max instances in Cloud Functions is a feature that allows you to limit the degree to which your function will scale in response to incoming requests. In Cloud Functions, scaling is achieved by creating new instances of your function. Each of these instances can only handle one request at a time, so large spikes in request volume might result in creating many instances. Note: Increasing your function’s memory can improve the throughput of your function and decrease the need for many instances. Usually this is desirable but in some cases you may want to limit the total number of instances that co-exist at any given time. For example, your function might interact with a database that can only handle a certain number of concurrent open connections. You can set max instances for an individual function during deployment. Each function can have its own max instances limit. Functions scale independently of each other. There’re limits and best practices for functions described as: 1) Guard against excessive scale-ups: When no max instances is specified, Cloud Functions is designed to favor scaling up to meet demand over limiting throughput. This means that the number of simultaneous instances that your function may have is effectively unlimited unless you’ve configured such a limit. We recommend assigning a max instances limit to any functions that send requests to throughput-constrained or otherwise unscalable downstream services. Such a limit improves overall system stability and helps guard against abnormally high request levels; 2) Request handling when all instances are busy: Under normal circumstances, your function scales up by creating new instances to handle incoming traffic load. But when you have set a max instances limit, you may encounter a scenario where there are insufficient instances to meet incoming traffic load. In such scenario, Cloud Functions will attempt to serve a new inbound request for up to 30 seconds — apply a retry logic; 3) Max instances limits that exceed Cloud Functions scaling ability: When you specify a max instances limit, you are specifying an upper limit. Setting a large limit does not mean that your function will scale up to the specified number of instances. It only means that the number of instances that co-exist at any point in time should not exceed the limit; 4) Handling traffic spikes: In some cases, such as rapid traffic surges, Cloud Functions may, for a short period of time, create more instances than the specified max instances limit. If your function can not tolerate this temporary behavior, you may want to factor in a safety margin and set a lower max instances value that your function can tolerate; 5) Idle instances and minimizing cold starts: To minimize the impact of cold starts, Cloud Functions will often maintain a reserve of idle instances for your function. These instances are ready to handle requests in case of a sudden traffic spike. For example, when an instance has finished handling a request, the instance may remain idle for a period of time. You are not billed for these idle instances.
Cloud Functions Execution Environment is described here as: Each Cloud Function runs in its own isolated secure execution context, scales automatically, and has a lifecycle independent from other functions. Stateless functions. Cloud Functions implements the serverless paradigm, in which you just run your code without worrying about the underlying infrastructure, such as servers or virtual machines. To allow Google to automatically manage and scale the functions, they must be stateless — one function invocation should not rely on in-memory state set by a previous invocation. However, the existing state can often be reused as a performance optimization. Controlling auto-scaling behavior. In some cases, unbounded scaling is undesirable. For example, your function might depend on a resource (such as database) that can not scale up to the same degree as a Cloud Function. A huge spike in request volume might result in Cloud Functions creating more function instances than your database can tolerate. Cold starts. A new function instance is started in 2 cases: 1) New deployment; 2) New function instance is auto-created to scale up the load, or occasionally to replace an existing instance (crash, etc.). Requests that include function instance startup (cold start) can be slower than requests hitting existing function instances.
Additional information about Auto-scaling and Concurrency can be found here: Cloud Functions handles incoming requests by assigning them to instances of your function. Depending on the volume of requests, as well as the number of existing function instances, Cloud Functions may assign a request to an existing instance or create a new one. Each instance of a function handles only one concurrent request at a time. This means that while your code is processing one request, there is no possibility of a second request being routed to the same instance. Thus the original request can use the full amount of resources (CPU, memory, etc.) that you requested. In cases where inbound request volume exceeds the number of existing instances, Cloud Functions may start multiple new instances to handle requests. This automatic scaling behavior allows Cloud Functions to handle many requests in parallel, each using a different instance of your function.
You can choose an amount of memory to be allocation for your GCP Cloud Function between 128MB and 4GB. GCP Cloud Functions currently support the following runtimes: .NET Core, Go, Java, NodeJS, Python, Ruby.
In case you want to save the state between executions of your GCP Cloud Functions this might be achieved with Cloud Composer described here or by implementing the state management yourself using Cloud Storage, etc. In the build logs of the provisioned functions you can also find references to container images used for different runtimes, for example, something like “us.gcr.io/fn-img/…”.
Frontend. When you invoke an GCP Cloud Function, the response comes back from Google Frontend (and X-Powered-By = Express which means NodeJS). This server (Google Frontend, GFE) is a reverse proxy that terminates TCP connections. Now if we do a banner grabbing exercise using the IP address/port of the response, then we’ll see that IP address showed up in IPv6 format and extra response headers were not provided.
By looking at the implementations of Serverless compute by different Cloud providers and their accompanying documentation, we can better assess the capabilities of the Serverless compute overall, its typical possibilities (resource allocations, scaling behaviors and optimizations, etc.) and limitations. Thus, we can design our solutions more confidently and effectively.
Simple, Medium Hard and Very Very Hard. In the previous section we came across Google Frontend (GFE) server which is (interestingly enough) also mentioned in “Site Reliability Engineering” book and here in the explanation about the conceptual approach Google takes to organize and optimize its production environments from the Site Reliability Engineering (SRE) perspective. The diagram below has a similar conceptual composition and components which we have considered earlier for our challenge. One interesting take away we drew from this explanation is about the design approach uniformity regardless of complexity of the task and with scalability in mind: “designing in a way that a simple task may be resolved with a medium complexity, but at the same time a very very hard task may be resolved with a medium complexity as well”.
Serverless compute approach. In this article while working on a scalability challenge, we considered different aspects on a potential solution architecture. Now if we would want to take advantage of a Serverless compute in the Cloud using Azure Cloud the solution architecture diagram might like the following:
On this diagram Azure Application Gateway helps with load balancing and provides a Web Application Firewall (WAF) capability, Azure API Management serves the purpose of an API Gateway for routing, etc. (please note that Azure API Management doesn’t perform load balancing as described here), Azure Functions deliver the application logic, and Azure Cache for Redis is a managed Redis cluster in Azure Cloud. You can find more information about designing APIs for micro-services in Azure Cloud here.
Intermediate results. If we decide to mix up our original approach with Serverless compute (based on Functions) we might further reduce the number of servers (between multiple tiers) possibly down by half or more in favor of growing the fleet of additional Serverless compute. Or possibly we might switch gears towards Serverless compute-based implementation altogether depending on the cost comparisons and start designing a robust scaling strategy based on the capabilities and limitations of the selected Cloud provider(s).
Is Cloud elasticity, fully-managed services and Serverless compute the silver bullet which helps us every time for every challenge? It may or may not be, because scaling in the Cloud still requires a lot of considerations and experimentations. Scaling out and adding x1000s and x10000s of instances (servers) at once may not be necessarily smooth, for example, AWS Transit Gateways (TGW) auto-scaling issues which caused Slack outage on January 4 2021 described here. You can also find high level information about Slack architecture on AWS here for your reference.
Estimations and Facts. Robust thoughtful designs, reference architectures and proper estimations help to get off to a good start with your solution architecture. In fact, experimentations, benchmarking, and appropriate testing allow to confirm or unconfirm the feasibility of a solution and also help point out to the possible incremental improvements which need to be made.
Managed Cloud Infrastructure and Cloud Native apps. Serverless compute capabilities offer fully managed environments in the Cloud where your provider takes care of all the infrastructure, operating systems, etc. on your behalf, which is great for developers who can focus on their applications with faster time to value, for businesses who can receive a faster return on investment (ROI) and for Cloud providers themselves who can keep expanding their market share. In fact, there will always be considerations such as cost, legacy systems in place, Cloud plus Edge use cases (and more) making hybrid approaches applicable. In the “Istio Up & Running” book by Lee Calcote you can find a laconic description of the Evolution of Cloud native as shown below:
The conclusion drawn from the illustration above is about Cloud Native application characteristics which are:
- Dynamic and decoupled from physical resources
- Distributed and decentralized
- Resilient and scalable
- Observable and predictable
When Elon Musk joined Clubhouse in the beginning of February 2021 one might anticipate the growing interest to the app as well. So to ease the scaling efforts one could apply a gated sign-up approach which would be a valid process solution to a challenge. Thus per Clubhouse’s Apple App Store page here: “… We’re working hard to add people to Clubhouse as fast as we can, but right now you need an invite to sign up. Anyone can get one by joining the waitlist, or by asking an existing user for one. We really appreciate your patience and can’t wait to welcome you. …”.
Opinions expressed are solely of the author. All product names, logos, brands, trademarks and registered trademarks are property of their respective owners.
Books you might enjoy reading
- “Site Reliability Engineering: How Google runs production systems” by Betsy Beyer, Chris Jones, Jennifer Petoff & Niall Richard Murphy (2016) (link)
- “Go Web Programming” by Sau Sheong Chang (2016) (link)
- “Istio Up & Running: Using a Service Mesh to Connect, Secure, Control, and Observe” by Lee Calcote & Zach Butcher (2019) (link)
- “Kafka The Definitive Guide: Real-Time Data and Stream Processing at Scale” by Neha Narkhede, Gwen Shapira & Todd Palino (2017) (link)
- “Redis In Action” by Josiah L. Carlson (2013) (link)
- “Designing Data-Intensive Applications: The Big Ideas behind Reliable, Scalable and Maintainable Systems” by Martin Kleppmann (2017) (link)