Keywords

1 Introduction

Function as a Service (FaaS) provides enterprises with a cloud-native serverless solution to build robust, scalable, and loosely-coupled distributed applications with a low operational cost. Clients can use such a platform to encapsulate the complex business logic into independent micro-services that communicate with each other via provided application programming interfaces (API). The FaaS platform is responsible for responding properly to outside events by triggering calls to such APIs in a loosely-coupled manner [16]. Amazon Web Services (AWS) and Google provide an enterprise-scale realization of such a paradigm as the AWS Lambda [1] and Google Cloud Functions [12] services, respectively. Due to the intrinsic ability of FaaS to be hosted as a platform for concrete implementation of applications that follow decoupled architecture principles, it is believed to play the major role of future SOA [1, 14, 18]. In this paper, we use FaaS and Lambda platform interchangeably.

By using a combination of in-house and cloud-based FaaS servers, developers can put their attentions fully to the design and implementation of business logic without concerning about activities related to servers’ maintenance issues (such as server provisioning, capacity planning, configuration setup, deploying the micro-services, and so on). The main idea behind using a FaaS platform is to remove the need for the traditional “always on” servers running behind the users’ scripts [26]. FaaS can remarkably bring down the operational cost at least in a twofold aspect. In particular, its adoption helps in realizing “pay-per-use” pricing at finer granularity than current hourly base cloud pricing. In addition, it enables users to create applications much faster by developing fine-grained actions (e.g., micro-services) rather than handling coarse-grained components (e.g., monolithic applications); this in turn contributes to cost reduction.

To enable better scaling, a service provider of FaaS platform may decide to host thousands of function services (or Lambda functions in AWS Lambda) on the available resources to achieve both clients and operator goals with low cost. In many situations, however, these goals are conflicting with each other, e.g. the fast execution time demanded by the end-users versus the high resource utilization targeted by the service providers. Scheduling and resource allocation plays a crucial role in reconciling these conflicting objectives. Current resource allocation strategies for distributed systems and virtualized platforms are often QoS-oblivious. In other words, resource allocation is carried out irrespective of the QoS requirement of each application or the ever-changing resource utilization level of each host [29, 32]. While each application has its own utilization characteristics (e.g. CPU/memory requirements) and different incoming traffic rate of each event source, none of them is known to the scheduler in advance [28, 30, 33].

In this paper, we present a closed-loop (feedback) resource controller, which increases the overall utilization of the resources while achieving QoS levels enforced by the end-users. The proposed controller makes its decisions based on the following parameters in each time epoch: (1) an estimation of the generation rate of events associated with each FaaS function that is taking place in the near future time periods, (2) the amount of QoS violation incidents occurred in the past epochs as the feedback loop, and (3) the reconfiguration cost (similar to the migration cost in a hypervisor-based system). We have conducted our evaluation study with two existing heuristics (round-robin and best-effort) with respect to three different metrics of resource utilization, QoS violation and scalability. Our solution outperforms both the round-robin and best-effort strategies by an average improvement of 21% in the overall resource utilization, while it reduces the QoS violation incidents by a factor of 3 on average.

The rest of the article is organized as follows. In Sect. 2, we give the background knowledge and related work associated with the FaaS platform. In Sect. 3, we define a metric for measuring QoS violation incidents. Section 4 formally presents the design principle of our resource allocation controller. In Sect. 5, we evaluate the performance of our solution through experiments on real systems. We then draw our conclusion in Sect. 6.

2 Background and Related Work

To build an application using a FaaS (Lambda) platform, the software development team needs to represent the whole business logic as the two core components of actions and event sources. An event is simply the detection of an internal or external condition which triggers a signal to be sent to a set of proper actions [6]. Examples include a change in database records, reading data by an Internet of things (IoT) sensor, posting a new tweet, an HTTP request, and a file uploading notice. Each event normally invokes a corresponding action by triggering a specific set of rules that is defined by the application owner.

Nevertheless, a FaaS action (also called Lambda function in platforms like AWS Lambda) is a piece of code which must be instantaneously executed whenever a corresponding event trigger fires. In some platforms, a chain of actions can be defined such that each action is executed one after another once the associated event occurs [17]. Each action needs to be designed as a stateless (or idempotent) component; hence, all the required data must be given as the input parameters [16]. This allows the platform to execute multiple instantiations of an action at the same time, while each instantiation keeps its own state. Typical examples of use-cases that can be adapted seemly to this paradigm include decomposition of traditional applications into micro-services, mobile server-side applications, file processing, big data analytic, and web servers [1, 17, 18].

The kernel of FaaS/Lambda platform is responsible for determining the amount of CPU, RAM, and other resources that must be devoted to the run-time in order to execute instantiations of every action. We assume that every action in the system is accompanied by a QoS enforcement level which is stated in the service level agreement (SLA). This value defines the minimum service level (expressed as a set of performance metrics) to be guaranteed by the platform for actions.

QoS enforcements and the concept of fairness can significantly affect the way that a resource allocation strategy works. Many studies in the context of the distributed platforms tend to focus on devising a “fair” resource allocation strategy, e.g. [5, 9, 10, 23]. Some suggested that simply minimizing the total number of QoS violations is sufficient for satisfying SLA, e.g., [3]. In contrast, Gabor et al. [9] justified that employing a fair schema (as suggested by [5, 10, 23]) cannot always provide a proper satisfaction level in such systems as promised. Gabor et al. in [9] also showed that in a fair resource allocation strategy a situation can be considered good as long as almost every action running in the system experiences a similar performance degradation level (even a severe one).

Obviously such a constraint is not permissible in practice; hence, fair policy cannot lead a desirable output for all cases. Yet the strategy suggested by [3] which minimizes the number of QoS violations could cause some adverse consequence, too. Let us consider a moment when the rate of multiple events abruptly increases at once. So, applying the avoidance strategy proposed in [3] might end up revoking resources from important actions, as such a strategy only concerns with reducing the total number of QoS violations of all hosts. As a result, our aim is to find a sensible objective function to minimize explicitly the number of QoS violations of important actions/clients in case of resource scarcity. It seems that a similar metric first proposed by [15, 16] can be adjusted to the new platform.

The goal of an elastic solution is to devise some mechanisms to scale up or down the assigned resources when the rate of requests fluctuates. Authors in [8, 31] introduced several techniques that use threshold-based rules on the actual CPU and I/O capacities for deciding when to add/remove resources. In [11] a new metric called congestion index was introduced to decide the number of replica in an SPE platform. However, almost all techniques ignore different level of QoS constraints that can be enforced by different applications. Our approach is different from the mentioned projects as we propose a well-defined controlling mechanism to be replaced with a heuristic-based algorithm. To this end, we introduce a set of metrics to address the resource utilization, QoS constraints, and the cost of changing re-configuration.

3 QoS Detriment Metric

We define a metric to distinguish a situation where any QoS violation happens during the execution time (the original idea is borrowed from [15]). Apparently different applications tolerate the performance degradation occurrence in different manners (e.g., delay in average response time). For example, the QoS level of actions tied with the applications in the high-frequency trading domain can be easily affected by any delay in response time, while an action in the domain of environmental monitoring is less sensitive to such an issue. This confirms that the service provider has to devise a mechanism to categorize and charge applications’ owners independently based on their QoS levels. In this section, we explain why both solutions might lead to some adverse outcomes in a Lambda platform. We then fill this gap by introducing a new metric, called QoS detriment, to quantify the QoS violation incidents.

We assume that there are exactly Q different classes, each represents a QoS contract that users can ask. We also assume that the desirable performance metric from user’s perspective is the average end-to-end delay of running the corresponding actions during a given interval \(T=(t, t+\varDelta T)\). Thus, a value of \(\omega ^{*}_q\) is assigned to each class \(1\le q\le Q\) (q is the quantifier of each QoS class) that represents an upper-bound of the absolute delay that is acceptable and must be guaranteed by the Lambda kernel for all actions that belong to class q. To decide if an action experiences a QoS violation within a given interval, we need to compare the value of the measured target performance (i.e., end-to-end delay) with the value of \(\omega ^{*}_q\).

However, avoiding QoS violation incidents is almost impossible when actions are executed. To relax this limit, we allow the resource allocation controller to violate QoS constraints for some actions in a managed way. Thus, we define a new function for each class of QoS contract, denoted by \(\mathscr {V}_q(\varDelta T)\), that accepts class q as its input, and its output regulates the percentage of QoS violation incidents that is allowed to happen during any interval of size \(\varDelta T\) for all actions belong to such a class.

Based on this new concept, we can express the definition of a QoS violation incident as follows. A sensible choice for \(\mathscr {V}\) is a simple linear rule like \(\mathscr {V} = 1 - \frac{q}{Q+C}\), where C is a constant and Q denotes the total number of QoS contract classes. Thus, for any arbitrary action \(a_i\) that belongs to class q, we say it is experiencing a QoS violation incident during any arbitrary interval T if the delay of processing is higher than \(\omega ^{*}_q\) for a fraction of time more than \(\left( 1 - \frac{q}{Q+C}\right) \%\) of any arbitrary interval.

We can define QoS detriment, denoted by \(\mathcal {D}_{m,T}\), as a metric to quantify the total amount of QoS violations happening in any host m as follows.

$$\begin{aligned} \mathcal {D}_{m,T}= \sum _{a_i\in \mathcal {V}_{m,T}} \mathcal {I}({a_i}), \end{aligned}$$
(1)

where \(\mathcal {V}_{m,T}\) denotes the set of Lambda functions experiencing a QoS violation during interval T. Symbol \(\mathcal {I}({a_i})\) is the importance coefficient of each action \(a_i\). This term expresses the amount of contribution of action \(a_i\) to the total amount of QoS detriment factor in case of \(a_i\) experiences any QoS violation. One good candidate for such an importance function can be regarded as \(\mathcal {I}({a_i}) = q_{a_i}\). This means the higher the QoS enforcement level is, the more it is counted in Eq. 1. One of the main goals of our work is to cut (or reduce) the total amount of QoS enforcement over all running hosts, i.e., to decrease \(\sum \limits _{m, T}\mathcal {D}_{m,T}\).

4 Closed-Loop Resource Allocation Controller

The proposed strategy is essentially a closed-loop (feedback) model predictive controller (MPC) that seeks a model to predict the dynamic behavior of the underlying platform in the near future, and then makes the (near-) optimal decision based on the value of input vectors as the feedback loop. The resource allocation controller employes an action that forces the output of the system to follow a “reference trajectory”. Such a method has been widely accepted in multiple domains of computing systems, such as energy-aware capacity provisioning in Cloud platforms [15, 21], as well as elastic scaling of stream data processing [3, 4]. Interested readers are referred to [25] for a thorough review in the theory and design of MPC.

There are three main components of the proposed controller: the model, the predictor, and the optimizer. The model provides the controller with an abstraction layer of the run-time behavior of the Lambda platform. The predictor can be used by the controller to give a rough estimation of future input values such as incoming traffic rates. The optimizer is responsible for finding the best possible values for controllable variables, which are denoted by \(\mathbf u _{\tau }\), such that the output of the system, shown by \(\mathbf z _{\tau }\), converges to an ideal set-point trajectory, denoted as \(\mathbf r _{\tau }\), at any time \(\tau \).

An important property of the proposed controller is that we gradually (i.e., in more than one step) apply the supposedly optimum input vector, i.e., \(\tilde{\mathbf{u }}_{\tau +1}\), into the system. More formally, let us suppose that \(\mathcal {T}_{ref}>1\) represent a response speed. Also, let us assume that \(\zeta _{\tau } = |\mathbf z _{\tau } - \mathbf r _{\tau }|\) represent the deviation of the current output from the ideal set-point trajectory, at time \(\tau \). In our controller, we expect that such a deviation converges to zero with an exponential rate in the next f steps, i.e., \(\zeta _{\tau +f} = e^{-f {\varDelta }/\mathcal {T}_{ref}}\zeta _{\tau }\), where \(\varDelta \) is the sampling interval. For example, choosing the ratio of \({\varDelta }/\mathcal {T}_{ref}=1/3\) is a sensible choice in practical situations that not only imposes a low computational overhead, but also provides an effective mechanism to reduce the adverse impact of errors in the prediction tool or the system model.

Response Time Model. We use a Kalman filter as a light optimal estimator tool to effectively infer input parameters from uncertain past observations by taking advantage of correlations between the values of the system state and the input vector. By propagating the current state of the system, including the statistical influence of dynamic perturbations and the outcomes of all previous measurements, the Kalman filter can minimize the mean square error of the estimated input parameters if the system noise is Gaussian [13].

Let \(\mathcal {\bar{N}}(e_j,{\tau })\) and \({\bar{T}}(a_i,\tau )\) denote the average number of events emitted by the event source \(e_j\) and the average computation time of the associated action \(a_i\) at any arbitrary interval \(\tau \), respectively. Hence, the average response time of the event \(e_j\) associated with each instance of the action \(a_i\) at machine \(p_k\), shown by \(RT_{a_i|p_k}^{\tau }\), can be estimated by employing a proper Kalman filter over the past record of resource usage measurements allocated in each machine to run the action at any arbitrary interval \(\tau \). For the sake of this project, we focus our attention on CPU and RAM as the two main resources in each server, but an extension of this work can be employed to include other I/O or network resources similarly. The values of dependency of response time parameter to the resource utilization need to be continuously updated whenever a new measurement data is collected by the controller.

Prediction Model. To predict the future values of non-controllable input parameters (i.e., \(\mathcal {\bar{N}}(e_j,{\tau }\)), as an indicator for the future rate of incoming events, and \({\bar{T}}(a_i,\tau )\) as an indicator for the total computational requests), we employ the well-known auto regressive integrated moving average (ARIMA). Using such a model, the future values of a random variable, such as \(\hat{u}\), can be prognosticated by applying a linear model over a series of past observations as: \(\hat{u}_\tau = c + \epsilon _t+\sum _{\ell =1}^{h}\beta _{\ell }u_{\tau -\ell }+\theta _{\ell }\epsilon _{\tau -\ell }\), where c is a constant and \(\epsilon \)’s are independent and identically distributed errors from a normal distribution with mean zero and a finite variance, e.g., a white noise function. \(\beta _{\ell }\)’s and \(\theta _{\ell }\)’s are coefficients to be updated using least-squares regression method right after a new observation becomes known.

Optimization Process. The controller continuously solves an optimization problem with an objective function that is the sum of three cost functions as given below.

  • Resource utilization residue (\(\mathcal {C}_{(U)}\)). The study in [27] discussed the need to keep CPU utilization constantly between 60%–80% in order to reach the best balance between the performance of each host and its energy consumption (the exact value depends on the CPU architecture). We use the residue function to penalize any derivation from the ideal utilization level for CPU. We propose a cost function that penalizes more any derivation from the upper bound comparing to the derivation from the lower threshold, employing such a cost function enables us to avoid the exploitation of full CPU capacity, known as “meltdown point” problem, that has the over-utilized CPU become a bottleneck of the system.

    $$\begin{aligned} \mathcal {C}_{(U)} = \left\{ \begin{array}{lll} |\frac{U - \mathcal {U}^{*,upper}_{CPU}}{1- \mathcal {U}^{*,upper}_{CPU}}|^2 &{} \text{ if } U \ge \mathcal {U}^{*,upper}_{CPU} \\ 0 &{} \text{ if } \mathcal {U}^{*,lower}_{CPU} \le U \le \mathcal {U}^{*,upper}_{CPU} \\ |1-\frac{U}{\mathcal {U}^{*,lower}_{CPU}}|^2 &{} \text{ if } U \le \mathcal {U}^{*,lower}_{CPU} \end{array} \right. , \end{aligned}$$
    (2)

    where U is the measured value of average CPU utilization of the host at any given interval.

  • Total QoS detriment (\(\sum _{p_k}\mathcal {D}_{p_k}\)). To favor a resource allocation decision that results in fewer QoS violations, we propose a cost function that explicitly evaluates the sum of QoS detriment over all machines (Sect. 3).

  • Total switching cost (\(\sum {\mathcal {SW}}\)). Changing the current configuration is costly. The switching cost evaluates the difference (e.g. the Euclidean norm) between the decision vectors applied at two successive steps to avoid exceeding changes in the configuration states. This enables the controller to be more conservative in adopting abrupt changes in the reconfiguration decisions.

The proposed objective function to be minimized is expressed as the sum of three above-mentioned costs as Eq. 3.

$$\begin{aligned} \min \mathcal {J}_\tau = \sum _{t=\tau +1}^{\tau +f}\sum _{p_k} \left( \gamma _1\mathcal {C}_{(U)} + \gamma _2\mathcal {D}_{p_k,t} + \gamma _3 {\mathcal {SW}_t} \right) , \end{aligned}$$
(3)

where f is the prediction horizon length, and \(\gamma _i\) coefficients are the weight for each cost function to be set separately. We compute the norm of a normalized vector of all terms in Eq. 3 whose components are the original values of the measured/estimated values of the corresponding metrics, each divided by its maximum expected value. For simplicity, we use equal weights for \(\gamma _i\)’s in this paper. While the optimizer module solves the above problem for the future \(f>1\) steps, the controller only applies the solution for the first step as the system’s input vector. Then, the whole cycle of prediction and optimization process is repeated in the next step (as the feedback loop).

To solve the optimization problem, we use a technique based on particle swarm optimization (PSO) heuristic. PSO is a population based stochastic optimization technique developed by Eberhart and Kennedy in 1995 as an advanced fast evolutionary computational technique [20] for solving continuous and discrete optimization problems with multiple local extrema. PSO can converge to the (near-) optimal results in a faster, cheaper way comparing with other optimization methods [24].

We adopt two additional techniques to reduce the potentially large computational overhead due the exponential size of the feasible state space. Firstly, we allow the optimization module to run only for a fix fraction (e.g. 1%) of the control step interval. For example, if \(\varDelta T\) is selected to be one minute, then the maximum time that the solver is allowed to find a solution is limited to 600 ms. Within such a period, the best solution obtained by the PSO solver, is considered as the input vector of the controller in the next step. Secondly, we allow the PSO solver to continue searching for a better solution until the data of the next step comes out. While such a solution cannot be used for the system input at the current step, it is greatly beneficial as the starting point for the next round of the PSO solver.

5 Experimental Evaluation

In this section, we present our evaluation results in terms of primarily (1) response time (latency), (2) resource utilization and (3) QoS violations. We also present the sensitivity analysis and the scalability of our resource controller.

5.1 Experimental Setup

System Environment. We evaluated our approach by conducting an extensive set of experiments on our local cluster to measure the effectiveness of our approach with respect to the three parameters of resource utilization, QoS violation incidents, and scalability. We used a local cluster consisting of two machines with a total of 16 cores, and 32 GB of main memory. Each machine equipped with a 3.40 GHz i7 CPU, 16 GB of RAM, and 8 MB LLC and Ubuntu 14.04. To imitate a heterogeneous environment, we use Xen hypervisor 4.4.2 to create 8 virtual machines each with one dedicated core and 2 GB of main memory (one VM shared with Dom-0), and another 4 virtual machines (VM), each with two dedicated cores and 4 GB of RAM. All Dom-0 and guest VMs run the same Linux kernel version 4.2.0.

The proposed solution as a feedback controller for the above-mentioned platform is implemented in Python 2.7 and runs in a dedicated machine equipped with Intel i7-4712HQ 2.3 GHz with 16 GB of RAM, and 512 Samsung PM851 SSD disk. We installed Dask framework [22] on all guest VMs to implement a Lambda platform as a distributed cluster. Being equipped with a versatile library for distributed computing over a cluster of hundreds of machines, Dask provides a library for running a set of pre-defined functions in parallel and/or out-of-core computational fashion [19]. The Dask model allows us to build a complex network of actions that might depend on each other to be run once after an associated event occur. It has dynamic asynchronous primitives that provide a very low-latency mechanism among working threads. Due to its asynchronous nature, the task scheduler of Dask framework can flexibly handle a variety of functions simultaneously [22].

Workload Attributes. We created a synthetic event/action data-set by analyzing on a subset of real twitter data gathered by [2]. For the scope of this paper, our comprehensive analysis relies on a synthetic workload that runs in our own test-bed, we left exploration of such an analysis in the real implementation with alternatives found in the industry as a subject for future investigation.

We created \(|\mathcal {A}|=\{10, 20, 30, 40, 50, 60\}\) functions each running either a web-service script, representing latency-sensitive workloads, or a data-analytic script, representing data-intensive workloads. Both workloads are taken from CloudSuite benchmark [7]. Each action \(a_i\in \mathcal {A}\) is associated with only one event source \(e_{i}\). The rate of event generation of each event source is taken from a Poisson distribution with parameter of \(\lambda \in \{1, 3, 6\}\). The \(\lambda \) parameter indicates the average number of events generated per millisecond. The execution time required to process each event ranges from 40 ms up to 21 s, with an average of 1078 ms. The number of generated events per action in each scenario varies from 5000 to 10000 depending on the scenario parameters, with an average of 7000 events per action. We allow each scenario to run for the period of one hour. There are two different QoS enforcement classes, i.e. \(|Q|=2\) in our setting. The associated upper bounds of QoS classes are \(\mathscr {V}_{q=1..2}\in \{0.99,.090\}\). In this way, we assign each stream to one of the QoS classes randomly. We choose the sampling interval epoch and the maximum number of CPU cores to be used in each scenario to be one second and \(M=16\), respectively.

Compared Heuristics. The proposed solution is compared against two other heuristics, namely round robin and best-effort. The former uses a round robin policy to balance the associated events amongst the associated threads with the main aim of distributing evenly the incoming events among each Lambda functions. This is the policy which is mainly implemented in major Lambda engines, including IBM OpenWhisk. Based on our implementation, we fixed the number of threads associated with each action based on the QoS class that it belongs to (i.e. 9 and 7 for two different QoS classes, respectively).

The best-effort approach uses first fit decreasing (FFD) algorithm to determine the number of appropriate worker threads per Lambda function in order to achieve a compromise between resources’ usages and QoS violation incidents. Best-effort adds an additional worker thread only if the amount of QoS violation experienced by the corresponding function exceeds a certain threshold (i.e. 2 min in our experiments). Further, if a physical host becomes fully utilized, then best-effort looks for the next machine to execute a thread.

5.2 Results

All reported analytical results reflect the behavior of the system when the performance of the system remains stable right after passing a short transient state. During such a transient period, the latency of serving events might be noticeably higher than its average in the steady states. We left the study of the transient period behavior of a Lambda platform as a future work.

Fig. 1.
figure 1

Average latency achieved by the proposed algorithm against round robin and best-effort as the number of actions varies from 10 to 60. Scenarios are distinguished by different values of \(\theta \in \{3, 6\}\), and number of cores \(|\mathcal {M}|\in \{8, 16\}\).

Response Time. Figure 1 demonstrates the average response time (latency) achieved by our approach as it is compared with the other two heuristics in four different scenarios. The x axis in all figures represents the number of Lambda functions that is increasing gradually from 10 to 60. Each scenario differs with another one with respect to either the event generation rate, \(\theta \), or the maximum number of cores that can be employed in that scenario, denoted by \(|\mathcal {M}|\). The result achieved by the proposed approach when \(\theta =1\) is similar to the ones shown here and we do not repeat them.

The trend confirms that the response time monotonically increases when the number of actions increases (from 10 to 60) or when the rate of event generation increases (from 1 to 6) irrespective of resource allocation strategies. Further, no anomalies can be seen in any scenario. This result is expected because the workload of each working thread monotonically increases in both cases. However, the effectiveness of both the round-robin and best-effort schema is less than the proposed algorithm mainly because ours can dynamically adapt to the spike in the event generation rates by assigning more computing resources to those actions which are suffering from obtaining enough resources to process the corresponding events (as reflected via the first term of the objective function). Particularly, the improvement in average processing time per Lambda function achieved by the proposed controller is more significant when \(\theta =6\) (high incoming traffic rate) and less resources are available. Overall, the proposed controller enhances the average processing time by 19.9% on average compared with the best outcome of other two heuristics.

Fig. 2.
figure 2

Steady state core utilization for active CPUs which are appointed by different resource allocation policies to run some Lambda functions. Two Scenarios are selected based on different values of \(\theta \) as the event generation rate, and the total number of available cores, \(|\mathcal {M}|\), to be employed by each policy.

Fig. 3.
figure 3

Normalized percentage of QoS violation incidents achieved by each resource allocation heuristic as the number of Lambda functions varies from 10 to 60 for the two extreme scenarios when \(|\mathcal {M}|=8\). The improvement of the proposed solution is 301% in average (max 358%) comparing to round-robin policy.

Resource Utilization. Figure 2 depicts a summary of the average core utilization gathered in all machines achieved by three resource allocation strategies under the two synthetic scenarios distinguished by different values of \(\theta \) as the event generation rate, and the total number of accessible cores by each policy, \(|\mathcal {M}|\). A significant achievement by applying the proposed controller is its ability to keep the utilization of all the employed CPU cores around the ideal utilization level in most scenarios (which is set to between [60%–80%] throughout the experiments). Such an achievement can be leveraged by putting other non-working cores into the deep sleep mode to save energy usage.

In contrary, both the round robin and best-effort policies are oblivious of such ideal level. By employing almost all available cores blindly, these policies keep the CPU utilization of some cores higher than the ideal level, i.e., more than 80%, while allowing the rest of cores run at a level much below the ideal value. It is worth noting that because the core utilization has a direct impact on the total energy consumption of each host, it is desirable to force each core to work either on 0% or close to the ideal level. Altogether, the results obtained from all experiments scenarios (including those who are not depicted here) revealed that the proposed controller enhances the utilization of working CPUs by 21% on average compared with the best outcome of the other heuristics which is achieved by employing best-effort policy.

QoS Violation. Figure 3 depicts the percentage of QoS violation incidents, according to the definition of QoS detriment metric in Sect. 3. The results compared the amount of QoS violation achieved by the proposed controller versus those achieved by the other two strategies. We only depict the results in scenarios that the event generation rate is deliberately high, i.e., the value of \(\theta \) is 6, while the number of CPU cores to be used in each scenario is low (=8).

As the rate of incoming events and the requested processing time for each corresponding action are substantially high, it is difficult for a QoS-oblivious scheduler to assign enough resources to the most important actions to avoid the occurrence of QoS violation for such actions. The experimental results confirmed that the proposed QoS-aware controller can effectively reduce the QoS violation incidents by a factor of 3.0 on average compared to the round-robin strategy which uses all available cores in an almost balanced manner and shows the best result with regard to this factor.

Sensitivity Analysis. When one tries to build a model for a complex system (such as a Lamdba platform), it is almost impossible to prevent the occurrence of errors in prediction phase. A promising controller must be tolerant to the negative consequences of such errors in the decision making phase. To help reduce the risk of such errors, we incorporate two methods in the proposed controller as follows.

  • Using \(\epsilon _t\) in prediction model to explicitly introduce randomness

  • Choose the value of response speed rate, \(\mathcal {T}_{ref}\), strictly greater than one (in our case 3). Such a selection allows the system to gradually adapt to the input changes in more than one step (see Sect. 4).

To perform the sensitivity analysis, we first start with a prediction model with zero error. Progressively, we inject errors ranging from 10% to 90% to the prediction of input variables, and then measure the influence of such errors on the system outputs. We define a parameter called sensitivity coefficient, denoted by \(\kappa \), for each performance metrics, such as Z, as follows.

$$\begin{aligned} \kappa _{\epsilon ,Z} = \frac{\left\| Z(x) - Z(x\pm \epsilon )\right\| }{\left\| Z(x)\right\| } \end{aligned}$$
(4)

\(\kappa \) reflects how much the target output is sensitive when the input parameter x is estimated with an error of \(\epsilon _x\).

Fig. 4.
figure 4

The sensitivity coefficient curve for two parameters of average latency (left) and average CPU utilization (right) as the prediction error varies from 10% to 90% (x-axis).

Figure 4 shows a summary of average sensitivity coefficients for both response time and CPU utilization with respect to the errors in the prediction model. The trend confirms that even an error of 90% puts a little negative stress on the target performance metrics (below 34% in the worst case scenario).

Scalability. As we force the optimizer module to return the best achievable solution found within the 1% of the time-frame, the computational time of the proposed controller is limited to a fixed amount (e.g., 600 ms in our experiments). We performed a set of experiments (by increasing the number of active cores and Lambda actions) to examine the scalability of the proposed controller. So, we allow the optimizer module to find an approximate solution within 10% of the optimal solution and collect the running time of such an optimizer. Table 1 presents the computational time that the optimizer module needs to find such a solution. The results confirmed that the technique can find a reasonable effective solution in less than 2.15 s when the number of machines and Lambda actions increase to 100 and 800, respectively.

Table 1. Average Running time of the optimizer module to find an 1.1 approximation solution when the number of cores and FaaS actions varies.

6 Conclusion

Understanding the run-time behavior of FaaS/Lambda functions can be of great practical importance for designing efficient resource allocation strategies for a FaaS/Lambda platform. We have presented a solution based on the famous model predictive controller (MPC) for achieving a dynamic QoS-aware resource allocation in such a platform. Our solution makes appropriate resource allocation decisions by predicting the future rate of events coming to the system as well as considering the QoS enforcements requested by each function. The proposed controller achieves an average improvement of 21% in resource utilization and a 3-times reduction of QoS-violation incidents compared with the best result achieved by the round-robin or best-effort strategy, while maintaining the mean latency of actions 19.9% less than the result achieved by best-effort strategy.

As reported by several past research projects (such as [27, 29, 32]), collocated applications can compete fiercely with each other to acquire the shared resources (e.g., CPU cache, memory bandwidth). Such a contention not only causes an overall performance degradation, but also can increase the power consumption of the whole system. We left an investigation on the effect of the proposed method on energy consumption of an in-house Lambda platform as a future work.