1 Introduction

End-users’ personal data in Internet era, gain more and more value and attention as are considered the basic functional block for digital services provision. Service providers rely upon them to offer personalized services to their customers, while adversaries try to get access to this data for fun and profit as well. These data, structured or unstructured, are stored in most of the cases in databases that provide the appropriate Application Programming Interface (APIs) to the employed applications and services for handling them. Specifically, these interactions are accomplished through Structured Query Language (SQL) [6] offering in such a way a transparently access to the data requested by different applications and sources. Though beneficial might SQL be for data management, adversaries exploits their structure to gain access to otherwise private data. To do so, they inject (legitimate) SQL commands to the input data of a given application for modifying the initial SQL command, and gain access or modify private information. This type of flaw is known as SQL Injection Attack (SQLIA) [9], in which adversaries exploit the fact that database makes no differentiation between end-users’ actual data and SQL commands. Note that, SQLIA remains at the top twenty most known vulnerabilities more than a decade, though, various countermeasures such as [3, 11] etc., have been proposed in literature. This trend endures because most of existing protection approaches either focus on specific applications or they cannot be applied transparently to existing databases.

Defensive programming could be considered an alternative solution for enhancing database security, however, in this case is supposed that the developed application is “error” free e.g., by introducing sanitazation techniques on end-users’ inputs, which is not the case under all the circumstances. Furthermore, developers do not take into consideration the details for securing securing the developed application [4, 19]. On the other hand, SQL injections vulnerabilities could be easily identified and exploited using open source tools e.g., GrabberFootnote 1. Thus, we believe that other solutions orthogonal to existing ones are required to enhance databases security.

In this paper, we propose a practical framework that enables the transparent enforcement of randomization to any given database for enhancing protection against SQLIA. Our main goal is the employment of a solution that requires as little as possible intervention, while building on the advantages of well-known practical security mechanisms. This way, we elaborate on enhancing database security against SQLIA on otherwise “unprotected” databases, minimizing their attack surface. To the best of our knowledge this is the first work in its kind. In this direction, we introduce a methodology building on the benefits of static and dynamic analysis towards SQL statements randomization [3] for any given database related application, and enforcing a runtime environment for SQL de-randomization without requiring to modify neither database source code nor middleware interfaces. We evaluate our framework in terms of introduced overhead using the well-known MySQL database under different configurations. Outcomes indicate the feasibility employment of the proposed framework.

The rest of the paper is structured as follows. In Sect. 2 we describe in detail our proposed framework for applying randomization technique transparently on any given databases application. In Sect. 3 we evaluate our framework in terms of its effectiveness with regard to introduced overhead. In Sect. 4 we discuss the related work and introduce a comparison with our approach. Finally, in Sect. 5 we conclude this paper giving some pointers for future work as well.

2 Proposed Framework

The core protection mechanism of our framework builds around the randomization protection mechanism [1, 10]. This is because, randomization is considered among the most effective solutions for protecting services against injection attacks. So, it should be noted that the access to the Server Side Script (SSS) is mandatory if there is a need to employ such a technique, as is the input point to access database and through it the attack vectors are created. This is exactly the case of binary randomization [16], meaning that if there is not access to the binary itself, it cannot be employed the randomization countermeasure.

Briefly, the proposed framework is composed of three main components (a) SQL statements identification (b) SQL randomization, and (c) the run-time enforcement. So, assuming the availability of the SSS, we parse it through a meta-compiler for identifying all the SQL statements included in it. Afterwards, the identified queries are randomized through a function f and the SSS is updated correspondingly. As the SSS generates and forwards randomized (“SQL”) statements, according to users’ inputs, to the database, the latter will not be in the position of “understanding” these (“SQL”) statements. Consequently, the database should incorporate the de-randomization function for transforming the incoming randomized statement to a “normal” SQL statement, otherwise the SQL statement could not be executed successfully. The runtime enforcement realizes this functionality in a transparent way, indicating that no access to database source code is required as well as is agnostic to the underlying database.

2.1 SQL Statements Identification

To automatically identify SQL statements in any give application, we assume that the SSS is available for analysis, as mentioned previously. To do so, we built a meta compiler based on the well-known tools lex and yacc [14]. In this point one might argue that SQL statements could be identified by simply searching inside the SSS for the corresponding SQL keyword. Indeed, this could be the case for “explicit” SQL statements definitions, corresponding to statements build in a “single” line. However, in that case there is no way to identify or variables that include part of an SQL and influence the final SQL statement e.g., part of SQL statement might be included in a conditional statement. This is because “searching” tools have no any capability of identifying data flows between variables. For instance, consider the code example illustrated in the Listing 1.1 in which a variable x is concatenated to the sql if “userinput” equals to 1. This means that the content of variable x should be protected, since it is a part of the SQL statement, otherwise the database remains vulnerable to SQLIA.

figure a

Thus, our proposed SQL identification solution defines a sample SSS programming language, e.g., PHP likeFootnote 2. We rely on such a type of grammar to build a parser able to identify all the available SQL parts consisting an SQL statement that included in a given SSS i.e., variables, function parameters, etc., that should be protected. This means that variables are “tainted”, to identify whether or not include an SQL statement, and monitor if there are variables influence the initial statement. If this is the case the variable is also tainted.

The start rule is the program consisted of series of statements. We have defined various types of statement, e.g., VARIABLE POINTER STRING ‘(’ expr_list ‘)’, CMD expr_list, however, in this version of the grammar we consider a limited number of statements included in the PHP server side script. We are planning to extend this grammar for including all the available statements in a future work.

First the SSS is split in tokens through lex that are passed to yacc to compute program’s statements and expressions included in the server side script to evaluate each of the whether or not includes an SQL code. All variables, function parameters, etc. are evaluated whether or not include an SQL statement. If this is the case the variable is marked and “monitored” if influence other variables as well that consequently marked. The analysis tool reports on the parts of the code that include SQL statements that should be randomized. This evaluation is accomplished by simple keywords identification on the values of the variables.

2.2 SQL Statements Randomization

Instruction set randomization technique was initially proposed in [1, 10] to protect software against code for protecting binaries against code injection attacks. In this approach the code is transformed through a transformation function i.e., F to a new executed code. This transformation is known only to the system in which it will be executed, so it can be translated to the original code. This approach assumes that the adversary is not aware of this transformation, and as a result any type of code injected towards to the application generates unknown commands causing the application rejection. Indeed, in case that the attacker knows the transformation it can inject and execute malicious code without being identified.

In this direction, Boyd and Keromytis [3] apply the concept of randomization for protecting databases against SQL injection attacks. Specifically, in their initial design they transform the SQL keywords to new types of keywords by appending a “random” integer to them. As this transformation is only known to the database adversaries injected code is transformed to unknown statements and consequently rejects its execution. To enforce a runtime environment for implementing the proposed framework (randomization) in a transparent way, meaning that no access to the underlying database source code is required, we rely on the advantages of adaptive defenses [7]. This is because, adaptive defenses enable software “virtual” partitioning based on its innate properties.

To do so, we build a database execution monitor using Intel’s Pin [13] dynamic binary instrumentation framework. We rely on Pin because it can run executables as is, while enables developers to instrument any executable, and develop tools supervisioning various aspects of its execution at different granularity level. In our case, we implemented a Pin tool that injects small pieces of monitoring code before every function entry as well as their parameters. To identify the appropriate hook point we execute a series of a software supporting (1) a database SQL connection and (2) a SQL statement execution. Using our monitor we record all the function calls for the both cases. Afterwards, we analyze the collected “traces” to determine execution differentiation points that constitute potential SQL hooks. The analysis of discovering the differentiation point is based on the formula (1). This is a heuristic based on the observation that the software supports the database SQL connection feature generates a subset of function calls of the software supports the execution of SQL statement. The outcome of this analysis provides a set of possible hooks for employing the de-randomization function for our case.

Since this analysis might produce more than one hooks as results shows a further process is required to determine the appropriate point. This further process includes the analysis of the parameters and return values of the possible SQL hook points. Note this information is recorded also by our monitor tool. We choose as the appropriate hook point the first function builds a complete SQL statement either as a return value, or as a parameter of a function call. If the complete SQL statement does not exist in such locations, because the database uses a global variable to store the statement, then the current approach fails to determine such a point, however, in our results we do not identify such a case. We develop and test this procedure on x86 Linux over two well known databases i.e., MySQL and PostgreSQL. We consider these database as they belong to the most employed one. To do so, we implement the corresponding clients incorporating the SQL connect and statement execution using the C programming language APIs for both databases. The analysis outcomes identifies the following functions as possible hooks (a) _Z16dispatch_command19enum and (b) pg_parse_query(char *) correspondingly.

We validate the outcomes of this analysis by employing the complete scheme of the protection framework, so the incoming statements can be executed note that in case that we send randomized statements are not possible to be executed by the database. We control also the CFG for showing that the identified calls are executed before the statement execution. The other way around to validate outcomes analysis is through code inspection. Note that these are possible points so other functions could be suitable for the employment–meaning that every function before statement execution is a suitable candidate hook point. We believe that the identification procedure cannot be employed manually by inspecting the source code as we do not have any hint of the SQL statements execution as well as the code base of databases are thousands of lines.

As the appropriate point of SQL hook point is identified, the de-randomization should take place on the database side. The enforcement could take different implementations i.e., modify the database source code directly, function interposition, or database instrumentation. However, to do it transparently we develop another tool based on Pin instrumenting the identified SQL point before its execution and modify the corresponding parameters. We report in detail on database performance under different configurations in Sect. 3.

3 Evaluation

We develop the proposed approach on MySQL ver. 5.5.44 database, as it is one of the most employed open source databases, using the MySQL select-benchmark. As randomization function we relied on XOR transformation using a key length of eight bytes. The database server runs on a single host featuring an i5 Intel processor with 4 GB of RAM running Ubuntu OS (14.04.3 LTS). All the experiments were repeated 40 times, while the client and the server executed on different machines. We use various configurations to demonstrate the performance implications under different employments. We run the benchmark on MySQL database natively both without and with randomization protection enabled, as well as over null and randomization enabled Pin tools, and using Pin probe mode for function interposition.

Figure 1 demonstrates the time required to complete the MySQL select-benchmark under these configurations. Note that the native and Pin null tool configurations are used as a point of reference to indicate the introduced overhead of the protection enabled scenarios. The native execution of MySQL-select benchmark without enabling database protection requires on average 96 s to accomplish its tasks, while the native enforcement of the runtime environment requires 112 s corresponding to an overhead of 16 %. When the runtime environment is implemented as a Pin tool the overhead increases significantly almost 330 %, which affects highly the performance of the database. However, using the Pin probe mode for function interposition scales down the overhead up to 2.2x times, whereas in comparison with native runtime enforcement the overhead is as little as 20 %.

Fig. 1.
figure 1

Total execution time for MySQL select-benchmark under different implementations of our proposed framework. Native and Pin null tool are used as a reference to demonstrate protection enforcement introduced overhead.

4 Related Work

The very first approaches for protecting services against SQLIA were focusing on end-users’ input sanitazation. They were implemented as a built-in functionality in web based frameworks such as PHP, ASP.Net, etc., as well as in intrusion detection systems i.e., SNORTFootnote 3 in which the incoming traffic is inspected through pre-defined signatures. Though effective these approaches might be adversaries could by pass them, while rely on developers and administrators capabilities to develop the appropriate controls which is not always the case. Thus, various other approaches combining static and dynamic analysis have been proposed in literature in order to enhance database security.

Halfond et al. in [8] introduce a model based approach to detect malicious SQL statements generating by users’ inputs. Briefly, this model is consisted of two parts (a) the static one which builds the legitimate statements model that could be generated by an application, and (b) the dynamic which inspects the generated statements at runtime and compares them with the statically built model. In the same direction, Bisht et al. in [2] propose an approach based on symbolic execution, instead of static analysis, for constructing applications’ legitimate SQL statements.

Su et al. in [17] introduce a solution named SqlCheck in which the syntactic structure of original SQL statements are compared with those generated by end users inputs in order to detect SQLIA. Complementary Wei et al.  [18] focus on stored procedure protection against SQLIA. In their approach they rely on static analysis to model SQL statements as a Finite State Automata (FSA), while they check at runtime whether the generated statements follow the static analysis model.

SQLProb [12] employs a dynamic user input extraction analysis taking into consideration the context of query syntactic structure to detect SQLIA, however, in contrast to other solutions incorporates a black box approach. Mitropoulos et al. in [15] propose a novel methodology for preventing SQLIA by introducing a middleware interface between the application and the underlying database. In alternative approach, Felt et al.  [5], towards the employment of the least privilege principle, introduce the notion of data separation in database related applications where each application develops a policy describing its access rights in the database. This way, different applications’ data are isolated among each other. This policy is enforced through a proxy server.

Boyd and Keromytis [3] consider the very first employment of randomization for database protection against SQLIA enabling their prevention. As mentioned previously, they suggest applications’ SQL statement randomization by appending a random integer to them, while its enforcement requires a proxy server in order to forward de-randomized SQL statements to database.

5 Conclusions and Future Work

SLQIA is still an open security problem that requires further attention not only for enhancing database security, but also end-users’ trust to the provided Internet based services. In this paper we introduce a framework to enable the applicability of well established protection solutions such as randomization to any given database related application, in a completely transparent way with as little as possible intervention. We develop an automatic methodology for SQL statements randomization, and we demonstrate different employments to enforce the protection mechanism.

Our findings show that the runtime enforcement is feasible to implemented, without modify database source code, through function interposition. We believe that the proposed framework is not only orthogonal to existing defenses, but also enables database administrators and server side scripts developers to minimize database attack surface against SQLIA. Currently, we are looking to extend the SQL statements identification approach by including also other server side programming languages as well as enabling it on binary applications. In addition we intend to accomplish a more thorough evaluation using various well-known reversible transformation i.e., AES.