figure a

1 Introduction

A growing number of applications of static analysis techniques have benefited from automated reasoning tools for string constraints. The effect of such tools on symbolic execution in particular has been transformative. At a high level, symbolic execution runs a program under analysis by representing its input values symbolically and tracking the program variables as expressions over these symbolic values, together with other concrete values in the program. The collected expressions are then analyzed by an automated reasoning tool to determine path feasibility at branches or other properties such as security vulnerabilities at points of interest. With the ever-increasing hardware and automated reasoning capabilities, symbolic execution has enjoyed much success in practice. A recent example in the cybersecurity realm was the DARPA Cyber Grand Challenge,Footnote 1 which featured a number of Cyber Reasoning Systems that heavily relied on symbolic execution techniques (see, e.g., [7, 24]).

Prior to the availability of string-capable reasoning tools, developers of symbolic execution engines had to adopt various ad-hoc heuristics to cope with strings and other variable-length inputs. One popular heuristic is to impose an artificial upper-bound on the length of these inputs. Unfortunately, this not only compromises analysis accuracy, since the chosen upper bounds may be too low in face of adversarial inputs, but it also leads to inefficiencies in solvers when practitioners, in an attempt to mitigate this problem, may end up setting the upper bounds too high. To address this issue, a number of SMT solvers have been extended recently with native support for unbounded strings and length constraints [2, 19, 26, 30]. These solvers have dramatically improved both in performance and robustness over the past few years, enabling a new generation of symbolic execution tools that support faithful string reasoning.

This paper revisits approaches for solving extended string constraints, which allow a rich language of string terms over operators such as \(\mathsf {contains}, \mathsf {index\_of}\) and \(\mathsf {replace}\). Earlier techniques for extended string constraints [5, 18, 27, 30] often rely on eager reductions to a core language of constraints, with the effect of requiring the solver to deal with fairly large or complex combination of basic constraints. For instance, encoding the constraint \(\lnot \mathsf {contains}(x, y)\) commonly involves bounded universal quantification over the integers, to state that string y does not occur at any position in x. In this work, we start with the observation that DPLL(T)-based SMT solvers [12] often reason in contexts where string variables are equated to (partially) concrete values. Based on this, we have developed a way to leverage efficient context-dependent simplification techniques and reduce extended constraints to a core language lazily instead.

Contribution and Significance. We extend a calculus by Liang et al. [19] to handle extended string constraints (i.e. constraints over \(\mathsf {substr}, \mathsf {contains}, \mathsf {index\_of}\) and \(\mathsf {replace}\)) using a combination of two techniques:

  • a reduction of extended string constraints to basic ones involving bounded quantification (Sect. 3.1), and

  • an inference technique based on context-dependent simplification (Sect. 3.2) which supplements this reduction and in practice significantly improves the scalability of our approach on constraints coming from symbolic execution.

Additionally, we provide a new set of 25,421 publicly-available benchmarks over extended string constraints. These benchmarks were generated by running PyEx, a new symbolic executor for Python programs based on PyExZ3 [3], over a test suite of 19 target functions sampled from four popular Python packages. We discuss an experimental evaluation showing that our implementation in the SMT solver cvc4 significantly outperforms other state-of-the-art string solvers in finding models for these benchmarks.

Finally, we discuss how the superior performance of cvc4 in comparison to other solvers translates into real-life benefits for Python developers using PyEx.

Structure of the Paper. After some formal preliminaries, we briefly review a calculus for basic string constraints in Sect. 2 that is an abbreviated version of [19]. We present new techniques for extended string constraints in Sect. 3, and evaluate these techniques on real-world queries generated by PyEx in Sect. 5.

1.1 Related Work

The satisfiability of word equations was proven decidable by Makanin [21] and then given a PSPACE algorithm by Plandowski [22]. The decidability of the fairly restricted language of word equations with length constraints is an open problem [11]. In practice, a number of approaches for solving string and regular expression constraints rely on reductions to automata [10, 15, 28] or bit-vector constraints for fixed-length strings [16]. More recently, new approaches have been developed for the satisfiability problem for (unbounded) word equations and memberships with length [2, 19, 20, 27, 30] within SMT solvers. Among these, z3-str [30] and S3 [27] are third-party extensions of the SMT solver z3 [9] adding support for string constraints via reductions to linear arithmetic and uninterpreted functions. This support includes extended string constraints over a signature similar to the one we consider in this paper. With respect to these solvers, our string solver is fully integrated into the architecture of cvc4, meaning it can be combined with other theories of cvc4, such as algebraic datatypes and arrays.

This paper is similar in scope to work by Bjørner et al. [5], which gives decidability results and an approach for string library functions, including \(\mathsf {contains}\) and \(\mathsf {replace}\). As in that work, we reduce the satisfiability problem for extended string constraints to a core language with bounded quantification. We also incorporate simplification techniques that improve performance by completely or partially avoiding this reduction.

Our target application, symbolic execution, has a rich history, starting from the seminal work of King [17]. Common modern symbolic execution tools include SAGE [13], KLEE [6], S2E [8], and Mayhem [7], which are all designed to analyze low-level binary or source code. In contrast, this paper considers constraints generated from the symbolic executor PyEx, which is designed to analyze Python code and includes support for string variables.

1.2 Formal Preliminaries

We work in the context of many-sorted first-order logic with equality and assume the reader is familiar with the notions signature, term, literal, (quantified) formula, and free variable. We consider many-sorted signatures \(\varSigma \) that contain an (infix) logical symbol \(\approx \) for equality—which has type \(\sigma \times \sigma \) for all sorts \(\sigma \) in \(\varSigma \) and is always interpreted as the identity relation. We also assume signatures \(\varSigma \) contain the Boolean sort \(\mathsf {Bool}\) and Boolean constant symbols \(\top \) and \(\bot \) for true and false. Without loss of generality, we assume \(\approx \) is the only predicate symbol in \(\varSigma \), as all other predicates may be modeled as functions with return sort \(\mathsf {Bool}\). If P is a function with return sort \(\mathsf {Bool}\), we will commonly write \(P(\varvec{t})\) as shorthand for \(P( \varvec{t} ) \approx \top \). If e is a term or a formula, we denote by \(\mathcal {V}( e )\) and \(\mathcal {T}(e)\) the set of free variables and subterms of e respectively, extending these notations to tuples and sets of terms or formulas as expected.

A theory is a pair \(T = (\varSigma , \mathbf {I})\) where \(\varSigma \) is a signature and \(\mathbf {I}\) is a class of \(\varSigma \)-interpretations, the models of T. A \(\varSigma \)-formula \(\varphi \) is satisfiable (resp., unsatisfiable) in T if it is satisfied by some (resp., no) interpretation in \(\mathbf {I}\). A set \(\varGamma \) of \(\varSigma \)-formulas entails in T a \(\varSigma \)-formula \(\varphi \), written \(\varGamma \, \models _{T} \, \varphi \), if every interpretation in \(\mathbf {I}\) that satisfies all formulas in \(\varGamma \) satisfies \(\varphi \) as well. We write \(\varGamma \, \models \, \varphi \) to denote entailment in the (empty) theory of equality. We say \(\varGamma \) propositionally entails \(\varphi \), written \(\varGamma \, \models _p \, \varphi \), if \(\varGamma \) entails \(\varphi \) when considering all atoms in \(\gamma \) and \(\varphi \) as propositional variables. Two \(\varSigma \)-terms or \(\varSigma \)-formulas are equivalent in T if they are satisfied by the same models of T. Two formulas \(\varphi _1\) and \(\varphi _2\) are equisatisfiable in T if \(\exists \varvec{x}_1.\,\varphi _1\) and \(\exists \varvec{x}_2.\,\varphi _2\) are equivalent in T where \(\varvec{x}_i\) collects the free variables of \(\varphi _i\) that do not occur free in \(\varphi _j\) with \(i \ne j\).

Fig. 1.
figure 1

Functions in signature \(\varSigma _\mathsf {ASX}\). \(\mathsf {Str}\) and \(\mathsf {Int}\) denote strings and integers respectively.

We consider an extended theory \(T_\mathsf {ASX}\) of strings and length equations, whose signature \(\varSigma _\mathsf {ASX}\) is given in Fig. 1. We assume a fixed finite alphabet \(\mathcal {A}\) of characters. The signature includes the sorts \(\mathsf {Str}\) and \(\mathsf {Int}\) denoting strings and integers respectively. Figure 1 divides the signature \(\varSigma _\mathsf {ASX}\) into three parts, which we denote by \(\varSigma _\mathsf {A}, \varSigma _\mathsf {S}\) and \(\varSigma _\mathsf {X}\). We will write \(\varSigma _\mathsf {AS}\) to denote \(\varSigma _\mathsf {A}\cup \varSigma _\mathsf {S}\) and so on. The subsignature \(\varSigma _\mathsf {A}\) is provided on the top line of Fig. 1 and includes the usual symbols of linear integer arithmetic, interpreted as expected. We will commonly write \(t_1 \bowtie t_2\), with \(\bowtie \ \in \{>, <, \leqslant \}\), as syntactic sugar for the equivalent inequality between \(t_1\) and \(t_2\) expressed using only \(\geqslant \). The subsignature \(\varSigma _\mathsf {S}\) is provided on the second line and includes: a constant symbol, or string constant, for each word of \(\mathcal {A}^*\) (including \(\mathsf {\epsilon }\) for the empty word), interpreted as that word; a variadic function symbol \(\mathsf {con}: \mathsf {Str}\times \ldots \times \mathsf {Str}\rightarrow \mathsf {Str}\), interpreted as word concatenation; and a function symbol \(\mathsf {len}: \mathsf {Str}\rightarrow \mathsf {Int}\), interpreted as the word length function. The subsignature \(\varSigma _\mathsf {X}\) is provided in the remainder of the figure.

We refer to the function symbols in \(\varSigma _\mathsf {X}\) as extended functions, and terms whose top symbol is in \(\varSigma _\mathsf {X}\) as extended function terms. A position in a string x is a non-negative integer smaller than the length of x that identifies a character in x — with 0 identifying the first character, 1 the second, and so on. For all xyznm, the term \(\mathsf {substr}( x, n, m )\) is interpreted as the maximal substring of x starting at position n with length at most m, or the empty string if n is an invalid position; \(\mathsf {contains}( x, y )\) is interpreted as true if and only if string x contains string y; \(\mathsf {index\_of}( x, y, n )\) is interpreted as the position of the first occurrence of y in x starting at position n, or \(-1\) if y is empty, n is an invalid position, or if no such occurrence exists; \(\mathsf {replace}( x, y, z )\) is interpreted as the result of replacing the first occurrence in x of y by z, or just x if y is empty or x does not contain y.

An atomic term is either a constant or a variable. A flat term is a term of the form \(f( x_1, \ldots , x_n )\), where \(x_1, \ldots , x_n\) are variables. A string term is one that contains function symbols from \(\varSigma _{\mathsf {SX}}\) only. A string term is basic if contains function symbols from \(\varSigma _\mathsf {S}\) only, and extended otherwise. A (basic, extended) string constraint is a (dis)equality between (basic, extended) string terms. An arithmetic constraint is an inequality or (dis)equality between linear combinations of atomic and/or string terms with integer sort. Notice that (dis)equalities between integer variables and constraints such as \(\mathsf {len}\, x \approx \mathsf {len}\, y\) are both string and arithmetic constraints. A \(T_\mathsf {ASX}\)-constraint is either a string or an arithmetic constraint. Without loss of generality, we consider the satisfiability problem for \(\varSigma _\mathsf {ASX}\)-formulas composed of \(T_\mathsf {ASX}\)-constraints only.

If E is a finite set of basic string constraints, the congruence closure of E is the set

$$\begin{aligned} \widehat{E}= & {} \{s \approx t \mid s,t \in \mathcal {T}(E),\, E \, \models _{}\, s \approx t\} \cup \\&\{s \not \approx t \mid s,t \in \mathcal {T}(E),\, s' \not \approx t' \in E \cup L,\, E \, \models _{}\, s \approx s' \wedge t \approx t' \text { for some } s', t'\} \end{aligned}$$

where \(L = \{ l_1 \not \approx l_2 \mid \text {for all distinct } l_1, l_2 \in \mathcal {A}^* \} \cup \{ \top \not \approx \bot \}\). The congruence closure of E induces an equivalence relation over \(\mathcal {T}(E)\) where two terms st are equivalent iff \(s \approx t \in \widehat{E}\). For all \(t \in \mathcal {T}(E)\), we denote its equivalence class in \(\widehat{E}\) by \( [ t ] _E\) or just \( [ t ] \) when E is clear.

Fig. 2.
figure 2

Simplification rules for \(\varSigma _\mathsf {AS}\)-terms.

Given a term t, we write \({t}{\downarrow }\) to denote its simplified form, where \({t}{\downarrow }\) is a term that is equivalent to t in \(T_\mathsf {ASX}\) and is not simplifiable further (i.e., \({({t}{\downarrow })}{\downarrow } = {t}{\downarrow }\)). We do not insist that simplified forms be canonical, that is, equivalent terms need not have the same simplified form. Rules for computing the simplified form of \(\varSigma _\mathsf {AS}\)-terms are given in Fig. 2. Rules for computing the simplified form of other \(\varSigma _\mathsf {ASX}\)-terms are fairly sophisticated and will be described in detail in Sect. 3.2. Given a tuple of basic string terms \(\varvec{t}\), we write \({\varvec{t}}{\downarrow }\) to denote a tuple of atomic string terms corresponding to the arguments of \({\mathsf {con}(\varvec{t})}{\downarrow }\). For example, if \(c_1, c_2 \in \mathcal {A}\), \({( c_1, \mathsf {con}( c_2, x ), y )}{\downarrow } = ( c_1 c_2, x, y )\) and \({ (x, \mathsf {\epsilon }) }{\downarrow }=(x)\). Given an arbitrary \(\varSigma _\mathsf {ASX}\)-formula \(\varphi \), we write \(\lfloor \varphi \rfloor \) to denote an equisatisfiable (purified) formula whose atoms are \(T_\mathsf {ASX}\)-constraints, and where t is in simplified form (i.e., \(t = {t}{\downarrow }\)) for all its subterms t. For example, \(\lfloor \mathsf {substr}( x, n_0, n_1+1 ) \approx y \rfloor \) is \(\mathsf {substr}( x, n_0, n_2 ) \approx y \wedge n_2 \approx n_1+1\) for a fresh integer variable \(n_2\).

2 A Calculus for Basic String Constraints

Liang et al. [19] developed a calculus for the satisfiability of finite conjunctions of constraints in a theory of strings with length and regular expressions. This section presents a modified version of that calculus. We focus on the portion of that calculus that handles string equalities and length constraints, and omit discussion of its other aspects. Furthermore, we extend that calculus with support for constraints involving extended functions. To simplify the description of this extension and make it self-contained, we also extend the calculus to model propositional reasoning as well, making it applicable to \(\varSigma _\mathsf {ASX}\)-formulas instead of just conjunctions of \(T_\mathsf {ASX}\)-constraints.

Definition 1

(Configurations). A configuration is either the distinguished symbol \(\mathsf {unsat} \) or a tuple of the form \(\langle \mathsf {G}, \mathsf {S}, \mathsf {A} \rangle \) where \(\mathsf {G}\) is a set of \(\varSigma _\mathsf {ASX}\)-formulas, \(\mathsf {A}\) is a set of arithmetic constraints, and \(\mathsf {S}\) is a tuple of the form \(( \mathsf {E}, \mathsf {X}, \mathsf {F}, \mathsf {N})\), where:

  • \(\mathsf {E}\) is a set of basic string equalities;

  • \(\mathsf {X}\) is a set of equalities of the form \(x \approx t\), where x is a variable and t is a flat extended function term;

  • \(\mathsf {F}\) is a set of pairs \(s \mapsto \varvec{a}\) where \(s \in \mathcal {T}(\mathsf {E})\) and \(\varvec{a}\) is a tuple of atomic string terms;

  • \(\mathsf {N}\) is a set of pairs \(e \mapsto \varvec{a}\) where e is an equivalence class of \(\widehat{\mathsf {E}}\) and \(\varvec{a}\) is a tuple of atomic string terms.    \(\square \)

A configuration \(\langle \mathsf {G}, \mathsf {S}, \mathsf {A} \rangle \) models the internal state of various modules of a DPLL(T)-based solver. The component \(\mathsf {G}\) collects the formulas being processed by the solver’s propositional satisfiability (SAT) engine; \(\mathsf {S}\) models the state of a theory solver for strings; and \(\mathsf {A}\) collects the constraints given to a solver for linear integer arithmetic. Initial configurations have the form \(\langle \{ \varphi \}, \overline{\varnothing }, \varnothing \rangle \) where \(\varphi \) is a quantifier-free \(T_\mathsf {ASX}\)-formula to be checked for satisfiability and \(\overline{\varnothing }\) abbreviates the tuple \((\varnothing , \varnothing , \varnothing , \varnothing )\).

We describe a calculus for the satisfiability of string constraints by a set of derivation rules that modify configurations. The rules are given in guarded assignment form, where the top of the rule describes the conditions under which the rule can be applied, and the bottom of the rule either is \(\mathsf {unsat} \), or otherwise describes the resulting modifications to the components of our configuration. A rule may have multiple, alternative conclusions separated by \(\parallel \). An application of a rule is redundant if it has a conclusion where each component in the derived configuration is a subset of the corresponding component in the premise configuration. A configuration other than \(\mathsf {unsat}\) is saturated if every possible application of a derivation rule to it is redundant. A derivation tree is a tree where each node is a configuration whose children, if any, are obtained by a non-redundant application of a rule of the calculus. A closed derivation tree (where all terminal nodes are \(\mathsf {unsat} \)) with root node \(\langle \{ \varphi \}, \overline{\varnothing }, \varnothing \rangle \) is a proof that \(\varphi \) is unsatisfiable in \(T_\mathsf {ASX}\). A derivation tree with root node \(\langle \{ \varphi \}, \overline{\varnothing }, \varnothing \rangle \) and a saturated leaf is, under certain assumptions (see Theorem 1) a witness that \(\varphi \) is satisfiable in \(T_\mathsf {ASX}\).

Fig. 3.
figure 3

Rules modeling the interaction between the propositional, string and arithmetic subsolvers.

To discuss the rules, we first introduce the notation for updating the internal state \(S = ( E, X, F, N )\) of our string solver. Let M be a set of \(T_\mathsf {ASX}\)-constraints. By introducing enough fresh variables, we can construct an equisatisfiable set \(M_E \cup M_X\) where \(M_E\) is a set of \(\varSigma _\mathsf {AS}\)-literals and \(M_X\) is a set of equalities of the form \(y \approx t\), with y a variable and t a flat extended function term. For simplicity, we assume here that \(M_E\) does not contain string disequalities.Footnote 2 We define the external update of S with M, written \(S \oplus M\), to be the tuple \(( E \cup M_E, X \cup M_X, \varnothing , \varnothing )\). For \(\varSigma _\mathsf {AS}\)-terms \(t_1, t_2\), we define the internal update of S with \(t_1 \approx t_2\), written \(S \odot t_1 \approx t_2\), to be the tuple \(( E \cup \{ t_1 \approx t_2 \}, X, F \sigma , \varnothing )\), where \(\sigma \) is the substitution \(\{ t_1 \mapsto t_2 \}\) if \(t_1\) is a string variable and is the empty substitution otherwise, and \(F \sigma \) is the result of replacing the right hand side of all pairs \(y \mapsto \varvec{a}\) in F with \({(\varvec{a}\sigma )}{\downarrow }\).

The rules in Fig. 3 model the basic interaction between the various subsolvers in our approach. The rule Prop-Assign considers each propositional satisfying assignment M for a quantifier-free formula \(\varphi \in \mathsf {G}\), i.e., each set M of literals such as every atom of \(\varphi \) occurs either positively or negatively (but not both) in M, all the atoms in M occur in \(\varphi \), and \(M \, \models _p \, \varphi \). For each such M, Prop-Assign has a conclusion where the string constraints in M (denoted as \(M|_S\)) are given to the string subsolver and the arithmetic constraints (denoted as \(M|_A\)) are given to the arithmetic subsolver. The arithmetic and string solvers use rules A-Prop and S-Prop to share equalities between (shared) arithmetic terms, and use A-Conf and S-Conf to report that their respective set of constraints is unsatisfiable. In those rules and in the rest of the paper, \(\models _{\mathsf {LIA}}\) denotes entailment in linear integer arithmetic. The rules L-Eq and L-Geq respectively infer and guess arithmetic constraints involving string length.

Example 1

Consider the formula \(\varphi \) of the form \(\mathsf {len}\, x > \mathsf {len}\, y \wedge y \approx \mathsf {con}( x, \mathsf {a} )\). Starting from configuration \(\langle \{ \varphi \}, \overline{\varnothing }, \varnothing \rangle \), we may apply Prop-Assign to remove \(\varphi \) from \(\mathsf {G}\) and update \(\mathsf {S}\) and \(\mathsf {A}\) based the propositional satisfying assignment for \(\varphi \). We obtain a configuration where \(\mathsf {A}\) is \(\{ \mathsf {len}\, x > \mathsf {len}\, y \}\) and the \(\mathsf {E}\) component of \(\mathsf {S}\) is \(\{ y \approx \mathsf {con}( x, \mathsf {a} ) \}\). Since \({(\mathsf {len}(\mathsf {con}(x,\mathsf {a} )))}{\downarrow } = \mathsf {len}\, x + 1\), we may add \(\mathsf {len}\,y \approx \mathsf {len}\, x + 1\) to \(\mathsf {A}\) by the rule L-Eq. Since \(\mathsf {len}\, x > \mathsf {len}\, y, \mathsf {len}\,y \approx \mathsf {len}\, x + 1 \, \models _{\mathsf {LIA}} \, \bot \), we may apply A-Conf to derive the \(\mathsf {unsat}\) configuration, establishing that \(\varphi \) is unsatisfiable in \(T_\mathsf {ASX}\).    \(\square \)

Fig. 4.
figure 4

String derivation rules. The rules construct flat forms \(\mathsf {F}\) and normal forms \(\mathsf {N}\) for string terms. The letter l denotes a string constant, and z denotes a fresh string variable.

The rules in Fig. 4 are used by the string solver for building the mappings \(\mathsf {F}\) and \(\mathsf {N}\). We defer discussion of the \(\mathsf {X}\) component of configurations until Sect. 3. In the rules, we assume a total ordering \(\prec \) on string terms, whose only restriction is that \(t_1 \prec t_2\) if \(t_1\) is variable and \(t_2\) is not. For a term t, we call \(\mathsf {F}\,t\) the flat form of t, and for an equivalence class e, we call \(\mathsf {N}\,e\) the normal form of e. We construct the mappings \(\mathsf {F}\) and \(\mathsf {N}\) using the rules F-Form1, F-Form2, N-Form1 and N-Form2 in a mutually recursive fashion. The remaining three rules apply to cases where the above rules do not result in complete mappings for \(\mathsf {F}\) and \(\mathsf {N}\). In the case where we compute flat forms for two terms s and t in the same equivalence class that share a common prefix \(\varvec{w}\) followed by two distinct variables u and v, we apply F-Unify to infer \(u \approx v\) in the case that the lengths of u and v are equal, otherwise we apply F-Split to infer that u is a prefix of v or vice versa in the case that the lengths of u and v are disequal. The rule L-Split splits the derivation based on equality between the lengths of string variables x and y, which we use to derive configurations where one of these two rules applies.

Example 2

Consider a configuration where \(\mathsf {E}\) is \(\{ y \approx \mathsf {con}( \mathsf {a} , x ), x \approx \mathsf {con}( u, z ), y \approx \mathsf {con}( \mathsf {a} , v, z ), w \approx \mathsf {a} , \mathsf {len}\,u \approx \mathsf {len}\,v \}\). The equivalence classes of \(\widehat{\mathsf {E}}\) are

$$ \{ y, \mathsf {con}( \mathsf {a} , x ), \mathsf {con}( \mathsf {a} , v, \mathsf {a} ) \} \quad \{ x, \mathsf {con}( u, \mathsf {a} ) \} \quad \{ u \} \quad \{ v \} \quad \{ w, \mathsf {a} \} $$

Using N-Form2, we obtain \(\mathsf {N}\, [ u ] = (u)\) and \(\mathsf {N}\, [ v ] = ( v )\). Using F-Form1 and N-Form1, we obtain \(\mathsf {F}\, \mathsf {con}( u, \mathsf {a} ) = \mathsf {N}\, [ x ] = ( u, z )\) and \(\mathsf {F}\, \mathsf {a} = \mathsf {N}\, [ w ] = ( \mathsf {a} )\). For \( [ y ] \), we use F-Form1 to obtain \(\mathsf {F}\, \mathsf {con}( \mathsf {a} , x ) = ( \mathsf {a} , u, \mathsf {a} )\) and \(\mathsf {F}\, \mathsf {con}( \mathsf {a} , v, \mathsf {a} ) = ( \mathsf {a} , v, \mathsf {a} )\). Since the flat forms of these terms are not the same and \(\mathsf {len}\,u \approx \mathsf {len}\,v \in \mathsf {E}\), we may apply F-Unify to conclude \(v \approx u\). We update \(\mathsf {S}\) to \(\mathsf {S}\odot u \approx v\), after which the equivalence classes are:

$$ \{ y, \mathsf {con}( \mathsf {a} , x ), \mathsf {con}( \mathsf {a} , v, \mathsf {a} ) \} \quad \{ x, \mathsf {con}( u, \mathsf {a} ) \} \quad \{ u, v \} \quad \{ w, \mathsf {a} \} $$

and \(\mathsf {F}\, \mathsf {con}( \mathsf {a} , v, \mathsf {a} )\) is now \(( \mathsf {a} , u, \mathsf {a} )\). Then we can use N-Form1 and N-Form2 to reconstruct \(\mathsf {N}\). Since \(\mathsf {F}\, \mathsf {con}( \mathsf {a} , x ) =\mathsf {F}\, \mathsf {con}( \mathsf {a} , v, \mathsf {a} ) = ( \mathsf {a} , u, \mathsf {a} )\), we can obtain \(\mathsf {N}\, [ y ] = ( \mathsf {a} , u, \mathsf {a} )\). This results in a configuration where \(\mathsf {N}\) is a complete mapping over the equivalence classes and no more rules apply, indicating that \(\mathsf {E}\) is satisfiable in \(T_\mathsf {ASX}\).    \(\square \)

We say a configuration is cyclic if \(\widehat{\mathsf {E}}\) either contains a chain of equalities of the form \(s \approx \mathsf {con}( \varvec{t_1} ), s_1 \approx \mathsf {con}( \varvec{ t_2 } ), \ldots , s_{n-1} \approx \mathsf {con}( \varvec{t_n} ), s_n \approx s\) where \(s_i\) is a term from \(\varvec{t_i}\) for each i, or an equality of the form \(s \approx t\) where \(\mathsf {F}\,s = ( \varvec{w}, \varvec{u} ), \mathsf {F}\,t = ( \varvec{w}, \varvec{v} ), \varvec{w}\) is the maximal common prefix of \(\mathsf {F}\,s\), and \(\mathsf {F}\,t\) and \(\mathcal {V}( \varvec{u} ) \cap \mathcal {V}( \varvec{v} )\) is non-empty. Recent techniques have been proposed for cyclic string constraints [19, 29]. We instead focus primarily on acyclic string constraints in the following result.

Theorem 1

For all quantifier-free \(\varSigma _\mathsf {AS}\)-formulas \(\varphi \), the following hold.

  1. 1.

    There is a closed derivation tree with root \(\langle \{ \varphi \}, \overline{\varnothing }, \varnothing \rangle \) only if \(\varphi \) is unsat in \(T_\mathsf {ASX}\).

  2. 2.

    There is a derivation tree with root \(\langle \{ \varphi \}, \overline{\varnothing }, \varnothing \rangle \) containing an acyclic saturated configuration only if \(\varphi \) is sat in \(T_\mathsf {ASX}\).

3 Techniques for Extended String Constraints

This section gives a novel extension of the calculus in the previous section for determining the satisfiability of \(T_\mathsf {ASX}\)-constraints. While the decidability of this problem is not known [5], we focus on techniques that are both refutation-sound and model-sound but guaranteed to terminate in general. We introduce two techniques for establishing the (un)satisfiability of \(T_\mathsf {ASX}\)-constraints \(\mathsf {S}\), described by the additional derivation rules in Figs. 5 and 6 which supplement those from Sect. 2.

3.1 Expanding Extended Function Terms to Bounded Integer Quantification

The satisfiability problem for equalities over \(\varSigma _\mathsf {ASX}\)-terms can be reduced to the satisfiability problem for possibly quantified \(\varSigma _\mathsf {AS}\)-formulas. This reduction is provided by rule Ext-Expand in Fig. 5 which, given \(x \approx t \in \mathsf {X}\), adds an equisatisfiable formula \([\![ x \approx t ]\!]\) to \(\mathsf {G}\), the expanded form of \(x \approx t\). The rules also removes \(x \approx t\) from X, effectively marking it as processed. Since \([\![ x \approx t ]\!]\) keeps the (free) variables of \(x \approx t\), any interpretation later found to satisfy \([\![ x \approx t ]\!]\) will also satisfy \(x \approx t\).

Fig. 5.
figure 5

Rules for reducing \(\varSigma _\mathsf {ASX}\)-constraints to \(\varSigma _\mathsf {AS}\)-constraints, where \(z_1, z_2\) are fresh variables, denotes the maximum of \(n_1-n_2\) and 0, and \(\mathsf {ite}\) is the if-then-else connective.

The definition of expanded form for the possible cases of t are given below rule Ext-Expand.Footnote 3 We remark that this rule can be applied only finitely many times. For an intuition of why, consider any ordering \(\prec \) on function symbols such that \(f \prec g\) if g is an extended function and f is not, and \(\mathsf {substr}\prec \mathsf {contains}\prec \mathsf {index\_of}\prec \mathsf {replace}\). In all cases, all function symbols of \([\![ x \approx f( \varvec{t} ) ]\!]\) are smaller than f in this ordering.

Note that the reduction introduces formulas with bounded integer quantification, that is, formulas of the form \(\forall k.\, 0 \leqslant k \leqslant t \Rightarrow \varphi \), where t does not contain k and \(\varphi \) is quantifier-free. Special consideration is needed to handle formulas of this form. We employ a pragmatic approach, modeled by the other two rules in Fig. 5, which guesses upper bounds on certain arithmetic terms and eliminates those quantified formulas based on these bounds. Specifically, rule B-Val splits the search into two cases \(t \leqslant \mathsf {n} \) and \(t > \mathsf {n} \), where t is an integer term and \(\mathsf {n} \) is a numeral. In the rule B-Inst, if \(\mathsf {G}\) contains a formula \(\varphi \) having a subformula \(\forall k.\, 0 \leqslant k \leqslant t \Rightarrow \psi \) and \(\mathsf {A}\) entails a (concrete) upper bound on t, then that subformula is replaced by a finite conjunction. Since all quantifiers introduced in a configuration are bounded, these two rules used in combination suffice to eliminate them.

Example 3

Consider the formula \(\varphi \,=\, \mathsf {contains}( y, z ) \wedge 0< \mathsf {len}\, y \leqslant 3 \wedge 0 < \mathsf {len}\,z\). Applying Prop-Assign to \(\varphi \) results in a configuration where \(\mathsf {E}, \mathsf {X}\) and \(\mathsf {A}\) respectively are \(\{ x \approx \top \}, \{ x \approx \mathsf {contains}( y, z ) \}\), and \(\{ 0< \mathsf {len}\, y \leqslant 3, 0 < \mathsf {len}\,z \}\). Using Ext-Expand, we remove \(x \approx \top \) from \(\mathsf {X}\) and add to \(\mathsf {G}, [\![ x \approx \mathsf {contains}( y, z ) ]\!]\) which is:

$$ ( x \not \approx \top ) \Leftrightarrow \forall k.\, 0 \leqslant k \leqslant \mathsf {len}\,y-\mathsf {len}\,z \Rightarrow \mathsf {substr}( y, k, \mathsf {len}\,z ) \not \approx z. $$

Since \(\mathsf {A}\, \models _{\mathsf {LIA}} \, \mathsf {len}\,y-\mathsf {len}\,z \leqslant 2\), by B-Inst we can replace this formula with:

$$ \begin{array}{c} ( ( x \not \approx \top ) \Leftrightarrow \wedge _{j=0}^2\, \mathsf {substr}( y, n_j, \mathsf {len}\,z ) \not \approx z ) \wedge \bigwedge \nolimits _{j=0}^2\, n_j \approx j \end{array} $$

where \(n_j\) is a fresh integer variable for \(j=0,1,2\). Applying Prop-Assign to this formula gives a branch where \(\mathsf {E}\) and \(\mathsf {X}\) are updated to \(\{ x \approx \top , m_0 \approx \mathsf {len}\,z, x_0 \approx z \}\) and \(\{ x_0 \approx \mathsf {substr}( y, n_0, m_0 ) \}\) for fresh variables \(x_0\) and \(m_0\). Using the rule Ext-Expand, we remove the equality from \(\mathsf {X}\) and add \([\![ x_0 \approx \mathsf {substr}( y, n_0, m_0 ) ]\!]\) to \(\mathsf {G}\), which is:

$$ \begin{array}{ll} \mathsf {ite}( &{} 0 \leqslant n_0 < \mathsf {len}\,y \wedge 0 \leqslant m_0,\\ &{} y \approx \mathsf {con}( z_1, x_0, z_2 ) \wedge \mathsf {len}\,z_1 \approx 0 \wedge \mathsf {len}\,z_2 \approx \mathsf {len}\,y \dot{-} m_0,\, x_0 \approx \mathsf {\epsilon }) \end{array} $$

with \(z_1, z_2\) fresh string variables. Applying Prop-Assign again produces a branch with \(\mathsf {E}= \{ x \approx \top , x_0 \approx z, m_0 \approx \mathsf {len}\,z, y \approx \mathsf {con}( z_1, x_0, z_2 ) \}\) and \(\mathsf {X}\) empty. The set \(\mathsf {E}\) is satisfiable in \(T_\mathsf {ASX}\). Deriving a saturated configuration from this point indicates that \(\varphi \) is satisfiable in \(T_\mathsf {ASX}\) as well. Indeed, all interpretations that satisfy both \(y \approx \mathsf {con}( z_1, x_0, z_2 )\) and \(x_0 \approx z\) also satisfy \(\mathsf {contains}( y, z )\).    \(\square \)

Although these rules give the general idea of our approach, our implementation actually handles the bounded quantifiers in a more sophisticated way, using model-based quantifier instantiation [23]. In a nutshell, we avoid generating all the instances of a quantified formula either by backtracking when a subset of them are unsatisfiable, or by determining that (sets of) instances are already satisfied by a candidate model.

3.2 Context-Dependent Simplification of Extended Function Terms

The reductions described above may be impractical due to the size and complexity of the formulas they introduce. For this reason, we have developed techniques for recognizing when the interpretation of an extended function term can be deduced based on the constraints in the current context. As a simple example, if \(\mathsf {contains}( \mathsf {abc} , x )\) is a term belonging to the string theory (with \(\mathsf {abc} \) a string constant), and the string solver has inferred that x is equal to a concrete value (e.g., \(\mathsf {d} \)) or even a partially concrete value (e.g., \(\mathsf {con}( \mathsf {d} , y )\)), then we can already infer that \(\mathsf {contains}( \mathsf {abc} , x )\) is equivalent to \(\bot \), thereby avoiding the construction of its expanded form. We present next a generic technique for inferring such facts that has a substantial performance impact in practice.

Fig. 6.
figure 6

Rules for context-dependent simplification of extended functions terms.

The rule Ext-Simplify from Fig. 6 applies to configurations in which an extended function term t can be simplified to an equivalent form, modulo the current set of constraints, that does not involve extended functions. In this rule, we derive a set of equalities \(\varvec{y} \approx \varvec{s}\) that are consequences of our current set of string constraints \(\mathsf {E}\), where typically \(\varvec{y}\) are the (free) variables of t. We will refer to \(\{ \varvec{y} \mapsto \varvec{s} \}\) as a derivable substitution (in \(\mathsf {E}\textit{)}\). If the simplified form of t under this substitution is a \(\varSigma _\mathsf {AS}\)-term, then we add the equality \(x \approx {( t \{ \varvec{y} \mapsto \varvec{s} \} )}{\downarrow }\) to \(\mathsf {G}\), and remove \(x \approx t\) from \(\mathsf {X}\). Similarly when t is of sort \(\mathsf {Bool}\), the rule Ext-Simplify-Pred adds an equivalence to \(\mathsf {G}\) based on the result of simplifying the formula \(t \approx \top \) under a derivable substitution, and removes \(x \approx t\) from \(\mathsf {X}\). The rule Ext-Eq is used to deduce equalities between extended terms that are syntactically identical after simplification under a derivable substitution.

These rules require methods for computing the simplified form \({t}{\downarrow }\) of \(\varSigma _\mathsf {ASX}\)-terms t, as well as for choosing substitutions \(\{ \varvec{y} \mapsto \varvec{s} \}\). We describe these methods in the following, and give several examples.

Simplification Rules for Extended String Functions. Recall that by construction, a term t and its simplified form \({t}{\downarrow }\) are equivalent in \(T_\mathsf {ASX}\). It is generally advantageous to use techniques that often simplify \(\varSigma _\mathsf {ASX}\)-terms t to \(\varSigma _\mathsf {AS}\)-terms \({t}{\downarrow }\), since this eliminates the need to apply Ext-Expand to compute the expanded form of t. For this reason, we use aggressive and non-trivial simplification techniques when considering \(\varSigma _\mathsf {ASX}\)-terms.

Fig. 7.
figure 7

Examples of simplification rules for \(\mathsf {contains}\).

Examples of some of the simplification rules for \(\mathsf {contains}\) are given in Fig. 7. There, for string constants \(l_1, l_2\), we write \(l_1 \setminus l_2\) to denote the empty string if \(l_1\) does not contain \(l_2\), and the remainder obtained from removing the largest prefix of \(l_1\) containing \(l_2\) otherwise. We use \(l_1 \sqcup _l l_2\) (resp. \(l_1 \sqcup _r l_2\)) to denote \(l_2\) if \(l_1\) contains \(l_2\), and the largest suffix (resp. prefix) of \(l_1\) that is a prefix (resp. suffix) of \(l_2\) otherwise. For example, \((\mathsf {abcde} \setminus \mathsf {cd} ) = \mathsf {e} , (\mathsf {abcde} \setminus \mathsf {ba} ) = \mathsf {\epsilon }\), \((\mathsf {abcde} \sqcup _l \mathsf {def} ) = \mathsf {de} , (\mathsf {abcde} \sqcup _r \mathsf {def} ) = \mathsf {\epsilon }\), and \((\mathsf {abcdc} \sqcup _l \mathsf {cd} ) = \mathsf {cd} \). Also, \(s \rightarrow ^*t\) indicates that t can be obtained from s by zero or more applications of the rules in the figure. One can prove that his rewrite system is terminating by noting that all conditions and right hand side of each non-trivial rule involve only concatenation terms with strictly fewer arguments than its left hand side.

In practice, the rules are implemented by a handful of recursive passes over the arguments of \(\mathsf {contains}\) terms. Computing the simplified form of other operators is also fairly sophisticated and not shown here. (Our simplifier is around 2000 lines of C++ code.) Despite its complexity, simplification often results in significant performance improvements, by eliminating the need to generate the expanded form of \(\varSigma _\mathsf {ASX}\)-terms. We illustrate this in the following examples.

Example 4

Given input \(y \approx \mathsf {bc} \wedge \mathsf {contains}( \mathsf {con}(\mathsf {a} ,y), \mathsf {con}(\mathsf {b} ,z,\mathsf {a} ) )\), our calculus considers a configuration where \(\mathsf {E}\) and \(\mathsf {X}\) respectively are

$$ \begin{array}{rcl} \{ y \approx \mathsf {bc} , x_1 \approx \top , z_1 \approx \mathsf {con}(\mathsf {a} ,y), z_2 \approx \mathsf {con}(\mathsf {b} ,z,\mathsf {a} ) \}&\text { and }&\{ x_1 \approx \mathsf {contains}( z_1, z_2 ) \} \end{array} $$

where \(x_1, z_1, z_2\) are fresh string variables. We have that \(\mathsf {E}\, \models \, z_1 \approx \mathsf {abc} \wedge z_2 \approx \mathsf {con}(\mathsf {b} ,z,\mathsf {a} )\). Hence the substitution \(\sigma = \{ z_1 \mapsto \mathsf {abc} , z_2 \mapsto \mathsf {con}(\mathsf {b} ,z,\mathsf {a} ) \}\) is derivable in this configuration. Since \({ (\mathsf {contains}( z_1, z_2 )\sigma )}{\downarrow } ={ \mathsf {contains}( \mathsf {abc} , \mathsf {con}( \mathsf {b} , z, \mathsf {a} ) ) }{\downarrow } = \bot \), we may apply Ext-Simplify to remove \(x_1 \approx \mathsf {contains}( z_1, z_2 )\) from \(\mathsf {X}\) and add \(x_1 \approx \bot \) to \(\mathsf {E}\), after which \(\mathsf {unsat}\) may be derived, since \(x_1 \approx \top \in \mathsf {E}\). In this example, we have avoided expanding the input formula by reasoning that \(\mathsf {con}(\mathsf {a} ,y)\) does not contain \(\mathsf {con}(\mathsf {b} ,z,\mathsf {a} )\) in the context where y is \(\mathsf {bc} \).    \(\square \)

Example 5

Given input \(y \approx \mathsf {con}( \mathsf {a} , z ) \wedge \mathsf {contains}( \mathsf {con}( x, y ), \mathsf {bc} )\), our calculus considers the configuration where \(\mathsf {E}\) and \(\mathsf {X}\) respectively are

$$ \begin{array}{rclrcl} \{ y \approx \mathsf {con}( \mathsf {a} , z ), x_1 \approx \top , z_1 \approx \mathsf {con}(x,y), z_2 \approx \mathsf {bc} \}&\text { and }&\{ x_1 \approx \mathsf {contains}( z_1, z_2 ) \} \end{array} $$

The substitution \(\sigma = \{ z_1 \approx \mathsf {con}( x, \mathsf {a} , z ), z_2 \mapsto \mathsf {bc} \}\) is derivable in this configuration. Computing \({(\mathsf {contains}( z_1, z_2 ) \sigma )}{\downarrow }\) results in \(\mathsf {contains}( x, \mathsf {bc} ) \vee \mathsf {contains}( z, \mathsf {bc} )\). We may apply Ext-Simplify-Pred to remove \(x_1 \approx \mathsf {contains}( z_1, z_2 )\) from \(\mathsf {X}\) and add this formula to \(\mathsf {G}\), after which we consider the two disjuncts independently.    \(\square \)

Example 6

Given input \(y \approx \mathsf {ab} \wedge \mathsf {contains}( \mathsf {con}(\mathsf {b} ,z), y ) \wedge \lnot \mathsf {contains}( z, y )\), our calculus considers the configuration where \(\mathsf {E}\) and \(\mathsf {X}\) respectively are

$$\begin{aligned} \begin{array}{rcl} \{ y \approx \mathsf {ab} , x_1 \approx \top , x_2 \approx \bot , z_1 \approx \mathsf {con}(\mathsf {b} ,z) \}&\text { and }&\{ x_1 \approx \mathsf {contains}( z_1, y ), x_2 \approx \mathsf {contains}( z, y ) \} \end{array} \end{aligned}$$

The substitution \(\sigma = \{ y \approx \mathsf {ab} , z_1 \approx \mathsf {con}( \mathsf {b} , z ) \}\) is derivable in this configuration, and \({ (\mathsf {contains}( z_1, y ) \sigma ) }{\downarrow } = \mathsf {contains}( z, \mathsf {ab} ) = { (\mathsf {contains}( z, y ) \sigma )}{\downarrow }\). Hence we can apply Ext-Eq to add \(x_1 \approx x_2\) to \(\mathsf {E}\), after which \(\mathsf {unsat}\) can be derived.    \(\square \)

Choosing Substitutions. A simple and general heuristic for choosing substitutions \(\{ \varvec{y} \mapsto \varvec{s} \}\) for terms t in the rules from Fig. 6 is to map each variable y in t to some representative of its equivalence class \( [ y ] \). We assume string constants are chosen as representatives whenever possible. We call this the representative substitution for t (in \(\mathsf {E}\)). Representative substitutions are both easy to compute and often enough for reducing \(\varSigma _\mathsf {ASX}\)-terms. A more powerful method for choosing substitutions is to consider substitutions that map each free variable y in t to \({\mathsf {con}( a_1,\ldots , a_n )}{\downarrow }\) where \(\mathsf {N}\, [ y ] = ( a_1, \ldots , a_n )\). We call this the normal form substitution for t. Intuitively, the normal form of t is a schema representing all known information about t. In this sense, a substitution mapping variables to their normal forms gives the highest likelihood of enabling our simplification techniques. In practice, our implementation takes advantage of both of these heuristics for choosing substitutions.

Example 7

Say we are in a configuration where the equivalence classes of \(\widehat{\mathsf {E}}\) are:

$$\begin{aligned} \{ y, \mathsf {con}( \mathsf {a} , x ) \} \quad \{ x, \mathsf {con}( z, \mathsf {c} ), u \} \quad \{ z, w, \mathsf {b} \} \end{aligned}$$

and the normal forms are \(\mathsf {N} [ y ] = \mathsf {abc} , \mathsf {N} [ x ] = \mathsf {bc} \) and \(\mathsf {N} [ z ] = \mathsf {b} \). If yx, and \(\mathsf {b} \) are chosen as the representatives of these classes, the representative substitution \(\sigma _r\) for this configuration is \(\{ y \mapsto y, x \mapsto x, u \mapsto x, z \mapsto \mathsf {b} , w \mapsto \mathsf {b} \}\), whereas the normal form substitution \(\sigma _n\) is \(\{ y \mapsto \mathsf {abc} , x \mapsto \mathsf {bc} , u \mapsto \mathsf {bc} , z \mapsto \mathsf {b} , w \mapsto \mathsf {b} \}\). Only the latter substitution suffices to show that \(\mathsf {contains}( y, \mathsf {con}( z, z ) )\) is false in the current context, noting \({(\mathsf {contains}( y, \mathsf {con}( z, z ) ) \sigma _r )}{\downarrow } = \mathsf {contains}( y, \mathsf {bb} )\) and \({(\mathsf {contains}( y, \mathsf {con}( z, z ) ) \sigma _n )}{\downarrow } = \bot \).    \(\square \)

4 Implementation

We have implemented all of these techniques in the DPLL(T)-based SMT solver cvc4 [4]. At a high level, our implementation can be summarized as a particular strategy for applying the rules of the calculus, which we outline in the following.

Strategy 1

Start with a derivation tree consisting of (root) node \(\langle \{ \varphi \}, \varnothing , \varnothing \rangle \). Let \(t_\mathsf {len}\) be \(\mathsf {len}\,x_1 + \ldots + \mathsf {len}\,x_m\) where \(x_1, \ldots , x_m\) are the string variables of \(\varphi \).

While the tree is not closed, consider as current configuration the left-most leaf in the tree that is not \(\mathsf {unsat} \) and apply to it a derivation rule to that configuration, based on the steps below.

  1. 1.

    Let \(\mathsf {n} \) be the smallest numeral such that \(t_\mathsf {len}> \mathsf {n} \not \in \mathsf {A}\). If \(t_\mathsf {len}\leqslant \mathsf {n} \not \in \mathsf {A}\), apply B-Val for \(t_\mathsf {len}\) and \(\mathsf {n} \).

  2. 2.

    If \(\mathsf {G}\) contains a formula \(\varphi \) with subformula \(\forall k.\, 0 \leqslant k \leqslant t \Rightarrow \psi \), then let \(\mathsf {n} \) be the smallest numeral such that \(t > \mathsf {n} \not \in \mathsf {A}\). If \(t \leqslant \mathsf {n} \in \mathsf {A}\), apply B-Inst for \(\varphi \) and \(\mathsf {n} \). Otherwise, apply B-Val for t and \(\mathsf {n} \).

  3. 3.

    If possible, apply a rule from Fig. 3, giving priority to A-Conf and S-Conf.

  4. 4.

    If possible, apply a rule from Fig. 6 based on representative substitutions.

  5. 5.

    If possible, apply a rule from Fig. 4.

  6. 6.

    If possible, apply a rule from Fig. 6 based on normal form substitutions.

  7. 7.

    If \(\mathsf {X}\) is non-empty, apply the rule Ext-Expand for some equality \(x \approx t\) in \(\mathsf {X}\).

If no rule applies and the current configuration is acyclic, return \(\mathsf {sat}\). If the tree is closed, return \(\mathsf {unsat}\).    \(\square \)

The strategy above is sound both for refutations and models, although it is not terminating in general.

Theorem 2

For all initial configurations \(\langle \{ \varphi \}, \varnothing , \varnothing \rangle \) where \(\varphi \) is a quantifier-free \(\varSigma _\mathsf {ASX}\)-formula:

  1. 1.

    Strategy 1 returns \(\mathsf {unsat}\) only if \(\varphi \) is unsatisfiable in \(T_\mathsf {ASX}\).

  2. 2.

    Strategy 1 returns \(\mathsf {sat}\) only if \(\varphi \) is satisfiable in \(T_\mathsf {ASX}\).

Implementation. While a comprehensive description of our implementation is beyond the scope of this work, we mention a few salient implementation details. The rule Prop-Assign is implemented by converting \(\mathsf {G}\) to clausal normal form and giving the resulting clauses to a SAT solver with support for conflict-driven clause learning. The rule A-Conf is implemented by a standard theory solver for linear integer arithmetic. The rules of the calculus that modify the \(\mathsf {S}\) component of our configuration are implemented in a dedicated DPLL(T) theory solver for strings which generates conflict clauses when branches of a derivation tree are closed, and theory lemmas for rules that add formulas to \(\mathsf {G}\) or \(\mathsf {A}\) and those that have multiple conclusions. Conflict clauses are generated by tracking explanations so that each literal internally added to \(\mathsf {S}\) can be justified in terms of input literals. Finally, we do not explicitly introduce fresh variables when constructing the set \(\mathsf {X}\), and instead record the set of extended terms that occur in \(\mathsf {E}\), which are implicitly treated as variables. We now revisit a few of the examples, giving concrete details on the operation of the solver.

Example 8

In Example 1, our input was \(\mathsf {len}\, x > \mathsf {len}\, y \wedge y \approx \mathsf {con}( x, \mathsf {a} )\). For this input, the SAT solver finds a propositionally satisfying assignment that assigns both conjuncts to true, which causes the literal \(\mathsf {len}\, x > \mathsf {len}\, y\) to be given to the theory solver for linear integer arithmetic, and \(y \approx \mathsf {con}( x, \mathsf {a} )\) to be given to the theory solver for strings. This corresponds to an application of the rule \(\textsf {\small Prop-Assign}\). The string solver sends \(( \lnot y \approx \mathsf {con}( x, \mathsf {a} ) \vee \mathsf {len}\, y \approx \mathsf {len}\, x + 1 )\) as a theory lemma to the SAT solver, corresponding to an application of the rule \(\textsf {\small L-Eq}\). After that, the SAT solver assigns \(\mathsf {len}\, y \approx \mathsf {len}\, x + 1\) to true, causing that literal to be asserted to the arithmetic solver, which subsequently generates a conflict clause of the form \((\lnot \mathsf {len}\, x > \mathsf {len}\, y \vee \lnot \mathsf {len}\, y \approx \mathsf {len}\, x + 1)\) corresponding to an application of \(\textsf {\small A-Conf}\). After receiving this clause, the SAT solver is unable to find another satisfying assignment and causes the system to terminate with “unsat.”    \(\square \)

Example 9

In Example 4, the string solver is given as input the literals \(y \approx \mathsf {bc} \) and \(\mathsf {contains}( \mathsf {con}(\mathsf {a} ,y), \mathsf {con}(\mathsf {b} ,z,\mathsf {a} ) ) \approx \top \). The intermediate variables \(z_1\) and \(z_2\) are not explicitly introduced. Instead, using the substitution \(\sigma = \{ y \mapsto \mathsf {bc} \}\), the solver directly infers that \({\mathsf {contains}( \mathsf {con}(\mathsf {a} ,y), \mathsf {con}(\mathsf {b} ,z,\mathsf {a} )) \sigma }{\downarrow } = \bot \). Based on this simplification, it infers \(\mathsf {contains}( \mathsf {con}(\mathsf {a} ,y), \mathsf {con}(\mathsf {b} ,z,\mathsf {a} )) \approx \bot \) with the explanation \(y \approx \mathsf {bc} \). Since the inferred literal conflicts with the second input literal, the string solver reports the conflict clause \(\lnot y \approx \mathsf {bc} \vee \mathsf {contains}( \mathsf {con}(\mathsf {a} ,y), \mathsf {con}(\mathsf {b} ,z,\mathsf {a} )) \approx \top \).    \(\square \)

Example 10

In Example 6, the equality \(y \approx \mathsf {ab} \) is the explanation for the substitution \(\sigma = \{ y \mapsto \mathsf {ab} \}\) under which \({\mathsf {contains}( \mathsf {con}( b, z ), y ) \sigma }{\downarrow } = \mathsf {contains}( z, \mathsf {ab} ) = {\mathsf {contains}( \mathsf {con}( z, y ) ) \sigma }{\downarrow }\). Hence, the solver reports \((\lnot y \approx \mathsf {ab} \vee \lnot \mathsf {contains}( \mathsf {con}(\mathsf {b} ,z), y ) \vee \mathsf {contains}( z, y ) )\) as a conflict clause in this example.    \(\square \)

Example 11

Explanations are tracked for normal form substitutions as well. In Example 7, a possible explanation for the substitution \(\sigma _n\) is \(y \approx \mathsf {con}( \mathsf {a} , x ) \wedge x \approx \mathsf {con}( z, \mathsf {c} ) \wedge u \approx x \wedge z \approx \mathsf {b} \wedge w \approx z\). Explanations for simplifications that occur under the substitution \(\sigma _n\) must include these equalities.    \(\square \)

In practice, we minimize explanations by only including the variables in substitutions that are relevant for certain inferences. In particular, the domain of derivable substitutions is restricted to the free variables of the terms they apply to. We further reduce this set based on dependency analysis. For example, \(\mathsf {contains}( \mathsf {abc} , \mathsf {con}( x, y ) ) \approx \bot \) can be explained by \(x \approx \mathsf {d} \wedge y \approx \mathsf {a} \). However, \(x \approx \mathsf {d} \) alone is enough.

5 Evaluation

This section reports on our experimental evaluation of our approach for extended string constraints as implemented in the SMT solver cvc4.Footnote 4 We used benchmark queries generated by running PyEx, a symbolic executor for Python programs, over a test suite that mimics the usage by a real-world Python developer. The technical details of our benchmark generation process are provided in Sect. 5.2.

We considered several configurations of cvc4 that differ in the subset of steps from Strategy 1 they apply. The default configuration, denoted cvc4, performs Steps 2, 3, 5 and 7 only. Configurations with suffix f (for “finite model finding”) perform Step 1, and configurations with suffix s (for “simplification”) perform Steps 4 and 6. For example, configuration cvc4+fs performs all seven steps. We consider other solvers for string constraints, including z3-str [30] (git revision e398f81) and z3 [9] (version 4.5, git revision 24eae3f) which was recently extended with native support for strings.

5.1 Comparison with Other String Solvers

We first evaluated the raw performance of cvc4, z3, and z3-str on the string benchmarks we collected with PyEx. We considered three sets of benchmarks produced by PyEx using cvc4+fs, z3, and z3-str as the path constraint solver during program exploration. We denote these sets, which collectively consisted of 25,421 benchmark problems, as PyEx-cfs, PyEx-z3 and PyEx-z32, respectively. We omit a small number (35) of these benchmarks for the following reasons: for 13 of them, at least one of the solvers produced a parse error; for the other 22, one solver returned a model (i.e., a satisfying assignment for the variables in the input problem) and agreed with its own model, but another solver answered “unsat” and disagreed with the model of the first solver.Footnote 5 We attribute the parse errors to how the solvers process certain escape sequences in string constants, and the model discrepancies to minor differences in the semantics of \(\mathsf {substr}\) when input indices are out of bounds.

Fig. 8.
figure 8

Results of running each solver over benchmarks generated by PyEx over our test suite. All benchmarks run with a 30 s timeout.

All results were produced on StarExec [25], a public execution service for running comparative evaluations of logical solvers. The results for the three solvers on the three benchmark sets are shown in Fig. 8 based on a 30 s timeout. The columns show the number of benchmarks that were determined to be satisfiable and unsatisfiable by each solver. The column with heading \(\times \) indicates the number of times the solver either timed out or terminated with an inconclusive response such as “unknown.” The best configuration of cvc4 (cvc4+fs) had a factor of 3.8 fewer timeouts than z3, and a factor of 7.6 fewer timeouts or failures than z3-str in total over all benchmark sets. In particular, we note that cvc4+fs solved 1,451 unique benchmarks with respect to z3 among those generated during a symbolic execution run using z3 as the solver (PyEx-z3). Since PyEx supports concurrent solver invocation, this suggests a mixed-solver strategy that employs both cvc+fs and z3 would likely have reduced the number of failed queries for that run. For unsatisfiable benchmarks, the solvers were relatively closer in performance, where z3 solved 24 more unsatisfiable benchmarks than cvc4+fs, which is not tuned for the unsatisfiable case due to its use of finite model finding. This further suggests a mixed-solver strategy would likely be beneficial for symbolic execution since it is often used for both program exploration (where \(\mathsf {sat}\) leads to progress) and vulnerability checking (where \(\mathsf {unsat}\) implies safety).

In addition to solving more benchmarks, cvc4+fs was significantly faster over them. Figure 9 plots the cumulative run time of the three solvers on benchmarks that each solves. With respect to z3, which took 11 h and 33 min on the 19,368 benchmarks its solves, cvc4+fs solved its first 19,368 benchmarks in 1 h and 23 min, and overall took only 5 h and 8 min on the 23,802 benchmarks it solves.

Fig. 9.
figure 9

Cactus plot of configurations of cvc4, z3 and z3-str on solved benchmarks across all three benchmark sets.

Using the context-dependent simplification techniques from Sect. 3.2, cvc4+s was able to solve 663 more benchmarks than cvc4, which does not apply simplification. By incorporating finite model finding, cvc4+fs was able to solve 536 more benchmarks in significantly less time, with a cumulative difference of more than 3 h on solved benchmarks compared to cvc4+s. Taking the virtual best configuration of cvc4, our techniques find 20,594 benchmarks to be satisfiable, 567 more than cvc4+fs, indicating that a portfolio approach for the various configurations would be advantageous.

We also measured how often cvc4 resorted to expanding extended function terms. We considered a modified configuration cvc4+fs’ that is identical to cvc4+fs except that it does not use the rule Ext-Simplify. The 23,738 benchmarks solved by both cvc4+fs and cvc4+fs’ had 619.2 extended function terms on average. On average over these benchmarks, the configuration cvc4+fs’ found that 63.5 unique extended functions terms were relevant to satisfiability (e.g., were added to a configuration), and of these 24.3 were expanded (38%). Likewise, cvc4+fs found that 66.4 unique extended functions terms were relevant to satisfiability, and of these 12.6 were expanded (19%). With Ext-Simplify, it inferred 405.2 equalities per benchmark on average based on context-dependent simplification, showing that simplification is possible in a majority of contexts (97%). Limited to expansions that introduce universal quantification, which excludes expansions of \(\mathsf {substr}\) and positively asserted \(\mathsf {contains}\), cvc4+fs’ considered 11.4 expansions on average compared to 2.7 considered by cvc4+fs. This means that approximately 4 times fewer quantified formulas were introduced thanks to context-dependent simplification in cvc4+fs.

5.2 Symbolic Execution for Python

Our benchmarks were generated by PyEx, which is a symbolic executor designed to assist Python developers achieve high-coverage testing in their daily development routine, e.g., as part of the nightly tests run on the most recent version of a code base. To demonstrate the relative performance of cvc4, z3, and z3-str in our nightly tests scenario, we ran PyEx on a test suite of 19 functions sampled from 4 popular Python packages: \(\mathsf {httplib2}, \mathsf {pip}, \mathsf {pymongo}\), and \(\mathsf {requests}\). The set of queries generated during this experiment was used for our evaluation of the raw solver performance in Sect. 5.1. In this section, we show that the superior raw performance of cvc4 over other current solvers also translates into real-life benefits for PyEx users.

Our experiment was conducted on a developer machine featuring an Intel E3-1275 v3 quad-core processor (3.5 GHz) and 32 GB of memory. PyEx was run on each of the 19 functions for a maximum CPU time of 2 h. By design, PyEx issues concurrent queries using Python multiprocessing when multiple queries are pending. To reflect the configuration of our test machine, we capped the number of concurrent processes to 8. Note that, due to the nature of our test infrastructure, each of the 19 functions was tested in sequence and thus concurrency happened only within the testing of each individual function. In addition, PyEx has a notion of a per-path timeout, which is a heuristic to steer code exploration away from code paths that are stuck with hard queries. For this experiment, that timeout was set to 10 min.

We argue that the most important metrics for a developer are (i) the wall-clock time to run PyEx over the test suite and (ii) the coverage achieved over this time. In our experiments, PyEx with z3 and with z3-str finished in 717 min and 829 min, respectively. By comparison, PyEx with the recommended configuration of cvc4 (cvc4+fs) finished in 295 min, which represents a speedup of \(59\% \) and \(64\% \) respectively over the other solvers. To compare coverage, we used the Python coverage library to measure both line coverage, the percentage of executed source lines, and branch coverage, the number of witnessed branch outcomes, of the test suite during symbolic execution. The line coverage of PyEx with cvc4+fs, z3, and z3-str was respectively \(8.48\%, 8.41\% \), and \(8.34\% \),Footnote 6 whereas the branch coverage was 3612, 3895, and 3500. Taking both metrics into account, we conclude that cvc4+fs is highly effective for PyEx since it achieves similar coverage as the other tools while running significantly faster.

6 Concluding Remarks and Future Work

We have presented a calculus for extended string constraints that relies on both bounded quantifier elimination and context-dependent simplification techniques. The latter led to significant performance benefits for constraints coming from symbolic execution of Python programs. An implementation of these techniques in cvc4 has 3.8 times fewer timeouts and enables the PyEx symbolic executor to achieve comparable program coverage on our test suite while using only \(41\% \) of the runtime compared to other solvers.

Our analysis on program coverage indicates that an interesting research avenue for future work would be to determine correlations between certain features of models generated by string solvers and their utility in a symbolic executor, since different models may lead to different symbolic executions and hence different overall analyses.

We plan to develop context-dependent simplification techniques for other string functions, including conversion functions \(\mathsf {str\_to\_int}\) and \(\mathsf {int\_to\_str}\), and to adapt these techniques to other SMT theories. Notably, we would like to use context-dependent simplification to optimize lazy approaches for fixed-width bit-vectors [14] where it is beneficial to avoid bit-blasting bit-vector operators, such as multiplication, that require elaborate encodings.