Skip to main content
Log in

Strong mixed-integer programming formulations for trained neural networks

  • Full Length Paper
  • Series B
  • Published:
Mathematical Programming Submit manuscript

Abstract

We present strong mixed-integer programming (MIP) formulations for high-dimensional piecewise linear functions that correspond to trained neural networks. These formulations can be used for a number of important tasks, such as verifying that an image classification network is robust to adversarial inputs, or solving decision problems where the objective function is a machine learning model. We present a generic framework, which may be of independent interest, that provides a way to construct sharp or ideal formulations for the maximum of d affine functions over arbitrary polyhedral input domains. We apply this result to derive MIP formulations for a number of the most popular nonlinear operations (e.g. ReLU and max pooling) that are strictly stronger than other approaches from the literature. We corroborate this computationally, showing that our formulations are able to offer substantial improvements in solve time on verification tasks for image classification networks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. Note that if \(D = \left\{ x \in {\mathbb {R}}^\eta \,\big |\, Ax \leqslant b\right\} \) is polyhedral, then \(z \cdot D = \left\{ x \in {\mathbb {R}}^\eta \,\big | \, Ax \leqslant bz\right\} \).

  2. As is standard in a Benders’ decomposition approach, we can address this by adding a feasibility cut describing the domain of \({\overline{g}}\) (the region where it is finite valued) instead of an optimality cut of the form (10a).

  3. Alternatively, a constructive proof of validity and idealness using Fourier–Motzkin elimination is given in the extended abstract of this work [4, Proposition 1].

  4. In this context, logits are non-normalized predictions of the neural network so that \({\tilde{x}} \in [0,1]^{28 \times 28}\) is predicted to be digit \(i-1\) with probability \(\exp (f({\tilde{x}})_i)/\sum _{j=1}^{10}\exp (f({\tilde{x}})_j)\) [1].

  5. We use cut callbacks in Gurobi to inject separated inequalities into the cut loop. While this offers little control over when the separation procedure is run, it allows us to take advantage of Gurobi’s sophisticated cut management implementation.

References

  1. https://developers.google.com/machine-learning/glossary/#logits. Accessed 6 Feb 2020

  2. Alipanahi, B., Delong, A., Weirauch, M.T., Frey, B.J.: Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat. Biotechnol. 33, 831–838 (2015)

    Google Scholar 

  3. Amos, B., Xu, L., Kolter, J.Z.: Input convex neural networks. In: Precup, D., Teh, Y.W. (eds.) Proceedings of the 34th International Conference on Machine Learning, vol. 70, pp. 146–155. PMLR, International Convention Centre, Sydney (2017)

  4. Anderson, R., Huchette, J., Tjandraatmadja, C., Vielma, J.P.: Strong mixed-integer programming formulations for trained neural networks. In: A. Lodi, V. Nagarajan (eds.) Proceedings of the 20th Conference on Integer Programming and Combinatorial Optimization, pp. 27–42. Springer International Publishing, Cham (2019). arxiv:1811.08359

  5. Arora, R., Basu, A., Mianjy, P., Mukherjee, A.: Understanding deep neural networks with rectified linear units (2016). arXiv preprint arXiv:1611.01491

  6. Arulkumaran, K., Deisenroth, M.P., Brundage, M., Bharath, A.A.: Deep reinforcement learning: a brief survey. IEEE Signal Process. Mag. 34(6), 26–38 (2017)

    Google Scholar 

  7. Atamtürk, A., Gómez, A.: Strong formulations for quadratic optimization with M-matrices and indicator variables. Math. Program. 170, 141–176 (2018)

    MathSciNet  MATH  Google Scholar 

  8. Balas, E.: Disjunctive programming and a hierarchy of relaxations for discrete optimization problems. SIAM J. Algorithmic Discrete Methods 6(3), 466–486 (1985)

    MathSciNet  MATH  Google Scholar 

  9. Balas, E.: Disjunctive programming: properties of the convex hull of feasible points. Discrete Appl. Math. 89, 3–44 (1998)

    MathSciNet  MATH  Google Scholar 

  10. Bartolini, A., Lombardi, M., Milano, M., Benini, L.: Neuron constraints to model complex real-world problems. In: International Conference on the Principles and Practice of Constraint Programming, pp. 115–129. Springer, Berlin (2011)

  11. Bartolini, A., Lombardi, M., Milano, M., Benini, L.: Optimization and controlled systems: a case study on thermal aware workload dispatching. In: Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence, pp. 427–433 (2012)

  12. Bastani, O., Ioannou, Y., Lampropoulos, L., Vytiniotis, D., Nori, A.V., Criminisi, A.: Measuring neural net robustness with constraints. In: Advances in Neural Information Processing Systems, pp. 2613–2621 (2016)

  13. Belotti, P., Bonami, P., Fischetti, M., Lodi, A., Monaci, M., Nogales-Gomez, A., Salvagnin, D.: On handling indicator constraints in mixed integer programming. Comput. Optim. Appl. 65(3), 545–566 (2016)

    MathSciNet  MATH  Google Scholar 

  14. Bertsimas, D., Tsitsiklis, J.: Introduction to Linear Optimization. Athena Scientific, Belmont, MA (1997)

    Google Scholar 

  15. Bienstock, D., Muñoz, G., Pokutta, S.: Principled deep neural network training through linear programming (2018). arXiv preprint arXiv:1810.03218

  16. Bishop, C.M.: Pattern Recognition and Machine Learning. Springer, Berlin (2006)

    MATH  Google Scholar 

  17. Bonami, P., Lodi, A., Tramontani, A., Wiese, S.: On mathematical programming with indicator constraints. Math. Program. 151(1), 191–223 (2015)

    MathSciNet  MATH  Google Scholar 

  18. Boureau, Y.L., Bach, F., LeCun, Y., Ponce, J.: Learning mid-level features for recognition. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 2559–2566 (2010)

  19. Bunel, R., Turkaslan, I., Torr, P.H., Kohli, P., Kumar, M.P.: A unified view of piecewise linear neural network verification. In: Advances in Neural Information Processing Systems (2018)

  20. Carlini, N., Wagner, D.: Towards evaluating the robustness of neural networks. In: 2017 IEEE Symposium on Security and Privacy (SP), pp. 39–57 (2017)

  21. Chen, L., Ma, W., Natarajan, K., Simchi-Levi, D., Yan, Z.: Distributionally robust linear and discrete optimization with marginals. Available at SSRN 3159473 (2018)

  22. Cheng, C.H., Nührenberg, G., Ruess, N.: Maximum resilience of artifical neural networks. In: International Symposium on Automated Technology for Verification and Analysis. Springer, Cham (2017)

  23. Dulac-Arnold, G., Evans, R., van Hasselt, H., Sunehag, P., Lillicrap, T., Hunt, J., Mann, T., Weber, T., Degris, T., Coppin, B.: Deep reinforcement learning in large discrete action spaces (2015). arxiv:1512.07679

  24. Dutta, S., Jha, S., Sanakaranarayanan, S., Tiwari, A.: Output range analysis for deep feedforward neural networks. In: NASA Formal Methods Symposium (2018)

  25. Dvijotham, K., Gowal, S., Stanforth, R., Arandjelovic, R., O’Donoghue, B., Uesato, J., Kohli, P.: Training verified learners with learned verifiers (2018). arxiv:1805.10265

  26. Dvijotham, K., Stanforth, R., Gowal, S., Mann, T., Kohli, P.: A dual approach to scalable verification of deep networks. In: Thirty-Fourth Conference Annual Conference on Uncertainty in Artificial Intelligence (2018)

  27. Ehlers, R.: Formal verification of piece-wise linear feed-forward neural networks. In: International Symposium on Automated Technology for Verification and Analysis. Springer, Cham (2017)

  28. Engstrom, L., Tran, B., Tsipras, D., Schmidt, L., Madry, A.: Exploring the landscape of spatial robustness. In: Chaudhuri, K., Salakhutdinov, R. (eds.) Proceedings of the 36th International Conference on Machine Learning, Proceedings of Machine Learning Research, vol. 97, pp. 1802–1811. PMLR, Long Beach, CA (2019). http://proceedings.mlr.press/v97/engstrom19a.html. Accessed 6 Feb 2020

  29. Fischetti, M., Jo, J.: Deep neural networks and mixed integer linear optimization. Constraints 23, 296–309 (2018)

    MathSciNet  MATH  Google Scholar 

  30. Gatys, L.A., Ecker, A.S., Bethge, M.: A neural algorithm of artistic style (2015). arxiv:1508.06576

  31. Glorot, X., Bordes, A., Bengio, Y.: Deep sparse rectifier neural networks. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 315–323 (2011)

  32. Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning, vol. 1. MIT Press, Cambridge (2016)

    MATH  Google Scholar 

  33. Goodfellow, I.J., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning, vol. 28, pp. 1319–1327 (2013)

  34. Grimstad, B., Andersson, H.: ReLU networks as surrogate models in mixed-integer linear programs. Comput. Chem. Eng. 131, 106580 (2019)

    Google Scholar 

  35. Haneveld, W.K.K.: Robustness against dependence in pert: an application of duality and distributions with known marginals. In: Stochastic Programming 84 Part I, pp. 153–182. Springer (1986)

  36. Hanin, B.: Universal function approximation by deep neural nets with bounded width and ReLU activations (2017). arXiv preprint arXiv:1708.02691

  37. Hijazi, H., Bonami, P., Cornuéjols, G., Ouorou, A.: Mixed-integer nonlinear programs featuring “on/off” constraints. Comput. Optim. Appl. 52(2), 537–558 (2012)

    MathSciNet  MATH  Google Scholar 

  38. Hijazi, H., Bonami, P., Ouorou, A.: A note on linear on/off constraints (2014). http://www.optimization-online.org/DB_FILE/2014/04/4309.pdf. Accessed 6 Feb 2020

  39. Huber, B., Rambau, J., Santos, F.: The Cayley Trick, lifting subdivisions and the Bohne-Dress theorem of zonotopal tiltings. J. Eur. Math. Soc. 2(2), 179–198 (2000)

    MathSciNet  MATH  Google Scholar 

  40. Huchette, J.: Advanced mixed-integer programming formulations: methodology, computation, and application. Ph.D. thesis, Massachusetts Institute of Technology (2018)

  41. Jeroslow, R., Lowe, J.: Modelling with integer variables. Math. Program. Study 22, 167–184 (1984)

    MathSciNet  MATH  Google Scholar 

  42. Jeroslow, R.G.: Alternative formulations of mixed integer programs. Ann. Oper. Res. 12, 241–276 (1988)

    MathSciNet  Google Scholar 

  43. Katz, G., Barrett, C., Dill, D.L., Julian, K., Kochenderfer, M.J.: Reluplex: an efficient SMT solver for verifying deep neural networks. In: International Conference on Computer Aided Verification, pp. 97–117 (2017)

  44. Khalil, E.B., Gupta, A., Dilkina, B.: Combinatorial attacks on binarized neural networks. In: International Conference on Learning Representations (2019)

  45. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization (2014). arxiv:1412.6980

  46. Korte, B., Vygen, J.: Combinatorial Optimization: Theory and Algorithms. Springer, Berlin (2000)

    MATH  Google Scholar 

  47. Kumar, A., Serra, T., Ramalingam, S.: Equivalent and approximate transformations of deep neural networks (2019). arxiv:1905.11428

  48. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015)

    Google Scholar 

  49. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86, 2278–2324 (1998)

    Google Scholar 

  50. Liu, C., Arnon, T., Lazarus, C., Barrett, C., Kochenderfer, M.J.: Algorithms for verifying deep neural networks (2019). arxiv:1903.06758

  51. Lombardi, M., Gualandi, S.: A lagrangian propagator for artificial neural networks in constraint programming. Constraints 21(4), 435–462 (2016)

    MathSciNet  MATH  Google Scholar 

  52. Lombardi, M., Milano, M.: Boosting combinatorial problem modeling with machine learning. In: Proceedings IJCAI, pp. 5472–5478 (2018)

  53. Lombardi, M., Milano, M., Bartolini, A.: Empirical decision model learning. Artif. Intell. 244, 343–367 (2017)

    MathSciNet  MATH  Google Scholar 

  54. Lomuscio, A., Maganti, L.: An approach to reachability analysis for feed-forward ReLU neural networks (2017). arxiv:1706.07351

  55. Maas, A.L., Hannun, A.Y., Ng, A.Y.: Rectifier nonlinearities improve neural network acoustic models. In: ICML Workshop on Deep Learning for Audio, Speech and Language (2013)

  56. Mladenov, M., Boutilier, C., Schuurmans, D., Elidan, G., Meshi, O., Lu, T.: Approximate linear programming for logistic Markov decision processes. In: Proceedings of the Twenty-sixth International Joint Conference on Artificial Intelligence (IJCAI-17), pp. 2486–2493. Melbourne, Australia (2017)

  57. Mordvintsev, A., Olah, C., Tyka, M.: Inceptionism: Going deeper into neural networks (2015). https://ai.googleblog.com/2015/06/inceptionism-going-deeper-into-neural.html. Accessed 6 Feb 2020

  58. Natarajan, K., Song, M., Teo, C.P.: Persistency model and its applications in choice modeling. Manage. Sci. 55(3), 453–469 (2009)

    MATH  Google Scholar 

  59. Olah, C., Mordvintsev, A., Schubert, L.: Feature visualization. Distill (2017). https://distill.pub/2017/feature-visualization. Accessed 6 Feb 2020

  60. Papernot, N., McDaniel, P., Jha, S., Fredrikson, M., Celik, Z.B., Swami, A.: The limitations of deep learning in adversarial settings. In: IEEE European Symposium on Security and Privacy, pp. 372–387 (2016)

  61. Raghunathan, A., Steinhardt, J., Liang, P.: Semidefinite relaxations for certifying robustness to adversarial examples. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS’18, pp. 10,900–10,910. Curran Associates Inc. (2018)

  62. Ryu, M., Chow, Y., Anderson, R., Tjandraatmadja, C., Boutilier, C.: CAQL: Continuous action Q-learning (2019). arxiv:1909.12397

  63. Salman, H., Yang, G., Zhang, H., Hsieh, C.J., Zhang, P.: A convex relaxation barrier to tight robustness verification of neural networks (2019). arxiv:1902.08722

  64. Say, B., Wu, G., Zhou, Y.Q., Sanner, S.: Nonlinear hybrid planning with deep net learned transition models and mixed-integer linear programming. In: Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI-17, pp. 750–756 (2017)

  65. Schweidtmann, A.M., Mitsos, A.: Global deterministic optimization with artificial neural networks embedded. J. Optim. Theory Appl. 180, 925–948 (2019)

    MathSciNet  MATH  Google Scholar 

  66. Serra, T., Ramalingam, S.: Empirical bounds on linear regions of deep rectifier networks (2018). arxiv:1810.03370

  67. Serra, T., Tjandraatmadja, C., Ramalingam, S.: Bounding and counting linear regions of deep neural networks. In: Thirty-Fifth International Conference on Machine Learning (2018)

  68. Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., Fergus, R.: Intriguing properties of neural networks. In: International Conference on Learning Representations (2014)

  69. Tawarmalani, M., Sahinidis, N.: Convexification and Global Optimization in Continuous and Mixed-Integer Nonlinear Programming: Theory, Algorithms, Software and Applications, vol. 65. Springer, Berlin (2002)

    MATH  Google Scholar 

  70. Tjeng, V., Xiao, K., Tedrake, R.: Verifying neural networks with mixed integer programming. In: International Conference on Learning Representations (2019)

  71. Trespalacios, F., Grossmann, I.E.: Improved big-M reformulation for generalized disjunctive programs. Comput. Chem. Eng. 76, 98–103 (2015)

    Google Scholar 

  72. Vielma, J.P.: Mixed integer linear programming formulation techniques. SIAM Rev. 57(1), 3–57 (2015)

    MathSciNet  MATH  Google Scholar 

  73. Vielma, J.P.: Embedding formulations and complexity for unions of polyhedra. Manage. Sci. 64(10), 4471–4965 (2018)

    Google Scholar 

  74. Vielma, J.P.: Small and strong formulations for unions of convex sets from the Cayley embedding. Math. Program. 177, 21–53 (2018)

    MathSciNet  MATH  Google Scholar 

  75. Vielma, J.P., Nemhauser, G.: Modeling disjunctive constraints with a logarithmic number of binary variables and constraints. Math. Program. 128(1–2), 49–72 (2011)

    MathSciNet  MATH  Google Scholar 

  76. Weibel, C.: Minkowski sums of polytopes: combinatorics and computation. Ph.D. thesis, École Polytechnique Fédérale de Lausanne (2007)

  77. Weiss, G.: Stochastic bounds on distributions of optimal value functions with applications to pert, network flows and reliability. Oper. Res. 34(4), 595–605 (1986)

    MathSciNet  MATH  Google Scholar 

  78. Wong, E., Kolter, J.Z.: Provable defenses against adversarial examples via the convex outer adversarial polytope. In: International Conference on Machine Learning (2018)

  79. Wong, E., Schmidt, F., Metzen, J.H., Kolter, J.Z.: Scaling provable adversarial defenses. In: 32nd Conference on Neural Information Processing Systems (2018)

  80. Wu, G., Say, B., Sanner, S.: Scalable planning with Tensorflow for hybrid nonlinear domains. In: Advances in Neural Information Processing Systems, pp. 6276–6286 (2017)

  81. Xiao, K.Y., Tjeng, V., Shafiullah, N.M., Madry, A.: Training for faster adversarial robustness verification via inducing ReLU stability. In: International Conference on Learning Representations (2019)

  82. Xu, B., Wang, N., Chen, T., Li, M.: Empirical evaluation of rectified activations in convolution network (2015). arxiv:1505.00853

  83. Zeng, H., Edwards, M.D., Liu, G., Gifford, D.K.: Convolutional neural network architectures for predicting DNA-protein binding. Bioinformatics 32(12), 121–127 (2016)

    Google Scholar 

Download references

Acknowledgements

The authors gratefully acknowledge Yeesian Ng and Ondřej Sýkora for many discussions on the topic of this paper, and for their work on the development of the tf.opt package used in the computational experiments.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Joey Huchette.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

An extended abstract version of this paper appeared in [4].

A tight big-M formulation for Max on polyhedral domains

A tight big-M formulation for Max on polyhedral domains

We present a tightened big-M formulation for the maximum of d affine functions over an arbitrary polytope input domain. We can view the formulation as a relaxation of the system in Proposition 4, where we select d inequalities from each of (10a) and (10b): those corresponding to \({\overline{\alpha }}, \underline{\alpha } \in \{ w^1, \ldots , w^d \}\). This subset yields a valid formulation, and we obviate the need for direct separation. This formulation can also be viewed as an application of Proposition 6.2 of Vielma [72], and is similar to the big-M formulations for generalized disjunctive programs of Trespalacios and Grossmann [71].

Proposition 15

Take coefficients N such that, for each \(\ell , k \in [\![d ]\!]\) with \(\ell \ne k\),

$$\begin{aligned} N^{\ell ,k,+}&\geqslant \max _{x^k \in D_{|k}}\{(w^k - w^\ell ) \cdot x^k\} \end{aligned}$$
(32a)
$$\begin{aligned} N^{\ell ,k,-}&\leqslant \min _{x^k \in D_{|k}}\{(w^k - w^\ell ) \cdot x^k\} , \end{aligned}$$
(32b)

and \(N^{k,k,+} = N^{k,k,-} = 0\) for all \(k \in [\![d ]\!]\). Then a valid formulation for \({\text {gr}}(\texttt {Max}{} \circ (f^1,\ldots ,f^d); D)\) is:

$$\begin{aligned}&y \leqslant w^\ell \cdot x + \sum _{k=1}^d (N^{\ell ,k,+} + b^k) z_k\quad \forall \ell \in [\![d ]\!] \end{aligned}$$
(33a)
$$\begin{aligned}&y \geqslant w^\ell \cdot x + \sum _{k=1}^d (N^{\ell ,k,-} + b^k) z_k\quad \forall \ell \in [\![d ]\!] \end{aligned}$$
(33b)
$$\begin{aligned}&(x,y,z) \in D \times {\mathbb {R}}\times \varDelta ^d \end{aligned}$$
(33c)
$$\begin{aligned}&z \in \{0,1\}^d \end{aligned}$$
(33d)

The tightest possible coefficients in (32) can be computed exactly by solving an LP for each pair of input affine functions \(\ell \ne k\). While this might be exceedingly computationally expensive if d is large, it is potentially viable if d is a small fixed constant. For example, the max pooling neuron computes the maximum over a rectangular window in a larger array [32, Sect. 9.3], and is frequently used in image classification architectures. Typically, max pooling units work with a \(2 \times 2\) or a \(3 \times 3\) window, in which case \(d=4\) or \(d=9\), respectively.

In addition, if in practice we observe that if the set \(D_{|k}\) is empty, then we can infer that the neuron is not irreducible as the k-th input function is never the maximum, and we can safely prune it. In particular, if we attempt to compute the coefficients for \(z_k\) and it is proven infeasible, we can prune the k-th function.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Anderson, R., Huchette, J., Ma, W. et al. Strong mixed-integer programming formulations for trained neural networks. Math. Program. 183, 3–39 (2020). https://doi.org/10.1007/s10107-020-01474-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10107-020-01474-5

Keywords

Mathematics Subject Classification

Navigation