Picket: guarding against corrupted data in tabular data during learning and inference

Liu, Zifan; Zhou, Zhechun; Rekatsinas, Theodoros

doi:10.1007/s00778-021-00699-w

Picket: guarding against corrupted data in tabular data during learning and inference

Special Issue Paper
Published: 12 October 2021

Volume 31, pages 927–955, (2022)
Cite this article

The VLDB Journal Aims and scope Submit manuscript

469 Accesses
1 Citation
Explore all metrics

Abstract

Data corruption is an impediment to modern machine learning deployments. Corrupted data can severely bias the learned model and can also lead to invalid inferences. We present, Picket, a simple framework to safeguard against data corruptions during both training and deployment of machine learning models over tabular data. For the training stage, Picket identifies and removes corrupted data points from the training data to avoid obtaining a biased model. For the deployment stage, Picket flags, in an online manner, corrupted query points to a trained machine learning model that due to noise will result in incorrect predictions. To detect corrupted data, Picket uses a self-supervised deep learning model for mixed-type tabular data, which we call PicketNet. To minimize the burden of deployment, learning a PicketNet model does not require any human-labeled data. Picket is designed as a plugin that can increase the robustness of any machine learning pipeline. We evaluate Picket on a diverse array of real-world data considering different corruption models that include systematic and adversarial noise during both training and testing. We show that Picket consistently safeguards against corrupted data during both training and deployment of various models ranging from SVMs to neural networks, beating a diverse array of competing methods that span from data quality validation models to robust outlier detection models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

TabMentor: Detect Errors on Tabular Data with Noisy Labels

Discretization Inspired Defence Algorithm Against Adversarial Attacks on Tabular Data

Improving neural network’s robustness on tabular data with D-layers

Article Open access 31 August 2023

Availability of data and material

Data are open source.

References

Koh, P.W., Steinhardt, J., Liang, P.: arXiv preprint arXiv:1811.00741 (2018)
Schelter, S., Biessmann, F., Lange, D., Rukat, T., Schmidt, P., Seufert, S., Brunelle, P., Taptunov, A.: Unit testing data with deequ. In: Proceedings of the 2019 International Conference on Management of Data (Association for Computing Machinery, New York, NY, USA, 2019), SIGMOD ’19, pp. 1993–1996. https://doi.org/10.1145/3299869.3320210
Breck, E., Polyzotis, N., Roy, S., Whang, S.E., Zinkevich, M.: Data validation for machine learning. In: MLSys-19
Baylor, D., Breck, E., Cheng, H.T., Fiedel, N., Foo, C.Y., Haque, Z., Haykal, S., Ispir, M., Jain, V., Koc, L., Koo, C.Y., Lew, L., Mewald, C., Modi, A.N., Polyzotis, N., Ramesh, S., Roy, S., Whang, S.E., Wicke, M., Wilkiewicz, J., Zhang, X., Zinkevich, M.: Tfx: a tensorflow-based production-scale machine learning platform. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (Association for Computing Machinery, New York, NY, USA, 2017), KDD ’17, pp. 1387–1395. https://doi.org/10.1145/3097983.3098021
Steinhardt, J., Koh, P.W.W., Liang, P.S.: Certified defenses for data poisoning attacks. In: Advances in Neural Information Processing Systems, pp. 3517–3529 (2017)
Xue, Z., Shang, Y., Feng, A.: Semi-supervised outlier detection based on fuzzy rough C-means clustering. Math. Comput. Simul. 80(9), 1911 (2010)
Article MathSciNet Google Scholar
Muñoz-González, L., Biggio, B., Demontis, A., Paudice, A., Wongrassamee, V., Lupu, E.C., Roli, F.: In: Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, pp. 27–38 (2017)
Biggio, B., Nelson, B., Laskov, P.: Poisoning attacks against support vector machines. In: Proceedings of the 29th International Conference on International Conference on Machine Learning. Omnipress, Madison, WI, USA, ICML ’12, pp. 1467–1474 (2012)
Liu, F.T., Ting, K.M., Zhou, Z.H.: Isolation forest. In: 2008 Eighth IEEE International Conference on Data Mining, pp. 413–422. IEEE (2008)
Chen, Y., Zhou, X.S., Huang, T.S.: One-class SVM for learning in image retrieval. In: Proceedings 2001 International Conference on Image Processing (Cat. No. 01CH37205), vol. 1, pp. 34–37. IEEE (2001)
Heidari, A., McGrath, J., Ilyas, I.F., Rekatsinas, T.: Holodetect: few-shot learning for error detection. In: Proceedings of the 2019 International Conference on Management of Data, pp. 829–846 (2019)
Mahdavi, M., Abedjan, Z., Castro Fernandez, R., Madden, S., Ouzzani, M., Stonebraker, M., Tang, N.: Raha: a configuration-free error detection system. In: Proceedings of the 2019 International Conference on Management of Data. Association for Computing Machinery, New York, NY, USA, SIGMOD ’19, pp. 865–882 (2019). https://doi.org/10.1145/3299869.3324956
Diakonikolas, I., Kamath, G., Kane, D.M., Li, J., Moitra, A., Stewart, A.: Being robust (in high dimensions) can be practical. In: Proceedings of the 34th International Conference on Machine Learning-Volume 70 (JMLR. org, 2017), pp. 999–1008
Diakonikolas, I., Kamath, G., Kane, D.M., Li, J., Steinhardt, J., Stewart, A.: arXiv preprint arXiv:1803.02815 (2018)
Roth, K., Kilcher, Y., Hofmann, T.: In: Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, Proceedings of Machine Learning Research, vol. 97, ed. by K. Chaudhuri, R. Salakhutdinov (PMLR, 2019), Proceedings of Machine Learning Research, vol. 97, pp. 5498–5507. http://proceedings.mlr.press/v97/roth19a.html
Grosse, K., Manoharan, P., Papernot, N., Backes, M., McDaniel, P.: arXiv preprint arXiv:1702.06280 (2017)
Eduardo, S., Nazábal, A., Williams, C.K.I., Sutton, C.: Robust variational autoencoders for outlier detection and repair of mixed-type data. In: The 23rd International Conference on Artificial Intelligence and Statistics (2020)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł, Polosukhin, I.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: arXiv preprint arXiv:1810.04805 (2018)
Su, W., Zhu, X., Cao, Y., Li, B., Lu, L., Wei, F., Dai, J.: In: International Conference on Learning Representations (2020). https://openreview.net/forum?id=SygXPaEYvH
Wu, R., Zhang, A., Ilyas, I.F., Rekatsinas, T.: Attention-based learning for missing data imputation in holoclean. In: Proceedings of Machine Learning and Systems, pp. 307–325 (2020)
Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R.R., Le, Q.V.: Xlnet: generalized autoregressive pretraining for language understanding. In: Advances in Neural Information Processing Systems, pp. 5754–5764 (2019)
Simonyan, K., Zisserman, A.: Advances in neural information processing systems, pp. 568–576 (2014)
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135 (2017)
Article Google Scholar
Arora, S., Liang, Y., Ma, T.: 5th International Conference on Learning Representations, ICLR 2017; Conference date: 24-04-2017 Through 26-04-2017 (2019)
Dua, D., Graff, C.: UCI machine learning repository (2017). http://archive.ics.uci.edu/ml
Li, P., Rao, X., Blase, J., Zhang, Y., Chu, X., Zhang, C.: arXiv preprint arXiv:1904.09483 (2019)
Herskovits, E.: Computer-based probabilistic-network construction. Ph.D. thesis, Stanford, CA, USA (1992). UMI Order No. GAX92-05646
Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: In: International Conference on Learning Representations (2018). https://openreview.net/forum?id=rJzIBfZAb
Nicolae, M.I., Sinn, M., Tran, M.N., Buesser, B., Rawat, A., Wistuba, M., Zantedeschi, V., Baracaldo, N., Chen, B., Ludwig, H., Molloy, I., Edwards, B.: CoRR (2018). arXiv:1807.01069
Goodfellow, I.J., Shlens, J., Szegedy, C.: CoRR abs/1412.6572 (2015)
Guo, C., Pleiss, G., Sun, Y., Weinberger, K.Q.: On calibration of modern neural networks. In: Proceedings of the 34th International Conference on Machine Learning-Volume 70 (JMLR. org, 2017), pp. 1321–1330
Khosravi, P., Liang, Y., Choi, Y., Van den Broeck, G.: In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19. International Joint Conferences on Artificial Intelligence Organization, pp. 2716–2724 (2019). https://doi.org/10.24963/ijcai.2019/377
Karlaš, B., Li, P., Wu, R., Gürel, N.M., Chu, X., Wu, W., Zhang, C.: Nearest neighbor classifiers over incomplete information: from certain answers to certain predictions (2020)
Z. Liu, J. Park, N. Palumbo, T. Rekatsinas, C. Tzamos. Robust mean estimation under coordinate-level corruption (2020)
Ilyas, I.F., Chu, X.: Trends in cleaning relational data: consistency and deduplication. Found. Trends Databases 5(4), 281 (2015)
Article Google Scholar
Rahm, E., Do, H.H.: Data cleaning: problems and current approaches. IEEE Data Eng. Bull. 23(4), 3 (2000)
Google Scholar
Fan, W., Geerts, F.: Foundations of Data Quality Management. Morgan & Claypool Publishers, Vermont (2012)
Book Google Scholar
An, J., Cho, S.: Variational autoencoder based anomaly detection using reconstruction probability. Spec. Lect. IE 2, 1 (2015)
Google Scholar
Sabokrou, M., Fathy, M., Hoseini, M.: Video anomaly detection and localisation based on the sparsity and reconstruction error of auto-encoder. Electron. Lett. 52(13), 1122 (2016)
Article Google Scholar
Carlini, N., Wagner, D.: Towards evaluating the robustness of neural networks. In: 2017 IEEE Symposium on Security and Privacy (sp), pp. 39–57. IEEE (2017)
Moosavi-Dezfooli, S.M., Fawzi, A., Frossard, P.: Deepfool: a simple and accurate method to fool deep neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2574–2582 (2016)
Goodfellow, I.J., Shlens, J., Szegedy, C.: arXiv preprint arXiv:1412.6572 (2014)
Xiao, C., Zhong, P., Zheng, C.: arXiv preprint arXiv:1905.10510 (2019)
Pang, T., Xu, K., Dong, Y., Du, C., Chen, N., Zhu, J.: In: International Conference on Learning Representations (2020). https://openreview.net/forum?id=Byg9A24tvB
Pang, T., Xu, K., Du, C., Chen, N., Zhu, J.: arXiv preprint arXiv:1901.08846 (2019)
Hu, S., Yu, T., Guo, C., Chao, W.L., Weinberger, K.Q.: Advances in Neural Information Processing Systems, pp. 1633–1644 (2019)
Kingma, D.P., Ba, J.: arXiv preprint arXiv:1412.6980 (2014)
Dietterich, T.G.: Approximate statistical tests for comparing supervised classification learning algorithms. Neural Comput. 10(7), 1895 (1998)
Article Google Scholar

Download references

Acknowledgements

This work was supported by the National Science Foundation under Grants 1755676 and 1815538 and DARPA under Grant ASKE HR00111990013. The US Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views, policies, or endorsements, either expressed or implied, of DARPA or the US Government.

Funding

This work was supported by the National Science Foundation under Grants 1755676 and 1815538 and Defense Advanced Research Projects Agency under Grant ASKE HR00111990013.

Author information

Authors and Affiliations

University of Wisconsin-Madison, Wisconsin, USA
Zifan Liu & Theodoros Rekatsinas
University of Southern California, California, USA
Zhechun Zhou

Authors

Zifan Liu
View author publications
You can also search for this author in PubMed Google Scholar
Zhechun Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Theodoros Rekatsinas
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zifan Liu.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Code availability

The code with data is available at https://github.com/rekords-uw/Picket

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Zhechun Zhou: Work done at University of Wisconsin-Madison.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Liu, Z., Zhou, Z. & Rekatsinas, T. Picket: guarding against corrupted data in tabular data during learning and inference. The VLDB Journal 31, 927–955 (2022). https://doi.org/10.1007/s00778-021-00699-w

Download citation

Received: 20 February 2021
Revised: 08 July 2021
Accepted: 30 August 2021
Published: 12 October 2021
Issue Date: September 2022
DOI: https://doi.org/10.1007/s00778-021-00699-w

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Picket: guarding against corrupted data in tabular data during learning and inference

Abstract

Access this article

Similar content being viewed by others

TabMentor: Detect Errors on Tabular Data with Noisy Labels

Discretization Inspired Defence Algorithm Against Adversarial Attacks on Tabular Data

Improving neural network’s robustness on tabular data with D-layers

Availability of data and material

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Code availability

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Picket: guarding against corrupted data in tabular data during learning and inference

Abstract

Access this article

Similar content being viewed by others

TabMentor: Detect Errors on Tabular Data with Noisy Labels

Discretization Inspired Defence Algorithm Against Adversarial Attacks on Tabular Data

Improving neural network’s robustness on tabular data with D-layers

Availability of data and material

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Code availability

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation