A Two-Stage Authorship Attribution Method Using Text and Structured Data for De-Anonymizing User-Generated Content

Schneider, Matthew J.; Mankad, Shawn

doi:10.1007/s40547-021-00116-x

A Two-Stage Authorship Attribution Method Using Text and Structured Data for De-Anonymizing User-Generated Content

Research Article
Published: 06 August 2021

Volume 8, pages 66–83, (2021)
Cite this article

Customer Needs and Solutions Aims and scope Submit manuscript

233 Accesses
1 Citation
Explore all metrics

Abstract

User-generated content (UGC) is an important source of information on products and services for consumers and firms. Although incentivizing high-quality UGC is an important business objective for any content platform, we show that it is also possible to identify anonymous posters by exploiting the characteristics of posted content. We present a novel two-stage authorship attribution methodology that combines structured and text data by identifying an author first by the amount and granularity of structured data (e.g., location, first name) posted with the UGC and second by the author’s writing style. As a case study, we show that 75% of the 1.3 million users in data publicly released by Yelp are uniquely identified by three structured variable combinations. For the remaining 25%, when the number of potential authors with (nearly) identically structured data ranges from 100 to 5 and sufficient training data exists for text analysis, the average probabilities of identification range from 40 to 81%. Our findings suggest that UGC platforms concerned with the potential negative effects of privacy-related incidents should limit or generalize their posters’ structured data when it is adjoined with textual content or mentioned in the text itself. We also show that although protection policies that focus on structured data remove the most predictive elements of authorship, they also have a small negative effect on the usefulness of content.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

From Zoos to Safaris—From Closed-World Enforcement to Open-World Assessment of Privacy

Hide and Seek in Slovakia: Utilizing Tracking Code Data to Uncover Untrustworthy Website Networks

Holiday Pictures or Blockbuster Movies? Insights into Copyright Infringement in User Uploads to One-Click File Hosters

Notes

As reviewed in Shu et al. [57], this problem is known by a number of names, including User Identity Linkage, Social Identity Linkage, User Identity Resolution, Social Network Reconciliation, User Account Linkage Inference, Profile Linkage, Anchor Link Prediction, and Detecting me edges.
We explored several different kernels for SVM, including polynomial (2nd and 3rd order) and other nonlinear specifications. A linear kernel achieved the best results and is therefore presented throughout the paper.

References

Abbasi A, Chen H (2008) Writeprints: a stylometric approach to identity-level identification and similarity detection in cyberspace. ACM Transact Inform Syst (TOIS) 26(2):1–29
Article Google Scholar
Abbasi A, Chen H, Nunamaker JF (2008) Stylometric identification in electronic markets: scalability and robustness. J Manag Inf Syst 25(1):49–78
Article Google Scholar
Aggarwal CC, Philip SY (2008) A general survey of privacy-preserving data mining models and algorithms. In: In Privacy-preserving data mining. Springer, Boston, pp 11–52
Chapter Google Scholar
Ahn D-Y, Duan JA, Mela CF (2015) Managing user-generated content: a dynamic rational expectations equilibrium approach. Mark Sci 35(2):284–303
Article Google Scholar
Almishari M, Tsudik G (2012) Exploring linkability of user reviews. In: In European Symposium on Research in Computer Security. Springer, Berlin, pp 307–324
Google Scholar
AMZ Tracker, 2018. How to deal with negative reviews. URL: https://www.amztracker.com/blog/deal-negative-reviews/. Accessed: July 24, 2020.
André Q, Carmon Z, Wertenbroch K, Crum A, Frank D, Goldstein W, Huber J, Van Boven L, Weber B, Yang H (2018) Consumer choice and autonomy in the age of artificial intelligence and big data. Cust Needs Solut 5(1):28–37
Article Google Scholar
Bordes A, Ertekin S, Weston J, Bottou L (2005) Fast kernel classifiers with online and active learning. J Mach Learn Res 6(Sep):1579–1619
Google Scholar
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Article Google Scholar
Brennan M, Afroz S, Greenstadt R (2012) Adversarial stylometry: circumventing authorship recognition to preserve privacy and anonymity. ACM Transac Inform Syst Secur (TISSEC) 15(3):1–22
Article Google Scholar
Brizan DG, Tansel AU (2006) A. survey of entity resolution and record linkage methodologies. Commun IIMA 6(3):5
Google Scholar
Büschken J, Allenby GM (2016) Sentence-based text analysis for customer reviews. Mark Sci 35(6):953–975
Article Google Scholar
Campbell J, Goldfarb A, Tucker C (2015) Privacy regulation and market structure. J Econ Manag Strateg 24(1):47–73
Article Google Scholar
Caselaw, (2017). ZL TECHNOLOGIES INC v. GLASSDOOR INC. Court of Appeal, First District, Division 4, California. URL: https://caselaw.findlaw.com/ca-court-of-appeal/1868279.html. Accessed July 24, 2020.
De Jong MG, Pieters R, Fox JP (2010) Reducing social desirability bias through item randomized response: an application to measure underreported desires. J Mark Res 47(1):14–27
Article Google Scholar
De Montjoye YA, Hidalgo CA, Verleysen M, Blondel VD (2013) Unique in the crowd: the privacy bounds of human mobility. Sci Rep 3(1):1376
Article Google Scholar
De Montjoye YA, Radaelli L, Singh VK (2015) Unique in the shopping mall: on the reidentifiability of credit card metadata. Science 347(6221):536–539
Article Google Scholar
Douglas DM (2016) Doxing: a conceptual analysis. Ethics Inf Technol 18(3):199–210
Article Google Scholar
Du Bay WH, (2004). The principles of readability. Accessed April 7, 2020. http://en.copian.ca/library/research/readab/readab.pdf.
Elmagarmid AK, Ipeirotis PG, Verykios VS (2006) Duplicate record detection: a survey. IEEE Trans Knowl Data Eng 19(1):1–16
Article Google Scholar
Farr C, (2018). Facebook sent a doctor on a secret mission to ask hospitals to share data. CNBC. URL: https://www.cnbc.com/2018/04/05/facebook-building-8-explored-data-sharing-agreement-with-hospitals.html. Accessed: July 24, 2020.
Getoor L, Machanavajjhala A (2012) Entity resolution: theory, practice & open challenges. Proc VLDB Endowment 5(12):2018–2019
Article Google Scholar
Ghose A, Ipeirotis PG (2010) Estimating the helpfulness and economic impact of product reviews: mining text and reviewer characteristics. IEEE Trans Knowl Data Eng 23(10):1498–1512
Article Google Scholar
Goldfarb A, Tucker C (2013) Why managing consumer privacy can be an opportunity. MIT Sloan Manag Rev 54(3):10
Google Scholar
Gravano L, Ipeirotis PG, Koudas N and Srivastava D, (2003). Text joins for data cleansing and integration in an rdbms. In Proceedings 19th International Conference on Data Engineering (Cat. No. 03CH37405) (pp. 729-731). IEEE.
Hewett K, Rand W, Rust RT, van Heerde HJ (2016) Brand buzz in the echoverse. J Mark 80(3):1–24
Article Google Scholar
Hill S, Provost F (2003) The myth of the double-blind review? Author identification using only citations. Acm Sigkdd Explor Newslett 5(2):179–184
Article Google Scholar
Hsu CW, Lin CJ (2002) A comparison of methods for multiclass support vector machines. IEEE Trans Neural Netw 13(2):415–425
Article Google Scholar
Hu M and Liu B, 2004. Mining opinion features in customer reviews. In AAAI (Vol. 4, No. 4, pp. 755-760).
Jones R, (2017). Court rules Yelp must identify anonymous user in defamation case. Gizmodo. URL: https://gizmodo.com/court-rules-yelp-must-identify-anonymous-user-in-defama-1820433103. Accessed: July 24, 2020.
Juola P (2012) Large-scale experiments in authorship attribution. Engl Stud 93(3):275–283
Article Google Scholar
Juola P and Vescovi D, (2010). Empirical evaluation of authorship obfuscation using JGAAP. In Proceedings of the 3rd ACM workshop on Artificial Intelligence and Security (pp. 14-18).
Kincaid JP, Fishburne Jr RP, Rogers RL and Chissom BS, (1975). Derivation of new readability formulas (automated readability index, fog count and flesch reading ease formula) for navy enlisted personnel. Naval Technical Training Command Millington TN Research Branch.
Klemko R (2021) A small group of sleuths has been identifying right-wing extremists long before the attack on the Capitol. URL: https://www.washingtonpost.com/national-security/antifa-far-right-doxing-identities/2021/01/10/41721de0-4dd7-11eb-bda4-615aaefd0555_story.html. Accessed January 2, 2021.
Koppel M, Schler J, Argamon S (2009) Computational methods in authorship attribution. J Am Soc Inf Sci Technol 60(1):9–26
Article Google Scholar
Kourtis I, Stamatatos E (2011) Author identification using semi-supervised learning. In: In CLEF 2011: Proceedings of the 2011 Conference on Multilingual and Multimodal Information Access Evaluation (Lab and Workshop Notebook Papers). The Netherlands, Amsterdam
Google Scholar
Krishnamoorthy S (2015) Linguistic features for review helpfulness prediction. Expert Syst Appl 42(7):3751–3759
Article Google Scholar
Kroft S, (2014). The data brokers: selling your personal information. 60 Minutes. URL: https://www.cbsnews.com/news/the-data-brokers-selling-your-personal-information/. Accessed: July 24, 2020.
Kumar V, Reinartz W (2018) Customer privacy concerns and privacy protective responses. In: In Customer relationship management. Springer, Berlin, pp 285–309
Chapter Google Scholar
Li XB, Qin J (2017) Anonymizing and sharing medical text records. Inf Syst Res 28(2):332–352
Article Google Scholar
Li XB, Sarkar S (2006) Privacy protection in data mining: a perturbation approach for categorical data. Inf Syst Res 17(3):254–270
Article Google Scholar
Mankad S, Han HS, Goh J, Gavirneni S (2016) Understanding online hotel reviews through automated text analysis. Serv Sci 8(2):124–138
Article Google Scholar
Martin KD, Murphy PE (2017) The role of data privacy in marketing. J Acad Mark Sci 45(2):135–155
Article Google Scholar
Menon S, Sarkar S (2016) Privacy and big data: scalable approaches to sanitize large transactional databases for sharing. MIS Q 40(4):963–981
Article Google Scholar
Moe WW, Schweidel DA (2012) Online product opinions: incidence, evaluation, and evolution. Mark Sci 31(3):372–386
Article Google Scholar
Narayanan A, Paskov H, Gong NZ, Bethencourt J, Stefanov E, Shin ECR and Song D, (2012). On the feasibility of internet-scale author identification. In 2012 IEEE Symposium on Security and Privacy (pp. 300-314). IEEE.
Narayanan A and Shmatikov V, 2008, May. Robust de-anonymization of large datasets. In Proceedings of the 2008 IEEE Symposium on Security and Privacy.
The Associated Press, (2017). Yelp says lawsuit might eliminate all negative reviews. New York Daily News. URL: https://www.nydailynews.com/news/national/yelp-lawsuit-eliminate-negative-reviews-article-1.2796087. Accessed July 24, 2020.
Payer M, Huang L, Gong NZ, Borgolte K, Frank M (2014) What you submit is who you are: a multimodal approach for deanonymizing scientific publications. IEEE Transact Inform Forensics Secur 10(1):200–212
Article Google Scholar
Peer E, Vosgerau J, Acquisti A (2014) Reputation as a sufficient condition for data quality on Amazon Mechanical Turk. Behav Res Methods 46(4):1023–1031
Article Google Scholar
Porter J, (2019). Fraudulent Yelp posting protected under the law, ridiculous. Tahoe Daily Tribune, May 20, 2019. URL: https://www.tahoedailytribune.com/news/jim-porter-fraudulent-yelp-posting-protected-under-the-law-ridiculous/. Accessed July 24, 2020.
Proserpio D, Zervas G (2017) Online reputation management: estimating the impact of management responses on consumer reviews. Mark Sci 36(5):645–665
Article Google Scholar
Qian T, Liu B, Chen L and Peng, Z., (2014). Tri-training for authorship attribution with limited training data. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) (pp. 345-351).
Rochet JC, Tirole J (2003) Platform competition in two-sided markets. J Eur Econ Assoc 1(4):990–1029
Article Google Scholar
Schneider MJ, Jagpal S, Gupta S, Li S, Yu Y (2017) Protecting customer privacy when marketing with second-party data. Int J Res Mark 34(3):593–603
Article Google Scholar
Schneider MJ, Jagpal S, Gupta S, Li S, Yu Y (2018) A flexible method for protecting marketing data: an application to point-of-sale data. Mark Sci. ePub ahead of print Jan 8 37:153–171. https://doi.org/10.1287/mksc.2017.1064
Article Google Scholar
Shu K, Wang S, Tang J, Zafarani R, Liu H (2017) User identity linkage across online social networks: a review. Acm Sigkdd Explor Newslett 18(2):5–17
Article Google Scholar
Singh JP, Irani S, Rana NP, Dwivedi YK, Saumya S, Roy PK (2017) Predicting the “helpfulness” of online consumer reviews. J Bus Res 70:346–355
Article Google Scholar
Snyder P, Doerfler P, Kanich C and McCoy D, (2017). Fifteen minutes of unwanted fame: detecting and characterizing doxing. In Proceedings of the 2017 Internet Measurement Conference (pp. 432-444).
Stamatatos E (2009) A survey of modern authorship attribution methods. J Am Soc Inf Sci Technol 60(3):538–556
Article Google Scholar
Steyvers M, Griffiths T (2007) Probabilistic topic models. Handb Latent Semantic Anal 427(7):424–440
Google Scholar
Stone EF, Spool MD, Rabinowitz S (1977) Effects of anonymity and retaliatory potential on student evaluations of faculty performance. Res High Educ 6(4):313–325
Article Google Scholar
Sweeney L (2000) Simple demographics often identify people uniquely. Health (San Francisco) 671(2000):1–34
Google Scholar
Sweeney L (2002a) k-anonymity: a model for protecting privacy. Int J Uncertaint Fuzziness Knowl-Based Syst 10(05):557–570
Article Google Scholar
Sweeney L (2002b) Achieving k-anonymity privacy protection using generalization and suppression. Int J Uncertaint Fuzziness Knowl-Based Syst 10(05):571–588
Article Google Scholar
Tirunillai S, Tellis GJ (2014) Mining marketing meaning from online chatter: strategic brand analysis of big data using latent dirichlet allocation. J Mark Res 51(4):463–479
Article Google Scholar
Turjeman D and Feinberg FM, (2019). When the data are out: measuring behavioral changes following a data breach. Available at SSRN 3427254.
Tweedie FJ, Baayen RH (1998) How variable may a constant be? Measures of lexical richness in perspective. Comput Hum 32(5):323–352
Article Google Scholar
US Census Bureau, (2016). Decennial Census Surname Files (2010, 2000). URL: https://www.census.gov/data/developers/data-sets/surnames.html. Accessed July 24, 2020.
US Social Security Administration, (2019). Baby names from social security card applications - national data. Data.gov. URL: https://catalog.data.gov/dataset/baby-names-from-social-security-card-applications-national-level-data. Accessed: July 24, 2020.
Wedel M, Kannan PK (2016) Marketing analytics for data-rich environments. J Mark 80(6):97–121
Article Google Scholar
Winkler WE, (1999). The state of record linkage and current research problems. In Statistical Research Division, US Census Bureau.
Xia D, Mankad S, Michailidis G (2016) Measuring influence of users in Twitter ecosystems using a counting process modeling framework. Technometrics 58(3):360–370
Article Google Scholar
Xu J, Ding M (2019) Using the double transparency of autonomous vehicles to increase fairness and social welfare. Cust Needs Solut 6(1):26–35
Article Google Scholar
Yelp, 2020. https://terms.yelp.com/privacy/en_us/20200101_en_us/#Controlling-Your-Personal-Data. .
Yule, G.U., 1944. The statistical study of literary vocabulary. In Mathematical Proceedings of the Cambridge Philosophical Society (Vol. 42, pp. b1-9).
Zhang Y, Moe WW, Schweidel DA (2017) Modeling the role of message content and influencers in social media rebroadcasting. Int J Res Mark 34(1):100–119
Article Google Scholar
Zhao Y, Yang S, Narayan V, Zhao Y (2013) Modeling consumer learning from online product reviews. Mark Sci 32(1):153–169
Article Google Scholar

Download references

Acknowledgements

We are thankful to Elea Feit, Sachin Gupta, Cameron Bale, and Sharan Jagpal for their helpful comments on earlier versions of this paper.

Author information

Authors and Affiliations

Drexel University, Philadelphia, PA, 19104, USA
Matthew J. Schneider
Cornell University, Ithaca, NY, 14853, USA
Shawn Mankad

Authors

Matthew J. Schneider
View author publications
You can also search for this author in PubMed Google Scholar
Shawn Mankad
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Matthew J. Schneider.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Appendix. Expanded results for the yelp data

Figure 4 provides out-of-sample accuracy results for the Yelp data. Accuracy consistently improves as the sophistication of the data intruder and the size of the training data increase.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Schneider, M.J., Mankad, S. A Two-Stage Authorship Attribution Method Using Text and Structured Data for De-Anonymizing User-Generated Content. Cust. Need. and Solut. 8, 66–83 (2021). https://doi.org/10.1007/s40547-021-00116-x

Download citation

Accepted: 05 July 2021
Published: 06 August 2021
Issue Date: September 2021
DOI: https://doi.org/10.1007/s40547-021-00116-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Two-Stage Authorship Attribution Method Using Text and Structured Data for De-Anonymizing User-Generated Content

Abstract

Access this article

Similar content being viewed by others

From Zoos to Safaris—From Closed-World Enforcement to Open-World Assessment of Privacy

Hide and Seek in Slovakia: Utilizing Tracking Code Data to Uncover Untrustworthy Website Networks

Holiday Pictures or Blockbuster Movies? Insights into Copyright Infringement in User Uploads to One-Click File Hosters

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Appendix. Expanded results for the yelp data

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A Two-Stage Authorship Attribution Method Using Text and Structured Data for De-Anonymizing User-Generated Content

Abstract

Access this article

Similar content being viewed by others

From Zoos to Safaris—From Closed-World Enforcement to Open-World Assessment of Privacy

Hide and Seek in Slovakia: Utilizing Tracking Code Data to Uncover Untrustworthy Website Networks

Holiday Pictures or Blockbuster Movies? Insights into Copyright Infringement in User Uploads to One-Click File Hosters

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Appendix. Expanded results for the yelp data

Appendix. Expanded results for the yelp data

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation