Skip to main content

Formalising Document Structure and Automatically Recognising Document Elements: A Case Study on Automobile Repair Manuals

  • Conference paper
  • First Online:
Digital Libraries at the Crossroads of Digital Information for the Future (ICADL 2019)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11853))

Included in the following conference series:

  • 727 Accesses

Abstract

Correct understanding of document structures is vital to enhancing various practices and applications, such as authoring and information retrieval. In this paper, with a focus on procedural documents of automobile repair manuals, we report a formalisation of document structure and an experiment to automatically classify sentences according to document structure. To formalise document structure, we first investigated what types of content are included in the target documents. Through manual annotation, we identified 26 indicative content categories (content elements). We then employed an existing standard for technical documents—the Darwin Information Typing Architecture—and formalised a detailed document structure for automobile repair tasks that specifies the arrangement of the content elements. We also examined the feasibility of automatic content element recognition in given documents by implementing sentence classifiers using multiple machine learning methods. The evaluation results revealed that support vector machine-based classifiers generally demonstrated high performance for in-domain data, while convolutional neural network-based classifiers are advantageous for out-of-domain data. We also conducted in-depth analyses to identify classification difficulties.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    We used the PRIUS Repair Manual (November 2017 version) provided by Toyota Motor Corporation.

  2. 2.

    The feasibility of this annotation scheme will be validated by measuring inter-rater agreement in future work.

  3. 3.

    For Japanese word segmentation, MeCab ver.0.996 (http://taku910.github.io/mecab) and mecab-ipa-NEologd (https://github.com/neologd/mecab-ipadic-neologd) were used. Although we tried three settings of features (uni-grams, bi-grams and uni-grams+bi-grams), we report only the uni-grams+bi-grams with the highest score in this paper.

  4. 4.

    We used publicly available pre-trained vectors. http://www.cl.ecei.tohoku.ac.jp/~m-suzuki/jawiki_vector/.

  5. 5.

    The F-score is the harmonic mean value of the precision and recall.

References

  1. Bellamy, L., Carey, M., Schlotfeldt, J.: DITA Best Practices: A Roadmap for Writing, Editing, and Architecting in DITA. IBM Press, Upper Saddle River (2012)

    Google Scholar 

  2. Bhatia, V.K.: Worlds of Written Discourse: A Genre-Based View. Continuum International, London (2004)

    Google Scholar 

  3. Biber, D., Conrad, S.: Register, Genre, and Style. Cambridge University Press, New York (2009)

    Book  Google Scholar 

  4. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)

    Article  Google Scholar 

  5. Brett, P.: A genre analysis of the results section of sociology articles. Engl. Specif. Purp. 13(1), 47–59 (1994)

    Article  Google Scholar 

  6. Bunton, D.: The structure of PhD conclusion chapters. J. Engl. Acad. Purp. 4(3), 207–224 (2005)

    Article  Google Scholar 

  7. Carey, M., Lanyi, M.M., Longo, D., Radzinski, E., Rouiller, S., Wilde, E.: Developing Quality Technical Information: A Handbook for Writers and Editors. IBM Press, Upper Saddle River (2014)

    Google Scholar 

  8. Corbin, J., Strauss, A.: Basics of Qualitative Research: Techniques and Procedures for Developing Grounded Theory, 4th edn. Sage Publications, Los Angeles (2014)

    Google Scholar 

  9. Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)

    MATH  Google Scholar 

  10. Cross, C., Oppenheim, C.: A genre analysis of scientific abstracts. J. Documentation 62(4), 428–446 (2006)

    Article  Google Scholar 

  11. Day, D., Priestley, M., Schell, D.: Introduction to the Darwin Information Typing Architecture: Toward portable technical information (2005). http://www.ibm.com/developerworks/xml/library/x-dita1/x-dita1-pdf.pdf

  12. Dernoncourt, F., Lee, J.Y.: PubMed 200k RCT: a dataset for sequential sentence classification in medical abstracts. In: Proceedings of the 8th International Joint Conference on Natural Language Processing (IJCNLP), Taipei, Taiwan, pp. 308–313 (2017)

    Google Scholar 

  13. Hayes, P.J., Andersen, P.M., Nlrenburg, I.B., Schmandt, L.M.: TCS: a shell for content-based text categorization. In: Proceedings of the 6th Conference on Artificial Intelligence for Applications (CAIA), Santa Barbara, California, USA, pp. 320–326 (1990)

    Google Scholar 

  14. Hinton, G.E.: Connectionist learning procedures. Artif. Intell. 40(1), 185–234 (1989)

    Article  Google Scholar 

  15. Horn, R.E.: Mapping Hypertext: The Analysis, Organization, and Display of Knowledge for the Next Generation of On-Line Text and Graphics. Lexington Institute, Arlington (1989)

    Google Scholar 

  16. Horn, R.E.: Structured writing as a paradigm. In: Romiszowski, A., Dills, C. (eds.) Instructional Development: State of the Art. Educational Technology Publications, Englewood Cliffs (1998)

    Google Scholar 

  17. Jin, D., Szolovits, P.: Hierarchical neural networks for sequential sentence classification in medical scientific abstracts. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), Brussels, Belgium, pp. 3100–3109 (2018)

    Google Scholar 

  18. Kando, N.: Text-level structure of research articles and its implication for text-based information processing systems. In: Proceedings of the 19th British Computer Society Annual Colloquium on Information Retrieval Research (BCS-IRSG), Aberdeen, UK, pp. 68–81 (1997)

    Google Scholar 

  19. Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, pp. 1746–1751 (2014)

    Google Scholar 

  20. Maswana, S., Kanamaru, T., Tajino, A.: Move analysis of research articles across five engineering fields: what they share and what they do not. Ampersand 2, 1–11 (2015)

    Article  Google Scholar 

  21. OASIS: Darwin Information Typing Architecture (DITA) Version 1.3. http://docs.oasis-open.org/dita/dita/v1.3/dita-v1.3-part3-all-inclusive.html

  22. Reiter, E., Dale, R.: Building Natural Language Generation Systems. Cambridge University Press, Cambridge (2000)

    Book  Google Scholar 

  23. Rubens, P. (ed.): Science and Technical Writing: A Manual of Style. Routledge, New York (2001)

    Google Scholar 

  24. Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)

    Article  MathSciNet  Google Scholar 

  25. Song, X., Petrak, J., Roberts, A.: A deep neural network sentence level classification method with context information. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), Brussels, Belgium, pp. 900–904 (2018)

    Google Scholar 

  26. Swales, J.M.: Genre Analysis: English in Academic and Research Settings. Cambridge University Press, Cambridge (1990)

    Google Scholar 

  27. Swales, J.M.: Research Genres: Explorations and Applications. Cambridge University Press, Cambridge (2004)

    Book  Google Scholar 

  28. Swales, J.M., Freak, C.B.: Academic Writing for Graduate Students: Essential Tasks and Skills, 3rd edn. University of Michigan Press, Ann Arbor (2012)

    Book  Google Scholar 

  29. Tessuto, G.: Generic structure and rhetorical moves in English-language empirical law research articles: sites of interdisciplinary and interdiscursive cross-over. Engl. Specif. Purp. 37, 13–26 (2015)

    Article  Google Scholar 

  30. Teufel, S., Moens, M.: Summarizing scientific articles: experiments with relevance and rhetorical status. Comput. Linguist. 28(4), 409–445 (2002)

    Article  Google Scholar 

  31. Zhang, Y., Wallace, B.C.: A sensitivity analysis of (and practitioners’ guide to) convolutional neural networks for sentence classification. In: Proceedings of the 8th International Joint Conference on Natural Language Processing (IJCNLP), Taipei, Taiwan, pp. 253–263 (2017)

    Google Scholar 

Download references

Acknowledgement

This work was partly supported by JSPS KAKENHI Grant Numbers 17H06733 and 19H05660. The automobile manuals used in this study were provided by Toyota Motor Corporation.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Hodai Sugino or Rei Miyata .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Sugino, H., Miyata, R., Sato, S. (2019). Formalising Document Structure and Automatically Recognising Document Elements: A Case Study on Automobile Repair Manuals. In: Jatowt, A., Maeda, A., Syn, S. (eds) Digital Libraries at the Crossroads of Digital Information for the Future. ICADL 2019. Lecture Notes in Computer Science(), vol 11853. Springer, Cham. https://doi.org/10.1007/978-3-030-34058-2_23

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-34058-2_23

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-34057-5

  • Online ISBN: 978-3-030-34058-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics