Skip to main content

Using Beautiful Soup

  • Chapter
  • First Online:
Website Scraping with Python

Abstract

In this chapter, you will learn how to use Beautiful Soup, a lightweight Python library, to extract and navigate HTML content easily and forget overly complex regular expressions and text parsing.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 49.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 64.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Unless you are lucky. Once I encountered a site where all the links to the remaining pages were there in the HTML code but had been hidden with some JS-magic.

  2. 2.

    OOP: object-oriented programming

  3. 3.

    For example, the Builder or Factory patterns, a constructor with all arguments.

  4. 4.

    https://docs.python.org/3/library/csv.html

  5. 5.

    I have to admit, every time I write CSV files I use spamwriter as my variable’s name. I guess this gives me a global understanding on what’s happening.

  6. 6.

    Set theory: https://en.wikipedia.org/wiki/Union_(set_theory)

  7. 7.

    https://docs.python.org/3/library/json.html

  8. 8.

    https://github.com/coleifer/peewee

  9. 9.

    Object-relational mapping

  10. 10.

    I have worked since 2007 with ORM tools, and I like the idea, but some queries can become quite complex.

  11. 11.

    https://docs.mongodb.com/getting-started/python/

  12. 12.

    Hard cache: Get all information from the cache, and if there are attempts to gather anything from the Internet, refuse it. This makes scraping a bit consistent between runs.

  13. 13.

    For more information, visit: https://blake2.net/

  14. 14.

    Alternatively, to be more consistent, you can create a downloader, which hides the cache from the users of your code.

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Gábor László Hajba

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Hajba, G.L. (2018). Using Beautiful Soup. In: Website Scraping with Python. Apress, Berkeley, CA. https://doi.org/10.1007/978-1-4842-3925-4_3

Download citation

Publish with us

Policies and ethics