Beautiful Soup – Parsing only parts of a document

Beautiful Soup – Parsing Only Part of a Document

There are many situations where you might want to extract specific types of information with BeautifulSoup4 (e.g., extracting only tags). The SoupStrainer class in BeautifulSoup allows you to parse only specific parts of a document.

One way to do this is to create a SoupStrainer and pass it as the parse_only argument to the BeautifulSoup4 constructor.

SoupStrainer

The SoupStrainer tells BeautifulSoup which parts to extract, and the parse tree will consist of only those elements. If you can narrow the information you need to a specific part of the HTML, this can speed up your search results.

product = SoupStrainer('div',{'id': 'products_list'})
soup = BeautifulSoup(html,parse_only=product)

The above lines of code will only parse the title of a product website, which may be in a tag field.

Similarly, as above, we can use other soupStrainer objects to parse specific information within the HTML tag. Here are some examples –

from bs4 import BeautifulSoup, SoupStrainer

#Only "a" tags
only_a_tags = SoupStrainer("a")

#Will parse only the below mentioned "ids".
parse_only = SoupStrainer(id=["first", "third", "my_unique_id"])
soup = BeautifulSoup(my_document, "html.parser", parse_only=parse_only)

#parse only where string length is less than 10
def is_short_string(string):
   return len(string) < 10

only_short_strings =SoupStrainer(string=is_short_string)

Beautiful Soup – Parsing Only Part of a Document

SoupStrainer

Related Posts

Python 3 – String isalnum() Method

Pytest test execution results in XML format

Comprehensive analysis of PLT image storage in Python

Leave a ReplyCancel Reply