Beautiful Soup – Modifying the Tree
Beautiful Soup – Modifying the Tree
A key aspect of Beautiful Soup is searching the parse tree, which allows you to modify web documents to your specifications. We can modify tags using their attributes, such as the .name, .string, or .append() methods. It allows you to add new tags and strings to existing tags with the help of the .new_string() and .new_tag() methods. Other methods, such as .insert(), .insert_before(), or .insert_after(), allow you to make various modifications to your HTML or XML document.
Changing Tag Names and Attributes
Once you’ve created your soup, you can easily modify it, such as renaming tags, modifying their attributes, adding new attributes, and removing attributes.
>>> soup = BeautifulSoup('<b class="bolder">Very Bold</b>')
>>> tag = soup.b
Modify and add new attributes as follows:
>>> tag.name = 'Blockquote'
>>> tag['class'] = 'Bolder'
>>> tag['id'] = 1.1
>>> tag
<Blockquote class="Bolder" id="1.1">Very Bold</Blockquote>
Delete attributes as follows:
>>> del tag['class']
>>> tag
<Blockquote id="1.1">Very Bold</Blockquote>
>>> del tag['id']
>>> tag
<Blockquote>Very Bold</Blockquote>
Modifying .string
You can easily modify the .string property of a tag –
>>> markup = 'Must for every <i>Learner>/i<'
>>> Bsoup = BeautifulSoup(markup)
>>> tag = Bsoup.a
>>> tag.string = "My Favourite spot."
>>> tag
My Favourite spot.
As we can see above, if this tag contains any other tags, they and all their contents will be replaced with the new data.
append()
Use the tag.append() method to add new data/content to an existing tag. It’s very similar to the append() method in Python lists.
>>> markup = 'Must for every <i>Learner</i>'
>>> Bsoup = BeautifulSoup(markup)
>>> Bsoup.a.append("Really Liked it")
>>> Bsoup
<html><body>Must for every <i>Learner</i> Really Liked it</body></html>
>>> Bsoup.a.contents
['Must for every ', <i>Learner</i>, ' Really Liked it']
NavigableString() and .new_tag()
If you want to append a string to a document, this can be easily accomplished using append() or the NavigableString() constructor.
>>> soup = BeautifulSoup("<b></b>")
>>> tag = soup.b
>>> tag.append("Start")
>>>
>>> new_string = NavigableString(" Your")
>>> tag.append(new_string)
>>> tag
<b>Start Your</b>
>>> tag.contents
['Start', ' Your']
Note: If you encounter any name errors when accessing the NavigableString() function, as shown below,
NameError: name ‘NavigableString’ is not defined
Simply import the NavigableString directory from the bs4 package – that’s it.
>>> from bs4 import NavigableString
We can fix the above error.
You can add comments to existing tags, or to other subclasses of NavigableString, by calling the constructor.
>>> from bs4 import Comment
>>> adding_comment = Comment("Always Learn something Good!")
>>> tag.append(adding_comment)
>>> tag
<b>Start Your<!--Always Learn something Good!--></b>
>>> tag.contents
['Start', ' Your', 'Always Learn something Good!']
Adding a brand new tag (rather than appending to an existing one) can be accomplished using BeautifulSoup’s built-in method: BeautifulSoup.new_tag().
>>> soup = BeautifulSoup("<b></b>")
>>> Otag = soup.b
>>>
>>> Newtag = soup.new_tag("a", href="https://www.tutorialspoint.com")
>>> Otag.append(Newtag)
>>> Otag
<b></b>
Only the first argument, the tag name, is required.
insert()
Similar to the .insert() method in Python lists, tag.insert() inserts a new element. However, unlike tag.append(), the new element is not necessarily at the end of its parent content. The new element can be added at any position.
>>> markup = 'Django Official website <i>Huge Community base</i>'
>>> soup = BeautifulSoup(markup)
>>> tag = soup.a
>>>
>>> tag.insert(1, "Love this framework ")
>>> tag
Django Official website Love this framework <i>Huge Community base</i>
>>> tag.contents
['Django Official website', 'Love this framework', <i>Huge Community base</i
>]
>>>
insert_before() and insert_after()
To insert tags or strings before content in the parse tree, we use insert_before().
>>> soup = BeautifulSoup(" **Brave** ")
>>> tag = soup.new_tag("i")
>>> tag.string = "Be"
>>>
>>> soup.b.string.insert_before(tag)
>>> soup.b
<b><i>Be</i>Brave</b>
Similarly, if you want to insert a tag or string after something in the parse tree, you can use insert_after().
>>> soup.b.i.insert_after(soup.new_string(" Always "))
>>> soup.b
<b><i>Be</i> Always Brave</b>
>>> soup.b.contents
[<i>Be</i>, ' Always ', 'Brave']
clear()
To remove the contents of a tag, use tag.clear()-
>>> markup = 'For <i>technical & Non-technical&lr/i> Contents'
>>> soup = BeautifulSoup(markup)
>>> tag = soup.a
>>> tag
For <i>Technical & Non-technical</i> Contents
>>>
>>> tag.clear()
>>> tag
extract()
To remove a tag or string from the tree, use PageElement.extract().
>>> markup = 'For <i&grtechnical & Non-technical</i> Contents'
>>> soup = BeautifulSoup(markup)
>>> a_tag = soup.a
>>>
>>> i_tag = soup.i.extract()
>>>
>>> a_tag
For Contents
>>>
>>> i_tag
<i>technical & Non-technical</i>
>>>
>>> print(i_tag.parent)
None
decompose()
tag.decompose() removes a tag from the tree and deletes all its contents.
>>> markup = 'For <i>technical & Non-technical</i> Contents'
>>> soup = BeautifulSoup(markup)
>>> a_tag = soup.a
>>> a_tag
For <i>technical & Non-technical</i> Contents
>>>
>>> soup.i.decompose()
>>> a_tag
For Contents
>>>
Replace_with()
As the name suggests, the pageElement.replace_with() function replaces an old tag or string with a new one in the tree.
>>> markup = 'Complete Python <i>Material</i>'
>>> soup = BeautifulSoup(markup)
>>> a_tag = soup.a
>>>
>>> new_tag = soup.new_tag("Official_site")
>>> new_tag.string = "https://www.python.org/"
>>> a_tag.i.replace_with(new_tag)
<i>Material</i>
>>>
>>> a_tag
Complete Python <Official_site>https://www.python.org/</Official_site>
In the output above, you’ll notice that replace_with() returns the replaced tag or string (like “Material” in our case), so you can inspect it or add it back to another part of the tree.
wrap()
pageElement.wrap() wraps an element in a tag you specify and returns a new wrapper.
>>> soup = BeautifulSoup("<p>tutorialspoint.com</p>")
>>> soup.p.string.wrap(soup.new_tag("b"))
<b>tutorialspoint.com</b>
>>>
>>> soup.p.wrap(soup.new_tag("Div"))
<Div><p><b>tutorialspoint.com</b></p></Div>
unwrap()
tag.unwrap() is the opposite of wrap(); it replaces a tag with whatever content is inside it.
>>> soup = BeautifulSoup('I liked <i>tutorialspoint</i>')
>>> a_tag = soup.a
>>>
>>> a_tag.i.unwrap()
<i></i>
>>> a_tag
I liked tutorialspoint
As you can see above, like replace_with(), unwrap() also returns the replaced tag.
Below is another example of unwrap() to better understand it.
>>> soup = BeautifulSoup("<p>I <strong>AM</strong> a <i>text</i>.</p>")
>>> soup.i.unwrap()
<i></i>
>>> soup
<html><body><p>I <strong>AM</strong> a text.</p></body></html>
unwrap() is suitable for stripping markup.