Beautiful Soup – Beautiful Objects
Beautiful Soup – Beautiful Objects
The starting point of any Beautiful Soup project is the Beautiful Soup object. A Beautiful Soup object represents the input HTML/XML document used to create it.
We can pass Beautiful Soup a string or a file-like object, where the file (object) can be stored locally on our machine or a web page.
The most common BeautifulSoup objects are –
- Tag
- NavigableString
- BeautifulSoup
- Comment
Comparing Objects for Equality
According to Beautiful Soup, two navigable string or tag objects are equal if they represent the same HTML/XML markup.
Now let’s look at the following example. The two <b>
tags are considered equal even though they live in different parts of the object tree because they both look like “<b>Java</b>
“.
>>> markup = "<p>Learn Python and <b>Java</b> and advanced <b>Java</b>! from Tutorialspoint</p>"
>>> soup = BeautifulSoup(markup, "html.parser")
>>> first_b, second_b = soup.find_all('b')
>>> print(first_b == second_b)
True
>>> print(first_b.previous_element == second_b.previous_element)
False
However, to check if two variables refer to the same object, you can use the following method—
>>> print(first_b is second_b)
False
Copying Beautiful Soup Objects
To create a copy of any tag or NavigableString, use the copy.copy() function, like this—
>>> import copy
>>> p_copy = copy.copy(soup.p)
>>> print(p_copy)
<p>Learn Python and <b>Java</b> and advanced <b>Java</b>! from Tutorialspoint</p>
>>>
Although the two copies (original and copy) contain the same markup, they do not represent the same object.
>>> print(soup.p == p_copy)
True
>>>
>>> print(soup.p is p_copy)
False
>>>
The only real difference is that the copy is completely removed from the original Beautiful Soup object tree, just as if extract() had been called on it.
>>> print(p_copy.parent)
None
The above behavior is because two different label objects cannot occupy the same space at the same time.