Beautiful Soup – Navigate by Tag

Beautiful Soup – Navigating by Tag

In this chapter, we’ll discuss navigating by tag.

Here is our HTML document –

>>> html_doc = """
<html><head><title>Tutorials Point</title></head>
<body>
<p class="title"><b>The Biggest Online Tutorials Library, It's all Free</b></p>
<p class="prog">Top 5 most used Programming Languages are:
java/java_overview.htm" class="prog" id="link1">Java,
C,
python/index.htm" class="prog" id="link3">Python,
javascript/javascript_overview.htm" class="prog" id="link4">JavaScript and
C;
as per online survey.</p>
<p class="prog">Programming Languages</p>
"""
>>>
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html_doc, 'html.parser')
>>>

Based on the above file, we will try to move from one section to another.

Move Down

One of the important elements in any HTML document is the tag, which may contain other tags/strings (children of the tag). Beautiful Soup provides different methods to navigate and iterate over the children of a tag.

Navigating by Tag Name

The easiest way to search the parse tree is by tag name. If you want tags, use soup.head –

>>> soup.head
<head>&ttitle>Tutorials Point</title></head>
>>> soup.title
<title>Tutorials Point</title>

To get a specific tag within the <body> tag (e.g., the first <b> tag), use

>>> soup.body.b
<b>The Biggest Online Tutorials Library, It's all Free</b>

Giving a tag’s name as an attribute will only give you the first tag with that name—

>>> soup.a
Java

To get the attributes of all tags, you can use the find_all() method.

>>> soup.find_all("a")
[Java, C, Python, JavaScript, C]>>> soup.find_all("a")
[Java, C, Python, JavaScript, C]

.contents and .children

We can search for the children of a tag in the list through the tag’s .contents-

>>> head_tag = soup.head
>>> head_tag
<head><title>Tutorials Point</title></head>
>>> Htag = soup.head
>>> Htag
<head><title>Tutorials Point</title></head>
>>>
>>> Htag.contents
[<title>Tutorials Point</title>
>>>
>>> Ttag = head_tag.contents[0]
>>> Ttag
<title>Tutorials Point</title>
>>> Ttag.contents
['Tutorials Point']

BeautifulSoup objects themselves also have children. In this case, the <html> tag is a child of the BeautifulSoup object.

>>> len(soup.contents)
2
>>> soup.contents[1].name
'html'

A string does not have .contents because it cannot possibly contain anything −

>>> text = Ttag.contents[0]
>>> text.contents
self.__class__.__name__, attr))
AttributeError: 'NavigableString' object has no attribute 'contents'

Use the .children generator to access the children of a tag, rather than getting them as a list.

>>> for child in Ttag.children:
print(child)
Tutorials Point

.descendants

The .descendants property allows you to recursively iterate over all of a tag’s children—its direct children and its direct children, and so on.

Its direct children and its direct children’s children, and so on.

>>> for child in Htag.descendants:
print(child)
<title>Tutorials Point</title>
Tutorials Point

<head> has only one child, but it has two descendants: the <title> tag and the <title> tag’s child. The beautifulsoup object has only one direct child (the <html> tag), but it has a large number of descendants—the <html> tags.

>>> len(list(soup.children))
2
>>> len(list(soup.descendants))
33

.string

If a tag has only one child, and that child is a NavigableString, the child will appear as .string.

>>> Ttag.string
'Tutorials Point'

If a tag’s only child is another tag, and that tag has a .string, then the parent tag is considered to have the same .string as its child.

>>> Htag.contents
[<title>Tutorials Point</title>]
>>>
>>> Htag.string
'Tutorials Point'

However, if a tag contains more than one thing, it’s unclear what .string should refer to, so .string is defined as None −

>>> print(soup.html.string)
None

.strings and stripped_strings

If a tag contains more than one thing, you can still see just the string. Use the .strings generator –

>>> for string in soup.strings:
print(repr(string))
'n'
'Tutorials Point'
'n'
'n'
"The Biggest Online Tutorials Library, It's all Free"
'n'
'Top 5 most used Programming Languages are: n'
'Java'
',n'
'C'
',n'
'Python'
',n'
'JavaScript'
'andn'
'C'
';n nas per online survey.'
'n'
'Programming Languages'
'n'

To remove excess whitespace, use the .stripped_strings generator –

>>> for string in soup.stripped_strings:
print(repr(string))
'Tutorials Point'
"The Biggest Online Tutorials Library, It's all Free"
'Top 5 most used Programming Languages are:'
'Java'
','
'C'
','
'Python'
','
'JavaScript'
'and'
'C'
';n nas per online survey.'
'Programming Languages'

Going Up

In a “family tree” metaphor, every tag and every string has a parent: the tag that contains it:

.parent

To access the parent of an element, use the .parent property.

>>> Ttag = soup.title
>>> Ttag
<title>Tutorials Point</title>
>>> Ttag.parent
<head>title>Tutorials Point</title></head>

In our html_doc, the title string itself has a parent: the <title> tag that contains it—

>>> Ttag.string.parent
<title>Tutorials Point</title>

The parent of a top-level tag like <html> is the Beautifulsoup object itself.

>>> htmltag = soup.html
>>> type(htmltag.parent)
<class 'bs4.BeautifulSoup'>

A BeautifulSoup object’s .parent is defined as None.

>>> print(soup.parent)
None

.parents

To iterate over all parent elements, use the .parents property.

>>> link = soup.a
>>> link
Java
>>>
>>> for parent in link.parents:
if parent is None:
print(parent)
else:
print(parent.name)
p
body
html
[document]

Lateral movement

Here is a simple document –

>>> sibling_soup = BeautifulSoup("<b>TutorialsPoint</b><c><strong>The Biggest Online Tutorials Library, It's all Free</strong></b>")
>>> print(sibling_soup.prettify())
<html>
<body>

<b>
TutorialsPoint
</b>
<c>
<strong>
The Biggest Online Tutorials Library, It's all Free
</strong>
</c>

</body>
</html>

In the above file, the <b> and <c> tags are at the same level and are children of the same tag. The <b> and <c> tags are siblings.

.next_sibling and .previous_sibling

Use .next_sibling and .previous_sibling to navigate between page elements at the same level in the parse tree:

>>> sibling_soup.b.next_sibling
<c><strong>The Biggest Online Tutorials Library, It's all Free</strong></c>
>>>
>>> sibling_soup.c.previous_sibling
<b>TutorialsPoint</b>

<b> tags have a .next_sibling but no .previous_sibling because there is no level of the tree before the <b> tag, and the same is true for the <c> tag.

>>> print(sibling_soup.b.previous_sibling)
None
>>> print(sibling_soup.c.next_sibling)
None

These two strings are not siblings because they do not have the same parent.

>>> sibling_soup.b.string
'TutorialsPoint'
>>>
>>> print(sibling_soup.b.string.next_sibling)
None

.next_siblings and .previous_siblings

To iterate over a label’s siblings, use .next_siblings and .previous_siblings.

>>> for sibling in soup.a.next_siblings:
print(repr(sibling))
',n'
C
',n'
>a class="prog" href="https://www.tutorialspoint.com/python/index.htm" id="link3">Python
',n'
JavaScript
'andn'
C
';n nas per online survey.'
>>> for sibling in soup.find(id="link3").previous_siblings:
print(repr(sibling))
',n'
C
',n'
Java
'Top 5 most used Programming Languages are: n'

Moving Back and Forward

Now let’s go back to the first two lines of our “html_doc” example —

<html><head><title>Tutorials Point</title></head>
<body>
<h4 class="tagLine"><b>The Biggest Online Tutorials Library, It's all Free

An HTML parser takes the string above and converts it into a series of events, such as “open an <html> tag,” “open a <head> tag,” “open a <title> tag,” “append a string,” “close the </title> tag,” “close the </head> tag,” “open an <h4> tag,” and so on. BeautifulSoup provides different methods for reconstructing the initial parsing of a document.

.next_element and .previous_element

The .next_element property of a tag or string points to the element parsed immediately following it. Sometimes it looks similar to .next_sibling , but it’s not exactly the same. Here’s the final tag from our “html_doc” example document.

>>> last_a_tag = soup.find("a", id="link5")
>>> last_a_tag
C
>>> last_a_tag.next_sibling
';n nas per online survey.'

However, the .next_element of the tag, which is parsed immediately after the tag, is not the rest of the sentence: it is the word “C”:

>>> last_a_tag.next_element
'C'

The above behavior occurs because the letter “C” appears before the semicolon in the original markup. The parser encounters a tag, then the letter “C,” then the closing tag, then the semicolon and the rest of the sentence. The semicolon is at the same level as the tag, but the letter “C” was encountered first.

The .previous_element property is the exact opposite of .next_element. It refers to any element parsed immediately before this one.

>>> last_a_tag.previous_element
' andn'
>>>
>>> last_a_tag.previous_element.next_element
C

.next_elements and .previous_elements

We use these iterators to move forward and backward one element.

>>> for element in last_a_tag.next_e elements:
print(repr(element))
'C'
';n nas per online survey.'
'n'
<p class="prog">Programming Languages</p>
'Programming Languages'
'n'

Leave a Reply

Your email address will not be published. Required fields are marked *