Beautiful Soup – Types of Objects

Beautiful Soup – Types of Objects

When we pass an html document or string to the beautifulsoup constructor, beautifulsoup essentially transforms a complex html page into different html data-internallinksmanager029f6b8e52c=”1″ href=”https://geek-docs.com/python/python-top-tutorials/1000100_python_index.html” rel=”noopener” target=”_blank” title=”Python Tutorial”>python objects. Below we’ll discuss four main objects:

  • Tag
  • NavigableString
  • BeautifulSoup
  • Comments

Tag Object

An HTML tag is used to define various types of content. A Tag object in BeautifulSoup corresponds to an HTML or XML tag in the actual page or document.

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('<b class="boldest">TutorialsPoint</b>')
>>> tag = soup.html
>>> type(tag)
<class 'bs4.element.Tag'>

A tag has many attributes and methods. Two important features of a tag are its name and attributes.

Name (tag.name)

Each tag has a name, which can be accessed by appending ‘.name’ to the tag. tag.name will return the type of tag it is.

>>> tag.name
'html'

However, if we change the tag name, this will also be reflected in the HTML markup generated by BeautifulSoup.

>>> tag.name = "Strong"
>>> tag
<Strong><body><b class="boldest">TutorialsPoint</b></body></Strong>
>>> tag.name
'Strong'

Attributes (tag.attrs)

A tag object can have any number of attributes. The tag has an attribute ‘class’ whose value is “boldest.” Anything that isn’t a tag is essentially an attribute and must contain a value. You can access attributes by accessing their key (such as “class” in the example above) or directly using “.attrs.”

>>> tutorialsP = BeautifulSoup("<div class='tutorialsP'></div>",'lxml')
>>> tag2 = tutorialsP.div
>>> tag2['class']
['tutorialsP']

We can perform various modifications (add/remove/edit) to the attributes of our tags.

>>> tag2['class'] = 'Online-Learning'
>>> tag2['style'] = '2007'
>>>
>>> tag2
<div class="Online-Learning" style="2007"></div>
>>> del tag2['style']
>>> tag2
<div class="Online-Learning"></div>
>>> del tag['class']
>>> tag
<b SecondAttribute="2">TutorialsPoint</b>
>>>
>>> del tag['SecondAttribute']
>>> tag
</b>
>>> tag2['class']
'Online-Learning'
>>> tag2['style']
KeyError: 'style'

Multi-valued Attributes

Some HTML5 attributes can have multiple values. The most common is the class attribute, which can have multiple CSS values. Other attributes include “rel”, “rev”, “headers”, “accesskey”, and “accept-charset”. Multi-valued attributes in Beautiful Soup are displayed as a list.

>>> from bs4 import BeautifulSoup
>>>
>>> css_soup = BeautifulSoup('<p class="body"></p>')
>>> css_soup.p['class']
['body']
>>>
>>> css_soup = BeautifulSoup('<p class="body bold"></p>')
>>> css_soup.p['class']
['body', 'bold']

However, if any attribute has more than one value but is not a multi-valued attribute according to any version of the HTML standard, Beautiful Soup will leave the attribute alone.

>>> id_soup = BeautifulSoup('<p id="body bold"></p>')
>>> id_soup.p['id']
'body bold'
>>> type(id_soup.p['id'])
<class 'str'>

If you convert a tag to a string, you can combine multiple attribute values.

>>> rel_soup = BeautifulSoup("<p> tutorialspoint Main  Page</p>")
>>> rel_soup.a['rel']
['Index']
>>> rel_soup.a['rel'] = ['Index', ' Online Library, Its all Free']
>>> print(rel_soup.p)
<p> tutorialspoint Main  Page</p>

By using ‘get_attribute_list’, the value you get is always a list or string, regardless of whether it’s a multi-valued value.

id_soup.p.get_attribute_list('id')

However, if you parse the document as ‘xml’, there are no multi-valued attributes –

>>> xml_soup = BeautifulSoup('<p class="body bold"></p>', 'xml')
>>> xml_soup.p['class']
'body bold'

Navigable Strings

The navigablestring object is used to represent the contents of a tag. To access its contents, use “.string” within the tag.

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup("<h2 id='message'>Hello, Tutorialspoint!</h2>")
>>>
>>> soup.string
'Hello, Tutorialspoint!'
>>> type(soup.string)
>

You can replace a string with another, but you cannot edit an existing string.

>>> soup = BeautifulSoup("<h2 id='message'>Hello, Tutorialspoint!</h2>")
>>> soup.string.replace_with("Online Learning!")
'Hello, Tutorialspoint!'
>>> soup.string
'Online Learning!'
>>> soup
<html><body><h2 id="message">Online Learning!</h2></body></html>

BeautifulSoup

BeautifulSoup is an object created when we attempt to scrape a web resource. Therefore, it represents the complete document we are trying to scrape. Most of the time, it is considered a tag object.

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup("<h2 id='message'>Hello, Tutorialspoint!</h2>")
>>> type(soup)
<class 'bs4.BeautifulSoup'>
>>> soup.name
'[document]'

Comment

The Comment object represents the comment section of a web document. It is simply a special type of NavigableString.

>>> soup = BeautifulSoup('<p><!-- Everything inside it is COMMENTS --></p>')
>>> comment = soup.p.string
>>> type(comment)
<class 'bs4.element.Comment'>
>>> type(comment)
<class 'bs4.element.Comment'>
>>> print(soup.p.prettify())
<p>
<!-- Everything inside it is COMMENTS -->
</p>

NavigableString Object

The navigablestring object is used to represent the text within a tag, not the tag itself.

Leave a Reply

Your email address will not be published. Required fields are marked *