Beautiful Soup – Add Soup to the Page

Beautiful Soup – Adding Soup to Your Page

In the previous code example, we used string methods to parse the document using the beautiful constructor. Another approach is to pass the document via the open filehandle.

from bs4 import BeautifulSoup
with open("example.html") as fp:
   soup = BeautifulSoup(fp)
soup = BeautifulSoup("<html>data</html>")

First, the document is converted to Unicode, and HTML entities are converted to Unicode characters:

import bs4
html = '''<b>tutorialspoint</b>, <i>&web scraping &data science;</i>'''
soup = bs4.BeautifulSoup(html, 'lxml')
print(soup)

Output

<html><body><b>tutorialspoint</b>, <i>&web scraping &data science;</i></body></html>

BeautifulSoup then parses the data using an HTML parser, or if you explicitly tell it to use an XML parser.

HTML Tree Structure

Before we explore the different components of an HTML page, let’s first understand the HTML tree structure.

Beautiful Soup -- Add Soup to the Page

The root element in the document tree is html. It can have parents, children, and siblings, determined by its position in the tree. To move between HTML elements, attributes, and text, you must move between nodes in the tree.

Let’s assume the webpage looks like the one below.

Beautiful Soup -- Add Soup to the Page

This translates into the following HTML document –

<html><head><title>TutorialsPoint</title></head><h1>TutorialsPoint Online Library</h1><p<<b>It's all Free</b></p></body></html>

Simply put, for the HTML document above, we have an HTML tree structure as shown below.

Beautiful Soup -- Add Soup to the Page

Beautiful Soup – Adding Soup to Your Page

Output

HTML Tree Structure

Related Posts

Python 3 – String isalnum() Method

Pytest test execution results in XML format

Comprehensive analysis of PLT image storage in Python

Leave a ReplyCancel Reply