Beautiful Soup – Add Soup to the Page
Beautiful Soup – Adding Soup to Your Page
In the previous code example, we used string methods to parse the document using the beautiful constructor. Another approach is to pass the document via the open filehandle.
from bs4 import BeautifulSoup
with open("example.html") as fp:
soup = BeautifulSoup(fp)
soup = BeautifulSoup("<html>data</html>")
First, the document is converted to Unicode, and HTML entities are converted to Unicode characters:
.
import bs4
html = '''<b>tutorialspoint</b>, <i>&web scraping &data science;</i>'''
soup = bs4.BeautifulSoup(html, 'lxml')
print(soup)
Output
<html><body><b>tutorialspoint</b>, <i>&web scraping &data science;</i></body></html>
BeautifulSoup then parses the data using an HTML parser, or if you explicitly tell it to use an XML parser.
HTML Tree Structure
Before we explore the different components of an HTML page, let’s first understand the HTML tree structure.
The root element in the document tree is html. It can have parents, children, and siblings, determined by its position in the tree. To move between HTML elements, attributes, and text, you must move between nodes in the tree.
Let’s assume the webpage looks like the one below.
This translates into the following HTML document –
<html><head><title>TutorialsPoint</title></head><h1>TutorialsPoint Online Library</h1><p<<b>It's all Free</b></p></body></html>
Simply put, for the HTML document above, we have an HTML tree structure as shown below.