Beautiful Soup – Installation

Beautiful Soup – Installation

Since Beautiful Soup isn’t a standard Python library, we need to install it first. We will install the latest Beautiful Soup 4 library (also known as BS4).

To isolate our workspace so as not to interfere with our existing setup, let’s first create a virtual environment.

Creating a Virtual Environment (Optional)

A virtual environment allows us to create an isolated Python working copy for a specific project without affecting our external setup.

The best way to install any Python package is using pip. However, if pip isn’t already installed (you can check this by using “pip -version” in your command prompt or shell), you can install it with the following command.

Linux Environment

$sudo apt-get install python-pip

Windows Environment

To install pip on Windows, follow these steps.

  • Get pip from https://bootstrap.pypa.io/get-pip.py href=”https://bootstrap.pypa.io/get-pip.py”>or download get-pip.py from GitHub to your computer.
  • Open a command prompt and navigate to the folder containing the get-pip.py file.

  • Run the following command –

>python get-pip.py

That’s it, pip is now installed on your Windows machine.

You can verify that pip is installed by running the following command:

>pip --version
pip 19.2.3 from c:usersyadurappdatalocalprogramspythonpython37libsite-packagessip (Python 3.7)

Install a virtual environment

Run the following command in your command prompt:

>pip install virtualenv

After running, you should see the following screenshot:

Beautiful Soup - Installation

The following command will create a virtual environment (“myEnv”) in your current directory.

>virtualenv myEnv

Screenshot

Beautiful Soup - Installation

To activate your virtual environment, run the following command –

>myEnvScriptsactivate

Beautiful Soup - Installation

In the screenshot above, you can see that we have “myEnv” as the prefix, which tells us that we are in the virtual environment “myEnv.”

To exit the virtual environment, run deactivate.

(myEnv) C:Usersyadur>deactivate
C:Usersyadur>

Now that our virtual environment is ready, let’s install BeautifulSoup.

Installing BeautifulSoup

Since BeautifulSoup is not a standard library, we need to install it. We will use the BeautifulSoup 4 package (called bs4).

Linux Machines

To install bs4 using your system package manager on Debian or Ubuntu Linux, run the following commands:

$sudo apt-get install python-bs4 (for Python 2.x)
$sudo apt-get install python3-bs4 (for Python 3.x)

You can install bs4 using easy_install or pip (if you have problems installing using your system packager).

$easy_install beautifulsoup4
$pip install beautifulsoup4

(If you’re using Python 3, you may need to use easy_install3 or pip3, respectively.)

Windows Machines

Installing beautifulsoup4 on Windows is very simple, especially if you already have pip installed.

>pip install beautifulsoup4

Beautiful Soup - Installation

So now BeautifulSoup4 is installed on our machine. Let’s talk about some issues you might encounter after installation.

Post-Installation Issues

On Windows machines, you may encounter errors where the wrong version is installed, primarily through—

  • Error: ImportError “No module named HTMLParser” , then you must run the Python 2 version of the code under Python 3.
  • Error: ImportError “No module named html.parser”, then you must run the Python 3 version of the code under Python 2.

The best way to get rid of both of these situations is to reinstall BeautifulSoup, completely removing the existing installation.

If you get a SyntaxError “Invalid syntax” on the line ROOT_TAG_NAME = u'[document]’, you need to convert your Python 2 code to Python 3. Simply install the package –

$ python3 setup.py install

Or manually run the Python 2 to 3 conversion script in the bs4 directory –

$ 2to3-3.2 -w bs4

Installing a parser

By default, Beautiful Soup supports the included in the Python standard library. rel=”noopener” target=”_blank” title=”HTML Tutorial”>HTML parser, however it also supports many external third-party python parsers, such as the lxml parser or the html5lib parser.

To install the lxml or html5lib parser, use the command —

Linux Machines

$apt-get install python-lxml
$apt-get install all python-html5lib

Windows Machines

$pip install lxml
$pip install html5lib

Beautiful Soup - Installation

Generally speaking, users use lxml for speed. If you are using an older version of Python 2 (before 2.7.3) or Python 3 (and prior to 3.2.2), it’s recommended to use the lxml or html5lib parser, as Python’s built-in HTML parser doesn’t handle older versions very well.

Running Beautiful Soup

Now it’s time to test our Beautiful Soup package on an HTML page (using the webpage as an example – https://www.tutorialspoint.com/index.htm, but you can choose any other page you like) and extract some information from it.

In the following code, we attempt to extract the title from a webpage.

from bs4 import BeautifulSoup
import requests
url = "https://www.tutorialspoint.com/index.htm"
req = requests.get(url)
soup = BeautifulSoup(req.text, "html.parser")
print(soup.title)

Output

<title>H2O, Colab, Theano, Flutter, KNime, Mean.js, Weka, Solidity, Org.Json, AWS QuickSight, JSON.Simple, Jackson Annotations, Passay, Boon, MuleSoft, Nagios, Matplotlib, Java NIO, 
<pre><code class="language-python line-numbers">for link in soup.find_all('a'):
print(link.get('href'))

Output

https://www.tutorialspoint.com/index.htm
https://www.tutorialspoint.com/about/about_careers.htm
https://www.tutorialspoint.com/questions/index.php
https://www.tutorialspoint.com/online_dev_tools.htm
https://www.tutorialspoint.com/codingground.htm
https://www.tutorialspoint.com/current_affairs.htm
https://www.tutorialspoint.com/upsc_ias_exams.htm
https://www.tutorialspoint.com/tutor_connect/index.php
https://www.tutorialspoint.com/whiteboard.htm
https://www.tutorialspoint.com/netmeeting.php
https://www.tutorialspoint.com/index.htm
https://www.tutorialspoint.com/tutorialslibrary.htm
https://www.tutorialspoint.com/videotutorials/index.php
https://store.tutorialspoint.com
https://www.tutorialspoint.com/gate_exams_tutorials.htm
https://www.tutorialspoint.com/html_online_training/index.asp p
https://www.tutorialspoint.com/css_online_training/index.asp
https://www.tutorialspoint.com/3d_animation_online_training/index.asp
https://www.tutorialspoint.com/swift_4_online_training/index.asp
https://www.tutorialspoint.com/blockchain_online_training/index.asp
https://www.tutorialspoint.com/reactjs_online_training/index.asp
https://www.tutorix.com
https://www.tutorialspoint.com/videotutorials/t op-courses.php
https://www.tutorialspoint.com/the_full_stack_web_development/index.asp
….
….
https://www.tutorialspoint.com/online_dev_tools.htm
https://www.tutorialspoint.com/free_web_graphics.htm
https://www.tutorialspoint.com/online_file_conversion.htm
https://www.tutorialspoint.com/netmeeting.php
https://www.tutorialspoint.com/free_online_whiteboard.htm
http://www.tutorialspoint.com
https://www. facebook.com/tutorialspointindia
https://plus.google.com/u/0/+tutorialspoint

http://www.linkedin.com/company/tutorialspoint
https://www.youtube.com/channel/UCVLbzhxVTiTLiVKeGV7WEBg
https://www.tutorialspoint.com/index.htm
/about/about_privacy.htm#cookies
/about/faq.htm
/about/about_helping.htm
/about/contact_us.htm

Similarly, we can use beautifulsoup4 to extract useful information.

Now let’s take a closer look at the “soup” in the above example.

Leave a Reply

Your email address will not be published. Required fields are marked *