Pulling the important data out of html using BeautifulSoup

Introduction

This tutorial introduces a small recipe to extract important textual content from a webpage using the Beautiful Soup web scrapper.

What is important in a webpage and what is not important?

In an HTML page, there are some HTML tags we need to remove when extracting the data such as script tags, style tags, link tags, and some meta tags. The navbar and footer element content is also not much important. so we exclude those elements in our extraction process.

Importing Libraries

we import beautiful soup and python request module.

import requests
from bs4 import BeautifulSoup as bs

Get the webpage content

we use a simple webpage hosted on GitHub for this process. Click this link to look at the webpage. Let's get and load the webpage content to a BeautifulSoup object.

page = requests.get("https://gayan830.github.io/bs_tutorial/fruits_vegetables.html")
soup = bs(page.content)
print(soup.prettify())

output:

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta content="Find out what nutrients are in most common Fruits and Vegetables." name="description"/>
  <meta content="Fruits, vegitable list, Fruit Vegetables vitamin" name="keywords"/>
  <meta content="width=device-width, initial-scale=1.0" name="viewport"/>
  <script src="https://ajax.googleapis.com/ajax/libs/jquery/3.3.1/jquery.min.js">
  </script>
  <title>
   Fruits and Vegetables
  </title>
  <link crossorigin="anonymous" href="https://cdn.jsdelivr.net/npm/bootstrap@5.2.0/dist/css/bootstrap.min.css" integrity="sha384-gH2yIJqKdNHPEq0n4Mqa/HGKIhSkIHeL5AyhkYV8i59U5AR6csBvApHHNl/vI1Bx" rel="stylesheet"/>
  <script crossorigin="anonymous" integrity="sha384-ODmDIVzN+pFdexxHEHFBQH3/9/vQ9uori45z4JjnFsRydbmQbmL5t1tQ0culUzyK" src="https://cdn.jsdelivr.net/npm/bootstrap@5.2.0/dist/js/bootstrap.min.js">
  </script>
  <link href="https://fonts.googleapis.com" rel="preconnect"/>
  <link crossorigin="" href="https://fonts.gstatic.com" rel="preconnect"/>
  <link href="https://fonts.googleapis.com/css2?family=Roboto:wght@100&amp;display=swap" rel="stylesheet"/>
  <link href="style.css" rel="stylesheet"/>
  <style>
   body {
            width: 700px;
        }

        nav {
            background-color: #2596be;
            margin-bottom: 40px;
        }
  </style>
 </head>
 <body class="center">
  <header>
   <h1>
    Fruits and Vegetables
   </h1>
   <nav>
    <ul>
     <li>
      <a href="#home">
       Home
      </a>
     </li>
     <li>
      <a href="#foods">
       Fruits and Vegetables
      </a>
     </li>
     <li>
      <a href="#contact">
       Contact
      </a>
     </li>
     <li>
      <a href="#about">
       About
      </a>
     </li>
    </ul>
   </nav>
  </header>
  <main class="main-text">
   <h2>
    Fruits and Vegetables contain lot of Vitamins, minerals.
            Eating fresh food and Vegetables can protect you from cancer and heart diseases.
   </h2>
  </main>
  <section>
   <div class="container center">
    <figure>
     <img alt="Fruit and Vegetables" src="Fruit_and_Vegetables.jpg"/>
     <figcaption>
      Fruit and Vegetables
     </figcaption>
    </figure>
   </div>
   <div class="container center">
    <b>
     List of common Fruits
    </b>
    <ol id="veg-list">
     <li>
      Banana
     </li>
     <li>
      Mango
     </li>
     <li>
      Guava
     </li>
     <li>
      Jack Fruit
     </li>
     <li>
      Grapes
     </li>
    </ol>
   </div>
   <div class="container center">
    <b>
     List of common Vegetables
    </b>
    <ol id="fruit-list">
     <li>
      Carrot
     </li>
     <li>
      Beans
     </li>
     <li>
      Chille
     </li>
     <li>
      Potatoes
     </li>
     <li>
      Tomatoes
     </li>
    </ol>
   </div>
  </section>
  <section>
   <h1>
    Fruits Vitamins
   </h1>
   <table class="center">
    <thead>
     <tr>
      <th colspan="2">
       Fruits Vitamin Table
      </th>
     </tr>
     <tr>
      <th>
       Name
      </th>
      <th>
       Vitamins and minerals
      </th>
     </tr>
    </thead>
    <tbody>
     <tr>
      <td>
       Banana
      </td>
      <td>
       Potassium, Vitamin B6, Vitamin C
      </td>
     </tr>
     <tr>
      <td>
       Mango
      </td>
      <td>
       Iron, Calcium, Vitamin C, Vitamin B6, Vitamin A
      </td>
     </tr>
     <tr>
      <td>
       Guava
      </td>
      <td>
       Calcium, Vitamin C, Vitamin D, Vitamin A
      </td>
     </tr>
    </tbody>
   </table>
  </section>
  <hr/>
  <footer>
   <p>
    <b>
     Contact the author of this page:
    </b>
   </p>
   <aside>
    <address>
     <a href="123@xyz.com">
      123@xyz.com
     </a>
     <br/>
     <a href="tel:+000000000">
      (000) 000-0000
     </a>
    </address>
   </aside>
   <aside>
    <address>
     <b>
      Address:
     </b>
     <br/>
     addres line 1
     <br/>
     addres line 2
     <br/>
     City, State
    </address>
   </aside>
  </footer>
 </body>
</html>

prettify() method format the HTML markup when outputting.

Remove unwanted tags and get the values of some of the important attributes

Some of the meta tag attributes contain relevant information about the page such as keywords and descriptions. For example, the following meta tags have important information. so we have to extract the values of the content attribute of the below meta tags and alt attribute value from the image.

<meta content="Find out what nutrients are in most common Fruits and Vegetables." name="description"/>
<meta content="Fruits, vegitable list, Fruit Vegetables vitamin" name="keywords"/>
<img alt="Fruit and Vegetables" src="Fruit_and_Vegetables.jpg"/>

In the below code snippet clean up all the types of tags specified in the find_all method list argument while extracting the necessary information. Beautifulsoup find_all() method loops through all the HTML elements in the Beautifulsoup object and the decompose() method removes HTML elements from the Beautifulsoup object. we store all the extracted information in the webpage_content list.

webpage_content = []

for element in soup.find_all(['meta', 'img', 'script','link','style', 'nav', 'footer', 'form', 'svg]):
  # extracting meta tag descriptions and keywords
  if element.get('name') in ['keywords', 'description']:
    webpage_content.append(element.get('content'))
  # getting alt text from images
  elif element.get('alt'):
    webpage_content.append(element.get('alt'))
  element.decompose() #remove element

After extracting the information from the above-specified type of tags let's look at the webpage_content list:

print(webpage_content)

output:

['Find out what nutrients are in most common Fruits and Vegetables.',
 'Fruits, vegitable list, Fruit Vegetables vitamin',
 'Fruit and Vegetables']

Extracting remaining content

we extract the content from the remaining tags using the following code snippet:

for string in soup.stripped_strings:
  webpage_content.append(string)

Now our webpage_content list has all the essential text content extracted from the webpage.

print(webpage_content)

output:

['Find out what nutrients are in most common Fruits and Vegetables.',
 'Fruits, vegitable list, Fruit Vegetables vitamin',
 'Fruit and Vegetables',
 'Fruits and Vegetables',
 'Fruits and Vegetables',
 'Fruits and Vegetables contain lot of Vitamins, minerals.\n            Eating fresh food and Vegetables can protect you from cancer and heart diseases.',
 'Fruit and Vegetables',
 'List of common Fruits',
 'Banana',
 'Mango',
 'Guava',
 'Jack Fruit',
 'Grapes',
 'List of common Vegetables',
 'Carrot',
 'Beans',
 'Chille',
 'Potatoes',
 'Tomatoes',
 'Fruits Vitamins',
 'Fruits Vitamin Table',
 'Name',
 'Vitamins and minerals',
 'Banana',
 'Potassium, Vitamin B6, Vitamin C',
 'Mango',
 'Iron, Calcium, Vitamin C, Vitamin B6, Vitamin A',
 'Guava',
 'Calcium, Vitamin C, Vitamin D, Vitamin A',
 'Fruits and Vegetables',
 'Fruits and Vegetables',
 'Fruits and Vegetables contain lot of Vitamins, minerals.\n            Eating fresh food and Vegetables can protect you from cancer and heart diseases.',
 'Fruit and Vegetables',
 'List of common Fruits',
 'Banana',
 'Mango',
 'Guava',
 'Jack Fruit',
 'Grapes',
 'List of common Vegetables',
 'Carrot',
 'Beans',
 'Chille',
 'Potatoes',
 'Tomatoes',
 'Fruits Vitamins',
 'Fruits Vitamin Table',
 'Name',
 'Vitamins and minerals',
 'Banana',
 'Potassium, Vitamin B6, Vitamin C',
 'Mango',
 'Iron, Calcium, Vitamin C, Vitamin B6, Vitamin A',
 'Guava',
 'Calcium, Vitamin C, Vitamin D, Vitamin A',
 'Fruits and Vegetables',
 'Fruits and Vegetables',
 'Fruits and Vegetables contain lot of Vitamins, minerals.\n            Eating fresh food and Vegetables can protect you from cancer and heart diseases.',
 'Fruit and Vegetables',
 'List of common Fruits',
 'Banana',
 'Mango',
 'Guava',
 'Jack Fruit',
 'Grapes',
 'List of common Vegetables',
 'Carrot',
 'Beans',
 'Chille',
 'Potatoes',
 'Tomatoes',
 'Fruits Vitamins',
 'Fruits Vitamin Table',
 'Name',
 'Vitamins and minerals',
 'Banana',
 'Potassium, Vitamin B6, Vitamin C',
 'Mango',
 'Iron, Calcium, Vitamin C, Vitamin B6, Vitamin A',
 'Guava',
 'Calcium, Vitamin C, Vitamin D, Vitamin A',
 'Fruits and Vegetables',
 'Fruits and Vegetables',
 'Fruits and Vegetables contain lot of Vitamins, minerals.\n            Eating fresh food and Vegetables can protect you from cancer and heart diseases.',
 'Fruit and Vegetables',
 'List of common Fruits',
 'Banana',
 'Mango',
 'Guava',
 'Jack Fruit',
 'Grapes',
 'List of common Vegetables',
 'Carrot',
 'Beans',
 'Chille',
 'Potatoes',
 'Tomatoes',
 'Fruits Vitamins',
 'Fruits Vitamin Table',
 'Name',
 'Vitamins and minerals',
 'Banana',
 'Potassium, Vitamin B6, Vitamin C',
 'Mango',
 'Iron, Calcium, Vitamin C, Vitamin B6, Vitamin A',
 'Guava',
 'Calcium, Vitamin C, Vitamin D, Vitamin A']

Conclusion

Extracting the important text content using Beautiful Soup is an easy task. Before scraping the webpage it's important to remove unnecessary content from the webpage. scraping is a popular method for getting data. However, scraping website content without proper permission is unethical and sometimes illegal. so we need to have proper permission before scraping any website.

reference

Beautiful Soup Documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

Tech CS Topics