Introduction
This tutorial introduces a small recipe to extract important textual content from a webpage using the Beautiful Soup web scrapper.
What is important in a webpage and what is not important?
Importing Libraries
import requests
from bs4 import BeautifulSoup as bs
Get the webpage content
page = requests.get("https://gayan830.github.io/bs_tutorial/fruits_vegetables.html")
soup = bs(page.content)
print(soup.prettify())
output:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="Find out what nutrients are in most common Fruits and Vegetables." name="description"/>
<meta content="Fruits, vegitable list, Fruit Vegetables vitamin" name="keywords"/>
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<script src="https://ajax.googleapis.com/ajax/libs/jquery/3.3.1/jquery.min.js">
</script>
<title>
Fruits and Vegetables
</title>
<link crossorigin="anonymous" href="https://cdn.jsdelivr.net/npm/bootstrap@5.2.0/dist/css/bootstrap.min.css" integrity="sha384-gH2yIJqKdNHPEq0n4Mqa/HGKIhSkIHeL5AyhkYV8i59U5AR6csBvApHHNl/vI1Bx" rel="stylesheet"/>
<script crossorigin="anonymous" integrity="sha384-ODmDIVzN+pFdexxHEHFBQH3/9/vQ9uori45z4JjnFsRydbmQbmL5t1tQ0culUzyK" src="https://cdn.jsdelivr.net/npm/bootstrap@5.2.0/dist/js/bootstrap.min.js">
</script>
<link href="https://fonts.googleapis.com" rel="preconnect"/>
<link crossorigin="" href="https://fonts.gstatic.com" rel="preconnect"/>
<link href="https://fonts.googleapis.com/css2?family=Roboto:wght@100&display=swap" rel="stylesheet"/>
<link href="style.css" rel="stylesheet"/>
<style>
body {
width: 700px;
}
nav {
background-color: #2596be;
margin-bottom: 40px;
}
</style>
</head>
<body class="center">
<header>
<h1>
Fruits and Vegetables
</h1>
<nav>
<ul>
<li>
<a href="#home">
Home
</a>
</li>
<li>
<a href="#foods">
Fruits and Vegetables
</a>
</li>
<li>
<a href="#contact">
Contact
</a>
</li>
<li>
<a href="#about">
About
</a>
</li>
</ul>
</nav>
</header>
<main class="main-text">
<h2>
Fruits and Vegetables contain lot of Vitamins, minerals.
Eating fresh food and Vegetables can protect you from cancer and heart diseases.
</h2>
</main>
<section>
<div class="container center">
<figure>
<img alt="Fruit and Vegetables" src="Fruit_and_Vegetables.jpg"/>
<figcaption>
Fruit and Vegetables
</figcaption>
</figure>
</div>
<div class="container center">
<b>
List of common Fruits
</b>
<ol id="veg-list">
<li>
Banana
</li>
<li>
Mango
</li>
<li>
Guava
</li>
<li>
Jack Fruit
</li>
<li>
Grapes
</li>
</ol>
</div>
<div class="container center">
<b>
List of common Vegetables
</b>
<ol id="fruit-list">
<li>
Carrot
</li>
<li>
Beans
</li>
<li>
Chille
</li>
<li>
Potatoes
</li>
<li>
Tomatoes
</li>
</ol>
</div>
</section>
<section>
<h1>
Fruits Vitamins
</h1>
<table class="center">
<thead>
<tr>
<th colspan="2">
Fruits Vitamin Table
</th>
</tr>
<tr>
<th>
Name
</th>
<th>
Vitamins and minerals
</th>
</tr>
</thead>
<tbody>
<tr>
<td>
Banana
</td>
<td>
Potassium, Vitamin B6, Vitamin C
</td>
</tr>
<tr>
<td>
Mango
</td>
<td>
Iron, Calcium, Vitamin C, Vitamin B6, Vitamin A
</td>
</tr>
<tr>
<td>
Guava
</td>
<td>
Calcium, Vitamin C, Vitamin D, Vitamin A
</td>
</tr>
</tbody>
</table>
</section>
<hr/>
<footer>
<p>
<b>
Contact the author of this page:
</b>
</p>
<aside>
<address>
<a href="123@xyz.com">
123@xyz.com
</a>
<br/>
<a href="tel:+000000000">
(000) 000-0000
</a>
</address>
</aside>
<aside>
<address>
<b>
Address:
</b>
<br/>
addres line 1
<br/>
addres line 2
<br/>
City, State
</address>
</aside>
</footer>
</body>
</html>
Remove unwanted tags and get the values of some of the important attributes
<meta content="Find out what nutrients are in most common Fruits and Vegetables." name="description"/>
<meta content="Fruits, vegitable list, Fruit Vegetables vitamin" name="keywords"/>
<img alt="Fruit and Vegetables" src="Fruit_and_Vegetables.jpg"/>
webpage_content = []
for element in soup.find_all(['meta', 'img', 'script','link','style', 'nav', 'footer', 'form', 'svg]):
# extracting meta tag descriptions and keywords
if element.get('name') in ['keywords', 'description']:
webpage_content.append(element.get('content'))
# getting alt text from images
elif element.get('alt'):
webpage_content.append(element.get('alt'))
element.decompose() #remove element
After extracting the information from the above-specified type of tags let's look at the webpage_content list: print(webpage_content)
['Find out what nutrients are in most common Fruits and Vegetables.',
'Fruits, vegitable list, Fruit Vegetables vitamin',
'Fruit and Vegetables']
Extracting remaining content
for string in soup.stripped_strings:
webpage_content.append(string)
print(webpage_content)
['Find out what nutrients are in most common Fruits and Vegetables.',
'Fruits, vegitable list, Fruit Vegetables vitamin',
'Fruit and Vegetables',
'Fruits and Vegetables',
'Fruits and Vegetables',
'Fruits and Vegetables contain lot of Vitamins, minerals.\n Eating fresh food and Vegetables can protect you from cancer and heart diseases.',
'Fruit and Vegetables',
'List of common Fruits',
'Banana',
'Mango',
'Guava',
'Jack Fruit',
'Grapes',
'List of common Vegetables',
'Carrot',
'Beans',
'Chille',
'Potatoes',
'Tomatoes',
'Fruits Vitamins',
'Fruits Vitamin Table',
'Name',
'Vitamins and minerals',
'Banana',
'Potassium, Vitamin B6, Vitamin C',
'Mango',
'Iron, Calcium, Vitamin C, Vitamin B6, Vitamin A',
'Guava',
'Calcium, Vitamin C, Vitamin D, Vitamin A',
'Fruits and Vegetables',
'Fruits and Vegetables',
'Fruits and Vegetables contain lot of Vitamins, minerals.\n Eating fresh food and Vegetables can protect you from cancer and heart diseases.',
'Fruit and Vegetables',
'List of common Fruits',
'Banana',
'Mango',
'Guava',
'Jack Fruit',
'Grapes',
'List of common Vegetables',
'Carrot',
'Beans',
'Chille',
'Potatoes',
'Tomatoes',
'Fruits Vitamins',
'Fruits Vitamin Table',
'Name',
'Vitamins and minerals',
'Banana',
'Potassium, Vitamin B6, Vitamin C',
'Mango',
'Iron, Calcium, Vitamin C, Vitamin B6, Vitamin A',
'Guava',
'Calcium, Vitamin C, Vitamin D, Vitamin A',
'Fruits and Vegetables',
'Fruits and Vegetables',
'Fruits and Vegetables contain lot of Vitamins, minerals.\n Eating fresh food and Vegetables can protect you from cancer and heart diseases.',
'Fruit and Vegetables',
'List of common Fruits',
'Banana',
'Mango',
'Guava',
'Jack Fruit',
'Grapes',
'List of common Vegetables',
'Carrot',
'Beans',
'Chille',
'Potatoes',
'Tomatoes',
'Fruits Vitamins',
'Fruits Vitamin Table',
'Name',
'Vitamins and minerals',
'Banana',
'Potassium, Vitamin B6, Vitamin C',
'Mango',
'Iron, Calcium, Vitamin C, Vitamin B6, Vitamin A',
'Guava',
'Calcium, Vitamin C, Vitamin D, Vitamin A',
'Fruits and Vegetables',
'Fruits and Vegetables',
'Fruits and Vegetables contain lot of Vitamins, minerals.\n Eating fresh food and Vegetables can protect you from cancer and heart diseases.',
'Fruit and Vegetables',
'List of common Fruits',
'Banana',
'Mango',
'Guava',
'Jack Fruit',
'Grapes',
'List of common Vegetables',
'Carrot',
'Beans',
'Chille',
'Potatoes',
'Tomatoes',
'Fruits Vitamins',
'Fruits Vitamin Table',
'Name',
'Vitamins and minerals',
'Banana',
'Potassium, Vitamin B6, Vitamin C',
'Mango',
'Iron, Calcium, Vitamin C, Vitamin B6, Vitamin A',
'Guava',
'Calcium, Vitamin C, Vitamin D, Vitamin A']
Conclusion
- Beautiful Soup Documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/