Introduction
This tutorial introduces a small recipe to extract important textual content from a webpage using the Beautiful Soup web scrapper.
What is important in a webpage and what is not important?
In an HTML page, there are some HTML tags we need to remove when extracting the data
such as script tags, style tags, link tags, and some meta tags. The navbar and footer element content is also not
much important. so we exclude those elements in our extraction process.
Importing Libraries
we import beautiful soup and python request module.
import requests
from bs4 import BeautifulSoup as bs
Get the webpage content
we use a simple webpage hosted on GitHub for this process. Click this link to
look at the webpage. Let's get and load the webpage content to a BeautifulSoup object.
page = requests.get("https://gayan830.github.io/bs_tutorial/fruits_vegetables.html")
soup = bs(page.content)
print(soup.prettify())
output:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="Find out what nutrients are in most common Fruits and Vegetables." name="description"/>
<meta content="Fruits, vegitable list, Fruit Vegetables vitamin" name="keywords"/>
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<script src="https://ajax.googleapis.com/ajax/libs/jquery/3.3.1/jquery.min.js">
</script>
<title>
Fruits and Vegetables
</title>
<link crossorigin="anonymous" href="https://cdn.jsdelivr.net/npm/bootstrap@5.2.0/dist/css/bootstrap.min.css" integrity="sha384-gH2yIJqKdNHPEq0n4Mqa/HGKIhSkIHeL5AyhkYV8i59U5AR6csBvApHHNl/vI1Bx" rel="stylesheet"/>
<script crossorigin="anonymous" integrity="sha384-ODmDIVzN+pFdexxHEHFBQH3/9/vQ9uori45z4JjnFsRydbmQbmL5t1tQ0culUzyK" src="https://cdn.jsdelivr.net/npm/bootstrap@5.2.0/dist/js/bootstrap.min.js">
</script>
<link href="https://fonts.googleapis.com" rel="preconnect"/>
<link crossorigin="" href="https://fonts.gstatic.com" rel="preconnect"/>
<link href="https://fonts.googleapis.com/css2?family=Roboto:wght@100&display=swap" rel="stylesheet"/>
<link href="style.css" rel="stylesheet"/>
<style>
body {
width: 700px;
}
nav {
background-color: #2596be;
margin-bottom: 40px;
}
</style>
</head>
<body class="center">
<header>
<h1>
Fruits and Vegetables
</h1>
<nav>
<ul>
<li>
<a href="#home">
Home
</a>
</li>
<li>
<a href="#foods">
Fruits and Vegetables
</a>
</li>
<li>
<a href="#contact">
Contact
</a>
</li>
<li>
<a href="#about">
About
</a>
</li>
</ul>
</nav>
</header>
<main class="main-text">
<h2>
Fruits and Vegetables contain lot of Vitamins, minerals.
Eating fresh food and Vegetables can protect you from cancer and heart diseases.
</h2>
</main>
<section>
<div class="container center">
<figure>
<img alt="Fruit and Vegetables" src="Fruit_and_Vegetables.jpg"/>
<figcaption>
Fruit and Vegetables
</figcaption>
</figure>
</div>
<div class="container center">
<b>
List of common Fruits
</b>
<ol id="veg-list">
<li>
Banana
</li>
<li>
Mango
</li>
<li>
Guava
</li>
<li>
Jack Fruit
</li>
<li>
Grapes
</li>
</ol>
</div>
<div class="container center">
<b>
List of common Vegetables
</b>
<ol id="fruit-list">
<li>
Carrot
</li>
<li>
Beans
</li>
<li>
Chille
</li>
<li>
Potatoes
</li>
<li>
Tomatoes
</li>
</ol>
</div>
</section>
<section>
<h1>
Fruits Vitamins
</h1>
<table class="center">
<thead>
<tr>
<th colspan="2">
Fruits Vitamin Table
</th>
</tr>
<tr>
<th>
Name
</th>
<th>
Vitamins and minerals
</th>
</tr>
</thead>
<tbody>
<tr>
<td>
Banana
</td>
<td>
Potassium, Vitamin B6, Vitamin C
</td>
</tr>
<tr>
<td>
Mango
</td>
<td>
Iron, Calcium, Vitamin C, Vitamin B6, Vitamin A
</td>
</tr>
<tr>
<td>
Guava
</td>
<td>
Calcium, Vitamin C, Vitamin D, Vitamin A
</td>
</tr>
</tbody>
</table>
</section>
<hr/>
<footer>
<p>
<b>
Contact the author of this page:
</b>
</p>
<aside>
<address>
<a href="123@xyz.com">
123@xyz.com
</a>
<br/>
<a href="tel:+000000000">
(000) 000-0000
</a>
</address>
</aside>
<aside>
<address>
<b>
Address:
</b>
<br/>
addres line 1
<br/>
addres line 2
<br/>
City, State
</address>
</aside>
</footer>
</body>
</html>
prettify() method format the
HTML markup when outputting.
Remove unwanted tags and get the values of some of the important attributes
Some of the meta tag attributes contain relevant information about the page such as
keywords and descriptions. For example, the following meta tags have important information. so we have to extract the
values of the content attribute of the below meta tags and alt attribute value from the image.
<meta content="Find out what nutrients are in most common Fruits and Vegetables." name="description"/>
<meta content="Fruits, vegitable list, Fruit Vegetables vitamin" name="keywords"/>
<img alt="Fruit and Vegetables" src="Fruit_and_Vegetables.jpg"/>
webpage_content = []
for element in soup.find_all(['meta', 'img', 'script','link','style', 'nav', 'footer', 'form', 'svg]):
# extracting meta tag descriptions and keywords
if element.get('name') in ['keywords', 'description']:
webpage_content.append(element.get('content'))
# getting alt text from images
elif element.get('alt'):
webpage_content.append(element.get('alt'))
element.decompose() #remove element
After extracting the information from the above-specified type of tags let's look at the webpage_content list: print(webpage_content)
output:
['Find out what nutrients are in most common Fruits and Vegetables.',
'Fruits, vegitable list, Fruit Vegetables vitamin',
'Fruit and Vegetables']
Extracting remaining content
we extract the content from the remaining tags using the following code snippet:
for string in soup.stripped_strings:
webpage_content.append(string)
Now our webpage_content list has all the essential text content extracted from the webpage.
print(webpage_content)
output:
['Find out what nutrients are in most common Fruits and Vegetables.',
'Fruits, vegitable list, Fruit Vegetables vitamin',
'Fruit and Vegetables',
'Fruits and Vegetables',
'Fruits and Vegetables',
'Fruits and Vegetables contain lot of Vitamins, minerals.\n Eating fresh food and Vegetables can protect you from cancer and heart diseases.',
'Fruit and Vegetables',
'List of common Fruits',
'Banana',
'Mango',
'Guava',
'Jack Fruit',
'Grapes',
'List of common Vegetables',
'Carrot',
'Beans',
'Chille',
'Potatoes',
'Tomatoes',
'Fruits Vitamins',
'Fruits Vitamin Table',
'Name',
'Vitamins and minerals',
'Banana',
'Potassium, Vitamin B6, Vitamin C',
'Mango',
'Iron, Calcium, Vitamin C, Vitamin B6, Vitamin A',
'Guava',
'Calcium, Vitamin C, Vitamin D, Vitamin A',
'Fruits and Vegetables',
'Fruits and Vegetables',
'Fruits and Vegetables contain lot of Vitamins, minerals.\n Eating fresh food and Vegetables can protect you from cancer and heart diseases.',
'Fruit and Vegetables',
'List of common Fruits',
'Banana',
'Mango',
'Guava',
'Jack Fruit',
'Grapes',
'List of common Vegetables',
'Carrot',
'Beans',
'Chille',
'Potatoes',
'Tomatoes',
'Fruits Vitamins',
'Fruits Vitamin Table',
'Name',
'Vitamins and minerals',
'Banana',
'Potassium, Vitamin B6, Vitamin C',
'Mango',
'Iron, Calcium, Vitamin C, Vitamin B6, Vitamin A',
'Guava',
'Calcium, Vitamin C, Vitamin D, Vitamin A',
'Fruits and Vegetables',
'Fruits and Vegetables',
'Fruits and Vegetables contain lot of Vitamins, minerals.\n Eating fresh food and Vegetables can protect you from cancer and heart diseases.',
'Fruit and Vegetables',
'List of common Fruits',
'Banana',
'Mango',
'Guava',
'Jack Fruit',
'Grapes',
'List of common Vegetables',
'Carrot',
'Beans',
'Chille',
'Potatoes',
'Tomatoes',
'Fruits Vitamins',
'Fruits Vitamin Table',
'Name',
'Vitamins and minerals',
'Banana',
'Potassium, Vitamin B6, Vitamin C',
'Mango',
'Iron, Calcium, Vitamin C, Vitamin B6, Vitamin A',
'Guava',
'Calcium, Vitamin C, Vitamin D, Vitamin A',
'Fruits and Vegetables',
'Fruits and Vegetables',
'Fruits and Vegetables contain lot of Vitamins, minerals.\n Eating fresh food and Vegetables can protect you from cancer and heart diseases.',
'Fruit and Vegetables',
'List of common Fruits',
'Banana',
'Mango',
'Guava',
'Jack Fruit',
'Grapes',
'List of common Vegetables',
'Carrot',
'Beans',
'Chille',
'Potatoes',
'Tomatoes',
'Fruits Vitamins',
'Fruits Vitamin Table',
'Name',
'Vitamins and minerals',
'Banana',
'Potassium, Vitamin B6, Vitamin C',
'Mango',
'Iron, Calcium, Vitamin C, Vitamin B6, Vitamin A',
'Guava',
'Calcium, Vitamin C, Vitamin D, Vitamin A']
Conclusion
Extracting the important text content using Beautiful Soup is an easy task. Before scraping the webpage it's important to remove unnecessary content from the webpage. scraping is a popular method for getting data. However, scraping website content without proper permission is unethical and sometimes illegal. so we need to have proper permission before scraping any website.
reference
- Beautiful Soup Documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
No comments:
Post a Comment