Whenever one talks about data harvesting, the term web scraping pops up as that has become popular in recent years. With the great resources available online, web scraping has become the go to tool for most people. Owing to its ease and considerable amount of libraries, Python has also grown to be one of the most preferred languages for scraping websites. In this context, one of the most efficient and widely employed libraries is BeautifulSoup. This article is intended to grasp the concept of web scraping focusing on the Python language and the BeautifulSoup library with the aim of extracting and processing data from web pages.
Even if web scraping appears to be remarkably beneficial, it should be used responsibly. Certain websites explicitly refuse scraping in their Terms of Use, so always be sure to comply with the posted rules of the site you are scraping data from.
One of the reasons a lot of people are able to use Broudie Soup is its ease of use and user interface. A handful of command text is sufficient for you to start scraping. It also resolves most common problems such as broken or non-standard HTML pages. There are many HTML parsers, and BeautifulSoup is said to be functional with these, but most of the time, it is fine with the inbuilt one.
To set up these libraries, you can use the command below in your terminal or command prompt:
As soon as you install that, it's time to start coding in python to crawl through sites.
Use the code below to see how to make requests and get a webpage:
In this example, `response.text' has the HTML of the page. Now that you have this HTML, you can pass it to Beautifulsoup which will begin parsing and extracting data.
Wondering how to create a soup? Here's how:
In this code, … tells BeautifulSoup which parser to employ. Other parsers can be obtained, yet the standard one suffices for the majority of tasks
So here in `soup.find('h1')` the first item in the list generated by the tag '< h1 > ' is searched and then by the use of '.text', the textual content within the tag is retrieved.
For instance, in this example, `soup.find_all('a')` clicks on all the anchor items in the list whilst a.get('href') then picks the href address of individual links in the list.
Many HTML tags come with additional information within them that might be helpful in accomplishing certain tasks. An example being: anchor tags which will come with an `href' tag, which would specify the link's website. And to achieve this, the `get()` method will do the trick
For instance, if you wish to retrieve the src from an image tag code, which possesses the URL of an image, then it can be done in the following way.
This will yield the URL of the image, that of the src of the first image tag found on the page.
Suppose you want to know the parent element of a given tag, then you can execute the following code.
This will indeed print the first parent tag of the first p tag available on the webpage.
Once you have the final HTML in your hand, you can forward it to BeautifulSoup for fresh parsing.
How Do You Define Web Scraping
To put it simply, web scraping is the extraction of data from websites. Content on many sites is presented in HTML, which is a format created for people to read. But HTML can also be created for machines and algorithms to read, which facilitates web scraping. In simple words, web scraping is the process of compiling the data found on various pages and saving in the desired format for analysis, research or any other use in the future.Even if web scraping appears to be remarkably beneficial, it should be used responsibly. Certain websites explicitly refuse scraping in their Terms of Use, so always be sure to comply with the posted rules of the site you are scraping data from.
Introduction to BeautifulSoup
BeautifulSoup is a library available in Python that does the parsing and navigation of HTML documents in a faster and easier way. It provides easy and usable methods for moving around in the webpage's HTML tree, making it easier to extract certain pieces of information. You may also use BeautifulSoup to manipulate scripts of Java that dynamically creates HTML content.One of the reasons a lot of people are able to use Broudie Soup is its ease of use and user interface. A handful of command text is sufficient for you to start scraping. It also resolves most common problems such as broken or non-standard HTML pages. There are many HTML parsers, and BeautifulSoup is said to be functional with these, but most of the time, it is fine with the inbuilt one.
Setting Up The Necessary Libraries
As noted earlier, before starting to deal with web scraping through BeautifulSoup, one has to get the appropriate libraries. The two big libraries you will need first are `requests` and `beautifulsoup4`. `requests` library makes it possible for you to obtain the contents of the web and `beautifulsoup4` serves the purpose of parsing and traversing through the HTML.To set up these libraries, you can use the command below in your terminal or command prompt:
pip install requests beautifulsoup4
Getting Contents of The Web
The first step of any web scraping activity is getting the page of the site you wish to scrape. And to do this, you employ the use of the `requests` library. `requests.get()` issues an HTTP GET request to the designated universal resource locator (url) and brings back the page's html.Use the code below to see how to make requests and get a webpage:
import requests
url = "https://example.com"
response = requests.get(url)
html_content = response.text
BeautifulSoup HTML Parsing
When you HTML content of a page, it is not useful in such a state as it needs to be parsed in order to be worked with. To assist you in this regard, BeautifulSoup is provided. The `BeautifulSoup` class takes in the html content as its argument and then constructs a parse tree which helps you in searching and traversing the html.Wondering how to create a soup? Here's how:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_content, 'html.parser')
How to Move in the HTML
An HTML file is made in the form of a tree having a set of elements arranged in a hierarchical form. BeautifulSoup also has some functions that allow you to navigate and search through this tree. One can search for specific tags, get content enclosed within those tags or get the attributes of those tags.Searching for an Exact Tag
Whenever you require an extraction of a particular HTML element tag, you simply apply the `find()` function. For example, in order to get the content of the first header on the page, now using h1 we can use: h1_tag = soup.find('h1')
print(h1_tag.text)
Bulk retrieving of tags
If you further want to enhance the search, like looking to search for all the anchor tags < a > on the page, you can simply apply the find_all method. This method simply returns the entire match result of the applied tag as an array. a_tags = soup.find_all('a')
for a in a_tags:
print(a.get('href'))
Extracting Information inside the Tags
Many HTML tags come with additional information within them that might be helpful in accomplishing certain tasks. An example being: anchor tags which will come with an `href' tag, which would specify the link's website. And to achieve this, the `get()` method will do the trick
For instance, if you wish to retrieve the src from an image tag code, which possesses the URL of an image, then it can be done in the following way.
img_tag = soup.find('img')
img_url = img_tag.get('src')
print(img_url)
Traversing the HTML Document
Traversing the HTML Document is one of the defining characteristics of BeautifulSoup. As we all know every tag in the Html document is associated with a lot of attributes and methods that enable one to traverse through the tree structure. For example, Accessing the parent, child or sibling elements of a given tag.Suppose you want to know the parent element of a given tag, then you can execute the following code.
child_tag = soup.find('p')
parent_tag = child_tag.parent
print(parent_tag)
Looking up Content that is Dynamically Loaded
There are many websites in this world that are using JavaScript and therefore the site content has to be loaded up. Beautiful soup unfortunately is not able to help out here and render the content as it only parses the content of the static HTML documents. Yes, but you can use the Python libraries for example Selenium or Playwright to load Java scripts and fetch the HTML content that the script will be offering.Once you have the final HTML in your hand, you can forward it to BeautifulSoup for fresh parsing.