Web Scraping with BeautifulSoup and Requests in Python

The internet is a vast source of information, but not all data is easily accessible in structured formats. That’s where web scraping comes in. Web scraping allows you to extract, process, and analyze data from websites, turning unstructured web pages into structured datasets.

In this guide, we’ll cover:

✅ What web scraping is and how it works
✅ The ethics and legality of web scraping
✅ How to use Requests and BeautifulSoup for scraping
✅ Practical examples with Python
✅ Common challenges and how to overcome them

Let’s dive in!


What is Web Scraping?

Web scraping is the process of extracting data from websites and converting it into a structured format like CSV, JSON, or databases. It is widely used in fields like:

  • Market research – Extracting competitor prices and product details
  • Data science – Collecting datasets for machine learning models
  • Finance – Tracking stock prices and financial news
  • Real estate – Aggregating property listings
  • Academic research – Gathering information for analysis

A typical web scraping process involves:

  1. Sending an HTTP request to a website
  2. Downloading the webpage content (HTML)
  3. Parsing the HTML to extract relevant data
  4. Saving the data in a structured format

For this, Python provides powerful libraries like requests and BeautifulSoup.


Is Web Scraping Legal and Ethical?

Before scraping, you must ensure you’re following ethical and legal guidelines:

Check the website’s robots.txt file
Websites often have a robots.txt file (e.g., example.com/robots.txt) that specifies which pages are allowed or disallowed for scraping.

Avoid overloading the server
Sending too many requests in a short time can slow down or crash a website. Always use time delays between requests.

Scrape only publicly available data
Avoid scraping private or sensitive information without permission.

Comply with terms of service
Read the website’s terms of service to understand their data usage policies.


Getting Started: Installing Required Libraries

To follow along, install the necessary libraries:

pip install requests beautifulsoup4
  • requests – For making HTTP requests and retrieving webpage content
  • beautifulsoup4 – For parsing and extracting data from HTML

Step 1: Fetching a Webpage with Requests

The first step in web scraping is retrieving the webpage’s HTML.

import requests

url = "https://example.com"
response = requests.get(url)

print(response.status_code)  # Check if the request was successful
print(response.text)  # Print the HTML content

Handling Request Errors

Sometimes, a request might fail due to a bad URL, server issues, or denied access. Always handle exceptions:

try:
    response = requests.get(url, timeout=5)
    response.raise_for_status()  # Raises an error for 4xx or 5xx status codes
except requests.exceptions.RequestException as e:
    print(f"Error: {e}")

Step 2: Parsing HTML with BeautifulSoup

Once we fetch the webpage, we need to extract useful information from the HTML.

Creating a BeautifulSoup Object

from bs4 import BeautifulSoup

html_content = response.text
soup = BeautifulSoup(html_content, "html.parser")

print(soup.prettify())  # Prints formatted HTML

Extracting Specific Elements

Extracting the Title of the Page

title = soup.title.text
print("Page Title:", title)

Extracting All Links from a Page

for link in soup.find_all("a"):
    print(link.get("href"))

Extracting Specific Data Using CSS Selectors

heading = soup.select_one("h1").text
print("First Heading:", heading)

Step 3: Scraping a Real-World Website

Let’s scrape product details from an e-commerce website. Suppose we want to extract product names and prices from a website like example.com/products.

url = "https://example.com/products"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")

products = soup.find_all("div", class_="product-item")

for product in products:
    name = product.find("h2").text
    price = product.find("span", class_="price").text
    print(f"Product: {name}, Price: {price}")

Saving Data to a CSV File

import csv

with open("products.csv", "w", newline="", encoding="utf-8") as file:
    writer = csv.writer(file)
    writer.writerow(["Product Name", "Price"])

    for product in products:
        name = product.find("h2").text
        price = product.find("span", class_="price").text
        writer.writerow([name, price])

print("Data saved successfully!")

Handling Dynamic Websites (JavaScript-Rendered Pages)

Some websites load data dynamically using JavaScript, making standard HTML parsing ineffective. Solutions include:

Using Selenium – Automates browser interactions
Using Scrapy – A more advanced web scraping framework
Accessing APIs – Some sites provide APIs with structured data

Scraping a JavaScript-Rendered Page with Selenium

pip install selenium
from selenium import webdriver

driver = webdriver.Chrome()
driver.get("https://example.com")

html = driver.page_source
soup = BeautifulSoup(html, "html.parser")

print(soup.title.text)
driver.quit()

Avoiding Common Web Scraping Challenges

1. Handling IP Bans

Websites may block repeated requests from the same IP. Solutions:

  • Use rotating proxies: Services like ScraperAPI or Bright Data
  • Use User-Agent rotation: Mimic real browsers
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"
}
response = requests.get(url, headers=headers)

2. Dealing with CAPTCHA

Some sites use CAPTCHA to block bots. Solutions:

  • Use OCR-based CAPTCHA solvers like Tesseract
  • Use APIs like 2Captcha

Web Scraping Best Practices

Respect robots.txt – Don’t scrape restricted pages
Use delays – Avoid overwhelming the server
Use proxy rotation – Prevent IP bans
Store data efficiently – Use databases like PostgreSQL or MongoDB


Real-World Applications of Web Scraping

  1. E-commerce – Price tracking and competitor analysis
  2. News Aggregation – Collecting articles from multiple sources
  3. Finance – Extracting stock market data
  4. Job Market Analysis – Scraping job postings
  5. Sports Analytics – Collecting match statistics

Conclusion

Web scraping is a powerful tool for extracting and analyzing web data. In this guide, we covered:

✅ How web scraping works and its ethical considerations
✅ Using requests to fetch HTML pages
✅ Parsing and extracting data with BeautifulSoup
✅ Handling JavaScript-rendered pages with Selenium
✅ Avoiding IP bans and CAPTCHA challenges

By following best practices and legal guidelines, you can leverage web scraping for research, business intelligence, and data-driven decision-making.

What’s Next?

If you found this guide helpful, try applying these techniques to real-world projects. You can:

🔹 Scrape job postings for job market trends
🔹 Build a stock market scraper
🔹 Automate data collection for research

Let us know in the comments how you’re using web scraping in your projects! 🚀

Leave a Reply

Your email address will not be published. Required fields are marked *