Web Scraper Python: Complete Tutorial (2026)

Web Scraper Python: Complete Tutorial (2026)

Python web scraping uses libraries like BeautifulSoup for static HTML and Selenium or Playwright for JavaScript-heavy sites. This tutorial covers both approaches with working code examples — then explains why modern AI testing tools can handle scraping automatically without writing selector code that breaks every time a site updates.

Key Takeaways

BeautifulSoup is for static HTML. If the data you need is in the initial HTML response (view source and see it), BeautifulSoup + Requests is faster, simpler, and less fragile than browser automation.

Selenium and Playwright are for JavaScript-rendered content. When data loads after the page — via AJAX, React, Vue, or Angular — you need a real browser to execute the JavaScript before scraping.

Selectors break constantly. The biggest maintenance burden in web scraping is that CSS selectors and XPath expressions stop working whenever the site redesigns. Brittle selectors are why scrapers require regular maintenance.

Always handle pagination, rate limiting, and errors. Production scrapers need exponential backoff, user-agent rotation, and logic to resume from where they left off. Tutorial code omits this — production code can't.

AI-powered tools skip the selector problem entirely. Instead of writing div.product-card > h2.title, you tell an AI agent "get all product names" and it figures out the selectors, handles JavaScript, and adapts when the site changes.

What Is Web Scraping?

Web scraping is the automated extraction of data from websites. Instead of manually copying information, you write code that loads pages, parses the HTML, and extracts structured data.

Common use cases:

  • Price monitoring — track competitor prices across e-commerce sites
  • Research data — collect publicly available datasets for analysis
  • Lead generation — extract business contact information from directories
  • Content aggregation — pull news, reviews, or listings from multiple sources
  • Testing — verify your own site's content or structure

Python is the most popular language for web scraping because of its mature library ecosystem: Requests for HTTP, BeautifulSoup for HTML parsing, and Selenium or Playwright for browser automation.

Python Web Scraping Libraries

Requests + BeautifulSoup (Static Sites)

The simplest combination. Requests fetches the raw HTML; BeautifulSoup parses it.

Install:

pip install requests beautifulsoup4

Basic example:

import requests
from bs4 import BeautifulSoup

url = "https://books.toscrape.com/"
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")

# Extract all book titles and prices
books = soup.find_all("article", class_="product_pod")
for book in books:
    title = book.find("h3").find("a")["title"]
    price = book.find("p", class_="price_color").text
    print(f"{title}: {price}")

When to use:

  • Data is in the initial HTML (not loaded by JavaScript)
  • No login or session management needed
  • High-volume scraping where speed matters
  • You need to run thousands of requests in parallel

Limitations:

  • Can't execute JavaScript — if the data loads via AJAX, you get an empty result
  • No browser context — cookies, localStorage, service workers don't apply
  • Easy to block — no real browser fingerprint

Selenium (Browser Automation)

Selenium drives a real browser, so it handles JavaScript-rendered content, logins, and interactions like clicking and form submission.

Install:

pip install selenium
# Also need ChromeDriver matching your Chrome version

Example — scraping JavaScript-rendered content:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options

options = Options()
options.add_argument("--headless")
options.add_argument("--no-sandbox")
options.add_argument("--disable-dev-shm-usage")

driver = webdriver.Chrome(options=options)

try:
    driver.get("https://example-shop.com/products")

    # Wait for products to load (JavaScript renders them)
    wait = WebDriverWait(driver, 10)
    products = wait.until(
        EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".product-card"))
    )

    results = []
    for product in products:
        name = product.find_element(By.CSS_SELECTOR, ".product-name").text
        price = product.find_element(By.CSS_SELECTOR, ".product-price").text
        results.append({"name": name, "price": price})

    print(results)

finally:
    driver.quit()

When to use:

  • JavaScript-heavy single-page applications (React, Vue, Angular)
  • Login-protected pages
  • Sites that require user interaction (click through pagination, fill forms)
  • When you need to test while scraping

Limitations:

  • Slow — browser startup adds 2-5 seconds per session
  • Resource-heavy — each Chrome instance uses 200-500MB RAM
  • ChromeDriver version must match Chrome — constant maintenance
  • Flaky — timing issues cause intermittent failures

Playwright (Modern Browser Automation)

Playwright is the modern alternative to Selenium. Faster, more reliable, with better async support and built-in waiting.

Install:

pip install playwright
playwright install chromium

Example — same scraping task, Playwright style:

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()

    page.goto("https://example-shop.com/products")

    # Wait for products to appear
    page.wait_for_selector(".product-card")

    products = page.query_selector_all(".product-card")

    results = []
    for product in products:
        name = product.query_selector(".product-name").inner_text()
        price = product.query_selector(".product-price").inner_text()
        results.append({"name": name, "price": price})

    print(results)
    browser.close()

Async version (better for multiple pages):

import asyncio
from playwright.async_api import async_playwright

async def scrape_products():
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        page = await browser.new_page()

        await page.goto("https://example-shop.com/products")
        await page.wait_for_selector(".product-card")

        products = await page.query_selector_all(".product-card")
        results = []

        for product in products:
            name_el = await product.query_selector(".product-name")
            price_el = await product.query_selector(".product-price")
            name = await name_el.inner_text() if name_el else ""
            price = await price_el.inner_text() if price_el else ""
            results.append({"name": name, "price": price})

        await browser.close()
        return results

asyncio.run(scrape_products())

When to use Playwright over Selenium:

  • New projects — Playwright's API is cleaner
  • You need async scraping at scale
  • You want built-in browser state management (save/restore login sessions)
  • You need Firefox or WebKit in addition to Chrome

Scrapy (Large-Scale Scraping)

For scraping thousands of pages, Scrapy is the professional framework. It handles request queuing, concurrency, middlewares, and pipelines.

Install:

pip install scrapy
scrapy startproject myproject

Example spider:

import scrapy

class ProductSpider(scrapy.Spider):
    name = "products"
    start_urls = ["https://books.toscrape.com/"]

    def parse(self, response):
        for book in response.css("article.product_pod"):
            yield {
                "title": book.css("h3 a::attr(title)").get(),
                "price": book.css("p.price_color::text").get(),
                "url": response.urljoin(book.css("h3 a::attr(href)").get()),
            }

        # Follow pagination
        next_page = response.css("li.next a::attr(href)").get()
        if next_page:
            yield response.follow(next_page, self.parse)

Run it:

scrapy crawl products -o products.json

When to use Scrapy:

  • Hundreds or thousands of pages
  • You need built-in rate limiting and retry logic
  • Your team needs a maintainable scraping codebase
  • You're building a production data pipeline

Limitations:

  • Learning curve — Scrapy has a framework-specific way of doing things
  • Static HTML only by default (add scrapy-playwright for JavaScript support)
  • Overkill for simple, one-off scraping tasks

Handling Common Challenges

Pagination

Almost every real scraping job needs to handle multiple pages.

BeautifulSoup — URL-based pagination:

import requests
from bs4 import BeautifulSoup

base_url = "https://books.toscrape.com/catalogue/page-{}.html"
all_books = []

for page_num in range(1, 51):  # Up to 50 pages
    url = base_url.format(page_num)
    response = requests.get(url)

    if response.status_code == 404:
        break

    soup = BeautifulSoup(response.content, "html.parser")
    books = soup.find_all("article", class_="product_pod")

    for book in books:
        all_books.append({
            "title": book.find("h3").find("a")["title"],
            "price": book.find("p", class_="price_color").text
        })

print(f"Scraped {len(all_books)} books")

Playwright — clicking "Next" button:

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()
    page.goto("https://example.com/products")

    all_products = []

    while True:
        page.wait_for_selector(".product-item")
        items = page.query_selector_all(".product-item")

        for item in items:
            all_products.append(item.inner_text())

        next_button = page.query_selector("button.next-page")
        if not next_button or not next_button.is_enabled():
            break

        next_button.click()
        page.wait_for_load_state("networkidle")

    browser.close()

Rate Limiting and Delays

Scraping too fast gets you blocked. Add polite delays:

import time
import random
import requests

def polite_get(url, min_delay=1.0, max_delay=3.0):
    """Fetch URL with a random delay to avoid getting blocked."""
    time.sleep(random.uniform(min_delay, max_delay))

    headers = {
        "User-Agent": (
            "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
            "AppleWebKit/537.36 (KHTML, like Gecko) "
            "Chrome/120.0.0.0 Safari/537.36"
        )
    }

    return requests.get(url, headers=headers)

Error Handling and Retries

Production scrapers need retry logic for network failures:

import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

def create_session():
    session = requests.Session()

    retry_strategy = Retry(
        total=3,
        status_forcelist=[429, 500, 502, 503, 504],
        backoff_factor=2,  # Waits 2s, 4s, 8s between retries
    )

    adapter = HTTPAdapter(max_retries=retry_strategy)
    session.mount("http://", adapter)
    session.mount("https://", adapter)

    return session

session = create_session()
response = session.get("https://example.com/data")

Login and Session Management

Many sites require authentication before you can scrape:

from playwright.sync_api import sync_playwright
import json
import os

def get_authenticated_page(playwright, login_url, username, password):
    browser = playwright.chromium.launch(headless=True)
    context = browser.new_context()
    page = context.new_page()

    # Check if we have a saved session
    if os.path.exists("session.json"):
        with open("session.json") as f:
            cookies = json.load(f)
        context.add_cookies(cookies)
        page.goto(login_url)

        # Check if session is still valid
        if page.url != login_url:
            return page  # Already authenticated

    # Perform login
    page.goto(login_url)
    page.fill('input[name="email"]', username)
    page.fill('input[name="password"]', password)
    page.click('button[type="submit"]')
    page.wait_for_url("**/dashboard**")

    # Save session for reuse
    cookies = context.cookies()
    with open("session.json", "w") as f:
        json.dump(cookies, f)

    return page

CSS Selector Reference

# BeautifulSoup
soup.select("div.product-card")           # By class
soup.select("#main-content")              # By ID
soup.select("ul > li")                    # Direct children only
soup.select("a[href*='product']")         # Attribute contains value
soup.select("p:first-child")             # First child pseudo-class

# Playwright
page.query_selector_all("div.product-card")
page.query_selector_all("[data-product-id]")     # Any element with attribute
page.query_selector_all("form input:not([type='hidden'])")

XPath Reference

XPath is more expressive than CSS selectors, useful for complex conditions:

from lxml import html as lxmlhtml

tree = lxmlhtml.fromstring(html_content)
titles = tree.xpath("//h2[@class='product-title']/text()")
prices = tree.xpath("//span[contains(@class, 'price')]/text()")

Common XPath patterns:

//div[@class='product']              — exact class match
//div[contains(@class, 'product')]   — class contains substring
//a[text()='Click here']             — exact link text
//a[contains(text(), 'Click')]       — link text contains
//input[@type='submit']              — attribute equals value
.//span                              — descendant relative to current node
//tr[position() > 1]                 — skip header row

Comparison: Which Library to Use

Requests + BS4 Selenium Playwright Scrapy
Static HTML ✅ Best ✅ Yes ✅ Yes ✅ Best
JavaScript sites ❌ No ✅ Yes ✅ Yes ❌ Plugin needed
Speed ✅ Fast ⚠️ Slow ✅ Fast ✅ Fast (async)
Memory usage ✅ Low ❌ High ⚠️ Medium ✅ Low
Login handling ⚠️ Manual ✅ Yes ✅ Yes ⚠️ Manual
Async support ✅ aiohttp ❌ Limited ✅ Native ✅ Built-in
Ease of learning ✅ Easy ⚠️ Medium ✅ Easy ❌ Complex
Large scale (1000+ pages) ✅ Yes ❌ Impractical ⚠️ Possible ✅ Best
Maintenance burden ⚠️ Medium ❌ High ⚠️ Medium ⚠️ Medium

Why Python Web Scrapers Break

Even working scrapers need constant maintenance. Here's what goes wrong:

1. Selectors Break on Site Redesigns

Your CSS selector .product-card > h2.title works today. Next month, the site redesigns and the class becomes .product-name or the structure becomes section.item > h3. Your scraper silently returns nothing.

This happens constantly. Teams using selectors-based scrapers spend significant time on maintenance every time the target site releases an update.

2. JavaScript Framework Updates

React, Vue, and Angular apps generate different HTML when they upgrade major versions. Component class names change. Server-side rendering gets added or removed. IDs that used to be stable become dynamic.

3. Anti-Bot Measures

Sites deploy Cloudflare, DataDome, or custom bot detection. They:

  • Check browser fingerprints (Canvas, WebGL, screen resolution)
  • Verify JavaScript execution patterns
  • Rate-limit by IP address
  • Challenge with CAPTCHAs
  • Serve different HTML to headless browsers

4. ChromeDriver Version Mismatch

Selenium requires chromedriver to match your installed Chrome version. Chrome auto-updates. Your production server has a different version than your dev machine.

selenium.common.exceptions.SessionNotCreatedException:
Message: session not created: This version of ChromeDriver only
supports Chrome version 114

This breaks in CI environments constantly — often silently, returning stale data instead of failing loudly.

5. Async Timing Issues

Sites with lazy loading, infinite scroll, or complex React state need careful waiting logic. Too little delay and you miss dynamically loaded content. Too much and the scraper is slow. Getting timing right requires ongoing calibration as the target site evolves.

Full Working Example: E-Commerce Product Scraper

Complete production-ready Playwright scraper:

import asyncio
import json
import logging
import random
from dataclasses import dataclass, asdict
from pathlib import Path
from playwright.async_api import async_playwright, TimeoutError as PlaywrightTimeout

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)


@dataclass
class Product:
    name: str
    price: str
    url: str
    rating: str = ""


async def scrape_page(page, url: str) -> list[Product]:
    """Scrape products from a single listing page."""
    products = []

    try:
        await page.goto(url, wait_until="networkidle", timeout=30000)
        await page.wait_for_selector(".product-card", timeout=10000)

        items = await page.query_selector_all(".product-card")

        for item in items:
            try:
                name_el = await item.query_selector(".product-name")
                price_el = await item.query_selector(".product-price")
                link_el = await item.query_selector("a.product-link")
                rating_el = await item.query_selector(".product-rating")

                name = await name_el.inner_text() if name_el else ""
                price = await price_el.inner_text() if price_el else ""
                href = await link_el.get_attribute("href") if link_el else ""
                rating = await rating_el.inner_text() if rating_el else ""

                if name and price:
                    products.append(Product(
                        name=name.strip(),
                        price=price.strip(),
                        url=href or url,
                        rating=rating.strip()
                    ))
            except Exception as e:
                logger.warning(f"Skipping product due to error: {e}")
                continue

    except PlaywrightTimeout:
        logger.error(f"Timeout loading {url}")
    except Exception as e:
        logger.error(f"Failed to scrape {url}: {e}")

    return products


async def scrape_all_pages(base_url: str, output_file: str = "products.json"):
    """Scrape all paginated pages and save to JSON."""
    all_products = []

    async with async_playwright() as p:
        browser = await p.chromium.launch(
            headless=True,
            args=["--no-sandbox", "--disable-dev-shm-usage"]
        )

        context = await browser.new_context(
            user_agent=(
                "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
                "AppleWebKit/537.36 (KHTML, like Gecko) "
                "Chrome/121.0.0.0 Safari/537.36"
            ),
            viewport={"width": 1280, "height": 800}
        )

        page = await context.new_page()
        page_num = 1

        while True:
            url = f"{base_url}?page={page_num}"
            logger.info(f"Scraping page {page_num}: {url}")

            products = await scrape_page(page, url)

            if not products:
                logger.info(f"No products found on page {page_num}, stopping")
                break

            all_products.extend(products)
            page_num += 1

            # Polite delay between pages
            await asyncio.sleep(random.uniform(1.5, 3.5))

        await browser.close()

    # Save results
    output = [asdict(p) for p in all_products]
    Path(output_file).write_text(json.dumps(output, indent=2))
    logger.info(f"Saved {len(all_products)} products to {output_file}")

    return all_products


if __name__ == "__main__":
    asyncio.run(scrape_all_pages("https://example-shop.com/products"))

The Alternative: AI Web Scraping

Modern AI tools change the scraping equation. Instead of writing selector code you maintain forever, you describe what you want and AI figures out how to get it.

HelpMeTest is an AI testing platform that runs tests in a real cloud browser. The same technology that runs automated tests can verify and extract web content using plain-English instructions:

*** Test Cases ***
Verify Product Listing Page
    Go To       https://example-shop.com/products
    Wait For    products to load
    Should See  at least 10 product cards
    Each product card should have a name and price

HelpMeTest handles JavaScript rendering, waiting for dynamic content, and self-heals when selectors change — without you writing a single CSS selector.

Comparison: Python vs AI Tools

Python (Playwright) HelpMeTest AI
Setup time 30-60 min 5 min
Code length 50-100+ lines 5-10 lines
Selector maintenance Manual when site changes Self-healing
JavaScript handling Explicit waits required Automatic
ChromeDriver setup Required, must match Chrome Not needed
Running in CI Configure headless mode Automatic
Error debugging Stack traces Visual recordings

When to Use Python vs AI Tools

Use Python when:

  • You need to process or transform data (Pandas, ML pipelines)
  • You're scraping at very high scale (thousands of pages/hour)
  • Data is in static, well-structured HTML
  • You need deep integration with existing Python infrastructure

Use AI tools (like HelpMeTest) when:

  • You want quick extraction without a coding project
  • The site changes frequently (avoid selector maintenance)
  • Non-technical team members need to run or adjust the scraper
  • You're verifying your own site's content as part of QA testing
  • You want scraping and functional testing in one tool

Web scraping operates in a legally complex area:

Check robots.txt at https://example.com/robots.txt — it shows which paths the site owner permits or disallows for bots. Respecting it is standard practice.

Review the Terms of Service — many sites explicitly prohibit scraping. Violating ToS can lead to account termination, IP blocking, and in some jurisdictions, legal liability.

Don't collect personal data — GDPR (EU) and CCPA (California) restrict collecting personal information. Names, emails, and phone numbers of individuals are protected even when publicly displayed.

Rate limit your requests — aggressive scraping degrades performance for real users. Treat external sites as shared infrastructure.

Use official APIs when available — APIs are faster, more reliable, and legally unambiguous.

Conclusion

Python web scraping gives you full control over data extraction — but it comes with real maintenance costs. Selectors break, browser drivers go out of sync, and anti-bot measures evolve.

Choose your approach based on the task:

  • BeautifulSoup + Requests for simple, static HTML
  • Playwright for modern JavaScript-heavy sites
  • Scrapy for large-scale production pipelines

For verifying your own site's content and behavior, AI-powered tools like HelpMeTest let you write plain-English test descriptions that self-heal when your UI changes — no selector management, no ChromeDriver maintenance.

Try HelpMeTest free — write your first test in minutes.

Read more