Web Scraper Python: Complete Tutorial (2026)
Python web scraping uses libraries like BeautifulSoup for static HTML and Selenium or Playwright for JavaScript-heavy sites. This tutorial covers both approaches with working code examples — then explains why modern AI testing tools can handle scraping automatically without writing selector code that breaks every time a site updates.
Key Takeaways
BeautifulSoup is for static HTML. If the data you need is in the initial HTML response (view source and see it), BeautifulSoup + Requests is faster, simpler, and less fragile than browser automation.
Selenium and Playwright are for JavaScript-rendered content. When data loads after the page — via AJAX, React, Vue, or Angular — you need a real browser to execute the JavaScript before scraping.
Selectors break constantly. The biggest maintenance burden in web scraping is that CSS selectors and XPath expressions stop working whenever the site redesigns. Brittle selectors are why scrapers require regular maintenance.
Always handle pagination, rate limiting, and errors. Production scrapers need exponential backoff, user-agent rotation, and logic to resume from where they left off. Tutorial code omits this — production code can't.
AI-powered tools skip the selector problem entirely. Instead of writing div.product-card > h2.title, you tell an AI agent "get all product names" and it figures out the selectors, handles JavaScript, and adapts when the site changes.
What Is Web Scraping?
Web scraping is the automated extraction of data from websites. Instead of manually copying information, you write code that loads pages, parses the HTML, and extracts structured data.
Common use cases:
- Price monitoring — track competitor prices across e-commerce sites
- Research data — collect publicly available datasets for analysis
- Lead generation — extract business contact information from directories
- Content aggregation — pull news, reviews, or listings from multiple sources
- Testing — verify your own site's content or structure
Python is the most popular language for web scraping because of its mature library ecosystem: Requests for HTTP, BeautifulSoup for HTML parsing, and Selenium or Playwright for browser automation.
Python Web Scraping Libraries
Requests + BeautifulSoup (Static Sites)
The simplest combination. Requests fetches the raw HTML; BeautifulSoup parses it.
Install:
pip install requests beautifulsoup4
Basic example:
import requests
from bs4 import BeautifulSoup
url = "https://books.toscrape.com/"
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
# Extract all book titles and prices
books = soup.find_all("article", class_="product_pod")
for book in books:
title = book.find("h3").find("a")["title"]
price = book.find("p", class_="price_color").text
print(f"{title}: {price}")
When to use:
- Data is in the initial HTML (not loaded by JavaScript)
- No login or session management needed
- High-volume scraping where speed matters
- You need to run thousands of requests in parallel
Limitations:
- Can't execute JavaScript — if the data loads via AJAX, you get an empty result
- No browser context — cookies, localStorage, service workers don't apply
- Easy to block — no real browser fingerprint
Selenium (Browser Automation)
Selenium drives a real browser, so it handles JavaScript-rendered content, logins, and interactions like clicking and form submission.
Install:
pip install selenium
# Also need ChromeDriver matching your Chrome version
Example — scraping JavaScript-rendered content:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options
options = Options()
options.add_argument("--headless")
options.add_argument("--no-sandbox")
options.add_argument("--disable-dev-shm-usage")
driver = webdriver.Chrome(options=options)
try:
driver.get("https://example-shop.com/products")
# Wait for products to load (JavaScript renders them)
wait = WebDriverWait(driver, 10)
products = wait.until(
EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".product-card"))
)
results = []
for product in products:
name = product.find_element(By.CSS_SELECTOR, ".product-name").text
price = product.find_element(By.CSS_SELECTOR, ".product-price").text
results.append({"name": name, "price": price})
print(results)
finally:
driver.quit()
When to use:
- JavaScript-heavy single-page applications (React, Vue, Angular)
- Login-protected pages
- Sites that require user interaction (click through pagination, fill forms)
- When you need to test while scraping
Limitations:
- Slow — browser startup adds 2-5 seconds per session
- Resource-heavy — each Chrome instance uses 200-500MB RAM
- ChromeDriver version must match Chrome — constant maintenance
- Flaky — timing issues cause intermittent failures
Playwright (Modern Browser Automation)
Playwright is the modern alternative to Selenium. Faster, more reliable, with better async support and built-in waiting.
Install:
pip install playwright
playwright install chromium
Example — same scraping task, Playwright style:
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto("https://example-shop.com/products")
# Wait for products to appear
page.wait_for_selector(".product-card")
products = page.query_selector_all(".product-card")
results = []
for product in products:
name = product.query_selector(".product-name").inner_text()
price = product.query_selector(".product-price").inner_text()
results.append({"name": name, "price": price})
print(results)
browser.close()
Async version (better for multiple pages):
import asyncio
from playwright.async_api import async_playwright
async def scrape_products():
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
page = await browser.new_page()
await page.goto("https://example-shop.com/products")
await page.wait_for_selector(".product-card")
products = await page.query_selector_all(".product-card")
results = []
for product in products:
name_el = await product.query_selector(".product-name")
price_el = await product.query_selector(".product-price")
name = await name_el.inner_text() if name_el else ""
price = await price_el.inner_text() if price_el else ""
results.append({"name": name, "price": price})
await browser.close()
return results
asyncio.run(scrape_products())
When to use Playwright over Selenium:
- New projects — Playwright's API is cleaner
- You need async scraping at scale
- You want built-in browser state management (save/restore login sessions)
- You need Firefox or WebKit in addition to Chrome
Scrapy (Large-Scale Scraping)
For scraping thousands of pages, Scrapy is the professional framework. It handles request queuing, concurrency, middlewares, and pipelines.
Install:
pip install scrapy
scrapy startproject myproject
Example spider:
import scrapy
class ProductSpider(scrapy.Spider):
name = "products"
start_urls = ["https://books.toscrape.com/"]
def parse(self, response):
for book in response.css("article.product_pod"):
yield {
"title": book.css("h3 a::attr(title)").get(),
"price": book.css("p.price_color::text").get(),
"url": response.urljoin(book.css("h3 a::attr(href)").get()),
}
# Follow pagination
next_page = response.css("li.next a::attr(href)").get()
if next_page:
yield response.follow(next_page, self.parse)
Run it:
scrapy crawl products -o products.json
When to use Scrapy:
- Hundreds or thousands of pages
- You need built-in rate limiting and retry logic
- Your team needs a maintainable scraping codebase
- You're building a production data pipeline
Limitations:
- Learning curve — Scrapy has a framework-specific way of doing things
- Static HTML only by default (add
scrapy-playwrightfor JavaScript support) - Overkill for simple, one-off scraping tasks
Handling Common Challenges
Pagination
Almost every real scraping job needs to handle multiple pages.
BeautifulSoup — URL-based pagination:
import requests
from bs4 import BeautifulSoup
base_url = "https://books.toscrape.com/catalogue/page-{}.html"
all_books = []
for page_num in range(1, 51): # Up to 50 pages
url = base_url.format(page_num)
response = requests.get(url)
if response.status_code == 404:
break
soup = BeautifulSoup(response.content, "html.parser")
books = soup.find_all("article", class_="product_pod")
for book in books:
all_books.append({
"title": book.find("h3").find("a")["title"],
"price": book.find("p", class_="price_color").text
})
print(f"Scraped {len(all_books)} books")
Playwright — clicking "Next" button:
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto("https://example.com/products")
all_products = []
while True:
page.wait_for_selector(".product-item")
items = page.query_selector_all(".product-item")
for item in items:
all_products.append(item.inner_text())
next_button = page.query_selector("button.next-page")
if not next_button or not next_button.is_enabled():
break
next_button.click()
page.wait_for_load_state("networkidle")
browser.close()
Rate Limiting and Delays
Scraping too fast gets you blocked. Add polite delays:
import time
import random
import requests
def polite_get(url, min_delay=1.0, max_delay=3.0):
"""Fetch URL with a random delay to avoid getting blocked."""
time.sleep(random.uniform(min_delay, max_delay))
headers = {
"User-Agent": (
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/120.0.0.0 Safari/537.36"
)
}
return requests.get(url, headers=headers)
Error Handling and Retries
Production scrapers need retry logic for network failures:
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
def create_session():
session = requests.Session()
retry_strategy = Retry(
total=3,
status_forcelist=[429, 500, 502, 503, 504],
backoff_factor=2, # Waits 2s, 4s, 8s between retries
)
adapter = HTTPAdapter(max_retries=retry_strategy)
session.mount("http://", adapter)
session.mount("https://", adapter)
return session
session = create_session()
response = session.get("https://example.com/data")
Login and Session Management
Many sites require authentication before you can scrape:
from playwright.sync_api import sync_playwright
import json
import os
def get_authenticated_page(playwright, login_url, username, password):
browser = playwright.chromium.launch(headless=True)
context = browser.new_context()
page = context.new_page()
# Check if we have a saved session
if os.path.exists("session.json"):
with open("session.json") as f:
cookies = json.load(f)
context.add_cookies(cookies)
page.goto(login_url)
# Check if session is still valid
if page.url != login_url:
return page # Already authenticated
# Perform login
page.goto(login_url)
page.fill('input[name="email"]', username)
page.fill('input[name="password"]', password)
page.click('button[type="submit"]')
page.wait_for_url("**/dashboard**")
# Save session for reuse
cookies = context.cookies()
with open("session.json", "w") as f:
json.dump(cookies, f)
return page
CSS Selector Reference
# BeautifulSoup
soup.select("div.product-card") # By class
soup.select("#main-content") # By ID
soup.select("ul > li") # Direct children only
soup.select("a[href*='product']") # Attribute contains value
soup.select("p:first-child") # First child pseudo-class
# Playwright
page.query_selector_all("div.product-card")
page.query_selector_all("[data-product-id]") # Any element with attribute
page.query_selector_all("form input:not([type='hidden'])")
XPath Reference
XPath is more expressive than CSS selectors, useful for complex conditions:
from lxml import html as lxmlhtml
tree = lxmlhtml.fromstring(html_content)
titles = tree.xpath("//h2[@class='product-title']/text()")
prices = tree.xpath("//span[contains(@class, 'price')]/text()")
Common XPath patterns:
//div[@class='product'] — exact class match
//div[contains(@class, 'product')] — class contains substring
//a[text()='Click here'] — exact link text
//a[contains(text(), 'Click')] — link text contains
//input[@type='submit'] — attribute equals value
.//span — descendant relative to current node
//tr[position() > 1] — skip header row
Comparison: Which Library to Use
| Requests + BS4 | Selenium | Playwright | Scrapy | |
|---|---|---|---|---|
| Static HTML | ✅ Best | ✅ Yes | ✅ Yes | ✅ Best |
| JavaScript sites | ❌ No | ✅ Yes | ✅ Yes | ❌ Plugin needed |
| Speed | ✅ Fast | ⚠️ Slow | ✅ Fast | ✅ Fast (async) |
| Memory usage | ✅ Low | ❌ High | ⚠️ Medium | ✅ Low |
| Login handling | ⚠️ Manual | ✅ Yes | ✅ Yes | ⚠️ Manual |
| Async support | ✅ aiohttp | ❌ Limited | ✅ Native | ✅ Built-in |
| Ease of learning | ✅ Easy | ⚠️ Medium | ✅ Easy | ❌ Complex |
| Large scale (1000+ pages) | ✅ Yes | ❌ Impractical | ⚠️ Possible | ✅ Best |
| Maintenance burden | ⚠️ Medium | ❌ High | ⚠️ Medium | ⚠️ Medium |
Why Python Web Scrapers Break
Even working scrapers need constant maintenance. Here's what goes wrong:
1. Selectors Break on Site Redesigns
Your CSS selector .product-card > h2.title works today. Next month, the site redesigns and the class becomes .product-name or the structure becomes section.item > h3. Your scraper silently returns nothing.
This happens constantly. Teams using selectors-based scrapers spend significant time on maintenance every time the target site releases an update.
2. JavaScript Framework Updates
React, Vue, and Angular apps generate different HTML when they upgrade major versions. Component class names change. Server-side rendering gets added or removed. IDs that used to be stable become dynamic.
3. Anti-Bot Measures
Sites deploy Cloudflare, DataDome, or custom bot detection. They:
- Check browser fingerprints (Canvas, WebGL, screen resolution)
- Verify JavaScript execution patterns
- Rate-limit by IP address
- Challenge with CAPTCHAs
- Serve different HTML to headless browsers
4. ChromeDriver Version Mismatch
Selenium requires chromedriver to match your installed Chrome version. Chrome auto-updates. Your production server has a different version than your dev machine.
selenium.common.exceptions.SessionNotCreatedException:
Message: session not created: This version of ChromeDriver only
supports Chrome version 114
This breaks in CI environments constantly — often silently, returning stale data instead of failing loudly.
5. Async Timing Issues
Sites with lazy loading, infinite scroll, or complex React state need careful waiting logic. Too little delay and you miss dynamically loaded content. Too much and the scraper is slow. Getting timing right requires ongoing calibration as the target site evolves.
Full Working Example: E-Commerce Product Scraper
Complete production-ready Playwright scraper:
import asyncio
import json
import logging
import random
from dataclasses import dataclass, asdict
from pathlib import Path
from playwright.async_api import async_playwright, TimeoutError as PlaywrightTimeout
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
@dataclass
class Product:
name: str
price: str
url: str
rating: str = ""
async def scrape_page(page, url: str) -> list[Product]:
"""Scrape products from a single listing page."""
products = []
try:
await page.goto(url, wait_until="networkidle", timeout=30000)
await page.wait_for_selector(".product-card", timeout=10000)
items = await page.query_selector_all(".product-card")
for item in items:
try:
name_el = await item.query_selector(".product-name")
price_el = await item.query_selector(".product-price")
link_el = await item.query_selector("a.product-link")
rating_el = await item.query_selector(".product-rating")
name = await name_el.inner_text() if name_el else ""
price = await price_el.inner_text() if price_el else ""
href = await link_el.get_attribute("href") if link_el else ""
rating = await rating_el.inner_text() if rating_el else ""
if name and price:
products.append(Product(
name=name.strip(),
price=price.strip(),
url=href or url,
rating=rating.strip()
))
except Exception as e:
logger.warning(f"Skipping product due to error: {e}")
continue
except PlaywrightTimeout:
logger.error(f"Timeout loading {url}")
except Exception as e:
logger.error(f"Failed to scrape {url}: {e}")
return products
async def scrape_all_pages(base_url: str, output_file: str = "products.json"):
"""Scrape all paginated pages and save to JSON."""
all_products = []
async with async_playwright() as p:
browser = await p.chromium.launch(
headless=True,
args=["--no-sandbox", "--disable-dev-shm-usage"]
)
context = await browser.new_context(
user_agent=(
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/121.0.0.0 Safari/537.36"
),
viewport={"width": 1280, "height": 800}
)
page = await context.new_page()
page_num = 1
while True:
url = f"{base_url}?page={page_num}"
logger.info(f"Scraping page {page_num}: {url}")
products = await scrape_page(page, url)
if not products:
logger.info(f"No products found on page {page_num}, stopping")
break
all_products.extend(products)
page_num += 1
# Polite delay between pages
await asyncio.sleep(random.uniform(1.5, 3.5))
await browser.close()
# Save results
output = [asdict(p) for p in all_products]
Path(output_file).write_text(json.dumps(output, indent=2))
logger.info(f"Saved {len(all_products)} products to {output_file}")
return all_products
if __name__ == "__main__":
asyncio.run(scrape_all_pages("https://example-shop.com/products"))
The Alternative: AI Web Scraping
Modern AI tools change the scraping equation. Instead of writing selector code you maintain forever, you describe what you want and AI figures out how to get it.
HelpMeTest is an AI testing platform that runs tests in a real cloud browser. The same technology that runs automated tests can verify and extract web content using plain-English instructions:
*** Test Cases ***
Verify Product Listing Page
Go To https://example-shop.com/products
Wait For products to load
Should See at least 10 product cards
Each product card should have a name and price
HelpMeTest handles JavaScript rendering, waiting for dynamic content, and self-heals when selectors change — without you writing a single CSS selector.
Comparison: Python vs AI Tools
| Python (Playwright) | HelpMeTest AI | |
|---|---|---|
| Setup time | 30-60 min | 5 min |
| Code length | 50-100+ lines | 5-10 lines |
| Selector maintenance | Manual when site changes | Self-healing |
| JavaScript handling | Explicit waits required | Automatic |
| ChromeDriver setup | Required, must match Chrome | Not needed |
| Running in CI | Configure headless mode | Automatic |
| Error debugging | Stack traces | Visual recordings |
When to Use Python vs AI Tools
Use Python when:
- You need to process or transform data (Pandas, ML pipelines)
- You're scraping at very high scale (thousands of pages/hour)
- Data is in static, well-structured HTML
- You need deep integration with existing Python infrastructure
Use AI tools (like HelpMeTest) when:
- You want quick extraction without a coding project
- The site changes frequently (avoid selector maintenance)
- Non-technical team members need to run or adjust the scraper
- You're verifying your own site's content as part of QA testing
- You want scraping and functional testing in one tool
Legal and Ethical Considerations
Web scraping operates in a legally complex area:
Check robots.txt at https://example.com/robots.txt — it shows which paths the site owner permits or disallows for bots. Respecting it is standard practice.
Review the Terms of Service — many sites explicitly prohibit scraping. Violating ToS can lead to account termination, IP blocking, and in some jurisdictions, legal liability.
Don't collect personal data — GDPR (EU) and CCPA (California) restrict collecting personal information. Names, emails, and phone numbers of individuals are protected even when publicly displayed.
Rate limit your requests — aggressive scraping degrades performance for real users. Treat external sites as shared infrastructure.
Use official APIs when available — APIs are faster, more reliable, and legally unambiguous.
Conclusion
Python web scraping gives you full control over data extraction — but it comes with real maintenance costs. Selectors break, browser drivers go out of sync, and anti-bot measures evolve.
Choose your approach based on the task:
- BeautifulSoup + Requests for simple, static HTML
- Playwright for modern JavaScript-heavy sites
- Scrapy for large-scale production pipelines
For verifying your own site's content and behavior, AI-powered tools like HelpMeTest let you write plain-English test descriptions that self-heal when your UI changes — no selector management, no ChromeDriver maintenance.
Try HelpMeTest free — write your first test in minutes.