AI Agents for Web Scraping: API/XHR, Playwright

AI agents for web scraping in 2026: API/XHR first, embedded JSON next, Playwright/browser agents for JS/auth, and LLM extraction only after verification.

AI agents for web scraping: the 2026 stack

The best web scraping AI agent workflow is not agent-first. Start with official API or XHR JSON, check embedded JSON next, use Playwright or a browser agent only when JavaScript, login or interaction is required, and reserve LLM extraction for small reviewed cleanup.

Layer	Use it for	Do not use it for
API/XHR	Repeatable data jobs, search results, product lists, comments, profiles and pagination.	Private endpoints you cannot monitor or requests you do not understand.
Embedded JSON	Next.js data, JSON-LD, hydration blobs and article metadata already shipped in HTML.	Large recurring jobs without saved fixtures and schema checks.
Playwright or browser agent	Rendered pages, auth cookies, infinite scroll, forms, dashboards and consent flows.	High-volume scraping where a stable API/XHR source exists.
LLM extraction	One-off cleanup from a small page, copied text, screenshot or PDF.	Unverified production data, selectors, financial facts or personal data collection.

My rule: use the agent to inspect and choose the data source, then verify the rows and turn recurring extraction into boring code with logs, retries and tests.

In 2021 I wrote an article called «How to scrape any website?» on Habr, the biggest Russian-language tech platform. It was a methodical guide: five levels of data extraction, ranked from cleanest to ugliest. The article did well. People bookmarked it, shared it, left comments arguing about Selenium vs Playwright.

Quick answer: what AI agents changed

AI agents did not make scraping rules disappear. They changed the workflow. For one-off extraction, I often ask an agent to inspect the page, find the cleanest data source, extract the rows, and return a CSV. For production pipelines, I still want code: API/XHR first, embedded JSON second, browser automation only when necessary, raw HTML last.

The new skill is not “write BeautifulSoup faster”. It is knowing when an agent is enough and when the result needs a monitored parser.

If you want the practical hierarchy, I keep the updated step-by-step guide here: How to parse any website.

Five years later, I look at that article the way you look at a photo of yourself in high school. It's technically correct. The fundamentals hold up. But the world around it changed so much that following the guide step-by-step feels like developing film in a darkroom when your phone takes better photos.

This is the story of what replaced manual web scraping for me, and why I think most one-off data extraction tasks will never require a handwritten scraper again.

The 2021 hierarchy: a time capsule

My original article laid out five levels of scraping, ordered by reliability and maintainability. The idea was simple: don't reach for BeautifulSoup until you've exhausted the cleaner options.

Level 1: Official API. Check if the site has a public API, an RSS feed, or a sitemap. If yes, use it. This is the cleanest approach and the least likely to break.

Level 2: XHR requests. Open Chrome DevTools, go to the Network tab, filter by XHR/Fetch, and watch what happens when the page loads. Modern sites load data through JavaScript calls to internal endpoints. These return JSON. You can replicate the requests with Python's requests library and skip the HTML entirely.

Level 3: Embedded JSON. Server-rendered frameworks like Next.js embed data as JSON inside the HTML. Look for __NEXT_DATA__, __NUXT__, or similar patterns. The data is right there in the page source, already structured. You just need to extract the JSON blob.

Level 4: Headless browser. When a site requires JavaScript execution, authentication tokens, or cookie management to render content, fire up Selenium or Playwright. Automate a real browser, let the page render, then grab the DOM.

Level 5: HTML parsing. Last resort. Parse the raw HTML with BeautifulSoup or Scrapy. Write CSS selectors or XPath queries. This is fragile because any redesign breaks your selectors.

This hierarchy was genuinely useful. It saved me and many readers from writing brittle HTML parsers when a clean JSON endpoint was hiding in the Network tab. The mental model was solid: start clean, get dirtier only when you have to.

I was proud of it. I taught people to think before they scraped.

What I actually do in 2026

Last month I needed pricing data from a SaaS comparison site. In 2021, I would have opened DevTools, spent 15 minutes poking around the Network tab, found the API endpoint the frontend was calling, figured out the pagination parameters, written a Python script with requests and json, handled edge cases, and saved to CSV. An hour of work, give or take.

Instead, I opened my terminal and typed:

claude
"Go to [site URL], extract all product names, pricing tiers, and feature lists. Save as a CSV file."

Claude Code read the page. It noticed the site loaded data via a paginated JSON API. It wrote a script that called that API directly, iterated through the pages, parsed the response, and saved a clean CSV. The whole thing took about two minutes. I didn't open DevTools. I didn't write a single line of Python.

This is not a hypothetical demo. This is what my Tuesday looks like now.

The hierarchy from 2021 still applies. Claude Code follows the same logic: it checks for APIs, looks at network requests, inspects embedded JSON, and falls back to HTML parsing when needed. The difference is that the agent does the inspection, makes the decision, writes the code, runs it, and hands me the result. I described what I wanted. The agent figured out how.

The tools that made this possible

Three things converged to make manual scraping feel quaint.

AI agents that can execute code

Claude Code isn't a chatbot that suggests code for you to copy-paste. It runs in your terminal and executes commands directly. It can write a Python script, run it, read the output, fix the errors, and run it again. This loop is what makes it effective for data extraction. The agent doesn't just plan the scraping approach; it carries it out and delivers the data. I wrote about my full Claude Code workflow in a separate post.

MCP tools for web access

Model Context Protocol gives AI agents direct access to the web. Tools like WebFetch and WebSearch let Claude Code read web pages, follow links, and process HTML without you writing any HTTP code. The agent can browse the web the way you browse the web. It reads a page, decides what data is relevant, and extracts it.

This is the real paradigm shift. The agent doesn't generate a script for you to run later. The agent IS the scraper. It fetches, parses, and delivers in the same conversation.

agent-browser replaces Selenium

For sites with anti-bot protection or heavy JavaScript rendering, agent-browser (by Vercel) handles headless browsing as an MCP server. Claude Code connects to it and controls a real browser session. No more fighting with ChromeDriver versions. No more WebDriverWait with arbitrary sleep timers. No more selenium.common.exceptions.StaleElementReferenceException at 2 AM.

agent-browser is more stable and more capable than the Chrome DevTools Protocol MCP server. It handles cookie consent popups, lazy-loaded content, and infinite scroll pages. The agent navigates the site, waits for content to load, and extracts what it sees. It's Selenium for the agent era.

The contrast, side by side

Let me make this concrete. Say you need to extract a list of 200 tech companies from a directory site, with names, descriptions, funding amounts, and website URLs.

The 2021 approach

Open the site in Chrome. Open DevTools (F12).
Go to Network tab. Reload. Filter by XHR.
Scroll through requests. Find the one that returns company data as JSON.
Inspect the request: URL, headers, query parameters.
Notice it uses pagination: ?page=1&limit=20.
Open VS Code. Create scraper.py.
Write the requests.get() call with the right headers.
Add a loop for pagination.
Parse the JSON response, extract the fields you need.
Handle the edge case where some companies don't have funding data.
Write to CSV with csv.DictWriter.
Run the script. Get a 403 error because you forgot a header.
Add the missing User-Agent header. Run again.
Get the data. Notice the descriptions have HTML entities. Add a cleanup step.
Run again. Check the CSV. Done.

Time: 30 to 60 minutes, depending on how cooperative the site is.

The 2026 approach

Open terminal. Type claude.
"Go to [directory URL]. Extract all companies: name, description, funding amount, website URL. Save as companies.csv."
Watch the agent work for about 90 seconds.
Open companies.csv. Verify the data looks right.

Time: 2 to 3 minutes.

The agent did the same steps I would have done manually. It found the API endpoint, figured out the pagination, handled the missing fields, cleaned up the text. It just did all of it without me supervising each step.

For one-off extraction tasks, this is a categorical improvement. Not incremental. Categorical.

When you still need a handwritten scraper

I'm not going to pretend agents replace everything. There are real scenarios where manual scraping is still the right call.

Scale. If you need to scrape millions of pages, you need infrastructure: proxy rotation, rate limiting, retry logic, distributed queues. Scrapy with a Redis queue and rotating proxies. An agent in a terminal session is not the right tool for crawling the entire internet.

Production pipelines. If the scraper runs on a schedule, feeds data into a production database, and needs monitoring, alerts, and error recovery, you want a proper codebase. Something you can deploy, test, and maintain. An agent conversation is ephemeral; a production pipeline is not.

Sophisticated anti-bot. Cloudflare Turnstile, advanced CAPTCHAs, fingerprint detection, headless browser detection. These arms races are still real. agent-browser handles basic protections, but the serious anti-bot vendors design their systems to block exactly this kind of automated access.

Real-time data feeds. If you need WebSocket connections, streaming data, or sub-second latency, you're writing custom code. No way around it.

The pattern is clear: agents dominate one-off and ad-hoc extraction. Humans still own production-grade, large-scale, ongoing collection systems. The split is roughly: if you'll run it once, use the agent. If you'll run it every day for the next year, write the code.

The legal and ethical part hasn't changed

Here's something that stays exactly the same regardless of whether a human or an AI agent does the scraping.

Check robots.txt. If the site says don't crawl, don't crawl. The fact that your AI agent is doing the crawling doesn't change the legal analysis.

Respect rate limits. Don't hammer a server with 1000 requests per second just because the agent can do it without you feeling the tedium.

Read the terms of service. Some sites explicitly prohibit automated access. This applies to agents too.

Personal data matters. GDPR, CCPA, and similar regulations apply to data collection regardless of the method. If you're scraping personal information, you need a legal basis for processing it.

The agent makes extraction faster. It doesn't make it more legal. Don't be reckless just because the barrier to entry dropped.

What happened to the scraping skill

I used to be proud of my scraping abilities. Knowing how to reverse-engineer a site's API from the Network tab felt like a superpower. Understanding how __NEXT_DATA__ worked gave me an edge. Writing a clean Scrapy spider with proper item pipelines was craftsmanship.

None of that knowledge is wasted. It's the reason I can tell when Claude Code picks a bad approach. When the agent tries to parse HTML and I know there's an API endpoint it missed, I can redirect it. When it writes a fragile CSS selector, I can say "look for the JSON in the page source instead." The hierarchy I wrote in 2021 is now my code review framework for agent-generated scrapers.

The skill shifted from doing to directing. From writing the scraper to evaluating the scraper. From knowing how to extract data to knowing whether the data was extracted correctly.

If you're just starting out and want to understand how the web actually works, my 2021 article is still a good teacher. Understanding how sites serve data, what XHR requests are, how server-side rendering embeds JSON into HTML, these fundamentals make you a better director of AI agents. You'll rarely apply them by hand, but you'll apply them by taste.

Where I think this goes next

Right now, agent-driven extraction works best when you can describe what you want in natural language. "Get me the prices" is easy. "Get me the prices but only for products that were added this month and have more than 10 reviews" is harder because it requires the agent to understand the site's data model.

The next step is agents that maintain persistent knowledge about data sources. An agent that knows "last time we scraped this site, the API endpoint was /api/v2/products and it required an auth token from the login page." Agents with memory and context about previous extraction sessions.

We're also going to see specialized MCP tools for common data sources. Instead of scraping LinkedIn, you'll have an MCP server that wraps LinkedIn's data access. Instead of scraping Amazon, you'll have a structured product data tool. The scraping layer gets abstracted away entirely.

My full AI tools setup already includes several MCP servers that give agents direct access to APIs I used to scrape. The trend is obvious: every data source eventually gets an agent-friendly interface.

The punchline

In 2021, I taught people a five-step system for extracting data from any website. The article's implicit promise was: follow this hierarchy and you can scrape anything.

In 2026, the hierarchy is still valid. The steps are still correct. But the person following them is no longer me. It's an AI agent running in my terminal. I describe the destination, and the agent navigates the hierarchy on its own.

Web scraping as a skill isn't dead. Web scraping as a manual activity is dying. The knowledge transfers; the labor doesn't.

If you told 2021 me that in five years I'd be nostalgic about writing BeautifulSoup selectors, I would have laughed. But here I am, a little wistful about the craft, while fully aware that I'm never going back.

The data still needs extracting. I just don't do the extracting anymore.

FAQ

What is the best web scraping AI agent workflow?

The best workflow is API/XHR first, embedded JSON second, browser automation third, raw HTML last. The AI agent is most useful as an inspector: it finds the cleanest source, documents the steps and helps turn recurring work into tested code.

Is web scraping dead in 2026?

No. One-off manual scraping is less common because agents can inspect pages quickly, but production extraction still needs stable sources, retries, monitoring, rate limits and clear legal boundaries.

Can AI agents replace web scraping?

They can replace many one-off scraping scripts, but not durable data pipelines. If the job runs daily, affects customers or feeds a database, use the agent for discovery and then ship tested extraction code.

When should I use Playwright or a browser agent instead of API/XHR?

Use Playwright or a browser agent when useful data appears only after JavaScript rendering, login, scrolling, form interaction or cookie consent. If a clean JSON endpoint exists, API/XHR remains simpler and more reliable.

Can LLM extraction parse websites reliably?

LLM extraction is useful for small, reviewed, low-frequency tasks. It should not be the source of truth for large tables, finance, personal data or production jobs unless another check verifies the output.

What should be workflow, code or agent?

My rule: if the extraction can be written as a checklist or graph, make it a workflow; if it must run every day, turn it into code; if it requires fresh judgement, let an agent inspect and propose the next step. That same distinction is showing up in public talks about what to leave to people, workflows and agents.