How to Parse Any Website: API, XHR, HTML and AI Agents
A practical hierarchy for parsing any website: official APIs, XHR requests, embedded JSON, headless browsers, HTML parsing, and when AI agents are enough.

I wrote the first version of this guide in 2021. The advice still holds: do not start with BeautifulSoup if there is a cleaner data source. But the workflow changed.
In 2026, I still use the same hierarchy for serious parsers. The difference is that I now let AI agents do more of the exploration: inspect the page, find XHR requests, test selectors, write the first scraper, and tell me where the fragile parts are.
Quick answer: the parsing hierarchy
To parse any website, check sources in this order: official API or RSS, XHR/fetch requests, embedded JSON in the HTML, rendered DOM through a headless browser, and only then raw HTML parsing. For one-off tasks, an AI agent with browser automation may be enough. For production, you still need stable data sources, rate limits, retries, and monitoring.
Do not start with HTML
HTML is the part of a website most likely to change. Companies run A/B tests, redesign pages, rename CSS classes, move blocks around, and ship experiments without warning.
If your parser depends on a fragile XPath to a random div, it will break. Maybe tomorrow. Maybe during the one night when you actually needed the data.
So the rule is simple: parse the highest-level structured source you can find. HTML is the fallback, not the plan.
Level 1: official API, RSS, sitemap
First, check if the site already gives you the data.
- Public API docs.
- RSS/Atom feeds.
- Sitemaps.
- Export buttons.
- Structured files linked from the page.
This sounds obvious and people still skip it. Do not spend two hours fighting cookies if the site has a JSON endpoint documented in the footer.
Official APIs are also easier to make reliable: status codes are clearer, response schemas are more stable, and rate limits are explicit.
Level 2: XHR / fetch requests
Most modern sites render layout first and fetch data later. Open DevTools, go to Network, filter by Fetch/XHR, reload the page, and click around.
You are looking for requests that return JSON:
- search results;
- product lists;
- comments;
- profile data;
- pagination endpoints;
- GraphQL responses.
If you find the internal endpoint, you can often reproduce it with a normal HTTP client and skip browser automation entirely.
Things to capture:
- URL and query parameters;
- method: GET/POST;
- headers that actually matter;
- request body;
- pagination cursor;
- auth/session requirements.
This is where AI agents are useful. I can ask Claude Code to inspect a HAR file or Network log and identify which requests contain the data. It does not remove the need to understand the site. It removes the boring clicking.
Level 3: embedded JSON
Frameworks like Next.js often embed initial data directly into the HTML. Look for:
__NEXT_DATA__;application/ld+jsonstructured data;- hydration state blobs;
- Redux/Apollo cache dumps;
- inline scripts with product or article objects.
This is more stable than parsing visible HTML because the structure usually mirrors application data, not visual layout.
For article pages, JSON-LD is especially useful. It often contains title, author, date, image, description, breadcrumbs, and canonical URL. If that is all you need, do not parse the DOM at all.
Level 4: rendered page with a headless browser
Some sites do not expose useful JSON. The data appears only after JavaScript runs. Then you need Playwright, Puppeteer, or an agent-browser style tool.
This is heavier but sometimes unavoidable:
- client-side rendering;
- infinite scroll;
- auth-gated dashboards;
- buttons that mutate state before data appears;
- sites that compute signatures in JavaScript.
Browser automation is also the best way to dogfood a parser before turning it into code. Ask the agent to navigate like a human, find the data, and explain which selectors or requests are stable.
But do not confuse “it worked once in a browser” with a production parser. Production still needs retries, timeouts, screenshots/logs on failure, and change detection.
Level 5: parse HTML
Only parse raw HTML when better sources are unavailable.
If you must do it:
- prefer semantic tags over CSS classes;
- use multiple fallback selectors;
- normalize whitespace and dates;
- store raw HTML snapshots for debugging;
- monitor extraction rates;
- fail loudly when required fields disappear.
BeautifulSoup, Scrapy, Cheerio, lxml — all fine. The library is not the hard part. The hard part is choosing selectors that survive the next redesign.
Where AI agents fit
AI agents changed the economics of scraping, not the physics.
For one-off research, I often do not write a parser anymore. I ask an agent to open the page, find the data, extract the rows, and return a CSV. That is enough for a quick analysis.
For repeated jobs, I still want code. The agent helps with:
- finding the cleanest data source;
- generating the first parser;
- writing tests against saved fixtures;
- adding retries and pagination;
- documenting failure modes;
- turning the workflow into a scheduled job.
The practical split: use agents for exploration and low-frequency extraction. Use proper parsers for anything scheduled, business-critical, or large-scale.
Legal and ethical boundaries
Parsing is not a magic permission slip. Read robots.txt and terms. Respect rate limits. Do not hammer small sites. Do not bypass paywalls or auth boundaries. Do not collect personal data just because it is technically visible.
The best scraper is boring and polite: limited, cached, monitored, and easy to turn off.
My checklist
Before writing code, answer these:
- Is there an official API or export?
- Does the page call JSON endpoints?
- Is useful data embedded in scripts or JSON-LD?
- Do I need JavaScript rendering?
- How often will this run?
- What breaks if the parser fails silently?
- Can an AI agent do this as a one-off instead?
If the answer is “I need this every day”, build the parser properly. If the answer is “I need this once for a report”, let the agent do the boring work and move on.
I wrote a separate 2026 piece on the broader shift here: Web Scraping with AI Agents.