How to parse any website

How to parse any website

I'll share my approach which helps me to build reliable parsers of any website.

I'm not going to show here actual scripts but I'll describe how I extract the data from any website. It should be useful when you need to decide how and what part of the website should be parsed.

TL;DR:

  1. Find an official API
  2. Find XHR requests in the browser's dev tools
  3. Find JSON data in page's HTML
  4. Render page using any headless browser
  5. Parse HTML

Pro tip: don't start with BS4/Scrapy

Cool companies perform a ton of A/B tests to increase conversions and improve the funnels of their websites. That means the layout (the HTML) and data formatting may be easily changed next week. Ideally, you want to code the parser once and without having to adapt it to a new layout periodically.

Don't parse HTML unless nothing else works. XPath & Beutifulsoup are not robust agains website change.

Find official API

Sometimes websites simply give you the API to let you extract the raw data you might need. You need to check if API exists before trying to parse anything. Don't spend time fighting with HTML and cookies if you don't need to.

Some websites still have RSS feeds. They are made to be easily parsed by machines but they are not popular anymore so most websites don't have them. If you want to extract data from some blogs or news websites - you should definitely check for RSS feeds!

Look for XHR requests

All modern websites (not on the dark web lol) use Javascript to fetch the data. It allows websites to open smoothly and load the content (data) after the basic layout is shown (HTML, skeleton).

Usually, this data is fetched by Javascript using GET/POST requests. A great example is Producthunt:

Just take the Raw data JSON! 

Workflow is simple:

  1. Open a webpage you want to parse
  2. Right-click --> Inspect (open dev tools of your browser)
  3. Open Network tab and select XHR filter (to display only requests you might need)
  4. Find requests that return data you need
  5. Copy this request as cURL and use it in your parser's code

Sometimes you notice that these XHR requests require large strings - tokens, cookies, and sessions. They are generated using Javascript or backend. Don't spend time looking into the JS code to understand how they were made.

Instead, you can try to simply copy-paste them from your browser to your code - they are usually valid for 7-30 days. Perhaps, this is ok for your needs. Or you may find another XHR request (e.g. /login) which outputs with all tokens and sessions you need. Otherwise, I'd suggest switching to browser automation (Selenium, Puppeteer, Splash, etc.).

Look for JSON data in HTML code

But sometimes, for SEO purposes some data is already embedded into the HTML page: If you open the page source of any Linkedin page, you'll notice a huuuuge JSONS in its HTML.

Just use these JSONs from HTML code because the possibility that its structure would be changed is really-really low. Write your parser to cut extract these JSONs using simple regular expressions or basic tools you already know.

If you see JSON data in a page's HTML, you don't need to render it using headless browsers – you can simply use GET calls (e.g. requests.get in Python).

Render JS

For scalability and simplicity, I suggest using remote browsers clusters. Recently I found the cool remote Selenium grid automation called Selenoid. Their support team is awful, but the microservice itself is quite easy to run on your machine. As a result, you'll have a magic URL that you'll insert in your Selenium driver to render all stuff remotely and not locally e.g. using chromedriver file. Super useful.

This is my Selenoid usage snippet (Python). Notice the enable_vnc param: it will open the VNC connection to actually see what is happening on the remote Selenoid browser.

def get_selenoid_driver(
    enable_vnc=False, browser_name="firefox"
):
    capabilities = {
        "browserName": browser_name,
        "version": "",
        "enableVNC": enable_vnc,
        "enableVideo": False,
        "screenResolution": "1280x1024x24",
        "sessionTimeout": "3m",
        
        # Someone used these params too, let's have them as well
        "goog:chromeOptions": {"excludeSwitches": ["enable-automation"]},
        "prefs": {
            "credentials_enable_service": False, 			
            "profile.password_manager_enabled": False
        },
    }

    driver = webdriver.Remote(
        command_executor=SELENOID_URL,
        desired_capabilities=capabilities,
    )
    driver.implicitly_wait(10)  # wait for the page load no matter what

    if enable_vnc:
        print(f"You can view VNC here: {SELENOID_WEB_URL}")
    return driver

After page rendering, you may accidentally find the JSON with all data you need in the page's HTML. Again, try to extract this JSON and take all the data you need from it instead of trying to parse HTML tags & fields that might change next week.

driver.get(url_to_open)
html = driver.page_source

Parse HTML

If you can find neither proper XHR query nor JSON in the page's HTML, that means the website use SSR which prefills all the data. In this rare case (probably the old website), the only thing left is to try to extract the data you need using CSS or XPath selectors. My only advice here is to try to minimize the queries & filters used to prevent overfitting and having NULL output if some page blocks become switched or slightly changed.


Hope this was useful. Please tweet me the comments and questions. See you soon for more parsing & ELT tutorials! 👋