How to protect content against scrapers with headless browsers?

We are living in the century of information. The data could bring huge value to your business. That’s why a lot of different scrapers exist in the world. I guess, each owner of a website at least once was thinking about how to protect their content. One of the ways is to use some JavaScript to render content. However, modern technologies allow running a headless browser on the server and simply scrape all required data. Let’s make scraper’s creator life a bit harder…

I want to show you an example of how to protect content from scrapers that use headless browsers for rendering content. However, first, I need to tell you some theories. Feel free to skip the next paragraph if you are a smart ass.

Shadow Root

Let me introduce you Shadow Root. If one of you doesn’t know what is it I’ll try to explain. Shadow Root allows you to create separate DOM inside your current DOM. It’s kind of a document inside a document. Why it could be useful? For example, you can make your web component with your HTML IDs and CSS classes without taking care of other code on the web page. If you want to learn more, try to read documentation and this article.

Protection

So, how we can use shadow root to protect content from scrapers? Shadow Root has two modes: open and closed. If mode is closed then you can’t access to elements inside from the JavaScript. That makes impossible to scrape content from the those elements by evaluation JS inside the browser.

I’ll show you an example. Let’s create an empty HTML document with a div for sensitive data.

<strong>Following content should be protected:</strong>
<div id="protected-content">
</div>

We need to use JavaScript, so let’s create a script block after our div and put the following code there for the test. I use shadow root in an open mode for now.

<strong>Following content should be protected:</strong>
<div id="protected-content">
</div>
<script>
    const element = document.querySelector('#sensitive-content');
    element.attachShadow({ mode: 'open' });
    element.shadowRoot.innerHTML = '<p>Hello from protected area, buddy!</p>';
</script>

What did we get as result?

Shadow root example in open mode
Shadow root example in open mode

Looks like our shadow root code is working. However, it didn’t give any kind of protection. You could easily access it by using the code like element.shadowRoot.

element.attachShadow({ mode: 'closed' });

Oops… Looks like my code doesn’t work now.

Shadow root is not accessible if mode is closed
Shadow root is not accessible if mode is closed

As I said before closed shadow root can’t be accessible from JavaScript. That means element.shadowRoot will be always null if the mode is closed.

How we can fix that? Let’s create a custom HTML component. At first, I removed div with id protected-content and the old JavaScript tag. Then I made a new script block inside the head and put the following code there. The new code contains a declaration of the ProtectedWebComponent, it’s the constructor, and connectedCallback method. You could find similar examples on the internet.

class ProtectedWebComponent extends HTMLElement {
    constructor() {
        super();
        this._protected_root = this.attachShadow({ mode: "closed" });
    }
    connectedCallback() {
        this._protected_root.innerHTML = `<p>Hello from protected area, buddy!</p>`;
    }
}
window.customElements.define("protected-web-component", ProtectedWebComponent);

Then I added a new component to HTML.

<strong>Following content should be protected:</strong>
<protected-web-component></protected-web-component>

Does it work? Yeah, looks good.

Closed shadow root example
Closed shadow root example

However, there is still a problem. You can’t access shadow root by calling something like element.shadowRoot, but there is still _protected_root field. So, it allows us to get shadow root by calling element._protected_root. How we could fix that? Let’s initialize content inside constructor and don’t save the shadow root link anywhere.

class ProtectedWebComponent extends HTMLElement {
    constructor() {
        super();
        const shadowRootLink = this.attachShadow({ mode: "closed" });
        shadowRootLink.innerHTML = `<p>Hello from protected area, buddy!</p>`;
    }
}
window.customElements.define("protected-web-component", ProtectedWebComponent);

Looks much better, now.

It's impossible to access closed shadow root from JavaScript
It’s impossible to access closed shadow root from JavaScript

Source code

If you want to check source code welcome to my GitHub repository.

Results

As you can see it’s pretty to protect some content by using shadow root. However, you should understand this way is not a magic stick. Using shadow root to protect content could make web scraper developer’s life more complex, but it doesn’t put content in some bulletproof armor. If you are interested in how to avoid this protection, write a comment below and I’ll try to explain.

P.S. This article is my first article in English. Don’t judge me if I made some languages mistakes. Better tell me about them and I’ll try to fix them.

Leave a Reply

Your email address will not be published.