Scalable Web Scrapers Exploring the best tools for scalable, cost-effective web scraping solutions.

Harrison Strowd
Harrison Strowd

We recently began working on a project focused on collecting publicly available data from a wide range of websites. These sites do not have publicly documented APIs, so we opted to crawl each site and scrape the relevant information from pages containing our target data.

To set ourselves up for long-term success and ease of maintenance, we thoroughly reviewed the available tools and services in the web crawling and scraping space. Below, we’ve summarized our analysis and shared insights into the approach we selected and our experience with these tools. We hope it proves helpful for others considering similar projects.

Context

Before diving into our analysis, it’s important to understand the context of our specific project. The following key characteristics guided our evaluation of each tool/service:

Teams with different skills, larger budgets, or alternative goals for crawling/scraping might arrive at a very different set of conclusions when performing a similar analysis.


We thoroughly reviewed available tools in the web crawling space to set ourselves up for long-term success and ease of maintenance.


Evaluation Criteria

We leveraged a number of public comparisons of web scraping tools (source, source, source, source, and source) and identified 34 tools and services to evaluate. We based our evaluation on the following nine criteria:

Our complete evaluation and findings for each tool/service can be found in this spreadsheet.

We only gathered enough information on each tool/service to rule it in or out of further consideration. Therefore, you’ll find several empty cells in our results where this information wasn’t crucial for our decision-making.


We identified 34 tools and services to evaluate based on price, tool type, crawling support, JavaScript execution, and anti-blocking features.


Evaluation Results

After completing this survey of available tools/services, we selected three for hands-on experimentation. Here’s what we found:


Diffbot was powerful but expensive, Apify was too costly at scale, and Browse AI couldn't support full site crawling — none met our specific requirements.


Selected Approach

Ultimately, we chose Crawlee.dev, a Node.js library developed by Apify, as our solution. It’s the foundation for many actors on their platform, and it aligned with our team's technical expertise.

We initially tried using Playwright and Chrome for crawling in a headless browser, but the setup was complex, and performance was slow. We switched to Cheerio, a lightweight HTML parsing client, which provided a significant performance boost. Although this does not allow JavaScript execution or interaction with page elements, it has not prevented us from scraping any target data from the initial set of target sites we've deployed crawlers for.


We now run our Crawlee-based crawlers daily on Render using cron jobs, paying just a few cents per execution — an incredibly easy and cost-effective solution.


We also opted to run our crawlers on Render using cron jobs. Since we already had a Dockerfile set up for our application, the deployment took only minutes. We now run the crawlers daily, paying just a few cents per execution. The process was incredibly easy and cost-effective.

We’re very pleased with the results and believe this solution will scale to meet our needs. One potential challenge remains: how often will the structure of our target sites change in ways that require us to adjust our crawlers? Time will tell.

Have you built web crawlers or site scrapers in the past? What tools and techniques did you use? Any lessons learned that could help us? We’d love to hear your thoughts below. Stay tuned for more updates as we continue to develop this solution. Thanks for following along!

Explore our writing

Our perspectives on engineering practices, sustainable business, and building technology with human connection at its core.

Leave a comment

Thoughts for our thoughts

We hope you find our writing useful and, perhaps, that it gives you something to think about. We read everything we receive and we'd love to hear from you.