We recently began working on a project focused on collecting publicly available data from a wide range of websites. These sites do not have publicly documented APIs, so we opted to crawl each site and scrape the relevant information from pages containing our target data.
To set ourselves up for long-term success and ease of maintenance, we thoroughly reviewed the available tools and services in the web crawling and scraping space. Below, we’ve summarized our analysis and shared insights into the approach we selected and our experience with these tools. We hope it proves helpful for others considering similar projects.
Context
Before diving into our analysis, it’s important to understand the context of our specific project. The following key characteristics guided our evaluation of each tool/service:
- Number of sites to scrape: 10s to 100s
- Desired scraping frequency (i.e., how often does the data change?): Weekly to monthly
- Acceptable price range: Free or up to ~$50 per month
- Target data format: JSON or other structured formats
- Crawling requirements: Crawl all pages/links within an area of each target site
- Site complexity (e.g., dynamic JS content, AJAX requests, etc.): Not likely
- Expected scraping obstacles (e.g., CAPTCHAs, IP blocking, etc.): Not likely
- Programming languages: Ruby, JavaScript, or hosted service with an HTTP API
Teams with different skills, larger budgets, or alternative goals for crawling/scraping might arrive at a very different set of conclusions when performing a similar analysis.
Evaluation Criteria
We leveraged a number of public comparisons of web scraping tools (source, source, source, source, and source) and identified 34 tools and services to evaluate. We based our evaluation on the following nine criteria:
- Price: How expensive is the tool to operate on a monthly basis?
- Tool/Service Type: Is this an open-source library we need to set up, maintain, and host ourselves? Is it a web platform that runs on its own? Or is it an installed application that runs on an individual’s machine?
- Crawling Support: Does the tool support navigating an entire website and identifying pages of interest?
- Scraping Support: Does the tool support extracting content from pages of interest?
- JavaScript Execution: Does the tool allow JavaScript to run (e.g., enabling AJAX calls, opening modals, etc.)?
- Interface/Programming Language: What interfaces are available for setting up and interacting with the tool? For open-source libraries, what programming languages are supported?
- Scheduled Execution: Does the tool allow scraping to be scheduled and executed automatically on a regular basis?
- IP Proxy Support: Does the tool allow proxying IP addresses to avoid rate limiting and crawler blocking?
- Anti-Blocking Features: Does the tool offer anti-blocking features like CAPTCHA handling and fingerprint management?
Our complete evaluation and findings for each tool/service can be found in this spreadsheet.
We only gathered enough information on each tool/service to rule it in or out of further consideration. Therefore, you’ll find several empty cells in our results where this information wasn’t crucial for our decision-making.
Evaluation Results
After completing this survey of available tools/services, we selected three for hands-on experimentation. Here’s what we found:
- Diffbot: Diffbot is a powerful tool, ideal for scraping data from unpredictable pages. You can drop any URL into the tool and get a structured dataset representing the page’s contents. However, when we tested their site crawling functionality, we encountered a blocker: the feature is only available with their "Plus Plan," which costs a hefty $899 per month. While this tool could be great for unpredictable pages, it didn’t fit our needs, as we had a predictable set of sites to crawl with structured data.
- Apify: Apify lets you execute "actors," which are prebuilt or custom scripts for scraping specific websites. We couldn’t find an existing actor that suited our needs, so we developed a custom one. Using Apify’s actor templates made it relatively easy, but the performance was slow. Scraping 100 pages took around 10 minutes, making the solution costly—$0.286 for just the first 100 pages. This didn’t meet our scalability or cost requirements.
- Browse AI: Browse AI allows non-technical users to create web scraping robots using a point-and-click interface. Initially, it seemed like a good fit. However, when we tried to scale it by crawling all related pages on the site, we found that Browse AI doesn’t support full site crawling. Their workaround requires users to manually supply a list of URLs, which didn’t suit our needs.
Selected Approach
Ultimately, we chose Crawlee.dev, a Node.js library developed by Apify, as our solution. It’s the foundation for many actors on their platform, and it aligned with our team's technical expertise.
We initially tried using Playwright and Chrome for crawling in a headless browser, but the setup was complex, and performance was slow. We switched to Cheerio, a lightweight HTML parsing client, which provided a significant performance boost. Although this does not allow JavaScript execution or interaction with page elements, it has not prevented us from scraping any target data from the initial set of target sites we've deployed crawlers for.
We also opted to run our crawlers on Render using cron jobs. Since we already had a Dockerfile set up for our application, the deployment took only minutes. We now run the crawlers daily, paying just a few cents per execution. The process was incredibly easy and cost-effective.
We’re very pleased with the results and believe this solution will scale to meet our needs. One potential challenge remains: how often will the structure of our target sites change in ways that require us to adjust our crawlers? Time will tell.
Have you built web crawlers or site scrapers in the past? What tools and techniques did you use? Any lessons learned that could help us? We’d love to hear your thoughts below. Stay tuned for more updates as we continue to develop this solution. Thanks for following along!