Scrapy is probably the most popular open source framework for web scraping. It's been around since at least 2008, which is when I first used it. It started out as an open-source release of a python framework built for scraping a large number for a commercial enterprise. The framework turned out to be so successful on its own that the creators of it formed a company around it––scrapinghub.com.
In this article, I will first compare the visual web scraping tool ParseHub to Scrapy as an open-source python project, and then I will compare ParseHub to the ScrapingHub paid service which runs Scrapy spiders for a fee.
ParseHub and Scrapy
Comparing ParseHub to Scrapy is somewhat of an apples-to-oranges comparison, because one is a UI tool and the other is a programming library. A more apples-to-apples comparison would be to the associated open-source project Portia. But since Scrapy is so established, and Portia is relatively new, I will confine this article to the first comparison, and leave Portia for another day and another blog post.
|Authoring environment||Desktop app (Mac, Windows and Linux)||Python plus scrapy command line tool|
|Scraper logic||Variables, loops, conditionals, function calls (via templates)||Variables, loops, conditionals, function calls (arbitrary python)|
|Pop-ups, infinite scroll, hover content||Yes||With external libraries|
|Debugging||Visual debugger||Python logs|
|Knowledge of HTML and HTTP||None required||Required|
|Selecting elements||Point-and-click, CSS seletors, XPath||CSS seletors, XPath|
|Speed||Fast parallel execution||Fast parallel execution|
|Hosting||Hosted on cloud of hundreds of ParseHub servers||Hosted on your local machine or your own servers|
|IP Rotation||Included in paid plans||Must pay external service|
|Sites (AKA spiders, scrapers, projects)||Free plan: 5, $99/month: 20, $499/month: 120||Limited by your infrastructure|
|Support||Free professional support||Community support|
|Data export||CSV, JSON, API||CSV, JSON, API|
|Run-time configuration||Passed in as a JSON object||Passed in command line, arbitrary python|
ParseHub and Scrapy: Conculsion
ParseHub offers most of the web scraping power and scale of Scrapy in a much easier-to-use package. Because we're actually big fans of Scrapy, we still recommend it for a few situations:
- Tight integration with existing python codebase and infrastructure
- Crawling hundreds of websites and grabbing all HTML or just some keywords
We are working on solving the second use case with ParseHub right now. Stay tuned!
ParseHub and Scrapinghub
Scrapinghub is a paid service for running web scrapers (AKA spiders or projects) created with the open-source python framework Scrapy. It is equivalent to ParseHub's "run on server" and "run on a schedule" service which is integrated into the ParseHub desktop app.
At first glance, the main difference between the two services appears to be their pricing. ParseHub packages capabilities into conventional software-as-a-service (SaaS) plans Free, Standard ($99) and Professional ($499). Scrapinghub prices its service in $9 "scrapy cloud units", similar to infrastructure-as-a-service (IaaS) such as Amazon EC2.
But it is easy to see that both services offer a generous free plan that grants multiple projects and hundreds or more pages. And both ParseHub and Scrapinghub offer more speed for more money. ParseHub clearly defines how many pages a minute it will provide for each plan. Scrapinghub offers additional "concurrent crawls" for $9 each. It would require some benchmarking to estimate how much faster each crawl makes Scrapy in terms of pages per minute.
ParseHub and Scrapinghub both offer IP rotation, but Scrapinghub sells it in a separate service, Crawlera, starting at $25 a month and up to $500 or more.
Conclusion: ParseHub and Scrapinghub
Like the earlier comparison, ParseHub vs Scrapinghub is somewhat of an apples-to-oranges comparison. ParseHub is designed to work at a higher level in which most of the features of Scrapinghub are bundled together. Scrapinghub is a good choice if you are already convinced that Scrapy is for you. If you are just starting out, we encourage you to try ParseHub which will get you up and running much faster and for similar pricing.
As a final note, Scrapinghub's monitoring dashboard is really nice. Kudos.