The internet provides us with access to an incredible amount of data and information.
Just think about the amount of data that a simple e-commerce site might have. Including product names, models, availability, prices, descriptions, reviews, photos, discount codes, etc.
Now think of larger websites like Twitter or Amazon and the scale of the data they hold.
Web Scraping and Data
Unfortunately, most websites do not provide users with simple access to their public data. For example, Amazon does not provide you with a way to download a spreadsheet with all the details of the products you’re interested in to make a better buying decision.
After all, Amazon doesn’t want you to make a good buying decision, they just want you to buy something.
Here is where web scraping comes in, providing you access to valuable data and information in order to make better decisions.
What is Web Scraping?
Web scraping refers to the extraction of data from a website into a new format. In most cases, the data from a website is extracted into an Excel sheet or JSON file.
Web scraping is usually an automated process done by a piece of software, although it can still be done manually. As a result, most people prefer to use web scraping software to save time and money.
While it might sound simple, web scraping can be used in numerous ways to unlock value from many different websites.
Want to learn more about web scraping? Read our definite guide on web scraping and its uses.
What is Web Scraping Used For?
Due to its versatility, web scraping can be used in various scenarios. We could spend hours reviewing each use case, but here are some of the most common.
Imagine that you are working for a company that sells and distributes dental equipment for dentists. As a result, you might be interested in creating a database or spreadsheet with information about every dentist in your city.
You could create this spreadsheet manually, one by one, or you could use a web scraper to scrape a website like Yellow Pages or Yelp for information on dentist offices. Including their business names, addresses, phone numbers and more.
Interested in lead generation? Read our guide on how to power your lead generation efforts with web scraping.
Competitor Analysis / Market Research
Let’s say you are looking into starting your own e-commerce business by selling smartphone cases online. Therefore, building a database of similar product listings can provide you with insights on how to position and price your products.
Many people use web scraping to generate datasets they can later use for statistical analysis.
For example, you could use a web scraper to extract stock prices for specific companies on a daily basis and get a better sense of how a specific industry is performing overall.
On the other hand, you could also use web scraping for more “fun” statistical analysis, such as scraping sports stats that will fuel your fantasy league choices.
As we mentioned earlier, there are many more uses for web scraping, including:
- Social Media scraping for sentiment analysis
- Scraping for archival purposes
- Scraping websites for research purposes
- Scraping your own site before a website migration
- Scraping data for comparison shopping
What is the Best Web Scraper?
This question is asked a lot.
The true answer is that it depends.
Given your project’s needs and specifications, one web scraper might be better than another. We’ve actually written an in-depth guide on what makes the best web scraper and what are some must-have features.
However, we are obviously biased towards ParseHub. Not only is it incredibly powerful, versatile and easy to use (being able to scrape any dynamic website), but it is also free to download and use.
We also provide awesome customer support, in case you ever hit a snag while running your scrape jobs.
How to Scrape a Website
Now, let’s walk you through your very first web scraping project.
For this example, we are going to keep it simple. We will scrape listings from Amazon’s search result page for the term “tablet”. We will be scraping the product name, listing URL, price, review score, number of reviews and image URL.
- Make sure to download and open ParseHub.
- Click on New Project and submit the Amazon URL we’ve selected. The website will now be rendered inside the application.
- Scroll past the sponsored listings and click on the product name of the first search result.
- The product name will be highlighted in green to indicate that is has been selected. Click on the second product name to select all the listings on the page. All product names will now be highlighted in green.
- On the left sidebar, rename your selection to product.
- ParseHub is now extracting both the product name and URL. Now we will tell it to extract the product’s price.
- First, click on the PLUS(+) sign next to the product selection you created and choose the Relative Select command.
- Using the Relative Select command, click on the first product name and then on its price. An arrow will appear to connect the two data points.
- Rename your new selection to price.
- Using the icon next to your price selection, expand your selection and remove the URL extraction.
- Next, repeat steps 7-10 to also extract the product’s star rating, number of reviews and image URL. Remember to name your selection accordingly as you create them.
Your final project should look like this:
Pro Tip: Want to scrape and also download the images for every product? Read our guide on how to scrape and download images from any site, including Amazon.
Dealing with Pagination
We want to keep this project simple, but we could not pass up the chance to showcase one of ParseHub’s best features. We will now tell ParseHub to navigate beyond the first page of results and keep scraping further pages of results.
- Click on the PLUS(+) sign next to your page selection and choose the Select command.
- Now scroll all the way down to the bottom of the page and click on the “Next” page link. It will be highlighted in green to show it has been selected.
- Rename your selection to next.
- Expand your selection and remove the extract commands under it.
- Now use the PLUS(+) sign next to the next command and select the Click command.
- A pop-up will appear asking you if this a Next Page button. Click Yes and enter the number of times you’d like to repeat your scrape. For this example, we will enter 4. Then click on Repeat Current Template
Running your Scrape Job
You are now ready to run your very first web scraping job. Just click on the Get Data button on the left sidebar and then on Run.
ParseHub will now scrape all the data you’ve selected. Feel free to keep working on other tasks while the scrape job runs on our servers. Once the job is completed you will be able to download the scraped data as an Excel or JSON file.
Pro Tip: For longer and more complex scrape jobs, we recommend running a Test Run before submitting your entire project. This way, you can confirm that your project will be formatted correctly.
Your Next Web Scraping Project
Congratulations! You just completed your very first scraping job.
Combining the skills and knowledge you’ve just acquired with this guide, you are now ready to take on your next web scraping project.
Which site will you scrape next?