Sometimes very useful information from a website is not available via an API or a spreadsheet export.
... and copy/pasting information from a website to a spreadsheet can be time-consuming when dealing with formatting issues and multiple web pages.
How Web Scraping and Google Sheets can help
Web scraping is a much better alternative to get content from any website into a more usable format - such as into Google Sheets. Web scraping builds an API for you, for websites that don't provide them. You can use a web scraping tool and save time copying and pasting web data.
Additionally, Google Sheets has a wonderful IMPORTDATA() function where you can pass in the API key and project token created by a web scraper like ParseHub, a visual web scraping tool.
Doing this will get you a continuous refresh of data every time content is updated on the website you are scraping.
In this tutorial we will show you how to:
- Scrape events from a website using ParseHub.
- Schedule the project to scrape new events continuously.
- Import the event name, date and location into Google Sheets, by using the IMPORTDATA() function and the ParseHub API key and project token.
Before we start
- Make sure to download ParseHub for free - this web scraping tool will go and collect the website data we need
- For this example, we will scrape event data from the City of Toronto website. However, ParseHub can be used to scrape data from all other kinds of websites.
1. Scrape data from a website
Open ParseHub and the Toronto events website
- Open the ParseHub desktop application.
- Open the following link - http://www.toronto.com/events/
in the ParseHub browser.
- Click on "Create New Project" and "Start project on this URL".
Select and scrape all of the events on the page
- Select the name of the first event on the webpage by clicking on it.
- Click on another event name and all of the events on the page will be selected.Rename the selection to events. You have now extracted the name and the URL of each event. Because of the "Begin New Entry" command, each event will be given its own row in your data set.
Scrape the dates of each event
- Add a Relative Select command by clicking the "plus" button next to the command "Begin new entry in events" command.
- Click on one event name.
- Hover over the "Dates" section of each event. You will notice that the highlight surrounds both the dates and the location. We need to zoom in to select only the dates.
- While hovering over the date, hold down the Control key (or Command key on Mac) and press the number 2 on your keyboard.
- As soon as you see the blue highlight which says "Text node" that surrounds only the dates - click on the dates. The dates for all of the events should now be selected for you.
- Rename the selection to date. They will be automatically extracted for you.
Scrape the location for each event
- Add another Relative Select command from the plus button beside "Begin new entry in events".
- Click on the event name, then click on the location of the same event to connect the two with an arrow. All of the locations will be selected and extracted.
- Rename the selection to locations.
Do you want to get events from all of the pages? Check out this pagination tutorial.
2. Schedule the project to scrape new events continuously
You can run this project one time and get the data to populate our Google Sheets in step 3 only once. However, we will show you how to schedule the project to run one time a day instead.
- Click on the "Get Data" button in the ParseHub application.
- Click on the "Schedule" button.
- From the dropdown, you can select "every week" or "every hour". For this project "every day" is already selected for you.
- You can also choose the time. By default, the project will run at midnight. Of course, you can just select the time you want from the dropdowns.
- Click on the "Save and Schedule" button.
- Now your project will run automatically at your scheduled times. A new tab for this schedule will be created on the "Get Data" page. You can click on this tab to open a page where your data can be downloaded, after the first time it is scheduled to run. Your data will be available in Excel and JSON format.
3. Import events into a Google Sheet
Instead of downloading data in Excel from the ParseHub extension just one time, we will import the data into Google Sheets. In this example, we will get the latest run of data from our events project.
Every time the project runs and scrapes the website, this Google Sheet will be uploaded with new events.
Find your API key and project token
- Open the project we just worked on.
- Find your project token in the "Settings" tab of the project design.
- Find your API key by clicking on the "profile" icon in the top-right corner of the toolbar. Click "Account" and you will see your API key listed.
Open Google Sheets and create your IMPORTDATA function
- Open a new Google Sheet.
- Click on the A1 cell and type in =IMPORTDATA()
- In the =IMPORTDATA() function create your URL like the following:
- Replace the PROJECT_TOKEN with the actual project token from the "Settings" tab of your project.
- Replace the API_KEY with the API key from your account.
We created this URL based on the ParseHub API reference. Take a look at it for more information.
If you did everything correctly, you should see the data from your project appear almost immediately.
We are always open to hearing about your ideas and hacks. If you know a great way to use ParseHub with another tool, shoot us an email at hello[at]parsehub[dot]com.
[This post was originally written on December 15, 2015 and updated on August 1, 2019]