Introduction
Welcome to the ParseHub API documentation. ParseHub’s API enables you to programatically manage and run your projects and retrieve extracted data.
The ParseHub API is designed around REST. It aims to have predictable URLs and uses HTTP verbs where possible.
To the right, you can find sample code in a variety of languages. By default, curl is selected so that you can try out the commands in your terminal.
Authentication
Each request must include your API Key for authentication. If you’re logged in, the examples will have your API key filled in.
You can find your API Key on your account page
Requests
POST
requests must have a form-encoded body and the Content-Type: application/x-www-form-urlencoded; charset=utf-8
header.
GET
requests must be url-encoded
All requests must be made over HTTPS. Any HTTP requests are responded to with a HTTP 302 to the equivalent HTTPS address.
If you are using the curl examples, make sure that any data you replace is properly shell-escaped.
ParseHub limits API usage to 5 requests per second, with any requests thereafter being queued up to a maximum of 25 requests per second. Requests thereafter will return a 429 status code.
Responses
Unless explicitly mentioned, JSON will be returned in all responses.
Errors
ParseHub returns standard HTTP errors when possible.
Backwards Compatibility
The ParseHub API may be changed in a backwards-compatible way at any time. Backwards compatibility means that new methods, objects, statuses, fields in responses, etc. may be added at any time, but existing ones will never be renamed or removed.
If there are backward incompatible changes that need to be made to our API, we will release a new API version. The previous API version will be maintained for a year after releasing the new version.
Client Libraries
Some developers in the community have built unofficial client libraries for using ParseHub in various development environments. ParseHub makes no guarantees as to their quality.
Python
PHP
Node
C
Go
If you’ve written a client library (with a corporate-friendly license) that you’d like added to this list, please contact us.
Objects
There are two primary types of objects that the ParseHub API operates with.
Project
{
"token": "t-0WMEZ-Bc9sWGHAMsYvP7y4",
"title": "My Project",
"templates_json": "<LARGE_JSON_STRING_HERE>",
"main_template": "main_template",
"main_site": "http://www.example.com",
"options_json": "{\"loadJs\":true,\"rotateIPs\":false,\"maxWorkers\":\"1\",\"startingValue\":\"{}\"}",
"last_run": "tRdThBFSU7CWKAQtjAqUzZWy",
"last_ready_run": "tglTFX7IFQlrAl7W6fl1GRjw"
}
This object represents a project that was created using the ParseHub client. It has the following properties:
Property | Description |
---|---|
token | A globally unique id representing this project. |
title | The title give by the user when creating this project. |
templates_json | The JSON-stringified representation of all the instructions for running this project. This representation is not yet documented, but will eventually allow developers to create plugins for ParseHub. |
main_template | The name of the template with which ParseHub should start executing the project. |
main_site | The default URL at which ParseHub should start running the project. |
options_json | An object containing several advanced options for the project. |
last_run | The run object of the most recently started run (orderd by start_time ) for the project. |
last_ready_run | The run object of the most recent ready run (ordered by start_time ) for the project. A ready run is one whose data_ready attribute is truthy. The last_run and last_ready_run for a project may be the same. |
Run
{
"project_token": "t-0WMEZ-Bc9sWGHAMsYvP7y4",
"run_token": "tCcB4hfFP6wvBRe2gwZv9aJp",
"status": "complete",
"data_ready": true,
"start_time": "2015-02-03T23:09:38",
"end_time": "2015-02-03T23:10:40",
"pages": 1,
"md5sum": "f82f56816560943564803e005cb71d26",
"start_url": "http://www.example.com",
"start_template": "main_template",
"start_value": "{}"
}
This object represents an instance of a project that was run at a given time with a given set of parameters. It has the following properties:
Property | Description |
---|---|
project_token | A globally unique id representing the project that this run belongs to. |
run_token | A globally unique id representing this run. |
status | The status of the run. It can be one of initialized , queued , running , cancelled , complete , or error . |
data_ready | Whether the data for this run is ready to download. If the status is complete , this will always be truthy. If the status is cancelled or error , then this may be truthy or falsy, depending on whether any data is available. |
start_time | The time that this run was started at, in UTC +0000. |
end_time | The time that this run was stopped. This field will be null if the run is either initialized or running . Time is in UTC +0000. |
pages | The number of pages that have been traversed by this run so far. |
md5sum | The md5sum of the results. This can be used to check if any results data has changed between two runs. |
start_url | The url that this run was started on. |
start_template | The template that this run was started with. |
start_value | The starting value of the global scope for this run. |
Methods
This section describes all of the HTTP endpoints that can be used to manipulate projects and runs.
The typical way to use the ParseHub API is the following:
Start a run
This is done by one of the following:
- Running the project from the ParseHub client,
- Setting up a schedule (also in the client) for runs to be started automatically,
- Using the
POST /api/v2/projects/{PROJECT_TOKEN}/run
method below.
Wait for the run to finish
You can use one of:
- Webhooks (recommended)
- Polling the
GET /api/v2/runs/{RUN_TOKEN}
method below. Note there are rate limits on this method.
For simple use cases, you can skip this step entirely by using the
GET /api/v2/projects/{PROJECT_TOKEN}/last_ready_run/data
method.Get the data from the run
Once the run’s
data_ready
is truthy, you can:- Download the data from either the ParseHub client or website.
- Use the
GET /api/v2/runs/{RUN_TOKEN}/data
method below. - For simple use cases, the
GET /api/v2/projects/{PROJECT_TOKEN}/last_ready_run/data
method. This will automatically get the data from the most recent run wheredata_ready
is truthy.
Get a project
curl -X GET "https://www.parsehub.com/api/v2/projects/{PROJECT_TOKEN}?api_key={YOUR_API_KEY}&offset=0&include_options=1"
import requests
params = {
"api_key": "{YOUR_API_KEY}",
"offset": "0",
"include_options": "1"
}
r = requests.get('https://www.parsehub.com/api/v2/projects/{PROJECT_TOKEN}', params=params)
print(r.text)
require 'net/http'
params = {
:api_key => "{YOUR_API_KEY}",
:offset => "0",
:include_options => "1"
}
url = URI.parse('https://www.parsehub.com/api/v2/projects/{PROJECT_TOKEN}')
url.query = URI.encode_www_form(params)
puts Net::HTTP.get(url)
<?php
$params = http_build_query(array(
"api_key" => "{YOUR_API_KEY}",
"offset" => "0",
"include_options" => "1"
));
$result = file_get_contents(
'https://www.parsehub.com/api/v2/projects/{PROJECT_TOKEN}?'.$params,
false,
stream_context_create(array(
'http' => array(
'method' => 'GET'
)
))
);
echo($result);
?>
var request = require('request');
request({
uri: 'https://www.parsehub.com/api/v2/projects/{PROJECT_TOKEN}',
method: 'GET',
qs: {
api_key: "{YOUR_API_KEY}",
offset: "0",
include_options: "1"
}
}, function(err, resp, body) {
console.log(body);
});
package main
import (
"fmt"
"io/ioutil"
"net/http"
"net/url"
)
func main() {
var Url * url.URL
Url, _ = url.Parse("https://www.parsehub.com/api/v2/projects/{PROJECT_TOKEN}")
params := url.Values{}
params.Add("api_key", "{YOUR_API_KEY}")
params.Add("offset", "0")
params.Add("include_options", "1")
Url.RawQuery = params.Encode()
resp, _ := http.Get(Url.String())
defer resp.Body.Close()
body, _ := ioutil.ReadAll(resp.Body)
fmt.Printf(string(body))
}
{
"token": "t-0WMEZ-Bc9sWGHAMsYvP7y4",
"title": "My Project",
"templates_json": "<LARGE_JSON_STRING_HERE>",
"main_template": "main_template",
"main_site": "http://www.example.com",
"options_json": "{\"loadJs\":true,\"rotateIPs\":false,\"maxWorkers\":\"1\",\"startingValue\":\"{}\"}",
"last_run": {
"project_token": "t-0WMEZ-Bc9sWGHAMsYvP7y4",
"run_token": "tCcB4hfFP6wvBRe2gwZv9aJp",
"status": "complete",
"data_ready": true,
"start_time": "2015-02-03T23:09:38",
"end_time": "2015-02-03T23:10:02",
"pages": 53,
"md5sum": "f82f56816560943564803e005cb71d26",
"start_url": "http://www.example.com",
"start_template": "main_template",
"start_value": "{\"query\": \"San Francisco\"}"
},
"last_ready_run": {
"project_token": "t-0WMEZ-Bc9sWGHAMsYvP7y4",
"run_token": "tCcB4hfFP6wvBRe2gwZv9aJp",
"status": "complete",
"data_ready": true,
"start_time": "2015-02-03T23:09:38",
"end_time": "2015-02-03T23:10:02",
"pages": 53,
"md5sum": "f82f56816560943564803e005cb71d26",
"start_url": "http://www.example.com",
"start_template": "main_template",
"start_value": "{\"query\": \"San Francisco\"}"
},
"run_list": [
{
"project_token": "t-0WMEZ-Bc9sWGHAMsYvP7y4",
"run_token": "tCcB4hfFP6wvBRe2gwZv9aJp",
"status": "complete",
"data_ready": true,
"start_time": "2015-02-03T23:09:38",
"end_time": "2015-02-03T23:10:02",
"pages": 53,
"md5sum": "f82f56816560943564803e005cb71d26",
"start_url": "http://www.example.com",
"start_template": "main_template",
"start_value": "{\"query\": \"San Francisco\"}"
}
]
}
This will return the project object for a specific project.
HTTP Request
GET https://www.parsehub.com/api/v2/projects/{PROJECT_TOKEN}
Parameters
Parameter | Description |
---|---|
api_key | The API key for your account. |
offset (Optional) | Specifies the offset from which to start the run_list . E.g. in order to get most recent runs 21-40, specify an offset of 20. Defaults to 0. |
include_options (Optional) | Includes the “options_json” key in the result returned. For performance reasons, we exclude this key by default. |
Response
If successful, returns the project identified by {PROJECT_TOKEN}. The project
will have an additional run_list
attribute which has a list of the most
recent 20 runs, starting at the offset
th most recent. The run_list
has
no order guarantees; you must sort it yourself if you’d like to have it
sorted by some attribute.
Run a project
curl "https://www.parsehub.com/api/v2/projects/{PROJECT_TOKEN}/run" -X POST \
-d api_key={YOUR_API_KEY} \
-d start_url=http%3A%2F%2Fwww.example.com \
-d start_template=main_template \
-d start_value_override=%7B%22query%22%3A+%22San+Francisco%22%7D \
-d send_email=1
import requests
params = {
"api_key": "{YOUR_API_KEY}",
"start_url": "http://www.example.com",
"start_template": "main_template",
"start_value_override": "{\"query\": \"San Francisco\"}",
"send_email": "1"
}
r = requests.post("https://www.parsehub.com/api/v2/projects/{PROJECT_TOKEN}/run", data=params)
print(r.text)
require 'net/http'
params = {
:api_key => "{YOUR_API_KEY}",
:start_url => "http://www.example.com",
:start_template => "main_template",
:start_value_override => "{\"query\": \"San Francisco\"}",
:send_email => "1"
}
url = URI.parse('https://www.parsehub.com/api/v2/projects/{PROJECT_TOKEN}/run')
url.query = URI.encode_www_form(params)
puts Net::HTTP.post_form(url, params)
<?php
$params = array(
"api_key" => "{YOUR_API_KEY}",
"start_url" => "http://www.example.com",
"start_template" => "main_template",
"start_value_override" => "{\"query\": \"San Francisco\"}",
"send_email" => "1"
);
$options = array(
'http' => array(
'method' => 'POST',
'header' => 'Content-Type: application/x-www-form-urlencoded; charset=utf-8',
'content' => http_build_query($params)
)
);
$context = stream_context_create($options);
$result = file_get_contents('https://www.parsehub.com/api/v2/projects/{PROJECT_TOKEN}/run', false, $context);
echo($result);
?>
var request = require('request');
request({
uri: 'https://www.parsehub.com/api/v2/projects/{PROJECT_TOKEN}/run',
method: 'POST',
form: {
api_key: "{YOUR_API_KEY}",
start_url: "http://www.example.com",
start_template: "main_template",
start_value_override: "{\"query\": \"San Francisco\"}",
send_email: "1"
}
}, function(err, resp, body) {
console.log(body);
});
package main
import (
"fmt"
"io/ioutil"
"net/http"
"net/url"
)
func main() {
var Url * url.URL
Url, _ = url.Parse("https://www.parsehub.com/api/v2/projects/{PROJECT_TOKEN}/run")
params := url.Values{}
params.Add("api_key", "{YOUR_API_KEY}")
params.Add("start_url", "http://www.example.com")
params.Add("start_template", "main_template")
params.Add("start_value_override", "{\"query\": \"San Francisco\"}")
params.Add("send_email", "1")
resp, _ := http.PostForm(Url.String(), params)
defer resp.Body.Close()
body, _ := ioutil.ReadAll(resp.Body)
fmt.Printf(string(body))
}
{
"project_token": "t-0WMEZ-Bc9sWGHAMsYvP7y4",
"run_token": "tCcB4hfFP6wvBRe2gwZv9aJp",
"status": "initialized",
"data_ready": false,
"start_time": "2015-02-03T23:09:38",
"end_time": null,
"pages": 0,
"md5sum": null,
"start_url": "http://www.example.com",
"start_template": "main_template",
"start_value": "{\"query\": \"San Francisco\"}"
}
This will start running an instance of the project on the ParseHub cloud. It will create a new run object. This method will return immediately, while the run continues in the background. You can use webhooks or polling to figure out when the data for this run is ready in order to retrieve it.
HTTP Request
POST https://www.parsehub.com/api/v2/projects/{PROJECT_TOKEN}/run
Parameters
Parameter | Description |
---|---|
api_key | The API key for your account. |
start_url (Optional) | The url to start running on. Defaults to the project’s start_site . |
start_template (Optional) | The template to start running with. Defaults to the projects’s start_template (inside the options_json ). |
start_value_override (Optional) | The starting global scope for this run. This can be used to pass parameters to your run. For example, you can pass {"query": "San Francisco"} to use the query somewhere in your run. Defaults to the project’s start_value . |
send_email (Optional) | If set to anything other than 0 , send an email when the run either completes successfully or fails due to an error. Defaults to 0 . |
Response
If successful, returns the run object that was created.
List all projects
curl -X GET "https://www.parsehub.com/api/v2/projects?api_key={YOUR_API_KEY}&offset=0&limit=20&include_options=1"
import requests
params = {
"api_key": "{YOUR_API_KEY}",
"offset": "0",
"limit": "20",
"include_options": "1"
}
r = requests.get('https://www.parsehub.com/api/v2/projects', params=params)
print(r.text)
require 'net/http'
params = {
:api_key => "{YOUR_API_KEY}",
:offset => "0",
:limit => "20",
:include_options => "1"
}
url = URI.parse('https://www.parsehub.com/api/v2/projects')
url.query = URI.encode_www_form(params)
puts Net::HTTP.get(url)
<?php
$params = http_build_query(array(
"api_key" => "{YOUR_API_KEY}",
"offset" => "0",
"limit" => "20",
"include_options" => "1"
));
$result = file_get_contents(
'https://www.parsehub.com/api/v2/projects?'.$params,
false,
stream_context_create(array(
'http' => array(
'method' => 'GET'
)
))
);
echo($result);
?>
var request = require('request');
request({
uri: 'https://www.parsehub.com/api/v2/projects',
method: 'GET',
qs: {
api_key: "{YOUR_API_KEY}",
offset: "0",
limit: "20",
include_options: "1"
}
}, function(err, resp, body) {
console.log(body);
});
package main
import (
"fmt"
"io/ioutil"
"net/http"
"net/url"
)
func main() {
var Url * url.URL
Url, _ = url.Parse("https://www.parsehub.com/api/v2/projects")
params := url.Values{}
params.Add("api_key", "{YOUR_API_KEY}")
params.Add("offset", "0")
params.Add("limit", "20")
params.Add("include_options", "1")
Url.RawQuery = params.Encode()
resp, _ := http.Get(Url.String())
defer resp.Body.Close()
body, _ := ioutil.ReadAll(resp.Body)
fmt.Printf(string(body))
}
{
"projects": [
{
"token": "t-0WMEZ-Bc9sWGHAMsYvP7y4",
"title": "My Project",
"templates_json": "<LARGE_JSON_STRING_HERE>",
"main_template": "main_template",
"main_site": "http://www.example.com",
"options_json": "{\"loadJs\":true,\"rotateIPs\":false,\"maxWorkers\":\"1\",\"startingValue\":\"{}\"}",
"last_run": {
"project_token": "t-0WMEZ-Bc9sWGHAMsYvP7y4",
"run_token": "tCcB4hfFP6wvBRe2gwZv9aJp",
"status": "complete",
"data_ready": true,
"start_time": "2015-02-03T23:09:38",
"end_time": "2015-02-03T23:10:02",
"pages": 53,
"md5sum": "f82f56816560943564803e005cb71d26",
"start_url": "http://www.example.com",
"start_template": "main_template",
"start_value": "{\"query\": \"San Francisco\"}"
},
"last_ready_run": {
"project_token": "t-0WMEZ-Bc9sWGHAMsYvP7y4",
"run_token": "tCcB4hfFP6wvBRe2gwZv9aJp",
"status": "complete",
"data_ready": true,
"start_time": "2015-02-03T23:09:38",
"end_time": "2015-02-03T23:10:02",
"pages": 53,
"md5sum": "f82f56816560943564803e005cb71d26",
"start_url": "http://www.example.com",
"start_template": "main_template",
"start_value": "{\"query\": \"San Francisco\"}"
}
}
]
}
This gets a list of projects in your account
HTTP Request
GET https://www.parsehub.com/api/v2/projects
Parameters
Parameter | Description |
---|---|
api_key | The API key for your account. |
offset (Optional) | Specifies the offset from which to start the projects . E.g. in order to get projects 21-40, specify an offset of 20. Defaults to 0. |
limit (Optional) | Specifies how many entries will be returned in projects . Accepts values between 1 and 20 inclusively. Defaults to 20. |
include_options (Optional) | Adds options_json, main_template, main_site and webhook to the entries of projects . Set this parameter to 1 if you intend to use them in ParseHub API calls. This parameter requires use of the offset and limit parameters to access the full list of projects. |
Response
If successful, returns an object with
Property | Description |
---|---|
projects | A list of the projects in your account. |
total_projects | The total number of projects in your account. |
Get a run
curl -X GET "https://www.parsehub.com/api/v2/runs/{RUN_TOKEN}?api_key={YOUR_API_KEY}"
import requests
params = {
"api_key": "{YOUR_API_KEY}"
}
r = requests.get('https://www.parsehub.com/api/v2/runs/{RUN_TOKEN}', params=params)
print(r.text)
require 'net/http'
params = {
:api_key => "{YOUR_API_KEY}"
}
url = URI.parse('https://www.parsehub.com/api/v2/runs/{RUN_TOKEN}')
url.query = URI.encode_www_form(params)
puts Net::HTTP.get(url)
<?php
$params = http_build_query(array(
"api_key" => "{YOUR_API_KEY}"
));
$result = file_get_contents(
'https://www.parsehub.com/api/v2/runs/{RUN_TOKEN}?'.$params,
false,
stream_context_create(array(
'http' => array(
'method' => 'GET'
)
))
);
echo($result);
?>
var request = require('request');
request({
uri: 'https://www.parsehub.com/api/v2/runs/{RUN_TOKEN}',
method: 'GET',
qs: {
api_key: "{YOUR_API_KEY}"
}
}, function(err, resp, body) {
console.log(body);
});
package main
import (
"fmt"
"io/ioutil"
"net/http"
"net/url"
)
func main() {
var Url * url.URL
Url, _ = url.Parse("https://www.parsehub.com/api/v2/runs/{RUN_TOKEN}")
params := url.Values{}
params.Add("api_key", "{YOUR_API_KEY}")
Url.RawQuery = params.Encode()
resp, _ := http.Get(Url.String())
defer resp.Body.Close()
body, _ := ioutil.ReadAll(resp.Body)
fmt.Printf(string(body))
}
{
"project_token": "t-0WMEZ-Bc9sWGHAMsYvP7y4",
"run_token": "tCcB4hfFP6wvBRe2gwZv9aJp",
"status": "initialized",
"data_ready": false,
"start_time": "2015-02-03T23:09:38",
"end_time": null,
"pages": 0,
"md5sum": null,
"start_url": "http://www.example.com",
"start_template": "main_template",
"start_value": "{\"query\": \"San Francisco\"}"
}
This returns the run object for a given run token. You can call this method repeatedly to poll for when a run is done, though we recommend using a webhook instead. This method is rate-limited. For each run, you may make at most 25 calls during the first 5 minutes after the run started, and at most one call every 3 minutes after that.
HTTP Request
GET https://www.parsehub.com/api/v2/runs/{RUN_TOKEN}
Parameters
Parameter | Description |
---|---|
api_key | The API key for your account. |
Response
If successful, returns the run identified by {RUN_TOKEN}
Get data for a run
curl -X GET "https://www.parsehub.com/api/v2/runs/{RUN_TOKEN}/data?api_key={YOUR_API_KEY}&format=csv" | gunzip
import requests
params = {
"api_key": "{YOUR_API_KEY}",
"format": "csv"
}
r = requests.get('https://www.parsehub.com/api/v2/runs/{RUN_TOKEN}/data', params=params)
print(r.text)
require 'net/http'
params = {
:api_key => "{YOUR_API_KEY}",
:format => "csv"
}
url = URI.parse('https://www.parsehub.com/api/v2/runs/{RUN_TOKEN}/data')
url.query = URI.encode_www_form(params)
puts Net::HTTP.get(url)
<?php
$params = http_build_query(array(
"api_key" => "{YOUR_API_KEY}",
"format" => "csv"
));
$result = file_get_contents(
'https://www.parsehub.com/api/v2/runs/{RUN_TOKEN}/data?'.$params,
false,
stream_context_create(array(
'http' => array(
'method' => 'GET'
)
))
);
echo($result);
?>
var request = require('request');
request({
uri: 'https://www.parsehub.com/api/v2/runs/{RUN_TOKEN}/data',
method: 'GET',
gzip: true,
qs: {
api_key: "{YOUR_API_KEY}",
format: "csv"
}
}, function(err, resp, body) {
console.log(body);
});
package main
import (
"fmt"
"io/ioutil"
"net/http"
"net/url"
)
func main() {
var Url * url.URL
Url, _ = url.Parse("https://www.parsehub.com/api/v2/runs/{RUN_TOKEN}/data")
params := url.Values{}
params.Add("api_key", "{YOUR_API_KEY}")
params.Add("format", "csv")
Url.RawQuery = params.Encode()
resp, _ := http.Get(Url.String())
defer resp.Body.Close()
body, _ := ioutil.ReadAll(resp.Body)
fmt.Printf(string(body))
}
"questions_title", "questions_text"
"This is a title", "This is some text"
"This is another title", "This is another text"
This returns the data that was extracted by a run.
HTTP Request
GET https://www.parsehub.com/api/v2/runs/{RUN_TOKEN}/data
Parameters
Parameter | Description |
---|---|
api_key | The API key for your account. |
format (Optional) | The format that you would like to get the data in. Possible values csv or json . Defaults to json . |
Response
If successful, returns the data in either csv
or json
format, depending on
the format
parameter.
Note: The Content-Encoding of this response is always gzip
.
Get last ready data
curl -X GET "https://www.parsehub.com/api/v2/projects/{PROJECT_TOKEN}/last_ready_run/data?api_key={YOUR_API_KEY}&format=csv" | gunzip
import requests
params = {
"api_key": "{YOUR_API_KEY}",
"format": "csv"
}
r = requests.get('https://www.parsehub.com/api/v2/projects/{PROJECT_TOKEN}/last_ready_run/data', params=params)
print(r.text)
require 'net/http'
params = {
:api_key => "{YOUR_API_KEY}",
:format => "csv"
}
url = URI.parse('https://www.parsehub.com/api/v2/projects/{PROJECT_TOKEN}/last_ready_run/data')
url.query = URI.encode_www_form(params)
puts Net::HTTP.get(url)
<?php
$params = http_build_query(array(
"api_key" => "{YOUR_API_KEY}",
"format" => "csv"
));
$result = file_get_contents(
'https://www.parsehub.com/api/v2/projects/{PROJECT_TOKEN}/last_ready_run/data?'.$params,
false,
stream_context_create(array(
'http' => array(
'method' => 'GET'
)
))
);
echo($result);
?>
var request = require('request');
request({
uri: 'https://www.parsehub.com/api/v2/projects/{PROJECT_TOKEN}/last_ready_run/data',
method: 'GET',
gzip: true,
qs: {
api_key: "{YOUR_API_KEY}",
format: "csv"
}
}, function(err, resp, body) {
console.log(body);
});
package main
import (
"fmt"
"io/ioutil"
"net/http"
"net/url"
)
func main() {
var Url * url.URL
Url, _ = url.Parse("https://www.parsehub.com/api/v2/projects/{PROJECT_TOKEN}/last_ready_run/data")
params := url.Values{}
params.Add("api_key", "{YOUR_API_KEY}")
params.Add("format", "csv")
Url.RawQuery = params.Encode()
resp, _ := http.Get(Url.String())
defer resp.Body.Close()
body, _ := ioutil.ReadAll(resp.Body)
fmt.Printf(string(body))
}
"questions_title", "questions_text"
"This is a title", "This is some text"
"This is another title", "This is another text"
This returns the data for the most recent ready run for a project. You can use this method in order to have a synchronous interface to your project.
HTTP Request
GET https://www.parsehub.com/api/v2/projects/{PROJECT_TOKEN}/last_ready_run/data
Parameters
Parameter | Description |
---|---|
api_key | The API key for your account. |
format (Optional) | The format that you would like to get the data in. Possible values csv or json . Defaults to json . |
Response
If successful, returns the data in either csv
or json
format, depending on
the format
parameter.
Note: The Content-Encoding of this response is always gzip
.
Cancel a run
curl "https://www.parsehub.com/api/v2/runs/{RUN_TOKEN}/cancel" -X POST \
-d api_key={YOUR_API_KEY}
import requests
params = {
"api_key": "{YOUR_API_KEY}"
}
r = requests.post("https://www.parsehub.com/api/v2/runs/{RUN_TOKEN}/cancel", data=params)
print(r.text)
require 'net/http'
params = {
:api_key => "{YOUR_API_KEY}"
}
url = URI.parse('https://www.parsehub.com/api/v2/runs/{RUN_TOKEN}/cancel')
url.query = URI.encode_www_form(params)
puts Net::HTTP.post_form(url, params)
<?php
$params = array(
"api_key" => "{YOUR_API_KEY}"
);
$options = array(
'http' => array(
'method' => 'POST',
'header' => 'Content-Type: application/x-www-form-urlencoded; charset=utf-8',
'content' => http_build_query($params)
)
);
$context = stream_context_create($options);
$result = file_get_contents('https://www.parsehub.com/api/v2/runs/{RUN_TOKEN}/cancel', false, $context);
echo($result);
?>
var request = require('request');
request({
uri: 'https://www.parsehub.com/api/v2/runs/{RUN_TOKEN}/cancel',
method: 'POST',
form: {
api_key: "{YOUR_API_KEY}"
}
}, function(err, resp, body) {
console.log(body);
});
package main
import (
"fmt"
"io/ioutil"
"net/http"
"net/url"
)
func main() {
var Url * url.URL
Url, _ = url.Parse("https://www.parsehub.com/api/v2/runs/{RUN_TOKEN}/cancel")
params := url.Values{}
params.Add("api_key", "{YOUR_API_KEY}")
resp, _ := http.PostForm(Url.String(), params)
defer resp.Body.Close()
body, _ := ioutil.ReadAll(resp.Body)
fmt.Printf(string(body))
}
{
"project_token": "t-0WMEZ-Bc9sWGHAMsYvP7y4",
"run_token": "tCcB4hfFP6wvBRe2gwZv9aJp",
"status": "cancelled",
"data_ready": false,
"start_time": "2015-02-03T23:09:38",
"end_time": null,
"pages": 52,
"md5sum": null,
"start_url": "http://www.example.com",
"start_template": "main_template",
"start_value": "{\"query\": \"San Francisco\"}"
}
This cancels a run and changes its status to cancelled
. Any data that was extracted so far will be available.
HTTP Request
POST https://www.parsehub.com/api/v2/runs/{RUN_TOKEN}/cancel
Parameters
Parameter | Description |
---|---|
api_key | The API key for your account. |
Response
If successful, returns the run identified by {RUN_TOKEN}
Delete a run
curl -X DELETE "https://www.parsehub.com/api/v2/runs/{RUN_TOKEN}?api_key={YOUR_API_KEY}"
import requests
params = {
"api_key": "{YOUR_API_KEY}"
}
r = requests.delete('https://www.parsehub.com/api/v2/runs/{RUN_TOKEN}', params=params)
print(r.text)
require 'net/http'
params = {
:api_key => "{YOUR_API_KEY}"
}
url = URI.parse('https://www.parsehub.com/api/v2/runs/{RUN_TOKEN}')
url.query = URI.encode_www_form(params)
puts Net::HTTP.delete(url)
<?php
$params = http_build_query(array(
"api_key" => "{YOUR_API_KEY}"
));
$result = file_get_contents(
'https://www.parsehub.com/api/v2/runs/{RUN_TOKEN}?'.$params,
false,
stream_context_create(array(
'http' => array(
'method' => 'DELETE'
)
))
);
echo($result);
?>
var request = require('request');
request({
uri: 'https://www.parsehub.com/api/v2/runs/{RUN_TOKEN}',
method: 'DELETE',
qs: {
api_key: "{YOUR_API_KEY}"
}
}, function(err, resp, body) {
console.log(body);
});
package main
import (
"fmt"
"io/ioutil"
"net/http"
"net/url"
)
func main() {
var Url * url.URL
Url, _ = url.Parse("https://www.parsehub.com/api/v2/runs/{RUN_TOKEN}")
params := url.Values{}
params.Add("api_key", "{YOUR_API_KEY}")
Url.RawQuery = params.Encode()
resp, _ := http.Delete(Url.String())
defer resp.Body.Close()
body, _ := ioutil.ReadAll(resp.Body)
fmt.Printf(string(body))
}
{
"run_token": "tCcB4hfFP6wvBRe2gwZv9aJp"
}
This cancels a run if running, and deletes the run and its data.
HTTP Request
DELETE https://www.parsehub.com/api/v2/runs/{RUN_TOKEN}
Parameters
Parameter | Description |
---|---|
api_key | The API key for your account. |
Response
If successful, returns an object with
Property | Description |
---|---|
run_token | The run_token of the run that was deleted. |
Webhooks
ParseHub has webhooks which you can use to get notified about the status of a project’s runs. You can use webhooks instead of having to write logic for polling the status of a run.
You can set up a webhook for a project in the ‘Settings’ tab of the project in the ParseHub client. This should be a valid URL that is visible from the internet.
ParseHub will send a POST request to that url whenever any of the project’s
runs’ status
or
data_ready
fields change. The POST body will be the run object.
If the status of a run is error
, ParseHub may automatically retry the run if
it thinks there’s a good chance that the run will succeed the second time. In
this case, there will be an additional new_run
field with the metadata
(run token, etc.) for the restarted run.
We will retry every request once per hour up to 3 times or until we get an HTTP 200 response.
Note that this is a traditional POST with valid JSON data encoded as application/x-www-form-urlencoded
.
You can test your endpoint with: curl -X POST [webhook url] -H "Content-Type: application/x-www-form-urlencoded" -d '{"some": "json"}'