NAV
API Reference
curl python ruby php node go

Introduction

Welcome to the ParseHub API documentation. ParseHub’s API enables you to programatically manage and run your projects and retrieve extracted data.

The ParseHub API is designed around REST. It aims to have predictable URLs and uses HTTP verbs where possible.

To the right, you can find sample code in a variety of languages. By default, curl is selected so that you can try out the commands in your terminal.

Authentication

Each request must include your API Key for authentication. If you’re logged in, the examples will have your API key filled in.

You can find your API Key on your account page

Requests

POST requests must have a form-encoded body and the Content-Type: application/x-www-form-urlencoded; charset=utf-8 header.

GET requests must be url-encoded

All requests must be made over HTTPS. Any HTTP requests are responded to with a HTTP 302 to the equivalent HTTPS address.

If you are using the curl examples, make sure that any data you replace is properly shell-escaped.

Responses

Unless explicitly mentioned, JSON will be returned in all responses.

Errors

ParseHub returns standard HTTP errors when possible.

Backwards Compatibility

The ParseHub API may be changed in a backwards-compatible way at any time. Backwards compatibility means that new methods, objects, statuses, fields in responses, etc. may be added at any time, but existing ones will never be renamed or removed.

If there are backward incompatible changes that need to be made to our API, we will release a new API version. The previous API version will be maintained for a year after releasing the new version.

Client Libraries

Some developers in the community have built unofficial client libraries for using ParseHub in various development environments. ParseHub makes no guarantees as to their quality.

Python

py-parsehub

PHP

parsehub-php

Node

parsehub

C

ParseHub-CSharp

Go

parsehub

If you’ve written a client library (with a corporate-friendly license) that you’d like added to this list, please contact us.

Objects

There are two primary types of objects that the ParseHub API operates with.

Project

{
  "token": "t-0WMEZ-Bc9sWGHAMsYvP7y4",
  "title": "My Project",
  "templates_json": "<LARGE_JSON_STRING_HERE>",
  "main_template": "main_template",
  "main_site": "http://www.example.com",
  "options_json": "{\"loadJs\":true,\"rotateIPs\":false,\"maxWorkers\":\"1\",\"startingValue\":\"{}\"}",
  "last_run": "tRdThBFSU7CWKAQtjAqUzZWy",
  "last_ready_run": "tglTFX7IFQlrAl7W6fl1GRjw"
}

This object represents a project that was created using the ParseHub client. It has the following properties:

Property Description
token A globally unique id representing this project.
title The title give by the user when creating this project.
templates_json The JSON-stringified representation of all the instructions for running this project. This representation is not yet documented, but will eventually allow developers to create plugins for ParseHub.
main_template The name of the template with which ParseHub should start executing the project.
main_site The default URL at which ParseHub should start running the project.
options_json An object containing several advanced options for the project.
last_run The run object of the most recently started run (orderd by start_time) for the project.
last_ready_run The run object of the most recent ready run (ordered by start_time) for the project. A ready run is one whose data_ready attribute is truthy. The last_run and last_ready_run for a project may be the same.

Run

{
  "project_token": "t-0WMEZ-Bc9sWGHAMsYvP7y4",
  "run_token": "tCcB4hfFP6wvBRe2gwZv9aJp",
  "status": "complete",
  "data_ready": true,
  "start_time": "2015-02-03T23:09:38",
  "end_time": "2015-02-03T23:10:40",
  "pages": 1,
  "md5sum": "f82f56816560943564803e005cb71d26",
  "start_url": "http://www.example.com",
  "start_template": "main_template",
  "start_value": "{}"
}

This object represents an instance of a project that was run at a given time with a given set of parameters. It has the following properties:

Property Description
project_token A globally unique id representing the project that this run belongs to.
run_token A globally unique id representing this run.
status The status of the run. It can be one of initialized, queued, running, cancelled, complete, or error.
data_ready Whether the data for this run is ready to download. If the status is complete, this will always be truthy. If the status is cancelled or error, then this may be truthy or falsy, depending on whether any data is available.
start_time The time that this run was started at, in UTC +0000.
end_time The time that this run was stopped. This field will be null if the run is either initialized or running. Time is in UTC +0000.
pages The number of pages that have been traversed by this run so far.
md5sum The md5sum of the results. This can be used to check if any results data has changed between two runs.
start_url The url that this run was started on.
start_template The template that this run was started with.
start_value The starting value of the global scope for this run.

Methods

This section describes all of the HTTP endpoints that can be used to manipulate projects and runs.

The typical way to use the ParseHub API is the following:

  1. Start a run

    This is done by one of the following:

    1. Running the project from the ParseHub client,
    2. Setting up a schedule (also in the client) for runs to be started automatically,
    3. Using the POST /api/v2/projects/{PROJECT_TOKEN}/run method below.

  2. Wait for the run to finish

    You can use one of:

    1. Webhooks (recommended)
    2. Polling the GET /api/v2/runs/{RUN_TOKEN} method below. Note there are rate limits on this method.

    For simple use cases, you can skip this step entirely by using the GET /api/v2/projects/{PROJECT_TOKEN}/last_ready_run/data method.

  3. Get the data from the run

    Once the run’s data_ready is truthy, you can:

    1. Download the data from either the ParseHub client or website.
    2. Use the GET /api/v2/runs/{RUN_TOKEN}/data method below.
    3. For simple use cases, the GET /api/v2/projects/{PROJECT_TOKEN}/last_ready_run/data method. This will automatically get the data from the most recent run where data_ready is truthy.

Get a project

curl -X GET "https://www.parsehub.com/api/v2/projects/{PROJECT_TOKEN}?api_key={YOUR_API_KEY}&offset=0"
import requests

params = {
  "api_key": "{YOUR_API_KEY}",
  "offset": "0"
}
r = requests.get('https://www.parsehub.com/api/v2/projects/{PROJECT_TOKEN}', params=params)
print(r.text)
require 'net/http'

params = {
  :api_key => "{YOUR_API_KEY}",
  :offset => "0"
}
url = URI.parse('https://www.parsehub.com/api/v2/projects/{PROJECT_TOKEN}')
url.query = URI.encode_www_form(params)

puts Net::HTTP.get(url)
<?php
$params = http_build_query(array(
  "api_key" => "{YOUR_API_KEY}",
  "offset" => "0"
));

$result = file_get_contents(
    'https://www.parsehub.com/api/v2/projects/{PROJECT_TOKEN}?'.$params,
    false,
    stream_context_create(array(
        'http' => array(
            'method' => 'GET'
        )
    ))
);
echo($result);
?>
var request = require('request');

request({
  uri: 'https://www.parsehub.com/api/v2/projects/{PROJECT_TOKEN}',
  method: 'GET',
  qs: {
    api_key: "{YOUR_API_KEY}",
    offset: "0"
  }
}, function(err, resp, body) {
  console.log(body);
});
package main

import (
  "fmt"
  "io/ioutil"
  "net/http"
  "net/url"
)

func main() {
  var Url * url.URL
  Url, _ = url.Parse("https://www.parsehub.com/api/v2/projects/{PROJECT_TOKEN}")

  params := url.Values{}
  params.Add("api_key", "{YOUR_API_KEY}")
  params.Add("offset", "0")
  Url.RawQuery = params.Encode()

  resp, _ := http.Get(Url.String())
  defer resp.Body.Close()
  body, _ := ioutil.ReadAll(resp.Body)
  fmt.Printf(string(body))
}
{
  "token": "t-0WMEZ-Bc9sWGHAMsYvP7y4", 
  "title": "My Project", 
  "templates_json": "<LARGE_JSON_STRING_HERE>", 
  "main_template": "main_template", 
  "main_site": "http://www.example.com", 
  "options_json": "{\"loadJs\":true,\"rotateIPs\":false,\"maxWorkers\":\"1\",\"startingValue\":\"{}\"}", 
  "last_run": {
    "project_token": "t-0WMEZ-Bc9sWGHAMsYvP7y4", 
    "run_token": "tCcB4hfFP6wvBRe2gwZv9aJp", 
    "status": "complete", 
    "data_ready": true, 
    "start_time": "2015-02-03T23:09:38", 
    "end_time": "2015-02-03T23:10:02", 
    "pages": 53, 
    "md5sum": "f82f56816560943564803e005cb71d26", 
    "start_url": "http://www.example.com", 
    "start_template": "main_template", 
    "start_value": "{\"query\": \"San Francisco\"}"
  }, 
  "last_ready_run": {
    "project_token": "t-0WMEZ-Bc9sWGHAMsYvP7y4", 
    "run_token": "tCcB4hfFP6wvBRe2gwZv9aJp", 
    "status": "complete", 
    "data_ready": true, 
    "start_time": "2015-02-03T23:09:38", 
    "end_time": "2015-02-03T23:10:02", 
    "pages": 53, 
    "md5sum": "f82f56816560943564803e005cb71d26", 
    "start_url": "http://www.example.com", 
    "start_template": "main_template", 
    "start_value": "{\"query\": \"San Francisco\"}"
  }, 
  "run_list": [
    {
      "project_token": "t-0WMEZ-Bc9sWGHAMsYvP7y4", 
      "run_token": "tCcB4hfFP6wvBRe2gwZv9aJp", 
      "status": "complete", 
      "data_ready": true, 
      "start_time": "2015-02-03T23:09:38", 
      "end_time": "2015-02-03T23:10:02", 
      "pages": 53, 
      "md5sum": "f82f56816560943564803e005cb71d26", 
      "start_url": "http://www.example.com", 
      "start_template": "main_template", 
      "start_value": "{\"query\": \"San Francisco\"}"
    }
  ]
}

This will return the project object for a specific project.

HTTP Request

GET https://www.parsehub.com/api/v2/projects/{PROJECT_TOKEN}

Parameters

Parameter Description
api_key The API key for your account.
offset (Optional) Specifies the offset from which to start the run_list. E.g. in order to get most recent runs 21-40, specify an offset of 20. Defaults to 0.

Response

If successful, returns the project identified by {PROJECT_TOKEN}. The project will have an additional run_list attribute which has a list of the most recent 20 runs, starting at the offsetth most recent. The run_list has no order guarantees; you must sort it yourself if you’d like to have it sorted by some attribute.

Run a project

curl "https://www.parsehub.com/api/v2/projects/{PROJECT_TOKEN}/run"   -X POST \ 
  -d api_key={YOUR_API_KEY} \ 
  -d start_url=http%3A%2F%2Fwww.example.com \ 
  -d start_template=main_template \ 
  -d start_value_override=%7B%22query%22%3A+%22San+Francisco%22%7D \ 
  -d send_email=1
import requests

params = {
  "api_key": "{YOUR_API_KEY}",
  "start_url": "http://www.example.com",
  "start_template": "main_template",
  "start_value_override": "{\"query\": \"San Francisco\"}",
  "send_email": "1"
}
r = requests.post("https://www.parsehub.com/api/v2/projects/{PROJECT_TOKEN}/run", data=params)

print(r.text)

require 'net/http'

params = {
  :api_key => "{YOUR_API_KEY}",
  :start_url => "http://www.example.com",
  :start_template => "main_template",
  :start_value_override => "{\"query\": \"San Francisco\"}",
  :send_email => "1"
}
url = URI.parse('https://www.parsehub.com/api/v2/projects/{PROJECT_TOKEN}/run')
url.query = URI.encode_www_form(params)

puts Net::HTTP.post_form(url, params)
<?php
$params = array(
  "api_key" => "{YOUR_API_KEY}",
  "start_url" => "http://www.example.com",
  "start_template" => "main_template",
  "start_value_override" => "{\"query\": \"San Francisco\"}",
  "send_email" => "1"
);

$options = array(
  'http' => array(
    'method' => 'POST',
    'header' => 'Content-Type: application/x-www-form-urlencoded; charset=utf-8',
    'content' => http_build_query($params)
  )
);

$context = stream_context_create($options);
$result = file_get_contents('https://www.parsehub.com/api/v2/projects/{PROJECT_TOKEN}/run', false, $context);
echo($result);
?>
var request = require('request');

request({
  uri: 'https://www.parsehub.com/api/v2/projects/{PROJECT_TOKEN}/run',
  method: 'POST',
  form: {
    api_key: "{YOUR_API_KEY}",
    start_url: "http://www.example.com",
    start_template: "main_template",
    start_value_override: "{\"query\": \"San Francisco\"}",
    send_email: "1"
  }
}, function(err, resp, body) {
  console.log(body);
});
package main

import (
  "fmt"
  "io/ioutil"
  "net/http"
  "net/url"
)

func main() {
  var Url * url.URL
  Url, _ = url.Parse("https://www.parsehub.com/api/v2/projects/{PROJECT_TOKEN}/run")

  params := url.Values{}
  params.Add("api_key", "{YOUR_API_KEY}")
  params.Add("start_url", "http://www.example.com")
  params.Add("start_template", "main_template")
  params.Add("start_value_override", "{\"query\": \"San Francisco\"}")
  params.Add("send_email", "1")

  resp, _ := http.PostForm(Url.String(), params)
  defer resp.Body.Close()
  body, _ := ioutil.ReadAll(resp.Body)
  fmt.Printf(string(body))
}
{
  "project_token": "t-0WMEZ-Bc9sWGHAMsYvP7y4", 
  "run_token": "tCcB4hfFP6wvBRe2gwZv9aJp", 
  "status": "initialized", 
  "data_ready": false, 
  "start_time": "2015-02-03T23:09:38", 
  "end_time": null, 
  "pages": 0, 
  "md5sum": null, 
  "start_url": "http://www.example.com", 
  "start_template": "main_template", 
  "start_value": "{\"query\": \"San Francisco\"}"
}

This will start running an instance of the project on the ParseHub cloud. It will create a new run object. This method will return immediately, while the run continues in the background. You can use webhooks or polling to figure out when the data for this run is ready in order to retrieve it.

HTTP Request

POST https://www.parsehub.com/api/v2/projects/{PROJECT_TOKEN}/run

Parameters

Parameter Description
api_key The API key for your account.
start_url (Optional) The url to start running on. Defaults to the project’s start_site.
start_template (Optional) The template to start running with. Defaults to the projects’s start_template (inside the options_json).
start_value_override (Optional) The starting global scope for this run. This can be used to pass parameters to your run. For example, you can pass {"query": "San Francisco"} to use the query somewhere in your run. Defaults to the project’s start_value.
send_email (Optional) If set to anything other than 0, send an email when the run either completes successfully or fails due to an error. Defaults to 0.

Response

If successful, returns the run object that was created.

List all projects

curl -X GET "https://www.parsehub.com/api/v2/projects?api_key={YOUR_API_KEY}&offset=0&limit=20&include_options=1"
import requests

params = {
  "api_key": "{YOUR_API_KEY}",
  "offset": "0",
  "limit": "20",
  "include_options": "1"
}
r = requests.get('https://www.parsehub.com/api/v2/projects', params=params)
print(r.text)
require 'net/http'

params = {
  :api_key => "{YOUR_API_KEY}",
  :offset => "0",
  :limit => "20",
  :include_options => "1"
}
url = URI.parse('https://www.parsehub.com/api/v2/projects')
url.query = URI.encode_www_form(params)

puts Net::HTTP.get(url)
<?php
$params = http_build_query(array(
  "api_key" => "{YOUR_API_KEY}",
  "offset" => "0",
  "limit" => "20",
  "include_options" => "1"
));

$result = file_get_contents(
    'https://www.parsehub.com/api/v2/projects?'.$params,
    false,
    stream_context_create(array(
        'http' => array(
            'method' => 'GET'
        )
    ))
);
echo($result);
?>
var request = require('request');

request({
  uri: 'https://www.parsehub.com/api/v2/projects',
  method: 'GET',
  qs: {
    api_key: "{YOUR_API_KEY}",
    offset: "0",
    limit: "20",
    include_options: "1"
  }
}, function(err, resp, body) {
  console.log(body);
});
package main

import (
  "fmt"
  "io/ioutil"
  "net/http"
  "net/url"
)

func main() {
  var Url * url.URL
  Url, _ = url.Parse("https://www.parsehub.com/api/v2/projects")

  params := url.Values{}
  params.Add("api_key", "{YOUR_API_KEY}")
  params.Add("offset", "0")
  params.Add("limit", "20")
  params.Add("include_options", "1")
  Url.RawQuery = params.Encode()

  resp, _ := http.Get(Url.String())
  defer resp.Body.Close()
  body, _ := ioutil.ReadAll(resp.Body)
  fmt.Printf(string(body))
}
{
  "projects": [
    {
      "token": "t-0WMEZ-Bc9sWGHAMsYvP7y4", 
      "title": "My Project", 
      "templates_json": "<LARGE_JSON_STRING_HERE>", 
      "main_template": "main_template", 
      "main_site": "http://www.example.com", 
      "options_json": "{\"loadJs\":true,\"rotateIPs\":false,\"maxWorkers\":\"1\",\"startingValue\":\"{}\"}", 
      "last_run": {
        "project_token": "t-0WMEZ-Bc9sWGHAMsYvP7y4", 
        "run_token": "tCcB4hfFP6wvBRe2gwZv9aJp", 
        "status": "complete", 
        "data_ready": true, 
        "start_time": "2015-02-03T23:09:38", 
        "end_time": "2015-02-03T23:10:02", 
        "pages": 53, 
        "md5sum": "f82f56816560943564803e005cb71d26", 
        "start_url": "http://www.example.com", 
        "start_template": "main_template", 
        "start_value": "{\"query\": \"San Francisco\"}"
      }, 
      "last_ready_run": {
        "project_token": "t-0WMEZ-Bc9sWGHAMsYvP7y4", 
        "run_token": "tCcB4hfFP6wvBRe2gwZv9aJp", 
        "status": "complete", 
        "data_ready": true, 
        "start_time": "2015-02-03T23:09:38", 
        "end_time": "2015-02-03T23:10:02", 
        "pages": 53, 
        "md5sum": "f82f56816560943564803e005cb71d26", 
        "start_url": "http://www.example.com", 
        "start_template": "main_template", 
        "start_value": "{\"query\": \"San Francisco\"}"
      }
    }
  ]
}

This gets a list of projects in your account

HTTP Request

GET https://www.parsehub.com/api/v2/projects

Parameters

Parameter Description
api_key The API key for your account.
offset (Optional) Specifies the offset from which to start the projects. E.g. in order to get projects 21-40, specify an offset of 20. Defaults to 0.
limit (Optional) Specifies how many entries will be returned in projects. Accepts values between 1 and 20 inclusively. Defaults to 20.
include_options (Optional) Adds options_json, main_template, main_site and webhook to the entries of projects. Set this parameter to 1 if you intend to use them in ParseHub API calls. This parameter requires use of the offset and limit parameters to access the full list of projects.

Response

If successful, returns an object with

Property Description
projects A list of the projects in your account.
total_projects The total number of projects in your account.

Get a run

curl -X GET "https://www.parsehub.com/api/v2/runs/{RUN_TOKEN}?api_key={YOUR_API_KEY}"
import requests

params = {
  "api_key": "{YOUR_API_KEY}"
}
r = requests.get('https://www.parsehub.com/api/v2/runs/{RUN_TOKEN}', params=params)
print(r.text)
require 'net/http'

params = {
  :api_key => "{YOUR_API_KEY}"
}
url = URI.parse('https://www.parsehub.com/api/v2/runs/{RUN_TOKEN}')
url.query = URI.encode_www_form(params)

puts Net::HTTP.get(url)
<?php
$params = http_build_query(array(
  "api_key" => "{YOUR_API_KEY}"
));

$result = file_get_contents(
    'https://www.parsehub.com/api/v2/runs/{RUN_TOKEN}?'.$params,
    false,
    stream_context_create(array(
        'http' => array(
            'method' => 'GET'
        )
    ))
);
echo($result);
?>
var request = require('request');

request({
  uri: 'https://www.parsehub.com/api/v2/runs/{RUN_TOKEN}',
  method: 'GET',
  qs: {
    api_key: "{YOUR_API_KEY}"
  }
}, function(err, resp, body) {
  console.log(body);
});
package main

import (
  "fmt"
  "io/ioutil"
  "net/http"
  "net/url"
)

func main() {
  var Url * url.URL
  Url, _ = url.Parse("https://www.parsehub.com/api/v2/runs/{RUN_TOKEN}")

  params := url.Values{}
  params.Add("api_key", "{YOUR_API_KEY}")
  Url.RawQuery = params.Encode()

  resp, _ := http.Get(Url.String())
  defer resp.Body.Close()
  body, _ := ioutil.ReadAll(resp.Body)
  fmt.Printf(string(body))
}
{
  "status": "initialized", 
  "start_time": "2015-02-03T23:09:38", 
  "project_token": "t-0WMEZ-Bc9sWGHAMsYvP7y4", 
  "start_template": "main_template", 
  "pages": 0, 
  "run_token": "tCcB4hfFP6wvBRe2gwZv9aJp", 
  "data_ready": false, 
  "md5sum": null, 
  "end_time": null, 
  "start_url": "http://www.example.com", 
  "start_value": "{\"query\": \"San Francisco\"}"
}

This returns the run object for a given run token. You can call this method repeatedly to poll for when a run is done, though we recommend using a webhook instead. This method is rate-limited. For each run, you may make at most 25 calls during the first 5 minutes after the run started, and at most one call every 3 minutes after that.

HTTP Request

GET https://www.parsehub.com/api/v2/runs/{RUN_TOKEN}

Parameters

Parameter Description
api_key The API key for your account.

Response

If successful, returns the run identified by {RUN_TOKEN}

Get data for a run

curl -X GET "https://www.parsehub.com/api/v2/runs/{RUN_TOKEN}/data?api_key={YOUR_API_KEY}&format=csv" | gunzip
import requests

params = {
  "api_key": "{YOUR_API_KEY}",
  "format": "csv"
}
r = requests.get('https://www.parsehub.com/api/v2/runs/{RUN_TOKEN}/data', params=params)
print(r.text)
require 'net/http'

params = {
  :api_key => "{YOUR_API_KEY}",
  :format => "csv"
}
url = URI.parse('https://www.parsehub.com/api/v2/runs/{RUN_TOKEN}/data')
url.query = URI.encode_www_form(params)

puts Net::HTTP.get(url)
<?php
$params = http_build_query(array(
  "api_key" => "{YOUR_API_KEY}",
  "format" => "csv"
));

$result = file_get_contents(
    'https://www.parsehub.com/api/v2/runs/{RUN_TOKEN}/data?'.$params,
    false,
    stream_context_create(array(
        'http' => array(
            'method' => 'GET'
        )
    ))
);
echo($result);
?>
var request = require('request');

request({
  uri: 'https://www.parsehub.com/api/v2/runs/{RUN_TOKEN}/data',
  method: 'GET',
  qs: {
    api_key: "{YOUR_API_KEY}",
    format: "csv"
  }
}, function(err, resp, body) {
  console.log(body);
});
package main

import (
  "fmt"
  "io/ioutil"
  "net/http"
  "net/url"
)

func main() {
  var Url * url.URL
  Url, _ = url.Parse("https://www.parsehub.com/api/v2/runs/{RUN_TOKEN}/data")

  params := url.Values{}
  params.Add("api_key", "{YOUR_API_KEY}")
  params.Add("format", "csv")
  Url.RawQuery = params.Encode()

  resp, _ := http.Get(Url.String())
  defer resp.Body.Close()
  body, _ := ioutil.ReadAll(resp.Body)
  fmt.Printf(string(body))
}
"questions_title", "questions_text"

"This is a title", "This is some text"

"This is another title", "This is another text"

This returns the data that was extracted by a run.

HTTP Request

GET https://www.parsehub.com/api/v2/runs/{RUN_TOKEN}/data

Parameters

Parameter Description
api_key The API key for your account.
format (Optional) The format that you would like to get the data in. Possible values csv or json. Defaults to json.

Response

If successful, returns the data in either csv or json format, depending on the format parameter.

Note: The Content-Encoding of this response is always gzip.

Get last ready data

curl -X GET "https://www.parsehub.com/api/v2/projects/{PROJECT_TOKEN}/last_ready_run/data?api_key={YOUR_API_KEY}&format=csv" | gunzip
import requests

params = {
  "api_key": "{YOUR_API_KEY}",
  "format": "csv"
}
r = requests.get('https://www.parsehub.com/api/v2/projects/{PROJECT_TOKEN}/last_ready_run/data', params=params)
print(r.text)
require 'net/http'

params = {
  :api_key => "{YOUR_API_KEY}",
  :format => "csv"
}
url = URI.parse('https://www.parsehub.com/api/v2/projects/{PROJECT_TOKEN}/last_ready_run/data')
url.query = URI.encode_www_form(params)

puts Net::HTTP.get(url)
<?php
$params = http_build_query(array(
  "api_key" => "{YOUR_API_KEY}",
  "format" => "csv"
));

$result = file_get_contents(
    'https://www.parsehub.com/api/v2/projects/{PROJECT_TOKEN}/last_ready_run/data?'.$params,
    false,
    stream_context_create(array(
        'http' => array(
            'method' => 'GET'
        )
    ))
);
echo($result);
?>
var request = require('request');

request({
  uri: 'https://www.parsehub.com/api/v2/projects/{PROJECT_TOKEN}/last_ready_run/data',
  method: 'GET',
  qs: {
    api_key: "{YOUR_API_KEY}",
    format: "csv"
  }
}, function(err, resp, body) {
  console.log(body);
});
package main

import (
  "fmt"
  "io/ioutil"
  "net/http"
  "net/url"
)

func main() {
  var Url * url.URL
  Url, _ = url.Parse("https://www.parsehub.com/api/v2/projects/{PROJECT_TOKEN}/last_ready_run/data")

  params := url.Values{}
  params.Add("api_key", "{YOUR_API_KEY}")
  params.Add("format", "csv")
  Url.RawQuery = params.Encode()

  resp, _ := http.Get(Url.String())
  defer resp.Body.Close()
  body, _ := ioutil.ReadAll(resp.Body)
  fmt.Printf(string(body))
}
"questions_title", "questions_text"

"This is a title", "This is some text"

"This is another title", "This is another text"

This returns the data for the most recent ready run for a project. You can use this method in order to have a synchronous interface to your project.

HTTP Request

GET https://www.parsehub.com/api/v2/projects/{PROJECT_TOKEN}/last_ready_run/data

Parameters

Parameter Description
api_key The API key for your account.
format (Optional) The format that you would like to get the data in. Possible values csv or json. Defaults to json.

Response

If successful, returns the data in either csv or json format, depending on the format parameter.

Note: The Content-Encoding of this response is always gzip.

Cancel a run

curl "https://www.parsehub.com/api/v2/runs/{RUN_TOKEN}/cancel"   -X POST \ 
  -d api_key={YOUR_API_KEY}
import requests

params = {
  "api_key": "{YOUR_API_KEY}"
}
r = requests.post("https://www.parsehub.com/api/v2/runs/{RUN_TOKEN}/cancel", data=params)

print(r.text)

require 'net/http'

params = {
  :api_key => "{YOUR_API_KEY}"
}
url = URI.parse('https://www.parsehub.com/api/v2/runs/{RUN_TOKEN}/cancel')
url.query = URI.encode_www_form(params)

puts Net::HTTP.post_form(url, params)
<?php
$params = array(
  "api_key" => "{YOUR_API_KEY}"
);

$options = array(
  'http' => array(
    'method' => 'POST',
    'header' => 'Content-Type: application/x-www-form-urlencoded; charset=utf-8',
    'content' => http_build_query($params)
  )
);

$context = stream_context_create($options);
$result = file_get_contents('https://www.parsehub.com/api/v2/runs/{RUN_TOKEN}/cancel', false, $context);
echo($result);
?>
var request = require('request');

request({
  uri: 'https://www.parsehub.com/api/v2/runs/{RUN_TOKEN}/cancel',
  method: 'POST',
  form: {
    api_key: "{YOUR_API_KEY}"
  }
}, function(err, resp, body) {
  console.log(body);
});
package main

import (
  "fmt"
  "io/ioutil"
  "net/http"
  "net/url"
)

func main() {
  var Url * url.URL
  Url, _ = url.Parse("https://www.parsehub.com/api/v2/runs/{RUN_TOKEN}/cancel")

  params := url.Values{}
  params.Add("api_key", "{YOUR_API_KEY}")

  resp, _ := http.PostForm(Url.String(), params)
  defer resp.Body.Close()
  body, _ := ioutil.ReadAll(resp.Body)
  fmt.Printf(string(body))
}
{
  "status": "cancelled", 
  "start_time": "2015-02-03T23:09:38", 
  "project_token": "t-0WMEZ-Bc9sWGHAMsYvP7y4", 
  "start_template": "main_template", 
  "pages": 52, 
  "run_token": "tCcB4hfFP6wvBRe2gwZv9aJp", 
  "data_ready": false, 
  "md5sum": null, 
  "end_time": null, 
  "start_url": "http://www.example.com", 
  "start_value": "{\"query\": \"San Francisco\"}"
}

This cancels a run and changes its status to cancelled. Any data that was extracted so far will be available.

HTTP Request

POST https://www.parsehub.com/api/v2/runs/{RUN_TOKEN}/cancel

Parameters

Parameter Description
api_key The API key for your account.

Response

If successful, returns the run identified by {RUN_TOKEN}

Delete a run

curl -X DELETE "https://www.parsehub.com/api/v2/runs/{RUN_TOKEN}?api_key={YOUR_API_KEY}"
import requests

params = {
  "api_key": "{YOUR_API_KEY}"
}
r = requests.delete('https://www.parsehub.com/api/v2/runs/{RUN_TOKEN}', params=params)
print(r.text)
require 'net/http'

params = {
  :api_key => "{YOUR_API_KEY}"
}
url = URI.parse('https://www.parsehub.com/api/v2/runs/{RUN_TOKEN}')
url.query = URI.encode_www_form(params)

puts Net::HTTP.delete(url)
<?php
$params = http_build_query(array(
  "api_key" => "{YOUR_API_KEY}"
));

$result = file_get_contents(
    'https://www.parsehub.com/api/v2/runs/{RUN_TOKEN}?'.$params,
    false,
    stream_context_create(array(
        'http' => array(
            'method' => 'DELETE'
        )
    ))
);
echo($result);
?>
var request = require('request');

request({
  uri: 'https://www.parsehub.com/api/v2/runs/{RUN_TOKEN}',
  method: 'DELETE',
  qs: {
    api_key: "{YOUR_API_KEY}"
  }
}, function(err, resp, body) {
  console.log(body);
});
package main

import (
  "fmt"
  "io/ioutil"
  "net/http"
  "net/url"
)

func main() {
  var Url * url.URL
  Url, _ = url.Parse("https://www.parsehub.com/api/v2/runs/{RUN_TOKEN}")

  params := url.Values{}
  params.Add("api_key", "{YOUR_API_KEY}")
  Url.RawQuery = params.Encode()

  resp, _ := http.Delete(Url.String())
  defer resp.Body.Close()
  body, _ := ioutil.ReadAll(resp.Body)
  fmt.Printf(string(body))
}
{
  "run_token": "tCcB4hfFP6wvBRe2gwZv9aJp"
}

This cancels a run if running, and deletes the run and its data.

HTTP Request

DELETE https://www.parsehub.com/api/v2/runs/{RUN_TOKEN}

Parameters

Parameter Description
api_key The API key for your account.

Response

If successful, returns an object with

Property Description
run_token The run_token of the run that was deleted.

Webhooks

ParseHub has webhooks which you can use to get notified about the status of a project’s runs. You can use webhooks instead of having to write logic for polling the status of a run.

You can set up a webhook for a project in the ‘Settings’ tab of the project in the ParseHub client. This should be a valid URL that is visible from the internet.

ParseHub will send a POST request to that url whenever any of the project’s runs’ status or data_ready fields change. The POST body will be the run object.

If the status of a run is error, ParseHub may automatically retry the run if it thinks there’s a good chance that the run will succeed the second time. In this case, there will be an additional new_run field with the metadata (run token, etc.) for the restarted run.

We will retry every request once per hour up to 3 times or until we get an HTTP 200 response.

Note that this is a traditional POST with valid JSON data encoded as application/x-www-form-urlencoded. You can test your endpoint with: curl -X POST [webhook url] -H "Content-Type: application/x-www-form-urlencoded" -d '{"some": "json"}'