Main content area

API

Collections API

The Library is offering unmediated access to our original materials collections, i.e. photographs, manuscripts, pictorial material, through its search API via Funnelback, the information retrieval system deployed to search our Manuscripts, Oral History and Pictures Catalogue.

Collections can be accessed via a REST style API service, and a JSON style API service.

For more information see Funnelback's API documentation. Specifically, see the REST style API documentation and the JSON style API documentation.

WW1 Diary and Letter Transcripts API

Approximately 11,000 volumes of letters and diaries of WW1 Australian soldiers have been collected by the Library. For the WWI Centenary Commemoration we have committed to making them accessible to all through digitisation and transcription.

All the materials have been digitised and approximately half of them have been transcribed. To foster the use, reuse and research of the diaries we have released this API from our transcription tool.

The transcriptions of our diaries are currently being worked on by members of the public and volunteers and as such are constantly updating. If you would like to work with a complete set of transcribed diaries you will need to regularly harvest collection items available through the API to update your dataset.

We currently have two APIs available for access to our transcribed World War I diaries and letters collection. Below is the documentation for the most up-to-date API which was most recently reviewed on 2 July 2015. This API has superseded the original API, however documentation for the original is still available should you need it.

Data Model

There are three entities: Transcript Document, Transcript Page and Transcript Image. Transcript Documents are generally diaries however they can also be letters and other manuscripts. Transcript Pages represent the pages within the diary or manuscript. There is a one to many relationship between Transcript Document and Transcript Page as diaries and letters usually contain multiple pages. Transcript Image entities represent a scanned image of a given page. There is a one to one relationship between Transcript Pages and Images as each page has a single image.

Transcript Document

type (string) - Always "transcipt_document" *Misspelling intentional
nid (integer) - The unique id for the document
collection (integer) - For GovHack this should always be 4
title (string) - The title of the diary/letter
record_number (integer) - Used to view in our archival collection system
http://www.acmssearch.sl.nsw.gov.au/search/itemDetailPaged.cgi?itemID=record_number
percentage_unlocked (integer) - Percentage of document which is available for transcription, not complete. You can see this percentage displayed on the transcript tool website
transcript_pages (array of Transcript Page entities) - List of Transcript Pages contained in the diary/letter
created (integer) - Created timestamp
changed (integer) - Last modified timestamp

Transcript Page

type (string) - Always "transcript_page"
nid (integer) - The unique id for the page
title (string) - The title of the specific page
page_number (integer) - The page number
transcript_image (Transcript Image entity) - The digitised Transcript Image of the page
transcription (string) - The textual transcription
status (integer) - Indicates what state the page is in. 

  1.  Creation
  2.  Not yet started
  3.  Partially transcribed
  4.  Ready for review
  5.  Accepted
  6.  Published

created (integer) - Created timestamp
changed (integer) - Last modified timestamp

Transcript Image

fid (integer) - The unique id for the image
name (string) - The filename
mime (string) - The files Mime Type
size (integer) - The filesize in bytes
url (string) - Location to view/download image
timestamp (integer) - Last updated unix timestamp
type (string) - Will always be "image"

API

This HTTP Web Service has been built specifically for GovHack 2015. The default data format of response data is XML, however JSON is also supported by using the HTTP header Accept: application/json

Our API has three endpoints, one for listing the Transcript Document and Pages, another for retrieving specific documents and pages. The third is for retrieving a specific image entity.
The root URI for all endpoints is http://transcripts.sl.nsw.gov.au/
All http responses are compressed with gzip.

Document/Page Listing Endpoint

GET /api/entity_node/

Optional Parameters
fields (string) - A comma separated list of fields to get. See above data model for possible fields.
parameters (array) - Filter parameters array such as parameters[type]="transcipt_document". See above data model for possible field parameters.
page (int) - The zero-based index of the page to get, defaults to 0.
pagesize (int) - Number of records to get per page.
sort (string) - Field to sort by.
direction (string) - Direction of the sort. ASC or DESC.

Document/Page Retrieval Endpoint

GET /api/entity_node/[node_id]/

Optional Parameters
fields (string) - A comma separated list of fields to get. See above data model for possible fields.

Image Retrieval Endpoint

GET /api/entity_file/[node_id]/

Optional Parameters
fields (string) - A comma separated list of fields to get. See above data model for possible fields.

Example

This example will use the default XML format however the same can be done with JSON. In this example we are going to start by listing all the documents (paginated), picking one out randomly and finding more data about that document. Then we will pick a random page from that diary and dig deeper into the page to bring back the status, image and transcription. Finally we will investigate the image file in more detail.

STEP 1. LIST ALL DOCUMENTS, NID AND TITLE FIELDS, 100 PER PAGE, SORTED BY TITLE ASCENDING

http://transcripts.sl.nsw.gov.au/api/entity_node/?fields=nid,title&parameters[collection]=4&parameters[type]=transcipt_document&page=0&pagesize=100&sort=title&direction=ASC

Sample of result...

<?xml version="1.0" encoding="utf-8"?>
<result is_array="true">
  ...
  <item>
    <nid>100365</nid>
    <title>Clarke war diary, 20 May 1916-13 August 1916</title>
  </item>
  <item>
    <nid>182206</nid>
    <title>Cocks letter diary, 1916-1919 / Verner Cocks</title>
  </item>
  ...
</result>

Here we have limited our results to Transcript Documents only using parameters[type]=transcipt_document. Also we are only interested in two fields, fields=nid,title.

STEP 2. USING ONE OF THE ITEM NODES, IN THIS CASE FOR "CLARKE WAR DIARY, 20 MAY 1916-13 AUGUST 1916", TAKE THE NID AND RETRIEVE THE NODE WITH MORE DETAILS

http://transcripts.sl.nsw.gov.au/api/entity_node/100365/?fields=nid,title,record_number,percentage_unlocked,transcript_pages

Sample of result...

<?xml version="1.0" encoding="utf-8"?>
<result>
  <record_number>908842</record_number>
   <transcript_pages is_array="true">
    ...
    <item>
      <uri>http://transcripts.sl.nsw.gov.au/api/entity_node/103636</uri>
      <id>103636</id>
    </item>
    <item>
      <uri>http://transcripts.sl.nsw.gov.au/api/entity_node/103637</uri>
      <id>103637</id>
    </item>
    ...
  </transcript_pages>
  <percentage_unlocked>0</percentage_unlocked>
  <nid>100365</nid>
  <title>Clarke war diary, 20 May 1916-13 August 1916</title>
</result>

So now we have (amongst other things) a listing of all Transcript Pages contained in the diary.

STEP 3. GRAB THE NID OF A RANDOM PAGE IN THE DIARY. RETRIEVE THE PAGE DETAILS.

http://transcripts.sl.nsw.gov.au/api/entity_node/103636/?fields=nid,title,status,transcript_image,transcription

Full result...

<?xml version="1.0" encoding="utf-8"?>
<result>
  <status>1</status>
  <transcription>
    <value>
      <p class="page" id="a5633033">[Page 33]</p>
<p>a letter to-day, giving my experiences from the time I last wrote to the present & chance to it getting through<br/>Yesterday some Allemandes dropped Bombs on Amiens just after we passed through & killed several French women<br/>The majority of French women are very loyal & I have seen them in all kinds of trouble just say, Vive la France."<br/>Things are hurrying towards a big offensive in this region where it will end it is impossible to say at present.<br/>The weather is bitterly cold here & none of us can keep warm. Coming straight from Egypt we feel it worse. I have got a bad cold & feel very ill, but I am hoping to soon shake it off.<br/>19 out of every 20 of the men have colds but it is only to be expected, as our blood must be very thin after our long sojourn in the land of the Pharoas.<br/>They have placed Paris out of bounds to the Troops but as it is 185 miles from here there is no need to worry over the matter.<br/>Anyway there is going to be an enquiry into it as the French people are very indignant over it.<br/>The Australian's names went before them on account</p>
    </value>
  </transcription>
  <transcript_image>
    <file>
      <uri>http://transcripts.sl.nsw.gov.au/api/entity_file/104534</uri>
      <id>104534</id>
     </file>
  </transcript_image>
  <nid>103636</nid>
  <title>Clarke war diary, 20 May 1916-13 August 1916 - Page 33</title>
</result>

So now we have some more information about a Transcript Page, including the actual transcript marked up with some basic HTML.

STEP 4. FINALLY USING THE URI OF THE IMAGE, RETRIEVE SOME MORE IMAGE DETAILS.

http://transcripts.sl.nsw.gov.au/api/entity_file/104534

Full result...

<?xml version="1.0" encoding="utf-8"?>
<result>
  <fid>104534</fid>
  <name>a5633033h.jpg</name>
  <mime>image/jpeg</mime>
  <size>973982</size>
  <url>http://transcripts.sl.nsw.gov.au/sites/all/files/a5633033h.jpg</url>
  <timestamp>1389067203</timestamp>
</result>

So now we know the image is around 974KB and have a url to reference or download. You can open this up in your browser or use wget, curl or another http client library to download it.

Trove

You can also access State Library of NSW data through the Trove API. Here you can access collection items such as maps, photographs and newspapers articles contributed to Trove from the State Library of NSW’s collections.

The Library’s National Union Catalogue (NUC) symbol is NSL.