A primary harvest is usually run each June, with targeted updates to coincide with special events including:
- Council Amalgamations
- NSW State Elections
- NSW Ministry changes
The State Library collects government information in print and the collection of government websites extends that into the digital realm.
This ensures that the Library holds a permanent record of government materials in all its forms for researchers in the future.
Australian libraries have a long history of archiving websites including Pandora which commenced in 1996 and Tasmania's Digital Island Archive in 1998. In 2005, the National Library began broad scale harvesting of commonwealth websites and developed a public interface in 2014.
As of 30 November 2017, the State Library’s harvested collection contains approximately 5.4TB of data contained in over 100 million documents. The archive has attracted visitors from around the world. One of the challenges in dealing with this information is the sheer size, breadth and depth of the archive. The collection contains a wide variety of material including: budget papers for both the state government and local councils, planning documents, videos, information, announcements and public library websites .
Addressing these challenges revolve around what you can use the data for and what sorts of material can be made easily accessible. When dealing with an archive of this magnitude it's difficult to know where to begin. Visualising the underlying content can provide a starting point for people to engage with the collection.
With that in mind, I isolated a test sample of NSW Public Library websites for further exploration. The dataset was relatively small containing approximately 68GB of data across 61 compressed WARC (Web ARChive) files.
An international group, supported by a grant from the Andrew W. Mellon Foundation, have developed a suite of tools, Archives Unleashed, for examining web archives. I installed the tools locally and worked through their key examples to successfully extract a 355kb file of URLs. This is a substantially smaller file containing 2,647 URLs with 4,401 connections and is much more manageable.
The next step was to load the file into Gephi which is an open source, data visualisation tool for network data and examine it further. Following the initial load, I was able to explore the network’s density and gain a sense of what websites were connecting to as demonstrated in the picture below.
As the library continues to harvest and develop these collections, we will explore approaches to making this data more accessible. One possibility is the creation of smaller subsets, like the one above, which can provide the basis for further analysis and visualisation.
- Written By Sean Volke, Online Resources Specialist Librarian