With the partnership of Helge Holzmann and Vinay Goel at the L3S Research Center and the Internet Archive, the MHL has been working on a tool to allow researchers to take advantage of an Apache Spark framework enables easy data extraction as well as derivation.
The MHLonArchiveSpark tool on GitHub includes all the required elements to work with MHL collections via this framework. Users will need some familiarity with ‘command line’ coding; familiarity with the Scala or Apache Spark commands list is recommended.
Currently, users must have access either to a computer cluster or server or use a Docker container in order to run ArchiveSpark. Using ArchiveSpark in a Docker environment can be done using a laptop or desktop computer. Using Docker will automatically create a Jupyter Notebook where work can be done with the MHL collections.
For those already familiar with Apache Spark environments, an example of the MHL project can be found here. We are working to make ‘recipes’ for searches available for users.