Create

If you are planning a novel research project that could make use of our data, please contact the MHL!

Get Data

You can always extract data from the Medical Heritage Library collection using the Internet Archive’s own advanced search tool. Just make sure you’re searching using the collection tag “medicalheritagelibrary” along with your other search terms.

ArchiveSpark

With the partnership of Helge Holzmann and Vinay Goel at the L3S Research Center and the Internet Archive, the MHL has been working on a tool to allow researchers to take advantage of an Apache Spark framework enables easy data extraction as well as derivation.

The MHLonArchiveSpark tool on GitHub includes all the required elements to work with MHL collections via this framework. Users will need some familiarity with ‘command line’ coding; familiarity with the Scala or Apache Spark commands list is recommended.

Currently, users must have access either to a computer cluster or server or use a Docker container in order to run ArchiveSpark. Using ArchiveSpark in a Docker environment can be done using a laptop or desktop computer. Using Docker will automatically create a Jupyter Notebook where work can be done with the MHL collections.

For those already familiar with Apache Spark environments, an example of the MHL project can be found here. We are working to make ‘recipes’ for searches available for users.

Future Projects

The MHL  is interested in working with researchers and educators to:

  • Gather user stories that illustrate how the MHL corpus (including the UKMHL) has been used to support scholarship and to share those stories on its website
  • Share how MHL content is being used in the classroom and enable students to share how they’ve engaged with MHL content
  • Promote tools created or employed to work with the MHL corpus to create new knowledge
  • Have people experiment with ArchiveSpark and provide feedback for us and the developers, as we’d like to see this become a tool for the less tech savvy to create custom datasets
  • Share their ideas about what tools/services/functionality would improve access to (and the use of) MHL content

Additionally, we have started to reach out to Library Science and other programs with prospective ideas for hackathons and data challenges:

Make ArchiveSpark with MHL more intuitive by developing a user-friendly interface (or other mechanism) for making ArchiveSpark functionality more broadly accessible. MHL constituencies come from a variety of academic disciplines and have varying levels of comfort and familiarity with utilizing tools like ArchiveSpark (https://github.com/helgeho/MHLonArchiveSpark). In a nutshell, this project seeks to make ArchiveSpark workflows broadly accessible to the public, which typically require users to:

    1. Go to the MHL’s Advanced Search Tool to identify a set of texts meeting their criteria and retrieve all of the from the Internet Archive
  1. Use ArchiveSpark to extract the full text of a  results set (including metadata) and then performing additional queries and against that set

Products of this project could include creating a number of canned recipes for searching content
with ArchiveSpark and considering new approaches to searching the dataset for the purpose of extraction and analysis easier for researchers. An online tutorial will be available in advance of the hackathon.

Connect Index Cat to journal articles that have been digitized by the MHL. This challenge involves matching Index Cat entries with full text articles residing in the Medical Heritage Library

The Index-Catalogue of the Library of the Surgeon-General’s Office (Index-Catalogue) is a multi-part printed bibliography or list of items in the Library of the Surgeon-General’s Office, U.S. Army. It contains material dated from the 1400s through 1950 and is an important resource for researchers in the history of medicine, history of science, and for clinical research. The Index-Catalogue was published in five (5) series in sixty-one (61) volumes from 1880 to 1961. Since it is a list of holdings for a specific library, it does not claim to be an index of all material published in medicine. By 1895, however, the Surgeon-General’s Library was the world’s largest medical library. Therefore, its catalog became a major source for accessing medical literature. The scope of Index-Catalogue extends beyond medicine and includes, for example, the basic sciences, scientific research, military medicine, public health, and hospital administration. Language coverage is international with citations in European and Slavic languages, Greek script, and Romanized Chinese and Japanese titles – some with English translations. The catalogue covers a wide assortment of materials including: books, journal articles, dissertations, pamphlets, reports, newspaper clippings, case studies, obituary notices, letters, portraits, as well as rare books and manuscripts (see https://www.nlm.nih.gov/hmd/indexcat/)

The software platform for IndexCatTM is IBM InfoSphere Data Explorer (DE). As installed by NLM, it permits simultaneous searching across all collections in the IndexCatTM database. XML data is available from the IndexCat™ database. It reflects both the Index-Catalogue and eTK/eVK2 collections.The data are available to all both within and outside the United States.  There is no charge for obtaining the files.

Create an index of archaic medical terminology using medical dictionaries found in the Medical Heritage Library, map those terms to contemporary medical terminology (such as the Unified Medical Language System, and index the Medical Heritage Library corpus to facilitate the discovery of published content from the perspective of contemporary medicine. Research using history of medicine primary resources requires a highly specialized vocabulary of medical terms. At the outset of a research project, humanities scholars, behavioral scientists, students of the history of medicine and others may not possess the full assemblage of biological and medical terminology needed to uncover a comprehensive body of primary source material. Even for a researcher who is knowledgeable of archaic medical terminology, the specificity of contemporary medical terms and the increasing degree of specialization within medicine presents barriers to the analysis of an idea or process over time and its impact on society. By applying semantic web technology and the lexical tools of the Unified Medical Language System , we can enable a more lexically open discovery process that supports multi-disciplinary approaches to history of medicine sources. Technical documentation for the UMLS API is available here: https://documentation.uts.nlm.nih.gov/. Alternatively, the scoping of the project could be limited to MeSH subject headings.

Comments are closed.