Welcome to ner_for_protein_structures’s documentation!¶
ner_for_protein_structures represents a collection of commandline tools published alongside a scientific publication describing the development of a human-in-the-loop named entity recognition algorithm specific for protein structures.
The tools described here allow for the convertion of annotations in BioC formatted XML files into a number of other formats for the calculation of performance statistics and to prepare intput data for model training. An inference script is also provided, which allows the submission of other BioC formatted XML files for inference via the Huggingface inference API.
Contents:
- Getting Started
- Models
- Data Preparation
- Conversion of BioC formatted annotations
- Make predictions on un-annotated BioC XML files
- Calculating performance stats following SemEval procedure
- List of functions in the tool collection
- Overview
- Convert annotations in BioC formatted XML to CSV
- Convert annotations in BioC formatted XML to JSON
- Fetch fulltext, open access publications in BioC formatted XML using PubMedCentral IDs
- Run NER predictions on BioC formatted XML files with a trained model - locally
- Run NER predictions on BioC formatted XML files with a trained model - remotely
- Processing BioC formatted XML files to turn annotations into IOB format for SemEval calculations
- Processing BioC formatted XML files to turn annotations into IOB format for model training
- Running SemEval to calculate performance statistics
- Converting EuropePMC/JATS style XML to BioC XML