=============================================== Make predictions on un-annotated BioC XML files =============================================== For usage of the models in the context of this project two commandline tools have been provided. One tool contacts the Huggingface inference API to make predictions the other relies on the model having been downloaded locally. In both cases the input text needs to be provided as BioC formatted XML files. The publications can either have been downloaded from an open access collection already pre-formatted in BioC as described in :ref:`get-publications` or may have been created from a local PDF by other means. Accessing the models through Huggingface inference API ------------------------------------------------------ To access the models through the inference API one needs to have the input text formatted as BioC XML as well as an account with Huggingface and an access token. If one followed the instructions for creating a virtual environment for the project and activated it, all the necessary dependencies should have been installed. The name of the remote model on Huggingface needs to be given. **Example** .. code-block:: bash ner_for_protein_structures.run_hf_inference_for_ner --xml-dir=test/data/not_annotated_BioC_XML/ --model-repo="PDBEurope/Bioformer8L-ProteinStructure-NER-v0.1" --auth-token="" --output-dir=test/results/predictions/ Although a waiting step has been added in the inference script, the API does have a lag phase which may cause some text passages not having any annotations. Also, anyone running requests on a free plan will find that the API has an access limit per hour which can easily be reached by just running two publications of average length. Accessing the models locally ---------------------------- The models can also be accessed loacally after download. They use between 350MB and 900MB of disk space. The input publications need to be formatted in BioC and provided as XML. Model location needs to be provided as full path. A short-hand for the model name should be provided as this will be added as "annotator" in the annotations to determine the origin of the annotation. The smaller bioformer8L and the larger PubmedBERT models both have been tested locally on a MacBook Pro M1 with 16GB RAM running Sonoma 14.1.1. **Example** .. code-block:: bash ner_for_protein_structures.run_local_inference_for_ner --xml-dir=test/data/not_annotated_BioC_XML/ --model-dir= --model-name="bioformer8L_v0.1" --output-dir=test/results/predictions/ .. code-block:: bash ner_for_protein_structures.run_local_inference_for_ner --xml-dir=test/data/not_annotated_BioC_XML/ --model-dir= --model-name="pubmedbert_v2.1" --output-dir=test/results/predictions/