<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://cthoyt.com/feed.xml" rel="self" type="application/atom+xml" /><link href="https://cthoyt.com/" rel="alternate" type="text/html" /><updated>2026-03-03T12:34:41+00:00</updated><id>https://cthoyt.com/feed.xml</id><title type="html">Biopragmatics</title><subtitle>Unraveling complex biology with biological knowledge graphs. Content licensed under CC BY 4.0.
</subtitle><author><name>Charles Tapley Hoyt</name></author><entry><title type="html">International Society of Biocuration Presents: Curate This!</title><link href="https://cthoyt.com/2026/03/03/curate-this.html" rel="alternate" type="text/html" title="International Society of Biocuration Presents: Curate This!" /><published>2026-03-03T10:13:00+00:00</published><updated>2026-03-03T10:13:00+00:00</updated><id>https://cthoyt.com/2026/03/03/curate-this</id><content type="html" xml:base="https://cthoyt.com/2026/03/03/curate-this.html"><![CDATA[<p>While researchers typically communicate their work through poster presentations,
oral presentations, and written communication, programmers often give (live)
demonstrations. I’m not aware of any technical nor practical barriers for why
curators couldn’t do the same, and always wished that curators did this more
often. This post is about how I planned to make this a reality by starting a
podcast with the
<a href="https://www.biocuration.org">International Society for Biocuration (ISB)</a>
entitled
<a href="https://youtube.com/playlist?list=PLYM0tkKvhlX54EMQGIbAOOhMKlDDPF21P&amp;si=7AO9ur6yWiFHhtzL">ISB Presents: Curate This!</a>.</p>

<p>The key first step was to decide on the goals of the podcast and its target
audience. The primary goal of <em>Curate This!</em> is to explicitly show the process
of curation and have an informal discussion about the challenges associated with
it. It should have short episodes that require as little preparatory work for
both the interviewer and interviewee as possible to make it possible to scale. I
also decided with the ISB that it should be hosted as an ISB podcast, not as
something just from me. This better fits the message for the curation community
and is overall a better governance decision to support longevity (if we’re
successful).</p>

<p>It’s not a goal of this podcast to give a background on curation - there are
plenty of
<a href="https://www.biocuration.org/home-3/isb-publications/">resources available from the ISB</a>
that cover this. It’s also not a goal of this podcast to focus on the curator
themselves, such as how they became a curator - the ISB hosts two <em>Careers in
Biocuration</em> sessions each year, one at the in-person conference and one
virtually, that cover this. The target audience for this podcast is
practitioners.</p>

<h2 id="script">Script</h2>

<p><em>Curate This!</em> interviews are split into several segments. The first and last
are recorded by the interviewer after the interview is done to give an
introduction and parting remarks. The bulk of the episode is contained within
three segments: introduction, demonstration, and reflections.</p>

<p>I’ve written out the questions and the concept for each segment below to serve
as a resource for potential interviewees to read ahead of time and prepare
themselves as well as a resource for interviewers to follow and stay on task.</p>

<h3 id="introduction">Introduction</h3>

<p>The goal of the first segment of the interview is to describe the history,
goals, and uses of your curated resource in around five minutes. We’ll loosely
use the following question lists:</p>

<ul>
  <li>Basic
    <ul>
      <li>What is your resource called (if not already mentioned in the introduction)?</li>
      <li>When was your resource established?</li>
      <li>What kind of information is in your resource?</li>
      <li>Do you develop/reuse any data standards?</li>
    </ul>
  </li>
  <li>Impact
    <ul>
      <li>Who uses (or could use) your resource and why? How do you assess this?</li>
      <li>Have you seen any cool citations of your resource?</li>
      <li>What is the broader impact in the basic and translational research space in
biomedicine (or beyond?)</li>
    </ul>
  </li>
  <li>Personnel
    <ul>
      <li>What does your resource’s team look like?
        <ul>
          <li>How many people/groups work on your resource?</li>
          <li>Is it developed and maintained by a group within your institution, as a
community effort, or somewhere in between?</li>
        </ul>
      </li>
      <li>If it’s a community effort, How do you do project management and
communication? E.g., Slack, GitHub, Trello, etc.</li>
      <li>How do you onboard new curators? If there’s a difference between
internal/external, what does this dichotomy look like?</li>
    </ul>
  </li>
</ul>

<h3 id="demonstration">Demonstration</h3>

<p>The goal of the second segment of the interview is to demonstrate the
contribution of a curation to your resource, live, in between ten and thirty
minutes. Here’s what makes a satisfying live demonstration:</p>

<ul>
  <li>Show how you select what you’re going to curate.
    <ul>
      <li>How do you find content? For example, if you curate text from literature or
patents, do you have a search query that runs on a chronological basis?</li>
      <li>How do you prioritize content? For example, do you use ranking from a search
system, or a more sophisticated document classifier?</li>
    </ul>
  </li>
  <li>What do you look for in the text?
    <ul>
      <li>Do you use external ontologies, terminologies, or semantic spaces to tag
named entities?</li>
    </ul>
  </li>
  <li>What kinds of assumptions do you make as a curator? For example, if you’re
curating relationships between proteins, do you assume that authors refer to
proteins using their corresponding gene names?</li>
  <li>How do you report the confidence of your curation (and its components)?</li>
  <li>What kind of metadata do you capture, e.g., the curator’s ORCiD, the time of
curation, or anything else?</li>
</ul>

<p>Ideally, you should prepare a curation ahead of time so you can quickly walk
through the process during the live demo, rather than needing time to think and
consider (though, this might be more realistic!).</p>

<h3 id="reflections">Reflections</h3>

<p>The goal of the third segment of the interview is to reflect on the
demonstration and conclude the interview with parting thoughts, in around five
minutes.</p>

<ul>
  <li>Next Steps
    <ul>
      <li>What happens next after curation?
        <ul>
          <li>Does the data get reflected on the website immediately?</li>
          <li>Does a second curator check things?</li>
          <li>Are more substantial releases made periodically?</li>
        </ul>
      </li>
      <li>What are some difficulties/challenges in curating your resource?</li>
      <li>What could authors/journals/publishers do to make it easier to curate?
        <ul>
          <li>What data should they include (that they don’t)?</li>
          <li>How should data look?</li>
          <li>What kinds of standards would you like to see developed?</li>
        </ul>
      </li>
      <li>Contrast curating a “good” paper versus a “bad” paper.</li>
    </ul>
  </li>
  <li>Longevity and Sustainability
    <ul>
      <li>How/where do you think that AI has a place in the curation and maintenance
of your resource?</li>
      <li>What’s the funding situation like?</li>
      <li>How much do you estimate it costs to maintain this resource per year?</li>
    </ul>
  </li>
</ul>

<h2 id="first-episode">First episode</h2>

<p>In our inaugural episode, I interviewed Dr. Susan (Sue) Bello, a curator for the
<a href="https://www.informatics.jax.org/">Mouse Genome Informatics (MGI)</a> knowledge
base and <a href="https://www.alliancegenome.org/">Alliance of Genome Resources (AGR)</a>
who works at the Jackson Laboratory in Maine. Sue is also the ISB executive
committee chair. She showed us how she curates alleles in MGI using the paper
<a href="https://doi.org/10.1016/j.isci.2024.111587">Mice deficient in TWIK-1 are more susceptible to kainic acid-induced seizures</a>
(Kim <em>et al.</em>, 2025).</p>

<iframe width="560" height="315" src="https://www.youtube-nocookie.com/embed/rNwvZ9KhfCM?si=Kt16Y_aWV24sqkBQ" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen=""></iframe>

<h2 id="let-us-interview-you">Let us Interview You</h2>

<p>If you curate a resource and want to be featured on the podcast, please reach
out. My contact information is on the bottom of my blog or ISB can be contacted
<a href="https://www.biocuration.org/contact-the-isb/">here</a>.</p>]]></content><author><name>Charles Tapley Hoyt</name></author><category term="International Society of Biocuration" /><category term="curation" /><category term="biocuration" /><summary type="html"><![CDATA[While researchers typically communicate their work through poster presentations, oral presentations, and written communication, programmers often give (live) demonstrations. I’m not aware of any technical nor practical barriers for why curators couldn’t do the same, and always wished that curators did this more often. This post is about how I planned to make this a reality by starting a podcast with the International Society for Biocuration (ISB) entitled ISB Presents: Curate This!.]]></summary></entry><entry><title type="html">Efficient Bulk Access to Citations in OpenCitations</title><link href="https://cthoyt.com/2026/02/10/opencitations-client.html" rel="alternate" type="text/html" title="Efficient Bulk Access to Citations in OpenCitations" /><published>2026-02-10T09:46:00+00:00</published><updated>2026-02-10T09:46:00+00:00</updated><id>https://cthoyt.com/2026/02/10/opencitations-client</id><content type="html" xml:base="https://cthoyt.com/2026/02/10/opencitations-client.html"><![CDATA[<p><a href="https://opencitations.net">OpenCitations</a> aggregates and deduplicates
bibliographic information from CrossRef, Europe PubMed Central, and other
sources to construct a comprehensive, open index of citations between scientific
works. This post describes the
<a href="https://github.com/cthoyt/opencitations-client"><code class="language-plaintext highlighter-rouge">opencitations-client</code></a> package
which wraps the OpenCitations API and implements an automated pipeline for
locally downloading, caching, and accessing OpenCitations in bulk.</p>

<h2 id="background">Background</h2>

<p>OpenCitations both provides access via an <a href="https://api.opencitations.net">API</a>
and <a href="https://download.opencitations.net">bulk data downloads</a> distributed across
FigShare and Zenodo. Importantly, it publishes its data under the CC0 public
domain license to democratize access to citations - previously, this data was
only available through paid access to commercial databases owned by publishers.</p>

<p>While API access can be convenient for <em>ad-hoc</em> usage, it’s generally slow,
rate-limited, susceptible to DDoS (e.g., from crawlers), and therefore difficult
(if not impossible) to use in bulk. My solution is to write software that
automates downloading, processing, and caching databases in bulk and provides
fast, highly available, local access. I’ve previously written about developing
standalone software packages for several large databases including
<a href="/2020/12/14/taming-drugbank.html">DrugBank</a>,
<a href="/2021/08/05/taming-chembl-sql.html">ChEMBL</a>,
<a href="/2023/09/01/umls.html">UMLS</a>,
<a href="/2024/06/08/easy-orcid.html">ORCiD</a>, and
<a href="/2025/01/23/clinical-trials-data-modeling.html">ClinicalTrials.gov</a>.
Similarly, I maintain several similar workflows in the
<a href="https://github.com/biopragmatics/pyobo">PyOBO software package</a> for converting
resources into ontology-like data structures. I previously wrote about how this
looks for <a href="/2025/10/14/databases-as-ontologies-2-hgnc.html">HGNC</a>.</p>

<h2 id="building-on-an-existing-ecosystem">Building on an Existing Ecosystem</h2>

<p>I’ve been developing a software ecosystem over the last decade to support common
workflows in research data management and data integration. When I start a new
project, I try and reuse or improve existing components from that ecosystem
wherever possible. Importantly, I try and find meaningful ways of organizing
code across my ecosystem to reduce duplication, separate concerns, reduce the
burden of testing, and ease maintenance.</p>

<p>OpenCitations publishes its
<a href="https://download.opencitations.net/">bulk data dumps</a> across several records in
Figshare and Zenodo. I’ve previously written
<a href="https://github.com/cthoyt/zenodo-client/"> <code class="language-plaintext highlighter-rouge">zenodo-client</code></a> to interact with
Zenodo’s API and orchestrates downloading and caching. <code class="language-plaintext highlighter-rouge">zenodo-client</code> heavily
builds on <a href="https://github.com/cthoyt/pystow"><code class="language-plaintext highlighter-rouge">pystow</code></a>, which implements I/O and
filesystem operations to enable reproducible, automated downloading, caching,
and opening of data.</p>

<p>I had not previously written software to interact with Figshare, so I followed
the form of <code class="language-plaintext highlighter-rouge">zenodo-client</code> and created a new package,
<a href="https://github.com/cthoyt/figshare-client"><code class="language-plaintext highlighter-rouge">figshare-client</code></a>. I’m able to
quickly create new high-quality packages because I’ve encoded all the wisdom and
experience I’ve gained over the years in a Cookiecutter template,
<a href="https://github.com/cthoyt/cookiecutter-snekpack">cookiecutter-snekpack</a>, which
I can use to set up a new project in mere minutes.</p>

<p>Along the way, I realized that the archives in Zenodo and Figshare were a
combination of TAR and ZIP archives, each with many CSV files inside. In Python,
TAR and ZIP archives have lots of weird quirks, even though they mostly do the
same thing. However, rather than addressing those issues in
<code class="language-plaintext highlighter-rouge">opencitations-client</code>, it made more sense to add utility functions in PyStow in
<a href="https://github.com/cthoyt/pystow/pull/125">cthoyt/pystow#125</a> (tar and zip
archive iteration), which I was much better able to test in the PyStow archive.</p>

<p>A key functionality of OpenCitations is to implement graph-like queries to find
incoming and outgoing citations. I considered several solutions for efficiently
caching and querying graph-like data including pickles and SQLite, but these
were respectively slow and disk inefficient. I found better solutions based on
NumPy’s memory maps and was surprised that I couldn’t find an implementation in
a popular package (e.g., SciPy). So, I had to decide where to put an
implementation of disk-based cached graph. I didn’t want to put it in
OpenCitations nor make a tiny package for just this one operation, so I decided
to expand the scope of PyStow and add it there in
<a href="https://github.com/cthoyt/pystow/pull/121">cthoyt/pystow#121</a>.</p>

<p>Finally, OpenCitations deals with a variety of identifier spaces including
first-party <a href="https://semantic.farm/omid">OpenCitations Metadata IDs (OMIDs)</a> and
<a href="https://semantic.farm/oci">OpenCitations Citation IDs (OCIs)</a> as well as
third-party identifiers from Wikidata, OpenAlex, PubMed, DOI, and others. I’ve
written the <a href="https://github.com/biopragmatics/curies"><code class="language-plaintext highlighter-rouge">curies</code></a> to handle
identifiers in an explicit and transparent way. In the end, the
<code class="language-plaintext highlighter-rouge">opencitations-client</code> relies on several components from my ecosystem, and of
course, several more generic and popular packages. Here’s how the dependencies
look:</p>

<pre><code class="language-mermaid">flowchart LR
    opencitations-client -- depends on --&gt; figshare-client
    opencitations-client -- depends on --&gt; zenodo-client
    opencitations-client -- depends on --&gt; curies
    figshare-client -- depends on --&gt; pystow
    zenodo-client -- depends on --&gt; pystow
</code></pre>

<h2 id="demo">Demo</h2>

<p>It’s important for software packages to implement simple, top-level APIs that
cover 99% of use cases with reasonable defaults. Most use cases for
OpenCitations are to get incoming/outgoing citations for a DOI, PubMed
identifiers, or OpenCitations identifiers. Here’s how this looks:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">curies</span> <span class="kn">import</span> <span class="n">Reference</span>
<span class="kn">from</span> <span class="nn">opencitations_client</span> <span class="kn">import</span> <span class="n">get_incoming_citations</span><span class="p">,</span> <span class="n">get_outgoing_citations</span>

<span class="c1"># a CURIE for the DOI for the Bioregistry paper
</span><span class="n">bioregistry_curie</span> <span class="o">=</span> <span class="s">"doi:10.1038/s41597-022-01807-3"</span>

<span class="c1"># who did the Bioregistry paper cite?
</span><span class="n">outgoing</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="n">Reference</span><span class="p">]</span> <span class="o">=</span> <span class="n">get_outgoing_citations</span><span class="p">(</span><span class="n">bioregistry_curie</span><span class="p">)</span>

<span class="c1"># who cited the Bioregistry paper?
</span><span class="n">incoming</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="n">Reference</span><span class="p">]</span> <span class="o">=</span> <span class="n">get_incoming_citations</span><span class="p">(</span><span class="n">bioregistry_curie</span><span class="p">)</span>
</code></pre></div></div>

<p>Importantly, each of these functions has a <code class="language-plaintext highlighter-rouge">backend</code> argument that defaults to
<code class="language-plaintext highlighter-rouge">api</code> and can be swapped to <code class="language-plaintext highlighter-rouge">local</code>. Because everything is built on software
that is smart about caching, loading, and data workflows, on the first time
<code class="language-plaintext highlighter-rouge">backend='local'</code> is used, all processing happens automatically (warning, takes
a few hours on a single core). This function also has a <code class="language-plaintext highlighter-rouge">return_value</code> argument
that can be used to swap between principled <code class="language-plaintext highlighter-rouge">curies.Reference</code> data structures
that explicitly encode identifiers, simple string local unique identifiers that
match the input prefix, or full citation objects (only available through
OpenCitations API).</p>

<p>See the <code class="language-plaintext highlighter-rouge">opencitations-client</code> code on GitHub
(<a href="https://github.com/cthoyt/opencitations-client">https://github.com/cthoyt/opencitations-client</a>)
and documentation on ReadTheDocs
(<a href="https://opencitations-client.readthedocs.io">https://opencitations-client.readthedocs.io</a>).</p>

<hr />

<p>While I’ve been thinking about adding citations to the bibliographic components
of knowledge graph construction workflows for several years, I was finally
pushed to implement <code class="language-plaintext highlighter-rouge">opencitations-client</code> for the
<a href="https://catalaix.com">Catalaix project</a>, where we’re developing new methods for
recycling and reuse of (bio)plastics. I wanted to get all seventeen
laboratories’ publications, who they cited, and who cited them as a seed for
information extraction and curation. Here’s a small example of a citation
network from those queries:</p>

<pre><code class="language-mermaid">flowchart TD
    26802344["Mechanism-specific and whole-organism ecotoxicity of mono-rhamnolipids.
Blank (2016)"]
34492827["The Green toxicology approach: Insight towards the eco-toxicologically safe development of benign catalysts.
Herres-Pawlis (2021)"]
28779508["Highly Active N,O Zinc Guanidine Catalysts for the Ring-Opening Polymerization of Lactide.
Herres-Pawlis (2017)"]
33195133["Genetic Cell-Surface Modification for Optimized Foam Fractionation.
Blank (2020)"]
32974309["Integration of Genetic and Process Engineering for Optimized Rhamnolipid Production Using
Jupke, Blank (2020)"]
30811863["New Kids in Lactide Polymerization: Highly Active and Robust Iron Guanidine Complexes as Superior Catalysts.
Pich, Herres-Pawlis (2019)"]
30758389["Tuning a robust system: N,O zinc guanidine catalysts for the ROP of lactide.
Pich, Herres-Pawlis (2019)"]
28524364["Biofunctional Microgel-Based Fertilizers for Controlled Foliar Delivery of Nutrients to Plants.
Pich, Schwaneberg (2017)"]
34865895["A plea for the integration of Green Toxicology in sustainable bioeconomy strategies - Biosurfactants and microgel-based pesticide release systems as examples.
Pich, Blank, Schwaneberg (2022)"]
32449840["Robust Guanidine Metal Catalysts for the Ring-Opening Polymerization of Lactide under Industrially Relevant Conditions.
Herres-Pawlis (2020)"]
34492827 --&gt; 30811863
34492827 --&gt; 30758389
34492827 --&gt; 28779508
34492827 --&gt; 32449840
32974309 --&gt; 33195133
34865895 --&gt; 26802344
34865895 --&gt; 32974309
34865895 --&gt; 28524364
34865895 --&gt; 34492827
</code></pre>]]></content><author><name>Charles Tapley Hoyt</name></author><category term="bibliometrics" /><category term="citations" /><category term="citation networks" /><summary type="html"><![CDATA[OpenCitations aggregates and deduplicates bibliographic information from CrossRef, Europe PubMed Central, and other sources to construct a comprehensive, open index of citations between scientific works. This post describes the opencitations-client package which wraps the OpenCitations API and implements an automated pipeline for locally downloading, caching, and accessing OpenCitations in bulk.]]></summary></entry><entry><title type="html">Challenges with Semantic Mappings</title><link href="https://cthoyt.com/2026/01/20/semantic-mapping-challenges.html" rel="alternate" type="text/html" title="Challenges with Semantic Mappings" /><published>2026-01-20T10:42:00+00:00</published><updated>2026-01-20T10:42:00+00:00</updated><id>https://cthoyt.com/2026/01/20/semantic-mapping-challenges</id><content type="html" xml:base="https://cthoyt.com/2026/01/20/semantic-mapping-challenges.html"><![CDATA[<p>There are many challenges associated with the curation, publication,
acquisition, and usage of semantic mappings. This post examines their
philosophical, technical, and practical implications, highlights existing
solutions, and describes opportunities for next steps for the community of
curators, semantic engineers, software developers, and data scientists who make
and use semantic mappings.</p>

<h3 id="proliferation-of-formats">Proliferation of Formats</h3>

<p>The first challenge with semantic mappings is the variety of forms they can
take. This both includes different data models and serializations of those
models. This problem is effectively solved, but I think is worth reviewing for
historical purposes (please let me know if I missed something):</p>

<p><img src="https://forge.extranet.logilab.fr/uploads/-/system/project/avatar/107/external-content.duckduckgo.com.jpeg" align="left" style="max-height: 3em;" alt="SKOS logo" />
<a href="https://www.w3.org/TR/skos-reference">Simple Knowledge Organization System (SKOS)</a>
is a data model for RDF to represent controlled vocabularies, taxonomies,
dictionaries, thesauri, and other semantic artifacts. It defines several
semantic mapping predicates including for broad matches, narrow matches, close
matches, related matches, and exact matches.</p>

<p><a href="https://gbv.github.io/jskos/#mapping">JSKOS (JSON for Knowledge Organization Systems)</a>,
a JSON-based extension of the SKOS data model. I recently wrote a post about
converting between <a href="/2026/01/15/sssom-to-jskos.html">SSSOM and JSKOS</a>.</p>

<p><img src="https://www.jean-delahousse.net/wp-content/uploads/2020/09/Owl_logo-258x300.png" align="left" style="max-height: 3em; margin-right: 0.5em;" alt="OWL logo" />
<a href="https://www.w3.org/TR/owl2-syntax/">Web Ontology Language (OWL)</a> is primarily
used for ontologies. It has first-class language support for encoding
equivalences between classes, properties, or individuals. Other semantic
mappings can be encoded as annotation properties on classes, properties, or
individuals, e.g., using SKOS predicates.</p>

<p><img src="https://obofoundry.org/images/foundrylogo.png" align="left" style="max-height: 3em; margin-right: 0.5em;" alt="OBO logo" />
The
<a href="https://owlcollab.github.io/oboformat/doc/GO.format.obo-1_4.html">OBO Flat File Format</a>
is a simplified version of OWL with macros most useful for curating biomedical
ontologies. It has the same abilities as OWL, but also the <code class="language-plaintext highlighter-rouge">xref</code> macro which
corresponds to <code class="language-plaintext highlighter-rouge">oboInOwl:hasDbXref</code> relations, which are by nature imprecise and
therefore used in a variety of ways.</p>

<p><img src="https://avatars.githubusercontent.com/u/77892844?v=4" align="left" style="max-height: 3em; margin-right: 0.5em;" alt="SSSOM logo" />
The
<a href="https://mapping-commons.github.io/sssom/">Simple Standard for Sharing Ontological Mappings (SSSOM)</a>
is a fit-for-purpose format for semantic mappings between classes, properties,
or individuals. SSSOM guides curators towards inputting key metadata that are
typically missing from other formalisms and is gaining wider community adoption.
Importantly, SSSOM integrates into ontology curation workflows, especially for
<a href="https://incatools.github.io/ontology-development-kit">Ontology Development Kit (ODK)</a>
users.</p>

<p>The
<a href="https://moex.gitlabpages.inria.fr/alignapi/edoal.html">Expressive and Declarative Ontology Alignment Language (EDOAL)</a>
lives in a similar space to SSSOM, but IMO was much less approachable (c.f.
XML + Java), and has not seen a lot of traction in the biomedical space.</p>

<p><img src="https://ontoportal.org/images/logo.png" align="left" style="max-height: 3em; margin-right: 0.5em;" alt="OntoPortal logo" />
<a href="https://ontoportal.org/">OntoPortal</a> has its own data model for semantic
mappings that has low metadata precision. I recently wrote a post on converting
<a href="/2025/11/23/sssom-from-bioportal.html">OntoPortal to SSSOM</a>. OntoPortal would also like
to invest more in SSSOM infrastructure if it can organize funding and human resources.</p>

<p><img src="https://upload.wikimedia.org/wikipedia/commons/6/66/Wikidata-logo-en.svg" align="left" style="max-height: 3em" alt="Wikidata logo" />
<a href="https://www.wikidata.org">Wikidata</a> has its own data model for semantic
mappings that include higher precision metadata. I recently wrote a post on
mapping between the data models from <a href="/2026/01/08/sssom-to-wikidata.html">SSSOM and
Wikidata</a>.</p>

<p>Finally, there’s a long tail of mappings that live in poorly annotated CSV, TSV,
Excel, and other formats. Similarly, mappings can live in plain RDF files, e.g.,
encoded with SKOS predicates, but without high precision metadata.</p>

<h3 id="scattered-partially-overlapping-and-incomplete">Scattered, Partially Overlapping, and Incomplete</h3>

<p>Semantic mappings are not centralized, meaning that multiple sources of semantic
mappings often need to be integrated to map between two semantic spaces. Even
then, these integrated mappings are often incomplete. Using
<a href="https://semantic.farm/mesh">Medical Subject Headings (MeSH)</a> and the
<a href="https://semantic.farm/hpo">Human Phenotype Ontology (HPO)</a> as an example, we
can see the following:</p>

<ol>
  <li>MeSH doesn’t maintain any mappings to HPO.</li>
  <li>HPO maintains some mappings as primary mappings.</li>
  <li>The <a href="https://semantic.farm/umls">Unified Medical Language System (UMLS)</a>
maintains some mappings as secondary mappings. HPO suggests using UMLS as a
supplementary mapping resource.</li>
  <li><a href="https://github.com/biopragmatics/biomappings">Biomappings</a> maintains some
community-curated mappings as secondary mappings.</li>
</ol>

<p><a href="https://github.com/biopragmatics/semra/blob/main/notebooks/umls-inference-analysis.ipynb"><img src="/img/mappings-are-hard/scattered.png" alt="" /></a></p>

<p>This actually might not be the best example - it would have been better to show
a pair of resources that both partially map to the other. When I first made this
chart, I had to engineer the UMLS inference by hand. Eventually, the need to
generalize this workflow led to the development of the
<a href="https://github.com/biopragmatics/semra">Semantic Mapping Reasoner and Assembler (SeMRA)</a>
Python package which does this automatically and at scale. The fact that there
were missing mappings that even UMLS inference couldn’t retrieve led to
establishing the <a href="https://github.com/biopragmatics/biomappings">Biomappings</a>
project for prediction and semi-automated curation of semantic mappings. The
underlying technology stack from Biomappings eventually got spun out to
<a href="https://github.com/cthoyt/sssom-curator">SSSOM Curator</a> and is now fully
domain-agnostic.</p>

<h3 id="different-precision-or-conflicts">Different Precision or Conflicts</h3>

<p>Another challenge with semantic mappings is when different resources have
different level of precision. In the example below, OrphaNet uses low-precision
mapping predicates (i.e., <code class="language-plaintext highlighter-rouge">oboInOwl:hasDbXref</code>) while MONDO uses high-precision
mapping predicates (i.e., <code class="language-plaintext highlighter-rouge">skos:exactMatch</code>). It makes sense to take the highest
quality mapping in this situation, but having a coherent software stack to do
this at scale was the big challenge (solved by SeMRA).</p>

<p><a href="https://docs.google.com/drawings/d/1jBK1-FxzfsBFd6Ro0YjQSvwJCZs1rqlLQq9FdtcEU-w/edit?usp=sharing"><img src="/img/mappings-are-hard/precision.svg" alt="" /></a></p>

<p>This can get a bit dicier when there might be conflicting information, for
example, if one resource says exact match and another says broader match. In
SeMRA, I devised a confidence assessment scheme (which should get its own post
later).</p>

<h3 id="common-conflations">Common Conflations</h3>

<p>There are three flavors of conflations that make curating and reviewing mappings
difficult that I want to highlight.</p>

<h4 id="different-ontology-encodings">Different Ontology Encodings</h4>

<p>Classes, instances, and properties are mutually exclusive by design. This means
that any semantic mappings between them are nonsense, but there are many
situations where these mappings might get produced by an automated system or by
a curator who is less knowledgable about the ontology aspect of semantic
mappings. There’s also a much more subtle discussion about classes, instances,
and metaclasses ( see
<a href="https://github.com/OBOFoundry/OBOFoundry.github.io/issues/2454">this discussion</a>)
that I would set aside.</p>

<p>As a concrete example, the
<a href="https://semantic.farm/registry/iao">Information Artifact Ontology (IAO)</a> has a
class that represents the section of a document that contains its abstract:
<a href="http://purl.obolibrary.org/obo/IAO_0000315">abstract (IAO:0000315)</a>. Schema.org
has an annotation property whose range is a creative work and whose domain is
the text of the abstract itself: <a href="http://schema.org/abstract">schema:abstract</a>.
These both have the same label <code class="language-plaintext highlighter-rouge">abstract</code>, which means that it’s possible to
conflate (i.e., accidentally map them).</p>

<h4 id="different-entity-types">Different Entity Types</h4>

<p>The second kind of conflation is even more subtle, when two classes, instances,
or properties come from similar but distinct hierarchies.</p>

<p>For example, there’s a subtle difference between what is a phenotype and what is
a disease. Ontologies are highly apt at encoding this subtlety with <em>axioms</em>
that can then be used by reasoners. This can become a problem for curating and
reviewing semantic mappings because some diseases are named after the phenotype
that it presents or that causes it. Using MeSH’s disease hierarchy and HPO’s
phenotype hierarchy as an example, we can see that
<a href="https://semantic.farm/mesh:D000069856">Staghorn Calculi (mesh:D000069856)</a> and
<a href="https://semantic.farm/hp:0033591">Staghorn calculus (HP:0033591)</a> should not
get mapped.</p>

<p>Many more examples can be produced (which also show there are even more
subtleties here) using SSSOM Curator with the command:
<code class="language-plaintext highlighter-rouge">sssom_curator predict lexical doid hp</code>. See the
<a href="https://sssom-curator.readthedocs.io/en/latest/projects.html#making-predictions">SSSOM Curator documentation</a>
for more information on the lexical matching workflow.</p>

<h4 id="different-senses">Different Senses</h4>

<p>The <a href="https://basic-formal-ontology.org">basic formal ontology (BFO)</a> is an
upper-level ontology that is used by many ontologies, including almost the
entire <a href="https://obofoundry.org">Open Biomedical Ontologies (OBO) Foundry</a>.
However, as Chris Mungall described in his blog post,
<a href="https://douroucouli.wordpress.com/2022/08/10/shadow-concepts-considered-harmful/">Shadow Concepts Considered Harmful</a>,
there are many different senses in which an entity can be described, each
falling under a different, mutually exclusive branch of BFO. The figure below,
from Chris’s post, represents different senses in which a human heart can be
described:</p>

<p><a href="https://douroucouli.wordpress.com/2022/08/10/shadow-concepts-considered-harmful/"><img src="/img/mappings-are-hard/mungalls-ontology-design-guidelines-12.png" alt="" /></a></p>

<p>This problem is particularly bad in disease modeling. Here are only a few
examples (of many more) that illustrate this:</p>

<ul>
  <li>the <a href="https://semantic.farm/ogms">Ontology for General Medical Science (OGMS)</a>
term for
<a href="http://purl.obolibrary.org/obo/OGMS_0000031">disease (OGMS:0000031)</a>, the
<a href="https://semantic.farm/efo">Experimental Factor Ontology (EFO)</a> term for
<a href="http://www.ebi.ac.uk/efo/EFO_0000408">disease (EFO:0000408)</a>,
<a href="https://semantic.farm/mondo">Monarch Disease Ontology (MONDO)</a> term for
<a href="http://purl.obolibrary.org/obo/MONDO_0000001">disease (MONDO:0000001)</a> is a
<a href="http://purl.obolibrary.org/obo/BFO_0000016">disposition (BFO:0000016)</a></li>
  <li>the
<a href="https://semantic.farm/gsso">Gender, Sex, and Sexual Orientation Ontology (GSSO)</a>
term for <a href="http://purl.obolibrary.org/obo/GSSO_000486">disease (GSSO:000486)</a>
is a <a href="http://purl.obolibrary.org/obo/BFO_0000015">process (BFO:0000015)</a></li>
  <li>the <a href="https://semantic.farm/doid">Human Disease Ontology (DOID)</a> informally
mentions that a disease is a disposition, but doesn’t make an ontological
commitment to BFO</li>
  <li>many more controlled vocabularies including NCIT, SNOMED-CT, and MI have their
own terms for diseases but don’t use BFO as an upper-level ontology nor are
constructed in a way conducive towards integration with other ontologies</li>
</ul>

<p>Schultz <em>et al.</em> (2011) proposed a way to formalize the connections between the
various senses for diseases in
<a href="https://link.springer.com/article/10.1186/2041-1480-2-S2-S6">Scalable representations of diseases in biomedical ontologies</a>.
However, the OBO community has yet to resolve the
<a href="https://github.com/OBOFoundry/COB/pull/226">long and taxing discussion</a> on how
to standardize disease modeling practices.</p>

<p>For semantic mappings, this becomes a problem because a reasoner will explode if
diseases under two different BFO branches get marked as equivalent, because the
BFO upper level terms are marked as disjoint - this is a feature, not a bug.
However, while useful for creating carefully constructed, logically
(self-)consistent descriptions of diseases, these modeling choices can be
confusing when curating or reviewing mappings. These modeling choices might not
be so important in downstream applications, such as assembling a knowledge graph
to support graph machine learning, where many different knowledge sources with
lower levels of accuracy and precision must be merged. In practice, I have
merged triples using conflicting senses for diseases in a useful way, without
issue.</p>

<h3 id="interpretation-is-important">Interpretation is Important</h3>

<p>While the last few examples were cautionary tales for when things (probably)
shouldn’t be mapped, the next examples are about when things (probably) should
be mapped.</p>

<h4 id="definitions">Definitions</h4>

<p>Here are three vocabularies’ terms for proteins and their textual definitions
(though, many more contain their own term for proteins):</p>

<table>
  <thead>
    <tr>
      <th>Entity</th>
      <th>Label</th>
      <th>Description</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><a href="https://www.wikidata.org/wiki/Q8054">wikidata:Q8054</a></td>
      <td>protein</td>
      <td>biomolecule or biomolecule complex largely consisting of chains of amino acid residues</td>
    </tr>
    <tr>
      <td><a href="http://semanticscience.org/resource/SIO_010043">SIO:010043</a></td>
      <td>protein</td>
      <td>A protein is an organic polymer that is composed of one or more linear polymers of amino acids.</td>
    </tr>
    <tr>
      <td><a href="http://purl.obolibrary.org/obo/PR_000000001">PR:000000001</a></td>
      <td>protein</td>
      <td>An amino acid chain that is canonically produced <em>de novo</em> by ribosome-mediated translation of a genetically-encoded mRNA, and any derivatives thereof.</td>
    </tr>
  </tbody>
</table>

<p>As semantic mapping curator, we have two options:</p>

<ol>
  <li>We can reasonably assume that the intent from all three resources was to
represent the same thing, despite the definitions being quite different. This
assumption can be built on our prior knowledge about what a protein is, why
Wikidata, SIO, and PR exist, and then infer the intent of the term’s
definition’s author</li>
  <li>We can make a very literal reading of the definition and conclude that these
three terms represent very different things</li>
</ol>

<p>I think that the latter is really unconstructive for several reasons, but I have
worked with colleagues, especially from the linguistics background, who take
this approach. First, this is unconstructive because it means you’ll probably
never map anything.</p>

<p>Second, if you want to be rigorous, use an ontology formalism with proper
logical definitions. For example, the
<a href="https://semantic.farm/cl">Cell Ontology (CL)</a> exhaustively defines its cells
using appropriate logical axioms. However, this also has a caveat, that to make
mappings based on logical definitions, then the different modelers have to agree
on the same axioms and same modeling paradigm. As far as I know, there aren’t
any groups out there that use the same modeling paradigm that haven’t just
combine forces to work on the same resource. So we’re stuck back at option 1
either way :)</p>

<h4 id="context-sometimes-matters">Context Sometimes Matters</h4>

<p>In contrast to the discussion about mapping phenotypes and diseases, there are
context-dependent reasons to make semantic mappings, which can be illustrated in
biomedicine using genes and proteins. Let’s start with some definitions:</p>

<ol>
  <li><a href="http://purl.obolibrary.org/obo/SO_0000704">SO:0000704</a> A gene is a region of
a chromosome that encodes a transcript</li>
  <li><a href="http://purl.obolibrary.org/obo/PR_000000001">PR:000000001</a> A protein is a
chain of amino acids</li>
</ol>

<p>The biomedical literature often uses gene symbols to discuss the proteins they
encode. While this isn’t precise, it’s still useful in many cases. Therefore,
when reading the COVID-19 literature, you will likely see discussion of the
IL6-STAT cascade, where IL6 is the HGNC gene symbol for the Interleukin 6
protein. Most of the time, the HGNC approved gene symbol is an initialism or
other abbreviation of the protein, but this isn’t always the case.</p>

<p>Edit: Sue Bello pointed out that most journals enforce gene names being put in
italics (<em>IL6</em>) and proteins without italics, though this requires the author
and reader to know that distinction, as well as for formatting to be preserved,
which it often isn’t unless you’re reading the original PDF or publisher’s HTML.</p>

<p>Similar to the literature, many pathway databases that accumulate knowledge
about the processes and reactions in which proteins take part actually use gene
symbols (or other gene identifiers) to curate proteins.</p>

<p><img src="/img/mappings-are-hard/context-dependent.svg" alt="" /></p>

<p>The take-home message here is that genes and proteins are indeed not the same
thing, but in some contexts, it’s useful to map between them. There’s also a
compromise - the <a href="https://semantic.farm/ro">Relation Ontology (RO)</a> has a
predicate <a href="https://semantic.farm/RO:0002205">has gene product (RO:0002205)</a> that
explicitly models the relationship between IL6 and Interleukin 6, which can then
be automatically inferred to mean a less precise mapping for certain scenarios
(SeMRA implements this).</p>

<p>Outside of biomedicine, I have also heard that context-specific mappings are
very important in the digital humanities. As I’m better understanding the use
cases of colleagues in other NFDI Consortia that focus on the digital
humanities, I will try and update this section to have alternate perspectives.</p>

<h3 id="evidence">Evidence</h3>

<p>A key challenge that motivated the development of SSSOM as a standard was to
associate high-quality metadata with semantic mappings, such as the reason the
mapping was produced (e.g., manual curation, lexical matching, structural
matching), who produced it (e.g., a person, algorithm, agent), when, how, and
more.</p>

<p><a href="https://docs.google.com/drawings/d/1rBofcaQxBFuYX0OzhCvBkigSNFWLclAbQ_X7zG7PRKA/edit?usp=sharing"><img src="/img/mappings-are-hard/evidence.svg" alt="" /></a></p>

<p>We developed the
<a href="https://semantic.farm/registry/semapv">Semantic Mapping Vocabulary (semapv)</a> to
encode different kinds of evidence such as for manual curation of mappings,
lexical matching, structural matching, and others. SSSOM is well-suited towards
capturing simple evidences (blue).</p>

<h4 id="provenance-for-inferences">Provenance for Inferences</h4>

<p>The purple evidence from the figure in the last section requires a more detailed
data model to represent provenance for inferred semantic mappings that simply
doesn’t fit in the SSSOM paradigm (and it shouldn’t be hacked in, either). I
proposed a more detailed data model for capturing how inference is done in
<a href="https://doi.org/10.1093/bioinformatics/btaf542">Assembly and reasoning over semantic mappings at scale for biomedical data integration</a>
and provided a reference implementation in the
<a href="https://github.com/biopragmatics/semra">Semantic Reasoning Mapper and Reasoner (SeMRA)</a>
Python software package. Here’s what that data model looks like, which also has
a Neo4j counterpart:</p>

<p><a href="https://docs.google.com/drawings/d/1C5l1UmwKohMsgprSXRK6Lo2egLsRWhXPIfoVo09tJ9I/edit?usp=sharing"><img src="/img/mappings-are-hard/semra-data-model.svg" alt="" /></a></p>

<h3 id="negative-semantic-mappings">Negative Semantic Mappings</h3>

<p>SSSOM also has first-class support for encoding <em>negative</em> relationships,
meaning that the following can be represented:</p>

<p><a href="https://docs.google.com/drawings/d/1AfCR35ra3FyQMulaTlynVZKswLbj8gp5MA4N2ipFe1I/edit?usp=sharing"><img src="/img/mappings-are-hard/negatives.svg" alt="" /></a></p>

<p>This means that SSSOM curators can keep track of non-trivial negative mappings,
e.g., when curating the results of semantic mapping prediction or automated
inference. In a semi-automated curation loop, this allows us to avoid
re-reviewing <a href="https://doi.org/10.32388/DYZ5J3">zombie mappings</a> over and over
again.</p>

<p>High quality, non-trivial negative mappings also enable more accurate machine
learning, as opposed to using negative sampling. For example, we have been
working on developing graph machine learning-based ontology matching and merging
using <a href="https://github.com/pykeen/pykeen/">PyKEEN</a> (a graph machine learning
package I helped develop and maintain).</p>

<p>An open challenge is that we neither have support from data modeling formalisms
(e.g., ontologies in OWL, knowledge graphs in RDF or Neo4j) to encode negative
knowledge (in this case negative mappings) nor tooling support. This means that
when we output SSSOM to RDF, we use our own formalism, which won’t be correctly
recognized by any other tooling that wasn’t developed with SSSOM in mind. I’m
keeping notes about this in a separate <a href="/2025/10/07/negative-rdf.html">post about negative
knowledge</a> that I update periodically.</p>

<hr />

<p>Despite the challenges, I think that the mapping world is actually getting quite
mature. I am currently working with NFDI and RDA colleagues to further unify the
SSSOM and JSKOS worlds, especially given that the
<a href="https://coli-conc.gbv.de/cocoda/">Cocoda</a> mapping curation tool solved many of
these problems (from the digital humanities perspective) many years ago, and we
simply were unaware of it.</p>

<p>I hope this post can continue as a living document - if I missed something,
please let me know and I will update the post to include it!</p>]]></content><author><name>Charles Tapley Hoyt</name></author><category term="SSSOM" /><category term="semantic mappings" /><category term="knowledge graphs" /><summary type="html"><![CDATA[There are many challenges associated with the curation, publication, acquisition, and usage of semantic mappings. This post examines their philosophical, technical, and practical implications, highlights existing solutions, and describes opportunities for next steps for the community of curators, semantic engineers, software developers, and data scientists who make and use semantic mappings.]]></summary></entry><entry><title type="html">Semantic Mappings Enable Automated Assembly</title><link href="https://cthoyt.com/2026/01/16/mappings-for-automated-assembly.html" rel="alternate" type="text/html" title="Semantic Mappings Enable Automated Assembly" /><published>2026-01-16T10:42:00+00:00</published><updated>2026-01-16T10:42:00+00:00</updated><id>https://cthoyt.com/2026/01/16/mappings-for-automated-assembly</id><content type="html" xml:base="https://cthoyt.com/2026/01/16/mappings-for-automated-assembly.html"><![CDATA[<p>Data and knowledge originating from heterogeneous sources often use
heterogeneous controlled vocabularies and/or ontologies for annotating named
entities. Semantic mappings are essential towards resolving these discrepancies
and integrating in a coherent way. This post highlights how this looks in two
scenarios: when constructing a knowledge graph for graph machine learning and
when constructing a comprehensive lexica for natural language processing, text
mining, and curation.</p>

<h2 id="background">Background</h2>

<p>Data and knowledge integration are challenging because there exist many
controlled vocabularies, ontologies, taxonomies, thesauri, classifications, and
other resources that mint identifiers with some degree of conceptual overlap,
redundancies, and discrepancies.</p>

<h3 id="problem-statement">Problem Statement</h3>

<p>When integrating data and knowledge from heterogeneous sources that refer to the
same concepts using different identifiers, we get non-trivial duplications and
missing connections in our results.</p>

<p>For example, if we constructed a knowledge graph by combining the
<a href="https://ctdbase.org">Comparative Toxicogenomics Database (CTD)</a> and the
<a href="https://semantic.farm/mondo">Monarch Disease Ontology (MONDO)</a>, we would get
disconnected mechanisms describing how
<a href="https://ctdbase.org/detail.go?type=chem&amp;acc=C003402">sapropterin</a> is used to
treat <a href="https://semantic.farm/MONDO:0009861">phenylketonuria</a> because the CTD
uses the <a href="https://semantic.farm/mesh">Medical Subject Headings (MeSH)</a> to
describe genes/proteins and MONDO uses the
<a href="https://semantic.farm/hgnc">HUGO Gene Nomenclature Committee (HGNC)</a>.</p>

<pre><code class="language-mermaid">flowchart LR
    a["sapropterin (mesh:C003402)"] -- activates --&gt; b["Phenylalanine Hydroxylase (mesh:D010651)"]
    c["phenylalanine hydroxylase (HGNC:8582)"] -- decreases --&gt; d["phenylketonuria (MONDO:0009861)"]
    b -...-|missing connection, these should be collapsed together| c
</code></pre>

<p>As a consequence, these redundancies lead to inaccurate results, for example,
when making queries between drugs and diseases or when using machine learning
algorithms to make predictions for new edges.</p>

<h3 id="causes">Causes</h3>

<p>I roughly classify these redundancies into three bins (from left to right in the
figure): similar domain, hierarchically related domain, and non-specific to a
domain. Below, I’ll give some concrete examples from the life sciences to
illustrate.</p>

<p><img src="/img/mappings-automated-assembly/overlaps.svg" alt="" /></p>

<p>In chemistry, there are dozens of resources that assign identifiers to small
molecules. Many have been constructed with unique scope or purpose such as
MetaboLights for metabolites, SwissLipids for lipids, DrugBank for drugs. Some
have similar scope and purpose, but have been constructed in parallel due to
scientific modeling reasons, such as different disease ontologies modeling
diseases with different parts of the Basic Formal Ontology (BFO). Some have
similar scope and purpose, but have been constructed in parallel due to
non-scientific reasons, such as PubChem and ChEMBL for small molecules with
assay information. As an aside, building resources in an open and collaborative
manner can help reduce proliferation, with the (major) caveat that they don’t
satisfy funding bodies nor the requirements for career progression so easily.</p>

<p>In medicine and epidemiology, there are many resources describing diseases,
transmission, response, adverse outcomes, and other facets. Particularly during
the COVID-19 pandemic, many independent controlled vocabularies were constructed
to model information at various levels of specificity. The figure shows the
hierarchical relationships between the
<a href="https://semantic.farm/doid">Disease Ontology (DOID)</a>, the
<a href="https://semantic.farm/ido">Infectious Disease Ontology (IDO)</a>, the
<a href="https://semantic.farm/vido">Viral Infectious Disease Ontology (VIDO)</a>, the
<a href="https://semantic.farm/cido">Coronavirus Infectious Disease Ontology (CIDO)</a>,
and the
<a href="https://semantic.farm/idocovid19">COVID-19 Infectious Disease Ontology (IDOCOVID19)</a>.
When effectively reusing terms (as OBO Foundry Ontologies often do), this
doesn’t create an issue, but in practice, many resources do not reuse terms for
various reasons.</p>

<p>In the life sciences, there are several controlled vocabularies that cover a
large number of domains such as the
<a href="https://semantic.farm/mesh">Medical Subject Headings (MeSH)</a>,
<a href="https://semantic.farm/ncit">National Cancer Institute Thesaurus (NCIT)</a>, and
<a href="https://semantic.farm/umls">Unified Medical Language System (UMLS)</a>. While they
give good coverage across many domains, these resources are often neither
detailed, precise enough, nor curated as ontologies. Therefore, many controlled
vocabularies use terms from these resources as a base and curate further.
However, this causes redundancy, and in many cases, the group does not correctly
cross-reference back. The
<a href="https://semantic.farm/registry/omit">Ontology for MicroRNA Target (OMIT)</a> even
imported the entirety of MeSH, but didn’t make any cross-references back to the
source, creating even more redundancy.</p>

<p>If you were wondering why for each domain, we couldn’t just have a single
resource, then please have a look at
<a href="https://xkcd.com/927">https://xkcd.com/927</a> :) Though, some resources that have
been around for a long time basically have a monopoly. For example, nobody in
their right mind in 2026 would start their own protein database to compete with
<a href="https://uniprot.org">UniProt</a>.</p>

<h2 id="assembly">Assembly</h2>

<p>I want to highlight two groups for whom resolving redundancy has a high impact,
but not necessarily high visibility. The first group is data scientists who
consume knowledge graphs, for example, for graph machine learning. This group is
often unaware of how graphs were constructed (see: any graph machine learning
literature since 2013 that blindly uses FB15k and WN18).</p>

<p>The second group is curators, who want to use a combination of terminology
services like the <a href="https://www.ebi.ac.uk/ols4/">Ontology Lookup Service (OLS)</a>
and text mining tools to annotate the literature with controlled vocabulary
terms. Curators don’t want to (and shouldn’t have to) understand the landscape
of related controlled vocabularies for their domain and should just be
responsible with terminology services and text mining tools to find <em>any</em>
appropriate term for their curation.</p>

<p>The important point is that software should solve the problem of redundancy, and
it needs to do so by consuming semantic mappings that bridge the gap illustrated
above in the phenylketonuria example.</p>

<p><img src="/img/mappings-automated-assembly/overlaps-deal-with-it.svg" alt="" /></p>

<p>This leads to the main goal of the post, which is to describe two high-level
workflows that can resolve redundancies and discrepancies when integration data
and knowledge by using semantic mappings. This post isn’t about where semantic
mappings come from - see my other posts on SSSOM, JSKOS, and SeMRA for more
background on that.</p>

<h3 id="knowledge-graph-assembly">Knowledge Graph Assembly</h3>

<p>Resolving redundancies when constructing a knowledge graph means standardizing
the subjects, predicates, and objects in triples. For example, if we have
knowledge about ethanol from multiple sources and some identify it using the
ChEBI identifier <code class="language-plaintext highlighter-rouge">16236</code> while others identify it using the DrugBank identifier
<code class="language-plaintext highlighter-rouge">DB000898</code>, we will have a similar issue to the phenylketonuria approach. If we
have semantic mappings that denote <code class="language-plaintext highlighter-rouge">CHEBI:16236</code> and <code class="language-plaintext highlighter-rouge">drugbank:DB000898</code> are
equivalent, as well as the ruleset that ChEBI identifiers take precedent over
DrugBank identifiers, then we can map the triples from the resource that uses
DrugBank like described in the figure below:</p>

<p><img src="/img/mappings-automated-assembly/knowledge-assembly.svg" alt="" /></p>

<p>Let’s take the ChEBI Ontology and DrugBank pharmacological data as two examples
that both annotate chemical roles. Here are a few scenarios for a given DrugBank
entry (assuming they both use the same <code class="language-plaintext highlighter-rouge">rdfs:subClassOf</code> relationship):</p>

<ol>
  <li>There are no semantic mappings linking it to a ChEBI entry. In this case, the
subject doesn’t need to be mapped.</li>
  <li>There is a semantic mapping linking it to a ChEBI entry, but there’s no
semantic mapping linking the object to a ChEBI entry. For example, DrugBank
annotates ethanol as
<a href="https://go.drugbank.com/categories/DBCAT003935">Agents Causing Muscle Toxicity (drugbank.category:DBCAT003935)</a>.
Therefore, the subject is mapped but the object is retained.</li>
  <li>There is a semantic mapping linking it to a CheBI entry and a semantic
mapping linking the object to the ChEBI entry. For example, ChEBI annotates
ethanol as a
<a href="https://semantic.farm/CHEBI:60643">NMDA receptor antagonist (CHEBI:60643)</a>
and DrugBank annotates ethanol as a
<a href="https://go.drugbank.com/categories/DBCAT002723">NMDA Receptor Antagonists (drugbank.category:DBCAT002723)</a>.
In this case, the DrugBank triple is fully a duplicate of the ChEBI one.
However, it may have valuable metadata to keep.</li>
</ol>

<p>As another example, DrugBank annotates ethanol as a
<a href="https://go.drugbank.com/categories/DBCAT003232">Cytochrome P-450 CYP3A4 Inhibitors (drugbank.category:DBCAT003232)</a>.
There’s a corresponding term in ChEBI
<a href="https://www.ebi.ac.uk/chebi/search?query=CYP3A4%20Inhibitors">EC 1.14.13.97 (taurochenodeoxycholate 6α-hydroxylase) inhibitor, (CHEBI:86501)</a>,
but there isn’t a semantic mapping capturing this. This is a job for the
<a href="https://github.com/cthoyt/sssom-curator/">SSSOM Curator</a> software and a
semantic mapping repository like
<a href="https://github.com/biopragmatics/biomappings">Biomappings</a> to store it.</p>

<p>There are <em>many</em> examples of manually constructed workflows that do this
process. While these work (e.g., <a href="https://het.io">Hetionet</a> was one of the
best), they are brittle towards expansion to new datasets and mappings, and
often hard to keep up-to-date. My goal has been to implement a fully generic and
automated version of this workflow, which I did as part of the
<a href="https://github.com/biopragmatics/semra">Semantic Reasoner and Assembler</a>.
However, describing how it works on a technical level will be part of a future
post.</p>

<h3 id="lexicon-assembly">Lexicon Assembly</h3>

<p>Controlled vocabularies often contain labels and synonyms for their terms that
are useful when constructing lexical indexes (i.e., databases of labels and
synonyms) that can be fed into named entity recognition (NER) and named entity
normalization (NEN) workflows - crucial components of natural language
processing and text mining workflows that are commonly used by curators to
annotate the literature with relationships that eventually become part of
databases that are used to construct knowledge graphs.</p>

<p>However, much like knowledge, synonyms for the same concept might be spread over
multiple different resources. Therefore, semantic mappings can be used to group
multiple terms together and pool all of their synonyms, which both improves
recall and reduces the number of duplicate groundings that might be given for a
given part of text.</p>

<p><img src="/img/mappings-automated-assembly/synonyms-assembly.svg" alt="" /></p>

<p>I’ve already made a technical implementation of this workflow in the
<a href="https://github.com/biopragmatics/biolexica">Biolexica</a> project, but I’m working
towards generalizing and rebranding it for use outside the biomedical domain. I
previously posted a <a href="/2025/12/19/annotating-the-literature-demo.html">simple
demonstration</a> using
the underlying NER and NEN technology stack, but just using MeSH - a future post
will show how this works with a coherent lexica for diseases, genes, and other
entity types.</p>]]></content><author><name>Charles Tapley Hoyt</name></author><category term="SSSOM" /><category term="semantic mappings" /><category term="knowledge graphs" /><summary type="html"><![CDATA[Data and knowledge originating from heterogeneous sources often use heterogeneous controlled vocabularies and/or ontologies for annotating named entities. Semantic mappings are essential towards resolving these discrepancies and integrating in a coherent way. This post highlights how this looks in two scenarios: when constructing a knowledge graph for graph machine learning and when constructing a comprehensive lexica for natural language processing, text mining, and curation.]]></summary></entry><entry><title type="html">Mapping from SSSOM to JSKOS</title><link href="https://cthoyt.com/2026/01/15/sssom-to-jskos.html" rel="alternate" type="text/html" title="Mapping from SSSOM to JSKOS" /><published>2026-01-15T10:42:00+00:00</published><updated>2026-01-15T10:42:00+00:00</updated><id>https://cthoyt.com/2026/01/15/sssom-to-jskos</id><content type="html" xml:base="https://cthoyt.com/2026/01/15/sssom-to-jskos.html"><![CDATA[<p><a href="https://gbv.github.io/jskos/">JSKOS (JSON for Knowledge Organization Systems)</a>
is a JSON-based data model for representing terminologies, thesauri,
classifications, and other semantic artifacts. Like the
<a href="https://mapping-commons.github.io/sssom/">Simple Standard for Sharing Ontological Mappings (SSSOM)</a>,
it can also encode semantic mappings. This post is about developing and
implementing a crosswalk between them in the
<a href="https://github.com/cthoyt/sssom-pydantic/pull/26">sssom-pydantic</a> Python
package.</p>

<h2 id="background-on-jskos">Background on JSKOS</h2>

<p>At its core, JSKOS implements the Simple Knowledge Organization System (SKOS)
data model and extends it with a data model inspired by Wikidata with the
following types:</p>

<p><img src="https://gbv.github.io/jskos/types.svg" alt="" /></p>

<p>JSKOS enables representing semantic mappings two ways:</p>

<ol>
  <li>using the <code class="language-plaintext highlighter-rouge">narrower</code>, <code class="language-plaintext highlighter-rouge">broader</code>, and <code class="language-plaintext highlighter-rouge">related</code> slots in the
<a href="https://gbv.github.io/jskos/#concept">Concept</a> class that correspond to SKOS
relations <code class="language-plaintext highlighter-rouge">skos:narrowMatch</code>, <code class="language-plaintext highlighter-rouge">skos:broadMatch</code>, and <code class="language-plaintext highlighter-rouge">skos:relatedMatch</code></li>
  <li>using the <code class="language-plaintext highlighter-rouge">mappings</code> slot in the
<a href="https://gbv.github.io/jskos/#concept">Concept</a> class, which accepts a list
of instances of the more generic
<a href="https://gbv.github.io/jskos/#mapping">Mapping</a> class</li>
</ol>

<p>Here’s how JSKOS represents an exact match from the
<a href="https://github.com/biopragmatics/biomappings">Biomappings</a> community curated
mappings database between a
<a href="https://semantic.farm/mesh">Medical Subject Headings (MeSH)</a> term and
<a href="https://semantic.farm/chebi">Chemical Entities of Biological Interest (ChEBI) ontology</a>
term for the chemical <a href="https://en.wikipedia.org/wiki/Ammeline">ammeline</a>:</p>

<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
  </span><span class="nl">"license"</span><span class="p">:</span><span class="w"> </span><span class="p">[{</span><span class="w"> </span><span class="nl">"uri"</span><span class="p">:</span><span class="w"> </span><span class="s2">"https://spdx.org/licenses/CC0-1.0"</span><span class="w"> </span><span class="p">}],</span><span class="w">
  </span><span class="nl">"uri"</span><span class="p">:</span><span class="w"> </span><span class="s2">"https://w3id.org/biopragmatics/biomappings/sssom/biomappings.sssom.tsv"</span><span class="p">,</span><span class="w">
  </span><span class="nl">"mappings"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
    </span><span class="p">{</span><span class="w">
      </span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="s2">"http://www.w3.org/2004/02/skos/core#exactMatch"</span><span class="p">],</span><span class="w">
      </span><span class="nl">"subject_bundle"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
        </span><span class="nl">"member_set"</span><span class="p">:</span><span class="w"> </span><span class="p">[{</span><span class="w"> </span><span class="nl">"uri"</span><span class="p">:</span><span class="w"> </span><span class="s2">"http://id.nlm.nih.gov/mesh/C000089"</span><span class="w"> </span><span class="p">}]</span><span class="w">
      </span><span class="p">},</span><span class="w">
      </span><span class="nl">"object_bundle"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
        </span><span class="nl">"member_set"</span><span class="p">:</span><span class="w"> </span><span class="p">[{</span><span class="w"> </span><span class="nl">"uri"</span><span class="p">:</span><span class="w"> </span><span class="s2">"http://purl.obolibrary.org/obo/CHEBI_28646"</span><span class="w"> </span><span class="p">}]</span><span class="w">
      </span><span class="p">},</span><span class="w">
      </span><span class="nl">"justification"</span><span class="p">:</span><span class="w"> </span><span class="s2">"https://w3id.org/semapv/vocab/ManualMappingCuration"</span><span class="w">
    </span><span class="p">}</span><span class="w">
  </span><span class="p">]</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>

<p>Notably, JSKOS is baked into the <a href="https://coli-conc.gbv.de/cocoda/">Cocoda</a>
mapping editor, which is being widely adopted in the humanities consortia of the
NFDI.</p>

<h2 id="interoperability-between-sssom-and-jskos">Interoperability between SSSOM and JSKOS</h2>

<p>Given the overlapping ability of the
<a href="https://mapping-commons.github.io/sssom/">Simple Standard for Sharing Ontological Mappings (SSSOM)</a>
and JSKOS to represent semantic mappings, the JSKOS and SSSOM teams developed a
<a href="https://github.com/gbv/jskos/issues/108">crosswalk</a> between JSKOS and SSSOM.
Along the way, the SSSOM and JSKOS data models evolved to incorporate good ideas
from the other, for example, the addition of a
<a href="https://github.com/mapping-commons/sssom/issues/359">mapping identifier</a> to
SSSOM records to allow for referencing the SSSOM mapping itself.</p>

<p>The crosswalk is not (yet) lossless, for example, JSKOS does not yet have a
mechanism to express
<a href="https://github.com/gbv/jskos/issues/152">information about lexical and other automated mappings</a>.
However, lossless conversion between data models isn’t always possible, nor is
it always necessary, considering the different domains for which JSKOS and SSSOM
were developed. That JSKOS was developed by researchers in the digital
humanities and SSSOM was developed by researchers in the life and natural
sciences can contextualize some of their discrepancies.</p>

<h2 id="technical-implementation">Technical Implementation</h2>

<p>The <a href="https://github.com/gbv/sssom-js">sssom-js</a> JavaScript package contains the
first SSSOM to JSKOS converter and has an
<a href="https://github.com/gbv/sssom-js/issues/5">open issue</a> for conversion back to
SSSOM (TSV). It was developed by the JSKOS team, meaning that I have high
confidence that the implemenation of the crosswalk is accurate.</p>

<p>While it can be invoked from the command line using <code class="language-plaintext highlighter-rouge">npx</code> like in
<code class="language-plaintext highlighter-rouge">npx sssom-js --from tsv --to jskos --output output.json input.sssom.tsv</code>, it
can also be explored in the first-party SSSOM Validation and Transformation
<a href="https://gbv.github.io/sssom-js/">website</a>.</p>

<p><img src="/img/sssom-js-validator.png" alt="" /></p>

<p>Originally, my plan was to implement SSSOM to JSKOS export in the
<a href="https://github.com/cthoyt/sssom-pydantic">sssom-pydantic</a> so it can be easily
incorporated into other SSSOM-aware applications like
<a href="https://github.com/cthoyt/sssom-curator/">SSSOM Curator</a> and the
<a href="https://github.com/biopragmatics/semra">Semantic Mapping Reasoner and Assembler</a>.</p>

<p>I started by implementing an object model for JSKOS in Python using Pydantic in
<a href="https://github.com/cthoyt/jskos">a dedicated package</a>. This actually turned out
to be very difficult to get to work in general because the JSKOS data model is
hierarchical and does not always contain fields that make it possible to
discriminate between which class a given arbitrary JSON object follows. This
makes it difficult to use Pydantic’s
<a href="https://docs.pydantic.dev/latest/concepts/unions/#nested-discriminated-unions">nested discriminated unions</a>
feature, so I had to implement a custom solution.</p>

<p>Ultimately, I scrapped the idea of re-implementing the crosswalk myself (for
now) and instead defer to the <code class="language-plaintext highlighter-rouge">sssom-js</code> implementation (while wrapping it in an
idiomatic Python API). Once <code class="language-plaintext highlighter-rouge">sssom-js</code> implements a TSV exporter, I will have a
high-quality oracle against which to test my implementation. These first steps
were implemented in
<a href="https://github.com/cthoyt/sssom-pydantic/pull/26">cthoyt/sssom-pydantic#26</a>.</p>

<p>Here’s what this looks like in Python:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">sssom_pydantic</span>
<span class="kn">from</span> <span class="nn">sssom_pydantic.contrib.jskos_export</span> <span class="kn">import</span> <span class="n">to_jskos</span>

<span class="n">url</span> <span class="o">=</span> <span class="s">"https://w3id.org/biopragmatics/biomappings/sssom/biomappings.sssom.tsv"</span>
<span class="n">mappings</span><span class="p">,</span> <span class="n">converter</span><span class="p">,</span> <span class="n">metadata</span> <span class="o">=</span> <span class="n">sssom_pydantic</span><span class="p">.</span><span class="n">read</span><span class="p">(</span><span class="n">url</span><span class="p">)</span>

<span class="n">jskos_concept</span> <span class="o">=</span> <span class="n">to_jskos</span><span class="p">(</span><span class="n">mappings</span><span class="p">,</span> <span class="n">converter</span><span class="o">=</span><span class="n">converter</span><span class="p">,</span> <span class="n">metadata</span><span class="o">=</span><span class="n">metadata</span><span class="p">)</span>
</code></pre></div></div>]]></content><author><name>Charles Tapley Hoyt</name></author><category term="SSSOM" /><category term="SKOS" /><category term="semantic mappings" /><category term="mappings" /><category term="interoperability" /><category term="JSKOS" /><summary type="html"><![CDATA[JSKOS (JSON for Knowledge Organization Systems) is a JSON-based data model for representing terminologies, thesauri, classifications, and other semantic artifacts. Like the Simple Standard for Sharing Ontological Mappings (SSSOM), it can also encode semantic mappings. This post is about developing and implementing a crosswalk between them in the sssom-pydantic Python package.]]></summary></entry><entry><title type="html">Mapping from SSSOM to Wikidata</title><link href="https://cthoyt.com/2026/01/08/sssom-to-wikidata.html" rel="alternate" type="text/html" title="Mapping from SSSOM to Wikidata" /><published>2026-01-08T15:47:00+00:00</published><updated>2026-01-08T15:47:00+00:00</updated><id>https://cthoyt.com/2026/01/08/sssom-to-wikidata</id><content type="html" xml:base="https://cthoyt.com/2026/01/08/sssom-to-wikidata.html"><![CDATA[<p>At the
<a href="https://nfdi4chem.de/event/4-workshop-ontologies4chem">4th Ontologies4Chem Workshop</a>
in Limburg an der Lahn, I proposed an initial crosswalk between the
<a href="https://mapping-commons.github.io/sssom">Simple Standard for Sharing Ontological Mappings (SSSOM)</a>
and the <a href="https://www.wikidata.org">Wikidata</a> semantic mapping data model. This
post describes the motivation for this proposal and the concrete implementation
I’ve developed in <a href="https://github.com/cthoyt/sssom-pydantic"><code class="language-plaintext highlighter-rouge">sssom-pydantic</code></a>.</p>

<p>This work is part of the NFDI’s
<a href="https://github.com/nfdi-de/section-metadata-wg-onto">Ontology Harmonization and Mapping Working Group</a>,
which is interested in enabling interoperability between SSSOM and related data
standards that encode semantic mappings.</p>

<p>The TL;DR for this post is that I implemented a mapping from SSSOM to Wikidata
in <code class="language-plaintext highlighter-rouge">sssom-pydantic</code> in
<a href="https://github.com/cthoyt/sssom-pydantic/pull/32">cthoyt/sssom-pydantic#32</a>.
One high-level entrypoint is the following function, which reads an SSSOM file
and prepares
<a href="https://www.wikidata.org/wiki/Help:QuickStatements">QuickStatements</a> which can
be reviewed in the web browser, then uploaded to Wikidata.</p>

<script src="https://gist.github.com/cthoyt/f38d37426a288989158a9804f74e731a.js"></script>

<p>This script can be run from Gist with
<code class="language-plaintext highlighter-rouge">uv run https://gist.github.com/cthoyt/f38d37426a288989158a9804f74e731a#file-sssom-wikidata-demo-py</code></p>

<h2 id="semantic-mappings-in-sssom">Semantic Mappings in SSSOM</h2>

<p>The
<a href="https://mapping-commons.github.io/sssom">Simple Standard for Sharing Ontological Mappings (SSSOM)</a>
is a community-driven data standard for semantic mappings, which are necessary
to support (semi-)automated data integration and knowledge integration, such as
in the construction of knowledge graphs.</p>

<p>While SSSOM primary a tabular data format that is best serialized in TSV, it
uses <a href="https://linkml.io">LinkML</a> to formalize the semantics of each field such
that SSSOM can be serialized to and read from OWL, RDF, and JSON-LD. Here’s a
brief example:</p>

<table>
  <thead>
    <tr>
      <th>subject_id</th>
      <th>subject_label</th>
      <th>predicate_id</th>
      <th>object_id</th>
      <th>object_label</th>
      <th>mapping_justification</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>wikidata:Q128700</td>
      <td>cell wall</td>
      <td>skos:exactMatch</td>
      <td>GO:0005618</td>
      <td>cell wall</td>
      <td>semapv:ManualMappingCuration</td>
    </tr>
    <tr>
      <td>wikidata:Q47512</td>
      <td>acetic acid</td>
      <td>skos:exactMatch</td>
      <td>CHEBI:15366</td>
      <td>acetic acid</td>
      <td>semapv:ManualMappingCuration</td>
    </tr>
  </tbody>
</table>

<h2 id="semantic-mappings-in-wikidata">Semantic Mappings in Wikidata</h2>

<p>Wikidata has two complementary formalisms for representing semantic mappings.
The first uses the
<a href="https://www.wikidata.org/wiki/Property:P2888">exact match (P2888)</a> property
with a URI as the object. For example,
<a href="https://www.wikidata.org/wiki/Q128700">cell wall (Q128700)</a> maps to the Gene
Ontology (GO) term for <a href="https://purl.obolibrary.org/obo/GO_0005618">cell wall</a>
by its URI <code class="language-plaintext highlighter-rouge">http://purl.obolibrary.org/obo/GO_0005618</code>.</p>

<p><img src="/img/sssom-to-wikidata/cell-wall.png" alt="A screenshot of the exact match section of webpage for Wikidata's cell wall record" /></p>

<p>The second formalism uses semantic space-specific properties (e.g.
<a href="https://www.wikidata.org/wiki/Property:P683">P683</a> for ChEBI) with local unique
identifiers as the object. For example,
<a href="https://www.wikidata.org/wiki/Q47512">acetic acid (Q47512)</a> maps to the ChEBI
term for
<a href="https://www.ebi.ac.uk/chebi/searchId.do?chebiId=CHEBI:15366">acetic acid</a> using
the <a href="https://www.wikidata.org/wiki/Property:P683">P683</a> property for ChEBI and
local unique identifier for acetic acid (within ChEBI) <code class="language-plaintext highlighter-rouge">15366</code>.</p>

<p><img src="/img/sssom-to-wikidata/acetic-acid.png" alt="A screenshot of the ChEBI mapping section of webpage for Wikidata's acetic acid record" /></p>

<p>Wikidata has a data structure that enables annotating qualifiers onto triples.
Therefore, other parts of semantic mappings modeled in SSSOM can be ported:</p>

<ol>
  <li>Authors and reviewers can be mapped from ORCiD identifiers to Wikidata
identifiers, then encoded using the
<a href="https://www.wikidata.org/wiki/Property:P50">S50</a> and
<a href="https://www.wikidata.org/wiki/Property:P4032">S4032</a> properties,
respectively</li>
  <li>A SKOS-flavored mapping predicate (i.e., exact, narrow, broad, close,
related) can be encoded using the
<a href="https://www.wikidata.org/wiki/Property:P4390">S4390</a> property</li>
  <li>The publication date can be encoded using the
<a href="https://www.wikidata.org/wiki/Property:P577">S577</a> property</li>
  <li>The license can be mapped from text to a Wikidata identifier, then encoded
using the <a href="https://www.wikidata.org/wiki/Property:P275">S275</a> property</li>
</ol>

<p>Note that properties that normally start with a <code class="language-plaintext highlighter-rouge">P</code> when used in triples are
changed to start with an <code class="language-plaintext highlighter-rouge">S</code> when used as qualifiers. Other fields in SSSOM
could potentially be mapped to Wikidata later.</p>

<h3 id="finding-wikidata-properties-using-the-semantic-farm">Finding Wikidata Properties using the Semantic Farm</h3>

<p>The <a href="https://semantic.farm">Semantic Farm</a> (previously called the Bioregistry)
maintains mappings between prefixes that appear in compact URIs (CURIEs) and
their corresponding Wikidata properties. For example, the prefix
<a href="https://semantic.farm/chebi"><code class="language-plaintext highlighter-rouge">CHEBI</code></a> maps to the Wikidata property
<a href="https://www.wikidata.org/wiki/Property:P683">P683</a>.</p>

<p><img src="/img/sssom-to-wikidata/bioregistry.png" alt="" /></p>

<p>These mappings can be accessed in several ways:</p>

<ol>
  <li>via the Semantic Farm’s
<a href="https://raw.githubusercontent.com/biopragmatics/bioregistry/main/exports/sssom/bioregistry.sssom.tsv">SSSOM</a>
export. Note: this requires subsetting to mappings where Wikidata properties
are the object.</li>
  <li>via the Semantic Farm’s
<a href="https://semantic.farm/api/metaregistry/wikidata/mappings.json">live API</a>,</li>
  <li>
    <p>via the Bioregistry Python package (this will get renamed to match Semantic
Farm, eventually) using the following code:</p>

    <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">bioregistry</span>

<span class="c1"># get bulk
</span><span class="n">prefix_to_property</span> <span class="o">=</span> <span class="n">bioregistry</span><span class="p">.</span><span class="n">get_registry_map</span><span class="p">(</span><span class="s">"wikidata"</span><span class="p">)</span>

<span class="c1"># get for a single resource
</span><span class="n">resource</span> <span class="o">=</span> <span class="n">bioregistry</span><span class="p">.</span><span class="n">get_resource</span><span class="p">(</span><span class="s">"chebi"</span><span class="p">)</span>
<span class="n">chebi_wikidata_property_id</span> <span class="o">=</span> <span class="n">resource</span><span class="p">.</span><span class="n">get_mapped_prefix</span><span class="p">(</span><span class="s">"wikidata"</span><span class="p">)</span>
</code></pre></div>    </div>
  </li>
</ol>

<h2 id="notable-implementation-details">Notable Implementation Details</h2>

<p>I’ve previously built two package which were key to making this work:</p>

<ol>
  <li><a href="https://github.com/cthoyt/wikidata-client"><code class="language-plaintext highlighter-rouge">wikidata-client</code></a>, which
interacts with the Wikidata SPARQL endpoint and has high-level wrappers
around lookup functionality. I’m also aware of
<a href="https://github.com/SuLab/WikidataIntegrator">WikidataIntegrator</a> - I’ve
contributed several improvements, but working with its codebase doesn’t spark
joy and the last time I tried to use it, it was fully broken due to some of
its dependencies not working on modern Python.</li>
  <li><a href="https://github.com/cthoyt/quickstatements-client"><code class="language-plaintext highlighter-rouge">quickstatements-client</code></a>,
which implements an object model for
<a href="https://www.wikidata.org/wiki/Help:QuickStatements">QuickStatements v2</a> and
an API client.</li>
</ol>

<p>Along the way to this PR, I made improvements to the wikidata-client in
<a href="https://github.com/cthoyt/wikidata-client/pull/2">cthoyt/wikidata-client#2</a> to
add high-level functionality for looking up multiple Wikidata records based on
values for a property (e.g., to support ORCID lookup in bulk).</p>

<p>All other changes were made in <code class="language-plaintext highlighter-rouge">sssom-pydantic</code> in
<a href="https://github.com/cthoyt/sssom-pydantic/pull/32">cthoyt/sssom-pydantic#32</a>.</p>

<p>The other key challenge was to avoid adding duplicate information to Wikidata -
unlike a simple triple store, we could accidentally end up with duplicate
statements. Therefore, the sssom-pydantic implementation looks up all existing
semantic mappings in Wikidata for entities appearing in an SSSOM file, then
filters appropriately to avoid uploading duplicate mappings to Wikidata.</p>

<h2 id="pulling-it-all-together">Pulling it All Together</h2>

<p>This new module in <code class="language-plaintext highlighter-rouge">sssom-pydantic</code> implements the following interactive
workflows:</p>

<ol>
  <li>Read an SSSOM file, convert mappings to Wikidata schema, then open a
QuickStatements tab in the web browser using
<code class="language-plaintext highlighter-rouge">read_and_open_quickstatements()</code></li>
  <li>Convert in-memory semantic mappings to the Wikidata schema, then open a
QuickStatements tab in the web browser using <code class="language-plaintext highlighter-rouge">open_quickstatements()</code></li>
</ol>

<p>Here’s what the QuickStatements web interface looks like after preparing some
demo mappings:</p>

<p><img src="/img/sssom-to-wikidata/quickstatements.png" alt="A screenshot of the QuickStatements queue" /></p>

<p>It also implements the following non-interactive workflows, which should be used
with caution since they write directly to Wikidata:</p>

<ol>
  <li>Read an SSSOM file, convert mappings to Wikidata schema, then post
non-interactively to Wikidata via QuickStatements using <code class="language-plaintext highlighter-rouge">read_and_post()</code></li>
  <li>Convert in-memory semantic mappings to the Wikidata schema, then post
non-interactively to Wikidata via QuickStatements using <code class="language-plaintext highlighter-rouge">post()</code></li>
</ol>

<hr />

<p>I’m a bit hesitant to start uploading SSSOM content to Wikidata in bulk, because
I don’t yet have a plan for how to maintain mappings that might change over time
in their upstream single source of truth, e.g., mappings curated in
<a href="https://github.com/biopragmatics/biomappings">Biomappings</a>. Otherwise, I think
this is a good proof of concept and would like to get feedback about additional
qualifiers that could be added, and if the ones I chose so far were the best.</p>]]></content><author><name>Charles Tapley Hoyt</name></author><category term="SSSOM" /><category term="Wikidata" /><category term="SKOS" /><category term="semantic mappings" /><category term="mappings" /><category term="interoperability" /><summary type="html"><![CDATA[At the 4th Ontologies4Chem Workshop in Limburg an der Lahn, I proposed an initial crosswalk between the Simple Standard for Sharing Ontological Mappings (SSSOM) and the Wikidata semantic mapping data model. This post describes the motivation for this proposal and the concrete implementation I’ve developed in sssom-pydantic.]]></summary></entry><entry><title type="html">Validating Prefix Maps in LinkML Schemas</title><link href="https://cthoyt.com/2026/01/06/bioregistry-linkml-validation.html" rel="alternate" type="text/html" title="Validating Prefix Maps in LinkML Schemas" /><published>2026-01-06T09:36:00+00:00</published><updated>2026-01-06T09:36:00+00:00</updated><id>https://cthoyt.com/2026/01/06/bioregistry-linkml-validation</id><content type="html" xml:base="https://cthoyt.com/2026/01/06/bioregistry-linkml-validation.html"><![CDATA[<p><a href="https://linkml.io">LinkML</a> enables defining data models and data schemas in
YAML informed by semantic web best practices. As such, each definition includes
a prefix map. Similarly to my previous posts on validating the prefix maps
appearing in <a href="/2025/09/04/bioregistry-turtle-validation.html">Turtle
files</a> and <a href="/2025/09/11/nfdi4culture-prefix-validation.html">in
unfamiliar SPARQL
endpoints</a>, this post
showcases describes a new extension to
<a href="https://github.com/biopragmatics/bioregistry">the Bioregistry</a> that validates
prefix maps in LinkML definitions.</p>

<p>Here’s an abridged excerpt of a LinkML
<a href="https://github.com/HendrikBorgelt/CatCore/blob/main/src/catcore/schema/catcore.yaml">definition</a>
borrowed from <a href="https://github.com/HendrikBorgelt/CatCore">CatCore</a>, a data model
under development by NFDI4Cat, the NFDI consortium interested in catalysis:</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">id</span><span class="pi">:</span> <span class="s">https://w3id.org/nfdi4cat/catcore</span>
<span class="na">name</span><span class="pi">:</span> <span class="s">catcore-metadata</span>
<span class="na">title</span><span class="pi">:</span> <span class="s">CatCore Metadata Reference Model</span>

<span class="na">prefixes</span><span class="pi">:</span>
  <span class="na">catcore</span><span class="pi">:</span> <span class="s">https://w3id.org/nfdi4cat/catcore/</span>
  <span class="na">voc4cat</span><span class="pi">:</span> <span class="s">https://w3id.org/nfdi4cat/voc4cat_</span>
  <span class="na">CHMO</span><span class="pi">:</span> <span class="s">http://purl.obolibrary.org/obo/CHMO_</span>
  <span class="na">OBI</span><span class="pi">:</span> <span class="s">http://purl.obolibrary.org/obo/OBI_</span>
  <span class="na">AFR</span><span class="pi">:</span> <span class="s">http://purl.allotrope.org/ontologies/result#AFR_</span>
  <span class="na">AFP</span><span class="pi">:</span> <span class="s">http://purl.allotrope.org/ontologies/process#AFP_</span>
  <span class="na">AFQ</span><span class="pi">:</span> <span class="s">http://purl.allotrope.org/ontologies/quality#AFQ_</span>
  <span class="na">NCIT</span><span class="pi">:</span> <span class="s">http://purl.obolibrary.org/obo/NCIT_</span>
  <span class="na">nmrCV</span><span class="pi">:</span> <span class="s2">"</span><span class="s">http://nmrML.org/nmrCV#NMR:"</span>
  <span class="na">linkml</span><span class="pi">:</span> <span class="s">https://w3id.org/linkml/</span>
  <span class="na">AFRL</span><span class="pi">:</span> <span class="s">http://purl.allotrope.org/ontologies/role#AFRL_</span>
  <span class="na">APOLLO_SV</span><span class="pi">:</span> <span class="s">http://purl.obolibrary.org/obo/APOLLO_SV_</span>
  <span class="na">SIO</span><span class="pi">:</span> <span class="s">http://semanticscience.org/resource/SIO_</span>

<span class="na">default_prefix</span><span class="pi">:</span> <span class="s">catcore</span>
</code></pre></div></div>

<p>In
<a href="https://github.com/biopragmatics/bioregistry/pull/1786">biopragmatics/bioregistry#1786</a>,
I implemented the <code class="language-plaintext highlighter-rouge">bioregistry validate linkml</code> command. It can be used to check
the prefix map in this file and give feedback on non-standard CURIE prefix
usage, unknown CURIE prefixes, etc. while giving suggestions for fixes, when
possible.</p>

<p>Running the command on the file that contains the example prefixes from above
gives the following output:</p>

<div class="language-console highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gp">$</span><span class="w"> </span>bioregistry validate linkml <span class="nt">--tablefmt</span> github <span class="nt">--use-preferred</span> https://github.com/HendrikBorgelt/CatCore/raw/refs/heads/main/src/catcore/schema/catcore.yaml
</code></pre></div></div>

<table>
  <thead>
    <tr>
      <th>prefix</th>
      <th>uri_prefix</th>
      <th>issue</th>
      <th>solution</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>catcore</td>
      <td>https://w3id.org/nfdi4cat/catcore/</td>
      <td>unknown CURIE prefix</td>
      <td> </td>
    </tr>
    <tr>
      <td>AFR</td>
      <td>http://purl.allotrope.org/ontologies/result#AFR_</td>
      <td>unknown CURIE prefix</td>
      <td> </td>
    </tr>
    <tr>
      <td>AFP</td>
      <td>http://purl.allotrope.org/ontologies/process#AFP_</td>
      <td>unknown CURIE prefix</td>
      <td> </td>
    </tr>
    <tr>
      <td>AFQ</td>
      <td>http://purl.allotrope.org/ontologies/quality#AFQ_</td>
      <td>unknown CURIE prefix</td>
      <td> </td>
    </tr>
    <tr>
      <td>nmrCV</td>
      <td>http://nmrML.org/nmrCV#NMR:</td>
      <td>non-standard CURIE prefix</td>
      <td>Switch to preferred prefix: NMR</td>
    </tr>
    <tr>
      <td>AFRL</td>
      <td>http://purl.allotrope.org/ontologies/role#AFRL_</td>
      <td>unknown CURIE prefix</td>
      <td> </td>
    </tr>
    <tr>
      <td>SIO</td>
      <td>http://semanticscience.org/resource/SIO_</td>
      <td>non-standard CURIE prefix</td>
      <td>Switch to preferred prefix: sio</td>
    </tr>
  </tbody>
</table>

<p>Curation feedback is not absolute - it’s always possible that the Bioregistry is
missing key content. Luckily, it conforms to the
<a href="https://www.nature.com/articles/s41597-024-03406-w">open data, open code, open infrastructure (O3)</a>
guidelines, so it’s easy for anyone to perform a
<a href="https://doi.org/10.32388/KBX9VO">drive-by curation</a> to fix any minor issues.
The Bioregistry has public, well-defined
<a href="https://github.com/biopragmatics/bioregistry?tab=contributing-ov-file">curation guidelines</a>,
<a href="https://github.com/biopragmatics/bioregistry?tab=coc-ov-file">code of conduct</a>,
and
<a href="https://github.com/biopragmatics/bioregistry/blob/main/docs/GOVERNANCE.md">project governance</a>
to support making curation contributions. Alternatively, the
<a href="github.com/biopragmatics/bioregistry/issues">issue tracker</a> allows
non-technical users to post requests that the Bioregistry team can follow up on.</p>

<p>Based on the output above, I made improvements to the Bioregistry in
<a href="https://github.com/biopragmatics/bioregistry/pull/1788">biopragmatics/bioregistry#1788</a>
to add four new prefixes for the Allotrope semantic spaces and add <code class="language-plaintext highlighter-rouge">SIO</code>
(stylized with capital letters) as the “preferred prefix” for the
<a href="https://semantic.farm/sio">Semantic Science Integrated Ontology</a>.</p>

<p>Note that LinkML is developed by members of the OBO Community, and therefore,
its prefixes often skew towards OBO community preferences. Therefore, you might
want to use the <code class="language-plaintext highlighter-rouge">--use-preferred</code> flag if a lot of your prefixes are stylized in
uppercase or with mixed case.</p>]]></content><author><name>Charles Tapley Hoyt</name></author><category term="LinkML" /><category term="Bioregistry" /><category term="prefix maps" /><category term="CURIEs" /><category term="URIs" /><summary type="html"><![CDATA[LinkML enables defining data models and data schemas in YAML informed by semantic web best practices. As such, each definition includes a prefix map. Similarly to my previous posts on validating the prefix maps appearing in Turtle files and in unfamiliar SPARQL endpoints, this post showcases describes a new extension to the Bioregistry that validates prefix maps in LinkML definitions.]]></summary></entry><entry><title type="html">Books I Read in 2025</title><link href="https://cthoyt.com/2026/01/01/books-in-2025.html" rel="alternate" type="text/html" title="Books I Read in 2025" /><published>2026-01-01T21:22:00+00:00</published><updated>2026-01-01T21:22:00+00:00</updated><id>https://cthoyt.com/2026/01/01/books-in-2025</id><content type="html" xml:base="https://cthoyt.com/2026/01/01/books-in-2025.html"><![CDATA[<p>Here are the books I read in 2025. My goals for the year were to get some more
variety, and I think I managed that.</p>

<ol>
  <li>Turning Darkness Into Light (The Memoirs of Lady Trent, #6) by Marie Brennan</li>
  <li>Jade City (The Green Bone Saga #1) by Fonda Lee</li>
  <li>Jade War (The Green Bone Saga #2) by Fonda Lee</li>
  <li>Jade Legacy (The Green Bone Saga #3) by Fonda Lee</li>
  <li>A Court of Mist and Fury (ACOTAR, #2) by Sarah J. Maas</li>
  <li>Everyone You Hate is Going to Die by Daniel Sloss</li>
  <li>Intimacies by Katie Kitamura</li>
  <li>A Court of Wings and Ruin (ACOTAR, #3) by Sarah J. Maas</li>
  <li>The Raven Tower by Ann Leckie</li>
  <li>The Night Circus by Erin Morgenstern</li>
  <li>The Fall by Albert Camus</li>
  <li>The Lies of Locke Lamora (Gentleman Bastard, #1) by Scott Lynch</li>
  <li>The Will of the Many (Hierarchy, #1) by James Islington</li>
  <li>The Empress of Salt and Fortune by Nghi Vo</li>
  <li>Red Seas Under Red Skies (Gentleman Bastard, #2) by Scott Lynch</li>
  <li>Reckless by Cornelia Funke</li>
  <li>The Republic of Thieves (Gentleman Bastard, #3) by Scott Lynch</li>
  <li>Isles of the Emberdark by Brandon Sanderson</li>
  <li>The Priory of the Orange Tree by Samantha Shannon</li>
  <li>The River Has Roots by El-Mohtar Amal</li>
  <li>Marytr! by Kaveh Akbar</li>
  <li>A Man Called Ove by Fredrik Backman</li>
  <li>Open Throat by Henry Hoke</li>
  <li>The Midnight Library by Matt Haig</li>
  <li>The Tainted Cup (Shadow of the Leviathan, #1) by Robert Jackson Bennett</li>
  <li>The Catcher in the Rye by J.D. Salinger</li>
  <li>The Strength of the Few (Hierarchy, #2) by James Islington</li>
  <li>A Drop Of Corruption (Shadow of the Leviathan, #2) by Robert Jackson Bennett</li>
  <li>Piranesi by Susanna Clarke</li>
  <li>The Sun Also Rises by Ernest Hemingway</li>
</ol>

<p>All comments below are spoiler-free, except a minor note about the ending of
<em>Martyr!</em>.</p>

<p>Highlights:</p>

<ol>
  <li>The Night Circus was my favorite. It evoked a really special magical feeling</li>
  <li>The Green Bone Saga was an excellent trilogy, with perfect pacing,
progression, and knew the perfect spot to end.</li>
  <li>Title drops are very important. The ending of Open Throat was very cathartic.</li>
  <li>There was only one Brandon sandwich this year, but… SHARD GUNS</li>
</ol>

<p>Comments:</p>

<ul>
  <li>I was in a reading slump in the summer, so I also re-read Stormlight Archive.
Now knowing the end of Wind and Truth, it’s crazy to see the foreshadowing to
the end.</li>
  <li>I liked <em>The Tainted Cup</em> and <em>A Drop of Corruption</em> because Din and Ana are
very interesting characters and the world building is great. But, I’m not sure
if I’m sold on mystery, at least in books. I enjoy film and television
adaptations, though.</li>
  <li>A new bookstore opened in Fall in Beuel that is hosting
<a href="https://beuelerbuchladen.de/veranstaltungen/">silent reading nights</a> once per
month. I had a great time going there with friends, and we even instituted our
own silent reading nights (with snacks).</li>
</ul>

<p>Disappointments:</p>

<ol>
  <li><em>The Will of the Many</em> perfected the genres and archetypes that it pulls
from, but <em>The Strength of the Few</em> mostly missed the mark. Because of the
twist from the end of the first book, it had to do a lot of world-building
and create new plot and character arcs for the protagonist(s). I think it
didn’t match the tone promise of the first novel and ultimately felt like the
whole book was three side quests with new side characters in whom I wasn’t as
invested in as in the first book. <em>The Strength of the Few</em> had severe middle
child syndrome and felt bloated, but I think that it can be salvaged in the
next (and last?) installment.</li>
  <li>(minor spoilers) <em>Martyr!</em> told the story of a deeply troubled man who wanted
to write a book on what makes death meaningful and decide himself if he
wanted to continue living. While it had a positive ending, I think it
ultimately failed to deliver any (well-organized) revelations about what
makes a meaningful life, a meaningful death, or on martyrdom, either in-story
to the protagonist or on a meta level to me as the reader. I would instead
look to novels by Kurt Vonnegut for this kind of cleverness.</li>
  <li>I’m still motivated to read classics, but was pretty disappointed with
<em>Catcher in the Rye</em> and <em>The Sun Also Rises</em>. I didn’t feel like they had
much going in terms of character, plot, or theme that secures their
longevity. Obviously, it was an achievement to write a character as hatable
as Holden Caulfield, but neither book had clear tone/plot/theme promises nor
delivered on them.</li>
</ol>

<p>Won’t Finish:</p>

<ol>
  <li>The Name of the Rose by Umberto Eco. The world building, characters, and
writing annoyed me, and I was completely checked out by the time the plot
started. We read this as part of book club in preparation for a trip to
<a href="https://kloster-eberbach.de">Kloster Eberbach</a>, where the adaptation of the
book was filmed. However, 4/6 of our bookclub DNF’d this one, and we
instituted a 200-page maximum for future books. At least the Kloster was
beautiful and had good wine!</li>
  <li>Neuromancer by William Gibson. This book single-handedly created the
cyberpunk aesthetic. However, the story was weak and the characters were
uncomfortably outdated, so I put this one down. I imagine many writers had a
similar experience and thought they could do better. This is probably why
there are so many good stories with cyberpunk aesthetic!</li>
</ol>

<p>Reading goals for 2026: more variety and more reading in German!</p>]]></content><author><name>Charles Tapley Hoyt</name></author><category term="books" /><summary type="html"><![CDATA[Here are the books I read in 2025. My goals for the year were to get some more variety, and I think I managed that.]]></summary></entry><entry><title type="html">Annotating the Literature with Named Entity Recognition</title><link href="https://cthoyt.com/2025/12/19/annotating-the-literature-demo.html" rel="alternate" type="text/html" title="Annotating the Literature with Named Entity Recognition" /><published>2025-12-19T09:01:00+00:00</published><updated>2025-12-19T09:01:00+00:00</updated><id>https://cthoyt.com/2025/12/19/annotating-the-literature-demo</id><content type="html" xml:base="https://cthoyt.com/2025/12/19/annotating-the-literature-demo.html"><![CDATA[<p>Annotating the literature with mentions of key concepts from a given domain is
often the first step towards extracting more substantial structured knowledge.
This can be challenging, as it typically encompasses acquiring and processing
the relevant literature and ontologies then installing and applying
difficult-to-use named entity recognition (NER) workflows. This post highlights
software components I’ve implemented to simplify this workflow. I demonstrate it
by annotating the biomedical literature available through
<a href="https://pubmed.ncbi.nlm.nih.gov/">PubMed</a> with
<a href="https://semantic.farm/mesh">Medical Subject Headings (MeSH)</a> terms, and also
comment on how this can be generalized to other natural sciences, engineering,
and humanities disciplines.</p>

<p>I’ve been building software for the last ten years that simplifies and
democratizes access to these resources. Here, I’m going to highlight three
components:</p>

<ol>
  <li><a href="https://pubmed.ncbi.nlm.nih.gov/"><strong>PubMed Downloader</strong></a> provides a wrapper
around PubMed’s API and around bulk download and processing of the source
data. While this resource only contains biomedical text, its place in the
workflow can be replaced with any other text source.</li>
  <li><a href="https://github.com/cthoyt/ssslm"><strong>SSSLM</strong></a> provides a wrapper around NER
methods such as <a href="https://github.com/gyorilab/gilda">Gilda</a> and
<a href="https://github.com/explosion/spaCy">spaCy</a>. SSSLM uses a pared-down version
of Gilda as its default NER tool because Gilda is fast, interpretable, and
easy to install (after removing some parts). SSSLM and the methods it wraps
are fully domain-agnostic.</li>
  <li><a href="https://github.com/biopragmatics/pyobo"><strong>PyOBO</strong></a> provides a wrapper around
fetching and processing ontologies, controlled vocabularies, databases, and
other resources that can be used as a dictionary. It also has a high-level
workflow,
<a href="https://pyobo.readthedocs.io/en/latest/api/pyobo.get_grounder.html"><code class="language-plaintext highlighter-rouge">pyobo.get_grounder()</code></a>
for getting content into <code class="language-plaintext highlighter-rouge">ssslm</code>. It’s built on the
<a href="https://semantic.farm">Semantic Farm</a> (previously called the Bioregistry) to
enable it to find and access ontologies across disciplines.</li>
</ol>

<h2 id="demonstration">Demonstration</h2>

<p>The following is a demonstration on how to get the abstracts of 5 articles from
PubMed, perform named entity recognition (NER) using Medical Subject Headings
(MeSH), output the results (below). Note that the following code can be run as a
script using <code class="language-plaintext highlighter-rouge">uv run</code>, as it makes explicit its dependencies as
<a href="https://peps.python.org/pep-0723/">PEP-723</a> inline metadata .</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># /// script
# requires-python = "&gt;=3.12"
# dependencies = [
#     "click&gt;=8.3.1",
#     "pubmed-downloader&gt;=0.0.12",
#     "pyobo[gilda-slim]&gt;=0.12.13",
#     "tabulate&gt;=0.9.0",
# ]
# ///
</span>
<span class="kn">import</span> <span class="nn">click</span>
<span class="kn">import</span> <span class="nn">pubmed_downloader</span>
<span class="kn">import</span> <span class="nn">pyobo</span>
<span class="kn">from</span> <span class="nn">tabulate</span> <span class="kn">import</span> <span class="n">tabulate</span>

<span class="c1"># get a grounder loaded up with a specific version of MeSH.
# if you don't specify a version, the latest will be used.
</span><span class="n">grounder</span><span class="p">:</span> <span class="n">ssslm</span><span class="p">.</span><span class="n">Grounder</span> <span class="o">=</span> <span class="n">pyobo</span><span class="p">.</span><span class="n">get_grounder</span><span class="p">(</span><span class="s">"mesh"</span><span class="p">,</span> <span class="n">versions</span><span class="o">=</span><span class="s">"2018"</span><span class="p">)</span>

<span class="c1"># get 5 PubMed identifiers about diabetes. note that the
# PubMed API has been horrifically slow lately, so please be patient
</span><span class="n">pubmed_ids</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="nb">str</span><span class="p">]</span> <span class="o">=</span> <span class="n">pubmed_downloader</span><span class="p">.</span><span class="n">search</span><span class="p">(</span><span class="s">"diabetes"</span><span class="p">,</span> <span class="n">backend</span><span class="o">=</span><span class="s">"api"</span><span class="p">,</span> <span class="n">retmax</span><span class="o">=</span><span class="mi">5</span><span class="p">)</span>
<span class="n">click</span><span class="p">.</span><span class="n">echo</span><span class="p">(</span><span class="sa">f</span><span class="s">"got </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">pubmed_ids</span><span class="p">)</span><span class="si">}</span><span class="s"> pubmed IDs"</span><span class="p">)</span>

<span class="k">for</span> <span class="n">article</span> <span class="ow">in</span> <span class="n">pubmed_downloader</span><span class="p">.</span><span class="n">get_articles</span><span class="p">(</span><span class="n">pubmed_ids</span><span class="p">,</span> <span class="n">error_strategy</span><span class="o">=</span><span class="s">"skip"</span><span class="p">,</span> <span class="n">progress</span><span class="o">=</span><span class="bp">True</span><span class="p">):</span>
    <span class="n">abstract</span><span class="p">:</span> <span class="nb">str</span> <span class="o">=</span> <span class="n">article</span><span class="p">.</span><span class="n">get_abstract</span><span class="p">()</span>

    <span class="c1"># get a list of annotations, which contain the offsets of the entity
</span>    <span class="c1"># and the grounding to a Bioregistry-standardized CURIE.
</span>    <span class="c1"># more generally, this can be applied to any string from any source
</span>    <span class="n">annotations</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="s">"ssslm.Annotation"</span><span class="p">]</span> <span class="o">=</span> <span class="n">grounder</span><span class="p">.</span><span class="n">annotate</span><span class="p">(</span><span class="n">abstract</span><span class="p">)</span>

    <span class="n">rows</span> <span class="o">=</span> <span class="p">[</span>
        <span class="p">(</span>
            <span class="n">annotation</span><span class="p">.</span><span class="n">start</span><span class="p">,</span>
            <span class="n">annotation</span><span class="p">.</span><span class="n">end</span><span class="p">,</span>
            <span class="sa">f</span><span class="s">"[</span><span class="si">{</span><span class="n">annotation</span><span class="p">.</span><span class="n">curie</span><span class="si">}</span><span class="s">](https://semantic.farm/</span><span class="si">{</span><span class="n">annotation</span><span class="p">.</span><span class="n">curie</span><span class="si">}</span><span class="s">)"</span><span class="p">,</span>
            <span class="n">annotation</span><span class="p">.</span><span class="n">name</span><span class="p">,</span>
            <span class="nb">round</span><span class="p">(</span><span class="n">annotation</span><span class="p">.</span><span class="n">score</span><span class="p">,</span> <span class="mi">3</span><span class="p">),</span>
        <span class="p">)</span>
        <span class="k">for</span> <span class="n">annotation</span> <span class="ow">in</span> <span class="n">annotations</span>
    <span class="p">]</span>
    <span class="n">headers</span> <span class="o">=</span> <span class="p">[</span><span class="s">"Start"</span><span class="p">,</span> <span class="s">"End"</span><span class="p">,</span> <span class="s">"CURIE"</span><span class="p">,</span> <span class="s">"Name"</span><span class="p">,</span> <span class="s">"Score"</span><span class="p">]</span>
    <span class="n">table</span> <span class="o">=</span> <span class="n">tabulate</span><span class="p">(</span><span class="n">rows</span><span class="p">,</span> <span class="n">headers</span><span class="o">=</span><span class="n">headers</span><span class="p">,</span> <span class="n">tablefmt</span><span class="o">=</span><span class="s">"github"</span><span class="p">)</span>

    <span class="n">click</span><span class="p">.</span><span class="n">echo</span><span class="p">(</span>
        <span class="sa">f</span><span class="s">"**</span><span class="si">{</span><span class="n">article</span><span class="p">.</span><span class="n">title</span><span class="p">.</span><span class="n">rstrip</span><span class="p">().</span><span class="n">rstrip</span><span class="p">(</span><span class="s">'.'</span><span class="p">)</span><span class="si">}</span><span class="s">** "</span>
        <span class="sa">f</span><span class="s">"([pubmed:</span><span class="si">{</span><span class="n">article</span><span class="p">.</span><span class="n">pubmed</span><span class="si">}</span><span class="s">](https://semantic.farm/pubmed:</span><span class="si">{</span><span class="n">article</span><span class="p">.</span><span class="n">pubmed</span><span class="si">}</span><span class="s">))"</span>
        <span class="sa">f</span><span class="s">"</span><span class="se">\n\n</span><span class="s">&gt; </span><span class="si">{</span><span class="n">abstract</span><span class="si">}</span><span class="se">\n\n</span><span class="si">{</span><span class="n">table</span><span class="si">}</span><span class="se">\n\n</span><span class="s">"</span>
    <span class="p">)</span>
</code></pre></div></div>

<h2 id="parting-thoughts">Parting Thoughts</h2>

<p>Normally I post parting thoughts at the bottom of each post, but since the
results take up a lot of space, I’ll put them here.</p>

<p>There are many directions to take these tools. The first might be to use a
subset of MeSH that’s most appropriate for the annotation task. For example, if
we just wanted to see diseases, then it only makes sense to use the MeSH Disease
branch. Similarly, there are many other ontologies, controlled vocabularies, and
databases in the diseases space such as MONDO, DOID, SNOMED-CT, and many more.
These can be incorporated into the grounder with
<code class="language-plaintext highlighter-rouge">pyobo.get_grounder(["mesh", "mondo", "doid", "snomedct])</code>, but will lead to
redundancy issues. I’ve previously published
<a href="https://github.com/biopragmatics/semra">SeMRA</a> where I addressed mapping
between equivalent entities, but am currently working on using these results to
assemble coherent and comprehensive lexica that can be easily reused by SSSLM in
the <a href="https://github.com/biopragmatics/biolexica">Biolexica project</a> (which will
also get renamed to be domain-agnostic).</p>

<p>Other domains can be directly used. For example, in the energy domain, the
<a href="https://semantic.farm/oeo">Open Energy Ontology</a> can be used with
<code class="language-plaintext highlighter-rouge">pyobo.get_grounder("oeo")</code>. In general, the
<a href="https://semantic.farm">Semantic Farm</a> can be used to find ontologies from other
domains. Within the <a href="https://www.nfdi.de/?lang=en">NFDI</a>, there are
<a href="https://semantic.farm/collection/">collections</a> for each NFDI consortia that
contain lists of relevant ontologies, controlled vocabularies, databases, and
other resources that mint identifiers.</p>

<p>I hope this was a helpful introduction! If you’ve got questions about these
workflows or want to see a demo on your favorite literature source/ontology/NER
method, post an issue to the relevant package’s issue tracker.</p>

<h2 id="results">Results</h2>

<p><strong>Investigation of intake pattern of SGLT2 inhibitors among shift workers with
diabetes: a crossover study</strong>
(<a href="https://semantic.farm/pubmed:41413602">pubmed:41413602</a>)</p>

<blockquote>
  <p>Shift workers experience regular changes in their waking hours due to
fluctuating work schedules. The timing of their medication intake differs
depending on whether they are working a day or night shift. Sodium-glucose
co-transporter 2 ( SGLT2) inhibitors are prescribed once a day and are often
taken before or after breakfast. However, studies on the optimal dosing times
for the effective treatment of shift workers are lacking. In this study, we
investigated whether the effects were different by the pattern of SGLT2
inhibitor intake for shift workers with diabetes. Seven shift workers with
diabetes who were taking an SGLT2 inhibitor were analyzed. All participants
took the medication upon waking for 14 days, followed by administration at a
fixed time for another 14 days. Glucose levels were measured over 14 days when
the drug was administered either upon waking or at a fixed time of day. The
time in range (TIR), which indicates the percentage of time during which the
glucose level is within the range of 70-180 mg/dL, was used as the main
evaluation index. The mean HbA1c of the participants was 7.1%. The TIR was
88.5% in the administration upon waking group and 84.9% in the administration
at a fixed time group. No significant difference in TIR values was observed
between the two administration groups. A TIR of 70% or higher is recommended
to prevent the onset of diabetic complications. Consistent intake of SGLT2
inhibitors, regardless of whether it is during the day or night shift, may
help stabilize blood glucose levels in shift workers throughout the day and
night, thereby preventing the development of complications.</p>
</blockquote>

<table>
  <thead>
    <tr>
      <th>Start</th>
      <th>End</th>
      <th>CURIE</th>
      <th>Name</th>
      <th>Score</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>6</td>
      <td>13</td>
      <td><a href="https://semantic.farm/mesh:D009274">mesh:D009274</a></td>
      <td>Occupational Groups</td>
      <td>0.54</td>
    </tr>
    <tr>
      <td>82</td>
      <td>96</td>
      <td><a href="https://semantic.farm/mesh:D010561">mesh:D010561</a></td>
      <td>Personnel Staffing and Scheduling</td>
      <td>0.54</td>
    </tr>
    <tr>
      <td>219</td>
      <td>233</td>
      <td><a href="https://semantic.farm/mesh:D027981">mesh:D027981</a></td>
      <td>Symporters</td>
      <td>0.54</td>
    </tr>
    <tr>
      <td>317</td>
      <td>326</td>
      <td><a href="https://semantic.farm/mesh:D062408">mesh:D062408</a></td>
      <td>Breakfast</td>
      <td>0.762</td>
    </tr>
    <tr>
      <td>410</td>
      <td>417</td>
      <td><a href="https://semantic.farm/mesh:D009274">mesh:D009274</a></td>
      <td>Occupational Groups</td>
      <td>0.54</td>
    </tr>
    <tr>
      <td>515</td>
      <td>530</td>
      <td><a href="https://semantic.farm/mesh:D000077203">mesh:D000077203</a></td>
      <td>Sodium-Glucose Transporter 2 Inhibitors</td>
      <td>0.549</td>
    </tr>
    <tr>
      <td>548</td>
      <td>555</td>
      <td><a href="https://semantic.farm/mesh:D009274">mesh:D009274</a></td>
      <td>Occupational Groups</td>
      <td>0.54</td>
    </tr>
    <tr>
      <td>583</td>
      <td>590</td>
      <td><a href="https://semantic.farm/mesh:D009274">mesh:D009274</a></td>
      <td>Occupational Groups</td>
      <td>0.54</td>
    </tr>
    <tr>
      <td>624</td>
      <td>639</td>
      <td><a href="https://semantic.farm/mesh:D000077203">mesh:D000077203</a></td>
      <td>Sodium-Glucose Transporter 2 Inhibitors</td>
      <td>0.549</td>
    </tr>
    <tr>
      <td>729</td>
      <td>743</td>
      <td><a href="https://semantic.farm/mesh:D009934">mesh:D009934</a></td>
      <td>Organization and Administration</td>
      <td>0.54</td>
    </tr>
    <tr>
      <td>781</td>
      <td>788</td>
      <td><a href="https://semantic.farm/mesh:D005947">mesh:D005947</a></td>
      <td>Glucose</td>
      <td>0.778</td>
    </tr>
    <tr>
      <td>832</td>
      <td>836</td>
      <td><a href="https://semantic.farm/mesh:D004364">mesh:D004364</a></td>
      <td>Pharmaceutical Preparations</td>
      <td>0.54</td>
    </tr>
    <tr>
      <td>981</td>
      <td>988</td>
      <td><a href="https://semantic.farm/mesh:D005947">mesh:D005947</a></td>
      <td>Glucose</td>
      <td>0.762</td>
    </tr>
    <tr>
      <td>1064</td>
      <td>1069</td>
      <td><a href="https://semantic.farm/mesh:D020481">mesh:D020481</a></td>
      <td>Index</td>
      <td>0.762</td>
    </tr>
    <tr>
      <td>1141</td>
      <td>1155</td>
      <td><a href="https://semantic.farm/mesh:D009934">mesh:D009934</a></td>
      <td>Organization and Administration</td>
      <td>0.54</td>
    </tr>
    <tr>
      <td>1191</td>
      <td>1205</td>
      <td><a href="https://semantic.farm/mesh:D009934">mesh:D009934</a></td>
      <td>Organization and Administration</td>
      <td>0.54</td>
    </tr>
    <tr>
      <td>1298</td>
      <td>1312</td>
      <td><a href="https://semantic.farm/mesh:D009934">mesh:D009934</a></td>
      <td>Organization and Administration</td>
      <td>0.54</td>
    </tr>
    <tr>
      <td>1383</td>
      <td>1405</td>
      <td><a href="https://semantic.farm/mesh:D048909">mesh:D048909</a></td>
      <td>Diabetes Complications</td>
      <td>0.54</td>
    </tr>
    <tr>
      <td>1428</td>
      <td>1444</td>
      <td><a href="https://semantic.farm/mesh:D000077203">mesh:D000077203</a></td>
      <td>Sodium-Glucose Transporter 2 Inhibitors</td>
      <td>0.549</td>
    </tr>
    <tr>
      <td>1524</td>
      <td>1537</td>
      <td><a href="https://semantic.farm/mesh:D001786">mesh:D001786</a></td>
      <td>Blood Glucose</td>
      <td>0.762</td>
    </tr>
    <tr>
      <td>1554</td>
      <td>1561</td>
      <td><a href="https://semantic.farm/mesh:D009274">mesh:D009274</a></td>
      <td>Occupational Groups</td>
      <td>0.54</td>
    </tr>
  </tbody>
</table>

<p><strong>Men’s health needs assessment in the Toledo District of Southern Belize</strong>
(<a href="https://semantic.farm/pubmed:41413521">pubmed:41413521</a>)</p>

<blockquote>
  <p>Belize is a small country in Central America with a growing burden of
non-communicable disease (NCD), including hypertension and diabetes. Toledo
District is the southernmost and poorest district in the country. Reliable
national level health data for Belize is readily available, but the data is
rarely disaggregated by sex or district. Reducing the burden of NCDs is a high
priority for the Ministry of Health and Wellness. Belize’s progress to date on
Sustainable Development Goal (SDG) 3 (Good Health and Wellbeing) has been
modest with many indicators stagnating or progress increasing at less than 50%
of the required rate. SDG 3 describes the need to reduce the risks of NCDs and
to strengthen the capacities of the healthcare workforce. The objective was to
perform a men’s health needs assessment to identify and prioritize men’s
health needs in the Toledo District. This was a mixed methods study.
Qualitative data were collected from semi-structured interviews. Interviews
were recorded, transcribed, and analyzed using Thematic Analysis. Quantitative
data included epidemiological data from national vital statistics or disease
registries and other public sources. Data were collected between January and
June 2017. Belizean men have among the highest risk for cardiac or diabetes
related illness or death in the Americas. Diabetes and hypertension are
responsible for 4.49% and 1.23% of Disability Adjusted Life Years in men
respectively and are increasing by 2.51% annually. Fifty-seven interviews (55
individuals and two groups) from nine villages were carried out. Four themes
emerged from the qualitative data. Men in Toledo: • have poor health literacy;
• have reasonable access to health resources, but do not use them; • inability
to clearly articulate health priorities; • do not process risk well. Men in
Toledo suffer from a high prevalence of NCDs including hypertension and
diabetes and understand health and risks poorly. This may contribute to
Belize’s struggle to achieve the goals of SDG 3.4.1. Strengthening the
healthcare workforce by improved training of community health workers (CHWs)
and providing health education to men in Toledo is required to address these
concerns.</p>
</blockquote>

<table>
  <thead>
    <tr>
      <th>Start</th>
      <th>End</th>
      <th>CURIE</th>
      <th>Name</th>
      <th>Score</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>0</td>
      <td>6</td>
      <td><a href="https://semantic.farm/mesh:D001531">mesh:D001531</a></td>
      <td>Belize</td>
      <td>0.778</td>
    </tr>
    <tr>
      <td>29</td>
      <td>44</td>
      <td><a href="https://semantic.farm/mesh:D002489">mesh:D002489</a></td>
      <td>Central America</td>
      <td>0.778</td>
    </tr>
    <tr>
      <td>70</td>
      <td>94</td>
      <td><a href="https://semantic.farm/mesh:D000073296">mesh:D000073296</a></td>
      <td>Noncommunicable Diseases</td>
      <td>0.549</td>
    </tr>
    <tr>
      <td>112</td>
      <td>124</td>
      <td><a href="https://semantic.farm/mesh:D006973">mesh:D006973</a></td>
      <td>Hypertension</td>
      <td>0.762</td>
    </tr>
    <tr>
      <td>236</td>
      <td>242</td>
      <td><a href="https://semantic.farm/mesh:D006262">mesh:D006262</a></td>
      <td>Health</td>
      <td>0.762</td>
    </tr>
    <tr>
      <td>252</td>
      <td>258</td>
      <td><a href="https://semantic.farm/mesh:D001531">mesh:D001531</a></td>
      <td>Belize</td>
      <td>0.778</td>
    </tr>
    <tr>
      <td>321</td>
      <td>324</td>
      <td><a href="https://semantic.farm/mesh:D012723">mesh:D012723</a></td>
      <td>Sex</td>
      <td>0.762</td>
    </tr>
    <tr>
      <td>405</td>
      <td>411</td>
      <td><a href="https://semantic.farm/mesh:D006262">mesh:D006262</a></td>
      <td>Health</td>
      <td>0.778</td>
    </tr>
    <tr>
      <td>426</td>
      <td>432</td>
      <td><a href="https://semantic.farm/mesh:D001531">mesh:D001531</a></td>
      <td>Belize</td>
      <td>0.778</td>
    </tr>
    <tr>
      <td>455</td>
      <td>483</td>
      <td><a href="https://semantic.farm/mesh:D000076502">mesh:D000076502</a></td>
      <td>Sustainable Development</td>
      <td>0.556</td>
    </tr>
    <tr>
      <td>498</td>
      <td>504</td>
      <td><a href="https://semantic.farm/mesh:D006262">mesh:D006262</a></td>
      <td>Health</td>
      <td>0.778</td>
    </tr>
    <tr>
      <td>546</td>
      <td>556</td>
      <td><a href="https://semantic.farm/mesh:D007202">mesh:D007202</a></td>
      <td>Indicators and Reagents</td>
      <td>0.54</td>
    </tr>
    <tr>
      <td>669</td>
      <td>674</td>
      <td><a href="https://semantic.farm/mesh:D012306">mesh:D012306</a></td>
      <td>Risk</td>
      <td>0.762</td>
    </tr>
    <tr>
      <td>723</td>
      <td>733</td>
      <td><a href="https://semantic.farm/mesh:D003695">mesh:D003695</a></td>
      <td>Delivery of Health Care</td>
      <td>0.54</td>
    </tr>
    <tr>
      <td>734</td>
      <td>743</td>
      <td><a href="https://semantic.farm/mesh:D000078329">mesh:D000078329</a></td>
      <td>Workforce</td>
      <td>0.762</td>
    </tr>
    <tr>
      <td>776</td>
      <td>788</td>
      <td><a href="https://semantic.farm/mesh:D054526">mesh:D054526</a></td>
      <td>Men’s Health</td>
      <td>0.725</td>
    </tr>
    <tr>
      <td>789</td>
      <td>805</td>
      <td><a href="https://semantic.farm/mesh:D020380">mesh:D020380</a></td>
      <td>Needs Assessment</td>
      <td>0.762</td>
    </tr>
    <tr>
      <td>833</td>
      <td>845</td>
      <td><a href="https://semantic.farm/mesh:D054526">mesh:D054526</a></td>
      <td>Men’s Health</td>
      <td>0.725</td>
    </tr>
    <tr>
      <td>846</td>
      <td>851</td>
      <td><a href="https://semantic.farm/mesh:D006301">mesh:D006301</a></td>
      <td>Health Services Needs and Demand</td>
      <td>0.54</td>
    </tr>
    <tr>
      <td>893</td>
      <td>900</td>
      <td><a href="https://semantic.farm/mesh:D008722">mesh:D008722</a></td>
      <td>Methods</td>
      <td>0.762</td>
    </tr>
    <tr>
      <td>1112</td>
      <td>1128</td>
      <td><a href="https://semantic.farm/mesh:D014798">mesh:D014798</a></td>
      <td>Vital Statistics</td>
      <td>0.762</td>
    </tr>
    <tr>
      <td>1132</td>
      <td>1139</td>
      <td><a href="https://semantic.farm/mesh:D004194">mesh:D004194</a></td>
      <td>Disease</td>
      <td>0.762</td>
    </tr>
    <tr>
      <td>1140</td>
      <td>1150</td>
      <td><a href="https://semantic.farm/mesh:D012042">mesh:D012042</a></td>
      <td>Registries</td>
      <td>0.762</td>
    </tr>
    <tr>
      <td>1237</td>
      <td>1240</td>
      <td><a href="https://semantic.farm/mesh:D008571">mesh:D008571</a></td>
      <td>Men</td>
      <td>0.762</td>
    </tr>
    <tr>
      <td>1264</td>
      <td>1268</td>
      <td><a href="https://semantic.farm/mesh:D012306">mesh:D012306</a></td>
      <td>Risk</td>
      <td>0.762</td>
    </tr>
    <tr>
      <td>1312</td>
      <td>1317</td>
      <td><a href="https://semantic.farm/mesh:D003643">mesh:D003643</a></td>
      <td>Death</td>
      <td>0.762</td>
    </tr>
    <tr>
      <td>1325</td>
      <td>1333</td>
      <td><a href="https://semantic.farm/mesh:D000569">mesh:D000569</a></td>
      <td>Americas</td>
      <td>0.778</td>
    </tr>
    <tr>
      <td>1348</td>
      <td>1360</td>
      <td><a href="https://semantic.farm/mesh:D006973">mesh:D006973</a></td>
      <td>Hypertension</td>
      <td>0.762</td>
    </tr>
    <tr>
      <td>1400</td>
      <td>1430</td>
      <td><a href="https://semantic.farm/mesh:D000087509">mesh:D000087509</a></td>
      <td>Disability-Adjusted Life Years</td>
      <td>0.556</td>
    </tr>
    <tr>
      <td>1434</td>
      <td>1437</td>
      <td><a href="https://semantic.farm/mesh:D008571">mesh:D008571</a></td>
      <td>Men</td>
      <td>0.762</td>
    </tr>
    <tr>
      <td>1628</td>
      <td>1631</td>
      <td><a href="https://semantic.farm/mesh:D008571">mesh:D008571</a></td>
      <td>Men</td>
      <td>0.778</td>
    </tr>
    <tr>
      <td>1655</td>
      <td>1670</td>
      <td><a href="https://semantic.farm/mesh:D057220">mesh:D057220</a></td>
      <td>Health Literacy</td>
      <td>0.762</td>
    </tr>
    <tr>
      <td>1700</td>
      <td>1716</td>
      <td><a href="https://semantic.farm/mesh:D006295">mesh:D006295</a></td>
      <td>Health Resources</td>
      <td>0.762</td>
    </tr>
    <tr>
      <td>1773</td>
      <td>1790</td>
      <td><a href="https://semantic.farm/mesh:D006292">mesh:D006292</a></td>
      <td>Health Priorities</td>
      <td>0.762</td>
    </tr>
    <tr>
      <td>1809</td>
      <td>1813</td>
      <td><a href="https://semantic.farm/mesh:D012306">mesh:D012306</a></td>
      <td>Risk</td>
      <td>0.762</td>
    </tr>
    <tr>
      <td>1820</td>
      <td>1823</td>
      <td><a href="https://semantic.farm/mesh:D008571">mesh:D008571</a></td>
      <td>Men</td>
      <td>0.778</td>
    </tr>
    <tr>
      <td>1853</td>
      <td>1863</td>
      <td><a href="https://semantic.farm/mesh:D015995">mesh:D015995</a></td>
      <td>Prevalence</td>
      <td>0.762</td>
    </tr>
    <tr>
      <td>1882</td>
      <td>1894</td>
      <td><a href="https://semantic.farm/mesh:D006973">mesh:D006973</a></td>
      <td>Hypertension</td>
      <td>0.762</td>
    </tr>
    <tr>
      <td>1923</td>
      <td>1929</td>
      <td><a href="https://semantic.farm/mesh:D006262">mesh:D006262</a></td>
      <td>Health</td>
      <td>0.762</td>
    </tr>
    <tr>
      <td>1934</td>
      <td>1939</td>
      <td><a href="https://semantic.farm/mesh:D012306">mesh:D012306</a></td>
      <td>Risk</td>
      <td>0.762</td>
    </tr>
    <tr>
      <td>1971</td>
      <td>1977</td>
      <td><a href="https://semantic.farm/mesh:D001531">mesh:D001531</a></td>
      <td>Belize</td>
      <td>0.778</td>
    </tr>
    <tr>
      <td>2004</td>
      <td>2009</td>
      <td><a href="https://semantic.farm/mesh:D006040">mesh:D006040</a></td>
      <td>Goals</td>
      <td>0.762</td>
    </tr>
    <tr>
      <td>2042</td>
      <td>2052</td>
      <td><a href="https://semantic.farm/mesh:D003695">mesh:D003695</a></td>
      <td>Delivery of Health Care</td>
      <td>0.54</td>
    </tr>
    <tr>
      <td>2053</td>
      <td>2062</td>
      <td><a href="https://semantic.farm/mesh:D000078329">mesh:D000078329</a></td>
      <td>Workforce</td>
      <td>0.762</td>
    </tr>
    <tr>
      <td>2087</td>
      <td>2111</td>
      <td><a href="https://semantic.farm/mesh:D003150">mesh:D003150</a></td>
      <td>Community Health Workers</td>
      <td>0.762</td>
    </tr>
    <tr>
      <td>2133</td>
      <td>2149</td>
      <td><a href="https://semantic.farm/mesh:D006266">mesh:D006266</a></td>
      <td>Health Education</td>
      <td>0.762</td>
    </tr>
    <tr>
      <td>2153</td>
      <td>2156</td>
      <td><a href="https://semantic.farm/mesh:D008571">mesh:D008571</a></td>
      <td>Men</td>
      <td>0.762</td>
    </tr>
    <tr>
      <td>2182</td>
      <td>2189</td>
      <td><a href="https://semantic.farm/mesh:D019484">mesh:D019484</a></td>
      <td>Address</td>
      <td>0.762</td>
    </tr>
  </tbody>
</table>

<p><strong>Risk factors of ventilator-associated pneumonia in patients with acute
exacerbation of chronic obstructive pulmonary disease: a meta-analysis and
systematic review</strong> (<a href="https://semantic.farm/pubmed:41413500">pubmed:41413500</a>)</p>

<blockquote>
  <p>This meta-analysis aimed to identify risk factors for ventilator-associated
pneumonia (VAP) in patients with Acute exacerbations of Chronic obstructive
pulmonary disease (AECOPD). We systematically searched PubMed, Web of Science,
CINAHL, Cochrane Library, Embase, CNKI, and other databases for studies
investigating risk factors for VAP in patients experiencing AECOPD. The search
encompassed records from database inception up to July 2, 2025. The quality of
the studies was assessed using the Newcastle-Ottawa Scale. Meta-analysis was
performed using Stata 18.0. A total of 16 articles were included, encompassing
3,664 subjects and 16 risk factors. Meta-analysis results showed that, Age
(OR: 2.49, 95%CI : 1.49, 4.17; P&lt;0.001), Smoking history (OR: 2.70, 95%CI :
1.65, 4.44; P&lt;0.001), Acute physiology and chronic health evaluation composite
score (APACHE Ⅱ) score (OR: 3.03, 95%CI : 1.98, 4.65; P&lt;0.001), Sequential
organ failure assessment (SOFA) score (OR: 2.75, 95%CI : 1.90, 3.99; P&lt;0.001),
Diabetes (OR: 2.11, 95%CI : 1.38, 3.24; P = 0.001), Underlying Diseases (OR:
3.42, 95%CI : 1.85, 6.32; P&lt;0.001), Duration of mechanical ventilation (OR:
4.53, 95%CI : 2.68, 7.65; P&lt;0.001), Tracheal intubation (OR: 4.21, 95%CI :
1.85, 9.57; P = 0.001), Indwelling gastric tube ( OR: 3.31, 95%CI : 1.38,
7.95; P = 0.008), Total parenteral nutrition (OR: 1.86, 95%CI : 1.29, 2.70; P
= 0.001), Combined antibiotics (OR: 2.79, 95%CI : 1.32, 5.93; P = 0.007),
Tracheotomy (OR: 2.92, 95%CI : 2.04, 4.17; P&lt;0.001), History of mechanical
ventilation within one year (OR: 2.92, 95%CI : 2.04, 4.17; P = 0.005), Use
acid suppressants (OR: 2.10, 95%CI : 1.49, 2.97; P&lt;0.001) were associated with
the development of VAP in AECOPD patients. This study identified 14 risk
factors associated with the risk of VAP in AECOPD patients. This finding is
helpful for early identification of high-risk patients, which is of great
value for reducing mortality and improving the clinical prognosis of patients
with mechanical ventilation.</p>
</blockquote>

<table>
  <thead>
    <tr>
      <th>Start</th>
      <th>End</th>
      <th>CURIE</th>
      <th>Name</th>
      <th>Score</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>5</td>
      <td>18</td>
      <td><a href="https://semantic.farm/mesh:D017418">mesh:D017418</a></td>
      <td>Meta-Analysis</td>
      <td>0.762</td>
    </tr>
    <tr>
      <td>37</td>
      <td>49</td>
      <td><a href="https://semantic.farm/mesh:D012307">mesh:D012307</a></td>
      <td>Risk Factors</td>
      <td>0.762</td>
    </tr>
    <tr>
      <td>54</td>
      <td>85</td>
      <td><a href="https://semantic.farm/mesh:D053717">mesh:D053717</a></td>
      <td>Pneumonia, Ventilator-Associated</td>
      <td>0.54</td>
    </tr>
    <tr>
      <td>132</td>
      <td>169</td>
      <td><a href="https://semantic.farm/mesh:D029424">mesh:D029424</a></td>
      <td>Pulmonary Disease, Chronic Obstructive</td>
      <td>0.549</td>
    </tr>
    <tr>
      <td>207</td>
      <td>213</td>
      <td><a href="https://semantic.farm/mesh:D039781">mesh:D039781</a></td>
      <td>PubMed</td>
      <td>0.778</td>
    </tr>
    <tr>
      <td>222</td>
      <td>229</td>
      <td><a href="https://semantic.farm/mesh:D012586">mesh:D012586</a></td>
      <td>Science</td>
      <td>0.778</td>
    </tr>
    <tr>
      <td>248</td>
      <td>255</td>
      <td><a href="https://semantic.farm/mesh:D007990">mesh:D007990</a></td>
      <td>Libraries</td>
      <td>0.556</td>
    </tr>
    <tr>
      <td>317</td>
      <td>329</td>
      <td><a href="https://semantic.farm/mesh:D012307">mesh:D012307</a></td>
      <td>Risk Factors</td>
      <td>0.762</td>
    </tr>
    <tr>
      <td>394</td>
      <td>401</td>
      <td><a href="https://semantic.farm/mesh:D011996">mesh:D011996</a></td>
      <td>Records</td>
      <td>0.762</td>
    </tr>
    <tr>
      <td>407</td>
      <td>415</td>
      <td><a href="https://semantic.farm/mesh:D019991">mesh:D019991</a></td>
      <td>Database</td>
      <td>0.762</td>
    </tr>
    <tr>
      <td>520</td>
      <td>533</td>
      <td><a href="https://semantic.farm/mesh:D017418">mesh:D017418</a></td>
      <td>Meta-Analysis</td>
      <td>0.772</td>
    </tr>
    <tr>
      <td>639</td>
      <td>651</td>
      <td><a href="https://semantic.farm/mesh:D012307">mesh:D012307</a></td>
      <td>Risk Factors</td>
      <td>0.762</td>
    </tr>
    <tr>
      <td>653</td>
      <td>666</td>
      <td><a href="https://semantic.farm/mesh:D017418">mesh:D017418</a></td>
      <td>Meta-Analysis</td>
      <td>0.772</td>
    </tr>
    <tr>
      <td>735</td>
      <td>742</td>
      <td><a href="https://semantic.farm/mesh:D012907">mesh:D012907</a></td>
      <td>Smoking</td>
      <td>0.778</td>
    </tr>
    <tr>
      <td>743</td>
      <td>750</td>
      <td><a href="https://semantic.farm/mesh:D006664">mesh:D006664</a></td>
      <td>History</td>
      <td>0.762</td>
    </tr>
    <tr>
      <td>794</td>
      <td>840</td>
      <td><a href="https://semantic.farm/mesh:D018806">mesh:D018806</a></td>
      <td>APACHE</td>
      <td>0.549</td>
    </tr>
    <tr>
      <td>858</td>
      <td>866</td>
      <td><a href="https://semantic.farm/mesh:D018806">mesh:D018806</a></td>
      <td>APACHE</td>
      <td>0.53</td>
    </tr>
    <tr>
      <td>1072</td>
      <td>1080</td>
      <td><a href="https://semantic.farm/mesh:D004194">mesh:D004194</a></td>
      <td>Disease</td>
      <td>0.778</td>
    </tr>
    <tr>
      <td>1072</td>
      <td>1080</td>
      <td><a href="https://semantic.farm/obo:mesh#C">obo:mesh#C</a></td>
      <td>Diseases</td>
      <td>0.778</td>
    </tr>
    <tr>
      <td>1136</td>
      <td>1158</td>
      <td><a href="https://semantic.farm/mesh:D012121">mesh:D012121</a></td>
      <td>Respiration, Artificial</td>
      <td>0.54</td>
    </tr>
    <tr>
      <td>1211</td>
      <td>1221</td>
      <td><a href="https://semantic.farm/mesh:D007440">mesh:D007440</a></td>
      <td>Intubation</td>
      <td>0.762</td>
    </tr>
    <tr>
      <td>1332</td>
      <td>1358</td>
      <td><a href="https://semantic.farm/mesh:D010289">mesh:D010289</a></td>
      <td>Parenteral Nutrition, Total</td>
      <td>0.549</td>
    </tr>
    <tr>
      <td>1411</td>
      <td>1422</td>
      <td><a href="https://semantic.farm/mesh:D000900">mesh:D000900</a></td>
      <td>Anti-Bacterial Agents</td>
      <td>0.54</td>
    </tr>
    <tr>
      <td>1466</td>
      <td>1477</td>
      <td><a href="https://semantic.farm/mesh:D014140">mesh:D014140</a></td>
      <td>Tracheotomy</td>
      <td>0.778</td>
    </tr>
    <tr>
      <td>1521</td>
      <td>1528</td>
      <td><a href="https://semantic.farm/mesh:D006664">mesh:D006664</a></td>
      <td>History</td>
      <td>0.778</td>
    </tr>
    <tr>
      <td>1532</td>
      <td>1554</td>
      <td><a href="https://semantic.farm/mesh:D012121">mesh:D012121</a></td>
      <td>Respiration, Artificial</td>
      <td>0.54</td>
    </tr>
    <tr>
      <td>1767</td>
      <td>1779</td>
      <td><a href="https://semantic.farm/mesh:D012307">mesh:D012307</a></td>
      <td>Risk Factors</td>
      <td>0.762</td>
    </tr>
    <tr>
      <td>1800</td>
      <td>1804</td>
      <td><a href="https://semantic.farm/mesh:D012306">mesh:D012306</a></td>
      <td>Risk</td>
      <td>0.762</td>
    </tr>
    <tr>
      <td>1941</td>
      <td>1950</td>
      <td><a href="https://semantic.farm/mesh:D009026">mesh:D009026</a></td>
      <td>Mortality</td>
      <td>0.762</td>
    </tr>
    <tr>
      <td>1978</td>
      <td>1987</td>
      <td><a href="https://semantic.farm/mesh:D011379">mesh:D011379</a></td>
      <td>Prognosis</td>
      <td>0.762</td>
    </tr>
    <tr>
      <td>2005</td>
      <td>2027</td>
      <td><a href="https://semantic.farm/mesh:D012121">mesh:D012121</a></td>
      <td>Respiration, Artificial</td>
      <td>0.54</td>
    </tr>
  </tbody>
</table>

<p><strong>Randomized trial assessing transverse supraumbilical incisions for cesarean
sections in morbid obese women with pannus</strong>
(<a href="https://semantic.farm/pubmed:41413498">pubmed:41413498</a>)</p>

<blockquote>
  <p>BACKGROUND AND OBJECTIVE: The high prevalence of Morbidly obese Egyptian
patients presents surgical problems for cesarean sections (CS), including a
higher risk of wound infections. This study examines the impact of a
transverse supraumbilical (TSU) incision in these patients. We conducted a
randomized controlled trial on 72 morbidly obese patients (BMI &gt;40 kg/m²)
scheduled for cesarean section at Ain Shams University Hospital from March
2016 to August 2018. Participants were divided into Group A (36 patients) with
a transverse supraumbilical (TSU) incision and Group B (36 patients) with a
conventional Pfannenstiel incision. The primary outcome measured was the
incidence of wound infection, while secondary outcomes included operative
time, postoperative pain, hospital stay, blood loss, postoperative mobility,
and intestinal motility. The results indicated no significant differences
between the groups regarding age, BMI, parity, diabetes mellitus, and history
of previous cesarean sections. The incidence of surgical site infection was
significantly lower in the transverse supraumbilical group (11.1%, 4/36)
compared to the Pfannenstiel group (58.3%, 21/36), with an absolute risk
reduction of 47.2% (95% CI: 27.8% to 66.6%). Other parameters like operative
time, hematocrit drop, pain score, hospital stay, and intestinal motility
showed no significant differences between the groups (P&gt;0.05). Supraumbilical
transverse incisions are a safe, effective alternative to Pfannenstiel
incisions in morbidly obese women, with better wound infection rates and
easier access. Further research is needed to confirm the benefits and to
assess patient satisfaction. This study was registered prospectively in
clinicaltrials.gov (NCT02692729) on 1.3.2016.</p>
</blockquote>

<table>
  <thead>
    <tr>
      <th>Start</th>
      <th>End</th>
      <th>CURIE</th>
      <th>Name</th>
      <th>Score</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>35</td>
      <td>45</td>
      <td><a href="https://semantic.farm/mesh:D015995">mesh:D015995</a></td>
      <td>Prevalence</td>
      <td>0.762</td>
    </tr>
    <tr>
      <td>113</td>
      <td>130</td>
      <td><a href="https://semantic.farm/mesh:D002585">mesh:D002585</a></td>
      <td>Cesarean Section</td>
      <td>0.762</td>
    </tr>
    <tr>
      <td>156</td>
      <td>160</td>
      <td><a href="https://semantic.farm/mesh:D012306">mesh:D012306</a></td>
      <td>Risk</td>
      <td>0.762</td>
    </tr>
    <tr>
      <td>164</td>
      <td>180</td>
      <td><a href="https://semantic.farm/mesh:D014946">mesh:D014946</a></td>
      <td>Wound Infection</td>
      <td>0.762</td>
    </tr>
    <tr>
      <td>293</td>
      <td>320</td>
      <td><a href="https://semantic.farm/mesh:D016449">mesh:D016449</a></td>
      <td>Randomized Controlled Trial</td>
      <td>0.762</td>
    </tr>
    <tr>
      <td>381</td>
      <td>397</td>
      <td><a href="https://semantic.farm/mesh:D002585">mesh:D002585</a></td>
      <td>Cesarean Section</td>
      <td>0.762</td>
    </tr>
    <tr>
      <td>411</td>
      <td>421</td>
      <td><a href="https://semantic.farm/mesh:D014495">mesh:D014495</a></td>
      <td>Universities</td>
      <td>0.556</td>
    </tr>
    <tr>
      <td>422</td>
      <td>430</td>
      <td><a href="https://semantic.farm/mesh:D006761">mesh:D006761</a></td>
      <td>Hospitals</td>
      <td>0.556</td>
    </tr>
    <tr>
      <td>670</td>
      <td>679</td>
      <td><a href="https://semantic.farm/mesh:D015994">mesh:D015994</a></td>
      <td>Incidence</td>
      <td>0.762</td>
    </tr>
    <tr>
      <td>683</td>
      <td>698</td>
      <td><a href="https://semantic.farm/mesh:D014946">mesh:D014946</a></td>
      <td>Wound Infection</td>
      <td>0.762</td>
    </tr>
    <tr>
      <td>734</td>
      <td>748</td>
      <td><a href="https://semantic.farm/mesh:D061646">mesh:D061646</a></td>
      <td>Operative Time</td>
      <td>0.762</td>
    </tr>
    <tr>
      <td>750</td>
      <td>768</td>
      <td><a href="https://semantic.farm/mesh:D010149">mesh:D010149</a></td>
      <td>Pain, Postoperative</td>
      <td>0.549</td>
    </tr>
    <tr>
      <td>770</td>
      <td>783</td>
      <td><a href="https://semantic.farm/mesh:D007902">mesh:D007902</a></td>
      <td>Length of Stay</td>
      <td>0.54</td>
    </tr>
    <tr>
      <td>785</td>
      <td>810</td>
      <td><a href="https://semantic.farm/mesh:D019106">mesh:D019106</a></td>
      <td>Postoperative Hemorrhage</td>
      <td>0.502</td>
    </tr>
    <tr>
      <td>825</td>
      <td>844</td>
      <td><a href="https://semantic.farm/mesh:D005769">mesh:D005769</a></td>
      <td>Gastrointestinal Motility</td>
      <td>0.54</td>
    </tr>
    <tr>
      <td>934</td>
      <td>940</td>
      <td><a href="https://semantic.farm/mesh:D010298">mesh:D010298</a></td>
      <td>Parity</td>
      <td>0.762</td>
    </tr>
    <tr>
      <td>942</td>
      <td>959</td>
      <td><a href="https://semantic.farm/mesh:D003920">mesh:D003920</a></td>
      <td>Diabetes Mellitus</td>
      <td>0.762</td>
    </tr>
    <tr>
      <td>965</td>
      <td>972</td>
      <td><a href="https://semantic.farm/mesh:D006664">mesh:D006664</a></td>
      <td>History</td>
      <td>0.762</td>
    </tr>
    <tr>
      <td>985</td>
      <td>1002</td>
      <td><a href="https://semantic.farm/mesh:D002585">mesh:D002585</a></td>
      <td>Cesarean Section</td>
      <td>0.762</td>
    </tr>
    <tr>
      <td>1008</td>
      <td>1017</td>
      <td><a href="https://semantic.farm/mesh:D015994">mesh:D015994</a></td>
      <td>Incidence</td>
      <td>0.762</td>
    </tr>
    <tr>
      <td>1021</td>
      <td>1044</td>
      <td><a href="https://semantic.farm/mesh:D013530">mesh:D013530</a></td>
      <td>Surgical Wound Infection</td>
      <td>0.54</td>
    </tr>
    <tr>
      <td>1181</td>
      <td>1204</td>
      <td><a href="https://semantic.farm/mesh:D061366">mesh:D061366</a></td>
      <td>Numbers Needed To Treat</td>
      <td>0.54</td>
    </tr>
    <tr>
      <td>1262</td>
      <td>1276</td>
      <td><a href="https://semantic.farm/mesh:D061646">mesh:D061646</a></td>
      <td>Operative Time</td>
      <td>0.762</td>
    </tr>
    <tr>
      <td>1278</td>
      <td>1288</td>
      <td><a href="https://semantic.farm/mesh:D006400">mesh:D006400</a></td>
      <td>Hematocrit</td>
      <td>0.762</td>
    </tr>
    <tr>
      <td>1295</td>
      <td>1299</td>
      <td><a href="https://semantic.farm/mesh:D010146">mesh:D010146</a></td>
      <td>Pain</td>
      <td>0.762</td>
    </tr>
    <tr>
      <td>1307</td>
      <td>1320</td>
      <td><a href="https://semantic.farm/mesh:D007902">mesh:D007902</a></td>
      <td>Length of Stay</td>
      <td>0.54</td>
    </tr>
    <tr>
      <td>1326</td>
      <td>1345</td>
      <td><a href="https://semantic.farm/mesh:D005769">mesh:D005769</a></td>
      <td>Gastrointestinal Motility</td>
      <td>0.54</td>
    </tr>
    <tr>
      <td>1523</td>
      <td>1528</td>
      <td><a href="https://semantic.farm/mesh:D014930">mesh:D014930</a></td>
      <td>Women</td>
      <td>0.762</td>
    </tr>
    <tr>
      <td>1542</td>
      <td>1557</td>
      <td><a href="https://semantic.farm/mesh:D014946">mesh:D014946</a></td>
      <td>Wound Infection</td>
      <td>0.762</td>
    </tr>
    <tr>
      <td>1591</td>
      <td>1599</td>
      <td><a href="https://semantic.farm/mesh:D012106">mesh:D012106</a></td>
      <td>Research</td>
      <td>0.762</td>
    </tr>
    <tr>
      <td>1648</td>
      <td>1668</td>
      <td><a href="https://semantic.farm/mesh:D017060">mesh:D017060</a></td>
      <td>Patient Satisfaction</td>
      <td>0.762</td>
    </tr>
  </tbody>
</table>

<p><strong>Associations of Perfluoroalkyl and Polyfluoroalkyl Substances With
Cardiovascular Disease Incidence in Adults With Prediabetes: Findings From the
Diabetes Prevention Program</strong>
(<a href="https://semantic.farm/pubmed:41413398">pubmed:41413398</a>)</p>

<blockquote>
  <p>Perfluoroalkyl and polyfluoroalkyl substances (PFAS) are persistent,
widespread environmental contaminants linked to cardiometabolic outcomes
including obesity, hyperlipidemia, and diabetes. We examined whether baseline
plasma PFAS concentrations are associated with incident cardiovascular disease
(CVD) in adults with prediabetes, leveraging data from DPPOS (Diabetes
Prevention Program Outcomes Study). Among 1382 participants, we quantified
baseline plasma concentrations of 6 PFAS. We used Cox proportional hazards
models to estimate the risks of developing CVD outcomes during a median of 21
years of follow-up for each PFAS and used quantile g-computation to evaluate
the joint effect of all 6 PFAS. Effect modification by age, sex, menopausal
status, diet, and physical activity was explored. The incidence of major
adverse cardiovascular events was 9.6%; 3.9% had CVD-related death. Each
increase in interquartile range (1.1 ng/mL) in 2-( In adults with prediabetes,
higher plasma concentrations of select PFAS, but not their mixture, were
prospectively associated with increased CVD risk. These findings underscore
PFAS as a potential environmental risk factor for CVD in high-risk
populations.</p>
</blockquote>

<table>
  <thead>
    <tr>
      <th>Start</th>
      <th>End</th>
      <th>CURIE</th>
      <th>Name</th>
      <th>Score</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>152</td>
      <td>159</td>
      <td><a href="https://semantic.farm/mesh:D009765">mesh:D009765</a></td>
      <td>Obesity</td>
      <td>0.762</td>
    </tr>
    <tr>
      <td>161</td>
      <td>175</td>
      <td><a href="https://semantic.farm/mesh:D006949">mesh:D006949</a></td>
      <td>Hyperlipidemias</td>
      <td>0.54</td>
    </tr>
    <tr>
      <td>220</td>
      <td>226</td>
      <td><a href="https://semantic.farm/mesh:D010949">mesh:D010949</a></td>
      <td>Plasma</td>
      <td>0.762</td>
    </tr>
    <tr>
      <td>276</td>
      <td>298</td>
      <td><a href="https://semantic.farm/mesh:D002318">mesh:D002318</a></td>
      <td>Cardiovascular Diseases</td>
      <td>0.54</td>
    </tr>
    <tr>
      <td>308</td>
      <td>314</td>
      <td><a href="https://semantic.farm/mesh:D000328">mesh:D000328</a></td>
      <td>Adult</td>
      <td>0.762</td>
    </tr>
    <tr>
      <td>320</td>
      <td>331</td>
      <td><a href="https://semantic.farm/mesh:D011236">mesh:D011236</a></td>
      <td>Prediabetic State</td>
      <td>0.54</td>
    </tr>
    <tr>
      <td>381</td>
      <td>388</td>
      <td><a href="https://semantic.farm/mesh:D019542">mesh:D019542</a></td>
      <td>Program</td>
      <td>0.778</td>
    </tr>
    <tr>
      <td>454</td>
      <td>460</td>
      <td><a href="https://semantic.farm/mesh:D010949">mesh:D010949</a></td>
      <td>Plasma</td>
      <td>0.762</td>
    </tr>
    <tr>
      <td>495</td>
      <td>526</td>
      <td><a href="https://semantic.farm/mesh:D016016">mesh:D016016</a></td>
      <td>Proportional Hazards Models</td>
      <td>0.549</td>
    </tr>
    <tr>
      <td>543</td>
      <td>548</td>
      <td><a href="https://semantic.farm/mesh:D012306">mesh:D012306</a></td>
      <td>Risk</td>
      <td>0.762</td>
    </tr>
    <tr>
      <td>679</td>
      <td>684</td>
      <td><a href="https://semantic.farm/mesh:D007596">mesh:D007596</a></td>
      <td>Joints</td>
      <td>0.54</td>
    </tr>
    <tr>
      <td>735</td>
      <td>738</td>
      <td><a href="https://semantic.farm/mesh:D012723">mesh:D012723</a></td>
      <td>Sex</td>
      <td>0.762</td>
    </tr>
    <tr>
      <td>759</td>
      <td>763</td>
      <td><a href="https://semantic.farm/mesh:D004032">mesh:D004032</a></td>
      <td>Diet</td>
      <td>0.762</td>
    </tr>
    <tr>
      <td>769</td>
      <td>786</td>
      <td><a href="https://semantic.farm/mesh:D015444">mesh:D015444</a></td>
      <td>Exercise</td>
      <td>0.54</td>
    </tr>
    <tr>
      <td>805</td>
      <td>814</td>
      <td><a href="https://semantic.farm/mesh:D015994">mesh:D015994</a></td>
      <td>Incidence</td>
      <td>0.762</td>
    </tr>
    <tr>
      <td>885</td>
      <td>890</td>
      <td><a href="https://semantic.farm/mesh:D003643">mesh:D003643</a></td>
      <td>Death</td>
      <td>0.762</td>
    </tr>
    <tr>
      <td>951</td>
      <td>957</td>
      <td><a href="https://semantic.farm/mesh:D000328">mesh:D000328</a></td>
      <td>Adult</td>
      <td>0.762</td>
    </tr>
    <tr>
      <td>963</td>
      <td>974</td>
      <td><a href="https://semantic.farm/mesh:D011236">mesh:D011236</a></td>
      <td>Prediabetic State</td>
      <td>0.54</td>
    </tr>
    <tr>
      <td>983</td>
      <td>989</td>
      <td><a href="https://semantic.farm/mesh:D010949">mesh:D010949</a></td>
      <td>Plasma</td>
      <td>0.762</td>
    </tr>
    <tr>
      <td>1093</td>
      <td>1097</td>
      <td><a href="https://semantic.farm/mesh:D012306">mesh:D012306</a></td>
      <td>Risk</td>
      <td>0.762</td>
    </tr>
    <tr>
      <td>1159</td>
      <td>1170</td>
      <td><a href="https://semantic.farm/mesh:D012307">mesh:D012307</a></td>
      <td>Risk Factors</td>
      <td>0.54</td>
    </tr>
    <tr>
      <td>1192</td>
      <td>1203</td>
      <td><a href="https://semantic.farm/mesh:D011153">mesh:D011153</a></td>
      <td>Population</td>
      <td>0.762</td>
    </tr>
  </tbody>
</table>]]></content><author><name>Charles Tapley Hoyt</name></author><category term="named entity recognition" /><category term="text mining" /><category term="natural language processing" /><category term="named entity normalization" /><category term="medical subject headings" /><category term="MeSH" /><category term="PubMed" /><category term="PyOBO" /><category term="SSSLM" /><summary type="html"><![CDATA[Annotating the literature with mentions of key concepts from a given domain is often the first step towards extracting more substantial structured knowledge. This can be challenging, as it typically encompasses acquiring and processing the relevant literature and ontologies then installing and applying difficult-to-use named entity recognition (NER) workflows. This post highlights software components I’ve implemented to simplify this workflow. I demonstrate it by annotating the biomedical literature available through PubMed with Medical Subject Headings (MeSH) terms, and also comment on how this can be generalized to other natural sciences, engineering, and humanities disciplines.]]></summary></entry><entry><title type="html">Machine-Actionable Training Materials at BioHackathon Germany 2025</title><link href="https://cthoyt.com/2025/12/09/biohackathon-de-2025.html" rel="alternate" type="text/html" title="Machine-Actionable Training Materials at BioHackathon Germany 2025" /><published>2025-12-09T11:08:00+00:00</published><updated>2025-12-09T11:08:00+00:00</updated><id>https://cthoyt.com/2025/12/09/biohackathon-de-2025</id><content type="html" xml:base="https://cthoyt.com/2025/12/09/biohackathon-de-2025.html"><![CDATA[<p>I recently attended the
<a href="https://www.denbi.de/de-nbi-events/1840-4th-biohackathon-germany">4<sup>th</sup> BioHackathon Germany</a>
hosted by the
<a href="https://www.denbi.de">German Network for Bioinformatics Infrastructure (de.NBI)</a>.
I participated in the project <em>On the Path to Machine-actionable Training
Materials</em> in order to improve the interoperability between
<a href="https://search.dalia.education/basic">DALIA</a>,
<a href="https://tess.elixir-europe.org">TeSS</a>,
<a href="https://elixirtess.github.io/mTeSS-X">mTeSS-X</a>, and
<a href="https://schema.org">Schema.org</a>. This post gives a summary of the activities
leading up to the hackathon and the results of our happy hacking.</p>

<h2 id="team">Team</h2>

<p><img src="/img/biohackathon2025/team.jpg" alt="" /></p>

<p>Our project,
<a href="https://www.denbi.de/de-nbi-events/1939-4th-biohackathon-germany-on-the-path-to-machine-actionable-training-materials">On the Path to Machine-actionable Training Materials</a>,
had the following active participants throughout the week:</p>

<ul>
  <li>Nick Juty &amp; Phil Reed (University of Manchester)</li>
  <li>Leyla Jael Castro &amp; Roman Baum (Deutsche Zentralbibliothek für Medizin; ZB
Med)</li>
  <li>Petra Steiner (University of Darmstadt)</li>
  <li>Oliver Knodel &amp; Martin Voigt (Helmholtz-Zentrum Dresden-Rossendorf; HZDR)</li>
  <li>Dilfuza Djamalova (Forschungszentrum Jülich; FZJ)</li>
  <li>Jacobo Miranda (European Molecular Biology Laboratory; EMBL)</li>
</ul>

<p>Nick and Petra were our team leaders and Phil acted as the project’s <em>de facto</em>
secretary. On the first day of the hackathon, we were briefly joined by Alban
Gaignard (Nantes University), Dimitris Panouris (SciLifeLab), and Harshita Gupta
(SciLifeLab) to present their current related work. Similarly, Dominik Brilhaus
(Heinrich-Heine-Universität Düsseldorf) joined on the first day to share his
perspective from DataPLANT (the NFDI consortium for plants) as a training
materials creator. Finally, Helena Schnitzer (FZJ) participated in some
Schema.org discussions through the week.</p>

<h2 id="goals">Goals</h2>

<p>We categorized our work plan into three streams:</p>

<ol>
  <li><a href="#training-material-interoperability"><strong>Training Material Interoperability</strong></a> -
survey the landscape of relevant ontologies and schemas for annotating
learning materials, curate mappings/crosswalks between existing data models,
develop a programmatic toolbox, and begin federating between training
material platforms</li>
  <li><a href="#training-material-analysis"><strong>Training Material Analysis</strong></a> - analyze
training materials at scale to group similar training materials, reduce
redundancy, and semi-automatically construct learning paths</li>
  <li><a href="#modeling-learning-paths"><strong>Modeling Learning Paths</strong></a> - collect use cases
and develop a (meta)data model for learning paths</li>
</ol>

<h2 id="training-material-interoperability">Training Material Interoperability</h2>

<p>Interoperability is third pillar of the
<a href="https://www.nature.com/articles/sdata201618">FAIR data principles</a>. Metadata
describing training materials may be captured and stored in one of several data
models including the DALIA Interchange Format (DIF) v1.3, the format implicitly
defined by the TeSS API, and the Schemas.org Learning Material profile. Further,
metadata records conforming to these data models are filled with references to
terms in other ontologies, controlled vocabularies, databases, and other
resources that mint (persistent) identifiers. Our overarching goal at the
hackathon was to improve interoperability on both levels.</p>

<h3 id="indexing-ontologies-and-schemas">Indexing Ontologies and Schemas</h3>

<p>Our first concrete goal for training material interoperability at the hackathon
was to survey ontologies, controlled vocabularies, databases, and other
resources that mint (persistent) identifiers that might appear in the metadata
describing a learning material. For example, TeSS uses the
<a href="https://semantic.farm/edam">EDAM Ontology</a> to annotate topics onto training
materials. For the same purpose, DALIA uses the
<a href="https://semantic.farm/kim.hcrt">Hochschulcampus Ressourcentypen</a> (I’ll say more
on how we deal with the conflicting resources in the section below on mappings).</p>

<p>Our second concrete goal was to survey schemas that are used in modeling open
educational resources and training materials, for example,
<a href="https://semantic.farm/sdo">Schema.org</a>,
<a href="https://semantic.farm/oerschema">OERSchema</a>, and
<a href="https://semantic.farm/modalia">MoDALIA</a>, which encodes the DALIA Interchange
Format (DIF) v1.3.</p>

<p>The Semantic Farm (<a href="https://semantic.farm">https://semantic.farm</a>) is
comprehensive database of metadata about resources that mint (persistent)
identifiers (e.g., ontologies, controlled vocabularies, databases, schemas) such
as their preferred CURIE prefix for usage in SPARQL queries and other semantic
web applications. It imports and aligns with other databases like
<a href="https://identifiers.org">Identifiers.org</a> (for the life sciences) and
<a href="https://bartoc.org">BARTOC</a> (for the digital humanities) to support
interoperability and sustainability. It follows the
<a href="https://www.nature.com/articles/s41597-024-03406-w">open data, open code, and open infrastructure (O3)</a>
guidelines and has well-defined governance to enable community maintenance and
support longevity.</p>

<p>It’s the perfect place to index all the learning material and open educational
resource-related ontologies, controlled vocabularies, databases, and schemas.</p>

<p>I gave a tutorial on how to search the Semantic Farm for ontologies, controlled
vocabularies and other resources that mint (persistent) identifiers, and how to
contribute any that are missing. In short, they can be contributed by filling
out the
<a href="https://github.com/biopragmatics/bioregistry/issues/new?template=new-prefix.yml">new prefix request template</a>
on GitHub. If you’re interested to add a new entry, you can directly use the
form, read the
<a href="https://github.com/biopragmatics/bioregistry/blob/main/docs/CONTRIBUTING.md#submitting-new-prefixes">contribution guidelines</a>,
or watch a
<a href="https://www.youtube.com/watch?v=e-I6rcV2_BE">short YouTube tutorial</a>.</p>

<p>While I had done some significant preparatory work before the hackathon by
creating many new entries in the Semantic Farm, the team found and added several
new and important entries to the Semantic Farm during the hackathon too. Here
are two highlights:</p>

<p><a href="https://orcid.org/0000-0001-5556-838X">Martin Voigt</a> contributed the prefix
<code class="language-plaintext highlighter-rouge">amb</code> for the
<a href="https://dini-ag-kim.github.io/amb/20231019">Allgemeines Metadatenprofil für Bildungsressourcen</a>
(General Metadata Profile for Educational Resources) in
<a href="https://github.com/biopragmatics/bioregistry/pull/1781">biopragmatics/bioregistry#1781</a>.
This is a metadata schema for learning materials produced by the
Kompetenzzentrum Interoperable Metadaten (KIM) within the Deutsche Initiative
für Netzwerkinformation e.V. that was heavily inspired by
<a href="https://schema.org">Schema.org</a> and the Dublin Core
<a href="https://www.dublincore.org/about/lrmi/">Learning Resource Metadata Initiative (LRMI)</a></p>

<p><a href="https://orcid.org/0009-0004-7782-2894">Dilfuza Djamalova</a> and
<a href="https://orcid.org/0009-0005-0673-021X">Jacobo Miranda</a> contributed the prefix
<code class="language-plaintext highlighter-rouge">gtn</code> for
<a href="https://training.galaxyproject.org/training-material">Galaxy Training Network</a>
training materials in
<a href="https://github.com/biopragmatics/bioregistry/pull/1779">biopragmatics/bioregistry#1779</a>.
This resource contains multi- and cross-disciplinary training materials for
using the Galaxy workflow management system. Below, I describe how we ingested
transformed the training materials from GTN into a common format such they can
be represented according to the DALIA Interchange Format (DIF) v1.3, the
implicit data model expected by TeSS, and in Schema.org-compliant RDF.</p>

<p>Ultimately, we collated relevant ontologies, controlled vocabularies, schemas
and other resources that mint (persistent) identifiers in a
<a href="https://semantic.farm/collection/0000018">collection</a> such that they can be
easily found and shared.</p>

<h3 id="semantic-mappings-and-crosswalks">Semantic Mappings and Crosswalks</h3>

<p><img src="/img/biohackathon2025/overlaps.svg" alt="" /></p>

<p>I alluded to the different resources used by TeSS and DALIA to annotate
disciplines. The issue of partially overlapping ontologies, controlled
vocabularies, and database is quite widespread, and can manifest in a few
different ways. The figure above shows that redundancy can arise because of
different focus within a domain (i.e., the chemistry example), different
hierarchical specificity (i.e., the disease example), and due to massive generic
resources having overlap across many domains (e.g., like UMLS, MeSH, NCIT).</p>

<p>This is problematic when integrating learning materials from different sources,
e.g., TeSS and DALIA, because two learning materials may be annotated with
different terms describing the same discipline. Therefore, the solution is to
create semantic mappings between these terms.</p>

<p>I’ve worked for several years on the
<a href="https://mapping-commons.github.io/sssom/">Simple Standard for Sharing Ontological Mappings (SSSOM)</a>
standard for storing semantic mappings, so this was naturally the target for our
work. Further, I have been working on a domain-agnostic workflow for predicting
semantic mappings with lexical matching and deploying a curation interface
called <a href="github.com/cthoyt/sssom-curator/">SSSOM Curator</a>. I gave a tutorial for
using SSSOM Curator to the team based on a previous tutorial I made (that can be
found on YouTube <a href="https://www.youtube.com/watch?v=FkXkOhT8gdc&amp;t=293s">here</a>). We
prepared predicted semantic mappings between several learning material-related
ontologies in
<a href="https://github.com/biopragmatics/biomappings/pull/204">biopragmatics/biomappings#204</a>,
but we didn’t prioritize semantic mapping curation during the hackathon. Here’s
what they look like in the SSSOM Curator interface for Biomappings:</p>

<p><img src="/img/biohackathon2025/sssom-curator-disciplines.png" alt="" /></p>

<p>Where curating correspondences between concepts in ontology, controlled
vocabularies, and databases is often called semantic mapping, curating
correspondences between schemas and properties therein is often called
crosswalks. We put a bigger emphasis on producing crosswalks between Schema.org
and MoDALIA. This is actually a more complex problem due to the fact that
correspondences between elements in schemas can be more sophisticated (e.g.,
mapping between two fields for first and last names to a single name field), but
there are at least a few places where properties can be mapped with SSSOM.</p>

<p><img src="/img/biohackathon2025/crosswalks.png" alt="" /></p>

<p>An interesting lesson learned is that some curators find using SKOS
relationships challenging because the narrow and broader relations have the
opposite direction than what they would expect. For example,
<code class="language-plaintext highlighter-rouge">X skos:narrowMatch Y</code> means that X is narrower than Y, not X has a narrow match
Y. Many vocabularies use a verb as part of the predicate to reduce this
confusion - I’m sure if it were <code class="language-plaintext highlighter-rouge">X skos:isNarrowMatchFor Y</code>, then this would not
have been a problem. Deep down, the real issue is that transparent identifiers
(i.e., human-readable ones) are bad, since they can’t be changed over time. See
the excellent article,
<a href="https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.2001414">Identifiers for the 21<sup>st</sup> century</a>,
by McMurry <em>et al.</em> (2017) for a more detailed discussion on what makes a good
identifier.</p>

<h3 id="operationalizing-crosswalks">Operationalizing Crosswalks</h3>

<p>The next step was to translate the abstract crosswalks between DALIA, TeSS, and
Schema.org into a concrete implementation using a general purpose programming
language (i.e., Python).</p>

<h4 id="the-scaling-problem">The Scaling Problem</h4>

<p>Given that we only focused on these three data models, it’s not unrealistic to
produce a DALIA-TeSS crosswalk, TeSS-Schema.org crosswalk, and DALIA-Schema.org
crosswalk. However, this approach does not scale well - in general, it requires
curating and implementing ${N}\choose{2}$ crosswalks with $N$ being the number
of schemas.</p>

<p>An alternative is to use a hub-and-spoke model, in which one data model is
targeted as the intermediary used for interchange and storage. This reduces the
burden on curators of crosswalks, as they only have to curate a single crosswalk
for any given data model into the intermediary. Similarly, it reduces the burden
on code maintainers as only a single crosswalk has to be implemented.</p>

<p>The challenge with open educational resources and learning materials is that no
existing data model is sufficient to cover the (most important) aspects of all
other data models. This motivated us to implement a unified, generic data model
for learning materials to serve as the interoperability hub between DALIA, TeSS,
Schema.org, and other data models.</p>

<pre><code class="language-mermaid">graph TD
    subgraph alltoall ["All-to-All (complex, burdensome)"]
        dalia[DALIA] &lt;--&gt; tess[TeSS]
        dalia &lt;--&gt; schema[Schema.org]
        dalia &lt;--&gt; oerschema[OERschema]
        dalia &lt;--&gt; amb["Allgemeines Metadatenprofil für Bildungsressourcen (AMB)"]
        dalia &lt;--&gt; lrmi["Learning Resource Metadata Initiative (LRMI)"]
        dalia &lt;--&gt; erudite[ERuDIte]
        tess &lt;--&gt; schema
        tess &lt;--&gt; oerschema
        tess &lt;--&gt; amb
        tess &lt;--&gt; lrmi
        tess &lt;--&gt; erudite
        schema &lt;--&gt; oerschema
        schema &lt;--&gt; amb
        schema &lt;--&gt; lrmi
        schema &lt;--&gt; erudite
        oerschema &lt;--&gt; amb
        oerschema &lt;--&gt; lrmi
        oerschema &lt;--&gt; erudite
        amb &lt;--&gt; lrmi
        amb &lt;--&gt; erudite
        lrmi &lt;--&gt; erudite
    end

    subgraph hub ["Hub-and-Spoke (maintainable, extensible)"]
        direction TB
        hubn[Unified OER Data Model] &lt;--&gt; daliaspoke[DALIA]
        hubn[Unified OER Data Model] &lt;--&gt; tessspoke[TeSS]
        hubn[Unified OER Data Model] &lt;--&gt; schemaspoke[Schema.org]
        hubn[Unified OER Data Model] &lt;--&gt; oerschemaspoke[OERschema]
        hubn[Unified OER Data Model] &lt;--&gt; ambspoke["Allgemeines Metadatenprofil für Bildungsressourcen (AMB)"]
        hubn[Unified OER Data Model] &lt;--&gt; lrmispoke["Learning Resource Metadata Initiative (LRMI)"]
        hubn[Unified OER Data Model] &lt;--&gt; eruditespoke[ERuDIte]
    end

    alltoall --&gt; hub
</code></pre>

<p>The famous XKCD comic, <a href="https://xkcd.com/927">Standards (https://xkcd.com/927)</a>,
proselytizes that any proposal of a unified standard that covers everyone’s use
cases is doomed to be an $N+1$ competing standard. While I’m doing my best to
present the work done in preparation for the hackathon and at the hackathon in a
linear way, the truth is that most steps also included discussion, hacking,
trying, failing, and repeating. Therefore, I can confidently say that for
practical reasons, implementing a new <em>de facto</em> standard was the only realistic
choice.</p>

<h4 id="the-oerbservatory-data-model">The OERbservatory Data Model</h4>

<p><img src="/img/biohackathon2025/oerbservatory-schematic.png" alt="" /></p>

<p>During the hackathon, we implemented the open source
<a href="https://github.com/data-literacy-alliance/oerbservatory">OERbservatory</a> Python
package. I first want to talk about three major features that it includes:</p>

<ol>
  <li>a unified, generic
<a href="https://github.com/data-literacy-alliance/oerbservatory/blob/main/src/oerbservatory/model.py">object model</a>
for open educational resources that’s effectively the union of the best parts
of DALIA, TeSS, Schema.org, and a few other data models we found</li>
  <li>import and export to two open educational resource and learning materials
data models - DALIA and TeSS. We didn’t have time during the hackathon to
implement import and export to Schema.org.</li>
  <li>import from three external learning material repositories -
<a href="https://oerhub.at">OERhub</a>, <a href="https://oersi.org">OERSI</a>, and the
<a href="https://training.galaxyproject.org">Galaxy Training Network (GTN)</a></li>
</ol>

<p>Here’s an excerpt of the object model, implemented using
<a href="https://github.com/pydantic/pydantic">Pydantic</a>. Note that Pydantic uses a
combination of Python’s type system and type annotations to express constraints
and rules, similarly to how SHACL does. However, we get the benefit of Python
type checking and the Python runtime to check that we’ve encoded this all
correctly. Finally, all Pydantic models can be serialized and deserialized from
JSON.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">EducationalResource</span><span class="p">(</span><span class="n">BaseModel</span><span class="p">):</span>
    <span class="s">"""Represents an educational resource."""</span>

    <span class="n">model_config</span> <span class="o">=</span> <span class="n">ConfigDict</span><span class="p">(</span><span class="n">arbitrary_types_allowed</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>

    <span class="n">reference</span><span class="p">:</span> <span class="n">Reference</span> <span class="o">|</span> <span class="bp">None</span> <span class="o">=</span> <span class="n">Field</span><span class="p">(</span>
        <span class="bp">None</span><span class="p">,</span>
        <span class="n">description</span><span class="o">=</span><span class="s">"The primary reference for this learning material"</span><span class="p">,</span>
        <span class="n">examples</span><span class="o">=</span><span class="p">[</span><span class="n">Reference</span><span class="p">(</span><span class="n">prefix</span><span class="o">=</span><span class="s">"dalia"</span><span class="p">,</span> <span class="n">identifier</span><span class="o">=</span><span class="s">""</span><span class="p">)]</span>
    <span class="p">)</span>
    <span class="n">title</span><span class="p">:</span> <span class="n">InternationalizedStr</span> <span class="o">=</span> <span class="n">Field</span><span class="p">(...,</span> <span class="n">description</span><span class="o">=</span><span class="s">"The title of the learning material"</span><span class="p">)</span>
    <span class="n">authors</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="n">Author</span> <span class="o">|</span> <span class="n">Organization</span><span class="p">]</span> <span class="o">=</span> <span class="n">Field</span><span class="p">(</span>
        <span class="n">default_factory</span><span class="o">=</span><span class="nb">list</span><span class="p">,</span>
        <span class="n">description</span><span class="o">=</span><span class="s">"An ordered list of authors (i.e., persons or organizations) of the learning material"</span><span class="p">,</span>
        <span class="n">examples</span><span class="o">=</span><span class="p">[</span>
            <span class="n">Author</span><span class="p">(</span><span class="n">name</span><span class="o">=</span><span class="s">"Charles Tapley Hoyt"</span><span class="p">,</span> <span class="n">orcid</span><span class="o">=</span><span class="s">"0000-0003-4423-4370"</span><span class="p">),</span>
            <span class="n">Organization</span><span class="p">(</span><span class="n">name</span><span class="o">=</span><span class="s">"NFDI"</span><span class="p">,</span> <span class="n">ror</span><span class="o">=</span><span class="s">"05qj6w324"</span><span class="p">),</span>
        <span class="p">],</span>
        <span class="n">min_len</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span>
    <span class="p">)</span>
    <span class="p">...</span>
</code></pre></div></div>

<details>
<summary>Technology Comparison (content warning: programming culture wars)</summary>
<p>
DALIA and Schema.org built on top of semantic web principles. Records about
learning materials encoded in these data models are stored in RDF and queryable
via SPARQL. However, while powerful, SPARQL is a querying language that is
inherently limited in its expressibility and utility. A general purpose
programming language is more suited for building data science workflows, search
engines, APIs, web interfaces, and other tools on top of open educational
resource and learning material data. That's why we emphasized concretizing the
crosswalks between DALIA, TeSS, and Schema.org in a software implementation.
</p><p>
We chose Python as the target language because of its ubiquity and ease of use.
When the TeSS platform was initially developed in the early 2010s, the Ruby
programming language and the Ruby on Rails framework were a popular choice for
developing web applications. Unfortunately, the scientific Python stack and
machine learning ecosystem led Python to being a clear winner for academics and
scientists. This creates an issue that only a small number of academics are
skilled in Ruby and can participate in the development of TeSS.
</p><p>
It was also crucial that we used Python such that our implementation was
reusable. For example, the DALIA 1.0 platform was implemented using Django,
which made it effectively impossible to reuse any of the underlying code
outside, e.g., in a data science workflow. The same issue is also true for the
TeSS implementation using Ruby-on-Rails. While these batteries-included
frameworks can get a minimal web application running quickly, they generally
lead developers towards writing code that isn't reusable.
</p>
</details>

<h4 id="oerbservatory-as-an-interoperability-hub-between-dalia-and-tess">OERbservatory as an Interoperability Hub between DALIA and TeSS</h4>

<p>Before we even started working on the OERbservatory, we had implemented two
packages for working with data in DALIA and TeSS:</p>

<ol>
  <li><a href="https://github.com/data-literacy-alliance/dalia-dif">data-literacy-alliance/dalia-dif</a>
implements a parser for the DALIA DIF v1.3 tabular format, an internal
representation of the content (also using Pydantic), and an RDF serializer
(using on <a href="https://github.com/cthoyt/pydantic-metamodel">pydantic-metamodel</a></li>
  <li><a href="https://github.com/cthoyt/tess-downloader">cthoyt/tess-downloader</a>
implements an API client to TeSS and an internal representation of the
learning resource data model (using Pydantic)</li>
</ol>

<p>Because each of these packages already implemented an internal (lossless)
representations of the data models for DALIA and TeSS, respectively, we only had
to write code in the OERbservatory that mapped the fields between them to
OERbservatory’s data model.</p>

<p>This was a <strong>big</strong> milestone towards interoperability. We demonstrated its
potential by programmatically downloading all learning materials from the ELIXIR
TeSS instance’s API and exporting them as DALIA RDF. Similarly, we converted all
learning materials curated for DALIA into the TeSS JSON format. Later, I’ll
describe how we took this workflow one step further to implement syncing between
DALIA and TeSS.</p>

<p>Note that this mapping can’t simply be expressed using SSSOM, SHACL, or other
declarative languages, because it relies on more sophisticated logic. For
example, topics annotated with ontology terms in the DALIA data model only store
the URI reference, whereas topics annotated with ontology terms in the TeSS data
model require both the URI reference and the term’s label. Since we’re encoding
our crosswalks using a general purpose programming language, we have a larger
toolkit available. Here, we could use
<a href="https://github.com/biopragmatics/pyobo">PyOBO</a>, a generic package I’ve written
for working with ontologies, for looking up labels.</p>

<p>Unfortunately, we did not have time to implement an importer/exporter for
Schema.org. We deprioritized this because Schema.org it felt the least
approachable due to the way its documentation is written, the complexity of its
models, and prolific use of mixins. We considered if we could automatically
generate Pydantic classes from Schema.org - and it turns out that
<a href="https://github.com/lexiq-legal/pydantic_schemaorg">pydantic-schemaorg</a> has
already done it! Unfortunately, the code is not compatible with modern versions
of Pydantic, and the project appears abandoned. We only had so much time at the
hackathon, so forking/reviving/rewriting <code class="language-plaintext highlighter-rouge">pydantic-schemaorg</code> was left as a task
for later.</p>

<h4 id="the-oerbservatory-as-an-aggregator">The OERbservatory as an Aggregator</h4>

<p>Besides open educational resources and learning materials that are encoded in
the DALIA, TeSS, and Schema.org formats, there are many repositories of learning
materials that do not conform to a well-defined schema. Prior to the hackathon,
I had already explored the Austrian <a href="https://oerhub.at">OERhub</a> and
<a href="https://oersi.org/resources">Open Educational Resources Search Index (OERSI)</a>
and written importers into <code class="language-plaintext highlighter-rouge">dalia-dif</code>. At the hackathon, I reimplemented those
importers using the newly formed OERbservatory unified, generic data model.</p>

<p>On the Thursday morning of the BioHackathon, I had an excellent
<a href="https://en.wikipedia.org/wiki/Team_programming#Mob_programming">mob programming</a>
session with <a href="https://orcid.org/0009-0004-7782-2894">Dilfuza Djamalova</a> and
<a href="https://orcid.org/0009-0005-0673-021X">Jacobo Miranda</a> to import training
materials from the
<a href="https://training.galaxyproject.org">Galaxy Training Network (GTN)</a>. It turns
out that there are already several open educational resources and learning
materials that are automatically scraped and imported by TeSS. However, those
importers are limited by TeSS’s relatively rigid data model, which is bound to
their database and can therefore not be easily evolved. Dilfuza and Jacobo had a
few goals for our hacking:</p>

<ul>
  <li>There are fields in GTN that aren’t yet captured by TeSS. They wanted to
implement those fields in OERbservatory, demonstrate their usage, then gently
nudge TeSS to evolve its data model to support their use cases</li>
  <li>They wanted to index their content in DALIA, which becomes much easier if they
only have to maintain one importer in OERbservatory which can already export
to DALIA</li>
  <li>GTN is part of the <a href="https://github.com/dekcd">DeKCD</a> consortia, which wants to
deduplicate training material. Adding an importer here gives access to the
workflows we’re building for reconciling different metadata curated in
different places about the same materials, and identifying similar materials
to reduce duplicate effort, and connect people working on the same kinds of
materials</li>
</ul>

<p>We implemented the GTN importer in
<a href="https://github.com/data-literacy-alliance/oerbservatory/pull/8">data-literacy-alliance/oerbservatory#8</a>
which covers tutorials in GTN and later could be extended to slide decks. Along
the way, we updated the main educational resource model in OERbservatory to
include a few new fields, including status (which also is shared by TeSS- that
now needs to be incorporated), the publication date, and the modified date. We
did not make a complete mapping for all fields in GTN due to time constraints,
so we implemented logging that summarizes fields that haven’t yet been mapped
(see the PR for examples of each). For example, the way that contributor
information is incorporated into the API from the frontmatter in the source is
interesting - it resolves the keys in the frontmatter to entries in
<a href="https://github.com/galaxyproject/training-material/blob/main/CONTRIBUTORS.yaml">this YAML file</a>
in the GTN GitHub repository. We will want to think about the best way to map
the authors into OERbservatory, and this also might be a time to extend the
author list to include contributor role annotations.</p>

<p>I was very excited that Dilfuza and Jacobo were motivated to work on this and
contribute following the hackathon. We see if the OERbservatory is approachable
enough for future external contributions! For example, Robert Hasse of
NFDI4BIOIMAGE already proactively prepared a script that exports their
consortium’s training materials into the DALIA DIF v1.3 tabular format. I don’t
consider this a very approachable format, and I’m sure efforts like his could
have been eased by using OERbservatory as a target. The next steps are to
incorporate the
<a href="https://www.dariah.ch">Swiss Digital Research Infrastructure for Arts and Humanities (DARIAH-CH)</a>
and <a href="https://www.psdi.ac.uk">Physical Sciences Data Infrastructure (PSDI)</a>
learning materials, which appeared on the schematic diagram for OERbservatory
earlier. There are also a lot of other potential learning material repositories
to scrape like Glittr.com. If you have a suggestion, you can drop it in the
<a href="https://github.com/data-literacy-alliance/oerbservatory/issues">OERbservatory issue tracker</a>.
Further, given that Martin Voigt was in the room during this hacking and
discussion, and he is the maintainer for TeSS’s
<a href="https://github.com/ElixirTeSS/TeSS_scrapers">scraper code</a>, we already started
formulating plans on how we might be able to deduplicate efforts.</p>

<h3 id="federation-of-open-educational-resources-and-learning-materials">Federation of Open Educational Resources and Learning Materials</h3>

<p><img src="/img/biohackathon2025/federation.svg" alt="" /></p>

<p>The next step towards interoperability beyond the demonstration of converting
between formats used by DALIA and TeSS was to demonstrate actually posting the
content to the live services.</p>

<p>While we are currently in the process of implementing submission of open
educational resources and learning materials in DALIA, TeSS already has a
web-based interface for
<a href="https://tess.elixir-europe.org/materials/new">registering new learning materials</a>.
TeSS doesn’t have a documented API endpoint for posting learning materials, but
luckily, Martin knew where it was and helped to figure out the correct way to
pass credentials to use it. We managed this by a combination of reading the Ruby
implementation of TeSS and good ‘ol trial and error. In the end, we implemented
posting learning materials in the TeSS-specific Python package in
<a href="https://github.com/cthoyt/tess-downloader/pull/2">cthoyt/tess-downloader#2</a>.
Then, it was only a matter of stringing together code that converts DALIA to
OERbservatory, OERbservatory to TeSS, and then to upload to TeSS.</p>

<p>In parallel, Martin worked on improving the devops behind the
<a href="https://tesshub.hzdr.de">PaNOSC TeSSHub</a> to enable quicky spinning up new TeSS
instances that each have their own subdomain. He created a different subdomain
for each of DALIA, OERSI, GTN/KCD, and OERhub. Finally, we wrote a script that
uploaded all open educational resources and learning material from each source
to the appropriate TeSS instance in
<a href="https://github.com/data-literacy-alliance/oerbservatory/pull/3">data-literacy-alliance/oerbservatory#3</a>.
The results in each space can be explored here:</p>

<table>
  <thead>
    <tr>
      <th>Source</th>
      <th>Domain</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>DALIA</td>
      <td><a href="https://dalia.tesshub.hzdr.de">https://dalia.tesshub.hzdr.de</a></td>
    </tr>
    <tr>
      <td>OERhub</td>
      <td><a href="https://oerhub.tesshub.hzdr.de">https://oerhub.tesshub.hzdr.de</a></td>
    </tr>
    <tr>
      <td>OERSI</td>
      <td><a href="https://oersi.tesshub.hzdr.de">https://oersi.tesshub.hzdr.de</a></td>
    </tr>
    <tr>
      <td>GTN/deKCD</td>
      <td><a href="https://kcd.tesshub.hzdr.de">https://kcd.tesshub.hzdr.de</a></td>
    </tr>
    <tr>
      <td>PanOSC</td>
      <td><a href="https://panosc.tesshub.hzdr.de">https://panosc.tesshub.hzdr.de</a></td>
    </tr>
  </tbody>
</table>

<p>A full list of spaces can be found
<a href="https://pan-training.tesshub.hzdr.de/spaces">here</a>.</p>

<h4 id="european-open-science-cloud">European Open Science Cloud</h4>

<p>The great specter looming over most NFDI-related projects is how to interface
with the European Open Science Cloud (EOSC). At the surface, EOSC is a massive
undertaking to democratize access to research infrastructure on the European
level. However, having just entered the NFDI bubble at the end of the summer, I
have bene overwhelmed by the high pressure to participate in EOSC combine with
the lack of funding and lack of direction on how to best go about doing that.
All of that being said, Oliver Knödel spend the hackathon preparing the concept
for how we could connect TeSSHub to the EOSC open educational resource and
training materials registry using the
<a href="https://www.openarchives.org/pmh/">Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH)</a>.
Once TeSSHub can demonstrate federating its content through this mechanism, we
can use as inspiration to make a generic implementation in OERbservatory.</p>

<h4 id="governance-and-provenance">Governance and Provenance</h4>

<p>Now that it’s possible to copy training materials from one platform to another,
we have started to consider governance and provenance issues like:</p>

<ul>
  <li>If a training material originally curated in DALIA is displayed in TeSS, how
is that attributed? We will have to carefully consider how metadata records
about learning resources are identified, and how those identifiers are passed
around during interchange/syncing.</li>
  <li>If a training material originally from TeSS is enriched in the DALIA platform,
should that information flow back to TeSS, and how? We will have to carefully
consider how information is deduplicated and reconciled</li>
  <li>How do we implement technical systems that can keep many federated platforms
up-to-date with each other?</li>
</ul>

<p>I’m sure there will be many more questions along these lines. Luckily, the
mTeSS-X group has already begun discussions on a smaller scale, since they care
about how to federate between many disparate TeSS instances.</p>

<h2 id="training-material-analysis">Training Material Analysis</h2>

<p>Our team split into two for the analysis of training materials. The first team
looked into algorithmic mechanisms for featurizing open educational resources
and learning materials and applications of those features. The second team
looked into using large language models (LLMs) for the automated construction of
learning paths.</p>

<h3 id="featurization-and-application">Featurization and Application</h3>

<p>The first team looked into two techniques for featurizing (i.e., assigning dense
vectors) to open educational resources and learning materials.</p>

<p>The first and most interpretable technique was to concatenate free text fields
and labels from structured fields from a learning resource and index the entire
corpus (i.e., all learning resources) using the
<a href="https://en.wikipedia.org/wiki/Tf%E2%80%93idf">term frequency-inverse document frequency (TF-IDF)</a>
algorithm. This does a small amount of text preprocessing, calculates a word
list for the entire corpus, then calculates for each word the likelihood of
appearance in a given learning material versus the entire corpus. Then, each
learning material is assigned a dense vector with values from $[0, 1]$ the
length of the word list. Learning materials can be compared, e.g., using cosine
similarity between their respective vectors.</p>

<p>The second technique was to use the <a href="https://sbert.net">sentence transformers</a>
machine learning architecture, which relies on a pre-trained (not large)
language model to accomplish a similar vectorization. Both methods run in less
than a few minutes for the corpus of learning resources from DALIA, TeSS,
OERHub, and OERSI. We also pre-calculated the all-by-all similarities and
applied a cutoff of 0.7 to shorten the list. Both the TF-IDF and sentence
transformers vector index and similarities are commit to the OERbservatory
repository and are available
<a href="https://github.com/data-literacy-alliance/oerbservatory/tree/main/output">here</a>.</p>

<p><img src="/img/biohackathon2025/similarities.png" alt="" /></p>

<p>After we had embeddings, Dilfuza began to investigate some of the following:</p>

<ol>
  <li>Identify duplicates metadata records corresponding to the same learning
material resource, e.g., when two different platforms scraped the same
learning material</li>
  <li>Semi-automatically identify similar training materials both to improve
suggestions to learners, to connect the learning material creators, and to
help de-duplicate training material creation efforts</li>
</ol>

<p>We only managed to get this far in the last day of the hackathon, so there is
still a lot more to do here! Originally, I had planned on also using these
embeddings to train classifiers for key provenance metadata such as topic,
target audience, and difficulty level, then to create a semi-automated curation
workflow for enriching learning materials whose records were sparse with
annotation. These will be next steps.</p>

<h3 id="automated-construction-of-learning-paths">Automated Construction of Learning Paths</h3>

<p>Nick looked into using large language models (LLMs) to construct learning paths
through machine-assisted dialog. This part is highly experimental so there isn’t
much to point to yet, but the idea was to take in a list of learning materials
(either hard-coded or as a URL for the chat system to retrieve) and a prompt to
ask the LLM ot collect similar materials base don objectives and keywords, then
create a learning path based on difficult (which is infrequently annotated) and
suggest a title.</p>

<p>This workflow was used to produce three learning paths on the following topics
that were each ordered, had reference links, a difficulty rating, a title, and
provider:</p>

<ol>
  <li>Sequencing and QC (10 items)</li>
  <li>Git and Version Control (6 items)</li>
  <li>Genome Annotation (8 items)</li>
</ol>

<p>More on this in future work!</p>

<h2 id="modeling-learning-paths">Modeling Learning Paths</h2>

<p>While there isn’t a clear consensus on what a learning path is, a simple
definition is that a learning path is a sequence of learning materials to
consume to help a learner achieve a specific level of competence on a topic.
TeSS implements a data model for learning paths based on this definition and the
ELIXIR TeSS instance has
<a href="https://tess.elixir-europe.org/learning_paths">eleven examples</a>. Our team had
the goal to develop an extension Schemas.org (in Bioschemas) to capture learning
paths.</p>

<p>For transparency, I didn’t actively participate in this track, but think it’s
worth sharing the results, most of which are adapted from Phil’s repository in
<a href="https://github.com/BioSchemas/LearningPath-sandbox">BioSchemas/LearningPath-sandbox</a>.</p>

<h3 id="proposed-data-model">Proposed Data Model</h3>

<p>Phil, Alban, and Leyla proposed two new Bioschemas profiles and a small change
to
<a href="https://bioschemas.org/profiles/TrainingMaterial/1.0-RELEASE">one Bioschemas profile</a>
with the help of Nick and Roman:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">LearningPath</code>: inherits from <code class="language-plaintext highlighter-rouge">Course</code></li>
  <li><code class="language-plaintext highlighter-rouge">LearningPathModule</code>: inherits from <code class="language-plaintext highlighter-rouge">Course</code>, <code class="language-plaintext highlighter-rouge">Syllabus</code>, <code class="language-plaintext highlighter-rouge">ListItem</code>, and
<code class="language-plaintext highlighter-rouge">ItemList</code></li>
  <li><code class="language-plaintext highlighter-rouge">TrainingMaterial</code>: inherits from <code class="language-plaintext highlighter-rouge">LearningResource</code> and <code class="language-plaintext highlighter-rouge">ListItem</code></li>
</ul>

<p>Here’s a class diagram describing the proposed data model, where 🔺 is
Schema.org type, 🟩 is Bioschemas profile, 🔵 is new profile:</p>

<pre><code class="language-mermaid">classDiagram
    direction TB
    class Event["Event🔺"] {
    }
    class CourseInstance["CourseInstance🔺🟩"] {
    }
    class Course["Course🔺🟩"] {
        syllabusSections
    }
    class new_LearningPath["new:LearningPath🔵"] {
        Syllabus[] syllabusSections
    }
    class ListItem["ListItem🔺"] {
        nextItem
    }
    class Syllabus["Syllabus🔺"] {
    }
    class new_LearningPathModule["new:LearningPathModule🔵"] {
        ListItem[] itemListElement
        LearningPathTopic nextItem
    }
    class LearningResource["LearningResource🔺"] {
    }
    class bio_TrainingMaterial["bio:TrainingMaterial🟩"] {
    }
    Course &lt;|-- new_LearningPath
    Course &lt;|-- new_LearningPathModule
    Syllabus &lt;|-- new_LearningPathModule
    ListItem &lt;|-- new_LearningPathModule
    LearningResource &lt;|-- Course
    LearningResource &lt;|-- bio_TrainingMaterial
    LearningResource &lt;|-- Syllabus
    Event &lt;|-- CourseInstance
</code></pre>

<h3 id="concrete-example-from-galaxy-training-network">Concrete Example from Galaxy Training Network</h3>

<p>The team mocked encoding the
<a href="https://tess.elixir-europe.org/learning_paths/introduction-to-galaxy-and-sequence-analysis-6384c0ed-3546-41cf-ac30-bff8680dd96c">Introduction to Galaxy and Sequence analysis</a>
learning path on TeSS in this new schema. This learning path has the following
structure:</p>

<ol>
  <li><strong>Module 1: Introduction to Galaxy</strong>
    <ol>
      <li>A short introduction to Galaxy</li>
      <li>Galaxy Basics for genomics</li>
    </ol>
  </li>
  <li><strong>Module 2: Basics of Genome Sequence Analysis</strong>
    <ol>
      <li>Quality Control</li>
      <li>Mapping</li>
      <li>An Introduction to Genome Assembly</li>
      <li>Chloroplast genome assembly</li>
    </ol>
  </li>
</ol>

<p>Here’s a mockup of how this could look in RDF:</p>

<div class="language-turtle highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">@prefix</span><span class="w"> </span><span class="nn">dct:</span><span class="w"> </span><span class="nl">&lt;http://purl.org/dc/terms/&gt;</span><span class="w"> </span><span class="p">.</span><span class="w">
</span><span class="kd">@prefix</span><span class="w"> </span><span class="nn">ex:</span><span class="w"> </span><span class="nl">&lt;http://example.org/&gt;</span><span class="w"> </span><span class="p">.</span><span class="w">
</span><span class="kd">@prefix</span><span class="w"> </span><span class="nn">schema:</span><span class="w"> </span><span class="nl">&lt;https://schema.org/&gt;</span><span class="w"> </span><span class="p">.</span><span class="w">

</span><span class="nn">ex:</span><span class="n">GA_learning_path</span><span class="w"> </span><span class="k">a</span><span class="w"> </span><span class="nn">schema:</span><span class="n">Course</span><span class="w"> </span><span class="p">;</span><span class="w">
    </span><span class="nn">dct:</span><span class="n">conformsTo</span><span class="w"> </span><span class="nl">&lt;https://bioschemas.org/profiles/LearningPath&gt;</span><span class="w"> </span><span class="p">;</span><span class="w">
    </span><span class="nn">schema:</span><span class="n">courseCode</span><span class="w"> </span><span class="s">"GSA101"</span><span class="w"> </span><span class="p">;</span><span class="w">
    </span><span class="nn">schema:</span><span class="n">description</span><span class="w"> </span><span class="s">"This learning path aims to teach you the basics of Galaxy and analysis of sequencing data. "</span><span class="w"> </span><span class="p">;</span><span class="w">
    </span><span class="nn">schema:</span><span class="n">name</span><span class="w"> </span><span class="s">"Introduction to Galaxy and Sequence analysis"</span><span class="w"> </span><span class="p">;</span><span class="w">
    </span><span class="nn">schema:</span><span class="n">provider</span><span class="w"> </span><span class="nn">ex:</span><span class="n">ExampleUniversity</span><span class="w"> </span><span class="p">;</span><span class="w">
    </span><span class="nn">schema:</span><span class="n">syllabusSections</span><span class="w"> </span><span class="nn">ex:</span><span class="n">Module_1,
</span><span class="w">        </span><span class="nn">ex:</span><span class="n">Module_2</span><span class="w"> </span><span class="p">.</span><span class="w">

</span><span class="nn">ex:</span><span class="n">Module_1</span><span class="w"> </span><span class="k">a</span><span class="w"> </span><span class="nn">schema:</span><span class="n">ItemList,
</span><span class="w">        </span><span class="nn">schema:</span><span class="n">ListItem,
</span><span class="w">        </span><span class="nn">schema:</span><span class="n">Syllabus</span><span class="w"> </span><span class="p">;</span><span class="w">
    </span><span class="nn">dct:</span><span class="n">conformsTo</span><span class="w"> </span><span class="nl">&lt;https://bioschemas.org/profiles/LearningPathModule&gt;</span><span class="w"> </span><span class="p">;</span><span class="w">
    </span><span class="nn">schema:</span><span class="n">itemListElement</span><span class="w"> </span><span class="nn">ex:</span><span class="n">TM11,
</span><span class="w">        </span><span class="nn">ex:</span><span class="n">TM12</span><span class="w"> </span><span class="p">;</span><span class="w">
    </span><span class="nn">schema:</span><span class="n">name</span><span class="w"> </span><span class="s">"Module 1: Introduction to Galaxy"</span><span class="w"> </span><span class="p">;</span><span class="w">
    </span><span class="nn">schema:</span><span class="n">nextItem</span><span class="w"> </span><span class="nn">ex:</span><span class="n">Module_2</span><span class="w"> </span><span class="p">;</span><span class="w">
    </span><span class="nn">schema:</span><span class="n">teaches</span><span class="w"> </span><span class="s">"Learn how to create a workflow"</span><span class="w"> </span><span class="p">.</span><span class="w">

</span><span class="nn">ex:</span><span class="n">TM11</span><span class="w"> </span><span class="k">a</span><span class="w"> </span><span class="nn">schema:</span><span class="n">LearningResource,
</span><span class="w">        </span><span class="nn">schema:</span><span class="n">ListItem</span><span class="w"> </span><span class="p">;</span><span class="w">
    </span><span class="nn">dct:</span><span class="n">conformsTo</span><span class="w"> </span><span class="nl">&lt;https://bioschemas.org/profiles/TrainingMaterial&gt;</span><span class="w"> </span><span class="p">;</span><span class="w">
    </span><span class="nn">schema:</span><span class="n">description</span><span class="w"> </span><span class="s">"What is Galaxy"</span><span class="w"> </span><span class="p">;</span><span class="w">
    </span><span class="nn">schema:</span><span class="n">name</span><span class="w"> </span><span class="s">"(1.1) A short introduction to Galaxy"</span><span class="w"> </span><span class="p">;</span><span class="w">
    </span><span class="nn">schema:</span><span class="n">nextItem</span><span class="w"> </span><span class="nn">ex:</span><span class="n">TM12</span><span class="w"> </span><span class="p">;</span><span class="w">
    </span><span class="nn">schema:</span><span class="n">url</span><span class="w"> </span><span class="s">"https://tess.elixir-europe.org/materials/hands-on-for-a-short-introduction-to-galaxy-tutorial?lp=1%3A1"</span><span class="w"> </span><span class="p">.</span><span class="w">
</span></code></pre></div></div>

<p>Here’s the same thing from a graphical perspective:</p>

<pre><code class="language-mermaid">graph TD
    N1["Module 1: Introduction to Galaxy"]
    N3["(1.2) Galaxy Basics for genomics"]
    N1 -- itemListElement --&gt; N3
    N1["Module 1: Introduction to Galaxy"]
    N2["(1.1) A short introduction to Galaxy"]
    N1 -- itemListElement --&gt; N2
    N4["Module 2: Basics of Genome Sequence Analysis"]
    N8["(2.4) Chloroplast genome assembly"]
    N4 -- itemListElement --&gt; N8
    N2["(1.1) A short introduction to Galaxy"]
    N3["(1.2) Galaxy Basics for genomics"]
    N2 -- nextItem --&gt; N3
    N1["Module 1: Introduction to Galaxy"]
    N4["Module 2: Basics of Genome Sequence Analysis"]
    N1 -- nextItem --&gt; N4
    N7["(2.3) An Introduction to Genome Assembly"]
    N8["(2.4) Chloroplast genome assembly"]
    N7 -- nextItem --&gt; N8
    N4["Module 2: Basics of Genome Sequence Analysis"]
    N5["(2.1) Quality Control"]
    N4 -- itemListElement --&gt; N5
    N4["Module 2: Basics of Genome Sequence Analysis"]
    N6["(2.2) Mapping"]
    N4 -- itemListElement --&gt; N6
    N4["Module 2: Basics of Genome Sequence Analysis"]
    N7["(2.3) An Introduction to Genome Assembly"]
    N4 -- itemListElement --&gt; N7
    N6["(2.2) Mapping"]
    N7["(2.3) An Introduction to Genome Assembly"]
    N6 -- nextItem --&gt; N7
    N3["(1.2) Galaxy Basics for genomics"]
    N5["(2.1) Quality Control"]
    N3 -- nextItem --&gt; N5
    N5["(2.1) Quality Control"]
    N6["(2.2) Mapping"]
    N5 -- nextItem --&gt; N6
</code></pre>

<p>Something that I became aware of while listening to discussions about learning
path is the way that Schema.org models lists. I wonder why they don’t use the
built-in RDF notions of lists and instead implemented their own formalism. I saw
that this caused a lot of confusion for the team both during mocking and also
during SPARQL querying.</p>

<p>I think the next steps in terms of learning paths is to create a concrete
implementation in OERbsevatory - we have the benefit that the Python programming
language provides a much more ergonomic abstraction over lists and collections.
There’s a lot of content inside the Galaxy Training Network (GTN) that could be
ingested into such a learning path. Towards this end, I gave a quick demo of
Pydantic to the learning paths team and showed them how I typically go about
data modeling.</p>

<hr />

<p>I really enjoyed the BioHackathon, and in general, I am very happy to be
attending more events to network with other academics in Germany. It was totally
exhausting, too, which is why I didn’t manage to finish this in the week
following the event.</p>

<p>In other open educational resource and learning materials news, we pre-printed
the first ac academic article describing a specific use case for DALIA on arXiv
in September:
<a href="https://arxiv.org/abs/2509.18902">Teaching RDM in a smart advanced inorganic lab course and its provision in the DALIA platform</a>.
We’re currently finalizing a second article fully dedicated towards describing
the DALIA platform which I hope can go on the arXiv in early January. Stay
tuned!</p>]]></content><author><name>Charles Tapley Hoyt</name></author><category term="LinkML" /><category term="Bioregistry" /><category term="prefix maps" /><category term="CURIEs" /><category term="URIs" /><summary type="html"><![CDATA[I recently attended the 4th BioHackathon Germany hosted by the German Network for Bioinformatics Infrastructure (de.NBI). I participated in the project On the Path to Machine-actionable Training Materials in order to improve the interoperability between DALIA, TeSS, mTeSS-X, and Schema.org. This post gives a summary of the activities leading up to the hackathon and the results of our happy hacking.]]></summary></entry></feed>