A team led by Satya Sri Rajiteswari Nimmagadda and Aniruddha Maiti has shown you can stuff scientific sentences into structured JSON without throwing away what matters. The paper, accepted at IEEE ICSC 2026, describes how they fine-tuned a lightweight LLM to generate hierarchical JSON from scientific text. Each JSON has a 'core' field for the main claim and a 'hierarchy' field for relationships and connected components. Think conditions, temporal qualifiers, logical exceptions. The stuff that makes scientific sentences hard to summarize.
The researchers used GPT-4o to generate training data with the target JSON format, then fine-tuned a smaller model with a custom structural loss function. They tested on 1,370 sentences pulled from arXiv, bioRxiv, ChemRxiv, and PubMed across physics, medicine, and five other research domains. To evaluate, they reconstructed original sentences from the JSON representations and measured semantic and lexical similarity between originals and reconstructions.
Hierarchical JSON beat flat A-relation-B structures. That's not shocking if you've spent time with scientific prose. A sentence like "Under conditions X and Y, Z occurs unless W" doesn't compress into a simple triple without losing something important. The hierarchical format captures more of that conditional messiness.
For agent builders who need machines to actually understand research papers, RAG approaches drop too much context. If you're building an agent that reads scientific literature and reasons over it, hierarchical structures could give it better retention of what it parsed. The work suggests you don't have to pick between structured data and semantic fidelity. You can have both, if you design the representation carefully.