Implement RDF/XML streaming functionality for robust large-scale dataset processing #58

Open
opened 2026-01-29 11:34:08 +00:00 by aditya · 0 comments
Member

Initially, streaming was implemented for Turtle (TTL) and N-Triples (NT) because identifying triples is straightforward: N-Triples has one triple per line, and Turtle uses "." to mark triple boundaries. RDF/XML streaming is more complex due to its hierarchical structure, triples are nested within rdf:Description elements, making it hard to identify where a triple ends without parsing the full XML tree.

The current implementation uses Python's xml.etree.ElementTree.iterparse library for incremental XML parsing. We identify rdf:Description elements using the _is_description_element() helper, which checks if an element's tag ends with 'Description' and contains 'rdf' or 'Description' (handling both namespaced and non-namespaced forms). Descriptions are used as processing units because each rdf:Description groups multiple triples about a single resource, enabling chunked processing and parallelization. The xml:base attribute is detected from the root element's 'start' event to preserve URI expansion for relative URIs within Description elements.

Acceptance Criteria:

  • RDF/XML streaming correctly handles multi-byte UTF-8 characters across line boundaries
  • Memory usage is optimized for large XML files during streaming
  • Robust error handling and recovery for malformed XML sections without stopping the entire stream
  • Checkpoint/resume functionality works reliably for RDF/XML streaming (tracking Description elements)
Initially, streaming was implemented for Turtle (TTL) and N-Triples (NT) because identifying triples is straightforward: N-Triples has one triple per line, and Turtle uses "." to mark triple boundaries. RDF/XML streaming is more complex due to its hierarchical structure, triples are nested within rdf:Description elements, making it hard to identify where a triple ends without parsing the full XML tree. The current implementation uses Python's xml.etree.ElementTree.iterparse library for incremental XML parsing. We identify rdf:Description elements using the _is_description_element() helper, which checks if an element's tag ends with 'Description' and contains 'rdf' or 'Description' (handling both namespaced and non-namespaced forms). Descriptions are used as processing units because each rdf:Description groups multiple triples about a single resource, enabling chunked processing and parallelization. The xml:base attribute is detected from the root element's 'start' event to preserve URI expansion for relative URIs within Description elements. ### **Acceptance Criteria:** - [ ] RDF/XML streaming correctly handles multi-byte UTF-8 characters across line boundaries - [ ] Memory usage is optimized for large XML files during streaming - [ ] Robust error handling and recovery for malformed XML sections without stopping the entire stream - [ ] Checkpoint/resume functionality works reliably for RDF/XML streaming (tracking Description elements)
aditya self-assigned this 2026-01-29 11:34:08 +00:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
cleverdatasets/dataset-uploader#58
No description provided.