Implement RDF/XML streaming functionality for robust large-scale dataset processing #58

New issue

Open

opened 2026-01-29 11:34:08 +00:00 by aditya · 0 comments

aditya commented

2026-01-29 11:34:08 +00:00

Member

The current implementation uses Python's xml.etree.ElementTree.iterparse library for incremental XML parsing. We identify rdf:Description elements using the _is_description_element() helper, which checks if an element's tag ends with 'Description' and contains 'rdf' or 'Description' (handling both namespaced and non-namespaced forms). Descriptions are used as processing units because each rdf:Description groups multiple triples about a single resource, enabling chunked processing and parallelization. The xml:base attribute is detected from the root element's 'start' event to preserve URI expansion for relative URIs within Description elements.

Acceptance Criteria:

RDF/XML streaming correctly handles multi-byte UTF-8 characters across line boundaries
Memory usage is optimized for large XML files during streaming
Robust error handling and recovery for malformed XML sections without stopping the entire stream
Checkpoint/resume functionality works reliably for RDF/XML streaming (tracking Description elements)

Initially, streaming was implemented for Turtle (TTL) and N-Triples (NT) because identifying triples is straightforward: N-Triples has one triple per line, and Turtle uses "." to mark triple boundaries. RDF/XML streaming is more complex due to its hierarchical structure, triples are nested within rdf:Description elements, making it hard to identify where a triple ends without parsing the full XML tree. The current implementation uses Python's xml.etree.ElementTree.iterparse library for incremental XML parsing. We identify rdf:Description elements using the _is_description_element() helper, which checks if an element's tag ends with 'Description' and contains 'rdf' or 'Description' (handling both namespaced and non-namespaced forms). Descriptions are used as processing units because each rdf:Description groups multiple triples about a single resource, enabling chunked processing and parallelization. The xml:base attribute is detected from the root element's 'start' event to preserve URI expansion for relative URIs within Description elements. ### **Acceptance Criteria:** - [ ] RDF/XML streaming correctly handles multi-byte UTF-8 characters across line boundaries - [ ] Memory usage is optimized for large XML files during streaming - [ ] Robust error handling and recovery for malformed XML sections without stopping the entire stream - [ ] Checkpoint/resume functionality works reliably for RDF/XML streaming (tracking Description elements)

Rows
Columns