Implement RDF/XML streaming functionality for robust large-scale dataset processing #58
Labels
No labels
Blocked
Bounty
$100
Bounty
$1000
Bounty
$10000
Bounty
$20
Bounty
$2000
Bounty
$250
Bounty
$50
Bounty
$500
Bounty
$5000
Bounty
$750
MoSCoW
Could have
MoSCoW
Must have
MoSCoW
Should have
Needs feedback
Points
1
Points
13
Points
2
Points
21
Points
3
Points
34
Points
5
Points
55
Points
8
Points
88
Priority
Backlog
Priority
Critical
Priority
High
Priority
Low
Priority
Medium
Signed-off: Owner
Signed-off: Scrum Master
Signed-off: Tech Lead
Spike
State
Completed
State
Duplicate
State
In Progress
State
In Review
State
Paused
State
Unverified
State
Verified
State
Wont Do
Type
Bug
Type
Discussion
Type
Documentation
Type
Epic
Type
Feature
Type
Legendary
Type
Support
Type
Task
Type
Testing
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
cleverdatasets/dataset-uploader#58
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Initially, streaming was implemented for Turtle (TTL) and N-Triples (NT) because identifying triples is straightforward: N-Triples has one triple per line, and Turtle uses "." to mark triple boundaries. RDF/XML streaming is more complex due to its hierarchical structure, triples are nested within rdf:Description elements, making it hard to identify where a triple ends without parsing the full XML tree.
The current implementation uses Python's xml.etree.ElementTree.iterparse library for incremental XML parsing. We identify rdf:Description elements using the _is_description_element() helper, which checks if an element's tag ends with 'Description' and contains 'rdf' or 'Description' (handling both namespaced and non-namespaced forms). Descriptions are used as processing units because each rdf:Description groups multiple triples about a single resource, enabling chunked processing and parallelization. The xml:base attribute is detected from the root element's 'start' event to preserve URI expansion for relative URIs within Description elements.
Acceptance Criteria: