Optimize RDF Streaming with Description Skipping and Parallel Parsing #52

Open
opened 2026-01-16 14:39:03 +00:00 by aditya · 0 comments
Member

Description:
Optimize RDF/XML parsing performance by implementing resumable description-level skipping (avoiding re-parsing already processed descriptions) and parallel batch parsing using multiprocessing. This reduces processing time for large datasets and enables efficient resume after interruptions.

Acceptance Criteria:

  • Skip already-processed <rdf:Description> elements based on checkpoint
  • Parse XML description batches in parallel using process pool
  • Maintain ordered output despite parallel processing
  • Detect and propagate xml:base for proper URI resolution
**Description:** Optimize RDF/XML parsing performance by implementing resumable description-level skipping (avoiding re-parsing already processed descriptions) and parallel batch parsing using multiprocessing. This reduces processing time for large datasets and enables efficient resume after interruptions. **Acceptance Criteria:** - Skip already-processed `<rdf:Description>` elements based on checkpoint - Parse XML description batches in parallel using process pool - Maintain ordered output despite parallel processing - Detect and propagate `xml:base` for proper URI resolution
aditya self-assigned this 2026-01-16 14:39:03 +00:00
Sign in to join this conversation.
No milestone
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
cleverdatasets/dataset-uploader#52
No description provided.