Implement Streaming Pipeline for Large RDF Datasets #50

Open
opened 2026-01-16 14:26:09 +00:00 by aditya · 0 comments
Member

Description:
Implement incremental streaming and upload pipeline for large RDF datasets (RDF/XML, Turtle, N-Triples) to HuggingFace Hub. The pipeline should support processing multi-GB datasets without loading them fully into memory, with automatic shard generation and incremental Parquet uploads.

Acceptance Criteria:

  • Stream RDF data from HTTP sources or local files
  • Convert to Parquet shards incrementally
  • Upload shards to HuggingFace during processing
  • Support checkpoint/resume functionality for interrupted uploads
**Description:** Implement incremental streaming and upload pipeline for large RDF datasets (RDF/XML, Turtle, N-Triples) to HuggingFace Hub. The pipeline should support processing multi-GB datasets without loading them fully into memory, with automatic shard generation and incremental Parquet uploads. **Acceptance Criteria:** - Stream RDF data from HTTP sources or local files - Convert to Parquet shards incrementally - Upload shards to HuggingFace during processing - Support checkpoint/resume functionality for interrupted uploads
aditya self-assigned this 2026-01-16 14:26:09 +00:00
Sign in to join this conversation.
No milestone
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
cleverdatasets/dataset-uploader#50
No description provided.