Fix upload_all_datasets.2.py: streaming converter, default behavior, and dry-run mode #20

Open
opened 2025-11-28 13:46:01 +00:00 by aditya · 0 comments
Member

Description
Fixed three critical issues in the dataset upload pipeline.

Problems

[ ] Streaming converter failure — convert_rdf_to_hf_dataset_streaming_parallel.py silently failed because temp_chunks directory creation was commented out. Parquet chunks never saved, causing "no valid data files" errors on large datasets (e.g., wikidata-truthy).
[ ] Unsafe default behavior — Running script without arguments defaulted to processing ALL datasets instead of showing help, risking unintended operations.
[ ] Dry-run crashes — --dry-run crashed with AssertionError when accessing file paths that don't exist (downloads skipped).

Fixes

[x] Uncommented temp directory creation in streaming-parallel converter
[x] Changed default behavior to show help message
[x] Added None-path handling for dry-run mode

Description Fixed three critical issues in the dataset upload pipeline. ### **Problems** [ ] Streaming converter failure — convert_rdf_to_hf_dataset_streaming_parallel.py silently failed because temp_chunks directory creation was commented out. Parquet chunks never saved, causing "no valid data files" errors on large datasets (e.g., wikidata-truthy). [ ] Unsafe default behavior — Running script without arguments defaulted to processing ALL datasets instead of showing help, risking unintended operations. [ ] Dry-run crashes — --dry-run crashed with AssertionError when accessing file paths that don't exist (downloads skipped). ### **Fixes** [x] Uncommented temp directory creation in streaming-parallel converter [x] Changed default behavior to show help message [x] Added None-path handling for dry-run mode
aditya self-assigned this 2025-11-28 13:47:15 +00:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
cleverdatasets/dataset-uploader#20
No description provided.