feat: add rdf_to_hf_incremental.py script for true streamingof entire pipeline #39
No reviewers
Labels
No labels
Blocked
Bounty
$100
Bounty
$1000
Bounty
$10000
Bounty
$20
Bounty
$2000
Bounty
$250
Bounty
$50
Bounty
$500
Bounty
$5000
Bounty
$750
MoSCoW
Could have
MoSCoW
Must have
MoSCoW
Should have
Needs feedback
Points
1
Points
13
Points
2
Points
21
Points
3
Points
34
Points
5
Points
55
Points
8
Points
88
Priority
Backlog
Priority
Critical
Priority
High
Priority
Low
Priority
Medium
Signed-off: Owner
Signed-off: Scrum Master
Signed-off: Tech Lead
Spike
State
Completed
State
Duplicate
State
In Progress
State
In Review
State
Paused
State
Unverified
State
Verified
State
Wont Do
Type
Bug
Type
Discussion
Type
Documentation
Type
Epic
Type
Feature
Type
Legendary
Type
Support
Type
Task
Type
Testing
No milestone
No project
No assignees
2 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
cleverdatasets/dataset-uploader!39
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "stream-download-convert-upload"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
I have added a new script rdf_to_hf_incremental.py, this module implements the download, decompression, conversion and upload in true streaming manner with zero disk utilization. The conversion of rdf to hf dataset is implemented with incremental writing of parquet in the memory and them uploading to the hugging face directly without the need of storing the entire cache of the dataset in the disk. The script is configurable through cli to handle both true streaming of entire pipeline as well as streaming of conversion and upload.
VERY IMPORTANT:
This review is only about functionality -- whether it can be used to upload the wikipedia dataset.
I'm going to try using it over the weekend.
Thank you for your hard work!
@ -0,0 +247,4 @@# ============================================================================def _infer_filename_from_url(url: str) -> str:This code already exists in
extract_stream. Don't repeat code!Fixed !
@ -0,0 +759,4 @@elif input_type == "file":# Infer from file extensionext = Path(input_value).suffix.lower()format_map = {".ttl": "turtle", ".nt": "ntriples", ".nq": "nquads", ".xml": "xml"}The list of expected format from line 70 includes
.rdf,.jsonld,.xml,.n3, and.trig.We will definitely need
.rdffor RDF/XML format files.Fixed !!
@ -0,0 +765,4 @@else:# Infer from URLext = Path(urlparse(input_value).path).suffix.lower()format_map = {".ttl": "turtle", ".nt": "ntriples", ".gz": "turtle"}The list of expected formats from line 70 includes
.nq,.rdf,.xml,jsonld,.n3, and.trig..gzjust means that it is compressed; it does not guarantee that the format is turtle.Fixed !!
@ -0,0 +792,4 @@console.print("\n[yellow]To authenticate, use one of:[/yellow]")console.print(" 1. export HF_TOKEN='hf_...'")console.print(" 2. huggingface-cli login")console.print(" 3. --token hf_...")Could you please include the error message?
It's possible that "authentication" appears in the error message but the underlying problem isn't with authentication.
(The easiest way to do this is to remove lines 796-797, and dedent lines 798 and 799. This would always print the error message and always return 1.)
@ -0,0 +21,4 @@def _infer_name_from_url(url: str) -> str:try:return Path(urlparse(url).path).nameThis code confuses two different kinds of path:
urlparse(url).pathreturns everything between thenetlocand theparametersof a URL. (See urllib.parse )Path().nameuses a file on the file system.Those two ideas aren't the same. For example, this code will fail if it's run on a Windows system, where the path separator is
\instead of `/'.Just use string utilities to work with the result of
urlparse(url).path, notPath().name.@ -0,0 +55,4 @@if out:yield outreturnCould you please add
.xzto the set of decompressors? It uses thelzmalibrary.fixed !
@ -0,0 +71,4 @@Yields (member_name, member_byte_iterator) pairs."""raw = _IterableBytesIO(byte_iter)mode = "r|*"When I read tarfile, I see that the modes look like
"r:gz"instead of"r|gz". It uses a colon instead of a pipe.@ -0,0 +75,4 @@if tar_name.endswith(".tar.gz") or tar_name.endswith(".tgz"):mode = "r|gz"elif tar_name.endswith(".tar"):mode = "r|"One more mode we'll need:
@ -0,0 +48,4 @@headers=req_headers,follow_redirects=cfg.follow_redirects,timeout=timeout,) as resp:According to https://www.python-httpx.org/api/ , you should
It might be okay here; I don't know this library.
You should include all of the functionality of the old code. For example,
--listshould be included on the command line.dataset_config. c152323097ruff format. 7f50361683ruff check --fix. c5707689d6View command line instructions
Checkout
From your project repository, check out a new branch and test the changes.Merge
Merge the changes and update on Forgejo.Warning: The "Autodetect manual merge" setting is not enabled for this repository, you will have to mark this pull request as manually merged afterwards.