feat: add rdf_to_hf_incremental.py script for true streamingof entire pipeline #39

Open
aditya wants to merge 37 commits from stream-download-convert-upload into push-to-hub
Member

I have added a new script rdf_to_hf_incremental.py, this module implements the download, decompression, conversion and upload in true streaming manner with zero disk utilization. The conversion of rdf to hf dataset is implemented with incremental writing of parquet in the memory and them uploading to the hugging face directly without the need of storing the entire cache of the dataset in the disk. The script is configurable through cli to handle both true streaming of entire pipeline as well as streaming of conversion and upload.

I have added a new script rdf_to_hf_incremental.py, this module implements the download, decompression, conversion and upload in true streaming manner with zero disk utilization. The conversion of rdf to hf dataset is implemented with incremental writing of parquet in the memory and them uploading to the hugging face directly without the need of storing the entire cache of the dataset in the disk. The script is configurable through cli to handle both true streaming of entire pipeline as well as streaming of conversion and upload.
brent.edwards left a comment
Member

VERY IMPORTANT:

This review is only about functionality -- whether it can be used to upload the wikipedia dataset.

I'm going to try using it over the weekend.

Thank you for your hard work!

VERY IMPORTANT: This review is only about functionality -- whether it can be used to upload the wikipedia dataset. I'm going to try using it over the weekend. Thank you for your hard work!
@ -0,0 +247,4 @@
# ============================================================================
def _infer_filename_from_url(url: str) -> str:
Member

This code already exists in extract_stream. Don't repeat code!

This code already exists in `extract_stream`. Don't repeat code!
Author
Member

Fixed !

Fixed !
@ -0,0 +759,4 @@
elif input_type == "file":
# Infer from file extension
ext = Path(input_value).suffix.lower()
format_map = {".ttl": "turtle", ".nt": "ntriples", ".nq": "nquads", ".xml": "xml"}
Member

The list of expected format from line 70 includes .rdf, .jsonld, .xml, .n3, and .trig.

We will definitely need .rdf for RDF/XML format files.

The list of expected format from line 70 includes `.rdf`, `.jsonld`, `.xml`, `.n3`, and `.trig`. We will definitely need `.rdf` for RDF/XML format files.
Author
Member

Fixed !!

Fixed !!
@ -0,0 +765,4 @@
else:
# Infer from URL
ext = Path(urlparse(input_value).path).suffix.lower()
format_map = {".ttl": "turtle", ".nt": "ntriples", ".gz": "turtle"}
Member

The list of expected formats from line 70 includes .nq, .rdf, .xml, jsonld, .n3, and .trig.

.gz just means that it is compressed; it does not guarantee that the format is turtle.

The list of expected formats from line 70 includes `.nq`, `.rdf`, `.xml`, `jsonld`, `.n3`, and `.trig`. `.gz` just means that it is compressed; it does not guarantee that the format is turtle.
Author
Member

Fixed !!

Fixed !!
@ -0,0 +792,4 @@
console.print("\n[yellow]To authenticate, use one of:[/yellow]")
console.print(" 1. export HF_TOKEN='hf_...'")
console.print(" 2. huggingface-cli login")
console.print(" 3. --token hf_...")
Member

Could you please include the error message?

It's possible that "authentication" appears in the error message but the underlying problem isn't with authentication.

(The easiest way to do this is to remove lines 796-797, and dedent lines 798 and 799. This would always print the error message and always return 1.)

Could you please include the error message? It's possible that "authentication" appears in the error message but the underlying problem isn't with authentication. (The easiest way to do this is to remove lines 796-797, and dedent lines 798 and 799. This would always print the error message and always return 1.)
@ -0,0 +21,4 @@
def _infer_name_from_url(url: str) -> str:
try:
return Path(urlparse(url).path).name
Member

This code confuses two different kinds of path:

urlparse(url).path returns everything between the netloc and the parameters of a URL. (See urllib.parse )

Path().name uses a file on the file system.

Those two ideas aren't the same. For example, this code will fail if it's run on a Windows system, where the path separator is \ instead of `/'.

Just use string utilities to work with the result of urlparse(url).path, not Path().name.

This code confuses two different kinds of path: `urlparse(url).path` returns everything between the `netloc` and the `parameters` of a URL. (See [urllib.parse](https://docs.python.org/3/library/urllib.parse.html#module-urllib.parse) ) `Path().name` uses a file on the file system. Those two ideas aren't the same. For example, this code will fail if it's run on a Windows system, where the path separator is `\` instead of `/'. Just use string utilities to work with the result of `urlparse(url).path`, not `Path().name`.
@ -0,0 +55,4 @@
if out:
yield out
return
Member

Could you please add .xz to the set of decompressors? It uses the lzma library.

Could you please add `.xz` to the set of decompressors? It uses the `lzma` library.
Author
Member

fixed !

fixed !
@ -0,0 +71,4 @@
Yields (member_name, member_byte_iterator) pairs.
"""
raw = _IterableBytesIO(byte_iter)
mode = "r|*"
Member

When I read tarfile, I see that the modes look like "r:gz" instead of "r|gz". It uses a colon instead of a pipe.

When I read [tarfile](https://docs.python.org/3/library/tarfile.html#tarfile.open), I see that the modes look like `"r:gz"` instead of `"r|gz"`. It uses a colon instead of a pipe.
@ -0,0 +75,4 @@
if tar_name.endswith(".tar.gz") or tar_name.endswith(".tgz"):
mode = "r|gz"
elif tar_name.endswith(".tar"):
mode = "r|"
Member

One more mode we'll need:

elif tar_name.endswith(".xz"):
    mode = "r:xz"
One more mode we'll need: ``` elif tar_name.endswith(".xz"): mode = "r:xz" ```
@ -0,0 +48,4 @@
headers=req_headers,
follow_redirects=cfg.follow_redirects,
timeout=timeout,
) as resp:
Member

According to https://www.python-httpx.org/api/ , you should

Only use these functions if you're testing HTTPX in a console or making a small number of requests. Using a Client will enable HTTP/2 and connection pooling for more efficient and long-lived connections.

It might be okay here; I don't know this library.

According to https://www.python-httpx.org/api/ , you should > Only use these functions if you're testing HTTPX in a console or making a small number of requests. Using a Client will enable HTTP/2 and connection pooling for more efficient and long-lived connections. It might be okay here; I don't know this library.
brent.edwards left a comment
Member

You should include all of the functionality of the old code. For example, --list should be included on the command line.

You should include all of the functionality of the old code. For example, `--list` should be included on the command line.
This pull request can be merged automatically.
You are not authorized to merge this pull request.
View command line instructions

Checkout

From your project repository, check out a new branch and test the changes.
git fetch -u origin stream-download-convert-upload:stream-download-convert-upload
git switch stream-download-convert-upload

Merge

Merge the changes and update on Forgejo.

Warning: The "Autodetect manual merge" setting is not enabled for this repository, you will have to mark this pull request as manually merged afterwards.

git switch push-to-hub
git merge --no-ff stream-download-convert-upload
git switch stream-download-convert-upload
git rebase push-to-hub
git switch push-to-hub
git merge --ff-only stream-download-convert-upload
git switch stream-download-convert-upload
git rebase push-to-hub
git switch push-to-hub
git merge --no-ff stream-download-convert-upload
git switch push-to-hub
git merge --squash stream-download-convert-upload
git switch push-to-hub
git merge --ff-only stream-download-convert-upload
git switch push-to-hub
git merge stream-download-convert-upload
git push origin push-to-hub
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
cleverdatasets/dataset-uploader!39
No description provided.