feat: add rdf_to_hf_incremental.py script for true streamingof entire pipeline #39

aditya · 2025-12-19T14:58:04Z

aditya commented

2025-12-19 14:58:04 +00:00

I have added a new script rdf_to_hf_incremental.py, this module implements the download, decompression, conversion and upload in true streaming manner with zero disk utilization. The conversion of rdf to hf dataset is implemented with incremental writing of parquet in the memory and them uploading to the hugging face directly without the need of storing the entire cache of the dataset in the disk. The script is configurable through cli to handle both true streaming of entire pipeline as well as streaming of conversion and upload.

aditya added 1 commit

2025-12-19 14:58:04 +00:00

feat: add rdf_to_hf_incremental.py script for true streamingof entire pipeline b218ab1410

brent.edwards reviewed

2025-12-20 02:05:43 +00:00

brent.edwards left a comment

VERY IMPORTANT:

This review is only about functionality -- whether it can be used to upload the wikipedia dataset.

I'm going to try using it over the weekend.

Thank you for your hard work!

VERY IMPORTANT: This review is only about functionality -- whether it can be used to upload the wikipedia dataset. I'm going to try using it over the weekend. Thank you for your hard work!

scripts/rdf_to_hf_incremental.py

					
				@ -0,0 +247,4 @@

				# ============================================================================

				def _infer_filename_from_url(url: str) -> str:

This code already exists in extract_stream. Don't repeat code!

This code already exists in `extract_stream`. Don't repeat code!

Fixed !

scripts/rdf_to_hf_incremental.py Outdated

					
				@ -0,0 +759,4 @@

				        elif input_type == "file":

				            # Infer from file extension

				            ext = Path(input_value).suffix.lower()

				            format_map = {".ttl": "turtle", ".nt": "ntriples", ".nq": "nquads", ".xml": "xml"}

The list of expected format from line 70 includes .rdf, .jsonld, .xml, .n3, and .trig.

We will definitely need .rdf for RDF/XML format files.

The list of expected format from line 70 includes `.rdf`, `.jsonld`, `.xml`, `.n3`, and `.trig`. We will definitely need `.rdf` for RDF/XML format files.

Fixed !!

scripts/rdf_to_hf_incremental.py Outdated

					
				@ -0,0 +765,4 @@

				        else:

				            # Infer from URL

				            ext = Path(urlparse(input_value).path).suffix.lower()

				            format_map = {".ttl": "turtle", ".nt": "ntriples", ".gz": "turtle"}

The list of expected formats from line 70 includes .nq, .rdf, .xml, jsonld, .n3, and .trig.

.gz just means that it is compressed; it does not guarantee that the format is turtle.

The list of expected formats from line 70 includes `.nq`, `.rdf`, `.xml`, `jsonld`, `.n3`, and `.trig`. `.gz` just means that it is compressed; it does not guarantee that the format is turtle.

Fixed !!

scripts/rdf_to_hf_incremental.py Outdated

					
				@ -0,0 +792,4 @@

				            console.print("\n[yellow]To authenticate, use one of:[/yellow]")

				            console.print("  1. export HF_TOKEN='hf_...'")

				            console.print("  2. huggingface-cli login")

				            console.print("  3. --token hf_...")

Could you please include the error message?

It's possible that "authentication" appears in the error message but the underlying problem isn't with authentication.

(The easiest way to do this is to remove lines 796-797, and dedent lines 798 and 799. This would always print the error message and always return 1.)

Could you please include the error message? It's possible that "authentication" appears in the error message but the underlying problem isn't with authentication. (The easiest way to do this is to remove lines 796-797, and dedent lines 798 and 799. This would always print the error message and always return 1.)

src/streaming/extract_stream.py Outdated

					
				@ -0,0 +21,4 @@

				def _infer_name_from_url(url: str) -> str:

				    try:

				        return Path(urlparse(url).path).name

This code confuses two different kinds of path:

urlparse(url).path returns everything between the netloc and the parameters of a URL. (See urllib.parse )

Path().name uses a file on the file system.

Those two ideas aren't the same. For example, this code will fail if it's run on a Windows system, where the path separator is \ instead of `/'.

Just use string utilities to work with the result of urlparse(url).path, not Path().name.

This code confuses two different kinds of path: `urlparse(url).path` returns everything between the `netloc` and the `parameters` of a URL. (See [urllib.parse](https://docs.python.org/3/library/urllib.parse.html#module-urllib.parse) ) `Path().name` uses a file on the file system. Those two ideas aren't the same. For example, this code will fail if it's run on a Windows system, where the path separator is `\` instead of `/'. Just use string utilities to work with the result of `urlparse(url).path`, not `Path().name`.

src/streaming/extract_stream.py

					
				@ -0,0 +55,4 @@

				            if out:

				                yield out

				        return

Could you please add .xz to the set of decompressors? It uses the lzma library.

Could you please add `.xz` to the set of decompressors? It uses the `lzma` library.

fixed !

src/streaming/extract_stream.py

					
				@ -0,0 +71,4 @@

				    Yields (member_name, member_byte_iterator) pairs.

				    """

				    raw = _IterableBytesIO(byte_iter)

				    mode = "r|*"

When I read tarfile, I see that the modes look like "r:gz" instead of "r|gz". It uses a colon instead of a pipe.

When I read [tarfile](https://docs.python.org/3/library/tarfile.html#tarfile.open), I see that the modes look like `"r:gz"` instead of `"r|gz"`. It uses a colon instead of a pipe.

src/streaming/extract_stream.py

					
				@ -0,0 +75,4 @@

				    if tar_name.endswith(".tar.gz") or tar_name.endswith(".tgz"):

				        mode = "r|gz"

				    elif tar_name.endswith(".tar"):

				        mode = "r|"

One more mode we'll need:

elif tar_name.endswith(".xz"):
    mode = "r:xz"

One more mode we'll need: ``` elif tar_name.endswith(".xz"): mode = "r:xz" ```

src/streaming/http_stream.py

					
				@ -0,0 +48,4 @@

				                headers=req_headers,

				                follow_redirects=cfg.follow_redirects,

				                timeout=timeout,

				            ) as resp:

According to https://www.python-httpx.org/api/ , you should

Only use these functions if you're testing HTTPX in a console or making a small number of requests. Using a Client will enable HTTP/2 and connection pooling for more efficient and long-lived connections.

It might be okay here; I don't know this library.

According to https://www.python-httpx.org/api/ , you should > Only use these functions if you're testing HTTPX in a console or making a small number of requests. Using a Client will enable HTTP/2 and connection pooling for more efficient and long-lived connections. It might be okay here; I don't know this library.

brent.edwards requested changes

2025-12-20 02:48:41 +00:00

brent.edwards left a comment

You should include all of the functionality of the old code. For example, --list should be included on the command line.

You should include all of the functionality of the old code. For example, `--list` should be included on the command line.

brent.edwards added 5 commits

2025-12-23 02:51:22 +00:00

Added dataset_config. 16fd345664

Fixing dataset_config to download one file. a51f1f3703

feat: add script to insert entries into dataset_config.json a01336cd15

Co-authored-by: aider (openrouter/openai/o3-mini-high) <aider@aider.chat>

Added all appropriate files from https://ftp.uniprot.org/pub/databases/uniprot/current_release/rdf/ 2ea6f3a89e

Removed the quickly-written scripts to edit dataset_config. c152323097

aditya added 2 commits

2025-12-23 14:02:18 +00:00

fix: Address code review feedback - deduplication, format support, and error handling improvements 711ee61f9a

feat: add --list --dry-run dataset card feature d5ac95f2aa

aditya added 11 commits

2025-12-23 14:51:20 +00:00

feat: add download via ftp functionality a9b6518839

fix: resolve resume functionality crash for complete files 2342d2dada

fix: fix ruff check related errors 1e95cb9a91

Added .xz compression to code. 8bef6986ea

Fixing dataset_config to download one file. 6f12a78c2c

fix: correct xz decompression command and update error messages e0b0836ffc

Co-authored-by: aider (openrouter/openai/o3-mini-high) <aider@aider.chat>

fix: update error message for missing xz command 757d6dd7f8

Co-authored-by: aider (openrouter/openai/o3-mini-high) <aider@aider.chat>

Fixes suggested by Aider. 0784e3e275

Merge branch '35-xz-compression-aider' into 35-xz-compression 1df9fcbdf6

fix: improve RDF format detection and parsing for UniProt dataset ff3939d937

Co-authored-by: aider (openrouter/anthropic/claude-sonnet-4) <aider@aider.chat>

Merge branch '35-xz-compression' into stream-download-convert-upload d564739f72

aditya added 18 commits

2025-12-23 15:24:57 +00:00

Mostly about convert_conceptnet_progress. b16eddbe71

convert_conceptnet_to_hf. c9c6bff859

convert_fb15k237_to_hf 927a8648f6

convert_nell995_to_hf c058301d1c

convert_rdf_progress ec1211ba51

convert_rdf_streaming_progress 4636770092

Fixes made by ruff format. 7f50361683

Merge branch '16-error-messages-and-logging' into mypy-and-pylint aa069f7527

Changes made by ruff check --fix. c5707689d6

fix: fix ruff check scripts/ errors 7414b333f8

Refactor: Extract helper functions to remove code duplication a4eca8bcaa

Fix: Pyright type checking errors and improve type annotations 001234afcf

Feat: add initial Dockerfile and compose file for the repo bd44f34ade

feat(devcontainer): optimize for PyCharm with workspace path fixes 1c73906ea2

fix(docker): prevent container restart loop 0e17022093

chore: correct the project name 3b57467d9e

Merge branch '7-dockerfile-setup' into stream-download-convert-upload bbc0d61039

Merge branch '8-devcontainer-setup' into stream-download-convert-upload c487b49f11

This pull request can be merged automatically.

You are not authorized to merge this pull request.

View command line instructions

Checkout

From your project repository, check out a new branch and test the changes.

git fetch -u origin stream-download-convert-upload:stream-download-convert-upload

git switch stream-download-convert-upload

Merge

Merge the changes and update on Forgejo.

Warning: The "Autodetect manual merge" setting is not enabled for this repository, you will have to mark this pull request as manually merged afterwards.

git switch push-to-hub

git merge --no-ff stream-download-convert-upload

git switch stream-download-convert-upload

git rebase push-to-hub

git switch push-to-hub

git merge --ff-only stream-download-convert-upload

git switch stream-download-convert-upload

git rebase push-to-hub

git switch push-to-hub

git merge --no-ff stream-download-convert-upload

git switch push-to-hub

git merge --squash stream-download-convert-upload