WIP: feat: Unify 5 RDF converter scripts using Strategy Pattern #22

Draft
aditya wants to merge 5 commits from 15-refactor-rdf-converters into 19-rewrite-upload-all-datasets
Member

Unifies five separate RDF converter scripts into a single maintainable codebase (convert_rdf_to_hf_dataset_unified.py) using the Strategy Pattern to support standard, streaming, and parallel processing.

Implements an auto selection mode that dynamically chooses the optimal conversion strategy based on file size and format, alongside improved Rich-based progress tracking.

Refactors the main upload_all_datasets.2.py script to simplify the pipeline by removing complex branching logic and delegating all conversions to the new unified tool.

Unifies five separate RDF converter scripts into a single maintainable codebase (convert_rdf_to_hf_dataset_unified.py) using the Strategy Pattern to support standard, streaming, and parallel processing. Implements an auto selection mode that dynamically chooses the optimal conversion strategy based on file size and format, alongside improved Rich-based progress tracking. Refactors the main upload_all_datasets.2.py script to simplify the pipeline by removing complex branching logic and delegating all conversions to the new unified tool.
brent.edwards requested changes 2025-12-02 03:50:43 +00:00
Dismissed
brent.edwards left a comment
Member

Are the files convert_rdf_to_hf_dataset.*.py other than convert_rdf_to_hf_dataset_unified.py supposed to go away? If so, then please ignore my comments inside them and delete those files instead of fixing them.

I have read about 800 lines of convert_rdf_to_hf_dataset_unified.py, and I know that I'll have more comments.

But I need to study for the evening, so I'll pass you 72 notes.

Are the files `convert_rdf_to_hf_dataset.*.py` other than `convert_rdf_to_hf_dataset_unified.py` supposed to go away? If so, then please ignore my comments inside them and delete those files instead of fixing them. I have read about 800 lines of `convert_rdf_to_hf_dataset_unified.py`, and I know that I'll have more comments. But I need to study for the evening, so I'll pass you 72 notes.
@ -49,6 +49,8 @@ import json
import logging
import multiprocessing as mp
import shutil
import shutil
Member

Line 51 is import shutil. You don't need to duplicate it.

Line 51 is `import shutil`. You don't need to duplicate it.
brent.edwards marked this conversation as resolved
@ -577,0 +584,4 @@
# Save final dataset
console.print("[yellow]Saving final dataset...[/yellow]")
dataset_dict.save_to_disk(str(output_path))
Member

Lines 563-587 are almost identical to lines 589-613.

This is really bad form; a future programmer will make a change to one and forget the other.

Could you please rewrite these lines to remove duplicate code? (If you have any problems, I am glad to help.)

I like the use of line 561 to refer to the directory being written. You can use the same idea when clean_cache is off with something similar to:

if clean_cache:
    my_cache_dir = tempfile.TemporaryDirectory()
else:
    my_cache_dir = temp_chunks_dir

with tempfile.TemporaryDirectory() as temp_cache_dir:
    console.print(...)

The above code would require clean-up at the end.

If you don't want to do clean-up, you should be able to do something like

with tempfile.TemporaryDirectory() if clean_cache else my_cache_dir as temp_cache_dir

but I find that harder to understand. Your choice.

Lines 563-587 are almost identical to lines 589-613. This is really bad form; a future programmer *will* make a change to one and forget the other. Could you please rewrite these lines to remove duplicate code? (If you have any problems, I am glad to help.) I *like* the use of line 561 to refer to the directory being written. You can use the same idea when `clean_cache` is off with something similar to: ``` if clean_cache: my_cache_dir = tempfile.TemporaryDirectory() else: my_cache_dir = temp_chunks_dir with tempfile.TemporaryDirectory() as temp_cache_dir: console.print(...) ``` The above code would require clean-up at the end. If you don't want to do clean-up, you should be able to do something like ``` with tempfile.TemporaryDirectory() if clean_cache else my_cache_dir as temp_cache_dir ``` but I find that harder to understand. Your choice.
brent.edwards marked this conversation as resolved
@ -634,7 +668,9 @@ def main():
parser.add_argument("--description", help="Dataset description")
parser.add_argument("--citation", help="Dataset citation")
parser.add_argument("--homepage", help="Dataset homepage URL")
parser.add_argument("--homepage", help="Dataset homepage URL")
Member

Lines 670 and 671 are duplicates.

Lines 670 and 671 are duplicates.
brent.edwards marked this conversation as resolved
@ -704,0 +727,4 @@
console.print(f"[red]Error saving dataset: {e}[/red]")
console.print(f"[yellow]Output path: {output_path}[/yellow]")
console.print(f"[yellow]Check disk space and permissions[/yellow]")
raise e
Member

Lines 707-730 and lines 732-755 are almost duplicates.

I don't know if you copy/pasted this or if an LMM generated this, but duplicate code is usually bad, especially when it's as easy as this to fix.

Here's an explainer about code duplication: https://axify.io/blog/code-duplication

Lines 707-730 and lines 732-755 are almost duplicates. I don't know if you copy/pasted this or if an LMM generated this, but duplicate code is usually bad, especially when it's as easy as this to fix. Here's an explainer about code duplication: https://axify.io/blog/code-duplication
brent.edwards marked this conversation as resolved
@ -407,0 +413,4 @@
# Save final dataset
console.print("[yellow]Saving final dataset...[/yellow]")
dataset_dict.save_to_disk(str(output_path))
Member

Lines 401-416 and lines 418-433 are fairly similar, except for where and how in the code dataset_dict is initialized.

They should be easy to merge.

Lines 401-416 and lines 418-433 are fairly similar, except for where and how in the code `dataset_dict` is initialized. They should be easy to merge.
brent.edwards marked this conversation as resolved
@ -308,3 +311,2 @@
# Read all Parquet files as a single dataset
dataset = Dataset.from_parquet(str(temp_chunks_dir / "*.parquet"))
print(f"\nPROGRESS: 90", flush=True)
# Read all Parquet files as a single dataset
Member

Duplicate line.

Duplicate line.
brent.edwards marked this conversation as resolved
@ -322,0 +330,4 @@
# Save final dataset
console.print("[yellow]Saving final dataset...[/yellow]")
print(f"\nPROGRESS: 95", flush=True)
dataset_dict.save_to_disk(str(output_path))
Member

Lines 315-333 and lines 335-352 are almost duplicates.

Lines 315-333 and lines 335-352 are almost duplicates.
brent.edwards marked this conversation as resolved
@ -367,7 +394,9 @@ def main():
)
parser.add_argument("--description", type=str, help="Dataset description")
parser.add_argument("--homepage", type=str, help="Dataset homepage")
parser.add_argument("--homepage", type=str, help="Dataset homepage")
Member

You don't need another --homepage argument.

You don't need another `--homepage` argument.
brent.edwards marked this conversation as resolved
@ -0,0 +1,2208 @@
#!/usr/bin/env python3
Member

When I run ruff check, it reports that the following lines are too long. I don't have the patience to copy and paste for each of these lines:

275, 413, 451, 460, 466, 485, 495, 518, 563, 594, 631, 647, 668, 676, 689, 716, 718, 720, 721, 729, 845, 855, 858, 865, 867, 875, 878, 884, 895, 914, 930, 944, 959, 971, 989, 1021, 1042, 1088, 1161, 1163, 1174, 1176, 1188, 1190, 1192, 1201, 1204, 1211, 1229, 1235, 1236, 1256, 1260, 1263, 1265, 1303, 1312, 1330, 1382, 1383, 1393, 1395, 1402, 1404, 1411, 1414, 1420, 1439, 1440, 1459, 1463, 1472, 1473, 1488, 1504, 1528, 1538, 1553, 1601, 1602, 1603, 1613, 1615, 1617, 1627, 1632, 1634, 1643, 1647, 1655, 1673, 1679, 1680, 1682, 1701, 1709, 1713, 1717, 1718, 1733, 1735, 1741, 1771, 1776, 1795, 1802, 1804, 1820, 1830, 1838, 1974, 1979, 1989, 1993, 2011, and 2055.

When I run `ruff check`, it reports that the following lines are too long. I don't have the patience to copy and paste for each of these lines: 275, 413, 451, 460, 466, 485, 495, 518, 563, 594, 631, 647, 668, 676, 689, 716, 718, 720, 721, 729, 845, 855, 858, 865, 867, 875, 878, 884, 895, 914, 930, 944, 959, 971, 989, 1021, 1042, 1088, 1161, 1163, 1174, 1176, 1188, 1190, 1192, 1201, 1204, 1211, 1229, 1235, 1236, 1256, 1260, 1263, 1265, 1303, 1312, 1330, 1382, 1383, 1393, 1395, 1402, 1404, 1411, 1414, 1420, 1439, 1440, 1459, 1463, 1472, 1473, 1488, 1504, 1528, 1538, 1553, 1601, 1602, 1603, 1613, 1615, 1617, 1627, 1632, 1634, 1643, 1647, 1655, 1673, 1679, 1680, 1682, 1701, 1709, 1713, 1717, 1718, 1733, 1735, 1741, 1771, 1776, 1795, 1802, 1804, 1820, 1830, 1838, 1974, 1979, 1989, 1993, 2011, and 2055.
brent.edwards marked this conversation as resolved
@ -0,0 +34,4 @@
"object": "value",
"object_type": "uri|literal|blank_node",
"object_datatype": "xsd:string or None",
"object_language": "en or None"
Member

We DEFINITELY have data that's not in English.

We DEFINITELY have data that's not in English.
Author
Member

Corrected the docstring

Corrected the docstring
brent.edwards marked this conversation as resolved
@ -0,0 +56,4 @@
import time
from abc import ABC, abstractmethod
from contextlib import contextmanager
from dataclasses import dataclass, field
Member

field is never used.

`field` is never used.
Member

According to ruff check:

scripts/convert_rdf_to_hf_dataset_unified.py:59:36: F401 [*] `dataclasses.field` imported but unused
   |
57 | from abc import ABC, abstractmethod
58 | from contextlib import contextmanager
59 | from dataclasses import dataclass, field
   |                                    ^^^^^ F401
60 | from io import BytesIO
61 | from multiprocessing import Manager, Pool
   |
   = help: Remove unused import: `dataclasses.field`
According to `ruff check`: ``` scripts/convert_rdf_to_hf_dataset_unified.py:59:36: F401 [*] `dataclasses.field` imported but unused | 57 | from abc import ABC, abstractmethod 58 | from contextlib import contextmanager 59 | from dataclasses import dataclass, field | ^^^^^ F401 60 | from io import BytesIO 61 | from multiprocessing import Manager, Pool | = help: Remove unused import: `dataclasses.field` ```
brent.edwards marked this conversation as resolved
@ -0,0 +60,4 @@
from io import BytesIO
from multiprocessing import Manager, Pool
from pathlib import Path
from typing import Any, Iterator, Optional
Member

According to ruff check:

scripts/convert_rdf_to_hf_dataset_unified.py:63:1: UP035 [*] Import from `collections.abc` instead: `Iterator`
   |
61 | from multiprocessing import Manager, Pool
62 | from pathlib import Path
63 | from typing import Any, Iterator, Optional
   | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ UP035
64 |
65 | import pyarrow as pa
   |
   = help: Import from `collections.abc`
According to `ruff check`: ``` scripts/convert_rdf_to_hf_dataset_unified.py:63:1: UP035 [*] Import from `collections.abc` instead: `Iterator` | 61 | from multiprocessing import Manager, Pool 62 | from pathlib import Path 63 | from typing import Any, Iterator, Optional | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ UP035 64 | 65 | import pyarrow as pa | = help: Import from `collections.abc` ```
brent.edwards marked this conversation as resolved
@ -0,0 +73,4 @@
SpinnerColumn,
TextColumn,
TimeElapsedColumn,
TimeRemainingColumn,
Member

TimeRemainingColumn is never used.

`TimeRemainingColumn` is never used.
Member

According to ruff check:

scripts/convert_rdf_to_hf_dataset_unified.py:76:5: F401 [*] `rich.progress.TimeRemainingColumn` imported but unused
   |
74 |     TextColumn,
75 |     TimeElapsedColumn,
76 |     TimeRemainingColumn,
   |     ^^^^^^^^^^^^^^^^^^^ F401
77 | )
   |
   = help: Remove unused import: `rich.progress.TimeRemainingColumn`
According to `ruff check`: ``` scripts/convert_rdf_to_hf_dataset_unified.py:76:5: F401 [*] `rich.progress.TimeRemainingColumn` imported but unused | 74 | TextColumn, 75 | TimeElapsedColumn, 76 | TimeRemainingColumn, | ^^^^^^^^^^^^^^^^^^^ F401 77 | ) | = help: Remove unused import: `rich.progress.TimeRemainingColumn` ```
brent.edwards marked this conversation as resolved
@ -0,0 +91,4 @@
output_path: Path
rdf_format: str = "turtle"
chunk_size: int = 10000
num_workers: Optional[int] = None
Member

According to ruff check:

scripts/convert_rdf_to_hf_dataset_unified.py:94:18: UP007 [*] Use `X | Y` for type annotations
   |
92 |     rdf_format: str = "turtle"
93 |     chunk_size: int = 10000
94 |     num_workers: Optional[int] = None
   |                  ^^^^^^^^^^^^^ UP007
95 |     metadata: Optional[dict[str, Any]] = None
96 |     create_train_test_split: bool = False
   |
   = help: Convert to `X | Y`

(I find Optional easier to understand myself... Crap.)

According to `ruff check`: ``` scripts/convert_rdf_to_hf_dataset_unified.py:94:18: UP007 [*] Use `X | Y` for type annotations | 92 | rdf_format: str = "turtle" 93 | chunk_size: int = 10000 94 | num_workers: Optional[int] = None | ^^^^^^^^^^^^^ UP007 95 | metadata: Optional[dict[str, Any]] = None 96 | create_train_test_split: bool = False | = help: Convert to `X | Y` ``` (I find `Optional` easier to understand myself... Crap.)
brent.edwards marked this conversation as resolved
@ -0,0 +92,4 @@
rdf_format: str = "turtle"
chunk_size: int = 10000
num_workers: Optional[int] = None
metadata: Optional[dict[str, Any]] = None
Member

According to ruff check:

scripts/convert_rdf_to_hf_dataset_unified.py:95:15: UP007 [*] Use `X | Y` for type annotations
   |
93 |     chunk_size: int = 10000
94 |     num_workers: Optional[int] = None
95 |     metadata: Optional[dict[str, Any]] = None
   |               ^^^^^^^^^^^^^^^^^^^^^^^^ UP007
96 |     create_train_test_split: bool = False
97 |     test_size: float = 0.05
   |
   = help: Convert to `X | Y`
According to `ruff check`: ``` scripts/convert_rdf_to_hf_dataset_unified.py:95:15: UP007 [*] Use `X | Y` for type annotations | 93 | chunk_size: int = 10000 94 | num_workers: Optional[int] = None 95 | metadata: Optional[dict[str, Any]] = None | ^^^^^^^^^^^^^^^^^^^^^^^^ UP007 96 | create_train_test_split: bool = False 97 | test_size: float = 0.05 | = help: Convert to `X | Y` ```
brent.edwards marked this conversation as resolved
@ -0,0 +105,4 @@
success: bool
total_triples: int = 0
processing_time_seconds: float = 0.0
output_path: Optional[Path] = None
Member

According to ruff check:

scripts/convert_rdf_to_hf_dataset_unified.py:108:18: UP007 [*] Use `X | Y` for type annotations
    |
106 |     total_triples: int = 0
107 |     processing_time_seconds: float = 0.0
108 |     output_path: Optional[Path] = None
    |                  ^^^^^^^^^^^^^^ UP007
109 |     error_message: Optional[str] = None
110 |     strategy_used: Optional[str] = None
    |
    = help: Convert to `X | Y`
According to `ruff check`: ``` scripts/convert_rdf_to_hf_dataset_unified.py:108:18: UP007 [*] Use `X | Y` for type annotations | 106 | total_triples: int = 0 107 | processing_time_seconds: float = 0.0 108 | output_path: Optional[Path] = None | ^^^^^^^^^^^^^^ UP007 109 | error_message: Optional[str] = None 110 | strategy_used: Optional[str] = None | = help: Convert to `X | Y` ```
brent.edwards marked this conversation as resolved
@ -0,0 +106,4 @@
total_triples: int = 0
processing_time_seconds: float = 0.0
output_path: Optional[Path] = None
error_message: Optional[str] = None
Member

According to ruff check:

scripts/convert_rdf_to_hf_dataset_unified.py:109:20: UP007 [*] Use `X | Y` for type annotations
    |
107 |     processing_time_seconds: float = 0.0
108 |     output_path: Optional[Path] = None
109 |     error_message: Optional[str] = None
    |                    ^^^^^^^^^^^^^ UP007
110 |     strategy_used: Optional[str] = None
    |
    = help: Convert to `X | Y`
According to `ruff check`: ``` scripts/convert_rdf_to_hf_dataset_unified.py:109:20: UP007 [*] Use `X | Y` for type annotations | 107 | processing_time_seconds: float = 0.0 108 | output_path: Optional[Path] = None 109 | error_message: Optional[str] = None | ^^^^^^^^^^^^^ UP007 110 | strategy_used: Optional[str] = None | = help: Convert to `X | Y` ```
brent.edwards marked this conversation as resolved
@ -0,0 +107,4 @@
processing_time_seconds: float = 0.0
output_path: Optional[Path] = None
error_message: Optional[str] = None
strategy_used: Optional[str] = None
Member

According to ruff check:

scripts/convert_rdf_to_hf_dataset_unified.py:110:20: UP007 [*] Use `X | Y` for type annotations
    |
108 |     output_path: Optional[Path] = None
109 |     error_message: Optional[str] = None
110 |     strategy_used: Optional[str] = None
    |                    ^^^^^^^^^^^^^ UP007
    |
    = help: Convert to `X | Y`
According to `ruff check`: ``` scripts/convert_rdf_to_hf_dataset_unified.py:110:20: UP007 [*] Use `X | Y` for type annotations | 108 | output_path: Optional[Path] = None 109 | error_message: Optional[str] = None 110 | strategy_used: Optional[str] = None | ^^^^^^^^^^^^^ UP007 | = help: Convert to `X | Y` ```
brent.edwards marked this conversation as resolved
@ -0,0 +123,4 @@
"""
@staticmethod
def extract_triple(subject, predicate, obj) -> dict[str, Any]:
Member

It looks like you can use

    def extract_triple(subject, predicate, obj) -> dict[str, str]:

at this time. But make sure that the TripleExtractor handles objects that are themselves triples!

It looks like you can use ``` def extract_triple(subject, predicate, obj) -> dict[str, str]: ``` at this time. But make sure that the `TripleExtractor` handles objects that are themselves triples!
Author
Member

A few clarifications on the current implementation:

  • TripleExtractor class removed: We refactored extract_triple to be a module-level function (line 117) instead of a class method. This was necessary for multiprocessing pickling support, class methods can have serialization issues when passed to worker processes

  • Return type dict[str, Any] is correct: We can't use dict[str, str] because object_datatype and object_language can be None.

  • Triple terms (RDF-star/RDF 1.2): Currently, the code handles this by falling through to blank_node

A few clarifications on the current implementation: - TripleExtractor class removed: We refactored extract_triple to be a module-level function (line 117) instead of a class method. This was necessary for multiprocessing pickling support, class methods can have serialization issues when passed to worker processes - Return type dict[str, Any] is correct: We can't use dict[str, str] because object_datatype and object_language can be None. - Triple terms (RDF-star/RDF 1.2): Currently, the code handles this by falling through to blank_node
brent.edwards marked this conversation as resolved
@ -0,0 +141,4 @@
object_type = "literal"
object_datatype = str(obj.datatype) if obj.datatype else None
object_language = obj.language if obj.language else None
elif isinstance(obj, URIRef):
Member

What happens if the object is an IRI but not a URI? See https://www.w3.org/TR/rdf12-concepts/#section-IRIs

What happens if the object is an IRI but not a URI? See https://www.w3.org/TR/rdf12-concepts/#section-IRIs
Author
Member

The RDFLib documentation and implementation confirm that the URIRef class is designed to handle both URIs (Uniform Resource Identifiers) and IRIs (Internationalized Resource Identifiers), even though its name explicitly references only "URI"

The RDFLib documentation and implementation confirm that the URIRef class is designed to handle both URIs (Uniform Resource Identifiers) and IRIs (Internationalized Resource Identifiers), even though its name explicitly references only "URI"
brent.edwards marked this conversation as resolved
@ -0,0 +145,4 @@
object_type = "uri"
object_datatype = None
object_language = None
else:
Member

What happens if the object is a triple term? (See https://www.w3.org/TR/rdf12-n-triples/#triple-terms )

What happens if the object is a triple term? (See https://www.w3.org/TR/rdf12-n-triples/#triple-terms )
Author
Member

Triple terms (RDF-star/RDF 1.2) currently fall through to the else branch and are classified as blank_node.

Do we have any RDF-star datasets that need explicit handling? If so, I can add explicit triple term support as a follow-up.

Triple terms (RDF-star/RDF 1.2) currently fall through to the else branch and are classified as blank_node. Do we have any RDF-star datasets that need explicit handling? If so, I can add explicit triple term support as a follow-up.
brent.edwards marked this conversation as resolved
@ -0,0 +214,4 @@
"""Handle file I/O operations including compression."""
@staticmethod
def open_file(file_path: Path, mode: str = "r"):
Member

I'm surprised that pyright didn't point out that this method doesn't list what it returns...

I'm surprised that `pyright` didn't point out that this method doesn't list what it returns...
brent.edwards marked this conversation as resolved
@ -0,0 +261,4 @@
return False
try:
with open(file_path, "r", encoding="utf-8") as f:
Member

According to ruff check:

scripts/convert_rdf_to_hf_dataset_unified.py:264:18: UP015 [*] Unnecessary mode argument
    |
263 |         try:
264 |             with open(file_path, "r", encoding="utf-8") as f:
    |                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ UP015
265 |                 first_line = f.readline().strip()
266 |                 return first_line.startswith("http://") or first_line.startswith("https://")
    |
    = help: Remove mode argument
According to `ruff check`: ``` scripts/convert_rdf_to_hf_dataset_unified.py:264:18: UP015 [*] Unnecessary mode argument | 263 | try: 264 | with open(file_path, "r", encoding="utf-8") as f: | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ UP015 265 | first_line = f.readline().strip() 266 | return first_line.startswith("http://") or first_line.startswith("https://") | = help: Remove mode argument ```
brent.edwards marked this conversation as resolved
@ -0,0 +264,4 @@
with open(file_path, "r", encoding="utf-8") as f:
first_line = f.readline().strip()
return first_line.startswith("http://") or first_line.startswith("https://")
except Exception:
Member

I think that all file errors descend from OSError, so you should be able to use that instead of Exception.

I think that all file errors descend from `OSError`, so you should be able to use that instead of `Exception`.
brent.edwards marked this conversation as resolved
@ -0,0 +328,4 @@
yield progress, task
@contextmanager
def progress_bar(self, description: str, total: Optional[int] = None,
Member

According to ruff check:

scripts/convert_rdf_to_hf_dataset_unified.py:331:53: UP007 [*] Use `X | Y` for type annotations
    |
330 |     @contextmanager
331 |     def progress_bar(self, description: str, total: Optional[int] = None,
    |                                                     ^^^^^^^^^^^^^ UP007
332 |                      show_count: bool = False, extra_fields: Optional[dict] = None):
333 |         """Context manager for progress bar."""
    |
    = help: Convert to `X | Y`
According to `ruff check`: ``` scripts/convert_rdf_to_hf_dataset_unified.py:331:53: UP007 [*] Use `X | Y` for type annotations | 330 | @contextmanager 331 | def progress_bar(self, description: str, total: Optional[int] = None, | ^^^^^^^^^^^^^ UP007 332 | show_count: bool = False, extra_fields: Optional[dict] = None): 333 | """Context manager for progress bar.""" | = help: Convert to `X | Y` ```
brent.edwards marked this conversation as resolved
@ -0,0 +329,4 @@
@contextmanager
def progress_bar(self, description: str, total: Optional[int] = None,
show_count: bool = False, extra_fields: Optional[dict] = None):
Member

According to ruff check:

scripts/convert_rdf_to_hf_dataset_unified.py:332:62: UP007 [*] Use `X | Y` for type annotations
    |
330 |     @contextmanager
331 |     def progress_bar(self, description: str, total: Optional[int] = None,
332 |                      show_count: bool = False, extra_fields: Optional[dict] = None):
    |                                                              ^^^^^^^^^^^^^^ UP007
333 |         """Context manager for progress bar."""
334 |         columns = [
    |
    = help: Convert to `X | Y`
According to `ruff check`: ``` scripts/convert_rdf_to_hf_dataset_unified.py:332:62: UP007 [*] Use `X | Y` for type annotations | 330 | @contextmanager 331 | def progress_bar(self, description: str, total: Optional[int] = None, 332 | show_count: bool = False, extra_fields: Optional[dict] = None): | ^^^^^^^^^^^^^^ UP007 333 | """Context manager for progress bar.""" 334 | columns = [ | = help: Convert to `X | Y` ```
brent.edwards marked this conversation as resolved
@ -0,0 +352,4 @@
task_kwargs = {"total": total}
if extra_fields:
task_kwargs.update(extra_fields)
task = progress.add_task(f"[yellow]{description}[/yellow]", **task_kwargs)
Member

pyright gives the following error:

/home/brent.edwards/Workspace-2/dataset-uploader/scripts/convert_rdf_to_hf_dataset_unified.py
  /home/brent.edwards/Workspace-2/dataset-uploader/scripts/convert_rdf_to_hf_dataset_unified.py:355:75 - error: Argument of type "int | None" cannot be assigned to parameter "start" of type "bool" in function "add_task"
    Type "int | None" is not assignable to type "bool"
      "int" is not assignable to "bool" (reportArgumentType)
  /home/brent.edwards/Workspace-2/dataset-uploader/scripts/convert_rdf_to_hf_dataset_unified.py:355:75 - error: Argument of type "int | None" cannot be assigned to parameter "completed" of type "int" in function "add_task"
    Type "int | None" is not assignable to type "int"
      "None" is not assignable to "int" (reportArgumentType)
  /home/brent.edwards/Workspace-2/dataset-uploader/scripts/convert_rdf_to_hf_dataset_unified.py:355:75 - error: Argument of type "int | None" cannot be assigned to parameter "visible" of type "bool" in function "add_task"
    Type "int | None" is not assignable to type "bool"
      "int" is not assignable to "bool" (reportArgumentType)
`pyright` gives the following error: ``` /home/brent.edwards/Workspace-2/dataset-uploader/scripts/convert_rdf_to_hf_dataset_unified.py /home/brent.edwards/Workspace-2/dataset-uploader/scripts/convert_rdf_to_hf_dataset_unified.py:355:75 - error: Argument of type "int | None" cannot be assigned to parameter "start" of type "bool" in function "add_task"   Type "int | None" is not assignable to type "bool"     "int" is not assignable to "bool" (reportArgumentType) /home/brent.edwards/Workspace-2/dataset-uploader/scripts/convert_rdf_to_hf_dataset_unified.py:355:75 - error: Argument of type "int | None" cannot be assigned to parameter "completed" of type "int" in function "add_task"   Type "int | None" is not assignable to type "int"     "None" is not assignable to "int" (reportArgumentType) /home/brent.edwards/Workspace-2/dataset-uploader/scripts/convert_rdf_to_hf_dataset_unified.py:355:75 - error: Argument of type "int | None" cannot be assigned to parameter "visible" of type "bool" in function "add_task"   Type "int | None" is not assignable to type "bool"     "int" is not assignable to "bool" (reportArgumentType) ```
brent.edwards marked this conversation as resolved
@ -0,0 +402,4 @@
"""
pass
def estimate_memory_usage_mb(self, file_size_mb: float) -> float:
Member

estimate_memory_usage_mb is never used in this code.

`estimate_memory_usage_mb` is never used in this code.
Author
Member

removed the dead code

removed the dead code
brent.edwards marked this conversation as resolved
@ -0,0 +406,4 @@
"""Estimate memory requirements in MB."""
return file_size_mb * 3.0 # Default: 3x file size
def supports_format(self, rdf_format: str) -> bool:
Member

supports_format is never used in the code. Is the check missing?

`supports_format` is never used in the code. Is the check missing?
Author
Member

removed the dead code

removed the dead code
brent.edwards marked this conversation as resolved
@ -0,0 +414,4 @@
"""Save dataset to disk with optional cache cleaning."""
config.output_path.mkdir(parents=True, exist_ok=True)
dataset_dict.save_to_disk(str(config.output_path))
self.progress.print(f"Dataset saved to {config.output_path}", "green")
Member

The description says "with optional cache cleaning". Is the cache cleaning missing?

The description says "with optional cache cleaning". Is the cache cleaning missing?
brent.edwards marked this conversation as resolved
@ -0,0 +448,4 @@
def supports_format(self, rdf_format: str) -> bool:
"""Standard strategy supports all formats including TSV."""
return rdf_format in ["turtle", "nt", "ntriples", "xml", "n3", "trig", "nquads", "tsv"]
Member

Although supports_format is never used, if it were used, should you include geonames here?

Although `supports_format` is never used, if it were used, should you include `geonames` here?
brent.edwards marked this conversation as resolved
@ -0,0 +517,4 @@
with self.progress.progress_bar("Parsing TSV triples...", show_count=True) as (progress, task):
# Count lines
with open(file_path, "r", encoding="utf-8") as f:
Member

According to ruff check:

scripts/convert_rdf_to_hf_dataset_unified.py:520:18: UP015 [*] Unnecessary mode argument
    |
518 |         with self.progress.progress_bar("Parsing TSV triples...", show_count=True) as (progress, task):
519 |             # Count lines
520 |             with open(file_path, "r", encoding="utf-8") as f:
    |                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ UP015
521 |                 total_lines = sum(1 for _ in f)
522 |             progress.update(task, total=total_lines)
    |
    = help: Remove mode argument
According to `ruff check`: ``` scripts/convert_rdf_to_hf_dataset_unified.py:520:18: UP015 [*] Unnecessary mode argument | 518 | with self.progress.progress_bar("Parsing TSV triples...", show_count=True) as (progress, task): 519 | # Count lines 520 | with open(file_path, "r", encoding="utf-8") as f: | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ UP015 521 | total_lines = sum(1 for _ in f) 522 | progress.update(task, total=total_lines) | = help: Remove mode argument ```
brent.edwards marked this conversation as resolved
@ -0,0 +521,4 @@
total_lines = sum(1 for _ in f)
progress.update(task, total=total_lines)
with open(file_path, "r", encoding="utf-8") as f:
Member

According to ruff check:

scripts/convert_rdf_to_hf_dataset_unified.py:524:18: UP015 [*] Unnecessary mode argument
    |
522 |             progress.update(task, total=total_lines)
523 |
524 |             with open(file_path, "r", encoding="utf-8") as f:
    |                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ UP015
525 |                 for i, line in enumerate(f):
526 |                     line = line.strip()
    |
    = help: Remove mode argument
According to `ruff check`: ``` scripts/convert_rdf_to_hf_dataset_unified.py:524:18: UP015 [*] Unnecessary mode argument | 522 | progress.update(task, total=total_lines) 523 | 524 | with open(file_path, "r", encoding="utf-8") as f: | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ UP015 525 | for i, line in enumerate(f): 526 | line = line.strip() | = help: Remove mode argument ```
brent.edwards marked this conversation as resolved
@ -0,0 +532,4 @@
continue
subject, predicate, obj = parts
triples.append({
Member

You're duplicating code. You should use TripleExtractor::extract_triple here.

You're duplicating code. You should use `TripleExtractor::extract_triple` here.
Author
Member

The TSV parsing intentionally doesn't use extract_triple() because they handle different data types. There's no RDFLib parsing involved, we can't create Literal or URIRef objects from these strings because they're not valid IRIs or typed literals.

The TSV parsing intentionally doesn't use extract_triple() because they handle different data types. There's no RDFLib parsing involved, we can't create Literal or URIRef objects from these strings because they're not valid IRIs or typed literals.
brent.edwards marked this conversation as resolved
@ -0,0 +567,4 @@
def update_progress():
while not stop_updates.is_set():
elapsed = time.time() - parse_start
pct = min(95, (elapsed / estimated_parse_time) * 100)
Member

Might this line be better as

pct = (elapsed / estimated_parse_time) * 95

The way the code is currently written, the last 5% of updates will show pct holding at 95% without change.

Might this line be better as ``` pct = (elapsed / estimated_parse_time) * 95 ``` The way the code is currently written, the last 5% of updates will show `pct` holding at 95% without change.
Author
Member

pct = min(95, (elapsed / estimated_parse_time) * 95). This ensures progress smoothly increases from 0% → 95% over the full estimated time, then jumps to 100% when parsing actually completes.

pct = min(95, (elapsed / estimated_parse_time) * 95). This ensures progress smoothly increases from 0% → 95% over the full estimated time, then jumps to 100% when parsing actually completes.
brent.edwards marked this conversation as resolved
@ -0,0 +589,4 @@
file_content = f.read()
self.progress.print("Parsing RDF content...", "dim")
estimated_parse_time = file_size_mb * 2.5
Member

I doubt that size. Look through my comments in CleverDatasets.

I doubt that size. Look through my comments in CleverDatasets.
brent.edwards marked this conversation as resolved
@ -0,0 +599,4 @@
last_progress = 0
while not stop_updates.is_set():
elapsed = time.time() - parse_start
pct = min(95, (elapsed / estimated_parse_time) * 100)
Member

Same comment about the last 5 percent of the estimated time...

Same comment about the last 5 percent of the estimated time...
brent.edwards marked this conversation as resolved
@ -0,0 +613,4 @@
thread.start()
try:
graph.parse(BytesIO(file_content), format=config.rdf_format)
Member

pyright gives the following error:

  /home/brent.edwards/Workspace-2/dataset-uploader/scripts/convert_rdf_to_hf_dataset_unified.py:876:62 - error: Argument of type "dict[str, IterableDataset] | IterableDataset | Dataset | DatasetDict" cannot be assigned to parameter "dataset" of type "Dataset" in function "_create_dataset_dict"
    Type "dict[str, IterableDataset] | IterableDataset | Dataset | DatasetDict" is not assignable to type "Dataset"
      "DatasetDict" is not assignable to "Dataset" (reportArgumentType)
`pyright` gives the following error: ``` /home/brent.edwards/Workspace-2/dataset-uploader/scripts/convert_rdf_to_hf_dataset_unified.py:876:62 - error: Argument of type "dict[str, IterableDataset] | IterableDataset | Dataset | DatasetDict" cannot be assigned to parameter "dataset" of type "Dataset" in function "_create_dataset_dict"   Type "dict[str, IterableDataset] | IterableDataset | Dataset | DatasetDict" is not assignable to type "Dataset"     "DatasetDict" is not assignable to "Dataset" (reportArgumentType) ```
brent.edwards marked this conversation as resolved
@ -0,0 +630,4 @@
with self.progress.progress_bar("Converting triples...",
total=len(graph_list), show_count=True) as (progress, task):
for idx, (s, p, o) in enumerate(graph_list):
triples.append(self.triple_extractor.extract_triple(s, p, o))
Member

This looks like it's replicating the work of TripleExtractor::extract_triples_from_graph (except that extract_triples_from_graph doesn't have a progress bar.)

Should you unify that code?

This looks like it's replicating the work of `TripleExtractor::extract_triples_from_graph` (except that `extract_triples_from_graph` doesn't have a progress bar.) Should you unify that code?
Author
Member

These are intentionally different implementations serving different purposes

  • extract_triples_from_graph(graph): Quick extraction without progress

  • Loop with progress bar: Large files where user needs visual feedback

The key difference is progress tracking. Python's list comprehension doesn't allow mid-iteration callbacks. To show a progress bar, we need an explicit loop

These are intentionally different implementations serving different purposes - extract_triples_from_graph(graph): Quick extraction without progress - Loop with progress bar: Large files where user needs visual feedback The key difference is progress tracking. Python's list comprehension doesn't allow mid-iteration callbacks. To show a progress bar, we need an explicit loop
brent.edwards marked this conversation as resolved
@ -0,0 +666,4 @@
all_triples = []
with self.progress.progress_bar("Processing GeoNames chunks...", total=100) as (progress, task):
with Pool(processes=num_workers) as pool:
Member

According to ruff check:

scripts/convert_rdf_to_hf_dataset_unified.py:668:9: SIM117 Use a single `with` statement with multiple contexts instead of nested `with` statements
    |
666 |           all_triples = []
667 |
668 | /         with self.progress.progress_bar("Processing GeoNames chunks...", total=100) as (progress, task):
669 | |             with Pool(processes=num_workers) as pool:
    | |_____________________________________________________^ SIM117
670 |                   result = pool.map_async(_parse_geonames_chunk, worker_args)
    |
    = help: Combine `with` statements
According to `ruff check`: ``` scripts/convert_rdf_to_hf_dataset_unified.py:668:9: SIM117 Use a single `with` statement with multiple contexts instead of nested `with` statements | 666 | all_triples = [] 667 | 668 | / with self.progress.progress_bar("Processing GeoNames chunks...", total=100) as (progress, task): 669 | | with Pool(processes=num_workers) as pool: | |_____________________________________________________^ SIM117 670 | result = pool.map_async(_parse_geonames_chunk, worker_args) | = help: Combine `with` statements ```
brent.edwards marked this conversation as resolved
@ -0,0 +879,4 @@
dataset_dict.save_to_disk(str(config.output_path))
else:
dataset = Dataset.from_parquet(str(data_dir / "*.parquet"))
dataset_dict = self._create_dataset_dict(dataset, config)
Member

pyright gives the following error:

  /home/brent.edwards/Workspace-2/dataset-uploader/scripts/convert_rdf_to_hf_dataset_unified.py:882:58 - error: Argument of type "dict[str, IterableDataset] | IterableDataset | Dataset | DatasetDict" cannot be assigned to parameter "dataset" of type "Dataset" in function "_create_dataset_dict"
    Type "dict[str, IterableDataset] | IterableDataset | Dataset | DatasetDict" is not assignable to type "Dataset"
      "DatasetDict" is not assignable to "Dataset" (reportArgumentType)
`pyright` gives the following error: ``` /home/brent.edwards/Workspace-2/dataset-uploader/scripts/convert_rdf_to_hf_dataset_unified.py:882:58 - error: Argument of type "dict[str, IterableDataset] | IterableDataset | Dataset | DatasetDict" cannot be assigned to parameter "dataset" of type "Dataset" in function "_create_dataset_dict"   Type "dict[str, IterableDataset] | IterableDataset | Dataset | DatasetDict" is not assignable to type "Dataset"     "DatasetDict" is not assignable to "Dataset" (reportArgumentType) ```
brent.edwards marked this conversation as resolved
@ -0,0 +892,4 @@
self._save_dataset_info(config, total_triples, dataset_dict, elapsed)
self.progress.print(f"✓ Successfully converted to HuggingFace dataset", "bold green")
Member

According to ruff check:

scripts/convert_rdf_to_hf_dataset_unified.py:895:33: F541 [*] f-string without any placeholders
    |
893 |             self._save_dataset_info(config, total_triples, dataset_dict, elapsed)
894 |
895 |             self.progress.print(f"✓ Successfully converted to HuggingFace dataset", "bold green")
    |                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ F541
896 |
897 |             return ConversionResult(
    |
    = help: Remove extraneous `f` prefix
According to `ruff check`: ``` scripts/convert_rdf_to_hf_dataset_unified.py:895:33: F541 [*] f-string without any placeholders | 893 | self._save_dataset_info(config, total_triples, dataset_dict, elapsed) 894 | 895 | self.progress.print(f"✓ Successfully converted to HuggingFace dataset", "bold green") | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ F541 896 | 897 | return ConversionResult( | = help: Remove extraneous `f` prefix ```
brent.edwards marked this conversation as resolved
@ -0,0 +948,4 @@
with self.file_handler.open_file(config.input_path) as f:
for line in f:
line = line.strip()
if not line or line.startswith("#"):
Member

pyright reports the following:

  /home/brent.edwards/Workspace-2/dataset-uploader/scripts/convert_rdf_to_hf_dataset_unified.py:951:48 - error: Argument of type "Literal['#']" cannot be assigned to parameter "prefix" of type "ReadableBuffer | tuple[ReadableBuffer, ...]" in function "startswith"
    Type "Literal['#']" is not assignable to type "ReadableBuffer | tuple[ReadableBuffer, ...]"
      "Literal['#']" is incompatible with protocol "Buffer"
        "__buffer__" is not present
      "Literal['#']" is not assignable to "tuple[ReadableBuffer, ...]" (reportArgumentType)

    

`pyright` reports the following: ``` /home/brent.edwards/Workspace-2/dataset-uploader/scripts/convert_rdf_to_hf_dataset_unified.py:951:48 - error: Argument of type "Literal['#']" cannot be assigned to parameter "prefix" of type "ReadableBuffer | tuple[ReadableBuffer, ...]" in function "startswith"   Type "Literal['#']" is not assignable to type "ReadableBuffer | tuple[ReadableBuffer, ...]"     "Literal['#']" is incompatible with protocol "Buffer"       "__buffer__" is not present     "Literal['#']" is not assignable to "tuple[ReadableBuffer, ...]" (reportArgumentType) ```     
brent.edwards marked this conversation as resolved
@ -0,0 +973,4 @@
graph = Graph()
with self.file_handler.open_file(config.input_path) as f:
graph.parse(f, format="turtle")
Member

pyright reports:

  /home/brent.edwards/Workspace-2/dataset-uploader/scripts/convert_rdf_to_hf_dataset_unified.py:976:25 - error: Argument of type "GzipFile | TextIOWrapper[_WrappedBuffer] | BZ2File | IO[Any]" cannot be assigned to parameter "source" of type "IO[bytes] | TextIO | InputSource | str | bytes | PurePath | None" in function "parse"
    Type "GzipFile | TextIOWrapper[_WrappedBuffer] | BZ2File | IO[Any]" is not assignable to type "IO[bytes] | TextIO | InputSource | str | bytes | PurePath | None"
      Type "GzipFile" is not assignable to type "IO[bytes] | TextIO | InputSource | str | bytes | PurePath | None"
        "GzipFile" is not assignable to "IO[bytes]"
        "GzipFile" is not assignable to "TextIO"
        "GzipFile" is not assignable to "InputSource"
        "GzipFile" is not assignable to "str"
        "GzipFile" is not assignable to "bytes"
        "GzipFile" is not assignable to "PurePath"
    ... (reportArgumentType)
`pyright` reports: ``` /home/brent.edwards/Workspace-2/dataset-uploader/scripts/convert_rdf_to_hf_dataset_unified.py:976:25 - error: Argument of type "GzipFile | TextIOWrapper[_WrappedBuffer] | BZ2File | IO[Any]" cannot be assigned to parameter "source" of type "IO[bytes] | TextIO | InputSource | str | bytes | PurePath | None" in function "parse"   Type "GzipFile | TextIOWrapper[_WrappedBuffer] | BZ2File | IO[Any]" is not assignable to type "IO[bytes] | TextIO | InputSource | str | bytes | PurePath | None"     Type "GzipFile" is not assignable to type "IO[bytes] | TextIO | InputSource | str | bytes | PurePath | None"       "GzipFile" is not assignable to "IO[bytes]"       "GzipFile" is not assignable to "TextIO"       "GzipFile" is not assignable to "InputSource"       "GzipFile" is not assignable to "str"       "GzipFile" is not assignable to "bytes"       "GzipFile" is not assignable to "PurePath" ... (reportArgumentType) ```
brent.edwards marked this conversation as resolved
@ -0,0 +1016,4 @@
for worker_chunks in results:
for chunk in worker_chunks:
yield chunk
Member

According to ruff check:

scripts/convert_rdf_to_hf_dataset_unified.py:1018:13: UP028 Replace `yield` over `for` loop with `yield from`
     |
1017 |           for worker_chunks in results:
1018 | /             for chunk in worker_chunks:
1019 | |                 yield chunk
     | |___________________________^ UP028
1020 |
1021 |       def _create_dataset_dict(self, dataset: Dataset, config: ConversionConfig) -> DatasetDict:
     |
     = help: Replace with `yield from`
According to `ruff check`: ``` scripts/convert_rdf_to_hf_dataset_unified.py:1018:13: UP028 Replace `yield` over `for` loop with `yield from` | 1017 | for worker_chunks in results: 1018 | / for chunk in worker_chunks: 1019 | | yield chunk | |___________________________^ UP028 1020 | 1021 | def _create_dataset_dict(self, dataset: Dataset, config: ConversionConfig) -> DatasetDict: | = help: Replace with `yield from` ```
brent.edwards marked this conversation as resolved
@ -0,0 +1199,4 @@
if config.clean_cache:
with tempfile.TemporaryDirectory() as temp_cache_dir:
dataset = Dataset.from_parquet(str(temp_chunks_dir / "*.parquet"), cache_dir=temp_cache_dir)
dataset_dict = DatasetDict({"data": dataset})
Member

pyright reports:

  /home/brent.edwards/Workspace-2/dataset-uploader/scripts/convert_rdf_to_hf_dataset_unified.py:1202:36 - error: No overloads for "__init__" match the provided arguments (reportCallIssue)
  /home/brent.edwards/Workspace-2/dataset-uploader/scripts/convert_rdf_to_hf_dataset_unified.py:1202:49 - error: Argument of type "dict[str, dict[str, IterableDataset] | IterableDataset | Dataset | DatasetDict]" cannot be assigned to parameter "iterable" of type "Iterable[tuple[str | NamedSplit, Dataset]]" in function "__init__"
    "Literal['data']" is not assignable to "tuple[str | NamedSplit, Dataset]" (reportArgumentType)
`pyright` reports: ``` /home/brent.edwards/Workspace-2/dataset-uploader/scripts/convert_rdf_to_hf_dataset_unified.py:1202:36 - error: No overloads for "__init__" match the provided arguments (reportCallIssue) /home/brent.edwards/Workspace-2/dataset-uploader/scripts/convert_rdf_to_hf_dataset_unified.py:1202:49 - error: Argument of type "dict[str, dict[str, IterableDataset] | IterableDataset | Dataset | DatasetDict]" cannot be assigned to parameter "iterable" of type "Iterable[tuple[str | NamedSplit, Dataset]]" in function "__init__"   "Literal['data']" is not assignable to "tuple[str | NamedSplit, Dataset]" (reportArgumentType) ```
brent.edwards marked this conversation as resolved
@ -0,0 +1206,4 @@
dataset_dict.save_to_disk(str(config.output_path))
else:
dataset = Dataset.from_parquet(str(temp_chunks_dir / "*.parquet"))
dataset_dict = DatasetDict({"data": dataset})
Member

pyright reports:

  /home/brent.edwards/Workspace-2/dataset-uploader/scripts/convert_rdf_to_hf_dataset_unified.py:1209:32 - error: No overloads for "__init__" match the provided arguments (reportCallIssue)
  /home/brent.edwards/Workspace-2/dataset-uploader/scripts/convert_rdf_to_hf_dataset_unified.py:1209:45 - error: Argument of type "dict[str, dict[str, IterableDataset] | IterableDataset | Dataset | DatasetDict]" cannot be assigned to parameter "iterable" of type "Iterable[tuple[str | NamedSplit, Dataset]]" in function "__init__"
    "Literal['data']" is not assignable to "tuple[str | NamedSplit, Dataset]" (reportArgumentType)
`pyright` reports: ``` /home/brent.edwards/Workspace-2/dataset-uploader/scripts/convert_rdf_to_hf_dataset_unified.py:1209:32 - error: No overloads for "__init__" match the provided arguments (reportCallIssue) /home/brent.edwards/Workspace-2/dataset-uploader/scripts/convert_rdf_to_hf_dataset_unified.py:1209:45 - error: Argument of type "dict[str, dict[str, IterableDataset] | IterableDataset | Dataset | DatasetDict]" cannot be assigned to parameter "iterable" of type "Iterable[tuple[str | NamedSplit, Dataset]]" in function "__init__"   "Literal['data']" is not assignable to "tuple[str | NamedSplit, Dataset]" (reportArgumentType) ```
brent.edwards marked this conversation as resolved
@ -0,0 +1232,4 @@
with open(config.output_path / "dataset_info.json", "w") as f:
json.dump(info, f, indent=2)
self.progress.print(f"✓ Successfully converted to HuggingFace dataset", "bold green")
Member

According to ruff check:

scripts/convert_rdf_to_hf_dataset_unified.py:1235:33: F541 [*] f-string without any placeholders
     |
1233 |                 json.dump(info, f, indent=2)
1234 |
1235 |             self.progress.print(f"✓ Successfully converted to HuggingFace dataset", "bold green")
     |                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ F541
1236 |             self.progress.print(f"  Data split: {len(dataset_dict['data']):,} triples", "green")
1237 |             self.progress.print(f"  Processing time: {elapsed:.1f} seconds", "green")
     |
     = help: Remove extraneous `f` prefix
According to `ruff check`: ``` scripts/convert_rdf_to_hf_dataset_unified.py:1235:33: F541 [*] f-string without any placeholders | 1233 | json.dump(info, f, indent=2) 1234 | 1235 | self.progress.print(f"✓ Successfully converted to HuggingFace dataset", "bold green") | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ F541 1236 | self.progress.print(f" Data split: {len(dataset_dict['data']):,} triples", "green") 1237 | self.progress.print(f" Processing time: {elapsed:.1f} seconds", "green") | = help: Remove extraneous `f` prefix ```
brent.edwards marked this conversation as resolved
@ -0,0 +1255,4 @@
def _stream_turtle_file(self, config: ConversionConfig, re) -> Iterator[list[dict[str, Any]]]:
"""Stream Turtle file with prefix handling."""
# Detect compression
Member

Why are you using this detection when you have the FileHandler class?

Why are you using this detection when you have the `FileHandler` class?
Author
Member

Fixed! _stream_turtle_file now uses FileHandler.open_file() instead of manual compression detection

Fixed! _stream_turtle_file now uses FileHandler.open_file() instead of manual compression detection
brent.edwards marked this conversation as resolved
@ -0,0 +1260,4 @@
is_bz2 = config.input_path.suffix == ".bz2" or str(config.input_path).endswith(".ttl.bz2")
if is_bz2:
file_obj = bz2.open(config.input_path, "rt", encoding="utf-8", errors="ignore")
Member

According to ruff check:

scripts/convert_rdf_to_hf_dataset_unified.py:1263:24: SIM115 Use a context manager for opening files
     |
1262 |         if is_bz2:
1263 |             file_obj = bz2.open(config.input_path, "rt", encoding="utf-8", errors="ignore")
     |                        ^^^^^^^^ SIM115
1264 |         elif is_gzipped:
1265 |             file_obj = gzip.open(config.input_path, "rt", encoding="utf-8", errors="ignore")
     |
According to `ruff check`: ``` scripts/convert_rdf_to_hf_dataset_unified.py:1263:24: SIM115 Use a context manager for opening files | 1262 | if is_bz2: 1263 | file_obj = bz2.open(config.input_path, "rt", encoding="utf-8", errors="ignore") | ^^^^^^^^ SIM115 1264 | elif is_gzipped: 1265 | file_obj = gzip.open(config.input_path, "rt", encoding="utf-8", errors="ignore") | ```
brent.edwards marked this conversation as resolved
@ -0,0 +1262,4 @@
if is_bz2:
file_obj = bz2.open(config.input_path, "rt", encoding="utf-8", errors="ignore")
elif is_gzipped:
file_obj = gzip.open(config.input_path, "rt", encoding="utf-8", errors="ignore")
Member

According to ruff check:

scripts/convert_rdf_to_hf_dataset_unified.py:1265:24: SIM115 Use a context manager for opening files
     |
1263 |             file_obj = bz2.open(config.input_path, "rt", encoding="utf-8", errors="ignore")
1264 |         elif is_gzipped:
1265 |             file_obj = gzip.open(config.input_path, "rt", encoding="utf-8", errors="ignore")
     |                        ^^^^^^^^^ SIM115
1266 |         else:
1267 |             file_obj = open(config.input_path, "r", encoding="utf-8", errors="ignore")
     |
According to `ruff check`: ``` scripts/convert_rdf_to_hf_dataset_unified.py:1265:24: SIM115 Use a context manager for opening files | 1263 | file_obj = bz2.open(config.input_path, "rt", encoding="utf-8", errors="ignore") 1264 | elif is_gzipped: 1265 | file_obj = gzip.open(config.input_path, "rt", encoding="utf-8", errors="ignore") | ^^^^^^^^^ SIM115 1266 | else: 1267 | file_obj = open(config.input_path, "r", encoding="utf-8", errors="ignore") | ```
brent.edwards marked this conversation as resolved
@ -0,0 +1264,4 @@
elif is_gzipped:
file_obj = gzip.open(config.input_path, "rt", encoding="utf-8", errors="ignore")
else:
file_obj = open(config.input_path, "r", encoding="utf-8", errors="ignore")
Member

According to ruff check:

scripts/convert_rdf_to_hf_dataset_unified.py:1267:24: UP015 [*] Unnecessary mode argument
     |
1265 |             file_obj = gzip.open(config.input_path, "rt", encoding="utf-8", errors="ignore")
1266 |         else:
1267 |             file_obj = open(config.input_path, "r", encoding="utf-8", errors="ignore")
     |                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ UP015
1268 |
1269 |         try:
     = help: Remove mode argument

scripts/convert_rdf_to_hf_dataset_unified.py:1267:24: SIM115 Use a context manager for opening files
     |
1265 |             file_obj = gzip.open(config.input_path, "rt", encoding="utf-8", errors="ignore")
1266 |         else:
1267 |             file_obj = open(config.input_path, "r", encoding="utf-8", errors="ignore")
     |                        ^^^^ SIM115
1268 |
1269 |         try:
     |
According to `ruff check`: ``` scripts/convert_rdf_to_hf_dataset_unified.py:1267:24: UP015 [*] Unnecessary mode argument | 1265 | file_obj = gzip.open(config.input_path, "rt", encoding="utf-8", errors="ignore") 1266 | else: 1267 | file_obj = open(config.input_path, "r", encoding="utf-8", errors="ignore") | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ UP015 1268 | 1269 | try: = help: Remove mode argument scripts/convert_rdf_to_hf_dataset_unified.py:1267:24: SIM115 Use a context manager for opening files | 1265 | file_obj = gzip.open(config.input_path, "rt", encoding="utf-8", errors="ignore") 1266 | else: 1267 | file_obj = open(config.input_path, "r", encoding="utf-8", errors="ignore") | ^^^^ SIM115 1268 | 1269 | try: | ```
brent.edwards marked this conversation as resolved
@ -0,0 +1409,4 @@
if config.clean_cache:
with tempfile.TemporaryDirectory() as temp_cache_dir:
dataset = Dataset.from_parquet(str(temp_chunks_dir / "*.parquet"), cache_dir=temp_cache_dir)
dataset_dict = DatasetDict({"data": dataset})
Member

pyright reports:

  /home/brent.edwards/Workspace-2/dataset-uploader/scripts/convert_rdf_to_hf_dataset_unified.py:1412:36 - error: No overloads for "__init__" match the provided arguments (reportCallIssue)
  /home/brent.edwards/Workspace-2/dataset-uploader/scripts/convert_rdf_to_hf_dataset_unified.py:1412:49 - error: Argument of type "dict[str, dict[str, IterableDataset] | IterableDataset | Dataset | DatasetDict]" cannot be assigned to parameter "iterable" of type "Iterable[tuple[str | NamedSplit, Dataset]]" in function "__init__"
    "Literal['data']" is not assignable to "tuple[str | NamedSplit, Dataset]" (reportArgumentType)
`pyright` reports: ``` /home/brent.edwards/Workspace-2/dataset-uploader/scripts/convert_rdf_to_hf_dataset_unified.py:1412:36 - error: No overloads for "__init__" match the provided arguments (reportCallIssue) /home/brent.edwards/Workspace-2/dataset-uploader/scripts/convert_rdf_to_hf_dataset_unified.py:1412:49 - error: Argument of type "dict[str, dict[str, IterableDataset] | IterableDataset | Dataset | DatasetDict]" cannot be assigned to parameter "iterable" of type "Iterable[tuple[str | NamedSplit, Dataset]]" in function "__init__"   "Literal['data']" is not assignable to "tuple[str | NamedSplit, Dataset]" (reportArgumentType) ```
brent.edwards marked this conversation as resolved
@ -0,0 +1415,4 @@
dataset_dict.save_to_disk(str(config.output_path))
else:
dataset = Dataset.from_parquet(str(temp_chunks_dir / "*.parquet"))
dataset_dict = DatasetDict({"data": dataset})
Member

pyright reports:

  /home/brent.edwards/Workspace-2/dataset-uploader/scripts/convert_rdf_to_hf_dataset_unified.py:1418:32 - error: No overloads for "__init__" match the provided arguments (reportCallIssue)
  /home/brent.edwards/Workspace-2/dataset-uploader/scripts/convert_rdf_to_hf_dataset_unified.py:1418:45 - error: Argument of type "dict[str, dict[str, IterableDataset] | IterableDataset | Dataset | DatasetDict]" cannot be assigned to parameter "iterable" of type "Iterable[tuple[str | NamedSplit, Dataset]]" in function "__init__"
    "Literal['data']" is not assignable to "tuple[str | NamedSplit, Dataset]" (reportArgumentType)
`pyright` reports: ``` /home/brent.edwards/Workspace-2/dataset-uploader/scripts/convert_rdf_to_hf_dataset_unified.py:1418:32 - error: No overloads for "__init__" match the provided arguments (reportCallIssue) /home/brent.edwards/Workspace-2/dataset-uploader/scripts/convert_rdf_to_hf_dataset_unified.py:1418:45 - error: Argument of type "dict[str, dict[str, IterableDataset] | IterableDataset | Dataset | DatasetDict]" cannot be assigned to parameter "iterable" of type "Iterable[tuple[str | NamedSplit, Dataset]]" in function "__init__"   "Literal['data']" is not assignable to "tuple[str | NamedSplit, Dataset]" (reportArgumentType) ```
brent.edwards marked this conversation as resolved
@ -0,0 +1436,4 @@
with open(config.output_path / "dataset_info.json", "w") as f:
json.dump(info, f, indent=2)
self.progress.print(f"✓ Successfully converted to HuggingFace dataset", "bold green")
Member

According to ruff check:

scripts/convert_rdf_to_hf_dataset_unified.py:1439:33: F541 [*] f-string without any placeholders
     |
1437 |                 json.dump(info, f, indent=2)
1438 |
1439 |             self.progress.print(f"✓ Successfully converted to HuggingFace dataset", "bold green")
     |                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ F541
1440 |             self.progress.print(f"  Data split: {len(dataset_dict['data']):,} triples", "green")
     |
     = help: Remove extraneous `f` prefix
According to `ruff check`: ``` scripts/convert_rdf_to_hf_dataset_unified.py:1439:33: F541 [*] f-string without any placeholders | 1437 | json.dump(info, f, indent=2) 1438 | 1439 | self.progress.print(f"✓ Successfully converted to HuggingFace dataset", "bold green") | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ F541 1440 | self.progress.print(f" Data split: {len(dataset_dict['data']):,} triples", "green") | = help: Remove extraneous `f` prefix ```
brent.edwards marked this conversation as resolved
@ -0,0 +1493,4 @@
with self.file_handler.open_file(config.input_path) as f:
for line in f:
if line.startswith("http://") or line.startswith("https://"):
Member

pyright reports:

  /home/brent.edwards/Workspace-2/dataset-uploader/scripts/convert_rdf_to_hf_dataset_unified.py:1496:36 - error: Argument of type "Literal['http://']" cannot be assigned to parameter "prefix" of type "ReadableBuffer | tuple[ReadableBuffer, ...]" in function "startswith"
    Type "Literal['http://']" is not assignable to type "ReadableBuffer | tuple[ReadableBuffer, ...]"
      "Literal['http://']" is incompatible with protocol "Buffer"
        "__buffer__" is not present
      "Literal['http://']" is not assignable to "tuple[ReadableBuffer, ...]" (reportArgumentType)
  /home/brent.edwards/Workspace-2/dataset-uploader/scripts/convert_rdf_to_hf_dataset_unified.py:1496:66 - error: Argument of type "Literal['https://']" cannot be assigned to parameter "prefix" of type "ReadableBuffer | tuple[ReadableBuffer, ...]" in function "startswith"
    Type "Literal['https://']" is not assignable to type "ReadableBuffer | tuple[ReadableBuffer, ...]"
      "Literal['https://']" is incompatible with protocol "Buffer"
        "__buffer__" is not present
      "Literal['https://']" is not assignable to "tuple[ReadableBuffer, ...]" (reportArgumentType)
`pyright` reports: ``` /home/brent.edwards/Workspace-2/dataset-uploader/scripts/convert_rdf_to_hf_dataset_unified.py:1496:36 - error: Argument of type "Literal['http://']" cannot be assigned to parameter "prefix" of type "ReadableBuffer | tuple[ReadableBuffer, ...]" in function "startswith"   Type "Literal['http://']" is not assignable to type "ReadableBuffer | tuple[ReadableBuffer, ...]"     "Literal['http://']" is incompatible with protocol "Buffer"       "__buffer__" is not present     "Literal['http://']" is not assignable to "tuple[ReadableBuffer, ...]" (reportArgumentType) /home/brent.edwards/Workspace-2/dataset-uploader/scripts/convert_rdf_to_hf_dataset_unified.py:1496:66 - error: Argument of type "Literal['https://']" cannot be assigned to parameter "prefix" of type "ReadableBuffer | tuple[ReadableBuffer, ...]" in function "startswith"   Type "Literal['https://']" is not assignable to type "ReadableBuffer | tuple[ReadableBuffer, ...]"     "Literal['https://']" is incompatible with protocol "Buffer"       "__buffer__" is not present     "Literal['https://']" is not assignable to "tuple[ReadableBuffer, ...]" (reportArgumentType) ```
brent.edwards marked this conversation as resolved
@ -0,0 +1542,4 @@
with self.file_handler.open_file(config.input_path) as f:
for line in f:
line = line.strip()
if not line or line.startswith("#"):
Member

pyright reports:

  /home/brent.edwards/Workspace-2/dataset-uploader/scripts/convert_rdf_to_hf_dataset_unified.py:1545:48 - error: Argument of type "Literal['#']" cannot be assigned to parameter "prefix" of type "ReadableBuffer | tuple[ReadableBuffer, ...]" in function "startswith"
    Type "Literal['#']" is not assignable to type "ReadableBuffer | tuple[ReadableBuffer, ...]"
      "Literal['#']" is incompatible with protocol "Buffer"
        "__buffer__" is not present
      "Literal['#']" is not assignable to "tuple[ReadableBuffer, ...]" (reportArgumentType)
`pyright` reports: ``` /home/brent.edwards/Workspace-2/dataset-uploader/scripts/convert_rdf_to_hf_dataset_unified.py:1545:48 - error: Argument of type "Literal['#']" cannot be assigned to parameter "prefix" of type "ReadableBuffer | tuple[ReadableBuffer, ...]" in function "startswith"   Type "Literal['#']" is not assignable to type "ReadableBuffer | tuple[ReadableBuffer, ...]"     "Literal['#']" is incompatible with protocol "Buffer"       "__buffer__" is not present     "Literal['#']" is not assignable to "tuple[ReadableBuffer, ...]" (reportArgumentType) ```
brent.edwards marked this conversation as resolved
@ -0,0 +1642,4 @@
with tempfile.TemporaryDirectory() as temp_cache_dir:
dataset = Dataset.from_parquet(str(temp_chunks_dir / "*.parquet"), cache_dir=temp_cache_dir)
self.progress.emit_progress(90)
dataset_dict = DatasetDict({"data": dataset})
Member

pyright reports:

  /home/brent.edwards/Workspace-2/dataset-uploader/scripts/convert_rdf_to_hf_dataset_unified.py:1645:36 - error: No overloads for "__init__" match the provided arguments (reportCallIssue)
  /home/brent.edwards/Workspace-2/dataset-uploader/scripts/convert_rdf_to_hf_dataset_unified.py:1645:49 - error: Argument of type "dict[str, dict[str, IterableDataset] | IterableDataset | Dataset | DatasetDict]" cannot be assigned to parameter "iterable" of type "Iterable[tuple[str | NamedSplit, Dataset]]" in function "__init__"
    "Literal['data']" is not assignable to "tuple[str | NamedSplit, Dataset]" (reportArgumentType)
`pyright` reports: ``` /home/brent.edwards/Workspace-2/dataset-uploader/scripts/convert_rdf_to_hf_dataset_unified.py:1645:36 - error: No overloads for "__init__" match the provided arguments (reportCallIssue) /home/brent.edwards/Workspace-2/dataset-uploader/scripts/convert_rdf_to_hf_dataset_unified.py:1645:49 - error: Argument of type "dict[str, dict[str, IterableDataset] | IterableDataset | Dataset | DatasetDict]" cannot be assigned to parameter "iterable" of type "Iterable[tuple[str | NamedSplit, Dataset]]" in function "__init__"   "Literal['data']" is not assignable to "tuple[str | NamedSplit, Dataset]" (reportArgumentType) ```
brent.edwards marked this conversation as resolved
@ -0,0 +1650,4 @@
else:
dataset = Dataset.from_parquet(str(temp_chunks_dir / "*.parquet"))
self.progress.emit_progress(90)
dataset_dict = DatasetDict({"data": dataset})
Member

pyright reports:

  /home/brent.edwards/Workspace-2/dataset-uploader/scripts/convert_rdf_to_hf_dataset_unified.py:1653:32 - error: No overloads for "__init__" match the provided arguments (reportCallIssue)
  /home/brent.edwards/Workspace-2/dataset-uploader/scripts/convert_rdf_to_hf_dataset_unified.py:1653:45 - error: Argument of type "dict[str, dict[str, IterableDataset] | IterableDataset | Dataset | DatasetDict]" cannot be assigned to parameter "iterable" of type "Iterable[tuple[str | NamedSplit, Dataset]]" in function "__init__"
    "Literal['data']" is not assignable to "tuple[str | NamedSplit, Dataset]" (reportArgumentType)
`pyright` reports: ``` /home/brent.edwards/Workspace-2/dataset-uploader/scripts/convert_rdf_to_hf_dataset_unified.py:1653:32 - error: No overloads for "__init__" match the provided arguments (reportCallIssue) /home/brent.edwards/Workspace-2/dataset-uploader/scripts/convert_rdf_to_hf_dataset_unified.py:1653:45 - error: Argument of type "dict[str, dict[str, IterableDataset] | IterableDataset | Dataset | DatasetDict]" cannot be assigned to parameter "iterable" of type "Iterable[tuple[str | NamedSplit, Dataset]]" in function "__init__"   "Literal['data']" is not assignable to "tuple[str | NamedSplit, Dataset]" (reportArgumentType) ```
brent.edwards marked this conversation as resolved
@ -0,0 +1676,4 @@
with open(config.output_path / "dataset_info.json", "w") as f:
json.dump(info, f, indent=2)
self.progress.print(f"✓ Successfully converted to HuggingFace dataset", "bold green")
Member

According to ruff check:

scripts/convert_rdf_to_hf_dataset_unified.py:1679:33: F541 [*] f-string without any placeholders
     |
1677 |                 json.dump(info, f, indent=2)
1678 |
1679 |             self.progress.print(f"✓ Successfully converted to HuggingFace dataset", "bold green")
     |                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ F541
1680 |             self.progress.print(f"  Data split: {len(dataset_dict['data']):,} triples", "green")
1681 |             self.progress.print(f"  Processing time: {elapsed:.1f} seconds", "green")
     |
     = help: Remove extraneous `f` prefix
According to `ruff check`: ``` scripts/convert_rdf_to_hf_dataset_unified.py:1679:33: F541 [*] f-string without any placeholders | 1677 | json.dump(info, f, indent=2) 1678 | 1679 | self.progress.print(f"✓ Successfully converted to HuggingFace dataset", "bold green") | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ F541 1680 | self.progress.print(f" Data split: {len(dataset_dict['data']):,} triples", "green") 1681 | self.progress.print(f" Processing time: {elapsed:.1f} seconds", "green") | = help: Remove extraneous `f` prefix ```
brent.edwards marked this conversation as resolved
@ -0,0 +1749,4 @@
current_doc = []
for line in f:
if line.startswith("http://") or line.startswith("https://"):
Member

pyright reports:

  /home/brent.edwards/Workspace-2/dataset-uploader/scripts/convert_rdf_to_hf_dataset_unified.py:1752:36 - error: Argument of type "Literal['http://']" cannot be assigned to parameter "prefix" of type "ReadableBuffer | tuple[ReadableBuffer, ...]" in function "startswith"
    Type "Literal['http://']" is not assignable to type "ReadableBuffer | tuple[ReadableBuffer, ...]"
      "Literal['http://']" is incompatible with protocol "Buffer"
        "__buffer__" is not present
      "Literal['http://']" is not assignable to "tuple[ReadableBuffer, ...]" (reportArgumentType)
  /home/brent.edwards/Workspace-2/dataset-uploader/scripts/convert_rdf_to_hf_dataset_unified.py:1752:66 - error: Argument of type "Literal['https://']" cannot be assigned to parameter "prefix" of type "ReadableBuffer | tuple[ReadableBuffer, ...]" in function "startswith"
    Type "Literal['https://']" is not assignable to type "ReadableBuffer | tuple[ReadableBuffer, ...]"
      "Literal['https://']" is incompatible with protocol "Buffer"
        "__buffer__" is not present
      "Literal['https://']" is not assignable to "tuple[ReadableBuffer, ...]" (reportArgumentType)
`pyright` reports: ``` /home/brent.edwards/Workspace-2/dataset-uploader/scripts/convert_rdf_to_hf_dataset_unified.py:1752:36 - error: Argument of type "Literal['http://']" cannot be assigned to parameter "prefix" of type "ReadableBuffer | tuple[ReadableBuffer, ...]" in function "startswith"   Type "Literal['http://']" is not assignable to type "ReadableBuffer | tuple[ReadableBuffer, ...]"     "Literal['http://']" is incompatible with protocol "Buffer"       "__buffer__" is not present     "Literal['http://']" is not assignable to "tuple[ReadableBuffer, ...]" (reportArgumentType) /home/brent.edwards/Workspace-2/dataset-uploader/scripts/convert_rdf_to_hf_dataset_unified.py:1752:66 - error: Argument of type "Literal['https://']" cannot be assigned to parameter "prefix" of type "ReadableBuffer | tuple[ReadableBuffer, ...]" in function "startswith"   Type "Literal['https://']" is not assignable to type "ReadableBuffer | tuple[ReadableBuffer, ...]"     "Literal['https://']" is incompatible with protocol "Buffer"       "__buffer__" is not present     "Literal['https://']" is not assignable to "tuple[ReadableBuffer, ...]" (reportArgumentType) ```
brent.edwards marked this conversation as resolved
@ -0,0 +1794,4 @@
def _stream_turtle_parallel(self, config: ConversionConfig) -> Iterator[list[dict[str, Any]]]:
"""Stream Turtle file with chunked parsing."""
# Detect compression
Member

Why are you using detect compression when you have the FileHandler class?

Why are you using detect compression when you have the `FileHandler` class?
brent.edwards marked this conversation as resolved
@ -0,0 +1799,4 @@
is_bz2 = config.input_path.suffix == ".bz2"
if is_bz2:
file_obj = bz2.open(config.input_path, "rt", encoding="utf-8", errors="ignore")
Member

According to ruff check:

scripts/convert_rdf_to_hf_dataset_unified.py:1802:24: SIM115 Use a context manager for opening files
     |
1801 |         if is_bz2:
1802 |             file_obj = bz2.open(config.input_path, "rt", encoding="utf-8", errors="ignore")
     |                        ^^^^^^^^ SIM115
1803 |         elif is_gzipped:
1804 |             file_obj = gzip.open(config.input_path, "rt", encoding="utf-8", errors="ignore")
     |
According to `ruff check`: ``` scripts/convert_rdf_to_hf_dataset_unified.py:1802:24: SIM115 Use a context manager for opening files | 1801 | if is_bz2: 1802 | file_obj = bz2.open(config.input_path, "rt", encoding="utf-8", errors="ignore") | ^^^^^^^^ SIM115 1803 | elif is_gzipped: 1804 | file_obj = gzip.open(config.input_path, "rt", encoding="utf-8", errors="ignore") | ```
brent.edwards marked this conversation as resolved
@ -0,0 +1801,4 @@
if is_bz2:
file_obj = bz2.open(config.input_path, "rt", encoding="utf-8", errors="ignore")
elif is_gzipped:
file_obj = gzip.open(config.input_path, "rt", encoding="utf-8", errors="ignore")
Member

According to ruff check:

scripts/convert_rdf_to_hf_dataset_unified.py:1804:24: SIM115 Use a context manager for opening files
     |
1802 |             file_obj = bz2.open(config.input_path, "rt", encoding="utf-8", errors="ignore")
1803 |         elif is_gzipped:
1804 |             file_obj = gzip.open(config.input_path, "rt", encoding="utf-8", errors="ignore")
     |                        ^^^^^^^^^ SIM115
1805 |         else:
1806 |             file_obj = open(config.input_path, "r", encoding="utf-8", errors="ignore")
     |
According to `ruff check`: ``` scripts/convert_rdf_to_hf_dataset_unified.py:1804:24: SIM115 Use a context manager for opening files | 1802 | file_obj = bz2.open(config.input_path, "rt", encoding="utf-8", errors="ignore") 1803 | elif is_gzipped: 1804 | file_obj = gzip.open(config.input_path, "rt", encoding="utf-8", errors="ignore") | ^^^^^^^^^ SIM115 1805 | else: 1806 | file_obj = open(config.input_path, "r", encoding="utf-8", errors="ignore") | ```
brent.edwards marked this conversation as resolved
@ -0,0 +1803,4 @@
elif is_gzipped:
file_obj = gzip.open(config.input_path, "rt", encoding="utf-8", errors="ignore")
else:
file_obj = open(config.input_path, "r", encoding="utf-8", errors="ignore")
Member

According to ruff check:

scripts/convert_rdf_to_hf_dataset_unified.py:1806:24: UP015 [*] Unnecessary mode argument
     |
1804 |             file_obj = gzip.open(config.input_path, "rt", encoding="utf-8", errors="ignore")
1805 |         else:
1806 |             file_obj = open(config.input_path, "r", encoding="utf-8", errors="ignore")
     |                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ UP015
1807 |
1808 |         try:
     |
     = help: Remove mode argument

scripts/convert_rdf_to_hf_dataset_unified.py:1806:24: SIM115 Use a context manager for opening files
     |
1804 |             file_obj = gzip.open(config.input_path, "rt", encoding="utf-8", errors="ignore")
1805 |         else:
1806 |             file_obj = open(config.input_path, "r", encoding="utf-8", errors="ignore")
     |                        ^^^^ SIM115
1807 |
1808 |         try:
     |
According to `ruff check`: ``` scripts/convert_rdf_to_hf_dataset_unified.py:1806:24: UP015 [*] Unnecessary mode argument | 1804 | file_obj = gzip.open(config.input_path, "rt", encoding="utf-8", errors="ignore") 1805 | else: 1806 | file_obj = open(config.input_path, "r", encoding="utf-8", errors="ignore") | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ UP015 1807 | 1808 | try: | = help: Remove mode argument scripts/convert_rdf_to_hf_dataset_unified.py:1806:24: SIM115 Use a context manager for opening files | 1804 | file_obj = gzip.open(config.input_path, "rt", encoding="utf-8", errors="ignore") 1805 | else: 1806 | file_obj = open(config.input_path, "r", encoding="utf-8", errors="ignore") | ^^^^ SIM115 1807 | 1808 | try: | ```
brent.edwards marked this conversation as resolved
@ -0,0 +1939,4 @@
"""
# Strategy name to class mapping
STRATEGIES = {
Member

According to ruff check:

scripts/convert_rdf_to_hf_dataset_unified.py:1942:18: RUF012 Mutable class attributes should be annotated with `typing.ClassVar`
     |
1941 |       # Strategy name to class mapping
1942 |       STRATEGIES = {
     |  __________________^
1943 | |         "standard": StandardStrategy,
1944 | |         "streaming": StreamingStrategy,
1945 | |         "streaming-turtle": StreamingTurtleStrategy,
1946 | |         "streaming-simple": SimpleStreamingStrategy,
1947 | |         "streaming-parallel": ParallelStreamingStrategy,
1948 | |     }
     | |_____^ RUF012
1949 |
1950 |       def __init__(self, progress_tracker: ProgressTracker):
     |
According to `ruff check`: ``` scripts/convert_rdf_to_hf_dataset_unified.py:1942:18: RUF012 Mutable class attributes should be annotated with `typing.ClassVar` | 1941 | # Strategy name to class mapping 1942 | STRATEGIES = { | __________________^ 1943 | | "standard": StandardStrategy, 1944 | | "streaming": StreamingStrategy, 1945 | | "streaming-turtle": StreamingTurtleStrategy, 1946 | | "streaming-simple": SimpleStreamingStrategy, 1947 | | "streaming-parallel": ParallelStreamingStrategy, 1948 | | } | |_____^ RUF012 1949 | 1950 | def __init__(self, progress_tracker: ProgressTracker): | ```
brent.edwards marked this conversation as resolved
@ -0,0 +1971,4 @@
# Selection logic
if is_geonames:
# GeoNames always benefits from parallel processing
self.progress.print(f"Auto-selected: streaming-parallel (GeoNames detected)", "cyan")
Member

According to ruff check:

scripts/convert_rdf_to_hf_dataset_unified.py:1974:33: F541 [*] f-string without any placeholders
     |
1972 |         if is_geonames:
1973 |             # GeoNames always benefits from parallel processing
1974 |             self.progress.print(f"Auto-selected: streaming-parallel (GeoNames detected)", "cyan")
     |                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ F541
1975 |             return self._strategies["streaming-parallel"]
     |
     = help: Remove extraneous `f` prefix
According to `ruff check`: ``` scripts/convert_rdf_to_hf_dataset_unified.py:1974:33: F541 [*] f-string without any placeholders | 1972 | if is_geonames: 1973 | # GeoNames always benefits from parallel processing 1974 | self.progress.print(f"Auto-selected: streaming-parallel (GeoNames detected)", "cyan") | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ F541 1975 | return self._strategies["streaming-parallel"] | = help: Remove extraneous `f` prefix ```
brent.edwards marked this conversation as resolved
@ -0,0 +1976,4 @@
if rdf_format in ("turtle", "ttl") and file_size_mb > 100:
# Large Turtle files need special handling
self.progress.print(f"Auto-selected: streaming-turtle (large Turtle file)", "cyan")
Member

According to ruff check:

scripts/convert_rdf_to_hf_dataset_unified.py:1979:33: F541 [*] f-string without any placeholders
     |
1977 |         if rdf_format in ("turtle", "ttl") and file_size_mb > 100:
1978 |             # Large Turtle files need special handling
1979 |             self.progress.print(f"Auto-selected: streaming-turtle (large Turtle file)", "cyan")
     |                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ F541
1980 |             return self._strategies["streaming-turtle"]
     |
     = help: Remove extraneous `f` prefix
According to `ruff check`: ``` scripts/convert_rdf_to_hf_dataset_unified.py:1979:33: F541 [*] f-string without any placeholders | 1977 | if rdf_format in ("turtle", "ttl") and file_size_mb > 100: 1978 | # Large Turtle files need special handling 1979 | self.progress.print(f"Auto-selected: streaming-turtle (large Turtle file)", "cyan") | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ F541 1980 | return self._strategies["streaming-turtle"] | = help: Remove extraneous `f` prefix ```
brent.edwards marked this conversation as resolved
@ -0,0 +1981,4 @@
if file_size_mb < 100:
# Small files: use standard in-memory
self.progress.print(f"Auto-selected: standard (file < 100MB)", "cyan")
Member

According to ruff check:

scripts/convert_rdf_to_hf_dataset_unified.py:1984:33: F541 [*] f-string without any placeholders
     |
1982 |         if file_size_mb < 100:
1983 |             # Small files: use standard in-memory
1984 |             self.progress.print(f"Auto-selected: standard (file < 100MB)", "cyan")
     |                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ F541
1985 |             return self._strategies["standard"]
     |
     = help: Remove extraneous `f` prefix
According to `ruff check`: ``` scripts/convert_rdf_to_hf_dataset_unified.py:1984:33: F541 [*] f-string without any placeholders | 1982 | if file_size_mb < 100: 1983 | # Small files: use standard in-memory 1984 | self.progress.print(f"Auto-selected: standard (file < 100MB)", "cyan") | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ F541 1985 | return self._strategies["standard"] | = help: Remove extraneous `f` prefix ```
brent.edwards marked this conversation as resolved
@ -0,0 +2130,4 @@
# Validate input file
if not args.input.exists():
print(f"Error: Input file not found: {args.input}")
Member

I would move line 2137 above here, so that it could use progress.print.

(This would be very useful when you want to print to something other than the screen. For example, it's responding to a web request or it's on a server without a screen and it's printing to a file.)

I would move line 2137 above here, so that it could use progress.print. (This would be very useful when you want to print to something other than the screen. For example, it's responding to a web request or it's on a server without a screen and it's printing to a file.)
brent.edwards marked this conversation as resolved
@ -984,3 +980,1 @@
Path(__file__).parent
/ "convert_rdf_to_hf_dataset_streaming_parallel.py"
)
# Use unified converter for all standard RDF datasets
Member

Since you're changing how the software is run, could you also change the descriptions in lines 802-820?

Since you're changing how the software is run, could you also change the descriptions in lines 802-820?
brent.edwards marked this conversation as resolved
brent.edwards left a comment
Member

GREAT work. Great simplification of multiple files into one. Thank you for your hard work!

GREAT work. Great simplification of multiple files into one. Thank you for your hard work!
This pull request has changes conflicting with the target branch.
  • scripts/convert_rdf_to_hf_dataset_streaming_parallel.py
  • scripts/upload_all_datasets.2.py
View command line instructions

Checkout

From your project repository, check out a new branch and test the changes.
git fetch -u origin 15-refactor-rdf-converters:15-refactor-rdf-converters
git switch 15-refactor-rdf-converters

Merge

Merge the changes and update on Forgejo.

Warning: The "Autodetect manual merge" setting is not enabled for this repository, you will have to mark this pull request as manually merged afterwards.

git switch 19-rewrite-upload-all-datasets
git merge --no-ff 15-refactor-rdf-converters
git switch 15-refactor-rdf-converters
git rebase 19-rewrite-upload-all-datasets
git switch 19-rewrite-upload-all-datasets
git merge --ff-only 15-refactor-rdf-converters
git switch 15-refactor-rdf-converters
git rebase 19-rewrite-upload-all-datasets
git switch 19-rewrite-upload-all-datasets
git merge --no-ff 15-refactor-rdf-converters
git switch 19-rewrite-upload-all-datasets
git merge --squash 15-refactor-rdf-converters
git switch 19-rewrite-upload-all-datasets
git merge --ff-only 15-refactor-rdf-converters
git switch 19-rewrite-upload-all-datasets
git merge 15-refactor-rdf-converters
git push origin 19-rewrite-upload-all-datasets
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
3 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
cleverdatasets/dataset-uploader!22
No description provided.