feat: Unify 5 RDF converter scripts using Strategy Pattern #22

aditya · 2025-12-01T16:31:05Z

aditya commented

2025-12-01 16:31:05 +00:00

Unifies five separate RDF converter scripts into a single maintainable codebase (convert_rdf_to_hf_dataset_unified.py) using the Strategy Pattern to support standard, streaming, and parallel processing.

Implements an auto selection mode that dynamically chooses the optimal conversion strategy based on file size and format, alongside improved Rich-based progress tracking.

Refactors the main upload_all_datasets.2.py script to simplify the pipeline by removing complex branching logic and delegating all conversions to the new unified tool.

Unifies five separate RDF converter scripts into a single maintainable codebase (convert_rdf_to_hf_dataset_unified.py) using the Strategy Pattern to support standard, streaming, and parallel processing. Implements an auto selection mode that dynamically chooses the optimal conversion strategy based on file size and format, alongside improved Rich-based progress tracking. Refactors the main upload_all_datasets.2.py script to simplify the pipeline by removing complex branching logic and delegating all conversions to the new unified tool.

aditya added 1 commit

2025-12-01 16:31:05 +00:00

feat: Unify 5 RDF converter scripts using Strategy Pattern ac1a7d8d8e

brent.edwards requested changes

2025-12-02 03:50:43 +00:00

Dismissed

brent.edwards left a comment

Are the files convert_rdf_to_hf_dataset.*.py other than convert_rdf_to_hf_dataset_unified.py supposed to go away? If so, then please ignore my comments inside them and delete those files instead of fixing them.

I have read about 800 lines of convert_rdf_to_hf_dataset_unified.py, and I know that I'll have more comments.

But I need to study for the evening, so I'll pass you 72 notes.

Are the files `convert_rdf_to_hf_dataset.*.py` other than `convert_rdf_to_hf_dataset_unified.py` supposed to go away? If so, then please ignore my comments inside them and delete those files instead of fixing them. I have read about 800 lines of `convert_rdf_to_hf_dataset_unified.py`, and I know that I'll have more comments. But I need to study for the evening, so I'll pass you 72 notes.

scripts/convert_rdf_to_hf_dataset_streaming.py Outdated

					
				@ -49,6 +49,8 @@ import json

				import logging

				import multiprocessing as mp

				import shutil

				import shutil

Line 51 is import shutil. You don't need to duplicate it.

Line 51 is `import shutil`. You don't need to duplicate it.

brent.edwards marked this conversation as resolved

scripts/convert_rdf_to_hf_dataset_streaming.py Outdated

					
				@ -577,0 +584,4 @@

				            # Save final dataset

				            console.print("[yellow]Saving final dataset...[/yellow]")

				            dataset_dict.save_to_disk(str(output_path))

Lines 563-587 are almost identical to lines 589-613.

This is really bad form; a future programmer will make a change to one and forget the other.

Could you please rewrite these lines to remove duplicate code? (If you have any problems, I am glad to help.)

I like the use of line 561 to refer to the directory being written. You can use the same idea when clean_cache is off with something similar to:

if clean_cache:
    my_cache_dir = tempfile.TemporaryDirectory()
else:
    my_cache_dir = temp_chunks_dir

with tempfile.TemporaryDirectory() as temp_cache_dir:
    console.print(...)

The above code would require clean-up at the end.

If you don't want to do clean-up, you should be able to do something like

with tempfile.TemporaryDirectory() if clean_cache else my_cache_dir as temp_cache_dir

but I find that harder to understand. Your choice.

Lines 563-587 are almost identical to lines 589-613. This is really bad form; a future programmer *will* make a change to one and forget the other. Could you please rewrite these lines to remove duplicate code? (If you have any problems, I am glad to help.) I *like* the use of line 561 to refer to the directory being written. You can use the same idea when `clean_cache` is off with something similar to: ``` if clean_cache: my_cache_dir = tempfile.TemporaryDirectory() else: my_cache_dir = temp_chunks_dir with tempfile.TemporaryDirectory() as temp_cache_dir: console.print(...) ``` The above code would require clean-up at the end. If you don't want to do clean-up, you should be able to do something like ``` with tempfile.TemporaryDirectory() if clean_cache else my_cache_dir as temp_cache_dir ``` but I find that harder to understand. Your choice.

brent.edwards marked this conversation as resolved

scripts/convert_rdf_to_hf_dataset_streaming.py Outdated

					
				@ -634,7 +668,9 @@ def main():

				    parser.add_argument("--description", help="Dataset description")

				    parser.add_argument("--citation", help="Dataset citation")

				    parser.add_argument("--homepage", help="Dataset homepage URL")

				    parser.add_argument("--homepage", help="Dataset homepage URL")

Lines 670 and 671 are duplicates.

brent.edwards marked this conversation as resolved

scripts/convert_rdf_to_hf_dataset_streaming_parallel.py Outdated

					
				@ -704,0 +727,4 @@

				                    console.print(f"[red]Error saving dataset: {e}[/red]")

				                    console.print(f"[yellow]Output path: {output_path}[/yellow]")

				                    console.print(f"[yellow]Check disk space and permissions[/yellow]")

				                    raise e

Lines 707-730 and lines 732-755 are almost duplicates.

I don't know if you copy/pasted this or if an LMM generated this, but duplicate code is usually bad, especially when it's as easy as this to fix.

Here's an explainer about code duplication: https://axify.io/blog/code-duplication

Lines 707-730 and lines 732-755 are almost duplicates. I don't know if you copy/pasted this or if an LMM generated this, but duplicate code is usually bad, especially when it's as easy as this to fix. Here's an explainer about code duplication: https://axify.io/blog/code-duplication

brent.edwards marked this conversation as resolved

scripts/convert_rdf_to_hf_dataset_streaming_simple.py Outdated

					
				@ -407,0 +413,4 @@

				            # Save final dataset

				            console.print("[yellow]Saving final dataset...[/yellow]")

				            dataset_dict.save_to_disk(str(output_path))

Lines 401-416 and lines 418-433 are fairly similar, except for where and how in the code dataset_dict is initialized.

They should be easy to merge.

Lines 401-416 and lines 418-433 are fairly similar, except for where and how in the code `dataset_dict` is initialized. They should be easy to merge.

brent.edwards marked this conversation as resolved

scripts/convert_rdf_to_hf_dataset_streaming_turtle.py Outdated

					
				@ -308,3 +311,2 @@

				    # Read all Parquet files as a single dataset

				    dataset = Dataset.from_parquet(str(temp_chunks_dir / "*.parquet"))

				    print(f"\nPROGRESS: 90", flush=True)

				    # Read all Parquet files as a single dataset

Duplicate line.

brent.edwards marked this conversation as resolved

scripts/convert_rdf_to_hf_dataset_streaming_turtle.py Outdated

					
				@ -322,0 +330,4 @@

				            # Save final dataset

				            console.print("[yellow]Saving final dataset...[/yellow]")

				            print(f"\nPROGRESS: 95", flush=True)

				            dataset_dict.save_to_disk(str(output_path))

Lines 315-333 and lines 335-352 are almost duplicates.

brent.edwards marked this conversation as resolved

scripts/convert_rdf_to_hf_dataset_streaming_turtle.py Outdated

					
				@ -367,7 +394,9 @@ def main():

				    )

				    parser.add_argument("--description", type=str, help="Dataset description")

				    parser.add_argument("--homepage", type=str, help="Dataset homepage")

				    parser.add_argument("--homepage", type=str, help="Dataset homepage")

You don't need another --homepage argument.

You don't need another `--homepage` argument.

brent.edwards marked this conversation as resolved

scripts/convert_rdf_to_hf_dataset_unified.py Outdated

					
				@ -0,0 +1,2208 @@

				#!/usr/bin/env python3

When I run ruff check, it reports that the following lines are too long. I don't have the patience to copy and paste for each of these lines:

275, 413, 451, 460, 466, 485, 495, 518, 563, 594, 631, 647, 668, 676, 689, 716, 718, 720, 721, 729, 845, 855, 858, 865, 867, 875, 878, 884, 895, 914, 930, 944, 959, 971, 989, 1021, 1042, 1088, 1161, 1163, 1174, 1176, 1188, 1190, 1192, 1201, 1204, 1211, 1229, 1235, 1236, 1256, 1260, 1263, 1265, 1303, 1312, 1330, 1382, 1383, 1393, 1395, 1402, 1404, 1411, 1414, 1420, 1439, 1440, 1459, 1463, 1472, 1473, 1488, 1504, 1528, 1538, 1553, 1601, 1602, 1603, 1613, 1615, 1617, 1627, 1632, 1634, 1643, 1647, 1655, 1673, 1679, 1680, 1682, 1701, 1709, 1713, 1717, 1718, 1733, 1735, 1741, 1771, 1776, 1795, 1802, 1804, 1820, 1830, 1838, 1974, 1979, 1989, 1993, 2011, and 2055.

When I run `ruff check`, it reports that the following lines are too long. I don't have the patience to copy and paste for each of these lines: 275, 413, 451, 460, 466, 485, 495, 518, 563, 594, 631, 647, 668, 676, 689, 716, 718, 720, 721, 729, 845, 855, 858, 865, 867, 875, 878, 884, 895, 914, 930, 944, 959, 971, 989, 1021, 1042, 1088, 1161, 1163, 1174, 1176, 1188, 1190, 1192, 1201, 1204, 1211, 1229, 1235, 1236, 1256, 1260, 1263, 1265, 1303, 1312, 1330, 1382, 1383, 1393, 1395, 1402, 1404, 1411, 1414, 1420, 1439, 1440, 1459, 1463, 1472, 1473, 1488, 1504, 1528, 1538, 1553, 1601, 1602, 1603, 1613, 1615, 1617, 1627, 1632, 1634, 1643, 1647, 1655, 1673, 1679, 1680, 1682, 1701, 1709, 1713, 1717, 1718, 1733, 1735, 1741, 1771, 1776, 1795, 1802, 1804, 1820, 1830, 1838, 1974, 1979, 1989, 1993, 2011, and 2055.

brent.edwards marked this conversation as resolved

scripts/convert_rdf_to_hf_dataset_unified.py Outdated

					
				@ -0,0 +34,4 @@

				        "object": "value",

				        "object_type": "uri|literal|blank_node",

				        "object_datatype": "xsd:string or None",

				        "object_language": "en or None"

We DEFINITELY have data that's not in English.

Corrected the docstring

brent.edwards marked this conversation as resolved

scripts/convert_rdf_to_hf_dataset_unified.py Outdated

					
				@ -0,0 +56,4 @@

				import time

				from abc import ABC, abstractmethod

				from contextlib import contextmanager

				from dataclasses import dataclass, field

field is never used.

`field` is never used.

According to ruff check:

scripts/convert_rdf_to_hf_dataset_unified.py:59:36: F401 [*] `dataclasses.field` imported but unused
   |
57 | from abc import ABC, abstractmethod
58 | from contextlib import contextmanager
59 | from dataclasses import dataclass, field
   |                                    ^^^^^ F401
60 | from io import BytesIO
61 | from multiprocessing import Manager, Pool
   |
   = help: Remove unused import: `dataclasses.field`

According to `ruff check`: ``` scripts/convert_rdf_to_hf_dataset_unified.py:59:36: F401 [*] `dataclasses.field` imported but unused | 57 | from abc import ABC, abstractmethod 58 | from contextlib import contextmanager 59 | from dataclasses import dataclass, field | ^^^^^ F401 60 | from io import BytesIO 61 | from multiprocessing import Manager, Pool | = help: Remove unused import: `dataclasses.field` ```

brent.edwards marked this conversation as resolved

scripts/convert_rdf_to_hf_dataset_unified.py Outdated

					
				@ -0,0 +60,4 @@

				from io import BytesIO

				from multiprocessing import Manager, Pool

				from pathlib import Path

				from typing import Any, Iterator, Optional

According to ruff check:

scripts/convert_rdf_to_hf_dataset_unified.py:63:1: UP035 [*] Import from `collections.abc` instead: `Iterator`
   |
61 | from multiprocessing import Manager, Pool
62 | from pathlib import Path
63 | from typing import Any, Iterator, Optional
   | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ UP035
64 |
65 | import pyarrow as pa
   |
   = help: Import from `collections.abc`

According to `ruff check`: ``` scripts/convert_rdf_to_hf_dataset_unified.py:63:1: UP035 [*] Import from `collections.abc` instead: `Iterator` | 61 | from multiprocessing import Manager, Pool 62 | from pathlib import Path 63 | from typing import Any, Iterator, Optional | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ UP035 64 | 65 | import pyarrow as pa | = help: Import from `collections.abc` ```

brent.edwards marked this conversation as resolved

scripts/convert_rdf_to_hf_dataset_unified.py Outdated

					
				@ -0,0 +73,4 @@

				    SpinnerColumn,

				    TextColumn,

				    TimeElapsedColumn,

				    TimeRemainingColumn,

TimeRemainingColumn is never used.

`TimeRemainingColumn` is never used.

According to ruff check:

scripts/convert_rdf_to_hf_dataset_unified.py:76:5: F401 [*] `rich.progress.TimeRemainingColumn` imported but unused
   |
74 |     TextColumn,
75 |     TimeElapsedColumn,
76 |     TimeRemainingColumn,
   |     ^^^^^^^^^^^^^^^^^^^ F401
77 | )
   |
   = help: Remove unused import: `rich.progress.TimeRemainingColumn`

brent.edwards marked this conversation as resolved

scripts/convert_rdf_to_hf_dataset_unified.py Outdated

					
				@ -0,0 +91,4 @@

				    output_path: Path

				    rdf_format: str = "turtle"

				    chunk_size: int = 10000

				    num_workers: Optional[int] = None

According to ruff check:

scripts/convert_rdf_to_hf_dataset_unified.py:94:18: UP007 [*] Use `X | Y` for type annotations
   |
92 |     rdf_format: str = "turtle"
93 |     chunk_size: int = 10000
94 |     num_workers: Optional[int] = None
   |                  ^^^^^^^^^^^^^ UP007
95 |     metadata: Optional[dict[str, Any]] = None
96 |     create_train_test_split: bool = False
   |
   = help: Convert to `X | Y`

(I find Optional easier to understand myself... Crap.)

brent.edwards marked this conversation as resolved

scripts/convert_rdf_to_hf_dataset_unified.py Outdated

					
				@ -0,0 +92,4 @@

				    rdf_format: str = "turtle"

				    chunk_size: int = 10000

				    num_workers: Optional[int] = None

				    metadata: Optional[dict[str, Any]] = None

According to ruff check:

scripts/convert_rdf_to_hf_dataset_unified.py:95:15: UP007 [*] Use `X | Y` for type annotations
   |
93 |     chunk_size: int = 10000
94 |     num_workers: Optional[int] = None
95 |     metadata: Optional[dict[str, Any]] = None
   |               ^^^^^^^^^^^^^^^^^^^^^^^^ UP007
96 |     create_train_test_split: bool = False
97 |     test_size: float = 0.05
   |
   = help: Convert to `X | Y`

brent.edwards marked this conversation as resolved

scripts/convert_rdf_to_hf_dataset_unified.py Outdated

					
				@ -0,0 +105,4 @@

				    success: bool

				    total_triples: int = 0

				    processing_time_seconds: float = 0.0

				    output_path: Optional[Path] = None

According to ruff check:

scripts/convert_rdf_to_hf_dataset_unified.py:108:18: UP007 [*] Use `X | Y` for type annotations
    |
106 |     total_triples: int = 0
107 |     processing_time_seconds: float = 0.0
108 |     output_path: Optional[Path] = None
    |                  ^^^^^^^^^^^^^^ UP007
109 |     error_message: Optional[str] = None
110 |     strategy_used: Optional[str] = None
    |
    = help: Convert to `X | Y`

brent.edwards marked this conversation as resolved

scripts/convert_rdf_to_hf_dataset_unified.py Outdated

					
				@ -0,0 +106,4 @@

				    total_triples: int = 0

				    processing_time_seconds: float = 0.0

				    output_path: Optional[Path] = None

				    error_message: Optional[str] = None

According to ruff check:

scripts/convert_rdf_to_hf_dataset_unified.py:109:20: UP007 [*] Use `X | Y` for type annotations
    |
107 |     processing_time_seconds: float = 0.0
108 |     output_path: Optional[Path] = None
109 |     error_message: Optional[str] = None
    |                    ^^^^^^^^^^^^^ UP007
110 |     strategy_used: Optional[str] = None
    |
    = help: Convert to `X | Y`

brent.edwards marked this conversation as resolved

scripts/convert_rdf_to_hf_dataset_unified.py Outdated

					
				@ -0,0 +107,4 @@

				    processing_time_seconds: float = 0.0

				    output_path: Optional[Path] = None

				    error_message: Optional[str] = None

				    strategy_used: Optional[str] = None

According to ruff check:

scripts/convert_rdf_to_hf_dataset_unified.py:110:20: UP007 [*] Use `X | Y` for type annotations
    |
108 |     output_path: Optional[Path] = None
109 |     error_message: Optional[str] = None
110 |     strategy_used: Optional[str] = None
    |                    ^^^^^^^^^^^^^ UP007
    |
    = help: Convert to `X | Y`

brent.edwards marked this conversation as resolved

scripts/convert_rdf_to_hf_dataset_unified.py Outdated

					
				@ -0,0 +123,4 @@

				    """

				    @staticmethod

				    def extract_triple(subject, predicate, obj) -> dict[str, Any]:

It looks like you can use

    def extract_triple(subject, predicate, obj) -> dict[str, str]:

at this time. But make sure that the TripleExtractor handles objects that are themselves triples!

It looks like you can use ``` def extract_triple(subject, predicate, obj) -> dict[str, str]: ``` at this time. But make sure that the `TripleExtractor` handles objects that are themselves triples!

A few clarifications on the current implementation:

TripleExtractor class removed: We refactored extract_triple to be a module-level function (line 117) instead of a class method. This was necessary for multiprocessing pickling support, class methods can have serialization issues when passed to worker processes
Return type dict[str, Any] is correct: We can't use dict[str, str] because object_datatype and object_language can be None.
Triple terms (RDF-star/RDF 1.2): Currently, the code handles this by falling through to blank_node

A few clarifications on the current implementation: - TripleExtractor class removed: We refactored extract_triple to be a module-level function (line 117) instead of a class method. This was necessary for multiprocessing pickling support, class methods can have serialization issues when passed to worker processes - Return type dict[str, Any] is correct: We can't use dict[str, str] because object_datatype and object_language can be None. - Triple terms (RDF-star/RDF 1.2): Currently, the code handles this by falling through to blank_node

brent.edwards marked this conversation as resolved

scripts/convert_rdf_to_hf_dataset_unified.py Outdated

					
				@ -0,0 +141,4 @@

				            object_type = "literal"

				            object_datatype = str(obj.datatype) if obj.datatype else None

				            object_language = obj.language if obj.language else None

				        elif isinstance(obj, URIRef):

What happens if the object is an IRI but not a URI? See https://www.w3.org/TR/rdf12-concepts/#section-IRIs

The RDFLib documentation and implementation confirm that the URIRef class is designed to handle both URIs (Uniform Resource Identifiers) and IRIs (Internationalized Resource Identifiers), even though its name explicitly references only "URI"

brent.edwards marked this conversation as resolved

scripts/convert_rdf_to_hf_dataset_unified.py Outdated

					
				@ -0,0 +145,4 @@

				            object_type = "uri"

				            object_datatype = None

				            object_language = None

				        else:

What happens if the object is a triple term? (See https://www.w3.org/TR/rdf12-n-triples/#triple-terms )

Triple terms (RDF-star/RDF 1.2) currently fall through to the else branch and are classified as blank_node.

Do we have any RDF-star datasets that need explicit handling? If so, I can add explicit triple term support as a follow-up.

Triple terms (RDF-star/RDF 1.2) currently fall through to the else branch and are classified as blank_node. Do we have any RDF-star datasets that need explicit handling? If so, I can add explicit triple term support as a follow-up.

brent.edwards marked this conversation as resolved

scripts/convert_rdf_to_hf_dataset_unified.py Outdated

					
				@ -0,0 +214,4 @@

				    """Handle file I/O operations including compression."""

				    @staticmethod

				    def open_file(file_path: Path, mode: str = "r"):

I'm surprised that pyright didn't point out that this method doesn't list what it returns...

I'm surprised that `pyright` didn't point out that this method doesn't list what it returns...

brent.edwards marked this conversation as resolved

scripts/convert_rdf_to_hf_dataset_unified.py Outdated

					
				@ -0,0 +261,4 @@

				            return False

				        try:

				            with open(file_path, "r", encoding="utf-8") as f:

According to ruff check:

scripts/convert_rdf_to_hf_dataset_unified.py:264:18: UP015 [*] Unnecessary mode argument
    |
263 |         try:
264 |             with open(file_path, "r", encoding="utf-8") as f:
    |                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ UP015
265 |                 first_line = f.readline().strip()
266 |                 return first_line.startswith("http://") or first_line.startswith("https://")
    |
    = help: Remove mode argument

According to `ruff check`: ``` scripts/convert_rdf_to_hf_dataset_unified.py:264:18: UP015 [*] Unnecessary mode argument | 263 | try: 264 | with open(file_path, "r", encoding="utf-8") as f: | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ UP015 265 | first_line = f.readline().strip() 266 | return first_line.startswith("http://") or first_line.startswith("https://") | = help: Remove mode argument ```

brent.edwards marked this conversation as resolved

scripts/convert_rdf_to_hf_dataset_unified.py Outdated

					
				@ -0,0 +264,4 @@

				            with open(file_path, "r", encoding="utf-8") as f:

				                first_line = f.readline().strip()

				                return first_line.startswith("http://") or first_line.startswith("https://")

				        except Exception:

I think that all file errors descend from OSError, so you should be able to use that instead of Exception.

I think that all file errors descend from `OSError`, so you should be able to use that instead of `Exception`.

brent.edwards marked this conversation as resolved

scripts/convert_rdf_to_hf_dataset_unified.py Outdated

					
				@ -0,0 +328,4 @@

				            yield progress, task

				    @contextmanager

				    def progress_bar(self, description: str, total: Optional[int] = None,

According to ruff check:

scripts/convert_rdf_to_hf_dataset_unified.py:331:53: UP007 [*] Use `X | Y` for type annotations
    |
330 |     @contextmanager
331 |     def progress_bar(self, description: str, total: Optional[int] = None,
    |                                                     ^^^^^^^^^^^^^ UP007
332 |                      show_count: bool = False, extra_fields: Optional[dict] = None):
333 |         """Context manager for progress bar."""
    |
    = help: Convert to `X | Y`

brent.edwards marked this conversation as resolved

scripts/convert_rdf_to_hf_dataset_unified.py Outdated

					
				@ -0,0 +329,4 @@

				    @contextmanager

				    def progress_bar(self, description: str, total: Optional[int] = None,

				                     show_count: bool = False, extra_fields: Optional[dict] = None):

According to ruff check:

scripts/convert_rdf_to_hf_dataset_unified.py:332:62: UP007 [*] Use `X | Y` for type annotations
    |
330 |     @contextmanager
331 |     def progress_bar(self, description: str, total: Optional[int] = None,
332 |                      show_count: bool = False, extra_fields: Optional[dict] = None):
    |                                                              ^^^^^^^^^^^^^^ UP007
333 |         """Context manager for progress bar."""
334 |         columns = [
    |
    = help: Convert to `X | Y`

brent.edwards marked this conversation as resolved

scripts/convert_rdf_to_hf_dataset_unified.py Outdated

					
				@ -0,0 +352,4 @@

				            task_kwargs = {"total": total}

				            if extra_fields:

				                task_kwargs.update(extra_fields)

				            task = progress.add_task(f"[yellow]{description}[/yellow]", **task_kwargs)

pyright gives the following error:

/home/brent.edwards/Workspace-2/dataset-uploader/scripts/convert_rdf_to_hf_dataset_unified.py
  /home/brent.edwards/Workspace-2/dataset-uploader/scripts/convert_rdf_to_hf_dataset_unified.py:355:75 - error: Argument of type "int | None" cannot be assigned to parameter "start" of type "bool" in function "add_task"
    Type "int | None" is not assignable to type "bool"
      "int" is not assignable to "bool" (reportArgumentType)
  /home/brent.edwards/Workspace-2/dataset-uploader/scripts/convert_rdf_to_hf_dataset_unified.py:355:75 - error: Argument of type "int | None" cannot be assigned to parameter "completed" of type "int" in function "add_task"
    Type "int | None" is not assignable to type "int"
      "None" is not assignable to "int" (reportArgumentType)
  /home/brent.edwards/Workspace-2/dataset-uploader/scripts/convert_rdf_to_hf_dataset_unified.py:355:75 - error: Argument of type "int | None" cannot be assigned to parameter "visible" of type "bool" in function "add_task"
    Type "int | None" is not assignable to type "bool"
      "int" is not assignable to "bool" (reportArgumentType)

`pyright` gives the following error: ``` /home/brent.edwards/Workspace-2/dataset-uploader/scripts/convert_rdf_to_hf_dataset_unified.py /home/brent.edwards/Workspace-2/dataset-uploader/scripts/convert_rdf_to_hf_dataset_unified.py:355:75 - error: Argument of type "int | None" cannot be assigned to parameter "start" of type "bool" in function "add_task" Type "int | None" is not assignable to type "bool" "int" is not assignable to "bool" (reportArgumentType) /home/brent.edwards/Workspace-2/dataset-uploader/scripts/convert_rdf_to_hf_dataset_unified.py:355:75 - error: Argument of type "int | None" cannot be assigned to parameter "completed" of type "int" in function "add_task" Type "int | None" is not assignable to type "int" "None" is not assignable to "int" (reportArgumentType) /home/brent.edwards/Workspace-2/dataset-uploader/scripts/convert_rdf_to_hf_dataset_unified.py:355:75 - error: Argument of type "int | None" cannot be assigned to parameter "visible" of type "bool" in function "add_task" Type "int | None" is not assignable to type "bool" "int" is not assignable to "bool" (reportArgumentType) ```

brent.edwards marked this conversation as resolved

scripts/convert_rdf_to_hf_dataset_unified.py Outdated

					
				@ -0,0 +402,4 @@

				        """

				        pass

				    def estimate_memory_usage_mb(self, file_size_mb: float) -> float:

estimate_memory_usage_mb is never used in this code.

`estimate_memory_usage_mb` is never used in this code.

removed the dead code

brent.edwards marked this conversation as resolved

scripts/convert_rdf_to_hf_dataset_unified.py Outdated

					
				@ -0,0 +406,4 @@

				        """Estimate memory requirements in MB."""

				        return file_size_mb * 3.0  # Default: 3x file size

				    def supports_format(self, rdf_format: str) -> bool:

supports_format is never used in the code. Is the check missing?

`supports_format` is never used in the code. Is the check missing?

removed the dead code

brent.edwards marked this conversation as resolved

scripts/convert_rdf_to_hf_dataset_unified.py Outdated

					
				@ -0,0 +414,4 @@

				        """Save dataset to disk with optional cache cleaning."""

				        config.output_path.mkdir(parents=True, exist_ok=True)

				        dataset_dict.save_to_disk(str(config.output_path))

				        self.progress.print(f"Dataset saved to {config.output_path}", "green")

The description says "with optional cache cleaning". Is the cache cleaning missing?

brent.edwards marked this conversation as resolved

scripts/convert_rdf_to_hf_dataset_unified.py Outdated

					
				@ -0,0 +448,4 @@

				    def supports_format(self, rdf_format: str) -> bool:

				        """Standard strategy supports all formats including TSV."""

				        return rdf_format in ["turtle", "nt", "ntriples", "xml", "n3", "trig", "nquads", "tsv"]

Although supports_format is never used, if it were used, should you include geonames here?

Although `supports_format` is never used, if it were used, should you include `geonames` here?

brent.edwards marked this conversation as resolved

scripts/convert_rdf_to_hf_dataset_unified.py Outdated

					
				@ -0,0 +517,4 @@

				        with self.progress.progress_bar("Parsing TSV triples...", show_count=True) as (progress, task):

				            # Count lines

				            with open(file_path, "r", encoding="utf-8") as f:

According to ruff check:

scripts/convert_rdf_to_hf_dataset_unified.py:520:18: UP015 [*] Unnecessary mode argument
    |
518 |         with self.progress.progress_bar("Parsing TSV triples...", show_count=True) as (progress, task):
519 |             # Count lines
520 |             with open(file_path, "r", encoding="utf-8") as f:
    |                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ UP015
521 |                 total_lines = sum(1 for _ in f)
522 |             progress.update(task, total=total_lines)
    |
    = help: Remove mode argument

According to `ruff check`: ``` scripts/convert_rdf_to_hf_dataset_unified.py:520:18: UP015 [*] Unnecessary mode argument | 518 | with self.progress.progress_bar("Parsing TSV triples...", show_count=True) as (progress, task): 519 | # Count lines 520 | with open(file_path, "r", encoding="utf-8") as f: | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ UP015 521 | total_lines = sum(1 for _ in f) 522 | progress.update(task, total=total_lines) | = help: Remove mode argument ```

brent.edwards marked this conversation as resolved

scripts/convert_rdf_to_hf_dataset_unified.py Outdated

					
				@ -0,0 +521,4 @@

				                total_lines = sum(1 for _ in f)

				            progress.update(task, total=total_lines)

				            with open(file_path, "r", encoding="utf-8") as f:

According to ruff check:

scripts/convert_rdf_to_hf_dataset_unified.py:524:18: UP015 [*] Unnecessary mode argument
    |
522 |             progress.update(task, total=total_lines)
523 |
524 |             with open(file_path, "r", encoding="utf-8") as f:
    |                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ UP015
525 |                 for i, line in enumerate(f):
526 |                     line = line.strip()
    |
    = help: Remove mode argument

brent.edwards marked this conversation as resolved

scripts/convert_rdf_to_hf_dataset_unified.py Outdated

					
				@ -0,0 +532,4 @@

				                        continue

				                    subject, predicate, obj = parts

				                    triples.append({

You're duplicating code. You should use TripleExtractor::extract_triple here.

You're duplicating code. You should use `TripleExtractor::extract_triple` here.

The TSV parsing intentionally doesn't use extract_triple() because they handle different data types. There's no RDFLib parsing involved, we can't create Literal or URIRef objects from these strings because they're not valid IRIs or typed literals.

brent.edwards marked this conversation as resolved

scripts/convert_rdf_to_hf_dataset_unified.py Outdated

					
				@ -0,0 +567,4 @@

				                def update_progress():

				                    while not stop_updates.is_set():

				                        elapsed = time.time() - parse_start

				                        pct = min(95, (elapsed / estimated_parse_time) * 100)

Might this line be better as

pct = (elapsed / estimated_parse_time) * 95

The way the code is currently written, the last 5% of updates will show pct holding at 95% without change.

Might this line be better as ``` pct = (elapsed / estimated_parse_time) * 95 ``` The way the code is currently written, the last 5% of updates will show `pct` holding at 95% without change.

pct = min(95, (elapsed / estimated_parse_time) * 95). This ensures progress smoothly increases from 0% → 95% over the full estimated time, then jumps to 100% when parsing actually completes.

brent.edwards marked this conversation as resolved

scripts/convert_rdf_to_hf_dataset_unified.py Outdated

					
				@ -0,0 +589,4 @@

				                file_content = f.read()

				            self.progress.print("Parsing RDF content...", "dim")

				            estimated_parse_time = file_size_mb * 2.5

I doubt that size. Look through my comments in CleverDatasets.

brent.edwards marked this conversation as resolved

scripts/convert_rdf_to_hf_dataset_unified.py Outdated

					
				@ -0,0 +599,4 @@

				                    last_progress = 0

				                    while not stop_updates.is_set():

				                        elapsed = time.time() - parse_start

				                        pct = min(95, (elapsed / estimated_parse_time) * 100)

Same comment about the last 5 percent of the estimated time...

brent.edwards marked this conversation as resolved

scripts/convert_rdf_to_hf_dataset_unified.py Outdated

					
				@ -0,0 +613,4 @@

				                thread.start()

				                try:

				                    graph.parse(BytesIO(file_content), format=config.rdf_format)

pyright gives the following error:

  /home/brent.edwards/Workspace-2/dataset-uploader/scripts/convert_rdf_to_hf_dataset_unified.py:876:62 - error: Argument of type "dict[str, IterableDataset] | IterableDataset | Dataset | DatasetDict" cannot be assigned to parameter "dataset" of type "Dataset" in function "_create_dataset_dict"
    Type "dict[str, IterableDataset] | IterableDataset | Dataset | DatasetDict" is not assignable to type "Dataset"
      "DatasetDict" is not assignable to "Dataset" (reportArgumentType)

`pyright` gives the following error: ``` /home/brent.edwards/Workspace-2/dataset-uploader/scripts/convert_rdf_to_hf_dataset_unified.py:876:62 - error: Argument of type "dict[str, IterableDataset] | IterableDataset | Dataset | DatasetDict" cannot be assigned to parameter "dataset" of type "Dataset" in function "_create_dataset_dict" Type "dict[str, IterableDataset] | IterableDataset | Dataset | DatasetDict" is not assignable to type "Dataset" "DatasetDict" is not assignable to "Dataset" (reportArgumentType) ```

brent.edwards marked this conversation as resolved

scripts/convert_rdf_to_hf_dataset_unified.py Outdated

					
				@ -0,0 +630,4 @@

				        with self.progress.progress_bar("Converting triples...",

				                                        total=len(graph_list), show_count=True) as (progress, task):

				            for idx, (s, p, o) in enumerate(graph_list):

				                triples.append(self.triple_extractor.extract_triple(s, p, o))

This looks like it's replicating the work of TripleExtractor::extract_triples_from_graph (except that extract_triples_from_graph doesn't have a progress bar.)

Should you unify that code?

This looks like it's replicating the work of `TripleExtractor::extract_triples_from_graph` (except that `extract_triples_from_graph` doesn't have a progress bar.) Should you unify that code?

These are intentionally different implementations serving different purposes

extract_triples_from_graph(graph): Quick extraction without progress
Loop with progress bar: Large files where user needs visual feedback

The key difference is progress tracking. Python's list comprehension doesn't allow mid-iteration callbacks. To show a progress bar, we need an explicit loop

These are intentionally different implementations serving different purposes - extract_triples_from_graph(graph): Quick extraction without progress - Loop with progress bar: Large files where user needs visual feedback The key difference is progress tracking. Python's list comprehension doesn't allow mid-iteration callbacks. To show a progress bar, we need an explicit loop

brent.edwards marked this conversation as resolved

scripts/convert_rdf_to_hf_dataset_unified.py Outdated

					
				@ -0,0 +666,4 @@

				        all_triples = []

				        with self.progress.progress_bar("Processing GeoNames chunks...", total=100) as (progress, task):

				            with Pool(processes=num_workers) as pool:

According to ruff check:

scripts/convert_rdf_to_hf_dataset_unified.py:668:9: SIM117 Use a single `with` statement with multiple contexts instead of nested `with` statements
    |
666 |           all_triples = []
667 |
668 | /         with self.progress.progress_bar("Processing GeoNames chunks...", total=100) as (progress, task):
669 | |             with Pool(processes=num_workers) as pool:
    | |_____________________________________________________^ SIM117
670 |                   result = pool.map_async(_parse_geonames_chunk, worker_args)
    |
    = help: Combine `with` statements

According to `ruff check`: ``` scripts/convert_rdf_to_hf_dataset_unified.py:668:9: SIM117 Use a single `with` statement with multiple contexts instead of nested `with` statements | 666 | all_triples = [] 667 | 668 | / with self.progress.progress_bar("Processing GeoNames chunks...", total=100) as (progress, task): 669 | | with Pool(processes=num_workers) as pool: | |_____________________________________________________^ SIM117 670 | result = pool.map_async(_parse_geonames_chunk, worker_args) | = help: Combine `with` statements ```

brent.edwards marked this conversation as resolved

scripts/convert_rdf_to_hf_dataset_unified.py Outdated

					
				@ -0,0 +879,4 @@

				                    dataset_dict.save_to_disk(str(config.output_path))

				            else:

				                dataset = Dataset.from_parquet(str(data_dir / "*.parquet"))

				                dataset_dict = self._create_dataset_dict(dataset, config)

pyright gives the following error:

  /home/brent.edwards/Workspace-2/dataset-uploader/scripts/convert_rdf_to_hf_dataset_unified.py:882:58 - error: Argument of type "dict[str, IterableDataset] | IterableDataset | Dataset | DatasetDict" cannot be assigned to parameter "dataset" of type "Dataset" in function "_create_dataset_dict"
    Type "dict[str, IterableDataset] | IterableDataset | Dataset | DatasetDict" is not assignable to type "Dataset"
      "DatasetDict" is not assignable to "Dataset" (reportArgumentType)

`pyright` gives the following error: ``` /home/brent.edwards/Workspace-2/dataset-uploader/scripts/convert_rdf_to_hf_dataset_unified.py:882:58 - error: Argument of type "dict[str, IterableDataset] | IterableDataset | Dataset | DatasetDict" cannot be assigned to parameter "dataset" of type "Dataset" in function "_create_dataset_dict" Type "dict[str, IterableDataset] | IterableDataset | Dataset | DatasetDict" is not assignable to type "Dataset" "DatasetDict" is not assignable to "Dataset" (reportArgumentType) ```

brent.edwards marked this conversation as resolved

scripts/convert_rdf_to_hf_dataset_unified.py Outdated

					
				@ -0,0 +892,4 @@

				            self._save_dataset_info(config, total_triples, dataset_dict, elapsed)

				            self.progress.print(f"✓ Successfully converted to HuggingFace dataset", "bold green")

According to ruff check:

scripts/convert_rdf_to_hf_dataset_unified.py:895:33: F541 [*] f-string without any placeholders
    |
893 |             self._save_dataset_info(config, total_triples, dataset_dict, elapsed)
894 |
895 |             self.progress.print(f"✓ Successfully converted to HuggingFace dataset", "bold green")
    |                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ F541
896 |
897 |             return ConversionResult(
    |
    = help: Remove extraneous `f` prefix

According to `ruff check`: ``` scripts/convert_rdf_to_hf_dataset_unified.py:895:33: F541 [*] f-string without any placeholders | 893 | self._save_dataset_info(config, total_triples, dataset_dict, elapsed) 894 | 895 | self.progress.print(f"✓ Successfully converted to HuggingFace dataset", "bold green") | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ F541 896 | 897 | return ConversionResult( | = help: Remove extraneous `f` prefix ```

brent.edwards marked this conversation as resolved

scripts/convert_rdf_to_hf_dataset_unified.py Outdated

					
				@ -0,0 +948,4 @@

				        with self.file_handler.open_file(config.input_path) as f:

				            for line in f:

				                line = line.strip()

				                if not line or line.startswith("#"):

pyright reports the following:

  /home/brent.edwards/Workspace-2/dataset-uploader/scripts/convert_rdf_to_hf_dataset_unified.py:951:48 - error: Argument of type "Literal['#']" cannot be assigned to parameter "prefix" of type "ReadableBuffer | tuple[ReadableBuffer, ...]" in function "startswith"
    Type "Literal['#']" is not assignable to type "ReadableBuffer | tuple[ReadableBuffer, ...]"
      "Literal['#']" is incompatible with protocol "Buffer"
        "__buffer__" is not present
      "Literal['#']" is not assignable to "tuple[ReadableBuffer, ...]" (reportArgumentType)

`pyright` reports the following: ``` /home/brent.edwards/Workspace-2/dataset-uploader/scripts/convert_rdf_to_hf_dataset_unified.py:951:48 - error: Argument of type "Literal['#']" cannot be assigned to parameter "prefix" of type "ReadableBuffer | tuple[ReadableBuffer, ...]" in function "startswith" Type "Literal['#']" is not assignable to type "ReadableBuffer | tuple[ReadableBuffer, ...]" "Literal['#']" is incompatible with protocol "Buffer" "__buffer__" is not present "Literal['#']" is not assignable to "tuple[ReadableBuffer, ...]" (reportArgumentType) ```

brent.edwards marked this conversation as resolved

scripts/convert_rdf_to_hf_dataset_unified.py Outdated

					
				@ -0,0 +973,4 @@

				        graph = Graph()

				        with self.file_handler.open_file(config.input_path) as f:

				            graph.parse(f, format="turtle")

pyright reports:

  /home/brent.edwards/Workspace-2/dataset-uploader/scripts/convert_rdf_to_hf_dataset_unified.py:976:25 - error: Argument of type "GzipFile | TextIOWrapper[_WrappedBuffer] | BZ2File | IO[Any]" cannot be assigned to parameter "source" of type "IO[bytes] | TextIO | InputSource | str | bytes | PurePath | None" in function "parse"
    Type "GzipFile | TextIOWrapper[_WrappedBuffer] | BZ2File | IO[Any]" is not assignable to type "IO[bytes] | TextIO | InputSource | str | bytes | PurePath | None"
      Type "GzipFile" is not assignable to type "IO[bytes] | TextIO | InputSource | str | bytes | PurePath | None"
        "GzipFile" is not assignable to "IO[bytes]"
        "GzipFile" is not assignable to "TextIO"
        "GzipFile" is not assignable to "InputSource"
        "GzipFile" is not assignable to "str"
        "GzipFile" is not assignable to "bytes"
        "GzipFile" is not assignable to "PurePath"
    ... (reportArgumentType)

brent.edwards marked this conversation as resolved

scripts/convert_rdf_to_hf_dataset_unified.py Outdated

					
				@ -0,0 +1016,4 @@

				        for worker_chunks in results:

				            for chunk in worker_chunks:

				                yield chunk

According to ruff check:

scripts/convert_rdf_to_hf_dataset_unified.py:1018:13: UP028 Replace `yield` over `for` loop with `yield from`
     |
1017 |           for worker_chunks in results:
1018 | /             for chunk in worker_chunks:
1019 | |                 yield chunk
     | |___________________________^ UP028
1020 |
1021 |       def _create_dataset_dict(self, dataset: Dataset, config: ConversionConfig) -> DatasetDict:
     |
     = help: Replace with `yield from`

According to `ruff check`: ``` scripts/convert_rdf_to_hf_dataset_unified.py:1018:13: UP028 Replace `yield` over `for` loop with `yield from` | 1017 | for worker_chunks in results: 1018 | / for chunk in worker_chunks: 1019 | | yield chunk | |___________________________^ UP028 1020 | 1021 | def _create_dataset_dict(self, dataset: Dataset, config: ConversionConfig) -> DatasetDict: | = help: Replace with `yield from` ```

brent.edwards marked this conversation as resolved

scripts/convert_rdf_to_hf_dataset_unified.py Outdated

					
				@ -0,0 +1199,4 @@

				            if config.clean_cache:

				                with tempfile.TemporaryDirectory() as temp_cache_dir:

				                    dataset = Dataset.from_parquet(str(temp_chunks_dir / "*.parquet"), cache_dir=temp_cache_dir)

				                    dataset_dict = DatasetDict({"data": dataset})

pyright reports:

  /home/brent.edwards/Workspace-2/dataset-uploader/scripts/convert_rdf_to_hf_dataset_unified.py:1202:36 - error: No overloads for "__init__" match the provided arguments (reportCallIssue)
  /home/brent.edwards/Workspace-2/dataset-uploader/scripts/convert_rdf_to_hf_dataset_unified.py:1202:49 - error: Argument of type "dict[str, dict[str, IterableDataset] | IterableDataset | Dataset | DatasetDict]" cannot be assigned to parameter "iterable" of type "Iterable[tuple[str | NamedSplit, Dataset]]" in function "__init__"
    "Literal['data']" is not assignable to "tuple[str | NamedSplit, Dataset]" (reportArgumentType)

`pyright` reports: ``` /home/brent.edwards/Workspace-2/dataset-uploader/scripts/convert_rdf_to_hf_dataset_unified.py:1202:36 - error: No overloads for "__init__" match the provided arguments (reportCallIssue) /home/brent.edwards/Workspace-2/dataset-uploader/scripts/convert_rdf_to_hf_dataset_unified.py:1202:49 - error: Argument of type "dict[str, dict[str, IterableDataset] | IterableDataset | Dataset | DatasetDict]" cannot be assigned to parameter "iterable" of type "Iterable[tuple[str | NamedSplit, Dataset]]" in function "__init__" "Literal['data']" is not assignable to "tuple[str | NamedSplit, Dataset]" (reportArgumentType) ```

brent.edwards marked this conversation as resolved

scripts/convert_rdf_to_hf_dataset_unified.py Outdated

					
				@ -0,0 +1206,4 @@

				                    dataset_dict.save_to_disk(str(config.output_path))

				            else:

				                dataset = Dataset.from_parquet(str(temp_chunks_dir / "*.parquet"))

				                dataset_dict = DatasetDict({"data": dataset})

pyright reports:

  /home/brent.edwards/Workspace-2/dataset-uploader/scripts/convert_rdf_to_hf_dataset_unified.py:1209:32 - error: No overloads for "__init__" match the provided arguments (reportCallIssue)
  /home/brent.edwards/Workspace-2/dataset-uploader/scripts/convert_rdf_to_hf_dataset_unified.py:1209:45 - error: Argument of type "dict[str, dict[str, IterableDataset] | IterableDataset | Dataset | DatasetDict]" cannot be assigned to parameter "iterable" of type "Iterable[tuple[str | NamedSplit, Dataset]]" in function "__init__"
    "Literal['data']" is not assignable to "tuple[str | NamedSplit, Dataset]" (reportArgumentType)

`pyright` reports: ``` /home/brent.edwards/Workspace-2/dataset-uploader/scripts/convert_rdf_to_hf_dataset_unified.py:1209:32 - error: No overloads for "__init__" match the provided arguments (reportCallIssue) /home/brent.edwards/Workspace-2/dataset-uploader/scripts/convert_rdf_to_hf_dataset_unified.py:1209:45 - error: Argument of type "dict[str, dict[str, IterableDataset] | IterableDataset | Dataset | DatasetDict]" cannot be assigned to parameter "iterable" of type "Iterable[tuple[str | NamedSplit, Dataset]]" in function "__init__" "Literal['data']" is not assignable to "tuple[str | NamedSplit, Dataset]" (reportArgumentType) ```

brent.edwards marked this conversation as resolved

scripts/convert_rdf_to_hf_dataset_unified.py Outdated

					
				@ -0,0 +1232,4 @@

				            with open(config.output_path / "dataset_info.json", "w") as f:

				                json.dump(info, f, indent=2)

				            self.progress.print(f"✓ Successfully converted to HuggingFace dataset", "bold green")

According to ruff check:

scripts/convert_rdf_to_hf_dataset_unified.py:1235:33: F541 [*] f-string without any placeholders
     |
1233 |                 json.dump(info, f, indent=2)
1234 |
1235 |             self.progress.print(f"✓ Successfully converted to HuggingFace dataset", "bold green")
     |                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ F541
1236 |             self.progress.print(f"  Data split: {len(dataset_dict['data']):,} triples", "green")
1237 |             self.progress.print(f"  Processing time: {elapsed:.1f} seconds", "green")
     |
     = help: Remove extraneous `f` prefix

According to `ruff check`: ``` scripts/convert_rdf_to_hf_dataset_unified.py:1235:33: F541 [*] f-string without any placeholders | 1233 | json.dump(info, f, indent=2) 1234 | 1235 | self.progress.print(f"✓ Successfully converted to HuggingFace dataset", "bold green") | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ F541 1236 | self.progress.print(f" Data split: {len(dataset_dict['data']):,} triples", "green") 1237 | self.progress.print(f" Processing time: {elapsed:.1f} seconds", "green") | = help: Remove extraneous `f` prefix ```

brent.edwards marked this conversation as resolved

scripts/convert_rdf_to_hf_dataset_unified.py Outdated

					
				@ -0,0 +1255,4 @@

				    def _stream_turtle_file(self, config: ConversionConfig, re) -> Iterator[list[dict[str, Any]]]:

				        """Stream Turtle file with prefix handling."""

				        # Detect compression

Why are you using this detection when you have the FileHandler class?

Why are you using this detection when you have the `FileHandler` class?

Fixed! _stream_turtle_file now uses FileHandler.open_file() instead of manual compression detection

brent.edwards marked this conversation as resolved

scripts/convert_rdf_to_hf_dataset_unified.py Outdated

					
				@ -0,0 +1260,4 @@

				        is_bz2 = config.input_path.suffix == ".bz2" or str(config.input_path).endswith(".ttl.bz2")

				        if is_bz2:

				            file_obj = bz2.open(config.input_path, "rt", encoding="utf-8", errors="ignore")

According to ruff check:

scripts/convert_rdf_to_hf_dataset_unified.py:1263:24: SIM115 Use a context manager for opening files
     |
1262 |         if is_bz2:
1263 |             file_obj = bz2.open(config.input_path, "rt", encoding="utf-8", errors="ignore")
     |                        ^^^^^^^^ SIM115
1264 |         elif is_gzipped:
1265 |             file_obj = gzip.open(config.input_path, "rt", encoding="utf-8", errors="ignore")
     |

According to `ruff check`: ``` scripts/convert_rdf_to_hf_dataset_unified.py:1263:24: SIM115 Use a context manager for opening files | 1262 | if is_bz2: 1263 | file_obj = bz2.open(config.input_path, "rt", encoding="utf-8", errors="ignore") | ^^^^^^^^ SIM115 1264 | elif is_gzipped: 1265 | file_obj = gzip.open(config.input_path, "rt", encoding="utf-8", errors="ignore") | ```

brent.edwards marked this conversation as resolved

scripts/convert_rdf_to_hf_dataset_unified.py Outdated

					
				@ -0,0 +1262,4 @@

				        if is_bz2:

				            file_obj = bz2.open(config.input_path, "rt", encoding="utf-8", errors="ignore")

				        elif is_gzipped:

				            file_obj = gzip.open(config.input_path, "rt", encoding="utf-8", errors="ignore")

According to ruff check:

scripts/convert_rdf_to_hf_dataset_unified.py:1265:24: SIM115 Use a context manager for opening files
     |
1263 |             file_obj = bz2.open(config.input_path, "rt", encoding="utf-8", errors="ignore")
1264 |         elif is_gzipped:
1265 |             file_obj = gzip.open(config.input_path, "rt", encoding="utf-8", errors="ignore")
     |                        ^^^^^^^^^ SIM115
1266 |         else:
1267 |             file_obj = open(config.input_path, "r", encoding="utf-8", errors="ignore")
     |

According to `ruff check`: ``` scripts/convert_rdf_to_hf_dataset_unified.py:1265:24: SIM115 Use a context manager for opening files | 1263 | file_obj = bz2.open(config.input_path, "rt", encoding="utf-8", errors="ignore") 1264 | elif is_gzipped: 1265 | file_obj = gzip.open(config.input_path, "rt", encoding="utf-8", errors="ignore") | ^^^^^^^^^ SIM115 1266 | else: 1267 | file_obj = open(config.input_path, "r", encoding="utf-8", errors="ignore") | ```

brent.edwards marked this conversation as resolved

scripts/convert_rdf_to_hf_dataset_unified.py Outdated

					
				@ -0,0 +1264,4 @@

				        elif is_gzipped:

				            file_obj = gzip.open(config.input_path, "rt", encoding="utf-8", errors="ignore")

				        else:

				            file_obj = open(config.input_path, "r", encoding="utf-8", errors="ignore")

According to ruff check:

scripts/convert_rdf_to_hf_dataset_unified.py:1267:24: UP015 [*] Unnecessary mode argument
     |
1265 |             file_obj = gzip.open(config.input_path, "rt", encoding="utf-8", errors="ignore")
1266 |         else:
1267 |             file_obj = open(config.input_path, "r", encoding="utf-8", errors="ignore")
     |                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ UP015
1268 |
1269 |         try:
     = help: Remove mode argument

scripts/convert_rdf_to_hf_dataset_unified.py:1267:24: SIM115 Use a context manager for opening files
     |
1265 |             file_obj = gzip.open(config.input_path, "rt", encoding="utf-8", errors="ignore")
1266 |         else:
1267 |             file_obj = open(config.input_path, "r", encoding="utf-8", errors="ignore")
     |                        ^^^^ SIM115
1268 |
1269 |         try:
     |

According to `ruff check`: ``` scripts/convert_rdf_to_hf_dataset_unified.py:1267:24: UP015 [*] Unnecessary mode argument | 1265 | file_obj = gzip.open(config.input_path, "rt", encoding="utf-8", errors="ignore") 1266 | else: 1267 | file_obj = open(config.input_path, "r", encoding="utf-8", errors="ignore") | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ UP015 1268 | 1269 | try: = help: Remove mode argument scripts/convert_rdf_to_hf_dataset_unified.py:1267:24: SIM115 Use a context manager for opening files | 1265 | file_obj = gzip.open(config.input_path, "rt", encoding="utf-8", errors="ignore") 1266 | else: 1267 | file_obj = open(config.input_path, "r", encoding="utf-8", errors="ignore") | ^^^^ SIM115 1268 | 1269 | try: | ```

brent.edwards marked this conversation as resolved

scripts/convert_rdf_to_hf_dataset_unified.py Outdated

					
				@ -0,0 +1409,4 @@

				            if config.clean_cache:

				                with tempfile.TemporaryDirectory() as temp_cache_dir:

				                    dataset = Dataset.from_parquet(str(temp_chunks_dir / "*.parquet"), cache_dir=temp_cache_dir)

				                    dataset_dict = DatasetDict({"data": dataset})

pyright reports:

  /home/brent.edwards/Workspace-2/dataset-uploader/scripts/convert_rdf_to_hf_dataset_unified.py:1412:36 - error: No overloads for "__init__" match the provided arguments (reportCallIssue)
  /home/brent.edwards/Workspace-2/dataset-uploader/scripts/convert_rdf_to_hf_dataset_unified.py:1412:49 - error: Argument of type "dict[str, dict[str, IterableDataset] | IterableDataset | Dataset | DatasetDict]" cannot be assigned to parameter "iterable" of type "Iterable[tuple[str | NamedSplit, Dataset]]" in function "__init__"
    "Literal['data']" is not assignable to "tuple[str | NamedSplit, Dataset]" (reportArgumentType)

`pyright` reports: ``` /home/brent.edwards/Workspace-2/dataset-uploader/scripts/convert_rdf_to_hf_dataset_unified.py:1412:36 - error: No overloads for "__init__" match the provided arguments (reportCallIssue) /home/brent.edwards/Workspace-2/dataset-uploader/scripts/convert_rdf_to_hf_dataset_unified.py:1412:49 - error: Argument of type "dict[str, dict[str, IterableDataset] | IterableDataset | Dataset | DatasetDict]" cannot be assigned to parameter "iterable" of type "Iterable[tuple[str | NamedSplit, Dataset]]" in function "__init__" "Literal['data']" is not assignable to "tuple[str | NamedSplit, Dataset]" (reportArgumentType) ```

brent.edwards marked this conversation as resolved

scripts/convert_rdf_to_hf_dataset_unified.py Outdated

					
				@ -0,0 +1415,4 @@

				                    dataset_dict.save_to_disk(str(config.output_path))

				            else:

				                dataset = Dataset.from_parquet(str(temp_chunks_dir / "*.parquet"))

				                dataset_dict = DatasetDict({"data": dataset})

pyright reports:

  /home/brent.edwards/Workspace-2/dataset-uploader/scripts/convert_rdf_to_hf_dataset_unified.py:1418:32 - error: No overloads for "__init__" match the provided arguments (reportCallIssue)
  /home/brent.edwards/Workspace-2/dataset-uploader/scripts/convert_rdf_to_hf_dataset_unified.py:1418:45 - error: Argument of type "dict[str, dict[str, IterableDataset] | IterableDataset | Dataset | DatasetDict]" cannot be assigned to parameter "iterable" of type "Iterable[tuple[str | NamedSplit, Dataset]]" in function "__init__"
    "Literal['data']" is not assignable to "tuple[str | NamedSplit, Dataset]" (reportArgumentType)

`pyright` reports: ``` /home/brent.edwards/Workspace-2/dataset-uploader/scripts/convert_rdf_to_hf_dataset_unified.py:1418:32 - error: No overloads for "__init__" match the provided arguments (reportCallIssue) /home/brent.edwards/Workspace-2/dataset-uploader/scripts/convert_rdf_to_hf_dataset_unified.py:1418:45 - error: Argument of type "dict[str, dict[str, IterableDataset] | IterableDataset | Dataset | DatasetDict]" cannot be assigned to parameter "iterable" of type "Iterable[tuple[str | NamedSplit, Dataset]]" in function "__init__" "Literal['data']" is not assignable to "tuple[str | NamedSplit, Dataset]" (reportArgumentType) ```

brent.edwards marked this conversation as resolved

scripts/convert_rdf_to_hf_dataset_unified.py Outdated

					
				@ -0,0 +1436,4 @@

				            with open(config.output_path / "dataset_info.json", "w") as f:

				                json.dump(info, f, indent=2)

				            self.progress.print(f"✓ Successfully converted to HuggingFace dataset", "bold green")

According to ruff check:

scripts/convert_rdf_to_hf_dataset_unified.py:1439:33: F541 [*] f-string without any placeholders
     |
1437 |                 json.dump(info, f, indent=2)
1438 |
1439 |             self.progress.print(f"✓ Successfully converted to HuggingFace dataset", "bold green")
     |                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ F541
1440 |             self.progress.print(f"  Data split: {len(dataset_dict['data']):,} triples", "green")
     |
     = help: Remove extraneous `f` prefix

According to `ruff check`: ``` scripts/convert_rdf_to_hf_dataset_unified.py:1439:33: F541 [*] f-string without any placeholders | 1437 | json.dump(info, f, indent=2) 1438 | 1439 | self.progress.print(f"✓ Successfully converted to HuggingFace dataset", "bold green") | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ F541 1440 | self.progress.print(f" Data split: {len(dataset_dict['data']):,} triples", "green") | = help: Remove extraneous `f` prefix ```

brent.edwards marked this conversation as resolved

scripts/convert_rdf_to_hf_dataset_unified.py Outdated

					
				@ -0,0 +1493,4 @@

				        with self.file_handler.open_file(config.input_path) as f:

				            for line in f:

				                if line.startswith("http://") or line.startswith("https://"):

pyright reports:

  /home/brent.edwards/Workspace-2/dataset-uploader/scripts/convert_rdf_to_hf_dataset_unified.py:1496:36 - error: Argument of type "Literal['http://']" cannot be assigned to parameter "prefix" of type "ReadableBuffer | tuple[ReadableBuffer, ...]" in function "startswith"
    Type "Literal['http://']" is not assignable to type "ReadableBuffer | tuple[ReadableBuffer, ...]"
      "Literal['http://']" is incompatible with protocol "Buffer"
        "__buffer__" is not present
      "Literal['http://']" is not assignable to "tuple[ReadableBuffer, ...]" (reportArgumentType)
  /home/brent.edwards/Workspace-2/dataset-uploader/scripts/convert_rdf_to_hf_dataset_unified.py:1496:66 - error: Argument of type "Literal['https://']" cannot be assigned to parameter "prefix" of type "ReadableBuffer | tuple[ReadableBuffer, ...]" in function "startswith"
    Type "Literal['https://']" is not assignable to type "ReadableBuffer | tuple[ReadableBuffer, ...]"
      "Literal['https://']" is incompatible with protocol "Buffer"
        "__buffer__" is not present
      "Literal['https://']" is not assignable to "tuple[ReadableBuffer, ...]" (reportArgumentType)

`pyright` reports: ``` /home/brent.edwards/Workspace-2/dataset-uploader/scripts/convert_rdf_to_hf_dataset_unified.py:1496:36 - error: Argument of type "Literal['http://']" cannot be assigned to parameter "prefix" of type "ReadableBuffer | tuple[ReadableBuffer, ...]" in function "startswith" Type "Literal['http://']" is not assignable to type "ReadableBuffer | tuple[ReadableBuffer, ...]" "Literal['http://']" is incompatible with protocol "Buffer" "__buffer__" is not present "Literal['http://']" is not assignable to "tuple[ReadableBuffer, ...]" (reportArgumentType) /home/brent.edwards/Workspace-2/dataset-uploader/scripts/convert_rdf_to_hf_dataset_unified.py:1496:66 - error: Argument of type "Literal['https://']" cannot be assigned to parameter "prefix" of type "ReadableBuffer | tuple[ReadableBuffer, ...]" in function "startswith" Type "Literal['https://']" is not assignable to type "ReadableBuffer | tuple[ReadableBuffer, ...]" "Literal['https://']" is incompatible with protocol "Buffer" "__buffer__" is not present "Literal['https://']" is not assignable to "tuple[ReadableBuffer, ...]" (reportArgumentType) ```

brent.edwards marked this conversation as resolved

scripts/convert_rdf_to_hf_dataset_unified.py Outdated

					
				@ -0,0 +1542,4 @@

				        with self.file_handler.open_file(config.input_path) as f:

				            for line in f:

				                line = line.strip()

				                if not line or line.startswith("#"):

pyright reports:

  /home/brent.edwards/Workspace-2/dataset-uploader/scripts/convert_rdf_to_hf_dataset_unified.py:1545:48 - error: Argument of type "Literal['#']" cannot be assigned to parameter "prefix" of type "ReadableBuffer | tuple[ReadableBuffer, ...]" in function "startswith"
    Type "Literal['#']" is not assignable to type "ReadableBuffer | tuple[ReadableBuffer, ...]"
      "Literal['#']" is incompatible with protocol "Buffer"
        "__buffer__" is not present
      "Literal['#']" is not assignable to "tuple[ReadableBuffer, ...]" (reportArgumentType)

`pyright` reports: ``` /home/brent.edwards/Workspace-2/dataset-uploader/scripts/convert_rdf_to_hf_dataset_unified.py:1545:48 - error: Argument of type "Literal['#']" cannot be assigned to parameter "prefix" of type "ReadableBuffer | tuple[ReadableBuffer, ...]" in function "startswith" Type "Literal['#']" is not assignable to type "ReadableBuffer | tuple[ReadableBuffer, ...]" "Literal['#']" is incompatible with protocol "Buffer" "__buffer__" is not present "Literal['#']" is not assignable to "tuple[ReadableBuffer, ...]" (reportArgumentType) ```

brent.edwards marked this conversation as resolved

scripts/convert_rdf_to_hf_dataset_unified.py Outdated

					
				@ -0,0 +1642,4 @@

				                with tempfile.TemporaryDirectory() as temp_cache_dir:

				                    dataset = Dataset.from_parquet(str(temp_chunks_dir / "*.parquet"), cache_dir=temp_cache_dir)

				                    self.progress.emit_progress(90)

				                    dataset_dict = DatasetDict({"data": dataset})

pyright reports:

  /home/brent.edwards/Workspace-2/dataset-uploader/scripts/convert_rdf_to_hf_dataset_unified.py:1645:36 - error: No overloads for "__init__" match the provided arguments (reportCallIssue)
  /home/brent.edwards/Workspace-2/dataset-uploader/scripts/convert_rdf_to_hf_dataset_unified.py:1645:49 - error: Argument of type "dict[str, dict[str, IterableDataset] | IterableDataset | Dataset | DatasetDict]" cannot be assigned to parameter "iterable" of type "Iterable[tuple[str | NamedSplit, Dataset]]" in function "__init__"
    "Literal['data']" is not assignable to "tuple[str | NamedSplit, Dataset]" (reportArgumentType)

`pyright` reports: ``` /home/brent.edwards/Workspace-2/dataset-uploader/scripts/convert_rdf_to_hf_dataset_unified.py:1645:36 - error: No overloads for "__init__" match the provided arguments (reportCallIssue) /home/brent.edwards/Workspace-2/dataset-uploader/scripts/convert_rdf_to_hf_dataset_unified.py:1645:49 - error: Argument of type "dict[str, dict[str, IterableDataset] | IterableDataset | Dataset | DatasetDict]" cannot be assigned to parameter "iterable" of type "Iterable[tuple[str | NamedSplit, Dataset]]" in function "__init__" "Literal['data']" is not assignable to "tuple[str | NamedSplit, Dataset]" (reportArgumentType) ```

brent.edwards marked this conversation as resolved

scripts/convert_rdf_to_hf_dataset_unified.py Outdated

					
				@ -0,0 +1650,4 @@

				            else:

				                dataset = Dataset.from_parquet(str(temp_chunks_dir / "*.parquet"))

				                self.progress.emit_progress(90)

				                dataset_dict = DatasetDict({"data": dataset})

pyright reports:

  /home/brent.edwards/Workspace-2/dataset-uploader/scripts/convert_rdf_to_hf_dataset_unified.py:1653:32 - error: No overloads for "__init__" match the provided arguments (reportCallIssue)
  /home/brent.edwards/Workspace-2/dataset-uploader/scripts/convert_rdf_to_hf_dataset_unified.py:1653:45 - error: Argument of type "dict[str, dict[str, IterableDataset] | IterableDataset | Dataset | DatasetDict]" cannot be assigned to parameter "iterable" of type "Iterable[tuple[str | NamedSplit, Dataset]]" in function "__init__"
    "Literal['data']" is not assignable to "tuple[str | NamedSplit, Dataset]" (reportArgumentType)

`pyright` reports: ``` /home/brent.edwards/Workspace-2/dataset-uploader/scripts/convert_rdf_to_hf_dataset_unified.py:1653:32 - error: No overloads for "__init__" match the provided arguments (reportCallIssue) /home/brent.edwards/Workspace-2/dataset-uploader/scripts/convert_rdf_to_hf_dataset_unified.py:1653:45 - error: Argument of type "dict[str, dict[str, IterableDataset] | IterableDataset | Dataset | DatasetDict]" cannot be assigned to parameter "iterable" of type "Iterable[tuple[str | NamedSplit, Dataset]]" in function "__init__" "Literal['data']" is not assignable to "tuple[str | NamedSplit, Dataset]" (reportArgumentType) ```

brent.edwards marked this conversation as resolved

scripts/convert_rdf_to_hf_dataset_unified.py Outdated

					
				@ -0,0 +1676,4 @@

				            with open(config.output_path / "dataset_info.json", "w") as f:

				                json.dump(info, f, indent=2)

				            self.progress.print(f"✓ Successfully converted to HuggingFace dataset", "bold green")

According to ruff check:

scripts/convert_rdf_to_hf_dataset_unified.py:1679:33: F541 [*] f-string without any placeholders
     |
1677 |                 json.dump(info, f, indent=2)
1678 |
1679 |             self.progress.print(f"✓ Successfully converted to HuggingFace dataset", "bold green")
     |                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ F541
1680 |             self.progress.print(f"  Data split: {len(dataset_dict['data']):,} triples", "green")
1681 |             self.progress.print(f"  Processing time: {elapsed:.1f} seconds", "green")
     |
     = help: Remove extraneous `f` prefix

According to `ruff check`: ``` scripts/convert_rdf_to_hf_dataset_unified.py:1679:33: F541 [*] f-string without any placeholders | 1677 | json.dump(info, f, indent=2) 1678 | 1679 | self.progress.print(f"✓ Successfully converted to HuggingFace dataset", "bold green") | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ F541 1680 | self.progress.print(f" Data split: {len(dataset_dict['data']):,} triples", "green") 1681 | self.progress.print(f" Processing time: {elapsed:.1f} seconds", "green") | = help: Remove extraneous `f` prefix ```

brent.edwards marked this conversation as resolved

scripts/convert_rdf_to_hf_dataset_unified.py Outdated

					
				@ -0,0 +1749,4 @@

				            current_doc = []

				            for line in f:

				                if line.startswith("http://") or line.startswith("https://"):

pyright reports:

  /home/brent.edwards/Workspace-2/dataset-uploader/scripts/convert_rdf_to_hf_dataset_unified.py:1752:36 - error: Argument of type "Literal['http://']" cannot be assigned to parameter "prefix" of type "ReadableBuffer | tuple[ReadableBuffer, ...]" in function "startswith"
    Type "Literal['http://']" is not assignable to type "ReadableBuffer | tuple[ReadableBuffer, ...]"
      "Literal['http://']" is incompatible with protocol "Buffer"
        "__buffer__" is not present
      "Literal['http://']" is not assignable to "tuple[ReadableBuffer, ...]" (reportArgumentType)
  /home/brent.edwards/Workspace-2/dataset-uploader/scripts/convert_rdf_to_hf_dataset_unified.py:1752:66 - error: Argument of type "Literal['https://']" cannot be assigned to parameter "prefix" of type "ReadableBuffer | tuple[ReadableBuffer, ...]" in function "startswith"
    Type "Literal['https://']" is not assignable to type "ReadableBuffer | tuple[ReadableBuffer, ...]"
      "Literal['https://']" is incompatible with protocol "Buffer"
        "__buffer__" is not present
      "Literal['https://']" is not assignable to "tuple[ReadableBuffer, ...]" (reportArgumentType)

`pyright` reports: ``` /home/brent.edwards/Workspace-2/dataset-uploader/scripts/convert_rdf_to_hf_dataset_unified.py:1752:36 - error: Argument of type "Literal['http://']" cannot be assigned to parameter "prefix" of type "ReadableBuffer | tuple[ReadableBuffer, ...]" in function "startswith" Type "Literal['http://']" is not assignable to type "ReadableBuffer | tuple[ReadableBuffer, ...]" "Literal['http://']" is incompatible with protocol "Buffer" "__buffer__" is not present "Literal['http://']" is not assignable to "tuple[ReadableBuffer, ...]" (reportArgumentType) /home/brent.edwards/Workspace-2/dataset-uploader/scripts/convert_rdf_to_hf_dataset_unified.py:1752:66 - error: Argument of type "Literal['https://']" cannot be assigned to parameter "prefix" of type "ReadableBuffer | tuple[ReadableBuffer, ...]" in function "startswith" Type "Literal['https://']" is not assignable to type "ReadableBuffer | tuple[ReadableBuffer, ...]" "Literal['https://']" is incompatible with protocol "Buffer" "__buffer__" is not present "Literal['https://']" is not assignable to "tuple[ReadableBuffer, ...]" (reportArgumentType) ```

brent.edwards marked this conversation as resolved

scripts/convert_rdf_to_hf_dataset_unified.py Outdated

					
				@ -0,0 +1794,4 @@

				    def _stream_turtle_parallel(self, config: ConversionConfig) -> Iterator[list[dict[str, Any]]]:

				        """Stream Turtle file with chunked parsing."""

				        # Detect compression

Why are you using detect compression when you have the FileHandler class?

Why are you using detect compression when you have the `FileHandler` class?

brent.edwards marked this conversation as resolved

scripts/convert_rdf_to_hf_dataset_unified.py Outdated

					
				@ -0,0 +1799,4 @@

				        is_bz2 = config.input_path.suffix == ".bz2"

				        if is_bz2:

				            file_obj = bz2.open(config.input_path, "rt", encoding="utf-8", errors="ignore")

According to ruff check:

scripts/convert_rdf_to_hf_dataset_unified.py:1802:24: SIM115 Use a context manager for opening files
     |
1801 |         if is_bz2:
1802 |             file_obj = bz2.open(config.input_path, "rt", encoding="utf-8", errors="ignore")
     |                        ^^^^^^^^ SIM115
1803 |         elif is_gzipped:
1804 |             file_obj = gzip.open(config.input_path, "rt", encoding="utf-8", errors="ignore")
     |

According to `ruff check`: ``` scripts/convert_rdf_to_hf_dataset_unified.py:1802:24: SIM115 Use a context manager for opening files | 1801 | if is_bz2: 1802 | file_obj = bz2.open(config.input_path, "rt", encoding="utf-8", errors="ignore") | ^^^^^^^^ SIM115 1803 | elif is_gzipped: 1804 | file_obj = gzip.open(config.input_path, "rt", encoding="utf-8", errors="ignore") | ```

brent.edwards marked this conversation as resolved

scripts/convert_rdf_to_hf_dataset_unified.py Outdated

					
				@ -0,0 +1801,4 @@

				        if is_bz2:

				            file_obj = bz2.open(config.input_path, "rt", encoding="utf-8", errors="ignore")

				        elif is_gzipped:

				            file_obj = gzip.open(config.input_path, "rt", encoding="utf-8", errors="ignore")

According to ruff check:

scripts/convert_rdf_to_hf_dataset_unified.py:1804:24: SIM115 Use a context manager for opening files
     |
1802 |             file_obj = bz2.open(config.input_path, "rt", encoding="utf-8", errors="ignore")
1803 |         elif is_gzipped:
1804 |             file_obj = gzip.open(config.input_path, "rt", encoding="utf-8", errors="ignore")
     |                        ^^^^^^^^^ SIM115
1805 |         else:
1806 |             file_obj = open(config.input_path, "r", encoding="utf-8", errors="ignore")
     |

According to `ruff check`: ``` scripts/convert_rdf_to_hf_dataset_unified.py:1804:24: SIM115 Use a context manager for opening files | 1802 | file_obj = bz2.open(config.input_path, "rt", encoding="utf-8", errors="ignore") 1803 | elif is_gzipped: 1804 | file_obj = gzip.open(config.input_path, "rt", encoding="utf-8", errors="ignore") | ^^^^^^^^^ SIM115 1805 | else: 1806 | file_obj = open(config.input_path, "r", encoding="utf-8", errors="ignore") | ```

brent.edwards marked this conversation as resolved

scripts/convert_rdf_to_hf_dataset_unified.py

					
				@ -0,0 +1803,4 @@

				        elif is_gzipped:

				            file_obj = gzip.open(config.input_path, "rt", encoding="utf-8", errors="ignore")

				        else:

				            file_obj = open(config.input_path, "r", encoding="utf-8", errors="ignore")

According to ruff check:

scripts/convert_rdf_to_hf_dataset_unified.py:1806:24: UP015 [*] Unnecessary mode argument
     |
1804 |             file_obj = gzip.open(config.input_path, "rt", encoding="utf-8", errors="ignore")
1805 |         else:
1806 |             file_obj = open(config.input_path, "r", encoding="utf-8", errors="ignore")
     |                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ UP015
1807 |
1808 |         try:
     |
     = help: Remove mode argument

scripts/convert_rdf_to_hf_dataset_unified.py:1806:24: SIM115 Use a context manager for opening files
     |
1804 |             file_obj = gzip.open(config.input_path, "rt", encoding="utf-8", errors="ignore")
1805 |         else:
1806 |             file_obj = open(config.input_path, "r", encoding="utf-8", errors="ignore")
     |                        ^^^^ SIM115
1807 |
1808 |         try:
     |

According to `ruff check`: ``` scripts/convert_rdf_to_hf_dataset_unified.py:1806:24: UP015 [*] Unnecessary mode argument | 1804 | file_obj = gzip.open(config.input_path, "rt", encoding="utf-8", errors="ignore") 1805 | else: 1806 | file_obj = open(config.input_path, "r", encoding="utf-8", errors="ignore") | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ UP015 1807 | 1808 | try: | = help: Remove mode argument scripts/convert_rdf_to_hf_dataset_unified.py:1806:24: SIM115 Use a context manager for opening files | 1804 | file_obj = gzip.open(config.input_path, "rt", encoding="utf-8", errors="ignore") 1805 | else: 1806 | file_obj = open(config.input_path, "r", encoding="utf-8", errors="ignore") | ^^^^ SIM115 1807 | 1808 | try: | ```

brent.edwards marked this conversation as resolved

scripts/convert_rdf_to_hf_dataset_unified.py Outdated

					
				@ -0,0 +1939,4 @@

				    """

				    # Strategy name to class mapping

				    STRATEGIES = {

According to ruff check:

scripts/convert_rdf_to_hf_dataset_unified.py:1942:18: RUF012 Mutable class attributes should be annotated with `typing.ClassVar`
     |
1941 |       # Strategy name to class mapping
1942 |       STRATEGIES = {
     |  __________________^
1943 | |         "standard": StandardStrategy,
1944 | |         "streaming": StreamingStrategy,
1945 | |         "streaming-turtle": StreamingTurtleStrategy,
1946 | |         "streaming-simple": SimpleStreamingStrategy,
1947 | |         "streaming-parallel": ParallelStreamingStrategy,
1948 | |     }
     | |_____^ RUF012
1949 |
1950 |       def __init__(self, progress_tracker: ProgressTracker):
     |

According to `ruff check`: ``` scripts/convert_rdf_to_hf_dataset_unified.py:1942:18: RUF012 Mutable class attributes should be annotated with `typing.ClassVar` | 1941 | # Strategy name to class mapping 1942 | STRATEGIES = { | __________________^ 1943 | | "standard": StandardStrategy, 1944 | | "streaming": StreamingStrategy, 1945 | | "streaming-turtle": StreamingTurtleStrategy, 1946 | | "streaming-simple": SimpleStreamingStrategy, 1947 | | "streaming-parallel": ParallelStreamingStrategy, 1948 | | } | |_____^ RUF012 1949 | 1950 | def __init__(self, progress_tracker: ProgressTracker): | ```

brent.edwards marked this conversation as resolved

scripts/convert_rdf_to_hf_dataset_unified.py Outdated

					
				@ -0,0 +1971,4 @@

				        # Selection logic

				        if is_geonames:

				            # GeoNames always benefits from parallel processing

				            self.progress.print(f"Auto-selected: streaming-parallel (GeoNames detected)", "cyan")

According to ruff check:

scripts/convert_rdf_to_hf_dataset_unified.py:1974:33: F541 [*] f-string without any placeholders
     |
1972 |         if is_geonames:
1973 |             # GeoNames always benefits from parallel processing
1974 |             self.progress.print(f"Auto-selected: streaming-parallel (GeoNames detected)", "cyan")
     |                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ F541
1975 |             return self._strategies["streaming-parallel"]
     |
     = help: Remove extraneous `f` prefix

According to `ruff check`: ``` scripts/convert_rdf_to_hf_dataset_unified.py:1974:33: F541 [*] f-string without any placeholders | 1972 | if is_geonames: 1973 | # GeoNames always benefits from parallel processing 1974 | self.progress.print(f"Auto-selected: streaming-parallel (GeoNames detected)", "cyan") | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ F541 1975 | return self._strategies["streaming-parallel"] | = help: Remove extraneous `f` prefix ```

brent.edwards marked this conversation as resolved

scripts/convert_rdf_to_hf_dataset_unified.py Outdated

					
				@ -0,0 +1976,4 @@

				        if rdf_format in ("turtle", "ttl") and file_size_mb > 100:

				            # Large Turtle files need special handling

				            self.progress.print(f"Auto-selected: streaming-turtle (large Turtle file)", "cyan")

According to ruff check:

scripts/convert_rdf_to_hf_dataset_unified.py:1979:33: F541 [*] f-string without any placeholders
     |
1977 |         if rdf_format in ("turtle", "ttl") and file_size_mb > 100:
1978 |             # Large Turtle files need special handling
1979 |             self.progress.print(f"Auto-selected: streaming-turtle (large Turtle file)", "cyan")
     |                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ F541
1980 |             return self._strategies["streaming-turtle"]
     |
     = help: Remove extraneous `f` prefix

According to `ruff check`: ``` scripts/convert_rdf_to_hf_dataset_unified.py:1979:33: F541 [*] f-string without any placeholders | 1977 | if rdf_format in ("turtle", "ttl") and file_size_mb > 100: 1978 | # Large Turtle files need special handling 1979 | self.progress.print(f"Auto-selected: streaming-turtle (large Turtle file)", "cyan") | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ F541 1980 | return self._strategies["streaming-turtle"] | = help: Remove extraneous `f` prefix ```

brent.edwards marked this conversation as resolved

scripts/convert_rdf_to_hf_dataset_unified.py

					
				@ -0,0 +1981,4 @@

				        if file_size_mb < 100:

				            # Small files: use standard in-memory

				            self.progress.print(f"Auto-selected: standard (file < 100MB)", "cyan")

According to ruff check:

scripts/convert_rdf_to_hf_dataset_unified.py:1984:33: F541 [*] f-string without any placeholders
     |
1982 |         if file_size_mb < 100:
1983 |             # Small files: use standard in-memory
1984 |             self.progress.print(f"Auto-selected: standard (file < 100MB)", "cyan")
     |                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ F541
1985 |             return self._strategies["standard"]
     |
     = help: Remove extraneous `f` prefix

According to `ruff check`: ``` scripts/convert_rdf_to_hf_dataset_unified.py:1984:33: F541 [*] f-string without any placeholders | 1982 | if file_size_mb < 100: 1983 | # Small files: use standard in-memory 1984 | self.progress.print(f"Auto-selected: standard (file < 100MB)", "cyan") | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ F541 1985 | return self._strategies["standard"] | = help: Remove extraneous `f` prefix ```

brent.edwards marked this conversation as resolved

scripts/convert_rdf_to_hf_dataset_unified.py Outdated

					
				@ -0,0 +2130,4 @@

				    # Validate input file

				    if not args.input.exists():

				        print(f"Error: Input file not found: {args.input}")

I would move line 2137 above here, so that it could use progress.print.

(This would be very useful when you want to print to something other than the screen. For example, it's responding to a web request or it's on a server without a screen and it's printing to a file.)

I would move line 2137 above here, so that it could use progress.print. (This would be very useful when you want to print to something other than the screen. For example, it's responding to a web request or it's on a server without a screen and it's printing to a file.)

brent.edwards marked this conversation as resolved

scripts/upload_all_datasets.2.py Outdated

					
				@ -984,3 +980,1 @@

				                Path(__file__).parent

				                / "convert_rdf_to_hf_dataset_streaming_parallel.py"

				            )

				            # Use unified converter for all standard RDF datasets

Since you're changing how the software is run, could you also change the descriptions in lines 802-820?

brent.edwards marked this conversation as resolved

aditya added 1 commit

2025-12-03 14:12:37 +00:00

fix: address code review feedback on unified RDF converter 7b4b9279a0

aditya added 1 commit

2025-12-03 14:26:31 +00:00

Delete: delete standalone rdf to hf scripts 73057094e1

aditya added 1 commit

2025-12-04 13:35:02 +00:00

refactor: remove dead code and use FileHandler in _stream_turtle_fil fc96cf2c8b

brent.edwards approved these changes

2025-12-04 19:40:15 +00:00

brent.edwards left a comment

GREAT work. Great simplification of multiple files into one. Thank you for your hard work!

khird reviewed

2025-12-05 21:14:34 +00:00

scripts/convert_rdf_to_hf_dataset_unified.py

					
				@ -0,0 +114,4 @@

				# CORE COMPONENTS - Shared by all strategies

				# =============================================================================

				def extract_triple(subject, predicate, obj) -> dict[str, Any]:

Missing the parameter types, and can the return type be made stricter? It looks like you have a dict[str, Optional[str]] but maybe a TypedDict would be better.

Missing the parameter types, and can the return type be made stricter? It looks like you have a `dict[str, Optional[str]]` but maybe a TypedDict would be better.

fixed!!

scripts/convert_rdf_to_hf_dataset_unified.py

					
				@ -0,0 +496,4 @@

				        for s, p, o in graph:

				            current_chunk.append(extract_triple(s, p, o))

				            if len(current_chunk) >= config.chunk_size:

				                yield current_chunk

This logic results in every chunk (bar the last) being >= the given chunk size. Is this correct? In most cases if I give the chunk size I would expect it to be treated as an upper bound: maybe I only have a buffer of that size available to hold the data, for instance.

This logic results in every chunk (bar the last) being `>=` the given chunk size. Is this correct? In most cases if I give the chunk size I would expect it to be treated as an upper bound: maybe I only have a buffer of that size available to hold the data, for instance.

The chunks are actually exactly chunk_size, not >=. The condition triggers as soon as we hit the limit (e.g., 10,000 items), then we yield and reset before adding more

scripts/convert_rdf_to_hf_dataset_unified.py

					
				@ -0,0 +808,4 @@

				                        end = f.tell()

				                        decoded = line.decode("utf-8", errors="ignore")

				                        if decoded.startswith(("http://", "https://")):

				                            break

I'm not confident in the chunking logic here. We:

calculate N evenly-spaced starting points around the file
set the file end point as the end point of the last chunk
for all other chunks,
- set the end point to the start of the next chunk
- increment the end point until the last line in the chunk starts with something that looks like a URL.
return the start and endpoints of each chunk as a tuple.

The problem is that your first readline() doesn't necessarily start from the beginning of a line, so you could just hit a URL by random chance. If it's important that the chunk boundary contains the entire entry, this might need to be handled more carefully.

I'm not confident in the chunking logic here. We: - calculate N evenly-spaced starting points around the file - set the file end point as the end point of the last chunk - for all other chunks, - set the end point to the start of the next chunk - increment the end point until the last line in the chunk starts with something that looks like a URL. - return the start and endpoints of each chunk as a tuple. The problem is that your first `readline()` doesn't necessarily start from the beginning of a line, so you could just hit a URL by random chance. If it's important that the chunk boundary contains the entire entry, this might need to be handled more carefully.

fixed by adding f.readline() to discard the partial line.

khird commented

2025-12-05 21:22:06 +00:00

First-time contributor

I've flagged a couple things that stood out to me, but this code is remarkably good. I will point out though: it's for converting files between formats, many of which I'm not familiar with, and this PR isn't accompanied by any tests. So it's hard for me to judge whether it does what it's supposed to - I just have to take that part on faith.

aditya added 1 commit

2025-12-08 13:06:35 +00:00

fix: correct chunk boundary detection and fix type hints 4d56c6af77

aditya force-pushed 15-refactor-rdf-converters from 4d56c6af77 to dc4afb30f0

2025-12-10 09:35:31 +00:00

Compare

aditya added 1 commit

2025-12-10 13:03:43 +00:00

fix: add fixes to use unified script in upload_all_dataset.py bcc2abac65

aditya force-pushed 15-refactor-rdf-converters from bcc2abac65 to 563f3e3a9b

2025-12-10 13:52:54 +00:00

Compare

aditya changed title from ~~WIP: feat: Unify 5 RDF converter scripts using Strategy Pattern~~ to feat: Unify 5 RDF converter scripts using Strategy Pattern

2025-12-10 13:53:20 +00:00

aditya added 1 commit

2025-12-10 14:29:22 +00:00

fix: fix to use unified converter script in upload_all_datasets.py 0ed529d233

khird approved these changes

2025-12-11 15:13:08 +00:00

Dismissed

features/steps/rdf_converter_steps.py

					
				@ -0,0 +679,4 @@

				    """Verify the dataset has description metadata."""

				    from datasets import load_from_disk

				    dataset_dict = load_from_disk(str(context.output_path))

A few functions ago, you're using the locally defined _load_output_dataset(), but here you reimport the datasets package. Is there a reason to use different patterns in the two places?

A few functions ago, you're using the locally defined `_load_output_dataset()`, but here you reimport the `datasets` package. Is there a reason to use different patterns in the two places?

features/steps/rdf_converter_steps.py

					
				@ -0,0 +687,4 @@

				@then("the dataset should have all metadata fields")

				def step_verify_all_metadata(context: Context) -> None:

				    """Verify the dataset has all metadata fields."""

Is it worth checking types here? If info.description is a float, for example, something has probably gone wrong.

Is it worth checking types here? If `info.description` is a float, for example, something has probably gone wrong.

features/steps/rdf_converter_steps.py

					
				@ -0,0 +718,4 @@

				@then('strategies should include "{name1}" and "{name2}"')

				def step_verify_strategies_include(context: Context, name1: str, name2: str) -> None:

				    """Verify specific strategies are in the list."""

It is probably better to take a list[str] to support testing all strategies, in case you add more in future and don't want to test exactly two. I'm not sure whether the string-based @then() interface allows you to do that though.

It is probably better to take a `list[str]` to support testing all strategies, in case you add more in future and don't want to test exactly two. I'm not sure whether the string-based `@then()` interface allows you to do that though.

features/steps/rdf_converter_steps.py

					
				@ -0,0 +816,4 @@

				@when("I run the CLI expecting failure")

				def step_run_cli_expect_failure(context: Context) -> None:

				    """Run CLI expecting it to fail."""

Perhaps there's a subtle difference I miss, but this looks identical to step_run_cli_auto(). Is there a reason to have both?

Perhaps there's a subtle difference I miss, but this looks identical to `step_run_cli_auto()`. Is there a reason to have both?

aditya added 1 commit

2025-12-12 12:44:46 +00:00

test: fix step definition per review feedback c38650bfd1

khird approved these changes

2025-12-16 13:43:44 +00:00

This pull request can be merged automatically.

You are not authorized to merge this pull request.

View command line instructions

Checkout

From your project repository, check out a new branch and test the changes.

git fetch -u origin 15-refactor-rdf-converters:15-refactor-rdf-converters

git switch 15-refactor-rdf-converters

Merge

Merge the changes and update on Forgejo.

Warning: The "Autodetect manual merge" setting is not enabled for this repository, you will have to mark this pull request as manually merged afterwards.

git switch 19-rewrite-upload-all-datasets

git merge --no-ff 15-refactor-rdf-converters

git switch 15-refactor-rdf-converters

git rebase 19-rewrite-upload-all-datasets

git switch 19-rewrite-upload-all-datasets

git merge --ff-only 15-refactor-rdf-converters

git switch 15-refactor-rdf-converters

git rebase 19-rewrite-upload-all-datasets

git switch 19-rewrite-upload-all-datasets

git merge --no-ff 15-refactor-rdf-converters

git switch 19-rewrite-upload-all-datasets

git merge --squash 15-refactor-rdf-converters