Docker: add entrypoint script for upload single dataset #53

Failing scenarios:
  features/rdf_converter/cli_integration.feature:76  CLI creates train/test split
  features/rdf_converter/error_handling.feature:49  Handle malformed N-Triples gracefully
  features/rdf_converter/error_handling.feature:60  Skip malformed lines in streaming mode
  features/rdf_converter/error_handling.feature:77  Recover from partially valid file
  features/rdf_converter/error_handling.feature:117  Streaming recovers from parsing errors in ntriples
  features/rdf_converter/file_handling.feature:102  Convert gzip-compressed N-Triples
  features/rdf_converter/file_handling.feature:108  Stream gzip-compressed N-Triples
  features/rdf_converter/file_handling.feature:124  Convert bz2-compressed N-Triples
  features/rdf_converter/file_handling.feature:130  Stream bz2-compressed N-Triples
  features/rdf_converter/file_handling.feature:140  Convert TSV file with standard strategy
  features/rdf_converter/file_handling.feature:147  TSV file creates literal object types
  features/rdf_converter/file_handling.feature:153  TSV file with empty lines and malformed rows
  features/rdf_converter/file_handling.feature:174  Convert GeoNames format file with standard strategy
  features/rdf_converter/file_handling.feature:180  Convert GeoNames format file with streaming strategy
  features/rdf_converter/file_handling.feature:190  Convert GeoNames format file with simple streaming
  features/rdf_converter/parallel_conversion.feature:126  Parallel correctly processes literals with language tags
  features/rdf_converter/parallel_conversion.feature:139  Parallel correctly processes URI objects
  features/rdf_converter/standard_conversion.feature:41  Standard conversion with train/test split
  features/rdf_converter/streaming_conversion.feature:58  Stream gzip-compressed N-Triples file

2 features passed, 6 failed, 0 skipped
91 scenarios passed, 19 failed, 0 skipped
525 steps passed, 19 failed, 18 skipped
Took 0min 4.669s

I can't approve until the tests pass.

When I run `behave features`, I get the following summary: ``` Failing scenarios: features/rdf_converter/cli_integration.feature:76 CLI creates train/test split features/rdf_converter/error_handling.feature:49 Handle malformed N-Triples gracefully features/rdf_converter/error_handling.feature:60 Skip malformed lines in streaming mode features/rdf_converter/error_handling.feature:77 Recover from partially valid file features/rdf_converter/error_handling.feature:117 Streaming recovers from parsing errors in ntriples features/rdf_converter/file_handling.feature:102 Convert gzip-compressed N-Triples features/rdf_converter/file_handling.feature:108 Stream gzip-compressed N-Triples features/rdf_converter/file_handling.feature:124 Convert bz2-compressed N-Triples features/rdf_converter/file_handling.feature:130 Stream bz2-compressed N-Triples features/rdf_converter/file_handling.feature:140 Convert TSV file with standard strategy features/rdf_converter/file_handling.feature:147 TSV file creates literal object types features/rdf_converter/file_handling.feature:153 TSV file with empty lines and malformed rows features/rdf_converter/file_handling.feature:174 Convert GeoNames format file with standard strategy features/rdf_converter/file_handling.feature:180 Convert GeoNames format file with streaming strategy features/rdf_converter/file_handling.feature:190 Convert GeoNames format file with simple streaming features/rdf_converter/parallel_conversion.feature:126 Parallel correctly processes literals with language tags features/rdf_converter/parallel_conversion.feature:139 Parallel correctly processes URI objects features/rdf_converter/standard_conversion.feature:41 Standard conversion with train/test split features/rdf_converter/streaming_conversion.feature:58 Stream gzip-compressed N-Triples file 2 features passed, 6 failed, 0 skipped 91 scenarios passed, 19 failed, 0 skipped 525 steps passed, 19 failed, 18 skipped Took 0min 4.669s ``` I can't approve until the tests pass.

scripts/rdf_to_hf_incremental.py Outdated

					
				@ -122,0 +127,4 @@

				                continue  # Retry

				    # All retries failed - try to salvage by parsing line-by-line

				    logger.warning(f"Chunk {idx} failed after {max_retries} attempts, attempting line-by-line recovery: {last_error}")

ruff check reports:

scripts/rdf_to_hf_incremental.py:130:89: E501 Line too long (118 > 88)
    |
129 |     # All retries failed - try to salvage by parsing line-by-line
130 |     logger.warning(f"Chunk {idx} failed after {max_retries} attempts, attempting line-by-line recovery: {last_error}")
    |                                                                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ E501
131 |
132 |     # Try to parse individual statements
    |

`ruff check` reports: ``` scripts/rdf_to_hf_incremental.py:130:89: E501 Line too long (118 > 88) | 129 | # All retries failed - try to salvage by parsing line-by-line 130 | logger.warning(f"Chunk {idx} failed after {max_retries} attempts, attempting line-by-line recovery: {last_error}") | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ E501 131 | 132 | # Try to parse individual statements | ```

fixed !!

brent.edwards marked this conversation as resolved

scripts/rdf_to_hf_incremental.py Outdated

					
				@ -253,0 +292,4 @@

				                            ]

				                            if triples:

				                                yield triples

				                            parsed_successfully = True

ruff check reports:

scripts/rdf_to_hf_incremental.py:295:29: F841 Local variable `parsed_successfully` is assigned to but never used
    |
293 |                             if triples:
294 |                                 yield triples
295 |                             parsed_successfully = True
    |                             ^^^^^^^^^^^^^^^^^^^ F841
296 |                             break
297 |                         except Exception as e:
    |
    = help: Remove assignment to unused variable `parsed_successfully`

fixed !!

brent.edwards marked this conversation as resolved

scripts/rdf_to_hf_incremental.py Outdated

					
				@ -253,0 +298,4 @@

				                            if attempt < max_retries - 1:

				                                continue  # Retry

				                            else:

				                                logger.warning(f"Chunk parsing failed after {max_retries} attempts: {e}")

ruff check reports:

scripts/rdf_to_hf_incremental.py:301:89: E501 Line too long (105 > 88)
    |
299 |                                 continue  # Retry
300 |                             else:
301 |                                 logger.warning(f"Chunk parsing failed after {max_retries} attempts: {e}")
    |                                                                                         ^^^^^^^^^^^^^^^^^ E501
302 |
303 |                     current_chunk = []
    |

`ruff check` reports: ``` scripts/rdf_to_hf_incremental.py:301:89: E501 Line too long (105 > 88) | 299 | continue # Retry 300 | else: 301 | logger.warning(f"Chunk parsing failed after {max_retries} attempts: {e}") | ^^^^^^^^^^^^^^^^^ E501 302 | 303 | current_chunk = [] | ```

fixed !!

brent.edwards marked this conversation as resolved

scripts/rdf_to_hf_incremental.py Outdated

					
				@ -278,0 +323,4 @@

				                try:

				                    graph = Graph()

				                    graph.parse(data=chunk_text, format="turtle")

				                    triples = [_convert_rdf_triple_to_dict(s, p, o) for s, p, o in graph]

ruff check reports:

scripts/rdf_to_hf_incremental.py:326:89: E501 Line too long (89 > 88)
    |
324 |                     graph = Graph()
325 |                     graph.parse(data=chunk_text, format="turtle")
326 |                     triples = [_convert_rdf_triple_to_dict(s, p, o) for s, p, o in graph]
    |                                                                                         ^ E501
327 |                     if triples:
328 |                         yield triples
    |

fixed !!

brent.edwards marked this conversation as resolved

scripts/rdf_to_hf_incremental.py Outdated

					
				@ -278,0 +331,4 @@

				                    if attempt < max_retries - 1:

				                        continue  # Retry

				                    else:

				                        logger.warning(f"Final chunk parsing failed after {max_retries} attempts: {e}")

ruff check reports:

scripts/rdf_to_hf_incremental.py:334:89: E501 Line too long (103 > 88)
    |
332 |                         continue  # Retry
333 |                     else:
334 |                         logger.warning(f"Final chunk parsing failed after {max_retries} attempts: {e}")
    |                                                                                         ^^^^^^^^^^^^^^^ E501
335 |         return
    |

`ruff check` reports: ``` scripts/rdf_to_hf_incremental.py:334:89: E501 Line too long (103 > 88) | 332 | continue # Retry 333 | else: 334 | logger.warning(f"Final chunk parsing failed after {max_retries} attempts: {e}") | ^^^^^^^^^^^^^^^ E501 335 | return | ```

fixed !!

brent.edwards marked this conversation as resolved

scripts/rdf_to_hf_incremental.py Outdated

					
				@ -984,0 +1054,4 @@

				                    or "gateway timeout" in error_str

				                    # Network/connection errors

				                    or "connection" in error_str and "refused" in error_str

ruff check reports:

scripts/rdf_to_hf_incremental.py:1057:24: RUF021 [*] Parenthesize `a and b` expressions when chaining `and` and `or` together, to make the precedence clear
     |
1056 |                     # Network/connection errors
1057 |                     or "connection" in error_str and "refused" in error_str
     |                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RUF021
1058 |                     or "connection" in error_str and "reset" in error_str
1059 |                     or "connection" in error_str and "timeout" in error_str
     |
     = help: Parenthesize the `and` subexpression

scripts/rdf_to_hf_incremental.py:1058:24: RUF021 [*] Parenthesize `a and b` expressions when chaining `and` and `or` together, to make the precedence clear
     |
1056 |                     # Network/connection errors
1057 |                     or "connection" in error_str and "refused" in error_str
1058 |                     or "connection" in error_str and "reset" in error_str
     |                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RUF021
1059 |                     or "connection" in error_str and "timeout" in error_str
1060 |                     or "timeout" in error_str
     |
     = help: Parenthesize the `and` subexpression

scripts/rdf_to_hf_incremental.py:1059:24: RUF021 [*] Parenthesize `a and b` expressions when chaining `and` and `or` together, to make the precedence clear
     |
1057 |                     or "connection" in error_str and "refused" in error_str
1058 |                     or "connection" in error_str and "reset" in error_str
1059 |                     or "connection" in error_str and "timeout" in error_str
     |                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RUF021
1060 |                     or "timeout" in error_str
1061 |                     or "timed out" in error_str
     |
     = help: Parenthesize the `and` subexpression

`ruff check` reports: ``` scripts/rdf_to_hf_incremental.py:1057:24: RUF021 [*] Parenthesize `a and b` expressions when chaining `and` and `or` together, to make the precedence clear | 1056 | # Network/connection errors 1057 | or "connection" in error_str and "refused" in error_str | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RUF021 1058 | or "connection" in error_str and "reset" in error_str 1059 | or "connection" in error_str and "timeout" in error_str | = help: Parenthesize the `and` subexpression scripts/rdf_to_hf_incremental.py:1058:24: RUF021 [*] Parenthesize `a and b` expressions when chaining `and` and `or` together, to make the precedence clear | 1056 | # Network/connection errors 1057 | or "connection" in error_str and "refused" in error_str 1058 | or "connection" in error_str and "reset" in error_str | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RUF021 1059 | or "connection" in error_str and "timeout" in error_str 1060 | or "timeout" in error_str | = help: Parenthesize the `and` subexpression scripts/rdf_to_hf_incremental.py:1059:24: RUF021 [*] Parenthesize `a and b` expressions when chaining `and` and `or` together, to make the precedence clear | 1057 | or "connection" in error_str and "refused" in error_str 1058 | or "connection" in error_str and "reset" in error_str 1059 | or "connection" in error_str and "timeout" in error_str | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RUF021 1060 | or "timeout" in error_str 1061 | or "timed out" in error_str | = help: Parenthesize the `and` subexpression ```

fixed !!

brent.edwards marked this conversation as resolved

scripts/rdf_to_hf_incremental.py Outdated

					
				@ -984,0 +1064,4 @@

				                    or "timeout" in error_type.lower()

				                    # SSL/TLS temporary issues

				                    or "ssl" in error_str and "error" in error_str

ruff check reports:

scripts/rdf_to_hf_incremental.py:1067:24: RUF021 [*] Parenthesize `a and b` expressions when chaining `and` and `or` together, to make the precedence clear
     |
1066 |                     # SSL/TLS temporary issues
1067 |                     or "ssl" in error_str and "error" in error_str
     |                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RUF021
1068 |
1069 |                     # Proxy errors
     |
     = help: Parenthesize the `and` subexpression

`ruff check` reports: ``` scripts/rdf_to_hf_incremental.py:1067:24: RUF021 [*] Parenthesize `a and b` expressions when chaining `and` and `or` together, to make the precedence clear | 1066 | # SSL/TLS temporary issues 1067 | or "ssl" in error_str and "error" in error_str | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RUF021 1068 | 1069 | # Proxy errors | = help: Parenthesize the `and` subexpression ```

fixed !!

brent.edwards marked this conversation as resolved

scripts/rdf_to_hf_incremental.py Outdated

					
				@ -984,0 +1067,4 @@

				                    or "ssl" in error_str and "error" in error_str

				                    # Proxy errors

				                    or "proxy" in error_str and "error" in error_str

ruff check reports:

scripts/rdf_to_hf_incremental.py:1070:24: RUF021 [*] Parenthesize `a and b` expressions when chaining `and` and `or` together, to make the precedence clear
     |
1069 |                     # Proxy errors
1070 |                     or "proxy" in error_str and "error" in error_str
     |                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RUF021
1071 |
1072 |                     # Hugging Face specific
     |
     = help: Parenthesize the `and` subexpression

`ruff check` reports: ``` scripts/rdf_to_hf_incremental.py:1070:24: RUF021 [*] Parenthesize `a and b` expressions when chaining `and` and `or` together, to make the precedence clear | 1069 | # Proxy errors 1070 | or "proxy" in error_str and "error" in error_str | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RUF021 1071 | 1072 | # Hugging Face specific | = help: Parenthesize the `and` subexpression ```

fixed !!

brent.edwards marked this conversation as resolved

scripts/rdf_to_hf_incremental.py Outdated

					
				@ -1004,3 +1096,3 @@

				                        raise

				                else:

				                    # Not a rate limit error, re-raise immediately

				                    # Not a retryable error (authentication, permission, etc.) - fail immediately

ruff check reports:

scripts/rdf_to_hf_incremental.py:1098:89: E501 Line too long (97 > 88)
     |
1096 |                         raise
1097 |                 else:
1098 |                     # Not a retryable error (authentication, permission, etc.) - fail immediately
     |                                                                                         ^^^^^^^^^ E501
1099 |                     raise
     |

fixed !!

brent.edwards marked this conversation as resolved

scripts/rdf_to_hf_incremental.py Outdated

					
				@ -1826,2 +1898,2 @@

				                    "for the dataset[/dim]"

				                )

				            # Upload README to the repository with retry logic

				            import io

It's usually better to put all import statements together.

It's usually better to put all `import` statements together.

fixed !!

brent.edwards marked this conversation as resolved

scripts/rdf_to_hf_incremental.py Outdated

					
				@ -1829,0 +1915,4 @@

				                        commit_message="Add dataset card with documentation"

				                    )

				                    console.print("[green]✓ Dataset card (README.md) uploaded[/green]")

				                    readme_uploaded = True

ruff check reports:

scripts/rdf_to_hf_incremental.py:1918:21: F841 Local variable `readme_uploaded` is assigned to but never used
     |
1916 |                     )
1917 |                     console.print("[green]✓ Dataset card (README.md) uploaded[/green]")
1918 |                     readme_uploaded = True
     |                     ^^^^^^^^^^^^^^^ F841
1919 |                     break  # Success, exit retry loop
     |
     = help: Remove assignment to unused variable `readme_uploaded`

`ruff check` reports: ``` scripts/rdf_to_hf_incremental.py:1918:21: F841 Local variable `readme_uploaded` is assigned to but never used | 1916 | ) 1917 | console.print("[green]✓ Dataset card (README.md) uploaded[/green]") 1918 | readme_uploaded = True | ^^^^^^^^^^^^^^^ F841 1919 | break # Success, exit retry loop | = help: Remove assignment to unused variable `readme_uploaded` ```

fixed !!

brent.edwards marked this conversation as resolved

scripts/rdf_to_hf_incremental.py Outdated

					
				@ -1829,0 +1943,4 @@

				                        # Network/connection errors

				                        or "connection" in error_str and "refused" in error_str

				                        or "connection" in error_str and "reset" in error_str

				                        or "connection" in error_str and "timeout" in error_str

ruff check reports:


scripts/rdf_to_hf_incremental.py:1944:28: RUF021 [*] Parenthesize `a and b` expressions when chaining `and` and `or` together, to make the precedence clear
     |
1943 |                         # Network/connection errors
1944 |                         or "connection" in error_str and "refused" in error_str
     |                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RUF021
1945 |                         or "connection" in error_str and "reset" in error_str
1946 |                         or "connection" in error_str and "timeout" in error_str
     |
     = help: Parenthesize the `and` subexpression

scripts/rdf_to_hf_incremental.py:1945:28: RUF021 [*] Parenthesize `a and b` expressions when chaining `and` and `or` together, to make the precedence clear
     |
1943 |                         # Network/connection errors
1944 |                         or "connection" in error_str and "refused" in error_str
1945 |                         or "connection" in error_str and "reset" in error_str
     |                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RUF021
1946 |                         or "connection" in error_str and "timeout" in error_str
1947 |                         or "timeout" in error_str
     |
     = help: Parenthesize the `and` subexpression

scripts/rdf_to_hf_incremental.py:1946:28: RUF021 [*] Parenthesize `a and b` expressions when chaining `and` and `or` together, to make the precedence clear
     |
1944 |                         or "connection" in error_str and "refused" in error_str
1945 |                         or "connection" in error_str and "reset" in error_str
1946 |                         or "connection" in error_str and "timeout" in error_str
     |                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RUF021
1947 |                         or "timeout" in error_str
1948 |                         or "timed out" in error_str
     |
     = help: Parenthesize the `and` subexpression

`ruff check` reports: ``` scripts/rdf_to_hf_incremental.py:1944:28: RUF021 [*] Parenthesize `a and b` expressions when chaining `and` and `or` together, to make the precedence clear | 1943 | # Network/connection errors 1944 | or "connection" in error_str and "refused" in error_str | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RUF021 1945 | or "connection" in error_str and "reset" in error_str 1946 | or "connection" in error_str and "timeout" in error_str | = help: Parenthesize the `and` subexpression scripts/rdf_to_hf_incremental.py:1945:28: RUF021 [*] Parenthesize `a and b` expressions when chaining `and` and `or` together, to make the precedence clear | 1943 | # Network/connection errors 1944 | or "connection" in error_str and "refused" in error_str 1945 | or "connection" in error_str and "reset" in error_str | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RUF021 1946 | or "connection" in error_str and "timeout" in error_str 1947 | or "timeout" in error_str | = help: Parenthesize the `and` subexpression scripts/rdf_to_hf_incremental.py:1946:28: RUF021 [*] Parenthesize `a and b` expressions when chaining `and` and `or` together, to make the precedence clear | 1944 | or "connection" in error_str and "refused" in error_str 1945 | or "connection" in error_str and "reset" in error_str 1946 | or "connection" in error_str and "timeout" in error_str | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RUF021 1947 | or "timeout" in error_str 1948 | or "timed out" in error_str | = help: Parenthesize the `and` subexpression ```

fixed !!

brent.edwards marked this conversation as resolved

scripts/rdf_to_hf_incremental.py Outdated

					
				@ -1829,0 +1951,4 @@

				                        or "timeout" in error_type.lower()

				                        # SSL/TLS temporary issues

				                        or "ssl" in error_str and "error" in error_str

ruff check reports:

scripts/rdf_to_hf_incremental.py:1954:28: RUF021 [*] Parenthesize `a and b` expressions when chaining `and` and `or` together, to make the precedence clear
     |
1953 |                         # SSL/TLS temporary issues
1954 |                         or "ssl" in error_str and "error" in error_str
     |                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RUF021
1955 |
1956 |                         # Proxy errors
     |
     = help: Parenthesize the `and` subexpression

`ruff check` reports: ``` scripts/rdf_to_hf_incremental.py:1954:28: RUF021 [*] Parenthesize `a and b` expressions when chaining `and` and `or` together, to make the precedence clear | 1953 | # SSL/TLS temporary issues 1954 | or "ssl" in error_str and "error" in error_str | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RUF021 1955 | 1956 | # Proxy errors | = help: Parenthesize the `and` subexpression ```

fixed !!

brent.edwards marked this conversation as resolved

scripts/rdf_to_hf_incremental.py Outdated

					
				@ -1829,0 +1954,4 @@

				                        or "ssl" in error_str and "error" in error_str

				                        # Proxy errors

				                        or "proxy" in error_str and "error" in error_str

ruff check reports:

scripts/rdf_to_hf_incremental.py:1957:28: RUF021 [*] Parenthesize `a and b` expressions when chaining `and` and `or` together, to make the precedence clear
     |
1956 |                         # Proxy errors
1957 |                         or "proxy" in error_str and "error" in error_str
     |                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RUF021
1958 |
1959 |                         # Hugging Face specific
     |
     = help: Parenthesize the `and` subexpression

`ruff check` reports: ``` scripts/rdf_to_hf_incremental.py:1957:28: RUF021 [*] Parenthesize `a and b` expressions when chaining `and` and `or` together, to make the precedence clear | 1956 | # Proxy errors 1957 | or "proxy" in error_str and "error" in error_str | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RUF021 1958 | 1959 | # Hugging Face specific | = help: Parenthesize the `and` subexpression ```

fixed !!

brent.edwards marked this conversation as resolved

scripts/rdf_to_hf_incremental.py Outdated

					
				@ -1829,0 +1976,4 @@

				                            time.sleep(wait_time)

				                        else:

				                            console.print(

				                                f"[yellow]⚠ Warning: Could not upload README.md after {max_retries} attempts: "

ruff check reports:

scripts/rdf_to_hf_incremental.py:1979:89: E501 Line too long (111 > 88)
     |
1977 |                         else:
1978 |                             console.print(
1979 |                                 f"[yellow]⚠ Warning: Could not upload README.md after {max_retries} attempts: "
     |                                                                                         ^^^^^^^^^^^^^^^^^^^^^^^ E501
1980 |                                 f"{e}[/yellow]"
1981 |                             )
     |

`ruff check` reports: ``` scripts/rdf_to_hf_incremental.py:1979:89: E501 Line too long (111 > 88) | 1977 | else: 1978 | console.print( 1979 | f"[yellow]⚠ Warning: Could not upload README.md after {max_retries} attempts: " | ^^^^^^^^^^^^^^^^^^^^^^^ E501 1980 | f"{e}[/yellow]" 1981 | ) | ```

fixed !!

brent.edwards marked this conversation as resolved

scripts/rdf_to_hf_incremental.py Outdated

					
				@ -1829,0 +1986,4 @@

				                    else:

				                        # Not a retryable error - fail immediately

				                        console.print(

				                            f"[yellow]⚠ Warning: Could not upload README.md (non-retryable error): "

ruff check reports:

scripts/rdf_to_hf_incremental.py:1989:89: E501 Line too long (100 > 88)
     |
1987 |                         # Not a retryable error - fail immediately
1988 |                         console.print(
1989 |                             f"[yellow]⚠ Warning: Could not upload README.md (non-retryable error): "
     |                                                                                         ^^^^^^^^^^^^ E501
1990 |                             f"{e}[/yellow]"
1991 |                         )
     |

`ruff check` reports: ``` scripts/rdf_to_hf_incremental.py:1989:89: E501 Line too long (100 > 88) | 1987 | # Not a retryable error - fail immediately 1988 | console.print( 1989 | f"[yellow]⚠ Warning: Could not upload README.md (non-retryable error): " | ^^^^^^^^^^^^ E501 1990 | f"{e}[/yellow]" 1991 | ) | ```

fixed !!

brent.edwards marked this conversation as resolved

scripts/rdf_to_hf_incremental.py Outdated

					
				@ -1829,0 +1993,4 @@

				                            "[dim]You can manually create a README.md file "

				                            "for the dataset[/dim]"

				                        )

				                        break

Lines 1035-1099 and lines 1922-1996 are extremely similar. Could they be unified into a function?

fixed !!

brent.edwards marked this conversation as resolved

aditya added 2 commits

2026-02-04 14:29:11 +00:00

test: copy pasted the existing behave bdd tests fe4aa2b9c5

fix: fix ruff check errors as per review feedback 7fb3513e28

brent.edwards requested changes

2026-02-04 21:44:19 +00:00

brent.edwards left a comment

A lot of your new code didn't pass ruff check.

A lot of your new code didn't pass `ruff check`.

scripts/convert_rdf_to_hf_dataset_unified.py Outdated

					
				@ -523,2 +582,2 @@

				    graph = Graph()

				    graph.parse(str(file_path), format=rdf_format)

				    # Check if this is GeoNames format (special handling required)

				    if rdf_format in ("xml", "application/rdf+xml") and _detect_geonames_format(file_path):

Crap. I'm sorry to tell you this, but...

ruff check reports:

scripts/convert_rdf_to_hf_dataset_unified.py:583:89: E501 Line too long (91 > 88)
    |
581 |     """
582 |     # Check if this is GeoNames format (special handling required)
583 |     if rdf_format in ("xml", "application/rdf+xml") and _detect_geonames_format(file_path):
    |                                                                                         ^^^ E501
584 |         # Use GeoNames parser with single worker for non-parallel strategies
585 |         yield from stream_geonames_parallel(file_path, chunk_size, num_workers=1)
    |

Let me know if you're having problems running ruff check locally.

Crap. I'm sorry to tell you this, but... `ruff check` reports: ``` scripts/convert_rdf_to_hf_dataset_unified.py:583:89: E501 Line too long (91 > 88) | 581 | """ 582 | # Check if this is GeoNames format (special handling required) 583 | if rdf_format in ("xml", "application/rdf+xml") and _detect_geonames_format(file_path): | ^^^ E501 584 | # Use GeoNames parser with single worker for non-parallel strategies 585 | yield from stream_geonames_parallel(file_path, chunk_size, num_workers=1) | ``` Let me know if you're having problems running `ruff check` locally.

fixed !!

scripts/convert_rdf_to_hf_dataset_unified.py Outdated

					
				@ -828,0 +965,4 @@

				                                            subject=subject,

				                                            predicate=predicate,

				                                            object=obj,

				                                            object_type="literal",  # TSV typically has literals

ruff check reports:

scripts/convert_rdf_to_hf_dataset_unified.py:968:89: E501 Line too long (96 > 88)
    |
966 | …                     predicate=predicate,
967 | …                     object=obj,
968 | …                     object_type="literal",  # TSV typically has literals
    |                                                                   ^^^^^^^^ E501
969 | …                     object_datatype=None,
970 | …                     object_language=None
    |

fixed !!

scripts/convert_rdf_to_hf_dataset_unified.py Outdated

					
				@ -828,0 +970,4 @@

				                                            object_language=None

				                                        ))

				                            except Exception as e:

				                                logger.debug(f"Skipping malformed TSV line {line_num}: {e}")

ruff check reports:

scripts/convert_rdf_to_hf_dataset_unified.py:973:89: E501 Line too long (92 > 88)
    |
971 |                                         ))
972 |                             except Exception as e:
973 |                                 logger.debug(f"Skipping malformed TSV line {line_num}: {e}")
    |                                                                                         ^^^^ E501
974 |                                 continue
975 |                 except Exception as e:

`ruff check` reports: ``` scripts/convert_rdf_to_hf_dataset_unified.py:973:89: E501 Line too long (92 > 88) | 971 | )) 972 | except Exception as e: 973 | logger.debug(f"Skipping malformed TSV line {line_num}: {e}") | ^^^^ E501 974 | continue 975 | except Exception as e: ```

fixed !!

scripts/convert_rdf_to_hf_dataset_unified.py Outdated

					
				@ -840,0 +990,4 @@

				            # Create train/test split if requested

				            if config.create_train_test_split:

				                assert isinstance(dataset, Dataset), "Expected Dataset instance"

				                train_test = dataset.train_test_split(test_size=config.test_size, seed=42)

ruff check reports:

scripts/convert_rdf_to_hf_dataset_unified.py:993:89: E501 Line too long (90 > 88)
    |
991 |             if config.create_train_test_split:
992 |                 assert isinstance(dataset, Dataset), "Expected Dataset instance"
993 |                 train_test = dataset.train_test_split(test_size=config.test_size, seed=42)
    |                                                                                         ^^ E501
994 |                 dataset_dict = DatasetDict({
995 |                     "train": train_test["train"],

`ruff check` reports: ``` scripts/convert_rdf_to_hf_dataset_unified.py:993:89: E501 Line too long (90 > 88) | 991 | if config.create_train_test_split: 992 | assert isinstance(dataset, Dataset), "Expected Dataset instance" 993 | train_test = dataset.train_test_split(test_size=config.test_size, seed=42) | ^^ E501 994 | dataset_dict = DatasetDict({ 995 | "train": train_test["train"], ```

fixed !!

scripts/convert_rdf_to_hf_dataset_unified.py Outdated

					
				@ -1161,1 +1324,3 @@

				                        )

				                # Handle compressed files

				                is_gzip = config.input_path.suffix == ".gz" or str(config.input_path).endswith(".gz")

				                is_bz2 = config.input_path.suffix == ".bz2" or str(config.input_path).endswith(".bz2")

ruff check reports:

scripts/convert_rdf_to_hf_dataset_unified.py:1325:89: E501 Line too long (101 > 88)
     |
1323 |                 # Check if it's the special GeoNames format (URLs followed by XML)
1324 |                 # Handle compressed files
1325 |                 is_gzip = config.input_path.suffix == ".gz" or str(config.input_path).endswith(".gz")
     |                                                                                         ^^^^^^^^^^^^^ E501
1326 |                 is_bz2 = config.input_path.suffix == ".bz2" or str(config.input_path).endswith(".bz2")
     |

scripts/convert_rdf_to_hf_dataset_unified.py:1326:89: E501 Line too long (102 > 88)
     |
1324 |                 # Handle compressed files
1325 |                 is_gzip = config.input_path.suffix == ".gz" or str(config.input_path).endswith(".gz")
1326 |                 is_bz2 = config.input_path.suffix == ".bz2" or str(config.input_path).endswith(".bz2")
     |                                                                                         ^^^^^^^^^^^^^^ E501
1327 |
1328 |                 try:
     |

`ruff check` reports: ``` scripts/convert_rdf_to_hf_dataset_unified.py:1325:89: E501 Line too long (101 > 88) | 1323 | # Check if it's the special GeoNames format (URLs followed by XML) 1324 | # Handle compressed files 1325 | is_gzip = config.input_path.suffix == ".gz" or str(config.input_path).endswith(".gz") | ^^^^^^^^^^^^^ E501 1326 | is_bz2 = config.input_path.suffix == ".bz2" or str(config.input_path).endswith(".bz2") | scripts/convert_rdf_to_hf_dataset_unified.py:1326:89: E501 Line too long (102 > 88) | 1324 | # Handle compressed files 1325 | is_gzip = config.input_path.suffix == ".gz" or str(config.input_path).endswith(".gz") 1326 | is_bz2 = config.input_path.suffix == ".bz2" or str(config.input_path).endswith(".bz2") | ^^^^^^^^^^^^^^ E501 1327 | 1328 | try: | ```

fixed !!

scripts/convert_rdf_to_hf_dataset_unified.py Outdated

					
				@ -1162,0 +1327,4 @@

				                try:

				                    if is_gzip:

				                        file_obj = gzip.open(config.input_path, "rt", encoding="utf-8", errors="ignore") # noqa: SIM115

ruff check reports:

scripts/convert_rdf_to_hf_dataset_unified.py:1330:89: E501 Line too long (104 > 88)
     |
1328 |                 try:
1329 |                     if is_gzip:
1330 |                         file_obj = gzip.open(config.input_path, "rt", encoding="utf-8", errors="ignore") # noqa: SIM115
     |                                                                                         ^^^^^^^^^^^^^^^^ E501
1331 |                     elif is_bz2:
1332 |                         file_obj = bz2.open(config.input_path, "rt", encoding="utf-8", errors="ignore") # noqa: SIM115
     |

`ruff check` reports: ``` scripts/convert_rdf_to_hf_dataset_unified.py:1330:89: E501 Line too long (104 > 88) | 1328 | try: 1329 | if is_gzip: 1330 | file_obj = gzip.open(config.input_path, "rt", encoding="utf-8", errors="ignore") # noqa: SIM115 | ^^^^^^^^^^^^^^^^ E501 1331 | elif is_bz2: 1332 | file_obj = bz2.open(config.input_path, "rt", encoding="utf-8", errors="ignore") # noqa: SIM115 | ```

fixed !!

scripts/convert_rdf_to_hf_dataset_unified.py Outdated

					
				@ -1162,0 +1329,4 @@

				                    if is_gzip:

				                        file_obj = gzip.open(config.input_path, "rt", encoding="utf-8", errors="ignore") # noqa: SIM115

				                    elif is_bz2:

				                        file_obj = bz2.open(config.input_path, "rt", encoding="utf-8", errors="ignore") # noqa: SIM115

ruff check reports:

scripts/convert_rdf_to_hf_dataset_unified.py:1332:89: E501 Line too long (103 > 88)
     |
1330 |                         file_obj = gzip.open(config.input_path, "rt", encoding="utf-8", errors="ignore") # noqa: SIM115
1331 |                     elif is_bz2:
1332 |                         file_obj = bz2.open(config.input_path, "rt", encoding="utf-8", errors="ignore") # noqa: SIM115
     |                                                                                         ^^^^^^^^^^^^^^^ E501
1333 |                     else:
1334 |                         file_obj = open(config.input_path, encoding="utf-8", errors="ignore") # noqa: SIM115
     |

`ruff check` reports: ``` scripts/convert_rdf_to_hf_dataset_unified.py:1332:89: E501 Line too long (103 > 88) | 1330 | file_obj = gzip.open(config.input_path, "rt", encoding="utf-8", errors="ignore") # noqa: SIM115 1331 | elif is_bz2: 1332 | file_obj = bz2.open(config.input_path, "rt", encoding="utf-8", errors="ignore") # noqa: SIM115 | ^^^^^^^^^^^^^^^ E501 1333 | else: 1334 | file_obj = open(config.input_path, encoding="utf-8", errors="ignore") # noqa: SIM115 | ```

fixed !!

scripts/convert_rdf_to_hf_dataset_unified.py Outdated

					
				@ -1165,3 +1334,1 @@

				                            f"[yellow]Using standard RDF parser for "

				                            f"{config.rdf_format} (single-threaded)[/yellow]"

				                        )

				                        file_obj = open(config.input_path, encoding="utf-8", errors="ignore") # noqa: SIM115

ruff check reports:

scripts/convert_rdf_to_hf_dataset_unified.py:1334:89: E501 Line too long (93 > 88)
     |
1332 |                         file_obj = bz2.open(config.input_path, "rt", encoding="utf-8", errors="ignore") # noqa: SIM115
1333 |                     else:
1334 |                         file_obj = open(config.input_path, encoding="utf-8", errors="ignore") # noqa: SIM115
     |                                                                                         ^^^^^ E501
1335 |
1336 |                     try:
     |

`ruff check` reports: ``` scripts/convert_rdf_to_hf_dataset_unified.py:1334:89: E501 Line too long (93 > 88) | 1332 | file_obj = bz2.open(config.input_path, "rt", encoding="utf-8", errors="ignore") # noqa: SIM115 1333 | else: 1334 | file_obj = open(config.input_path, encoding="utf-8", errors="ignore") # noqa: SIM115 | ^^^^^ E501 1335 | 1336 | try: | ```

fixed !!

scripts/convert_rdf_to_hf_dataset_unified.py Outdated

					
				@ -1168,0 +1335,4 @@

				                    try:

				                        first_line = file_obj.readline().strip()

				                        if first_line.startswith("http://") or first_line.startswith("https://"):

I don't know why ruff check is ignoring this line this time around, but it's 98 characters long, longer than ruff checks usual 88 characters.

I don't know why `ruff check` is ignoring this line this time around, but it's 98 characters long, longer than `ruff check`s usual 88 characters.

fixed !!

scripts/convert_rdf_to_hf_dataset_unified.py Outdated

					
				@ -1168,0 +1350,4 @@

				                    finally:

				                        file_obj.close()

				                except Exception as e:

				                    logger.warning(f"Error detecting GeoNames format: {e}, using generic parser")

ruff check reports:

scripts/convert_rdf_to_hf_dataset_unified.py:1353:89: E501 Line too long (97 > 88)
     |
1351 |                         file_obj.close()
1352 |                 except Exception as e:
1353 |                     logger.warning(f"Error detecting GeoNames format: {e}, using generic parser")
     |                                                                                         ^^^^^^^^^ E501
1354 |                     format_type = "generic"
1355 |             elif config.rdf_format in ("nt", "ntriples"):

`ruff check` reports: ``` scripts/convert_rdf_to_hf_dataset_unified.py:1353:89: E501 Line too long (97 > 88) | 1351 | file_obj.close() 1352 | except Exception as e: 1353 | logger.warning(f"Error detecting GeoNames format: {e}, using generic parser") | ^^^^^^^^^ E501 1354 | format_type = "generic" 1355 | elif config.rdf_format in ("nt", "ntriples"): ```

fixed !!

scripts/convert_rdf_to_hf_dataset_unified.py

					
				@ -1249,0 +1435,4 @@

				            # Create train/test split if requested

				            if config.create_train_test_split:

				                assert isinstance(dataset, Dataset), "Expected Dataset instance"

				                train_test = dataset.train_test_split(test_size=config.test_size, seed=42)

Look. You're able to run ruff check just as well as I can. I'm just going to report that ruff check failed in the code that you wrote. Please look through the rest of the code that you're adding.

Look. You're able to run `ruff check` just as well as I can. I'm just going to report that `ruff check` failed in the code that you wrote. Please look through the rest of the code that you're adding.

fixed !!

aditya added 1 commit

2026-02-05 09:03:19 +00:00

fix: fix more ruff check errors per review feedback 5e3be91d50

aditya force-pushed streaming-upload-docker from 5e3be91d50 to 56abec9431

2026-02-05 13:29:09 +00:00

Compare

aditya added 1 commit

2026-02-06 11:13:17 +00:00

chore: remove blank lines 23246dc11f

This pull request can be merged automatically.

You are not authorized to merge this pull request.

View command line instructions

Checkout

From your project repository, check out a new branch and test the changes.

git fetch -u origin streaming-upload-docker:streaming-upload-docker

git switch streaming-upload-docker

Merge

Merge the changes and update on Forgejo.

Warning: The "Autodetect manual merge" setting is not enabled for this repository, you will have to mark this pull request as manually merged afterwards.

git switch xml-streaming-optimization

git merge --no-ff streaming-upload-docker

git switch streaming-upload-docker

git rebase xml-streaming-optimization

git switch xml-streaming-optimization

git merge --ff-only streaming-upload-docker

git switch streaming-upload-docker

git rebase xml-streaming-optimization

git switch xml-streaming-optimization

git merge --no-ff streaming-upload-docker

git switch xml-streaming-optimization

git merge --squash streaming-upload-docker