Docker: add entrypoint script for upload single dataset #53

Open
aditya wants to merge 6 commits from streaming-upload-docker into xml-streaming-optimization
Member

The code contain entry point script for upload single dataset.

The code contain entry point script for upload single dataset.
aditya force-pushed streaming-upload-docker from d3d581dfe9 to 415ce6baaf 2026-01-22 10:16:19 +00:00 Compare
aditya force-pushed streaming-upload-docker from 2e68a12951 to 7069bbdd93 2026-01-27 14:43:01 +00:00 Compare
brent.edwards left a comment
Member

When I run behave features, I get the following summary:

Failing scenarios:
  features/rdf_converter/cli_integration.feature:76  CLI creates train/test split
  features/rdf_converter/error_handling.feature:49  Handle malformed N-Triples gracefully
  features/rdf_converter/error_handling.feature:60  Skip malformed lines in streaming mode
  features/rdf_converter/error_handling.feature:77  Recover from partially valid file
  features/rdf_converter/error_handling.feature:117  Streaming recovers from parsing errors in ntriples
  features/rdf_converter/file_handling.feature:102  Convert gzip-compressed N-Triples
  features/rdf_converter/file_handling.feature:108  Stream gzip-compressed N-Triples
  features/rdf_converter/file_handling.feature:124  Convert bz2-compressed N-Triples
  features/rdf_converter/file_handling.feature:130  Stream bz2-compressed N-Triples
  features/rdf_converter/file_handling.feature:140  Convert TSV file with standard strategy
  features/rdf_converter/file_handling.feature:147  TSV file creates literal object types
  features/rdf_converter/file_handling.feature:153  TSV file with empty lines and malformed rows
  features/rdf_converter/file_handling.feature:174  Convert GeoNames format file with standard strategy
  features/rdf_converter/file_handling.feature:180  Convert GeoNames format file with streaming strategy
  features/rdf_converter/file_handling.feature:190  Convert GeoNames format file with simple streaming
  features/rdf_converter/parallel_conversion.feature:126  Parallel correctly processes literals with language tags
  features/rdf_converter/parallel_conversion.feature:139  Parallel correctly processes URI objects
  features/rdf_converter/standard_conversion.feature:41  Standard conversion with train/test split
  features/rdf_converter/streaming_conversion.feature:58  Stream gzip-compressed N-Triples file

2 features passed, 6 failed, 0 skipped
91 scenarios passed, 19 failed, 0 skipped
525 steps passed, 19 failed, 18 skipped
Took 0min 4.669s

I can't approve until the tests pass.

When I run `behave features`, I get the following summary: ``` Failing scenarios: features/rdf_converter/cli_integration.feature:76 CLI creates train/test split features/rdf_converter/error_handling.feature:49 Handle malformed N-Triples gracefully features/rdf_converter/error_handling.feature:60 Skip malformed lines in streaming mode features/rdf_converter/error_handling.feature:77 Recover from partially valid file features/rdf_converter/error_handling.feature:117 Streaming recovers from parsing errors in ntriples features/rdf_converter/file_handling.feature:102 Convert gzip-compressed N-Triples features/rdf_converter/file_handling.feature:108 Stream gzip-compressed N-Triples features/rdf_converter/file_handling.feature:124 Convert bz2-compressed N-Triples features/rdf_converter/file_handling.feature:130 Stream bz2-compressed N-Triples features/rdf_converter/file_handling.feature:140 Convert TSV file with standard strategy features/rdf_converter/file_handling.feature:147 TSV file creates literal object types features/rdf_converter/file_handling.feature:153 TSV file with empty lines and malformed rows features/rdf_converter/file_handling.feature:174 Convert GeoNames format file with standard strategy features/rdf_converter/file_handling.feature:180 Convert GeoNames format file with streaming strategy features/rdf_converter/file_handling.feature:190 Convert GeoNames format file with simple streaming features/rdf_converter/parallel_conversion.feature:126 Parallel correctly processes literals with language tags features/rdf_converter/parallel_conversion.feature:139 Parallel correctly processes URI objects features/rdf_converter/standard_conversion.feature:41 Standard conversion with train/test split features/rdf_converter/streaming_conversion.feature:58 Stream gzip-compressed N-Triples file 2 features passed, 6 failed, 0 skipped 91 scenarios passed, 19 failed, 0 skipped 525 steps passed, 19 failed, 18 skipped Took 0min 4.669s ``` I can't approve until the tests pass.
@ -122,0 +127,4 @@
continue # Retry
# All retries failed - try to salvage by parsing line-by-line
logger.warning(f"Chunk {idx} failed after {max_retries} attempts, attempting line-by-line recovery: {last_error}")
Member

ruff check reports:

scripts/rdf_to_hf_incremental.py:130:89: E501 Line too long (118 > 88)
    |
129 |     # All retries failed - try to salvage by parsing line-by-line
130 |     logger.warning(f"Chunk {idx} failed after {max_retries} attempts, attempting line-by-line recovery: {last_error}")
    |                                                                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ E501
131 |
132 |     # Try to parse individual statements
    |
`ruff check` reports: ``` scripts/rdf_to_hf_incremental.py:130:89: E501 Line too long (118 > 88) | 129 | # All retries failed - try to salvage by parsing line-by-line 130 | logger.warning(f"Chunk {idx} failed after {max_retries} attempts, attempting line-by-line recovery: {last_error}") | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ E501 131 | 132 | # Try to parse individual statements | ```
Author
Member

fixed !!

fixed !!
brent.edwards marked this conversation as resolved
@ -253,0 +292,4 @@
]
if triples:
yield triples
parsed_successfully = True
Member

ruff check reports:

scripts/rdf_to_hf_incremental.py:295:29: F841 Local variable `parsed_successfully` is assigned to but never used
    |
293 |                             if triples:
294 |                                 yield triples
295 |                             parsed_successfully = True
    |                             ^^^^^^^^^^^^^^^^^^^ F841
296 |                             break
297 |                         except Exception as e:
    |
    = help: Remove assignment to unused variable `parsed_successfully`
`ruff check` reports: ``` scripts/rdf_to_hf_incremental.py:295:29: F841 Local variable `parsed_successfully` is assigned to but never used | 293 | if triples: 294 | yield triples 295 | parsed_successfully = True | ^^^^^^^^^^^^^^^^^^^ F841 296 | break 297 | except Exception as e: | = help: Remove assignment to unused variable `parsed_successfully` ```
Author
Member

fixed !!

fixed !!
brent.edwards marked this conversation as resolved
@ -253,0 +298,4 @@
if attempt < max_retries - 1:
continue # Retry
else:
logger.warning(f"Chunk parsing failed after {max_retries} attempts: {e}")
Member

ruff check reports:

scripts/rdf_to_hf_incremental.py:301:89: E501 Line too long (105 > 88)
    |
299 |                                 continue  # Retry
300 |                             else:
301 |                                 logger.warning(f"Chunk parsing failed after {max_retries} attempts: {e}")
    |                                                                                         ^^^^^^^^^^^^^^^^^ E501
302 |
303 |                     current_chunk = []
    |
`ruff check` reports: ``` scripts/rdf_to_hf_incremental.py:301:89: E501 Line too long (105 > 88) | 299 | continue # Retry 300 | else: 301 | logger.warning(f"Chunk parsing failed after {max_retries} attempts: {e}") | ^^^^^^^^^^^^^^^^^ E501 302 | 303 | current_chunk = [] | ```
Author
Member

fixed !!

fixed !!
brent.edwards marked this conversation as resolved
@ -278,0 +323,4 @@
try:
graph = Graph()
graph.parse(data=chunk_text, format="turtle")
triples = [_convert_rdf_triple_to_dict(s, p, o) for s, p, o in graph]
Member

ruff check reports:

scripts/rdf_to_hf_incremental.py:326:89: E501 Line too long (89 > 88)
    |
324 |                     graph = Graph()
325 |                     graph.parse(data=chunk_text, format="turtle")
326 |                     triples = [_convert_rdf_triple_to_dict(s, p, o) for s, p, o in graph]
    |                                                                                         ^ E501
327 |                     if triples:
328 |                         yield triples
    |
`ruff check` reports: ``` scripts/rdf_to_hf_incremental.py:326:89: E501 Line too long (89 > 88) | 324 | graph = Graph() 325 | graph.parse(data=chunk_text, format="turtle") 326 | triples = [_convert_rdf_triple_to_dict(s, p, o) for s, p, o in graph] | ^ E501 327 | if triples: 328 | yield triples | ```
Author
Member

fixed !!

fixed !!
brent.edwards marked this conversation as resolved
@ -278,0 +331,4 @@
if attempt < max_retries - 1:
continue # Retry
else:
logger.warning(f"Final chunk parsing failed after {max_retries} attempts: {e}")
Member

ruff check reports:

scripts/rdf_to_hf_incremental.py:334:89: E501 Line too long (103 > 88)
    |
332 |                         continue  # Retry
333 |                     else:
334 |                         logger.warning(f"Final chunk parsing failed after {max_retries} attempts: {e}")
    |                                                                                         ^^^^^^^^^^^^^^^ E501
335 |         return
    |
`ruff check` reports: ``` scripts/rdf_to_hf_incremental.py:334:89: E501 Line too long (103 > 88) | 332 | continue # Retry 333 | else: 334 | logger.warning(f"Final chunk parsing failed after {max_retries} attempts: {e}") | ^^^^^^^^^^^^^^^ E501 335 | return | ```
Author
Member

fixed !!

fixed !!
brent.edwards marked this conversation as resolved
@ -984,0 +1054,4 @@
or "gateway timeout" in error_str
# Network/connection errors
or "connection" in error_str and "refused" in error_str
Member

ruff check reports:

scripts/rdf_to_hf_incremental.py:1057:24: RUF021 [*] Parenthesize `a and b` expressions when chaining `and` and `or` together, to make the precedence clear
     |
1056 |                     # Network/connection errors
1057 |                     or "connection" in error_str and "refused" in error_str
     |                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RUF021
1058 |                     or "connection" in error_str and "reset" in error_str
1059 |                     or "connection" in error_str and "timeout" in error_str
     |
     = help: Parenthesize the `and` subexpression

scripts/rdf_to_hf_incremental.py:1058:24: RUF021 [*] Parenthesize `a and b` expressions when chaining `and` and `or` together, to make the precedence clear
     |
1056 |                     # Network/connection errors
1057 |                     or "connection" in error_str and "refused" in error_str
1058 |                     or "connection" in error_str and "reset" in error_str
     |                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RUF021
1059 |                     or "connection" in error_str and "timeout" in error_str
1060 |                     or "timeout" in error_str
     |
     = help: Parenthesize the `and` subexpression

scripts/rdf_to_hf_incremental.py:1059:24: RUF021 [*] Parenthesize `a and b` expressions when chaining `and` and `or` together, to make the precedence clear
     |
1057 |                     or "connection" in error_str and "refused" in error_str
1058 |                     or "connection" in error_str and "reset" in error_str
1059 |                     or "connection" in error_str and "timeout" in error_str
     |                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RUF021
1060 |                     or "timeout" in error_str
1061 |                     or "timed out" in error_str
     |
     = help: Parenthesize the `and` subexpression
`ruff check` reports: ``` scripts/rdf_to_hf_incremental.py:1057:24: RUF021 [*] Parenthesize `a and b` expressions when chaining `and` and `or` together, to make the precedence clear | 1056 | # Network/connection errors 1057 | or "connection" in error_str and "refused" in error_str | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RUF021 1058 | or "connection" in error_str and "reset" in error_str 1059 | or "connection" in error_str and "timeout" in error_str | = help: Parenthesize the `and` subexpression scripts/rdf_to_hf_incremental.py:1058:24: RUF021 [*] Parenthesize `a and b` expressions when chaining `and` and `or` together, to make the precedence clear | 1056 | # Network/connection errors 1057 | or "connection" in error_str and "refused" in error_str 1058 | or "connection" in error_str and "reset" in error_str | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RUF021 1059 | or "connection" in error_str and "timeout" in error_str 1060 | or "timeout" in error_str | = help: Parenthesize the `and` subexpression scripts/rdf_to_hf_incremental.py:1059:24: RUF021 [*] Parenthesize `a and b` expressions when chaining `and` and `or` together, to make the precedence clear | 1057 | or "connection" in error_str and "refused" in error_str 1058 | or "connection" in error_str and "reset" in error_str 1059 | or "connection" in error_str and "timeout" in error_str | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RUF021 1060 | or "timeout" in error_str 1061 | or "timed out" in error_str | = help: Parenthesize the `and` subexpression ```
Author
Member

fixed !!

fixed !!
brent.edwards marked this conversation as resolved
@ -984,0 +1064,4 @@
or "timeout" in error_type.lower()
# SSL/TLS temporary issues
or "ssl" in error_str and "error" in error_str
Member

ruff check reports:

scripts/rdf_to_hf_incremental.py:1067:24: RUF021 [*] Parenthesize `a and b` expressions when chaining `and` and `or` together, to make the precedence clear
     |
1066 |                     # SSL/TLS temporary issues
1067 |                     or "ssl" in error_str and "error" in error_str
     |                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RUF021
1068 |
1069 |                     # Proxy errors
     |
     = help: Parenthesize the `and` subexpression
`ruff check` reports: ``` scripts/rdf_to_hf_incremental.py:1067:24: RUF021 [*] Parenthesize `a and b` expressions when chaining `and` and `or` together, to make the precedence clear | 1066 | # SSL/TLS temporary issues 1067 | or "ssl" in error_str and "error" in error_str | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RUF021 1068 | 1069 | # Proxy errors | = help: Parenthesize the `and` subexpression ```
Author
Member

fixed !!

fixed !!
brent.edwards marked this conversation as resolved
@ -984,0 +1067,4 @@
or "ssl" in error_str and "error" in error_str
# Proxy errors
or "proxy" in error_str and "error" in error_str
Member

ruff check reports:

scripts/rdf_to_hf_incremental.py:1070:24: RUF021 [*] Parenthesize `a and b` expressions when chaining `and` and `or` together, to make the precedence clear
     |
1069 |                     # Proxy errors
1070 |                     or "proxy" in error_str and "error" in error_str
     |                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RUF021
1071 |
1072 |                     # Hugging Face specific
     |
     = help: Parenthesize the `and` subexpression
`ruff check` reports: ``` scripts/rdf_to_hf_incremental.py:1070:24: RUF021 [*] Parenthesize `a and b` expressions when chaining `and` and `or` together, to make the precedence clear | 1069 | # Proxy errors 1070 | or "proxy" in error_str and "error" in error_str | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RUF021 1071 | 1072 | # Hugging Face specific | = help: Parenthesize the `and` subexpression ```
Author
Member

fixed !!

fixed !!
brent.edwards marked this conversation as resolved
@ -1004,3 +1096,3 @@
raise
else:
# Not a rate limit error, re-raise immediately
# Not a retryable error (authentication, permission, etc.) - fail immediately
Member

ruff check reports:

scripts/rdf_to_hf_incremental.py:1098:89: E501 Line too long (97 > 88)
     |
1096 |                         raise
1097 |                 else:
1098 |                     # Not a retryable error (authentication, permission, etc.) - fail immediately
     |                                                                                         ^^^^^^^^^ E501
1099 |                     raise
     |
`ruff check` reports: ``` scripts/rdf_to_hf_incremental.py:1098:89: E501 Line too long (97 > 88) | 1096 | raise 1097 | else: 1098 | # Not a retryable error (authentication, permission, etc.) - fail immediately | ^^^^^^^^^ E501 1099 | raise | ```
Author
Member

fixed !!

fixed !!
brent.edwards marked this conversation as resolved
@ -1826,2 +1898,2 @@
"for the dataset[/dim]"
)
# Upload README to the repository with retry logic
import io
Member

It's usually better to put all import statements together.

It's usually better to put all `import` statements together.
Author
Member

fixed !!

fixed !!
brent.edwards marked this conversation as resolved
@ -1829,0 +1915,4 @@
commit_message="Add dataset card with documentation"
)
console.print("[green]✓ Dataset card (README.md) uploaded[/green]")
readme_uploaded = True
Member

ruff check reports:

scripts/rdf_to_hf_incremental.py:1918:21: F841 Local variable `readme_uploaded` is assigned to but never used
     |
1916 |                     )
1917 |                     console.print("[green]✓ Dataset card (README.md) uploaded[/green]")
1918 |                     readme_uploaded = True
     |                     ^^^^^^^^^^^^^^^ F841
1919 |                     break  # Success, exit retry loop
     |
     = help: Remove assignment to unused variable `readme_uploaded`
`ruff check` reports: ``` scripts/rdf_to_hf_incremental.py:1918:21: F841 Local variable `readme_uploaded` is assigned to but never used | 1916 | ) 1917 | console.print("[green]✓ Dataset card (README.md) uploaded[/green]") 1918 | readme_uploaded = True | ^^^^^^^^^^^^^^^ F841 1919 | break # Success, exit retry loop | = help: Remove assignment to unused variable `readme_uploaded` ```
Author
Member

fixed !!

fixed !!
brent.edwards marked this conversation as resolved
@ -1829,0 +1943,4 @@
# Network/connection errors
or "connection" in error_str and "refused" in error_str
or "connection" in error_str and "reset" in error_str
or "connection" in error_str and "timeout" in error_str
Member

ruff check reports:


scripts/rdf_to_hf_incremental.py:1944:28: RUF021 [*] Parenthesize `a and b` expressions when chaining `and` and `or` together, to make the precedence clear
     |
1943 |                         # Network/connection errors
1944 |                         or "connection" in error_str and "refused" in error_str
     |                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RUF021
1945 |                         or "connection" in error_str and "reset" in error_str
1946 |                         or "connection" in error_str and "timeout" in error_str
     |
     = help: Parenthesize the `and` subexpression

scripts/rdf_to_hf_incremental.py:1945:28: RUF021 [*] Parenthesize `a and b` expressions when chaining `and` and `or` together, to make the precedence clear
     |
1943 |                         # Network/connection errors
1944 |                         or "connection" in error_str and "refused" in error_str
1945 |                         or "connection" in error_str and "reset" in error_str
     |                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RUF021
1946 |                         or "connection" in error_str and "timeout" in error_str
1947 |                         or "timeout" in error_str
     |
     = help: Parenthesize the `and` subexpression

scripts/rdf_to_hf_incremental.py:1946:28: RUF021 [*] Parenthesize `a and b` expressions when chaining `and` and `or` together, to make the precedence clear
     |
1944 |                         or "connection" in error_str and "refused" in error_str
1945 |                         or "connection" in error_str and "reset" in error_str
1946 |                         or "connection" in error_str and "timeout" in error_str
     |                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RUF021
1947 |                         or "timeout" in error_str
1948 |                         or "timed out" in error_str
     |
     = help: Parenthesize the `and` subexpression
`ruff check` reports: ``` scripts/rdf_to_hf_incremental.py:1944:28: RUF021 [*] Parenthesize `a and b` expressions when chaining `and` and `or` together, to make the precedence clear | 1943 | # Network/connection errors 1944 | or "connection" in error_str and "refused" in error_str | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RUF021 1945 | or "connection" in error_str and "reset" in error_str 1946 | or "connection" in error_str and "timeout" in error_str | = help: Parenthesize the `and` subexpression scripts/rdf_to_hf_incremental.py:1945:28: RUF021 [*] Parenthesize `a and b` expressions when chaining `and` and `or` together, to make the precedence clear | 1943 | # Network/connection errors 1944 | or "connection" in error_str and "refused" in error_str 1945 | or "connection" in error_str and "reset" in error_str | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RUF021 1946 | or "connection" in error_str and "timeout" in error_str 1947 | or "timeout" in error_str | = help: Parenthesize the `and` subexpression scripts/rdf_to_hf_incremental.py:1946:28: RUF021 [*] Parenthesize `a and b` expressions when chaining `and` and `or` together, to make the precedence clear | 1944 | or "connection" in error_str and "refused" in error_str 1945 | or "connection" in error_str and "reset" in error_str 1946 | or "connection" in error_str and "timeout" in error_str | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RUF021 1947 | or "timeout" in error_str 1948 | or "timed out" in error_str | = help: Parenthesize the `and` subexpression ```
Author
Member

fixed !!

fixed !!
brent.edwards marked this conversation as resolved
@ -1829,0 +1951,4 @@
or "timeout" in error_type.lower()
# SSL/TLS temporary issues
or "ssl" in error_str and "error" in error_str
Member

ruff check reports:

scripts/rdf_to_hf_incremental.py:1954:28: RUF021 [*] Parenthesize `a and b` expressions when chaining `and` and `or` together, to make the precedence clear
     |
1953 |                         # SSL/TLS temporary issues
1954 |                         or "ssl" in error_str and "error" in error_str
     |                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RUF021
1955 |
1956 |                         # Proxy errors
     |
     = help: Parenthesize the `and` subexpression
`ruff check` reports: ``` scripts/rdf_to_hf_incremental.py:1954:28: RUF021 [*] Parenthesize `a and b` expressions when chaining `and` and `or` together, to make the precedence clear | 1953 | # SSL/TLS temporary issues 1954 | or "ssl" in error_str and "error" in error_str | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RUF021 1955 | 1956 | # Proxy errors | = help: Parenthesize the `and` subexpression ```
Author
Member

fixed !!

fixed !!
brent.edwards marked this conversation as resolved
@ -1829,0 +1954,4 @@
or "ssl" in error_str and "error" in error_str
# Proxy errors
or "proxy" in error_str and "error" in error_str
Member

ruff check reports:

scripts/rdf_to_hf_incremental.py:1957:28: RUF021 [*] Parenthesize `a and b` expressions when chaining `and` and `or` together, to make the precedence clear
     |
1956 |                         # Proxy errors
1957 |                         or "proxy" in error_str and "error" in error_str
     |                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RUF021
1958 |
1959 |                         # Hugging Face specific
     |
     = help: Parenthesize the `and` subexpression
`ruff check` reports: ``` scripts/rdf_to_hf_incremental.py:1957:28: RUF021 [*] Parenthesize `a and b` expressions when chaining `and` and `or` together, to make the precedence clear | 1956 | # Proxy errors 1957 | or "proxy" in error_str and "error" in error_str | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RUF021 1958 | 1959 | # Hugging Face specific | = help: Parenthesize the `and` subexpression ```
Author
Member

fixed !!

fixed !!
brent.edwards marked this conversation as resolved
@ -1829,0 +1976,4 @@
time.sleep(wait_time)
else:
console.print(
f"[yellow]⚠ Warning: Could not upload README.md after {max_retries} attempts: "
Member

ruff check reports:

scripts/rdf_to_hf_incremental.py:1979:89: E501 Line too long (111 > 88)
     |
1977 |                         else:
1978 |                             console.print(
1979 |                                 f"[yellow]⚠ Warning: Could not upload README.md after {max_retries} attempts: "
     |                                                                                         ^^^^^^^^^^^^^^^^^^^^^^^ E501
1980 |                                 f"{e}[/yellow]"
1981 |                             )
     |
`ruff check` reports: ``` scripts/rdf_to_hf_incremental.py:1979:89: E501 Line too long (111 > 88) | 1977 | else: 1978 | console.print( 1979 | f"[yellow]⚠ Warning: Could not upload README.md after {max_retries} attempts: " | ^^^^^^^^^^^^^^^^^^^^^^^ E501 1980 | f"{e}[/yellow]" 1981 | ) | ```
Author
Member

fixed !!

fixed !!
brent.edwards marked this conversation as resolved
@ -1829,0 +1986,4 @@
else:
# Not a retryable error - fail immediately
console.print(
f"[yellow]⚠ Warning: Could not upload README.md (non-retryable error): "
Member

ruff check reports:

scripts/rdf_to_hf_incremental.py:1989:89: E501 Line too long (100 > 88)
     |
1987 |                         # Not a retryable error - fail immediately
1988 |                         console.print(
1989 |                             f"[yellow]⚠ Warning: Could not upload README.md (non-retryable error): "
     |                                                                                         ^^^^^^^^^^^^ E501
1990 |                             f"{e}[/yellow]"
1991 |                         )
     |
`ruff check` reports: ``` scripts/rdf_to_hf_incremental.py:1989:89: E501 Line too long (100 > 88) | 1987 | # Not a retryable error - fail immediately 1988 | console.print( 1989 | f"[yellow]⚠ Warning: Could not upload README.md (non-retryable error): " | ^^^^^^^^^^^^ E501 1990 | f"{e}[/yellow]" 1991 | ) | ```
Author
Member

fixed !!

fixed !!
brent.edwards marked this conversation as resolved
@ -1829,0 +1993,4 @@
"[dim]You can manually create a README.md file "
"for the dataset[/dim]"
)
break
Member

Lines 1035-1099 and lines 1922-1996 are extremely similar. Could they be unified into a function?

Lines 1035-1099 and lines 1922-1996 are extremely similar. Could they be unified into a function?
Author
Member

fixed !!

fixed !!
brent.edwards marked this conversation as resolved
brent.edwards left a comment
Member

A lot of your new code didn't pass ruff check.

A lot of your new code didn't pass `ruff check`.
@ -523,2 +582,2 @@
graph = Graph()
graph.parse(str(file_path), format=rdf_format)
# Check if this is GeoNames format (special handling required)
if rdf_format in ("xml", "application/rdf+xml") and _detect_geonames_format(file_path):
Member

Crap. I'm sorry to tell you this, but...

ruff check reports:

scripts/convert_rdf_to_hf_dataset_unified.py:583:89: E501 Line too long (91 > 88)
    |
581 |     """
582 |     # Check if this is GeoNames format (special handling required)
583 |     if rdf_format in ("xml", "application/rdf+xml") and _detect_geonames_format(file_path):
    |                                                                                         ^^^ E501
584 |         # Use GeoNames parser with single worker for non-parallel strategies
585 |         yield from stream_geonames_parallel(file_path, chunk_size, num_workers=1)
    |

Let me know if you're having problems running ruff check locally.

Crap. I'm sorry to tell you this, but... `ruff check` reports: ``` scripts/convert_rdf_to_hf_dataset_unified.py:583:89: E501 Line too long (91 > 88) | 581 | """ 582 | # Check if this is GeoNames format (special handling required) 583 | if rdf_format in ("xml", "application/rdf+xml") and _detect_geonames_format(file_path): | ^^^ E501 584 | # Use GeoNames parser with single worker for non-parallel strategies 585 | yield from stream_geonames_parallel(file_path, chunk_size, num_workers=1) | ``` Let me know if you're having problems running `ruff check` locally.
Author
Member

fixed !!

fixed !!
@ -828,0 +965,4 @@
subject=subject,
predicate=predicate,
object=obj,
object_type="literal", # TSV typically has literals
Member

ruff check reports:

scripts/convert_rdf_to_hf_dataset_unified.py:968:89: E501 Line too long (96 > 88)
    |
966 | …                     predicate=predicate,
967 | …                     object=obj,
968 | …                     object_type="literal",  # TSV typically has literals
    |                                                                   ^^^^^^^^ E501
969 | …                     object_datatype=None,
970 | …                     object_language=None
    |
`ruff check` reports: ``` scripts/convert_rdf_to_hf_dataset_unified.py:968:89: E501 Line too long (96 > 88) | 966 | … predicate=predicate, 967 | … object=obj, 968 | … object_type="literal", # TSV typically has literals | ^^^^^^^^ E501 969 | … object_datatype=None, 970 | … object_language=None | ```
Author
Member

fixed !!

fixed !!
@ -828,0 +970,4 @@
object_language=None
))
except Exception as e:
logger.debug(f"Skipping malformed TSV line {line_num}: {e}")
Member

ruff check reports:

scripts/convert_rdf_to_hf_dataset_unified.py:973:89: E501 Line too long (92 > 88)
    |
971 |                                         ))
972 |                             except Exception as e:
973 |                                 logger.debug(f"Skipping malformed TSV line {line_num}: {e}")
    |                                                                                         ^^^^ E501
974 |                                 continue
975 |                 except Exception as e:
`ruff check` reports: ``` scripts/convert_rdf_to_hf_dataset_unified.py:973:89: E501 Line too long (92 > 88) | 971 | )) 972 | except Exception as e: 973 | logger.debug(f"Skipping malformed TSV line {line_num}: {e}") | ^^^^ E501 974 | continue 975 | except Exception as e: ```
Author
Member

fixed !!

fixed !!
@ -840,0 +990,4 @@
# Create train/test split if requested
if config.create_train_test_split:
assert isinstance(dataset, Dataset), "Expected Dataset instance"
train_test = dataset.train_test_split(test_size=config.test_size, seed=42)
Member

ruff check reports:

scripts/convert_rdf_to_hf_dataset_unified.py:993:89: E501 Line too long (90 > 88)
    |
991 |             if config.create_train_test_split:
992 |                 assert isinstance(dataset, Dataset), "Expected Dataset instance"
993 |                 train_test = dataset.train_test_split(test_size=config.test_size, seed=42)
    |                                                                                         ^^ E501
994 |                 dataset_dict = DatasetDict({
995 |                     "train": train_test["train"],
`ruff check` reports: ``` scripts/convert_rdf_to_hf_dataset_unified.py:993:89: E501 Line too long (90 > 88) | 991 | if config.create_train_test_split: 992 | assert isinstance(dataset, Dataset), "Expected Dataset instance" 993 | train_test = dataset.train_test_split(test_size=config.test_size, seed=42) | ^^ E501 994 | dataset_dict = DatasetDict({ 995 | "train": train_test["train"], ```
Author
Member

fixed !!

fixed !!
@ -1161,1 +1324,3 @@
)
# Handle compressed files
is_gzip = config.input_path.suffix == ".gz" or str(config.input_path).endswith(".gz")
is_bz2 = config.input_path.suffix == ".bz2" or str(config.input_path).endswith(".bz2")
Member

ruff check reports:

scripts/convert_rdf_to_hf_dataset_unified.py:1325:89: E501 Line too long (101 > 88)
     |
1323 |                 # Check if it's the special GeoNames format (URLs followed by XML)
1324 |                 # Handle compressed files
1325 |                 is_gzip = config.input_path.suffix == ".gz" or str(config.input_path).endswith(".gz")
     |                                                                                         ^^^^^^^^^^^^^ E501
1326 |                 is_bz2 = config.input_path.suffix == ".bz2" or str(config.input_path).endswith(".bz2")
     |

scripts/convert_rdf_to_hf_dataset_unified.py:1326:89: E501 Line too long (102 > 88)
     |
1324 |                 # Handle compressed files
1325 |                 is_gzip = config.input_path.suffix == ".gz" or str(config.input_path).endswith(".gz")
1326 |                 is_bz2 = config.input_path.suffix == ".bz2" or str(config.input_path).endswith(".bz2")
     |                                                                                         ^^^^^^^^^^^^^^ E501
1327 |
1328 |                 try:
     |
`ruff check` reports: ``` scripts/convert_rdf_to_hf_dataset_unified.py:1325:89: E501 Line too long (101 > 88) | 1323 | # Check if it's the special GeoNames format (URLs followed by XML) 1324 | # Handle compressed files 1325 | is_gzip = config.input_path.suffix == ".gz" or str(config.input_path).endswith(".gz") | ^^^^^^^^^^^^^ E501 1326 | is_bz2 = config.input_path.suffix == ".bz2" or str(config.input_path).endswith(".bz2") | scripts/convert_rdf_to_hf_dataset_unified.py:1326:89: E501 Line too long (102 > 88) | 1324 | # Handle compressed files 1325 | is_gzip = config.input_path.suffix == ".gz" or str(config.input_path).endswith(".gz") 1326 | is_bz2 = config.input_path.suffix == ".bz2" or str(config.input_path).endswith(".bz2") | ^^^^^^^^^^^^^^ E501 1327 | 1328 | try: | ```
Author
Member

fixed !!

fixed !!
@ -1162,0 +1327,4 @@
try:
if is_gzip:
file_obj = gzip.open(config.input_path, "rt", encoding="utf-8", errors="ignore") # noqa: SIM115
Member

ruff check reports:

scripts/convert_rdf_to_hf_dataset_unified.py:1330:89: E501 Line too long (104 > 88)
     |
1328 |                 try:
1329 |                     if is_gzip:
1330 |                         file_obj = gzip.open(config.input_path, "rt", encoding="utf-8", errors="ignore") # noqa: SIM115
     |                                                                                         ^^^^^^^^^^^^^^^^ E501
1331 |                     elif is_bz2:
1332 |                         file_obj = bz2.open(config.input_path, "rt", encoding="utf-8", errors="ignore") # noqa: SIM115
     |
`ruff check` reports: ``` scripts/convert_rdf_to_hf_dataset_unified.py:1330:89: E501 Line too long (104 > 88) | 1328 | try: 1329 | if is_gzip: 1330 | file_obj = gzip.open(config.input_path, "rt", encoding="utf-8", errors="ignore") # noqa: SIM115 | ^^^^^^^^^^^^^^^^ E501 1331 | elif is_bz2: 1332 | file_obj = bz2.open(config.input_path, "rt", encoding="utf-8", errors="ignore") # noqa: SIM115 | ```
Author
Member

fixed !!

fixed !!
@ -1162,0 +1329,4 @@
if is_gzip:
file_obj = gzip.open(config.input_path, "rt", encoding="utf-8", errors="ignore") # noqa: SIM115
elif is_bz2:
file_obj = bz2.open(config.input_path, "rt", encoding="utf-8", errors="ignore") # noqa: SIM115
Member

ruff check reports:

scripts/convert_rdf_to_hf_dataset_unified.py:1332:89: E501 Line too long (103 > 88)
     |
1330 |                         file_obj = gzip.open(config.input_path, "rt", encoding="utf-8", errors="ignore") # noqa: SIM115
1331 |                     elif is_bz2:
1332 |                         file_obj = bz2.open(config.input_path, "rt", encoding="utf-8", errors="ignore") # noqa: SIM115
     |                                                                                         ^^^^^^^^^^^^^^^ E501
1333 |                     else:
1334 |                         file_obj = open(config.input_path, encoding="utf-8", errors="ignore") # noqa: SIM115
     |
`ruff check` reports: ``` scripts/convert_rdf_to_hf_dataset_unified.py:1332:89: E501 Line too long (103 > 88) | 1330 | file_obj = gzip.open(config.input_path, "rt", encoding="utf-8", errors="ignore") # noqa: SIM115 1331 | elif is_bz2: 1332 | file_obj = bz2.open(config.input_path, "rt", encoding="utf-8", errors="ignore") # noqa: SIM115 | ^^^^^^^^^^^^^^^ E501 1333 | else: 1334 | file_obj = open(config.input_path, encoding="utf-8", errors="ignore") # noqa: SIM115 | ```
Author
Member

fixed !!

fixed !!
@ -1165,3 +1334,1 @@
f"[yellow]Using standard RDF parser for "
f"{config.rdf_format} (single-threaded)[/yellow]"
)
file_obj = open(config.input_path, encoding="utf-8", errors="ignore") # noqa: SIM115
Member

ruff check reports:

scripts/convert_rdf_to_hf_dataset_unified.py:1334:89: E501 Line too long (93 > 88)
     |
1332 |                         file_obj = bz2.open(config.input_path, "rt", encoding="utf-8", errors="ignore") # noqa: SIM115
1333 |                     else:
1334 |                         file_obj = open(config.input_path, encoding="utf-8", errors="ignore") # noqa: SIM115
     |                                                                                         ^^^^^ E501
1335 |
1336 |                     try:
     |
`ruff check` reports: ``` scripts/convert_rdf_to_hf_dataset_unified.py:1334:89: E501 Line too long (93 > 88) | 1332 | file_obj = bz2.open(config.input_path, "rt", encoding="utf-8", errors="ignore") # noqa: SIM115 1333 | else: 1334 | file_obj = open(config.input_path, encoding="utf-8", errors="ignore") # noqa: SIM115 | ^^^^^ E501 1335 | 1336 | try: | ```
Author
Member

fixed !!

fixed !!
@ -1168,0 +1335,4 @@
try:
first_line = file_obj.readline().strip()
if first_line.startswith("http://") or first_line.startswith("https://"):
Member

I don't know why ruff check is ignoring this line this time around, but it's 98 characters long, longer than ruff checks usual 88 characters.

I don't know why `ruff check` is ignoring this line this time around, but it's 98 characters long, longer than `ruff check`s usual 88 characters.
Author
Member

fixed !!

fixed !!
@ -1168,0 +1350,4 @@
finally:
file_obj.close()
except Exception as e:
logger.warning(f"Error detecting GeoNames format: {e}, using generic parser")
Member

ruff check reports:

scripts/convert_rdf_to_hf_dataset_unified.py:1353:89: E501 Line too long (97 > 88)
     |
1351 |                         file_obj.close()
1352 |                 except Exception as e:
1353 |                     logger.warning(f"Error detecting GeoNames format: {e}, using generic parser")
     |                                                                                         ^^^^^^^^^ E501
1354 |                     format_type = "generic"
1355 |             elif config.rdf_format in ("nt", "ntriples"):
`ruff check` reports: ``` scripts/convert_rdf_to_hf_dataset_unified.py:1353:89: E501 Line too long (97 > 88) | 1351 | file_obj.close() 1352 | except Exception as e: 1353 | logger.warning(f"Error detecting GeoNames format: {e}, using generic parser") | ^^^^^^^^^ E501 1354 | format_type = "generic" 1355 | elif config.rdf_format in ("nt", "ntriples"): ```
Author
Member

fixed !!

fixed !!
@ -1249,0 +1435,4 @@
# Create train/test split if requested
if config.create_train_test_split:
assert isinstance(dataset, Dataset), "Expected Dataset instance"
train_test = dataset.train_test_split(test_size=config.test_size, seed=42)
Member

Look. You're able to run ruff check just as well as I can. I'm just going to report that ruff check failed in the code that you wrote. Please look through the rest of the code that you're adding.

Look. You're able to run `ruff check` just as well as I can. I'm just going to report that `ruff check` failed in the code that you wrote. Please look through the rest of the code that you're adding.
Author
Member

fixed !!

fixed !!
aditya force-pushed streaming-upload-docker from 5e3be91d50 to 56abec9431 2026-02-05 13:29:09 +00:00 Compare
This pull request can be merged automatically.
You are not authorized to merge this pull request.
View command line instructions

Checkout

From your project repository, check out a new branch and test the changes.
git fetch -u origin streaming-upload-docker:streaming-upload-docker
git switch streaming-upload-docker

Merge

Merge the changes and update on Forgejo.

Warning: The "Autodetect manual merge" setting is not enabled for this repository, you will have to mark this pull request as manually merged afterwards.

git switch xml-streaming-optimization
git merge --no-ff streaming-upload-docker
git switch streaming-upload-docker
git rebase xml-streaming-optimization
git switch xml-streaming-optimization
git merge --ff-only streaming-upload-docker
git switch streaming-upload-docker
git rebase xml-streaming-optimization
git switch xml-streaming-optimization
git merge --no-ff streaming-upload-docker
git switch xml-streaming-optimization
git merge --squash streaming-upload-docker
git switch xml-streaming-optimization
git merge --ff-only streaming-upload-docker
git switch xml-streaming-optimization
git merge streaming-upload-docker
git push origin xml-streaming-optimization
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
cleverdatasets/dataset-uploader!53
No description provided.