xml-streaming-optimization #49

Open
aditya wants to merge 7 commits from xml-streaming-optimization into auto-excel-update
Member

This branch contains optimized code which skips the description parsing during the resume and only counts the description count, this optimization reduces the overhead of parsing the entire dataset again. This code also contain multi threading implementation for description parsing and checkpoint update using tracker sheet.

This branch contains optimized code which skips the description parsing during the resume and only counts the description count, this optimization reduces the overhead of parsing the entire dataset again. This code also contain multi threading implementation for description parsing and checkpoint update using tracker sheet.
khird approved these changes 2026-01-27 15:29:10 +00:00
Dismissed
@ -370,0 +406,4 @@
self.sheet.values()
.get(
spreadsheetId=self.spreadsheet_id,
range=f"{self.worksheet_name}!A{row_num}:Z{row_num}",
First-time contributor

Hardcoding A and Z columns looks suspicious - is this just a conservative estimate or do we know that the data we want is in this subset of the sheet?

Hardcoding A and Z columns looks suspicious - is this just a conservative estimate or do we know that the data we want is in this subset of the sheet?
Author
Member

Fixed !!

Fixed !!
khird approved these changes 2026-02-02 15:23:12 +00:00
brent.edwards left a comment
Member

When I run bandit features on this code, I get the following summary:

Failing scenarios:
  features/rdf_converter/cli_integration.feature:76  CLI creates train/test split
  features/rdf_converter/error_handling.feature:49  Handle malformed N-Triples gracefully
  features/rdf_converter/error_handling.feature:60  Skip malformed lines in streaming mode
  features/rdf_converter/error_handling.feature:77  Recover from partially valid file
  features/rdf_converter/error_handling.feature:117  Streaming recovers from parsing errors in ntriples
  features/rdf_converter/file_handling.feature:102  Convert gzip-compressed N-Triples
  features/rdf_converter/file_handling.feature:108  Stream gzip-compressed N-Triples
  features/rdf_converter/file_handling.feature:124  Convert bz2-compressed N-Triples
  features/rdf_converter/file_handling.feature:130  Stream bz2-compressed N-Triples
  features/rdf_converter/file_handling.feature:140  Convert TSV file with standard strategy
  features/rdf_converter/file_handling.feature:147  TSV file creates literal object types
  features/rdf_converter/file_handling.feature:153  TSV file with empty lines and malformed rows
  features/rdf_converter/file_handling.feature:174  Convert GeoNames format file with standard strategy
  features/rdf_converter/file_handling.feature:180  Convert GeoNames format file with streaming strategy
  features/rdf_converter/file_handling.feature:190  Convert GeoNames format file with simple streaming
  features/rdf_converter/parallel_conversion.feature:126  Parallel correctly processes literals with language tags
  features/rdf_converter/parallel_conversion.feature:139  Parallel correctly processes URI objects
  features/rdf_converter/standard_conversion.feature:41  Standard conversion with train/test split
  features/rdf_converter/streaming_conversion.feature:58  Stream gzip-compressed N-Triples file

2 features passed, 6 failed, 0 skipped
91 scenarios passed, 19 failed, 0 skipped
525 steps passed, 19 failed, 18 skipped
Took 0min 4.620s

I can't pass this code until the tests pass.


When I run `bandit features` on this code, I get the following summary: ``` Failing scenarios: features/rdf_converter/cli_integration.feature:76 CLI creates train/test split features/rdf_converter/error_handling.feature:49 Handle malformed N-Triples gracefully features/rdf_converter/error_handling.feature:60 Skip malformed lines in streaming mode features/rdf_converter/error_handling.feature:77 Recover from partially valid file features/rdf_converter/error_handling.feature:117 Streaming recovers from parsing errors in ntriples features/rdf_converter/file_handling.feature:102 Convert gzip-compressed N-Triples features/rdf_converter/file_handling.feature:108 Stream gzip-compressed N-Triples features/rdf_converter/file_handling.feature:124 Convert bz2-compressed N-Triples features/rdf_converter/file_handling.feature:130 Stream bz2-compressed N-Triples features/rdf_converter/file_handling.feature:140 Convert TSV file with standard strategy features/rdf_converter/file_handling.feature:147 TSV file creates literal object types features/rdf_converter/file_handling.feature:153 TSV file with empty lines and malformed rows features/rdf_converter/file_handling.feature:174 Convert GeoNames format file with standard strategy features/rdf_converter/file_handling.feature:180 Convert GeoNames format file with streaming strategy features/rdf_converter/file_handling.feature:190 Convert GeoNames format file with simple streaming features/rdf_converter/parallel_conversion.feature:126 Parallel correctly processes literals with language tags features/rdf_converter/parallel_conversion.feature:139 Parallel correctly processes URI objects features/rdf_converter/standard_conversion.feature:41 Standard conversion with train/test split features/rdf_converter/streaming_conversion.feature:58 Stream gzip-compressed N-Triples file 2 features passed, 6 failed, 0 skipped 91 scenarios passed, 19 failed, 0 skipped 525 steps passed, 19 failed, 18 skipped Took 0min 4.620s ``` I can't pass this code until the tests pass. ----
@ -114,2 +118,4 @@
["error", "error message"]
)
self.last_shard_col = self._find_column_index(
["last completed shard", "last shard", "shard"]
Member

Should this list include "last_completed_shard" and "last_shard" ?

Should this list include `"last_completed_shard"` and `"last_shard"` ?
Author
Member

fixed !!

fixed !!
@ -158,0 +183,4 @@
# Use dynamic range: A to max_column (or Z if no columns found)
if max_col >= 0:
end_col = self._col_letter(max_col + 1) # +1 because _col_letter is 1-indexed
Member

ruff check reports:

scripts/google_sheets_tracker.py:186:89: E501 Line too long (94 > 88)
    |
184 |             # Use dynamic range: A to max_column (or Z if no columns found)
185 |             if max_col >= 0:
186 |                 end_col = self._col_letter(max_col + 1)  # +1 because _col_letter is 1-indexed
    |                                                                                         ^^^^^^ E501
187 |                 range_str = f"{self.worksheet_name}!A:{end_col}"
188 |             else:
    |
`ruff check` reports: ``` scripts/google_sheets_tracker.py:186:89: E501 Line too long (94 > 88) | 184 | # Use dynamic range: A to max_column (or Z if no columns found) 185 | if max_col >= 0: 186 | end_col = self._col_letter(max_col + 1) # +1 because _col_letter is 1-indexed | ^^^^^^ E501 187 | range_str = f"{self.worksheet_name}!A:{end_col}" 188 | else: | ```
Author
Member

fixed !!

fixed !!
@ -370,0 +447,4 @@
)
.execute()
)
values = result.get("values", [])
Member

Aside from the choice of columns, lines 174-200 and lines 417-450 look extremely similar. You can turn them into a function.

Aside from the choice of columns, lines 174-200 and lines 417-450 look extremely similar. You can turn them into a function.
Author
Member

fixed !!

fixed !!
@ -370,0 +433,4 @@
# Use dynamic range: A to max_column (or Z if no columns found)
if max_col >= 0:
end_col = self._col_letter(max_col + 1) # +1 because _col_letter is 1-indexed
Member

ruff check reports:

scripts/google_sheets_tracker.py:436:89: E501 Line too long (94 > 88)
    |
434 |             # Use dynamic range: A to max_column (or Z if no columns found)
435 |             if max_col >= 0:
436 |                 end_col = self._col_letter(max_col + 1)  # +1 because _col_letter is 1-indexed
    |                                                                                         ^^^^^^ E501
437 |                 range_str = f"{self.worksheet_name}!A{row_num}:{end_col}{row_num}"
438 |             else:
    |
`ruff check` reports: ``` scripts/google_sheets_tracker.py:436:89: E501 Line too long (94 > 88) | 434 | # Use dynamic range: A to max_column (or Z if no columns found) 435 | if max_col >= 0: 436 | end_col = self._col_letter(max_col + 1) # +1 because _col_letter is 1-indexed | ^^^^^^ E501 437 | range_str = f"{self.worksheet_name}!A{row_num}:{end_col}{row_num}" 438 | else: | ```
Author
Member

fixed !!

fixed !!
@ -370,0 +532,4 @@
)
if self.last_updated_col is not None:
from datetime import datetime
Member

It's usually best to put all imports at the top of the file. (I can't complain too hard; ruff check doesn't include this any longer.)

It's usually best to put all imports at the top of the file. (I can't complain too hard; `ruff check` doesn't include this any longer.)
Author
Member

fixed !!

fixed !!
@ -150,2 +176,4 @@
console = Console()
console.print("[yellow]Streaming Turtle from line iterator[/yellow]")
if skip_triples > 0:
console.print(f"[dim]Resume mode: will skip {skip_triples:,} already-processed triples[/dim]")
Member

ruff check reports:

scripts/rdf_to_hf_incremental.py:179:89: E501 Line too long (102 > 88)
    |
177 |     console.print("[yellow]Streaming Turtle from line iterator[/yellow]")
178 |     if skip_triples > 0:
179 |         console.print(f"[dim]Resume mode: will skip {skip_triples:,} already-processed triples[/dim]")
    |                                                                                         ^^^^^^^^^^^^^^ E501
180 |
181 |     if num_workers is None:
    |
`ruff check` reports: ``` scripts/rdf_to_hf_incremental.py:179:89: E501 Line too long (102 > 88) | 177 | console.print("[yellow]Streaming Turtle from line iterator[/yellow]") 178 | if skip_triples > 0: 179 | console.print(f"[dim]Resume mode: will skip {skip_triples:,} already-processed triples[/dim]") | ^^^^^^^^^^^^^^ E501 180 | 181 | if num_workers is None: | ```
Author
Member

fixed !!

fixed !!
@ -152,0 +179,4 @@
console.print(f"[dim]Resume mode: will skip {skip_triples:,} already-processed triples[/dim]")
if num_workers is None:
num_workers = max(1, mp.cpu_count() - 1)
Member

According to multiprocessing.cpu_count() :

This number is not equivalent to the number of CPUs the current process can use. The number of usable CPUs can be obtained with os.process_cpu_count() (or len(os.sched_getaffinity(0))).

It might be better to measure the number of usable CPUs instead of the number of CPUs.

According to [multiprocessing.cpu_count()](https://docs.python.org/3/library/multiprocessing.html#multiprocessing.cpu_count) : > This number is not equivalent to the number of CPUs the current process can use. The number of usable CPUs can be obtained with os.process_cpu_count() (or len(os.sched_getaffinity(0))). It might be better to measure the number of usable CPUs instead of the number of CPUs.
Author
Member

fixed !!

fixed !!
@ -193,0 +232,4 @@
console.print(f"[dim]Fast-forward: {total_triples:,} / {skip_triples:,} triples[/dim]")
continue
elif total_triples == skip_triples:
console.print(f"[green]✓ Skipped {skip_triples:,} triples, resuming normal parsing[/green]")
Member

ruff check reports:

scripts/rdf_to_hf_incremental.py:235:89: E501 Line too long (112 > 88)
    |
233 |                     continue
234 |                 elif total_triples == skip_triples:
235 |                     console.print(f"[green]✓ Skipped {skip_triples:,} triples, resuming normal parsing[/green]")
    |                                                                                         ^^^^^^^^^^^^^^^^^^^^^^^^ E501
236 |                     continue
    |
`ruff check` reports: ``` scripts/rdf_to_hf_incremental.py:235:89: E501 Line too long (112 > 88) | 233 | continue 234 | elif total_triples == skip_triples: 235 | console.print(f"[green]✓ Skipped {skip_triples:,} triples, resuming normal parsing[/green]") | ^^^^^^^^^^^^^^^^^^^^^^^^ E501 236 | continue | ```
Author
Member

fixed !!

fixed !!
@ -199,1 +262,3 @@
console.print(f"[dim]Processed {line_count:,} lines...[/dim]")
if line_count % PROGRESS_LOG_INTERVAL == 0:
if total_triples < skip_triples:
console.print(f"[dim]Fast-forward: {total_triples:,} / {skip_triples:,} triples[/dim]")
Member

ruff check reports:

scripts/rdf_to_hf_incremental.py:264:89: E501 Line too long (107 > 88)
    |
262 |             if line_count % PROGRESS_LOG_INTERVAL == 0:
263 |                 if total_triples < skip_triples:
264 |                     console.print(f"[dim]Fast-forward: {total_triples:,} / {skip_triples:,} triples[/dim]")
    |                                                                                         ^^^^^^^^^^^^^^^^^^^ E501
265 |                 else:
266 |                     console.print(f"[dim]Processed {line_count:,} lines...[/dim]")
    |
`ruff check` reports: ``` scripts/rdf_to_hf_incremental.py:264:89: E501 Line too long (107 > 88) | 262 | if line_count % PROGRESS_LOG_INTERVAL == 0: 263 | if total_triples < skip_triples: 264 | console.print(f"[dim]Fast-forward: {total_triples:,} / {skip_triples:,} triples[/dim]") | ^^^^^^^^^^^^^^^^^^^ E501 265 | else: 266 | console.print(f"[dim]Processed {line_count:,} lines...[/dim]") | ```
Author
Member

fixed !!

fixed !!
@ -159,0 +189,4 @@
# Extract prefix declarations first
buffered_first_line = None
for line in lines:
line_count += 1
Member

It might be easier to use enumerate.

It might be easier to use `enumerate`.
Author
Member

fixed !!

fixed !!
@ -190,0 +210,4 @@
if buffered_first_line:
yield buffered_first_line
for line in lines:
yield line
Member

ruff check reports:

scripts/rdf_to_hf_incremental.py:212:9: UP028 Replace `yield` over `for` loop with `yield from`
    |
210 |           if buffered_first_line:
211 |               yield buffered_first_line
212 | /         for line in lines:
213 | |             yield line
    | |______________________^ UP028
214 |
215 |       # Sequential processing for single worker
    |
    = help: Replace with `yield from`
`ruff check` reports: ``` scripts/rdf_to_hf_incremental.py:212:9: UP028 Replace `yield` over `for` loop with `yield from` | 210 | if buffered_first_line: 211 | yield buffered_first_line 212 | / for line in lines: 213 | | yield line | |______________________^ UP028 214 | 215 | # Sequential processing for single worker | = help: Replace with `yield from` ```
Author
Member

fixed !!

fixed !!
@ -193,0 +229,4 @@
if total_triples < skip_triples:
if total_triples % PROGRESS_LOG_INTERVAL == 0:
console.print(f"[dim]Fast-forward: {total_triples:,} / {skip_triples:,} triples[/dim]")
Member

ruff check reports:

scripts/rdf_to_hf_incremental.py:232:89: E501 Line too long (111 > 88)
    |
230 |                 if total_triples < skip_triples:
231 |                     if total_triples % PROGRESS_LOG_INTERVAL == 0:
232 |                         console.print(f"[dim]Fast-forward: {total_triples:,} / {skip_triples:,} triples[/dim]")
    |                                                                                         ^^^^^^^^^^^^^^^^^^^^^^^ E501
233 |                     continue
234 |                 elif total_triples == skip_triples:
    |
`ruff check` reports: ``` scripts/rdf_to_hf_incremental.py:232:89: E501 Line too long (111 > 88) | 230 | if total_triples < skip_triples: 231 | if total_triples % PROGRESS_LOG_INTERVAL == 0: 232 | console.print(f"[dim]Fast-forward: {total_triples:,} / {skip_triples:,} triples[/dim]") | ^^^^^^^^^^^^^^^^^^^^^^^ E501 233 | continue 234 | elif total_triples == skip_triples: | ```
Author
Member

fixed !!

fixed !!
@ -193,0 +238,4 @@
triple_count += 1
if triple_count >= chunk_size:
chunk_text = prefix_text + "\n" + "".join(current_chunk) + "\n" + line
Member

ruff check reports:

scripts/rdf_to_hf_incremental.py:241:89: E501 Line too long (90 > 88)
    |
240 |                 if triple_count >= chunk_size:
241 |                     chunk_text = prefix_text + "\n" + "".join(current_chunk) + "\n" + line
    |                                                                                         ^^ E501
242 |                     try:
243 |                         graph = Graph()
    |
`ruff check` reports: ``` scripts/rdf_to_hf_incremental.py:241:89: E501 Line too long (90 > 88) | 240 | if triple_count >= chunk_size: 241 | chunk_text = prefix_text + "\n" + "".join(current_chunk) + "\n" + line | ^^ E501 242 | try: 243 | graph = Graph() | ```
Author
Member

fixed !!

fixed !!
@ -221,1 +365,4 @@
"""
console = Console()
if skip_triples > 0:
console.print(f"[dim]Resume mode: will skip {skip_triples:,} already-processed triples[/dim]")
Member

ruff check reports:

scripts/rdf_to_hf_incremental.py:368:89: E501 Line too long (102 > 88)
    |
366 |     console = Console()
367 |     if skip_triples > 0:
368 |         console.print(f"[dim]Resume mode: will skip {skip_triples:,} already-processed triples[/dim]")
    |                                                                                         ^^^^^^^^^^^^^^ E501
369 |
370 |     batch: list[dict] = []
    |
`ruff check` reports: ``` scripts/rdf_to_hf_incremental.py:368:89: E501 Line too long (102 > 88) | 366 | console = Console() 367 | if skip_triples > 0: 368 | console.print(f"[dim]Resume mode: will skip {skip_triples:,} already-processed triples[/dim]") | ^^^^^^^^^^^^^^ E501 369 | 370 | batch: list[dict] = [] | ```
Author
Member

fixed !!

fixed !!
@ -227,0 +379,4 @@
if total_triples < skip_triples:
if total_triples % PROGRESS_LOG_INTERVAL == 0:
console.print(f"[dim]Fast-forward: {total_triples:,} / {skip_triples:,} triples[/dim]")
Member

ruff check reports:

scripts/rdf_to_hf_incremental.py:382:89: E501 Line too long (103 > 88)
    |
380 |         if total_triples < skip_triples:
381 |             if total_triples % PROGRESS_LOG_INTERVAL == 0:
382 |                 console.print(f"[dim]Fast-forward: {total_triples:,} / {skip_triples:,} triples[/dim]")
    |                                                                                         ^^^^^^^^^^^^^^^ E501
383 |             continue
384 |         elif total_triples == skip_triples:
    |
`ruff check` reports: ``` scripts/rdf_to_hf_incremental.py:382:89: E501 Line too long (103 > 88) | 380 | if total_triples < skip_triples: 381 | if total_triples % PROGRESS_LOG_INTERVAL == 0: 382 | console.print(f"[dim]Fast-forward: {total_triples:,} / {skip_triples:,} triples[/dim]") | ^^^^^^^^^^^^^^^ E501 383 | continue 384 | elif total_triples == skip_triples: | ```
Author
Member

fixed !!

fixed !!
@ -227,0 +382,4 @@
console.print(f"[dim]Fast-forward: {total_triples:,} / {skip_triples:,} triples[/dim]")
continue
elif total_triples == skip_triples:
console.print(f"[green]✓ Skipped {skip_triples:,} triples, resuming normal parsing[/green]")
Member

ruff check reports:

scripts/rdf_to_hf_incremental.py:385:89: E501 Line too long (104 > 88)
    |
383 |             continue
384 |         elif total_triples == skip_triples:
385 |             console.print(f"[green]✓ Skipped {skip_triples:,} triples, resuming normal parsing[/green]")
    |                                                                                         ^^^^^^^^^^^^^^^^ E501
386 |             continue
    |
`ruff check` reports: ``` scripts/rdf_to_hf_incremental.py:385:89: E501 Line too long (104 > 88) | 383 | continue 384 | elif total_triples == skip_triples: 385 | console.print(f"[green]✓ Skipped {skip_triples:,} triples, resuming normal parsing[/green]") | ^^^^^^^^^^^^^^^^ E501 386 | continue | ```
Author
Member

fixed !!

fixed !!
@ -211,0 +274,4 @@
if triples:
yield triples
except Exception as e:
logger.error(f"Error parsing final Turtle chunk: {e}")
Member

Lines 242-252 and lines 270-277 are extremely similar; should they be one method?

Lines 242-252 and lines 270-277 are extremely similar; should they be one method?
Author
Member

fixed !!

fixed !!
@ -211,0 +303,4 @@
if total_triples < skip_triples:
if total_triples % PROGRESS_LOG_INTERVAL == 0:
console.print(f"[dim]Fast-forward: {total_triples:,} / {skip_triples:,} triples[/dim]")
Member

ruff check reports:

scripts/rdf_to_hf_incremental.py:306:89: E501 Line too long (111 > 88)
    |
304 |                 if total_triples < skip_triples:
305 |                     if total_triples % PROGRESS_LOG_INTERVAL == 0:
306 |                         console.print(f"[dim]Fast-forward: {total_triples:,} / {skip_triples:,} triples[/dim]")
    |                                                                                         ^^^^^^^^^^^^^^^^^^^^^^^ E501
307 |                     continue
308 |                 elif total_triples == skip_triples:
    |
`ruff check` reports: ``` scripts/rdf_to_hf_incremental.py:306:89: E501 Line too long (111 > 88) | 304 | if total_triples < skip_triples: 305 | if total_triples % PROGRESS_LOG_INTERVAL == 0: 306 | console.print(f"[dim]Fast-forward: {total_triples:,} / {skip_triples:,} triples[/dim]") | ^^^^^^^^^^^^^^^^^^^^^^^ E501 307 | continue 308 | elif total_triples == skip_triples: | ```
Author
Member

fixed !!

fixed !!
@ -211,0 +306,4 @@
console.print(f"[dim]Fast-forward: {total_triples:,} / {skip_triples:,} triples[/dim]")
continue
elif total_triples == skip_triples:
console.print(f"[green]✓ Skipped {skip_triples:,} triples, resuming normal parsing[/green]")
Member

ruff check reports:

scripts/rdf_to_hf_incremental.py:309:89: E501 Line too long (112 > 88)
    |
307 |                     continue
308 |                 elif total_triples == skip_triples:
309 |                     console.print(f"[green]✓ Skipped {skip_triples:,} triples, resuming normal parsing[/green]")
    |                                                                                         ^^^^^^^^^^^^^^^^^^^^^^^^ E501
310 |                     continue
    |
`ruff check` reports: ``` scripts/rdf_to_hf_incremental.py:309:89: E501 Line too long (112 > 88) | 307 | continue 308 | elif total_triples == skip_triples: 309 | console.print(f"[green]✓ Skipped {skip_triples:,} triples, resuming normal parsing[/green]") | ^^^^^^^^^^^^^^^^^^^^^^^^ E501 310 | continue | ```
Author
Member

fixed !!

fixed !!
@ -211,0 +309,4 @@
console.print(f"[green]✓ Skipped {skip_triples:,} triples, resuming normal parsing[/green]")
continue
sub_chunk_triple_count += 1
Member

There are a lot of similarities between lines 220-238 and 294-312; can they be unified?

There are a lot of similarities between lines 220-238 and 294-312; can they be unified?
Author
Member

fixed !!

fixed !!
@ -211,0 +319,4 @@
sub_chunk_triple_count = 0
if len(sub_chunks) >= batch_size:
for idx, triples in pool.imap(_parse_turtle_chunk, sub_chunks):
Member

ruff check reports:

scripts/rdf_to_hf_incremental.py:322:29: B007 Loop control variable `idx` not used within loop body
    |
321 |                     if len(sub_chunks) >= batch_size:
322 |                         for idx, triples in pool.imap(_parse_turtle_chunk, sub_chunks):
    |                             ^^^ B007
323 |                             if triples:
324 |                                 yield triples
    |
    = help: Rename unused `idx` to `_idx`
`ruff check` reports: ``` scripts/rdf_to_hf_incremental.py:322:29: B007 Loop control variable `idx` not used within loop body | 321 | if len(sub_chunks) >= batch_size: 322 | for idx, triples in pool.imap(_parse_turtle_chunk, sub_chunks): | ^^^ B007 323 | if triples: 324 | yield triples | = help: Rename unused `idx` to `_idx` ```
Author
Member

fixed !!

fixed !!
@ -319,0 +482,4 @@
return []
def _parse_description_batch(args: tuple[int, str, str | None]) -> tuple[int, list[dict]]:
Member

ruff check reports:

scripts/rdf_to_hf_incremental.py:485:89: E501 Line too long (90 > 88)
    |
485 | def _parse_description_batch(args: tuple[int, str, str | None]) -> tuple[int, list[dict]]:
    |                                                                                         ^^ E501
486 |     """Parse a Description XML fragment (worker function for multiprocessing).
    |
`ruff check` reports: ``` scripts/rdf_to_hf_incremental.py:485:89: E501 Line too long (90 > 88) | 485 | def _parse_description_batch(args: tuple[int, str, str | None]) -> tuple[int, list[dict]]: | ^^ E501 486 | """Parse a Description XML fragment (worker function for multiprocessing). | ```
Author
Member

fixed !!

fixed !!
@ -211,0 +331,4 @@
if line_count % PROGRESS_LOG_INTERVAL == 0:
if total_triples < skip_triples:
console.print(f"[dim]Fast-forward: {total_triples:,} / {skip_triples:,} triples[/dim]")
Member

ruff check reports:

scripts/rdf_to_hf_incremental.py:334:89: E501 Line too long (107 > 88)
    |
332 |             if line_count % PROGRESS_LOG_INTERVAL == 0:
333 |                 if total_triples < skip_triples:
334 |                     console.print(f"[dim]Fast-forward: {total_triples:,} / {skip_triples:,} triples[/dim]")
    |                                                                                         ^^^^^^^^^^^^^^^^^^^ E501
335 |                 else:
336 |                     console.print(f"[dim]Processed {line_count:,} lines...[/dim]")
    |
`ruff check` reports: ``` scripts/rdf_to_hf_incremental.py:334:89: E501 Line too long (107 > 88) | 332 | if line_count % PROGRESS_LOG_INTERVAL == 0: 333 | if total_triples < skip_triples: 334 | console.print(f"[dim]Fast-forward: {total_triples:,} / {skip_triples:,} triples[/dim]") | ^^^^^^^^^^^^^^^^^^^ E501 335 | else: 336 | console.print(f"[dim]Processed {line_count:,} lines...[/dim]") | ```
Author
Member

fixed !!

fixed !!
@ -211,0 +342,4 @@
# Process remaining batch
if sub_chunks:
for idx, triples in pool.imap(_parse_turtle_chunk, sub_chunks):
Member

ruff check reports:

scripts/rdf_to_hf_incremental.py:345:17: B007 Loop control variable `idx` not used within loop body
    |
343 |         # Process remaining batch
344 |         if sub_chunks:
345 |             for idx, triples in pool.imap(_parse_turtle_chunk, sub_chunks):
    |                 ^^^ B007
346 |                 if triples:
347 |                     yield triples
    |
    = help: Rename unused `idx` to `_idx`
`ruff check` reports: ``` scripts/rdf_to_hf_incremental.py:345:17: B007 Loop control variable `idx` not used within loop body | 343 | # Process remaining batch 344 | if sub_chunks: 345 | for idx, triples in pool.imap(_parse_turtle_chunk, sub_chunks): | ^^^ B007 346 | if triples: 347 | yield triples | = help: Rename unused `idx` to `_idx` ```
Author
Member

fixed !!

fixed !!
@ -340,2 +523,4 @@
console = Console()
console.print("[yellow]Streaming RDF/XML with standard XML parser[/yellow]")
if skip_descriptions > 0:
console.print(f"[dim]Resuming: will skip {skip_descriptions:,} already-processed descriptions[/dim]")
Member

ruff check reports:

scripts/rdf_to_hf_incremental.py:526:89: E501 Line too long (109 > 88)
    |
524 |     console.print("[yellow]Streaming RDF/XML with standard XML parser[/yellow]")
525 |     if skip_descriptions > 0:
526 |         console.print(f"[dim]Resuming: will skip {skip_descriptions:,} already-processed descriptions[/dim]")
    |                                                                                         ^^^^^^^^^^^^^^^^^^^^^ E501
527 |
528 |     if num_workers is None:
    |
`ruff check` reports: ``` scripts/rdf_to_hf_incremental.py:526:89: E501 Line too long (109 > 88) | 524 | console.print("[yellow]Streaming RDF/XML with standard XML parser[/yellow]") 525 | if skip_descriptions > 0: 526 | console.print(f"[dim]Resuming: will skip {skip_descriptions:,} already-processed descriptions[/dim]") | ^^^^^^^^^^^^^^^^^^^^^ E501 527 | 528 | if num_workers is None: | ```
Author
Member

fixed !!

fixed !!
@ -359,0 +564,4 @@
elem.clear()
if total_descriptions % PROGRESS_LOG_INTERVAL == 0:
console.print(
f"[dim]Skipping: {total_descriptions:,} / {skip_descriptions:,}[/dim]"
Member

ruff check reports:

scripts/rdf_to_hf_incremental.py:567:89: E501 Line too long (98 > 88)
    |
565 |                     if total_descriptions % PROGRESS_LOG_INTERVAL == 0:
566 |                         console.print(
567 |                             f"[dim]Skipping: {total_descriptions:,} / {skip_descriptions:,}[/dim]"
    |                                                                                         ^^^^^^^^^^ E501
568 |                         )
569 |                     continue
    |
`ruff check` reports: ``` scripts/rdf_to_hf_incremental.py:567:89: E501 Line too long (98 > 88) | 565 | if total_descriptions % PROGRESS_LOG_INTERVAL == 0: 566 | console.print( 567 | f"[dim]Skipping: {total_descriptions:,} / {skip_descriptions:,}[/dim]" | ^^^^^^^^^^ E501 568 | ) 569 | continue | ```
Author
Member

fixed !!

fixed !!
@ -376,2 +575,2 @@
triple_batch = []
batch_description_count = 0
if len(parse_batch) >= parse_batch_size:
for idx, triples in pool.imap(_parse_description_batch, parse_batch):
Member

ruff check reports:

scripts/rdf_to_hf_incremental.py:576:25: B007 Loop control variable `idx` not used within loop body
    |
575 |                 if len(parse_batch) >= parse_batch_size:
576 |                     for idx, triples in pool.imap(_parse_description_batch, parse_batch):
    |                         ^^^ B007
577 |                         if triples:
578 |                             triple_batch.extend(triples)
    |
    = help: Rename unused `idx` to `_idx`

scripts/rdf_to_hf_incremental.py:576:89: E501 Line too long (89 > 88)
    |
575 |                 if len(parse_batch) >= parse_batch_size:
576 |                     for idx, triples in pool.imap(_parse_description_batch, parse_batch):
    |                                                                                         ^ E501
577 |                         if triples:
578 |                             triple_batch.extend(triples)
    |
`ruff check` reports: ``` scripts/rdf_to_hf_incremental.py:576:25: B007 Loop control variable `idx` not used within loop body | 575 | if len(parse_batch) >= parse_batch_size: 576 | for idx, triples in pool.imap(_parse_description_batch, parse_batch): | ^^^ B007 577 | if triples: 578 | triple_batch.extend(triples) | = help: Rename unused `idx` to `_idx` scripts/rdf_to_hf_incremental.py:576:89: E501 Line too long (89 > 88) | 575 | if len(parse_batch) >= parse_batch_size: 576 | for idx, triples in pool.imap(_parse_description_batch, parse_batch): | ^ E501 577 | if triples: 578 | triple_batch.extend(triples) | ```
Author
Member

fixed !!

fixed !!
@ -388,1 +596,4 @@
logger.error("This may indicate malformed XML in the source file")
finally:
if parse_batch:
for idx, triples in pool.imap(_parse_description_batch, parse_batch):
Member

ruff check reports:

scripts/rdf_to_hf_incremental.py:599:17: B007 Loop control variable `idx` not used within loop body
    |
597 |     finally:
598 |         if parse_batch:
599 |             for idx, triples in pool.imap(_parse_description_batch, parse_batch):
    |                 ^^^ B007
600 |                 if triples:
601 |                     triple_batch.extend(triples)
    |
    = help: Rename unused `idx` to `_idx`
`ruff check` reports: ``` scripts/rdf_to_hf_incremental.py:599:17: B007 Loop control variable `idx` not used within loop body | 597 | finally: 598 | if parse_batch: 599 | for idx, triples in pool.imap(_parse_description_batch, parse_batch): | ^^^ B007 600 | if triples: 601 | triple_batch.extend(triples) | = help: Rename unused `idx` to `_idx` ```
Author
Member

fixed !!

fixed !!
@ -440,2 +661,3 @@
yield from stream_turtle_chunks_from_lines(lines, chunk_size, skip_triples, num_workers)
elif format in ("xml", "rdf", "rdfxml"):
yield from stream_rdfxml_chunks_from_lines(lines, chunk_size=chunk_size)
yield from stream_rdfxml_chunks_from_lines(lines, chunk_size=chunk_size, skip_descriptions=skip_descriptions, num_workers=num_workers)
Member

ruff check reports:

scripts/rdf_to_hf_incremental.py:663:89: E501 Line too long (142 > 88)
    |
661 | …ines, chunk_size, skip_triples, num_workers)
662 | …
663 | …ines, chunk_size=chunk_size, skip_descriptions=skip_descriptions, num_workers=num_workers)
    |                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ E501
664 | …
665 | …
    |
`ruff check` reports: ``` scripts/rdf_to_hf_incremental.py:663:89: E501 Line too long (142 > 88) | 661 | …ines, chunk_size, skip_triples, num_workers) 662 | … 663 | …ines, chunk_size=chunk_size, skip_descriptions=skip_descriptions, num_workers=num_workers) | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ E501 664 | … 665 | … | ```
Author
Member

fixed !!

fixed !!
@ -438,2 +659,3 @@
yield from stream_ntriples_from_lines(lines, skip_triples)
elif format in ("turtle", "ttl"):
yield from stream_turtle_chunks_from_lines(lines, chunk_size=chunk_size)
yield from stream_turtle_chunks_from_lines(lines, chunk_size, skip_triples, num_workers)
Member

ruff check reports:

scripts/rdf_to_hf_incremental.py:661:89: E501 Line too long (96 > 88)
    |
659 |         yield from stream_ntriples_from_lines(lines, skip_triples)
660 |     elif format in ("turtle", "ttl"):
661 |         yield from stream_turtle_chunks_from_lines(lines, chunk_size, skip_triples, num_workers)
    |                                                                                         ^^^^^^^^ E501
662 |     elif format in ("xml", "rdf", "rdfxml"):
663 |         yield from stream_rdfxml_chunks_from_lines(lines, chunk_size=chunk_size, skip_descriptions=skip_descriptions, num_workers=num…
    |
`ruff check` reports: ``` scripts/rdf_to_hf_incremental.py:661:89: E501 Line too long (96 > 88) | 659 | yield from stream_ntriples_from_lines(lines, skip_triples) 660 | elif format in ("turtle", "ttl"): 661 | yield from stream_turtle_chunks_from_lines(lines, chunk_size, skip_triples, num_workers) | ^^^^^^^^ E501 662 | elif format in ("xml", "rdf", "rdfxml"): 663 | yield from stream_rdfxml_chunks_from_lines(lines, chunk_size=chunk_size, skip_descriptions=skip_descriptions, num_workers=num… | ```
Author
Member

fixed !!

fixed !!
@ -638,3 +864,3 @@
chunk_iter = stream_rdf_chunks_from_lines(
lines, format=rdf_format, chunk_size=DEFAULT_BATCH_SIZE
lines, rdf_format, DEFAULT_BATCH_SIZE, skip_descriptions, skip_triples, num_workers
Member

ruff check reports:

scripts/rdf_to_hf_incremental.py:866:89: E501 Line too long (95 > 88)
    |
865 |         chunk_iter = stream_rdf_chunks_from_lines(
866 |             lines, rdf_format, DEFAULT_BATCH_SIZE, skip_descriptions, skip_triples, num_workers
    |                                                                                         ^^^^^^^ E501
867 |         )
868 |         return chunk_iter, source_name
    |
`ruff check` reports: ``` scripts/rdf_to_hf_incremental.py:866:89: E501 Line too long (95 > 88) | 865 | chunk_iter = stream_rdf_chunks_from_lines( 866 | lines, rdf_format, DEFAULT_BATCH_SIZE, skip_descriptions, skip_triples, num_workers | ^^^^^^^ E501 867 | ) 868 | return chunk_iter, source_name | ```
Author
Member

fixed !!

fixed !!
@ -1459,0 +1730,4 @@
if checkpoint_source:
console.print(f"[yellow]Resuming from shard {start_shard} (loaded from {checkpoint_source})[/yellow]")
if skip_descriptions > 0:
console.print(f"[yellow]Will skip {skip_descriptions:,} already-processed descriptions[/yellow]")
Member

ruff check reports:

scripts/rdf_to_hf_incremental.py:1733:89: E501 Line too long (109 > 88)
     |
1731 |         console.print(f"[yellow]Resuming from shard {start_shard} (loaded from {checkpoint_source})[/yellow]")
1732 |         if skip_descriptions > 0:
1733 |             console.print(f"[yellow]Will skip {skip_descriptions:,} already-processed descriptions[/yellow]")
     |                                                                                         ^^^^^^^^^^^^^^^^^^^^^ E501
1734 |         if resume_total_rows > 0:
1735 |             console.print(f"[yellow]Continuing from row {resume_total_rows:,}[/yellow]")
     |
`ruff check` reports: ``` scripts/rdf_to_hf_incremental.py:1733:89: E501 Line too long (109 > 88) | 1731 | console.print(f"[yellow]Resuming from shard {start_shard} (loaded from {checkpoint_source})[/yellow]") 1732 | if skip_descriptions > 0: 1733 | console.print(f"[yellow]Will skip {skip_descriptions:,} already-processed descriptions[/yellow]") | ^^^^^^^^^^^^^^^^^^^^^ E501 1734 | if resume_total_rows > 0: 1735 | console.print(f"[yellow]Continuing from row {resume_total_rows:,}[/yellow]") | ```
Author
Member

fixed !!

fixed !!
@ -1459,0 +1728,4 @@
checkpoint_source = "sheet"
if checkpoint_source:
console.print(f"[yellow]Resuming from shard {start_shard} (loaded from {checkpoint_source})[/yellow]")
Member

ruff check reports:

scripts/rdf_to_hf_incremental.py:1731:89: E501 Line too long (110 > 88)
     |
1730 |     if checkpoint_source:
1731 |         console.print(f"[yellow]Resuming from shard {start_shard} (loaded from {checkpoint_source})[/yellow]")
     |                                                                                         ^^^^^^^^^^^^^^^^^^^^^^ E501
1732 |         if skip_descriptions > 0:
1733 |             console.print(f"[yellow]Will skip {skip_descriptions:,} already-processed descriptions[/yellow]")
     |
`ruff check` reports: ``` scripts/rdf_to_hf_incremental.py:1731:89: E501 Line too long (110 > 88) | 1730 | if checkpoint_source: 1731 | console.print(f"[yellow]Resuming from shard {start_shard} (loaded from {checkpoint_source})[/yellow]") | ^^^^^^^^^^^^^^^^^^^^^^ E501 1732 | if skip_descriptions > 0: 1733 | console.print(f"[yellow]Will skip {skip_descriptions:,} already-processed descriptions[/yellow]") | ```
Author
Member

fixed !!

fixed !!
This pull request can be merged automatically.
This branch is out-of-date with the base branch
You are not authorized to merge this pull request.
View command line instructions

Checkout

From your project repository, check out a new branch and test the changes.
git fetch -u origin xml-streaming-optimization:xml-streaming-optimization
git switch xml-streaming-optimization

Merge

Merge the changes and update on Forgejo.

Warning: The "Autodetect manual merge" setting is not enabled for this repository, you will have to mark this pull request as manually merged afterwards.

git switch auto-excel-update
git merge --no-ff xml-streaming-optimization
git switch xml-streaming-optimization
git rebase auto-excel-update
git switch auto-excel-update
git merge --ff-only xml-streaming-optimization
git switch xml-streaming-optimization
git rebase auto-excel-update
git switch auto-excel-update
git merge --no-ff xml-streaming-optimization
git switch auto-excel-update
git merge --squash xml-streaming-optimization
git switch auto-excel-update
git merge --ff-only xml-streaming-optimization
git switch auto-excel-update
git merge xml-streaming-optimization
git push origin auto-excel-update
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
3 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
cleverdatasets/dataset-uploader!49
No description provided.