xml-streaming-optimization #49

aditya · 2026-01-16T13:01:57Z

aditya commented

2026-01-16 13:01:57 +00:00

This branch contains optimized code which skips the description parsing during the resume and only counts the description count, this optimization reduces the overhead of parsing the entire dataset again. This code also contain multi threading implementation for description parsing and checkpoint update using tracker sheet.

aditya added 2 commits

2026-01-16 13:01:57 +00:00

fix: implement multithreading for parsing of description and skip parsing of description 001e3d3f4f

fix: fix xml:base not getting added with subject; implement checkpoint in google sheet cb25f4dea1

aditya added 2 commits

2026-01-21 14:37:04 +00:00

fix: fix resume functionality in .ttl fomat dataset 02c9f12aee

fix: fix minor bugs; add multiprocessing for ttl format datasets e2ddbd58a0

khird approved these changes

2026-01-27 15:29:10 +00:00

Dismissed

scripts/google_sheets_tracker.py

					
				@ -370,0 +406,4 @@

				                self.sheet.values()

				                .get(

				                    spreadsheetId=self.spreadsheet_id,

				                    range=f"{self.worksheet_name}!A{row_num}:Z{row_num}",

Hardcoding A and Z columns looks suspicious - is this just a conservative estimate or do we know that the data we want is in this subset of the sheet?

Fixed !!

aditya added 1 commit

2026-01-30 11:09:32 +00:00

fix: fix hardcoded A:Z column range in Google Sheets tracker 3a38af25a7

khird approved these changes

2026-02-02 15:23:12 +00:00

brent.edwards requested changes

2026-02-03 00:50:15 +00:00

brent.edwards left a comment

When I run bandit features on this code, I get the following summary:

Failing scenarios:
  features/rdf_converter/cli_integration.feature:76  CLI creates train/test split
  features/rdf_converter/error_handling.feature:49  Handle malformed N-Triples gracefully
  features/rdf_converter/error_handling.feature:60  Skip malformed lines in streaming mode
  features/rdf_converter/error_handling.feature:77  Recover from partially valid file
  features/rdf_converter/error_handling.feature:117  Streaming recovers from parsing errors in ntriples
  features/rdf_converter/file_handling.feature:102  Convert gzip-compressed N-Triples
  features/rdf_converter/file_handling.feature:108  Stream gzip-compressed N-Triples
  features/rdf_converter/file_handling.feature:124  Convert bz2-compressed N-Triples
  features/rdf_converter/file_handling.feature:130  Stream bz2-compressed N-Triples
  features/rdf_converter/file_handling.feature:140  Convert TSV file with standard strategy
  features/rdf_converter/file_handling.feature:147  TSV file creates literal object types
  features/rdf_converter/file_handling.feature:153  TSV file with empty lines and malformed rows
  features/rdf_converter/file_handling.feature:174  Convert GeoNames format file with standard strategy
  features/rdf_converter/file_handling.feature:180  Convert GeoNames format file with streaming strategy
  features/rdf_converter/file_handling.feature:190  Convert GeoNames format file with simple streaming
  features/rdf_converter/parallel_conversion.feature:126  Parallel correctly processes literals with language tags
  features/rdf_converter/parallel_conversion.feature:139  Parallel correctly processes URI objects
  features/rdf_converter/standard_conversion.feature:41  Standard conversion with train/test split
  features/rdf_converter/streaming_conversion.feature:58  Stream gzip-compressed N-Triples file

2 features passed, 6 failed, 0 skipped
91 scenarios passed, 19 failed, 0 skipped
525 steps passed, 19 failed, 18 skipped
Took 0min 4.620s

I can't pass this code until the tests pass.

When I run `bandit features` on this code, I get the following summary: ``` Failing scenarios: features/rdf_converter/cli_integration.feature:76 CLI creates train/test split features/rdf_converter/error_handling.feature:49 Handle malformed N-Triples gracefully features/rdf_converter/error_handling.feature:60 Skip malformed lines in streaming mode features/rdf_converter/error_handling.feature:77 Recover from partially valid file features/rdf_converter/error_handling.feature:117 Streaming recovers from parsing errors in ntriples features/rdf_converter/file_handling.feature:102 Convert gzip-compressed N-Triples features/rdf_converter/file_handling.feature:108 Stream gzip-compressed N-Triples features/rdf_converter/file_handling.feature:124 Convert bz2-compressed N-Triples features/rdf_converter/file_handling.feature:130 Stream bz2-compressed N-Triples features/rdf_converter/file_handling.feature:140 Convert TSV file with standard strategy features/rdf_converter/file_handling.feature:147 TSV file creates literal object types features/rdf_converter/file_handling.feature:153 TSV file with empty lines and malformed rows features/rdf_converter/file_handling.feature:174 Convert GeoNames format file with standard strategy features/rdf_converter/file_handling.feature:180 Convert GeoNames format file with streaming strategy features/rdf_converter/file_handling.feature:190 Convert GeoNames format file with simple streaming features/rdf_converter/parallel_conversion.feature:126 Parallel correctly processes literals with language tags features/rdf_converter/parallel_conversion.feature:139 Parallel correctly processes URI objects features/rdf_converter/standard_conversion.feature:41 Standard conversion with train/test split features/rdf_converter/streaming_conversion.feature:58 Stream gzip-compressed N-Triples file 2 features passed, 6 failed, 0 skipped 91 scenarios passed, 19 failed, 0 skipped 525 steps passed, 19 failed, 18 skipped Took 0min 4.620s ``` I can't pass this code until the tests pass. ----

scripts/google_sheets_tracker.py

					
				@ -114,2 +118,4 @@

				                    ["error", "error message"]

				                )

				                self.last_shard_col = self._find_column_index(

				                    ["last completed shard", "last shard", "shard"]

Should this list include "last_completed_shard" and "last_shard" ?

Should this list include `"last_completed_shard"` and `"last_shard"` ?

fixed !!

scripts/google_sheets_tracker.py

					
				@ -158,0 +183,4 @@

				            # Use dynamic range: A to max_column (or Z if no columns found)

				            if max_col >= 0:

				                end_col = self._col_letter(max_col + 1)  # +1 because _col_letter is 1-indexed

ruff check reports:

scripts/google_sheets_tracker.py:186:89: E501 Line too long (94 > 88)
    |
184 |             # Use dynamic range: A to max_column (or Z if no columns found)
185 |             if max_col >= 0:
186 |                 end_col = self._col_letter(max_col + 1)  # +1 because _col_letter is 1-indexed
    |                                                                                         ^^^^^^ E501
187 |                 range_str = f"{self.worksheet_name}!A:{end_col}"
188 |             else:
    |

`ruff check` reports: ``` scripts/google_sheets_tracker.py:186:89: E501 Line too long (94 > 88) | 184 | # Use dynamic range: A to max_column (or Z if no columns found) 185 | if max_col >= 0: 186 | end_col = self._col_letter(max_col + 1) # +1 because _col_letter is 1-indexed | ^^^^^^ E501 187 | range_str = f"{self.worksheet_name}!A:{end_col}" 188 | else: | ```

fixed !!

scripts/google_sheets_tracker.py

					
				@ -370,0 +447,4 @@

				                )

				                .execute()

				            )

				            values = result.get("values", [])

Aside from the choice of columns, lines 174-200 and lines 417-450 look extremely similar. You can turn them into a function.

fixed !!

scripts/google_sheets_tracker.py

					
				@ -370,0 +433,4 @@

				            # Use dynamic range: A to max_column (or Z if no columns found)

				            if max_col >= 0:

				                end_col = self._col_letter(max_col + 1)  # +1 because _col_letter is 1-indexed

ruff check reports:

scripts/google_sheets_tracker.py:436:89: E501 Line too long (94 > 88)
    |
434 |             # Use dynamic range: A to max_column (or Z if no columns found)
435 |             if max_col >= 0:
436 |                 end_col = self._col_letter(max_col + 1)  # +1 because _col_letter is 1-indexed
    |                                                                                         ^^^^^^ E501
437 |                 range_str = f"{self.worksheet_name}!A{row_num}:{end_col}{row_num}"
438 |             else:
    |

`ruff check` reports: ``` scripts/google_sheets_tracker.py:436:89: E501 Line too long (94 > 88) | 434 | # Use dynamic range: A to max_column (or Z if no columns found) 435 | if max_col >= 0: 436 | end_col = self._col_letter(max_col + 1) # +1 because _col_letter is 1-indexed | ^^^^^^ E501 437 | range_str = f"{self.worksheet_name}!A{row_num}:{end_col}{row_num}" 438 | else: | ```

fixed !!

scripts/google_sheets_tracker.py

					
				@ -370,0 +532,4 @@

				                )

				            if self.last_updated_col is not None:

				                from datetime import datetime

It's usually best to put all imports at the top of the file. (I can't complain too hard; ruff check doesn't include this any longer.)

It's usually best to put all imports at the top of the file. (I can't complain too hard; `ruff check` doesn't include this any longer.)

fixed !!

scripts/rdf_to_hf_incremental.py Outdated

					
				@ -150,2 +176,4 @@

				    console = Console()

				    console.print("[yellow]Streaming Turtle from line iterator[/yellow]")

				    if skip_triples > 0:

				        console.print(f"[dim]Resume mode: will skip {skip_triples:,} already-processed triples[/dim]")

ruff check reports:

scripts/rdf_to_hf_incremental.py:179:89: E501 Line too long (102 > 88)
    |
177 |     console.print("[yellow]Streaming Turtle from line iterator[/yellow]")
178 |     if skip_triples > 0:
179 |         console.print(f"[dim]Resume mode: will skip {skip_triples:,} already-processed triples[/dim]")
    |                                                                                         ^^^^^^^^^^^^^^ E501
180 |
181 |     if num_workers is None:
    |

`ruff check` reports: ``` scripts/rdf_to_hf_incremental.py:179:89: E501 Line too long (102 > 88) | 177 | console.print("[yellow]Streaming Turtle from line iterator[/yellow]") 178 | if skip_triples > 0: 179 | console.print(f"[dim]Resume mode: will skip {skip_triples:,} already-processed triples[/dim]") | ^^^^^^^^^^^^^^ E501 180 | 181 | if num_workers is None: | ```

fixed !!

scripts/rdf_to_hf_incremental.py

					
				@ -152,0 +179,4 @@

				        console.print(f"[dim]Resume mode: will skip {skip_triples:,} already-processed triples[/dim]")

				    if num_workers is None:

				        num_workers = max(1, mp.cpu_count() - 1)

According to multiprocessing.cpu_count() :

This number is not equivalent to the number of CPUs the current process can use. The number of usable CPUs can be obtained with os.process_cpu_count() (or len(os.sched_getaffinity(0))).

It might be better to measure the number of usable CPUs instead of the number of CPUs.

According to [multiprocessing.cpu_count()](https://docs.python.org/3/library/multiprocessing.html#multiprocessing.cpu_count) : > This number is not equivalent to the number of CPUs the current process can use. The number of usable CPUs can be obtained with os.process_cpu_count() (or len(os.sched_getaffinity(0))). It might be better to measure the number of usable CPUs instead of the number of CPUs.

fixed !!

scripts/rdf_to_hf_incremental.py Outdated

					
				@ -193,0 +232,4 @@

				                        console.print(f"[dim]Fast-forward: {total_triples:,} / {skip_triples:,} triples[/dim]")

				                    continue

				                elif total_triples == skip_triples:

				                    console.print(f"[green]✓ Skipped {skip_triples:,} triples, resuming normal parsing[/green]")

ruff check reports:

scripts/rdf_to_hf_incremental.py:235:89: E501 Line too long (112 > 88)
    |
233 |                     continue
234 |                 elif total_triples == skip_triples:
235 |                     console.print(f"[green]✓ Skipped {skip_triples:,} triples, resuming normal parsing[/green]")
    |                                                                                         ^^^^^^^^^^^^^^^^^^^^^^^^ E501
236 |                     continue
    |

`ruff check` reports: ``` scripts/rdf_to_hf_incremental.py:235:89: E501 Line too long (112 > 88) | 233 | continue 234 | elif total_triples == skip_triples: 235 | console.print(f"[green]✓ Skipped {skip_triples:,} triples, resuming normal parsing[/green]") | ^^^^^^^^^^^^^^^^^^^^^^^^ E501 236 | continue | ```

fixed !!

scripts/rdf_to_hf_incremental.py Outdated

					
				@ -199,1 +262,3 @@

				            console.print(f"[dim]Processed {line_count:,} lines...[/dim]")

				            if line_count % PROGRESS_LOG_INTERVAL == 0:

				                if total_triples < skip_triples:

				                    console.print(f"[dim]Fast-forward: {total_triples:,} / {skip_triples:,} triples[/dim]")

ruff check reports:

scripts/rdf_to_hf_incremental.py:264:89: E501 Line too long (107 > 88)
    |
262 |             if line_count % PROGRESS_LOG_INTERVAL == 0:
263 |                 if total_triples < skip_triples:
264 |                     console.print(f"[dim]Fast-forward: {total_triples:,} / {skip_triples:,} triples[/dim]")
    |                                                                                         ^^^^^^^^^^^^^^^^^^^ E501
265 |                 else:
266 |                     console.print(f"[dim]Processed {line_count:,} lines...[/dim]")
    |

`ruff check` reports: ``` scripts/rdf_to_hf_incremental.py:264:89: E501 Line too long (107 > 88) | 262 | if line_count % PROGRESS_LOG_INTERVAL == 0: 263 | if total_triples < skip_triples: 264 | console.print(f"[dim]Fast-forward: {total_triples:,} / {skip_triples:,} triples[/dim]") | ^^^^^^^^^^^^^^^^^^^ E501 265 | else: 266 | console.print(f"[dim]Processed {line_count:,} lines...[/dim]") | ```

fixed !!

scripts/rdf_to_hf_incremental.py

					
				@ -159,0 +189,4 @@

				    # Extract prefix declarations first

				    buffered_first_line = None

				    for line in lines:

				        line_count += 1

It might be easier to use enumerate.

It might be easier to use `enumerate`.

fixed !!

scripts/rdf_to_hf_incremental.py Outdated

					
				@ -190,0 +210,4 @@

				        if buffered_first_line:

				            yield buffered_first_line

				        for line in lines:

				            yield line

ruff check reports:

scripts/rdf_to_hf_incremental.py:212:9: UP028 Replace `yield` over `for` loop with `yield from`
    |
210 |           if buffered_first_line:
211 |               yield buffered_first_line
212 | /         for line in lines:
213 | |             yield line
    | |______________________^ UP028
214 |
215 |       # Sequential processing for single worker
    |
    = help: Replace with `yield from`

fixed !!

scripts/rdf_to_hf_incremental.py Outdated

					
				@ -193,0 +229,4 @@

				                if total_triples < skip_triples:

				                    if total_triples % PROGRESS_LOG_INTERVAL == 0:

				                        console.print(f"[dim]Fast-forward: {total_triples:,} / {skip_triples:,} triples[/dim]")

ruff check reports:

scripts/rdf_to_hf_incremental.py:232:89: E501 Line too long (111 > 88)
    |
230 |                 if total_triples < skip_triples:
231 |                     if total_triples % PROGRESS_LOG_INTERVAL == 0:
232 |                         console.print(f"[dim]Fast-forward: {total_triples:,} / {skip_triples:,} triples[/dim]")
    |                                                                                         ^^^^^^^^^^^^^^^^^^^^^^^ E501
233 |                     continue
234 |                 elif total_triples == skip_triples:
    |

`ruff check` reports: ``` scripts/rdf_to_hf_incremental.py:232:89: E501 Line too long (111 > 88) | 230 | if total_triples < skip_triples: 231 | if total_triples % PROGRESS_LOG_INTERVAL == 0: 232 | console.print(f"[dim]Fast-forward: {total_triples:,} / {skip_triples:,} triples[/dim]") | ^^^^^^^^^^^^^^^^^^^^^^^ E501 233 | continue 234 | elif total_triples == skip_triples: | ```

fixed !!

scripts/rdf_to_hf_incremental.py Outdated

					
				@ -193,0 +238,4 @@

				                triple_count += 1

				                if triple_count >= chunk_size:

				                    chunk_text = prefix_text + "\n" + "".join(current_chunk) + "\n" + line

ruff check reports:

scripts/rdf_to_hf_incremental.py:241:89: E501 Line too long (90 > 88)
    |
240 |                 if triple_count >= chunk_size:
241 |                     chunk_text = prefix_text + "\n" + "".join(current_chunk) + "\n" + line
    |                                                                                         ^^ E501
242 |                     try:
243 |                         graph = Graph()
    |

`ruff check` reports: ``` scripts/rdf_to_hf_incremental.py:241:89: E501 Line too long (90 > 88) | 240 | if triple_count >= chunk_size: 241 | chunk_text = prefix_text + "\n" + "".join(current_chunk) + "\n" + line | ^^ E501 242 | try: 243 | graph = Graph() | ```

fixed !!

scripts/rdf_to_hf_incremental.py

					
				@ -221,1 +365,4 @@

				    """

				    console = Console()

				    if skip_triples > 0:

				        console.print(f"[dim]Resume mode: will skip {skip_triples:,} already-processed triples[/dim]")

ruff check reports:

scripts/rdf_to_hf_incremental.py:368:89: E501 Line too long (102 > 88)
    |
366 |     console = Console()
367 |     if skip_triples > 0:
368 |         console.print(f"[dim]Resume mode: will skip {skip_triples:,} already-processed triples[/dim]")
    |                                                                                         ^^^^^^^^^^^^^^ E501
369 |
370 |     batch: list[dict] = []
    |

`ruff check` reports: ``` scripts/rdf_to_hf_incremental.py:368:89: E501 Line too long (102 > 88) | 366 | console = Console() 367 | if skip_triples > 0: 368 | console.print(f"[dim]Resume mode: will skip {skip_triples:,} already-processed triples[/dim]") | ^^^^^^^^^^^^^^ E501 369 | 370 | batch: list[dict] = [] | ```

fixed !!

scripts/rdf_to_hf_incremental.py

					
				@ -227,0 +379,4 @@

				        if total_triples < skip_triples:

				            if total_triples % PROGRESS_LOG_INTERVAL == 0:

				                console.print(f"[dim]Fast-forward: {total_triples:,} / {skip_triples:,} triples[/dim]")

ruff check reports:

scripts/rdf_to_hf_incremental.py:382:89: E501 Line too long (103 > 88)
    |
380 |         if total_triples < skip_triples:
381 |             if total_triples % PROGRESS_LOG_INTERVAL == 0:
382 |                 console.print(f"[dim]Fast-forward: {total_triples:,} / {skip_triples:,} triples[/dim]")
    |                                                                                         ^^^^^^^^^^^^^^^ E501
383 |             continue
384 |         elif total_triples == skip_triples:
    |

`ruff check` reports: ``` scripts/rdf_to_hf_incremental.py:382:89: E501 Line too long (103 > 88) | 380 | if total_triples < skip_triples: 381 | if total_triples % PROGRESS_LOG_INTERVAL == 0: 382 | console.print(f"[dim]Fast-forward: {total_triples:,} / {skip_triples:,} triples[/dim]") | ^^^^^^^^^^^^^^^ E501 383 | continue 384 | elif total_triples == skip_triples: | ```

fixed !!

scripts/rdf_to_hf_incremental.py

					
				@ -227,0 +382,4 @@

				                console.print(f"[dim]Fast-forward: {total_triples:,} / {skip_triples:,} triples[/dim]")

				            continue

				        elif total_triples == skip_triples:

				            console.print(f"[green]✓ Skipped {skip_triples:,} triples, resuming normal parsing[/green]")

ruff check reports:

scripts/rdf_to_hf_incremental.py:385:89: E501 Line too long (104 > 88)
    |
383 |             continue
384 |         elif total_triples == skip_triples:
385 |             console.print(f"[green]✓ Skipped {skip_triples:,} triples, resuming normal parsing[/green]")
    |                                                                                         ^^^^^^^^^^^^^^^^ E501
386 |             continue
    |

`ruff check` reports: ``` scripts/rdf_to_hf_incremental.py:385:89: E501 Line too long (104 > 88) | 383 | continue 384 | elif total_triples == skip_triples: 385 | console.print(f"[green]✓ Skipped {skip_triples:,} triples, resuming normal parsing[/green]") | ^^^^^^^^^^^^^^^^ E501 386 | continue | ```

fixed !!

scripts/rdf_to_hf_incremental.py

					
				@ -211,0 +274,4 @@

				                if triples:

				                    yield triples

				            except Exception as e:

				                logger.error(f"Error parsing final Turtle chunk: {e}")

Lines 242-252 and lines 270-277 are extremely similar; should they be one method?

fixed !!

scripts/rdf_to_hf_incremental.py Outdated

					
				@ -211,0 +303,4 @@

				                if total_triples < skip_triples:

				                    if total_triples % PROGRESS_LOG_INTERVAL == 0:

				                        console.print(f"[dim]Fast-forward: {total_triples:,} / {skip_triples:,} triples[/dim]")

ruff check reports:

scripts/rdf_to_hf_incremental.py:306:89: E501 Line too long (111 > 88)
    |
304 |                 if total_triples < skip_triples:
305 |                     if total_triples % PROGRESS_LOG_INTERVAL == 0:
306 |                         console.print(f"[dim]Fast-forward: {total_triples:,} / {skip_triples:,} triples[/dim]")
    |                                                                                         ^^^^^^^^^^^^^^^^^^^^^^^ E501
307 |                     continue
308 |                 elif total_triples == skip_triples:
    |

`ruff check` reports: ``` scripts/rdf_to_hf_incremental.py:306:89: E501 Line too long (111 > 88) | 304 | if total_triples < skip_triples: 305 | if total_triples % PROGRESS_LOG_INTERVAL == 0: 306 | console.print(f"[dim]Fast-forward: {total_triples:,} / {skip_triples:,} triples[/dim]") | ^^^^^^^^^^^^^^^^^^^^^^^ E501 307 | continue 308 | elif total_triples == skip_triples: | ```

fixed !!

scripts/rdf_to_hf_incremental.py Outdated

					
				@ -211,0 +306,4 @@

				                        console.print(f"[dim]Fast-forward: {total_triples:,} / {skip_triples:,} triples[/dim]")

				                    continue

				                elif total_triples == skip_triples:

				                    console.print(f"[green]✓ Skipped {skip_triples:,} triples, resuming normal parsing[/green]")

ruff check reports:

scripts/rdf_to_hf_incremental.py:309:89: E501 Line too long (112 > 88)
    |
307 |                     continue
308 |                 elif total_triples == skip_triples:
309 |                     console.print(f"[green]✓ Skipped {skip_triples:,} triples, resuming normal parsing[/green]")
    |                                                                                         ^^^^^^^^^^^^^^^^^^^^^^^^ E501
310 |                     continue
    |

`ruff check` reports: ``` scripts/rdf_to_hf_incremental.py:309:89: E501 Line too long (112 > 88) | 307 | continue 308 | elif total_triples == skip_triples: 309 | console.print(f"[green]✓ Skipped {skip_triples:,} triples, resuming normal parsing[/green]") | ^^^^^^^^^^^^^^^^^^^^^^^^ E501 310 | continue | ```

fixed !!

scripts/rdf_to_hf_incremental.py

					
				@ -211,0 +309,4 @@

				                    console.print(f"[green]✓ Skipped {skip_triples:,} triples, resuming normal parsing[/green]")

				                    continue

				                sub_chunk_triple_count += 1

There are a lot of similarities between lines 220-238 and 294-312; can they be unified?

fixed !!

scripts/rdf_to_hf_incremental.py Outdated

					
				@ -211,0 +319,4 @@

				                    sub_chunk_triple_count = 0

				                    if len(sub_chunks) >= batch_size:

				                        for idx, triples in pool.imap(_parse_turtle_chunk, sub_chunks):

ruff check reports:

scripts/rdf_to_hf_incremental.py:322:29: B007 Loop control variable `idx` not used within loop body
    |
321 |                     if len(sub_chunks) >= batch_size:
322 |                         for idx, triples in pool.imap(_parse_turtle_chunk, sub_chunks):
    |                             ^^^ B007
323 |                             if triples:
324 |                                 yield triples
    |
    = help: Rename unused `idx` to `_idx`

fixed !!

scripts/rdf_to_hf_incremental.py

					
				@ -319,0 +482,4 @@

				        return []

				def _parse_description_batch(args: tuple[int, str, str | None]) -> tuple[int, list[dict]]:

ruff check reports:

scripts/rdf_to_hf_incremental.py:485:89: E501 Line too long (90 > 88)
    |
485 | def _parse_description_batch(args: tuple[int, str, str | None]) -> tuple[int, list[dict]]:
    |                                                                                         ^^ E501
486 |     """Parse a Description XML fragment (worker function for multiprocessing).
    |

`ruff check` reports: ``` scripts/rdf_to_hf_incremental.py:485:89: E501 Line too long (90 > 88) | 485 | def _parse_description_batch(args: tuple[int, str, str | None]) -> tuple[int, list[dict]]: | ^^ E501 486 | """Parse a Description XML fragment (worker function for multiprocessing). | ```

fixed !!

scripts/rdf_to_hf_incremental.py Outdated

					
				@ -211,0 +331,4 @@

				            if line_count % PROGRESS_LOG_INTERVAL == 0:

				                if total_triples < skip_triples:

				                    console.print(f"[dim]Fast-forward: {total_triples:,} / {skip_triples:,} triples[/dim]")

ruff check reports:

scripts/rdf_to_hf_incremental.py:334:89: E501 Line too long (107 > 88)
    |
332 |             if line_count % PROGRESS_LOG_INTERVAL == 0:
333 |                 if total_triples < skip_triples:
334 |                     console.print(f"[dim]Fast-forward: {total_triples:,} / {skip_triples:,} triples[/dim]")
    |                                                                                         ^^^^^^^^^^^^^^^^^^^ E501
335 |                 else:
336 |                     console.print(f"[dim]Processed {line_count:,} lines...[/dim]")
    |

`ruff check` reports: ``` scripts/rdf_to_hf_incremental.py:334:89: E501 Line too long (107 > 88) | 332 | if line_count % PROGRESS_LOG_INTERVAL == 0: 333 | if total_triples < skip_triples: 334 | console.print(f"[dim]Fast-forward: {total_triples:,} / {skip_triples:,} triples[/dim]") | ^^^^^^^^^^^^^^^^^^^ E501 335 | else: 336 | console.print(f"[dim]Processed {line_count:,} lines...[/dim]") | ```

fixed !!

scripts/rdf_to_hf_incremental.py

					
				@ -211,0 +342,4 @@

				        # Process remaining batch

				        if sub_chunks:

				            for idx, triples in pool.imap(_parse_turtle_chunk, sub_chunks):

ruff check reports:

scripts/rdf_to_hf_incremental.py:345:17: B007 Loop control variable `idx` not used within loop body
    |
343 |         # Process remaining batch
344 |         if sub_chunks:
345 |             for idx, triples in pool.imap(_parse_turtle_chunk, sub_chunks):
    |                 ^^^ B007
346 |                 if triples:
347 |                     yield triples
    |
    = help: Rename unused `idx` to `_idx`

fixed !!

scripts/rdf_to_hf_incremental.py

					
				@ -340,2 +523,4 @@

				    console = Console()

				    console.print("[yellow]Streaming RDF/XML with standard XML parser[/yellow]")

				    if skip_descriptions > 0:

				        console.print(f"[dim]Resuming: will skip {skip_descriptions:,} already-processed descriptions[/dim]")

ruff check reports:

scripts/rdf_to_hf_incremental.py:526:89: E501 Line too long (109 > 88)
    |
524 |     console.print("[yellow]Streaming RDF/XML with standard XML parser[/yellow]")
525 |     if skip_descriptions > 0:
526 |         console.print(f"[dim]Resuming: will skip {skip_descriptions:,} already-processed descriptions[/dim]")
    |                                                                                         ^^^^^^^^^^^^^^^^^^^^^ E501
527 |
528 |     if num_workers is None:
    |

`ruff check` reports: ``` scripts/rdf_to_hf_incremental.py:526:89: E501 Line too long (109 > 88) | 524 | console.print("[yellow]Streaming RDF/XML with standard XML parser[/yellow]") 525 | if skip_descriptions > 0: 526 | console.print(f"[dim]Resuming: will skip {skip_descriptions:,} already-processed descriptions[/dim]") | ^^^^^^^^^^^^^^^^^^^^^ E501 527 | 528 | if num_workers is None: | ```

fixed !!

scripts/rdf_to_hf_incremental.py

					
				@ -359,0 +564,4 @@

				                    elem.clear()

				                    if total_descriptions % PROGRESS_LOG_INTERVAL == 0:

				                        console.print(

				                            f"[dim]Skipping: {total_descriptions:,} / {skip_descriptions:,}[/dim]"

ruff check reports:

scripts/rdf_to_hf_incremental.py:567:89: E501 Line too long (98 > 88)
    |
565 |                     if total_descriptions % PROGRESS_LOG_INTERVAL == 0:
566 |                         console.print(
567 |                             f"[dim]Skipping: {total_descriptions:,} / {skip_descriptions:,}[/dim]"
    |                                                                                         ^^^^^^^^^^ E501
568 |                         )
569 |                     continue
    |

`ruff check` reports: ``` scripts/rdf_to_hf_incremental.py:567:89: E501 Line too long (98 > 88) | 565 | if total_descriptions % PROGRESS_LOG_INTERVAL == 0: 566 | console.print( 567 | f"[dim]Skipping: {total_descriptions:,} / {skip_descriptions:,}[/dim]" | ^^^^^^^^^^ E501 568 | ) 569 | continue | ```

fixed !!

scripts/rdf_to_hf_incremental.py

					
				@ -376,2 +575,2 @@

				                    triple_batch = []

				                    batch_description_count = 0

				                if len(parse_batch) >= parse_batch_size:

				                    for idx, triples in pool.imap(_parse_description_batch, parse_batch):

ruff check reports:

scripts/rdf_to_hf_incremental.py:576:25: B007 Loop control variable `idx` not used within loop body
    |
575 |                 if len(parse_batch) >= parse_batch_size:
576 |                     for idx, triples in pool.imap(_parse_description_batch, parse_batch):
    |                         ^^^ B007
577 |                         if triples:
578 |                             triple_batch.extend(triples)
    |
    = help: Rename unused `idx` to `_idx`

scripts/rdf_to_hf_incremental.py:576:89: E501 Line too long (89 > 88)
    |
575 |                 if len(parse_batch) >= parse_batch_size:
576 |                     for idx, triples in pool.imap(_parse_description_batch, parse_batch):
    |                                                                                         ^ E501
577 |                         if triples:
578 |                             triple_batch.extend(triples)
    |

fixed !!

scripts/rdf_to_hf_incremental.py

					
				@ -388,1 +596,4 @@

				        logger.error("This may indicate malformed XML in the source file")

				    finally:

				        if parse_batch:

				            for idx, triples in pool.imap(_parse_description_batch, parse_batch):

ruff check reports:

scripts/rdf_to_hf_incremental.py:599:17: B007 Loop control variable `idx` not used within loop body
    |
597 |     finally:
598 |         if parse_batch:
599 |             for idx, triples in pool.imap(_parse_description_batch, parse_batch):
    |                 ^^^ B007
600 |                 if triples:
601 |                     triple_batch.extend(triples)
    |
    = help: Rename unused `idx` to `_idx`

fixed !!

scripts/rdf_to_hf_incremental.py

					
				@ -440,2 +661,3 @@

				        yield from stream_turtle_chunks_from_lines(lines, chunk_size, skip_triples, num_workers)

				    elif format in ("xml", "rdf", "rdfxml"):

				        yield from stream_rdfxml_chunks_from_lines(lines, chunk_size=chunk_size)

				        yield from stream_rdfxml_chunks_from_lines(lines, chunk_size=chunk_size, skip_descriptions=skip_descriptions, num_workers=num_workers)

ruff check reports:

scripts/rdf_to_hf_incremental.py:663:89: E501 Line too long (142 > 88)
    |
661 | …ines, chunk_size, skip_triples, num_workers)
662 | …
663 | …ines, chunk_size=chunk_size, skip_descriptions=skip_descriptions, num_workers=num_workers)
    |                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ E501
664 | …
665 | …
    |

`ruff check` reports: ``` scripts/rdf_to_hf_incremental.py:663:89: E501 Line too long (142 > 88) | 661 | …ines, chunk_size, skip_triples, num_workers) 662 | … 663 | …ines, chunk_size=chunk_size, skip_descriptions=skip_descriptions, num_workers=num_workers) | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ E501 664 | … 665 | … | ```

fixed !!

scripts/rdf_to_hf_incremental.py

					
				@ -438,2 +659,3 @@

				        yield from stream_ntriples_from_lines(lines, skip_triples)

				    elif format in ("turtle", "ttl"):

				        yield from stream_turtle_chunks_from_lines(lines, chunk_size=chunk_size)

				        yield from stream_turtle_chunks_from_lines(lines, chunk_size, skip_triples, num_workers)

ruff check reports:

scripts/rdf_to_hf_incremental.py:661:89: E501 Line too long (96 > 88)
    |
659 |         yield from stream_ntriples_from_lines(lines, skip_triples)
660 |     elif format in ("turtle", "ttl"):
661 |         yield from stream_turtle_chunks_from_lines(lines, chunk_size, skip_triples, num_workers)
    |                                                                                         ^^^^^^^^ E501
662 |     elif format in ("xml", "rdf", "rdfxml"):
663 |         yield from stream_rdfxml_chunks_from_lines(lines, chunk_size=chunk_size, skip_descriptions=skip_descriptions, num_workers=num…
    |

`ruff check` reports: ``` scripts/rdf_to_hf_incremental.py:661:89: E501 Line too long (96 > 88) | 659 | yield from stream_ntriples_from_lines(lines, skip_triples) 660 | elif format in ("turtle", "ttl"): 661 | yield from stream_turtle_chunks_from_lines(lines, chunk_size, skip_triples, num_workers) | ^^^^^^^^ E501 662 | elif format in ("xml", "rdf", "rdfxml"): 663 | yield from stream_rdfxml_chunks_from_lines(lines, chunk_size=chunk_size, skip_descriptions=skip_descriptions, num_workers=num… | ```

fixed !!

scripts/rdf_to_hf_incremental.py

					
				@ -638,3 +864,3 @@

				        chunk_iter = stream_rdf_chunks_from_lines(

				            lines, format=rdf_format, chunk_size=DEFAULT_BATCH_SIZE

				            lines, rdf_format, DEFAULT_BATCH_SIZE, skip_descriptions, skip_triples, num_workers

ruff check reports:

scripts/rdf_to_hf_incremental.py:866:89: E501 Line too long (95 > 88)
    |
865 |         chunk_iter = stream_rdf_chunks_from_lines(
866 |             lines, rdf_format, DEFAULT_BATCH_SIZE, skip_descriptions, skip_triples, num_workers
    |                                                                                         ^^^^^^^ E501
867 |         )
868 |         return chunk_iter, source_name
    |

`ruff check` reports: ``` scripts/rdf_to_hf_incremental.py:866:89: E501 Line too long (95 > 88) | 865 | chunk_iter = stream_rdf_chunks_from_lines( 866 | lines, rdf_format, DEFAULT_BATCH_SIZE, skip_descriptions, skip_triples, num_workers | ^^^^^^^ E501 867 | ) 868 | return chunk_iter, source_name | ```

fixed !!

scripts/rdf_to_hf_incremental.py

					
				@ -1459,0 +1730,4 @@

				    if checkpoint_source:

				        console.print(f"[yellow]Resuming from shard {start_shard} (loaded from {checkpoint_source})[/yellow]")

				        if skip_descriptions > 0:

				            console.print(f"[yellow]Will skip {skip_descriptions:,} already-processed descriptions[/yellow]")

ruff check reports:

scripts/rdf_to_hf_incremental.py:1733:89: E501 Line too long (109 > 88)
     |
1731 |         console.print(f"[yellow]Resuming from shard {start_shard} (loaded from {checkpoint_source})[/yellow]")
1732 |         if skip_descriptions > 0:
1733 |             console.print(f"[yellow]Will skip {skip_descriptions:,} already-processed descriptions[/yellow]")
     |                                                                                         ^^^^^^^^^^^^^^^^^^^^^ E501
1734 |         if resume_total_rows > 0:
1735 |             console.print(f"[yellow]Continuing from row {resume_total_rows:,}[/yellow]")
     |

`ruff check` reports: ``` scripts/rdf_to_hf_incremental.py:1733:89: E501 Line too long (109 > 88) | 1731 | console.print(f"[yellow]Resuming from shard {start_shard} (loaded from {checkpoint_source})[/yellow]") 1732 | if skip_descriptions > 0: 1733 | console.print(f"[yellow]Will skip {skip_descriptions:,} already-processed descriptions[/yellow]") | ^^^^^^^^^^^^^^^^^^^^^ E501 1734 | if resume_total_rows > 0: 1735 | console.print(f"[yellow]Continuing from row {resume_total_rows:,}[/yellow]") | ```

fixed !!

scripts/rdf_to_hf_incremental.py

					
				@ -1459,0 +1728,4 @@

				            checkpoint_source = "sheet"

				    if checkpoint_source:

				        console.print(f"[yellow]Resuming from shard {start_shard} (loaded from {checkpoint_source})[/yellow]")

ruff check reports:

scripts/rdf_to_hf_incremental.py:1731:89: E501 Line too long (110 > 88)
     |
1730 |     if checkpoint_source:
1731 |         console.print(f"[yellow]Resuming from shard {start_shard} (loaded from {checkpoint_source})[/yellow]")
     |                                                                                         ^^^^^^^^^^^^^^^^^^^^^^ E501
1732 |         if skip_descriptions > 0:
1733 |             console.print(f"[yellow]Will skip {skip_descriptions:,} already-processed descriptions[/yellow]")
     |

`ruff check` reports: ``` scripts/rdf_to_hf_incremental.py:1731:89: E501 Line too long (110 > 88) | 1730 | if checkpoint_source: 1731 | console.print(f"[yellow]Resuming from shard {start_shard} (loaded from {checkpoint_source})[/yellow]") | ^^^^^^^^^^^^^^^^^^^^^^ E501 1732 | if skip_descriptions > 0: 1733 | console.print(f"[yellow]Will skip {skip_descriptions:,} already-processed descriptions[/yellow]") | ```

fixed !!

aditya added 2 commits

2026-02-05 13:09:35 +00:00

test: copy pasted the existing behave bdd test 87443596ee

fix: fix ruff check errors as per review feedback e60c3eab13

This pull request can be merged automatically.

This branch is out-of-date with the base branch

You are not authorized to merge this pull request.

View command line instructions

Checkout

From your project repository, check out a new branch and test the changes.

git fetch -u origin xml-streaming-optimization:xml-streaming-optimization

git switch xml-streaming-optimization

Merge

Merge the changes and update on Forgejo.

Warning: The "Autodetect manual merge" setting is not enabled for this repository, you will have to mark this pull request as manually merged afterwards.

git switch auto-excel-update

git merge --no-ff xml-streaming-optimization

git switch xml-streaming-optimization

git rebase auto-excel-update

git switch auto-excel-update

git merge --ff-only xml-streaming-optimization

git switch xml-streaming-optimization

git rebase auto-excel-update

git switch auto-excel-update

git merge --no-ff xml-streaming-optimization

git switch auto-excel-update

git merge --squash xml-streaming-optimization