unit-tests-google-tracker-sheet #48

aditya · 2026-01-12T13:04:50Z

aditya commented

2026-01-12 13:04:50 +00:00

This branch contains the behave bdd tests for google_sheets_tracker.py and also for upload_all_datasets.py

aditya added 87 commits

2026-01-12 13:04:51 +00:00

Mostly about convert_conceptnet_progress. b16eddbe71

convert_conceptnet_to_hf. c9c6bff859

convert_fb15k237_to_hf 927a8648f6

convert_nell995_to_hf c058301d1c

convert_rdf_progress ec1211ba51

convert_rdf_streaming_progress 4636770092

Fixes made by ruff format. 7f50361683

feat: enhance error handling and reporting in dataset processing scripts e3bee8af4f

Co-authored-by: aider (openrouter/anthropic/claude-sonnet-4) <aider@aider.chat>

feat: improve error handling and logging in RDF dataset downloader e1cabc7069

Co-authored-by: aider (openrouter/anthropic/claude-sonnet-4) <aider@aider.chat>

feat: improve error handling for FB15k-237 dataset conversion 576dc842a9

Co-authored-by: aider (openrouter/anthropic/claude-sonnet-4) <aider@aider.chat>

fix: correct indentation in FB15k-237 file parsing function 163268b40d

Co-authored-by: aider (openrouter/anthropic/claude-sonnet-4) <aider@aider.chat>

refactor: improve error handling for NELL-995 dataset conversion 3862bf2655

Co-authored-by: aider (openrouter/anthropic/claude-sonnet-4) <aider@aider.chat>

fix: correct indentation in NELL-995 dataset converter script e6e1932f66

Co-authored-by: aider (openrouter/anthropic/claude-sonnet-4) <aider@aider.chat>

refactor: improve error handling and logging in ConceptNet conversion script 24869bef6c

Co-authored-by: aider (openrouter/anthropic/claude-sonnet-4) <aider@aider.chat>

fix: correct indentation in ConceptNet CSV parsing function e7bfa3f2c9

Co-authored-by: aider (openrouter/anthropic/claude-sonnet-4) <aider@aider.chat>

feat: add comprehensive error handling for RDF to HuggingFace dataset conversion 67a9857ab3

Co-authored-by: aider (openrouter/anthropic/claude-sonnet-4) <aider@aider.chat>

One more set of changes for error messages. 69de631c84

Merge branch '16-error-messages-and-logging' into mypy-and-pylint aa069f7527

Changes made by ruff check --fix. c5707689d6

fix: fix ruff check scripts/ errors 7414b333f8

Changes made to make the downloader work. c776025b0f

Refactor: Extract helper functions to remove code duplication a4eca8bcaa

Fix: Pyright type checking errors and improve type annotations 001234afcf

One-half of the conversion is done, and the structure is written. 6b73289c24

Small updates. 33e004a08e

Feat: add initial Dockerfile and compose file for the repo bd44f34ade

upload_all_datasets.2.py written. Will run overnight. 1d5e791963

Minor changes; passing to Aditya. 59e1d8d8bd

Fix: fix bugs in dataset upload and conversion scripts db7733d630

feat(devcontainer): optimize for PyCharm with workspace path fixes 1c73906ea2

Added some fixes to run the wikidad-truthy set. 3d2a9b75a2

Please enter the commit message for your changes. Lines starting

Parallelized the call to dataset_dict.save_to_disk. 68587e5e6c

Replacing upload_all_datasets.py with the previous 9534336c15

upload_all_datasets.2.py

fix(docker): prevent container restart loop 0e17022093

chore: correct the project name 3b57467d9e

Changes requested by Kyle; thank you! a6928d0ce2

Final version; what was used to send wikidata-truthy. 37b3ee6e91

feat: add download via ftp functionality a9b6518839

fix: resolve resume functionality crash for complete files 2342d2dada

Minor improvements. bac71ff1cc

style: simplify string formatting and type hints in script 0f3e5f2e30

refactor: implement true streaming approach for RDF to dataset conversion 29c71c5aea

Co-authored-by: aider (openrouter/openai/o3-mini-high) <aider@aider.chat>

fix: remove undefined variables and redundant code in dataset conversion script eba86c37be

Co-authored-by: aider (openrouter/openai/o3-mini-high) <aider@aider.chat>

More changes. 18a724eef9

fix: resolve pickling error in streaming converter generator 5a3b22d47f

fix: fix minor linting errors efa08d5c7d

fix: fix ruff check related errors 1e95cb9a91

THIS DOES NOT WORK. 4da8793b18

fix: add hugging face push_to_hub()to upload the dataset directly to hugging face 0a9e8da028

Added .xz compression to code. 8bef6986ea

Fixing dataset_config to download one file. 6f12a78c2c

fix: correct xz decompression command and update error messages e0b0836ffc

Co-authored-by: aider (openrouter/openai/o3-mini-high) <aider@aider.chat>

fix: update error message for missing xz command 757d6dd7f8

Co-authored-by: aider (openrouter/openai/o3-mini-high) <aider@aider.chat>

Fixes suggested by Aider. 0784e3e275

Merge branch '35-xz-compression-aider' into 35-xz-compression 1df9fcbdf6

fix: improve RDF format detection and parsing for UniProt dataset ff3939d937

Co-authored-by: aider (openrouter/anthropic/claude-sonnet-4) <aider@aider.chat>

feat: add rdf_to_hf_incremental.py script for true streamingof entire pipeline b218ab1410

Added dataset_config. 16fd345664

Fixing dataset_config to download one file. a51f1f3703

feat: add script to insert entries into dataset_config.json a01336cd15

Co-authored-by: aider (openrouter/openai/o3-mini-high) <aider@aider.chat>

Added all appropriate files from https://ftp.uniprot.org/pub/databases/uniprot/current_release/rdf/ 2ea6f3a89e

Removed the quickly-written scripts to edit dataset_config. c152323097

fix: Address code review feedback - deduplication, format support, and error handling improvements 711ee61f9a

feat: add --list --dry-run dataset card feature d5ac95f2aa

Merge branch '35-xz-compression' into stream-download-convert-upload d564739f72

Merge branch '7-dockerfile-setup' into stream-download-convert-upload bbc0d61039

Merge branch '8-devcontainer-setup' into stream-download-convert-upload c487b49f11

Adding the features file. eb8bfdbee5

feat: unify RDF to HF dataset conversion with improved strategies a7f7d279b4

Co-authored-by: aider (openrouter/openai/o3-mini-high) <aider@aider.chat>

Moved the new unified code. 02f740ab81

fix: complete the unified script to add add the new features as streaming parallel; update the upload_all_dataset.py to use the unified script 5b215aff76

Merge pull request 'fix: complete the unified script to add add the new features as streaming parallel; update the upload_all_dataset.py to use the unified script' (#41 ) from unified-final into stream-download-convert-upload-merge-15-refactor-rdf-converters 7c93adddc1

Reviewed-on: #41
Reviewed-by: Brent Edwards <brent.edwards@cleverthis.com>

fix: fix path too long issue; fix pickling issue in unified converter script 1bb26305ce

A few minor fixes. 1d33b67198

More work toward pyright approval. e7a540aeb3

fix: fix chunksize argument in multiprocessing as per review 90d77a7077

fix: fix ruff check errors 2b1537d4f9

Adding shell.nix. fbaa30e904

Updating pyproject to require Python 3.10. da9f785326

Minor changes. f8a7739016

Merge pull request 'fix: fix path too long issue; fix pickling issue in unified converter script' (#42 ) from path-too-long-fix into stream-download-convert-upload-merge-15-refactor-rdf-converters 76de466d99

Reviewed-on: #42
Reviewed-by: Brent Edwards <brent.edwards@cleverthis.com>

feat: integrate excel integration using gcp 0aad92eae5

feat: add devcontainer setup for vscode 23bd3eb9f0

fix: fix push to hub in unified converter script 734a58c9c1

fix: fix issues as per review feedback ae3f5bd791

fix: fix sheet updation logic; update README.md for pycharm devcontainer ef924a1aa1

tests: add comprehensive behave bdd tests for google sheet tracker module and upload all datasets module 6af250fc78

aditya changed target branch from master to auto-excel-update

2026-01-12 13:07:25 +00:00

aditya changed title from ~~WIP: unit-tests-google-tracker-sheet~~ to unit-tests-google-tracker-sheet

2026-01-12 13:07:48 +00:00

CoreRasurae reviewed

2026-01-13 21:44:21 +00:00

CoreRasurae left a comment

First-time contributor

I have one comment that i think might be relevant to address.

features/steps/google_sheets_steps.py Outdated

					
				@ -0,0 +143,4 @@

				            context.auth_result = context.tracker.authenticate()

				        except Exception as e:

				            context.auth_result = False

				            context.error = str(e)

I am not sure if i am okay with steps like these that are so generic... i mean, it can silently hide the true failure reason as it tries so many different things depending on the context. i think this step should be split into more fine grained steps that do well defined things and easier to manually check when something fails.

fixed !

aditya force-pushed unit-tests-google-tracker-sheet from 6af250fc78 to 7f6594d99a

2026-01-19 11:57:35 +00:00

Compare

aditya added 1 commit

2026-01-20 13:43:08 +00:00

test: add more behave tests 94a16e06f3

brent.edwards requested changes

2026-02-02 22:08:57 +00:00

brent.edwards left a comment

A lot of little changes. I didn't have time to read the /features/steps files.

A lot of little changes. I didn't have time to read the `/features/steps` files.

features/google_sheets_tracker/column_operations.feature Outdated

					
				@ -0,0 +44,4 @@

				  Scenario: Column letter conversion for column AB

				    When I convert column number 28 to letter

				    Then the column letter result should be "AB"

The last four scenarios are duplicated in features/google_sheets_tracker/column_letter_conversion.feature , and can be removed.

fixed !!

brent.edwards marked this conversation as resolved

features/google_sheets_tracker/convenience_methods.feature

					
				@ -0,0 +41,4 @@

				  Scenario: Convenience methods return success status

				    When I mark "wordnet" as in progress

				    Then the mark_in_progress method should return True

I would be more comfortable if this test added something like

And the mark_completed method should return False
And the mark_error method should return False
And the reset_to_not_started method should return False

and the negative tests for the next three tests as well.

I would be more comfortable if this test added something like ``` And the mark_completed method should return False And the mark_error method should return False And the reset_to_not_started method should return False ``` and the negative tests for the next three tests as well.

fixed !!

My apologies if I weren't clear enough.

The current lines 42-53 of convenience_methods.feature are:

  Scenario: Convenience methods return success status
    When I mark "wordnet" as in progress
    Then the mark_in_progress method should return True

    When I mark "wordnet" as completed
    Then the mark_completed method should return True

    When I mark "wordnet" with error "Test error"
    Then the mark_error method should return True

    When I reset "wordnet" to not started
    Then the reset_to_not_started method should return True

I would be more comfortable if the above were replaced with something like

  Scenario: Convenience methods return success status
    When I mark "wordnet" as in progress
    Then the mark_in_progress method should return True
    And the mark_completed method should return False
    And the mark_error method should return False
    And the reset_to_not_started method should return False

    When I mark "wordnet" as completed
    Then the mark_in_progress method should return False
    And the mark_completed method should return True
    And the mark_error method should return False
    And the reset_to_not_started method should return False

    When I mark "wordnet" with error "Test error"
    Then the mark_in_progress method should return False
    And the mark_completed method should return False
    And the mark_error method should return True
    And the reset_to_not_started method should return False

    When I reset "wordnet" to not started
    Then the mark_in_progress method should return False
    And the mark_completed method should return False
    And the mark_error method should return False
    And the reset_to_not_started method should return True

My apologies if I weren't clear enough. The current lines 42-53 of `convenience_methods.feature` are: ``` Scenario: Convenience methods return success status When I mark "wordnet" as in progress Then the mark_in_progress method should return True When I mark "wordnet" as completed Then the mark_completed method should return True When I mark "wordnet" with error "Test error" Then the mark_error method should return True When I reset "wordnet" to not started Then the reset_to_not_started method should return True ``` I would be more comfortable if the above were replaced with something like ``` Scenario: Convenience methods return success status When I mark "wordnet" as in progress Then the mark_in_progress method should return True And the mark_completed method should return False And the mark_error method should return False And the reset_to_not_started method should return False When I mark "wordnet" as completed Then the mark_in_progress method should return False And the mark_completed method should return True And the mark_error method should return False And the reset_to_not_started method should return False When I mark "wordnet" with error "Test error" Then the mark_in_progress method should return False And the mark_completed method should return False And the mark_error method should return True And the reset_to_not_started method should return False When I reset "wordnet" to not started Then the mark_in_progress method should return False And the mark_completed method should return False And the mark_error method should return False And the reset_to_not_started method should return True ```

I apologize, I incorrectly replied to this comment as fixed.

I don't think the test you suggested matches our real-world use cases. Here's why:

Actually, calling these methods sequentially does happen in production:

Script Interruption (Ctrl+C): First mark_in_progress is called, status becomes IN PROGRESS. Then user presses Ctrl+C. Now reset_to_not_started must work and return True to reset the dataset. If it returns False, the dataset stays stuck as IN PROGRESS forever.

Processing Error: First mark_in_progress is called successfully. Then download fails immediately. Now mark_error must work and return True to record what went wrong. If it returns False, we can't record the error.

Successful Completion: First mark_in_progress is called. Processing succeeds. Then mark_completed must work and return True to mark it done. If it returns False, we can't mark the work as complete.
These are all valid workflows that require methods to work together, not exclude each other. The test expectations would break our error handling and recovery mechanisms.

Moreover, google sheet tracker module is not used independently, its only being used with the automation scripts like rdf_to_hf_incremental.py and upload_all_datasets.py which use the validation method check_should_process before calling any of these convenience methods.

I apologize, I incorrectly replied to this comment as fixed. I don't think the test you suggested matches our real-world use cases. Here's why: Actually, calling these methods sequentially does happen in production: Script Interruption (Ctrl+C): First mark_in_progress is called, status becomes IN PROGRESS. Then user presses Ctrl+C. Now reset_to_not_started must work and return True to reset the dataset. If it returns False, the dataset stays stuck as IN PROGRESS forever. Processing Error: First mark_in_progress is called successfully. Then download fails immediately. Now mark_error must work and return True to record what went wrong. If it returns False, we can't record the error. Successful Completion: First mark_in_progress is called. Processing succeeds. Then mark_completed must work and return True to mark it done. If it returns False, we can't mark the work as complete. These are all valid workflows that require methods to work together, not exclude each other. The test expectations would break our error handling and recovery mechanisms. Moreover, google sheet tracker module is not used independently, its only being used with the automation scripts like rdf_to_hf_incremental.py and upload_all_datasets.py which use the validation method check_should_process before calling any of these convenience methods.

features/google_sheets_tracker/error_handling.feature

					
				@ -0,0 +26,4 @@

				      | Dataset Name | Status      | Error |

				      | wordnet      | NOT STARTED |       |

				    When I attempt to mark "nonexistent" as in progress

				    Then the update should fail

This test seems very close to features/google_sheets_tracker/convenience_methods.feature line 56.

fixed !!

brent.edwards marked this conversation as resolved

features/google_sheets_tracker/error_handling.feature Outdated

					
				@ -0,0 +49,4 @@

				    Given an authenticated sheets tracker

				    And a sheet with empty rows

				    When I get the status of "wordnet"

				    Then the status should be None

This test seems almost identical to features/google_sheets_tracker/error_handling.feature line 16.

fixed !!

brent.edwards marked this conversation as resolved

features/google_sheets_tracker/error_handling.feature Outdated

					
				@ -0,0 +68,4 @@

				    And a sheet with only one row (headers only)

				    When I get the status of "wordnet"

				    Then the status should be None

				    And the row should be None

This test also seems almost identical to features/google_sheets_tracker/error_handling.feature line 16.

fixed !!

brent.edwards marked this conversation as resolved

features/google_sheets_tracker/error_handling.feature Outdated

					
				@ -0,0 +76,4 @@

				      | Dataset Name | Status      | Error |

				      | wordnet      | NOT STARTED |       |

				    When I attempt to update status for "nonexistent-dataset"

				    Then the update should return False

This test is almost the same as features/google_sheets_tracker/convenience_methods.feature lines 56-63.

fixed !!

brent.edwards marked this conversation as resolved

features/google_sheets_tracker/header_loading.feature

					
				@ -0,0 +25,4 @@

				    And a sheet with headers "ID, Status, Error"

				    When the tracker loads headers

				    Then dataset_name_col should be identified as 0

				    And status_col should be 1

Why isn't error checked here?

Why isn't `error` checked here?

fixed !!

brent.edwards marked this conversation as resolved

features/google_sheets_tracker/header_loading.feature

					
				@ -0,0 +32,4 @@

				    And a sheet with headers "Dataset Name, State, Error"

				    When the tracker loads headers

				    Then dataset_name_col should be 0

				    And status_col should be identified as 1

Why isn't error checked here?

Why isn't `error` checked here?

fixed !!

brent.edwards marked this conversation as resolved

features/google_sheets_tracker/header_loading.feature

					
				@ -0,0 +39,4 @@

				    And a sheet with headers "Dataset Name, Status, Error Message"

				    When the tracker loads headers

				    Then dataset_name_col should be 0

				    And error_col should be identified as 2

Why isn't status_col checked here?

Why isn't `status_col` checked here?

fixed !!

brent.edwards marked this conversation as resolved

features/google_sheets_tracker/processing_checks.feature

					
				@ -0,0 +45,4 @@

				  Scenario: Process when tracker not authenticated

				    Given an unauthenticated sheets tracker

				    When I check if "wordnet" should be processed

				    Then the decision should be to process

"should be" to process? Is it possible that this is in the wrong direction?

fixed !!

brent.edwards marked this conversation as resolved

features/google_sheets_tracker/status_operations.feature Outdated

					
				@ -0,0 +50,4 @@

				  Scenario: Error message truncation

				    When I reset "wordnet" to not started with error message longer than 500 characters

				    Then the error for "wordnet" should be at most 500 characters

This test looks similar to features/google_sheets_tracker/error_message_handling.feature line 12.

fixed !!

brent.edwards marked this conversation as resolved

features/google_sheets_tracker/update_status_edge_cases.feature

					
				@ -0,0 +55,4 @@

				      | wordnet      | NOT STARTED |       |

				    When I attempt to update status for "nonexistent"

				    Then the update should fail gracefully

				    And no exception should be raised

This test seems very close to features/google_sheets_tracker/convenience_methods.feature line 56.

fixed !!

brent.edwards marked this conversation as resolved

features/google_sheets_tracker/update_status_edge_cases.feature

					
				@ -0,0 +62,4 @@

				      | Dataset Name | Status      | Error |

				      | wordnet      | NOT STARTED |       |

				    When I attempt to update status with very long value

				    Then the update should handle the long value correctly

In other places, the limit was 500 characters. Is that also true here?

fixed !!

brent.edwards marked this conversation as resolved

features/google_sheets_tracker/update_status_real_api.feature

					
				@ -0,0 +7,4 @@

				    Given an authenticated sheets tracker

				  Scenario: Update status with only status column present (no error column)

				    Given a sheet with headers "Dataset Name, Status"

The Given should probably say something about this sheet being real not virtual. This test looks very similar to features/google_sheets_tracker/update_status_edge_cases.feature lines 10-16, so I would have expected the setup to be the same... but only at line 14 does it say that the update is with a real API call.

The same should be done for the rest of the tests. Compare them against update_status_edge_cases and make sure that there is a difference in the Given statement.

The `Given` should probably say something about this sheet being real not virtual. This test looks very similar to features/google_sheets_tracker/update_status_edge_cases.feature lines 10-16, so I would have expected the setup to be the same... but only at line 14 does it say that the update is with a real API call. The same should be done for the rest of the tests. Compare them against `update_status_edge_cases` and make sure that there is a difference in the `Given` statement.

fixed !!

brent.edwards marked this conversation as resolved

features/steps/google_sheets_steps.py

					
				@ -0,0 +1,2646 @@

				"""Step definitions for Google Sheets tracker tests."""

There are three ways you could organize this file:

(Easiest) Put all @givens together, then all @whens, then all @thens.
(Harder) Group the elements by .feature file. Within each .feature file, put all @givens together, then all @whens, then all @thens.
(NOT RECOMMENDED!!!!) Group all elements by test. The source code should repeat the test before the implementation of the @given, @when and @then. (This will make the file a lot longer, and you'll need to have references whenever you have the same @given or whatnot.)

It looks like this code was kinda-sorta organized according to 3.

There are three ways you could organize this file: 1. (Easiest) Put all `@given`s together, then all `@when`s, then all `@then`s. 2. (Harder) Group the elements by `.feature` file. Within each `.feature` file, put all `@given`s together, then all `@when`s, then all `@then`s. 3. (NOT RECOMMENDED!!!!) Group all elements by test. The source code should repeat the test before the implementation of the `@given`, `@when` and `@then`. (This will make the file a lot longer, and you'll need to have references whenever you have the same `@given` or whatnot.) It looks like this code was kinda-sorta organized according to 3.

fixed !!

Which one of the three ways did you organize the file?

When I look at !48 (commit ac1cc304f9) , I don't see much organization.

(HINT: You could have said "I don't think this is important enough to do.", and I would have agreed. But you reported that you fixed it.)

Which one of the three ways did you organize the file? When I look at https://git.cleverthis.com/cleverdatasets/dataset-uploader/pulls/48/commits/ac1cc304f98e61e04f76e74d09eb1a6c23e9f34c#diff-4b6a0b6dddf1733aa850256d39c4e28cd2a56204 , I don't see much organization. (HINT: You could have said "I don't think this is important enough to do.", and I would have agreed. But you reported that you fixed it.)

I misunderstood. I didn't actually reorganize it. Given the size of the file (2,667 lines), reorganizing it would be a significant effort, and it doesn’t seem important enough to prioritize right now.

features/steps/google_sheets_steps.py

					
				@ -0,0 +95,4 @@

				            if header.strip().lower() in ["status", "processing status"]:

				                context.tracker.status_col = idx

				            elif header.strip().lower() in ["error", "error message"]:

				                context.tracker.error_col = idx

Does the tracker remember the column of the dataset name?

fixed !!

brent.edwards marked this conversation as resolved

features/steps/google_sheets_steps.py

				`@ -0,0 +2630,4 @@`

Is there a reason for this long gap?

fixed !!

There's still an eleven-line gap between lines 2645-2656. Thanks for eliminating two of the lines, but the long gap remains.

(It's not necessarily WRONG. It just very much stands out when one reads the code.)

Could you either drop it down to two lines or just say what the gap is about?

There's still an eleven-line gap between lines 2645-2656. Thanks for eliminating two of the lines, but the long gap remains. (It's not necessarily WRONG. It just very much stands out when one reads the code.) Could you either drop it down to two lines or just say what the gap is about?

scripts/convert_rdf_to_hf_dataset_unified.py Outdated

					
				@ -523,2 +582,2 @@

				    graph = Graph()

				    graph.parse(str(file_path), format=rdf_format)

				    # Check if this is GeoNames format (special handling required)

				    if rdf_format in ("xml", "application/rdf+xml") and _detect_geonames_format(file_path):

According to ruff check:

scripts/convert_rdf_to_hf_dataset_unified.py:583:89: E501 Line too long (91 > 88)
    |
581 |     """
582 |     # Check if this is GeoNames format (special handling required)
583 |     if rdf_format in ("xml", "application/rdf+xml") and _detect_geonames_format(file_path):
    |                                                                                         ^^^ E501
584 |         # Use GeoNames parser with single worker for non-parallel strategies
585 |         yield from stream_geonames_parallel(file_path, chunk_size, num_workers=1)
    |

According to `ruff check`: ``` scripts/convert_rdf_to_hf_dataset_unified.py:583:89: E501 Line too long (91 > 88) | 581 | """ 582 | # Check if this is GeoNames format (special handling required) 583 | if rdf_format in ("xml", "application/rdf+xml") and _detect_geonames_format(file_path): | ^^^ E501 584 | # Use GeoNames parser with single worker for non-parallel strategies 585 | yield from stream_geonames_parallel(file_path, chunk_size, num_workers=1) | ```

fixed !!

brent.edwards marked this conversation as resolved

scripts/convert_rdf_to_hf_dataset_unified.py

					
				@ -828,0 +965,4 @@

				                                            subject=subject,

				                                            predicate=predicate,

				                                            object=obj,

				                                            object_type="literal",  # TSV typically has literals

According to ruff check:

scripts/convert_rdf_to_hf_dataset_unified.py:968:89: E501 Line too long (96 > 88)
    |
966 | …                     predicate=predicate,
967 | …                     object=obj,
968 | …                     object_type="literal",  # TSV typically has literals
    |                                                                   ^^^^^^^^ E501
969 | …                     object_datatype=None,
970 | …                     object_language=None
    |

fixed !!

brent.edwards marked this conversation as resolved

scripts/convert_rdf_to_hf_dataset_unified.py

					
				@ -828,0 +970,4 @@

				                                            object_language=None

				                                        ))

				                            except Exception as e:

				                                logger.debug(f"Skipping malformed TSV line {line_num}: {e}")

According to ruff check:

scripts/convert_rdf_to_hf_dataset_unified.py:973:89: E501 Line too long (92 > 88)
    |
971 |                                         ))
972 |                             except Exception as e:
973 |                                 logger.debug(f"Skipping malformed TSV line {line_num}: {e}")
    |                                                                                         ^^^^ E501
974 |                                 continue
975 |                 except Exception as e:
    |

According to `ruff check`: ``` scripts/convert_rdf_to_hf_dataset_unified.py:973:89: E501 Line too long (92 > 88) | 971 | )) 972 | except Exception as e: 973 | logger.debug(f"Skipping malformed TSV line {line_num}: {e}") | ^^^^ E501 974 | continue 975 | except Exception as e: | ```

fixed !!

brent.edwards marked this conversation as resolved

scripts/convert_rdf_to_hf_dataset_unified.py

					
				@ -840,0 +990,4 @@

				            # Create train/test split if requested

				            if config.create_train_test_split:

				                assert isinstance(dataset, Dataset), "Expected Dataset instance"

				                train_test = dataset.train_test_split(test_size=config.test_size, seed=42)

According to ruff check:

scripts/convert_rdf_to_hf_dataset_unified.py:993:89: E501 Line too long (90 > 88)
    |
991 |             if config.create_train_test_split:
992 |                 assert isinstance(dataset, Dataset), "Expected Dataset instance"
993 |                 train_test = dataset.train_test_split(test_size=config.test_size, seed=42)
    |                                                                                         ^^ E501
994 |                 dataset_dict = DatasetDict({
995 |                     "train": train_test["train"],
    |

According to `ruff check`: ``` scripts/convert_rdf_to_hf_dataset_unified.py:993:89: E501 Line too long (90 > 88) | 991 | if config.create_train_test_split: 992 | assert isinstance(dataset, Dataset), "Expected Dataset instance" 993 | train_test = dataset.train_test_split(test_size=config.test_size, seed=42) | ^^ E501 994 | dataset_dict = DatasetDict({ 995 | "train": train_test["train"], | ```

fixed !!

brent.edwards marked this conversation as resolved

scripts/convert_rdf_to_hf_dataset_unified.py Outdated

					
				@ -1160,2 +1324,2 @@

				                            f"{num_workers} workers[/yellow]"

				                        )

				                # Handle compressed files

				                is_gzip = config.input_path.suffix == ".gz" or str(config.input_path).endswith(".gz")

According to ruff check:


scripts/convert_rdf_to_hf_dataset_unified.py:1325:89: E501 Line too long (101 > 88)
     |
1323 |                 # Check if it's the special GeoNames format (URLs followed by XML)
1324 |                 # Handle compressed files
1325 |                 is_gzip = config.input_path.suffix == ".gz" or str(config.input_path).endswith(".gz")
     |                                                                                         ^^^^^^^^^^^^^ E501
1326 |                 is_bz2 = config.input_path.suffix == ".bz2" or str(config.input_path).endswith(".bz2")
     |

scripts/convert_rdf_to_hf_dataset_unified.py:1326:89: E501 Line too long (102 > 88)
     |
1324 |                 # Handle compressed files
1325 |                 is_gzip = config.input_path.suffix == ".gz" or str(config.input_path).endswith(".gz")
1326 |                 is_bz2 = config.input_path.suffix == ".bz2" or str(config.input_path).endswith(".bz2")
     |                                                                                         ^^^^^^^^^^^^^^ E501
1327 |
1328 |                 try:
     |

According to `ruff check`: ``` scripts/convert_rdf_to_hf_dataset_unified.py:1325:89: E501 Line too long (101 > 88) | 1323 | # Check if it's the special GeoNames format (URLs followed by XML) 1324 | # Handle compressed files 1325 | is_gzip = config.input_path.suffix == ".gz" or str(config.input_path).endswith(".gz") | ^^^^^^^^^^^^^ E501 1326 | is_bz2 = config.input_path.suffix == ".bz2" or str(config.input_path).endswith(".bz2") | scripts/convert_rdf_to_hf_dataset_unified.py:1326:89: E501 Line too long (102 > 88) | 1324 | # Handle compressed files 1325 | is_gzip = config.input_path.suffix == ".gz" or str(config.input_path).endswith(".gz") 1326 | is_bz2 = config.input_path.suffix == ".bz2" or str(config.input_path).endswith(".bz2") | ^^^^^^^^^^^^^^ E501 1327 | 1328 | try: | ```

fixed !!

brent.edwards marked this conversation as resolved

scripts/convert_rdf_to_hf_dataset_unified.py Outdated

					
				@ -1162,0 +1327,4 @@

				                try:

				                    if is_gzip:

				                        file_obj = gzip.open(config.input_path, "rt", encoding="utf-8", errors="ignore") # noqa: SIM115

According to ruff check:


scripts/convert_rdf_to_hf_dataset_unified.py:1330:89: E501 Line too long (104 > 88)
     |
1328 |                 try:
1329 |                     if is_gzip:
1330 |                         file_obj = gzip.open(config.input_path, "rt", encoding="utf-8", errors="ignore") # noqa: SIM115
     |                                                                                         ^^^^^^^^^^^^^^^^ E501
1331 |                     elif is_bz2:
1332 |                         file_obj = bz2.open(config.input_path, "rt", encoding="utf-8", errors="ignore") # noqa: SIM115
     |

scripts/convert_rdf_to_hf_dataset_unified.py:1332:89: E501 Line too long (103 > 88)
     |
1330 |                         file_obj = gzip.open(config.input_path, "rt", encoding="utf-8", errors="ignore") # noqa: SIM115
1331 |                     elif is_bz2:
1332 |                         file_obj = bz2.open(config.input_path, "rt", encoding="utf-8", errors="ignore") # noqa: SIM115
     |                                                                                         ^^^^^^^^^^^^^^^ E501
1333 |                     else:
1334 |                         file_obj = open(config.input_path, encoding="utf-8", errors="ignore") # noqa: SIM115
     |

According to `ruff check`: ``` scripts/convert_rdf_to_hf_dataset_unified.py:1330:89: E501 Line too long (104 > 88) | 1328 | try: 1329 | if is_gzip: 1330 | file_obj = gzip.open(config.input_path, "rt", encoding="utf-8", errors="ignore") # noqa: SIM115 | ^^^^^^^^^^^^^^^^ E501 1331 | elif is_bz2: 1332 | file_obj = bz2.open(config.input_path, "rt", encoding="utf-8", errors="ignore") # noqa: SIM115 | scripts/convert_rdf_to_hf_dataset_unified.py:1332:89: E501 Line too long (103 > 88) | 1330 | file_obj = gzip.open(config.input_path, "rt", encoding="utf-8", errors="ignore") # noqa: SIM115 1331 | elif is_bz2: 1332 | file_obj = bz2.open(config.input_path, "rt", encoding="utf-8", errors="ignore") # noqa: SIM115 | ^^^^^^^^^^^^^^^ E501 1333 | else: 1334 | file_obj = open(config.input_path, encoding="utf-8", errors="ignore") # noqa: SIM115 | ```

fixed !!

brent.edwards marked this conversation as resolved

scripts/convert_rdf_to_hf_dataset_unified.py

					
				@ -1165,3 +1334,1 @@

				                            f"[yellow]Using standard RDF parser for "

				                            f"{config.rdf_format} (single-threaded)[/yellow]"

				                        )

				                        file_obj = open(config.input_path, encoding="utf-8", errors="ignore") # noqa: SIM115

According to ruff check:

scripts/convert_rdf_to_hf_dataset_unified.py:1334:89: E501 Line too long (93 > 88)
     |
1332 |                         file_obj = bz2.open(config.input_path, "rt", encoding="utf-8", errors="ignore") # noqa: SIM115
1333 |                     else:
1334 |                         file_obj = open(config.input_path, encoding="utf-8", errors="ignore") # noqa: SIM115
     |                                                                                         ^^^^^ E501
1335 |
1336 |                     try:
     |

According to `ruff check`: ``` scripts/convert_rdf_to_hf_dataset_unified.py:1334:89: E501 Line too long (93 > 88) | 1332 | file_obj = bz2.open(config.input_path, "rt", encoding="utf-8", errors="ignore") # noqa: SIM115 1333 | else: 1334 | file_obj = open(config.input_path, encoding="utf-8", errors="ignore") # noqa: SIM115 | ^^^^^ E501 1335 | 1336 | try: | ```

fixed !!

brent.edwards marked this conversation as resolved

scripts/convert_rdf_to_hf_dataset_unified.py

					
				@ -1168,0 +1350,4 @@

				                    finally:

				                        file_obj.close()

				                except Exception as e:

				                    logger.warning(f"Error detecting GeoNames format: {e}, using generic parser")

According to ruff check:

scripts/convert_rdf_to_hf_dataset_unified.py:1353:89: E501 Line too long (97 > 88)
     |
1351 |                         file_obj.close()
1352 |                 except Exception as e:
1353 |                     logger.warning(f"Error detecting GeoNames format: {e}, using generic parser")
     |                                                                                         ^^^^^^^^^ E501
1354 |                     format_type = "generic"
1355 |             elif config.rdf_format in ("nt", "ntriples"):
     |

According to `ruff check`: ``` scripts/convert_rdf_to_hf_dataset_unified.py:1353:89: E501 Line too long (97 > 88) | 1351 | file_obj.close() 1352 | except Exception as e: 1353 | logger.warning(f"Error detecting GeoNames format: {e}, using generic parser") | ^^^^^^^^^ E501 1354 | format_type = "generic" 1355 | elif config.rdf_format in ("nt", "ntriples"): | ```

fixed !!

brent.edwards marked this conversation as resolved

scripts/convert_rdf_to_hf_dataset_unified.py Outdated

					
				@ -1249,0 +1435,4 @@

				            # Create train/test split if requested

				            if config.create_train_test_split:

				                assert isinstance(dataset, Dataset), "Expected Dataset instance"

				                train_test = dataset.train_test_split(test_size=config.test_size, seed=42)

According to ruff check:

scripts/convert_rdf_to_hf_dataset_unified.py:1438:89: E501 Line too long (90 > 88)
     |
1436 |             if config.create_train_test_split:
1437 |                 assert isinstance(dataset, Dataset), "Expected Dataset instance"
1438 |                 train_test = dataset.train_test_split(test_size=config.test_size, seed=42)
     |                                                                                         ^^ E501
1439 |                 dataset_dict = DatasetDict({
1440 |                     "train": train_test["train"],
     |

According to `ruff check`: ``` scripts/convert_rdf_to_hf_dataset_unified.py:1438:89: E501 Line too long (90 > 88) | 1436 | if config.create_train_test_split: 1437 | assert isinstance(dataset, Dataset), "Expected Dataset instance" 1438 | train_test = dataset.train_test_split(test_size=config.test_size, seed=42) | ^^ E501 1439 | dataset_dict = DatasetDict({ 1440 | "train": train_test["train"], | ```

fixed !!

brent.edwards marked this conversation as resolved

scripts/convert_rdf_to_hf_dataset_unified.py Outdated

					
				@ -1278,0 +1473,4 @@

				                # Use single process for small datasets to avoid index errors

				                # Only use multiprocessing for datasets with > 1000 examples

				                if total_triples > 1000 and num_workers > 1:

				                    dataset_dict.save_to_disk(str(config.output_path), num_proc=num_workers)

According to ruff check:

scripts/convert_rdf_to_hf_dataset_unified.py:1476:89: E501 Line too long (92 > 88)
     |
1474 |                 # Only use multiprocessing for datasets with > 1000 examples
1475 |                 if total_triples > 1000 and num_workers > 1:
1476 |                     dataset_dict.save_to_disk(str(config.output_path), num_proc=num_workers)
     |                                                                                         ^^^^ E501
1477 |                 else:
1478 |                     dataset_dict.save_to_disk(str(config.output_path))
     |

According to `ruff check`: ``` scripts/convert_rdf_to_hf_dataset_unified.py:1476:89: E501 Line too long (92 > 88) | 1474 | # Only use multiprocessing for datasets with > 1000 examples 1475 | if total_triples > 1000 and num_workers > 1: 1476 | dataset_dict.save_to_disk(str(config.output_path), num_proc=num_workers) | ^^^^ E501 1477 | else: 1478 | dataset_dict.save_to_disk(str(config.output_path)) | ```

fixed !!

brent.edwards marked this conversation as resolved

aditya added 2 commits

2026-02-04 14:19:42 +00:00

test: fix behave test as per review feedback ac1cc304f9

fix: fix ruff check errors as per review feedback 28701bfb55

aditya added 1 commit

2026-02-06 10:52:39 +00:00

chore: remove blank lines 2391bda6af

This pull request can be merged automatically.

This branch is out-of-date with the base branch

You are not authorized to merge this pull request.

View command line instructions

Checkout

From your project repository, check out a new branch and test the changes.

git fetch -u origin unit-tests-google-tracker-sheet:unit-tests-google-tracker-sheet

git switch unit-tests-google-tracker-sheet

Merge

Merge the changes and update on Forgejo.

Warning: The "Autodetect manual merge" setting is not enabled for this repository, you will have to mark this pull request as manually merged afterwards.

git switch auto-excel-update

git merge --no-ff unit-tests-google-tracker-sheet

git switch unit-tests-google-tracker-sheet

git rebase auto-excel-update

git switch auto-excel-update

git merge --ff-only unit-tests-google-tracker-sheet

git switch unit-tests-google-tracker-sheet

git rebase auto-excel-update

git switch auto-excel-update

git merge --no-ff unit-tests-google-tracker-sheet

git switch auto-excel-update

git merge --squash unit-tests-google-tracker-sheet