unit-tests-google-tracker-sheet #48

Open
aditya wants to merge 5 commits from unit-tests-google-tracker-sheet into auto-excel-update
Member

This branch contains the behave bdd tests for google_sheets_tracker.py and also for upload_all_datasets.py

This branch contains the behave bdd tests for google_sheets_tracker.py and also for upload_all_datasets.py
Co-authored-by: aider (openrouter/anthropic/claude-sonnet-4) <aider@aider.chat>
Co-authored-by: aider (openrouter/anthropic/claude-sonnet-4) <aider@aider.chat>
Co-authored-by: aider (openrouter/anthropic/claude-sonnet-4) <aider@aider.chat>
Co-authored-by: aider (openrouter/anthropic/claude-sonnet-4) <aider@aider.chat>
Co-authored-by: aider (openrouter/anthropic/claude-sonnet-4) <aider@aider.chat>
Co-authored-by: aider (openrouter/anthropic/claude-sonnet-4) <aider@aider.chat>
Co-authored-by: aider (openrouter/anthropic/claude-sonnet-4) <aider@aider.chat>
Co-authored-by: aider (openrouter/anthropic/claude-sonnet-4) <aider@aider.chat>
Co-authored-by: aider (openrouter/anthropic/claude-sonnet-4) <aider@aider.chat>
Please enter the commit message for your changes. Lines starting
upload_all_datasets.2.py
Co-authored-by: aider (openrouter/openai/o3-mini-high) <aider@aider.chat>
Co-authored-by: aider (openrouter/openai/o3-mini-high) <aider@aider.chat>
Co-authored-by: aider (openrouter/openai/o3-mini-high) <aider@aider.chat>
Co-authored-by: aider (openrouter/openai/o3-mini-high) <aider@aider.chat>
Co-authored-by: aider (openrouter/anthropic/claude-sonnet-4) <aider@aider.chat>
Co-authored-by: aider (openrouter/openai/o3-mini-high) <aider@aider.chat>
Co-authored-by: aider (openrouter/openai/o3-mini-high) <aider@aider.chat>
Reviewed-on: #41
Reviewed-by: Brent Edwards <brent.edwards@cleverthis.com>
Reviewed-on: #42
Reviewed-by: Brent Edwards <brent.edwards@cleverthis.com>
aditya changed target branch from master to auto-excel-update 2026-01-12 13:07:25 +00:00
aditya changed title from WIP: unit-tests-google-tracker-sheet to unit-tests-google-tracker-sheet 2026-01-12 13:07:48 +00:00
CoreRasurae left a comment
First-time contributor

I have one comment that i think might be relevant to address.

I have one comment that i think might be relevant to address.
@ -0,0 +143,4 @@
context.auth_result = context.tracker.authenticate()
except Exception as e:
context.auth_result = False
context.error = str(e)
First-time contributor

I am not sure if i am okay with steps like these that are so generic... i mean, it can silently hide the true failure reason as it tries so many different things depending on the context. i think this step should be split into more fine grained steps that do well defined things and easier to manually check when something fails.

I am not sure if i am okay with steps like these that are so generic... i mean, it can silently hide the true failure reason as it tries so many different things depending on the context. i think this step should be split into more fine grained steps that do well defined things and easier to manually check when something fails.
Author
Member

fixed !

fixed !
aditya force-pushed unit-tests-google-tracker-sheet from 6af250fc78 to 7f6594d99a 2026-01-19 11:57:35 +00:00 Compare
brent.edwards left a comment
Member

A lot of little changes. I didn't have time to read the /features/steps files.

A lot of little changes. I didn't have time to read the `/features/steps` files.
@ -0,0 +44,4 @@
Scenario: Column letter conversion for column AB
When I convert column number 28 to letter
Then the column letter result should be "AB"
Member

The last four scenarios are duplicated in features/google_sheets_tracker/column_letter_conversion.feature , and can be removed.

The last four scenarios are duplicated in features/google_sheets_tracker/column_letter_conversion.feature , and can be removed.
Author
Member

fixed !!

fixed !!
brent.edwards marked this conversation as resolved
@ -0,0 +41,4 @@
Scenario: Convenience methods return success status
When I mark "wordnet" as in progress
Then the mark_in_progress method should return True
Member

I would be more comfortable if this test added something like

And the mark_completed method should return False
And the mark_error method should return False
And the reset_to_not_started method should return False

and the negative tests for the next three tests as well.

I would be more comfortable if this test added something like ``` And the mark_completed method should return False And the mark_error method should return False And the reset_to_not_started method should return False ``` and the negative tests for the next three tests as well.
Author
Member

fixed !!

fixed !!
Member

My apologies if I weren't clear enough.

The current lines 42-53 of convenience_methods.feature are:

  Scenario: Convenience methods return success status
    When I mark "wordnet" as in progress
    Then the mark_in_progress method should return True

    When I mark "wordnet" as completed
    Then the mark_completed method should return True

    When I mark "wordnet" with error "Test error"
    Then the mark_error method should return True

    When I reset "wordnet" to not started
    Then the reset_to_not_started method should return True

I would be more comfortable if the above were replaced with something like

  Scenario: Convenience methods return success status
    When I mark "wordnet" as in progress
    Then the mark_in_progress method should return True
    And the mark_completed method should return False
    And the mark_error method should return False
    And the reset_to_not_started method should return False

    When I mark "wordnet" as completed
    Then the mark_in_progress method should return False
    And the mark_completed method should return True
    And the mark_error method should return False
    And the reset_to_not_started method should return False

    When I mark "wordnet" with error "Test error"
    Then the mark_in_progress method should return False
    And the mark_completed method should return False
    And the mark_error method should return True
    And the reset_to_not_started method should return False

    When I reset "wordnet" to not started
    Then the mark_in_progress method should return False
    And the mark_completed method should return False
    And the mark_error method should return False
    And the reset_to_not_started method should return True
My apologies if I weren't clear enough. The current lines 42-53 of `convenience_methods.feature` are: ``` Scenario: Convenience methods return success status When I mark "wordnet" as in progress Then the mark_in_progress method should return True When I mark "wordnet" as completed Then the mark_completed method should return True When I mark "wordnet" with error "Test error" Then the mark_error method should return True When I reset "wordnet" to not started Then the reset_to_not_started method should return True ``` I would be more comfortable if the above were replaced with something like ``` Scenario: Convenience methods return success status When I mark "wordnet" as in progress Then the mark_in_progress method should return True And the mark_completed method should return False And the mark_error method should return False And the reset_to_not_started method should return False When I mark "wordnet" as completed Then the mark_in_progress method should return False And the mark_completed method should return True And the mark_error method should return False And the reset_to_not_started method should return False When I mark "wordnet" with error "Test error" Then the mark_in_progress method should return False And the mark_completed method should return False And the mark_error method should return True And the reset_to_not_started method should return False When I reset "wordnet" to not started Then the mark_in_progress method should return False And the mark_completed method should return False And the mark_error method should return False And the reset_to_not_started method should return True ```
Author
Member

I apologize, I incorrectly replied to this comment as fixed.

I don't think the test you suggested matches our real-world use cases. Here's why:

Actually, calling these methods sequentially does happen in production:

Script Interruption (Ctrl+C): First mark_in_progress is called, status becomes IN PROGRESS. Then user presses Ctrl+C. Now reset_to_not_started must work and return True to reset the dataset. If it returns False, the dataset stays stuck as IN PROGRESS forever.

Processing Error: First mark_in_progress is called successfully. Then download fails immediately. Now mark_error must work and return True to record what went wrong. If it returns False, we can't record the error.

Successful Completion: First mark_in_progress is called. Processing succeeds. Then mark_completed must work and return True to mark it done. If it returns False, we can't mark the work as complete.
These are all valid workflows that require methods to work together, not exclude each other. The test expectations would break our error handling and recovery mechanisms.

Moreover, google sheet tracker module is not used independently, its only being used with the automation scripts like rdf_to_hf_incremental.py and upload_all_datasets.py which use the validation method check_should_process before calling any of these convenience methods.

I apologize, I incorrectly replied to this comment as fixed. I don't think the test you suggested matches our real-world use cases. Here's why: Actually, calling these methods sequentially does happen in production: Script Interruption (Ctrl+C): First mark_in_progress is called, status becomes IN PROGRESS. Then user presses Ctrl+C. Now reset_to_not_started must work and return True to reset the dataset. If it returns False, the dataset stays stuck as IN PROGRESS forever. Processing Error: First mark_in_progress is called successfully. Then download fails immediately. Now mark_error must work and return True to record what went wrong. If it returns False, we can't record the error. Successful Completion: First mark_in_progress is called. Processing succeeds. Then mark_completed must work and return True to mark it done. If it returns False, we can't mark the work as complete. These are all valid workflows that require methods to work together, not exclude each other. The test expectations would break our error handling and recovery mechanisms. Moreover, google sheet tracker module is not used independently, its only being used with the automation scripts like rdf_to_hf_incremental.py and upload_all_datasets.py which use the validation method check_should_process before calling any of these convenience methods.
@ -0,0 +26,4 @@
| Dataset Name | Status | Error |
| wordnet | NOT STARTED | |
When I attempt to mark "nonexistent" as in progress
Then the update should fail
Member

This test seems very close to features/google_sheets_tracker/convenience_methods.feature line 56.

This test seems very close to features/google_sheets_tracker/convenience_methods.feature line 56.
Author
Member

fixed !!

fixed !!
brent.edwards marked this conversation as resolved
@ -0,0 +49,4 @@
Given an authenticated sheets tracker
And a sheet with empty rows
When I get the status of "wordnet"
Then the status should be None
Member

This test seems almost identical to features/google_sheets_tracker/error_handling.feature line 16.

This test seems almost identical to features/google_sheets_tracker/error_handling.feature line 16.
Author
Member

fixed !!

fixed !!
brent.edwards marked this conversation as resolved
@ -0,0 +68,4 @@
And a sheet with only one row (headers only)
When I get the status of "wordnet"
Then the status should be None
And the row should be None
Member

This test also seems almost identical to features/google_sheets_tracker/error_handling.feature line 16.

This test also seems almost identical to features/google_sheets_tracker/error_handling.feature line 16.
Author
Member

fixed !!

fixed !!
brent.edwards marked this conversation as resolved
@ -0,0 +76,4 @@
| Dataset Name | Status | Error |
| wordnet | NOT STARTED | |
When I attempt to update status for "nonexistent-dataset"
Then the update should return False
Member

This test is almost the same as features/google_sheets_tracker/convenience_methods.feature lines 56-63.

This test is almost the same as features/google_sheets_tracker/convenience_methods.feature lines 56-63.
Author
Member

fixed !!

fixed !!
brent.edwards marked this conversation as resolved
@ -0,0 +25,4 @@
And a sheet with headers "ID, Status, Error"
When the tracker loads headers
Then dataset_name_col should be identified as 0
And status_col should be 1
Member

Why isn't error checked here?

Why isn't `error` checked here?
Author
Member

fixed !!

fixed !!
brent.edwards marked this conversation as resolved
@ -0,0 +32,4 @@
And a sheet with headers "Dataset Name, State, Error"
When the tracker loads headers
Then dataset_name_col should be 0
And status_col should be identified as 1
Member

Why isn't error checked here?

Why isn't `error` checked here?
Author
Member

fixed !!

fixed !!
brent.edwards marked this conversation as resolved
@ -0,0 +39,4 @@
And a sheet with headers "Dataset Name, Status, Error Message"
When the tracker loads headers
Then dataset_name_col should be 0
And error_col should be identified as 2
Member

Why isn't status_col checked here?

Why isn't `status_col` checked here?
Author
Member

fixed !!

fixed !!
brent.edwards marked this conversation as resolved
@ -0,0 +45,4 @@
Scenario: Process when tracker not authenticated
Given an unauthenticated sheets tracker
When I check if "wordnet" should be processed
Then the decision should be to process
Member

"should be" to process? Is it possible that this is in the wrong direction?

"should be" to process? Is it possible that this is in the wrong direction?
Author
Member

fixed !!

fixed !!
brent.edwards marked this conversation as resolved
@ -0,0 +50,4 @@
Scenario: Error message truncation
When I reset "wordnet" to not started with error message longer than 500 characters
Then the error for "wordnet" should be at most 500 characters
Member

This test looks similar to features/google_sheets_tracker/error_message_handling.feature line 12.

This test looks similar to features/google_sheets_tracker/error_message_handling.feature line 12.
Author
Member

fixed !!

fixed !!
brent.edwards marked this conversation as resolved
@ -0,0 +55,4 @@
| wordnet | NOT STARTED | |
When I attempt to update status for "nonexistent"
Then the update should fail gracefully
And no exception should be raised
Member

This test seems very close to features/google_sheets_tracker/convenience_methods.feature line 56.

This test seems very close to features/google_sheets_tracker/convenience_methods.feature line 56.
Author
Member

fixed !!

fixed !!
brent.edwards marked this conversation as resolved
@ -0,0 +62,4 @@
| Dataset Name | Status | Error |
| wordnet | NOT STARTED | |
When I attempt to update status with very long value
Then the update should handle the long value correctly
Member

In other places, the limit was 500 characters. Is that also true here?

In other places, the limit was 500 characters. Is that also true here?
Author
Member

fixed !!

fixed !!
brent.edwards marked this conversation as resolved
@ -0,0 +7,4 @@
Given an authenticated sheets tracker
Scenario: Update status with only status column present (no error column)
Given a sheet with headers "Dataset Name, Status"
Member

The Given should probably say something about this sheet being real not virtual. This test looks very similar to features/google_sheets_tracker/update_status_edge_cases.feature lines 10-16, so I would have expected the setup to be the same... but only at line 14 does it say that the update is with a real API call.

The same should be done for the rest of the tests. Compare them against update_status_edge_cases and make sure that there is a difference in the Given statement.

The `Given` should probably say something about this sheet being real not virtual. This test looks very similar to features/google_sheets_tracker/update_status_edge_cases.feature lines 10-16, so I would have expected the setup to be the same... but only at line 14 does it say that the update is with a real API call. The same should be done for the rest of the tests. Compare them against `update_status_edge_cases` and make sure that there is a difference in the `Given` statement.
Author
Member

fixed !!

fixed !!
brent.edwards marked this conversation as resolved
@ -0,0 +1,2646 @@
"""Step definitions for Google Sheets tracker tests."""
Member

There are three ways you could organize this file:

  1. (Easiest) Put all @givens together, then all @whens, then all @thens.
  2. (Harder) Group the elements by .feature file. Within each .feature file, put all @givens together, then all @whens, then all @thens.
  3. (NOT RECOMMENDED!!!!) Group all elements by test. The source code should repeat the test before the implementation of the @given, @when and @then. (This will make the file a lot longer, and you'll need to have references whenever you have the same @given or whatnot.)

It looks like this code was kinda-sorta organized according to 3.

There are three ways you could organize this file: 1. (Easiest) Put all `@given`s together, then all `@when`s, then all `@then`s. 2. (Harder) Group the elements by `.feature` file. Within each `.feature` file, put all `@given`s together, then all `@when`s, then all `@then`s. 3. (NOT RECOMMENDED!!!!) Group all elements by test. The source code should repeat the test before the implementation of the `@given`, `@when` and `@then`. (This will make the file a lot longer, and you'll need to have references whenever you have the same `@given` or whatnot.) It looks like this code was kinda-sorta organized according to 3.
Author
Member

fixed !!

fixed !!
Member

Which one of the three ways did you organize the file?

When I look at !48 (commit ac1cc304f9) , I don't see much organization.

(HINT: You could have said "I don't think this is important enough to do.", and I would have agreed. But you reported that you fixed it.)

Which one of the three ways did you organize the file? When I look at https://git.cleverthis.com/cleverdatasets/dataset-uploader/pulls/48/commits/ac1cc304f98e61e04f76e74d09eb1a6c23e9f34c#diff-4b6a0b6dddf1733aa850256d39c4e28cd2a56204 , I don't see much organization. (HINT: You could have said "I don't think this is important enough to do.", and I would have agreed. But you reported that you fixed it.)
Author
Member

I misunderstood. I didn't actually reorganize it. Given the size of the file (2,667 lines), reorganizing it would be a significant effort, and it doesn’t seem important enough to prioritize right now.

I misunderstood. I didn't actually reorganize it. Given the size of the file (2,667 lines), reorganizing it would be a significant effort, and it doesn’t seem important enough to prioritize right now.
@ -0,0 +95,4 @@
if header.strip().lower() in ["status", "processing status"]:
context.tracker.status_col = idx
elif header.strip().lower() in ["error", "error message"]:
context.tracker.error_col = idx
Member

Does the tracker remember the column of the dataset name?

Does the tracker remember the column of the dataset name?
Author
Member

fixed !!

fixed !!
brent.edwards marked this conversation as resolved
@ -0,0 +2630,4 @@
Member

Is there a reason for this long gap?

Is there a reason for this long gap?
Author
Member

fixed !!

fixed !!
Member

There's still an eleven-line gap between lines 2645-2656. Thanks for eliminating two of the lines, but the long gap remains.

(It's not necessarily WRONG. It just very much stands out when one reads the code.)

Could you either drop it down to two lines or just say what the gap is about?

There's still an eleven-line gap between lines 2645-2656. Thanks for eliminating two of the lines, but the long gap remains. (It's not necessarily WRONG. It just very much stands out when one reads the code.) Could you either drop it down to two lines or just say what the gap is about?
@ -523,2 +582,2 @@
graph = Graph()
graph.parse(str(file_path), format=rdf_format)
# Check if this is GeoNames format (special handling required)
if rdf_format in ("xml", "application/rdf+xml") and _detect_geonames_format(file_path):
Member

According to ruff check:

scripts/convert_rdf_to_hf_dataset_unified.py:583:89: E501 Line too long (91 > 88)
    |
581 |     """
582 |     # Check if this is GeoNames format (special handling required)
583 |     if rdf_format in ("xml", "application/rdf+xml") and _detect_geonames_format(file_path):
    |                                                                                         ^^^ E501
584 |         # Use GeoNames parser with single worker for non-parallel strategies
585 |         yield from stream_geonames_parallel(file_path, chunk_size, num_workers=1)
    |
According to `ruff check`: ``` scripts/convert_rdf_to_hf_dataset_unified.py:583:89: E501 Line too long (91 > 88) | 581 | """ 582 | # Check if this is GeoNames format (special handling required) 583 | if rdf_format in ("xml", "application/rdf+xml") and _detect_geonames_format(file_path): | ^^^ E501 584 | # Use GeoNames parser with single worker for non-parallel strategies 585 | yield from stream_geonames_parallel(file_path, chunk_size, num_workers=1) | ```
Author
Member

fixed !!

fixed !!
brent.edwards marked this conversation as resolved
@ -828,0 +965,4 @@
subject=subject,
predicate=predicate,
object=obj,
object_type="literal", # TSV typically has literals
Member

According to ruff check:

scripts/convert_rdf_to_hf_dataset_unified.py:968:89: E501 Line too long (96 > 88)
    |
966 | …                     predicate=predicate,
967 | …                     object=obj,
968 | …                     object_type="literal",  # TSV typically has literals
    |                                                                   ^^^^^^^^ E501
969 | …                     object_datatype=None,
970 | …                     object_language=None
    |
According to `ruff check`: ``` scripts/convert_rdf_to_hf_dataset_unified.py:968:89: E501 Line too long (96 > 88) | 966 | … predicate=predicate, 967 | … object=obj, 968 | … object_type="literal", # TSV typically has literals | ^^^^^^^^ E501 969 | … object_datatype=None, 970 | … object_language=None | ```
Author
Member

fixed !!

fixed !!
brent.edwards marked this conversation as resolved
@ -828,0 +970,4 @@
object_language=None
))
except Exception as e:
logger.debug(f"Skipping malformed TSV line {line_num}: {e}")
Member

According to ruff check:

scripts/convert_rdf_to_hf_dataset_unified.py:973:89: E501 Line too long (92 > 88)
    |
971 |                                         ))
972 |                             except Exception as e:
973 |                                 logger.debug(f"Skipping malformed TSV line {line_num}: {e}")
    |                                                                                         ^^^^ E501
974 |                                 continue
975 |                 except Exception as e:
    |
According to `ruff check`: ``` scripts/convert_rdf_to_hf_dataset_unified.py:973:89: E501 Line too long (92 > 88) | 971 | )) 972 | except Exception as e: 973 | logger.debug(f"Skipping malformed TSV line {line_num}: {e}") | ^^^^ E501 974 | continue 975 | except Exception as e: | ```
Author
Member

fixed !!

fixed !!
brent.edwards marked this conversation as resolved
@ -840,0 +990,4 @@
# Create train/test split if requested
if config.create_train_test_split:
assert isinstance(dataset, Dataset), "Expected Dataset instance"
train_test = dataset.train_test_split(test_size=config.test_size, seed=42)
Member

According to ruff check:

scripts/convert_rdf_to_hf_dataset_unified.py:993:89: E501 Line too long (90 > 88)
    |
991 |             if config.create_train_test_split:
992 |                 assert isinstance(dataset, Dataset), "Expected Dataset instance"
993 |                 train_test = dataset.train_test_split(test_size=config.test_size, seed=42)
    |                                                                                         ^^ E501
994 |                 dataset_dict = DatasetDict({
995 |                     "train": train_test["train"],
    |
According to `ruff check`: ``` scripts/convert_rdf_to_hf_dataset_unified.py:993:89: E501 Line too long (90 > 88) | 991 | if config.create_train_test_split: 992 | assert isinstance(dataset, Dataset), "Expected Dataset instance" 993 | train_test = dataset.train_test_split(test_size=config.test_size, seed=42) | ^^ E501 994 | dataset_dict = DatasetDict({ 995 | "train": train_test["train"], | ```
Author
Member

fixed !!

fixed !!
brent.edwards marked this conversation as resolved
@ -1160,2 +1324,2 @@
f"{num_workers} workers[/yellow]"
)
# Handle compressed files
is_gzip = config.input_path.suffix == ".gz" or str(config.input_path).endswith(".gz")
Member

According to ruff check:


scripts/convert_rdf_to_hf_dataset_unified.py:1325:89: E501 Line too long (101 > 88)
     |
1323 |                 # Check if it's the special GeoNames format (URLs followed by XML)
1324 |                 # Handle compressed files
1325 |                 is_gzip = config.input_path.suffix == ".gz" or str(config.input_path).endswith(".gz")
     |                                                                                         ^^^^^^^^^^^^^ E501
1326 |                 is_bz2 = config.input_path.suffix == ".bz2" or str(config.input_path).endswith(".bz2")
     |

scripts/convert_rdf_to_hf_dataset_unified.py:1326:89: E501 Line too long (102 > 88)
     |
1324 |                 # Handle compressed files
1325 |                 is_gzip = config.input_path.suffix == ".gz" or str(config.input_path).endswith(".gz")
1326 |                 is_bz2 = config.input_path.suffix == ".bz2" or str(config.input_path).endswith(".bz2")
     |                                                                                         ^^^^^^^^^^^^^^ E501
1327 |
1328 |                 try:
     |
According to `ruff check`: ``` scripts/convert_rdf_to_hf_dataset_unified.py:1325:89: E501 Line too long (101 > 88) | 1323 | # Check if it's the special GeoNames format (URLs followed by XML) 1324 | # Handle compressed files 1325 | is_gzip = config.input_path.suffix == ".gz" or str(config.input_path).endswith(".gz") | ^^^^^^^^^^^^^ E501 1326 | is_bz2 = config.input_path.suffix == ".bz2" or str(config.input_path).endswith(".bz2") | scripts/convert_rdf_to_hf_dataset_unified.py:1326:89: E501 Line too long (102 > 88) | 1324 | # Handle compressed files 1325 | is_gzip = config.input_path.suffix == ".gz" or str(config.input_path).endswith(".gz") 1326 | is_bz2 = config.input_path.suffix == ".bz2" or str(config.input_path).endswith(".bz2") | ^^^^^^^^^^^^^^ E501 1327 | 1328 | try: | ```
Author
Member

fixed !!

fixed !!
brent.edwards marked this conversation as resolved
@ -1162,0 +1327,4 @@
try:
if is_gzip:
file_obj = gzip.open(config.input_path, "rt", encoding="utf-8", errors="ignore") # noqa: SIM115
Member

According to ruff check:


scripts/convert_rdf_to_hf_dataset_unified.py:1330:89: E501 Line too long (104 > 88)
     |
1328 |                 try:
1329 |                     if is_gzip:
1330 |                         file_obj = gzip.open(config.input_path, "rt", encoding="utf-8", errors="ignore") # noqa: SIM115
     |                                                                                         ^^^^^^^^^^^^^^^^ E501
1331 |                     elif is_bz2:
1332 |                         file_obj = bz2.open(config.input_path, "rt", encoding="utf-8", errors="ignore") # noqa: SIM115
     |

scripts/convert_rdf_to_hf_dataset_unified.py:1332:89: E501 Line too long (103 > 88)
     |
1330 |                         file_obj = gzip.open(config.input_path, "rt", encoding="utf-8", errors="ignore") # noqa: SIM115
1331 |                     elif is_bz2:
1332 |                         file_obj = bz2.open(config.input_path, "rt", encoding="utf-8", errors="ignore") # noqa: SIM115
     |                                                                                         ^^^^^^^^^^^^^^^ E501
1333 |                     else:
1334 |                         file_obj = open(config.input_path, encoding="utf-8", errors="ignore") # noqa: SIM115
     |
According to `ruff check`: ``` scripts/convert_rdf_to_hf_dataset_unified.py:1330:89: E501 Line too long (104 > 88) | 1328 | try: 1329 | if is_gzip: 1330 | file_obj = gzip.open(config.input_path, "rt", encoding="utf-8", errors="ignore") # noqa: SIM115 | ^^^^^^^^^^^^^^^^ E501 1331 | elif is_bz2: 1332 | file_obj = bz2.open(config.input_path, "rt", encoding="utf-8", errors="ignore") # noqa: SIM115 | scripts/convert_rdf_to_hf_dataset_unified.py:1332:89: E501 Line too long (103 > 88) | 1330 | file_obj = gzip.open(config.input_path, "rt", encoding="utf-8", errors="ignore") # noqa: SIM115 1331 | elif is_bz2: 1332 | file_obj = bz2.open(config.input_path, "rt", encoding="utf-8", errors="ignore") # noqa: SIM115 | ^^^^^^^^^^^^^^^ E501 1333 | else: 1334 | file_obj = open(config.input_path, encoding="utf-8", errors="ignore") # noqa: SIM115 | ```
Author
Member

fixed !!

fixed !!
brent.edwards marked this conversation as resolved
@ -1165,3 +1334,1 @@
f"[yellow]Using standard RDF parser for "
f"{config.rdf_format} (single-threaded)[/yellow]"
)
file_obj = open(config.input_path, encoding="utf-8", errors="ignore") # noqa: SIM115
Member

According to ruff check:

scripts/convert_rdf_to_hf_dataset_unified.py:1334:89: E501 Line too long (93 > 88)
     |
1332 |                         file_obj = bz2.open(config.input_path, "rt", encoding="utf-8", errors="ignore") # noqa: SIM115
1333 |                     else:
1334 |                         file_obj = open(config.input_path, encoding="utf-8", errors="ignore") # noqa: SIM115
     |                                                                                         ^^^^^ E501
1335 |
1336 |                     try:
     |
According to `ruff check`: ``` scripts/convert_rdf_to_hf_dataset_unified.py:1334:89: E501 Line too long (93 > 88) | 1332 | file_obj = bz2.open(config.input_path, "rt", encoding="utf-8", errors="ignore") # noqa: SIM115 1333 | else: 1334 | file_obj = open(config.input_path, encoding="utf-8", errors="ignore") # noqa: SIM115 | ^^^^^ E501 1335 | 1336 | try: | ```
Author
Member

fixed !!

fixed !!
brent.edwards marked this conversation as resolved
@ -1168,0 +1350,4 @@
finally:
file_obj.close()
except Exception as e:
logger.warning(f"Error detecting GeoNames format: {e}, using generic parser")
Member

According to ruff check:

scripts/convert_rdf_to_hf_dataset_unified.py:1353:89: E501 Line too long (97 > 88)
     |
1351 |                         file_obj.close()
1352 |                 except Exception as e:
1353 |                     logger.warning(f"Error detecting GeoNames format: {e}, using generic parser")
     |                                                                                         ^^^^^^^^^ E501
1354 |                     format_type = "generic"
1355 |             elif config.rdf_format in ("nt", "ntriples"):
     |
According to `ruff check`: ``` scripts/convert_rdf_to_hf_dataset_unified.py:1353:89: E501 Line too long (97 > 88) | 1351 | file_obj.close() 1352 | except Exception as e: 1353 | logger.warning(f"Error detecting GeoNames format: {e}, using generic parser") | ^^^^^^^^^ E501 1354 | format_type = "generic" 1355 | elif config.rdf_format in ("nt", "ntriples"): | ```
Author
Member

fixed !!

fixed !!
brent.edwards marked this conversation as resolved
@ -1249,0 +1435,4 @@
# Create train/test split if requested
if config.create_train_test_split:
assert isinstance(dataset, Dataset), "Expected Dataset instance"
train_test = dataset.train_test_split(test_size=config.test_size, seed=42)
Member

According to ruff check:

scripts/convert_rdf_to_hf_dataset_unified.py:1438:89: E501 Line too long (90 > 88)
     |
1436 |             if config.create_train_test_split:
1437 |                 assert isinstance(dataset, Dataset), "Expected Dataset instance"
1438 |                 train_test = dataset.train_test_split(test_size=config.test_size, seed=42)
     |                                                                                         ^^ E501
1439 |                 dataset_dict = DatasetDict({
1440 |                     "train": train_test["train"],
     |
According to `ruff check`: ``` scripts/convert_rdf_to_hf_dataset_unified.py:1438:89: E501 Line too long (90 > 88) | 1436 | if config.create_train_test_split: 1437 | assert isinstance(dataset, Dataset), "Expected Dataset instance" 1438 | train_test = dataset.train_test_split(test_size=config.test_size, seed=42) | ^^ E501 1439 | dataset_dict = DatasetDict({ 1440 | "train": train_test["train"], | ```
Author
Member

fixed !!

fixed !!
brent.edwards marked this conversation as resolved
@ -1278,0 +1473,4 @@
# Use single process for small datasets to avoid index errors
# Only use multiprocessing for datasets with > 1000 examples
if total_triples > 1000 and num_workers > 1:
dataset_dict.save_to_disk(str(config.output_path), num_proc=num_workers)
Member

According to ruff check:

scripts/convert_rdf_to_hf_dataset_unified.py:1476:89: E501 Line too long (92 > 88)
     |
1474 |                 # Only use multiprocessing for datasets with > 1000 examples
1475 |                 if total_triples > 1000 and num_workers > 1:
1476 |                     dataset_dict.save_to_disk(str(config.output_path), num_proc=num_workers)
     |                                                                                         ^^^^ E501
1477 |                 else:
1478 |                     dataset_dict.save_to_disk(str(config.output_path))
     |
According to `ruff check`: ``` scripts/convert_rdf_to_hf_dataset_unified.py:1476:89: E501 Line too long (92 > 88) | 1474 | # Only use multiprocessing for datasets with > 1000 examples 1475 | if total_triples > 1000 and num_workers > 1: 1476 | dataset_dict.save_to_disk(str(config.output_path), num_proc=num_workers) | ^^^^ E501 1477 | else: 1478 | dataset_dict.save_to_disk(str(config.output_path)) | ```
Author
Member

fixed !!

fixed !!
brent.edwards marked this conversation as resolved
This pull request can be merged automatically.
This branch is out-of-date with the base branch
You are not authorized to merge this pull request.
View command line instructions

Checkout

From your project repository, check out a new branch and test the changes.
git fetch -u origin unit-tests-google-tracker-sheet:unit-tests-google-tracker-sheet
git switch unit-tests-google-tracker-sheet

Merge

Merge the changes and update on Forgejo.

Warning: The "Autodetect manual merge" setting is not enabled for this repository, you will have to mark this pull request as manually merged afterwards.

git switch auto-excel-update
git merge --no-ff unit-tests-google-tracker-sheet
git switch unit-tests-google-tracker-sheet
git rebase auto-excel-update
git switch auto-excel-update
git merge --ff-only unit-tests-google-tracker-sheet
git switch unit-tests-google-tracker-sheet
git rebase auto-excel-update
git switch auto-excel-update
git merge --no-ff unit-tests-google-tracker-sheet
git switch auto-excel-update
git merge --squash unit-tests-google-tracker-sheet
git switch auto-excel-update
git merge --ff-only unit-tests-google-tracker-sheet
git switch auto-excel-update
git merge unit-tests-google-tracker-sheet
git push origin auto-excel-update
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
3 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
cleverdatasets/dataset-uploader!48
No description provided.