Create-validator. #56

Open
brent.edwards wants to merge 31 commits from create-validator into streaming-upload-docker
Member

This is so that Aditya can read and comment on the file.

This is so that Aditya can read and comment on the file.
Co-authored-by: aider (openrouter/openai/gpt-5.2-codex) <aider@aider.chat>
Co-authored-by: aider (openrouter/openai/gpt-5.2-codex) <aider@aider.chat>
Co-authored-by: aider (openrouter/openai/gpt-5.2-codex) <aider@aider.chat>
Co-authored-by: aider (openrouter/openai/gpt-5.2-codex) <aider@aider.chat>
Co-authored-by: aider (openrouter/openai/gpt-5.2-codex) <aider@aider.chat>
Co-authored-by: aider (openrouter/openai/gpt-5.2-codex) <aider@aider.chat>
Co-authored-by: aider (openrouter/openai/gpt-5.2-codex) <aider@aider.chat>
Co-authored-by: aider (openrouter/openai/gpt-5.2-codex) <aider@aider.chat>
Co-authored-by: aider (openrouter/openai/gpt-5.2-codex) <aider@aider.chat>
Co-authored-by: aider (openrouter/openai/gpt-5.2-codex) <aider@aider.chat>
Co-authored-by: aider (openrouter/openai/gpt-5.2-codex) <aider@aider.chat>
Co-authored-by: aider (openrouter/openai/gpt-5.2-codex) <aider@aider.chat>
aditya left a comment
Member

Mostly memory and performance related issues.

Mostly memory and performance related issues.
@ -0,0 +53,4 @@
client.dataset_info(repo_id=repo_id)
except HfHubHTTPError as exc:
status_code = exc.response.status_code if exc.response else None
if status_code == 404:
Member

what about 500, 503, timeouts? These crash the validator.

what about 500, 503, timeouts? These crash the validator.
Author
Member

Fixed.

Fixed.
@ -0,0 +100,4 @@
return download_path
def dataset_is_parquet_file(
Member

This function return 'False' for permission errors, corrupted files, or any legitimate bugs. It should distinguish between validation failure vs. system errors.

This function return 'False' for permission errors, corrupted files, or any legitimate bugs. It should distinguish between validation failure vs. system errors.
Author
Member

I want to keep every test as simple as possible.

The file was downloaded by dataset_can_download_from_hf. If there are permission errors, it means that the file was downloaded into a directory that is writable but not readable. Those directories are very rare.

dataset_has_required_columns checks for corrupted files in a separate test.

Finally, I don't know what a "legitimate bug" is.

Let me know if you have other tests that you would like me to add...

I want to keep every test as simple as possible. The file was downloaded by `dataset_can_download_from_hf`. If there are permission errors, it means that the file was downloaded into a directory that is writable but not readable. Those directories are very rare. `dataset_has_required_columns` checks for corrupted files in a separate test. Finally, I don't know what a "legitimate bug" is. Let me know if you have other tests that you would like me to add...
@ -0,0 +194,4 @@
rdf_format = "xml"
graph = Graph()
graph.parse(str(file_path), format=rdf_format)
Member

It loads the entire RDF file into an in-memory graph, for multi GB RDF files, this is extremly slow and memory-intensive

It loads the entire RDF file into an in-memory graph, for multi GB RDF files, this is extremly slow and memory-intensive
Author
Member

I have changed this to a bloom filter.

I have changed this to a bloom filter.
@ -0,0 +203,4 @@
return store
def collect_parquet_strings(
Member

This function loads entire datasets into memory as string sets for comparison
for large datasets (such as wikidata with billion of triples), this will exhaust memory.

This function loads entire datasets into memory as string sets for comparison for large datasets (such as wikidata with billion of triples), this will exhaust memory.
Author
Member

Changed to a bloom filter.

Changed to a bloom filter.
@ -1269,0 +1286,4 @@
base_dir: Path,
) -> bool:
"""Validate that source strings and parquet strings match (Level 2/3)."""
source_strings = collect_source_strings(dataset_id, base_dir)
Member

collect_source_strings returns empty set if no files found, warning is printed, but validation continues and passes.
Should this be a failure?

collect_source_strings returns empty set if no files found, warning is printed, but validation continues and passes. Should this be a failure?
Author
Member

Excellent point. Changed.

Excellent point. Changed.
@ -1586,6 +1718,11 @@ def main() -> int:
console.print(f"[bold cyan]Removing {rdf_file}[/bold cyan]")
remove_dir(rdf_file.parent)
validation_ok = validate(datasets_to_process, args.base_dir)
Member

this validation runs after all the datasets are uploaded at the very end, user waits days for uploads, then discovers validation failures. Can we add per-dataset validation before marking complete?

this validation runs after all the datasets are uploaded at the very end, user waits days for uploads, then discovers validation failures. Can we add per-dataset validation before marking complete?
Author
Member

Done.

Done.
aditya left a comment
Member

I have pointed out an error that I encountered while running the latest upload_all_datasets.py. I have also highlighted some points related to the use of Bloom filters.

I have pointed out an error that I encountered while running the latest upload_all_datasets.py. I have also highlighted some points related to the use of Bloom filters.
@ -1269,0 +1299,4 @@
else:
parquet_strings = _new_string_store()
for parquet_file in parquet_files:
parquet_strings = parquet_strings.union(
Member

The line 1302 is giving me an AttributeError: 'NoneType' object has no attribute 'union'

this is the reason I found on the internet:

In bloom_filter2, union() is implemented like a mutating operation, similar to set.update() in python.

parquet_strings = parquet_strings.union( collect_parquet_strings(parquet_file)) modifies the BloomFilter in-place and returns None

parquet_strings is now None

Trying to call None.union(..) raises AttributeError: 'NoneType' object has not attribute 'union'

The line 1302 is giving me an AttributeError: 'NoneType' object has no attribute 'union' this is the reason I found on the internet: In bloom_filter2, union() is implemented like a mutating operation, similar to set.update() in python. parquet_strings = parquet_strings.union( collect_parquet_strings(parquet_file)) modifies the BloomFilter in-place and returns None parquet_strings is now None Trying to call None.union(..) raises AttributeError: 'NoneType' object has not attribute 'union'
Author
Member

Thank you.

Thank you.
@ -1269,0 +1303,4 @@
collect_parquet_strings(parquet_file)
)
missing_in_parquet = parquet_strings in source_strings
Member

I think the in operator here checks if one BloomFilter object is contained in another, which isn't a valid operation. BloomFilters support testing individual string membership (string in bloom_filter), but not subset/superset checks between filters.

please correct me if I am missing something here.

I think the `in` operator here checks if one BloomFilter object is contained in another, which isn't a valid operation. BloomFilters support testing individual string membership (`string in bloom_filter`), but not subset/superset checks between filters. please correct me if I am missing something here.
Author
Member

You were correct. I have added a new function and tests for the function.

You were correct. I have added a new function and tests for the function.
aditya left a comment
Member

When I ran the python scripts/upload_all_datasets.py --dataset wordnet --validate, it started the validation process, completed it but I did not get any success or failure message.

When I ran the python scripts/upload_all_datasets.py --dataset wordnet --validate, it started the validation process, completed it but I did not get any success or failure message.
@ -1269,0 +1322,4 @@
)
missing_in_parquet = is_sub_bloom_filter(parquet_strings, source_strings)
if missing_in_parquet:
Member

I think you should invert this if condition to check the missing parquet.
Please correct me if I am missing anything.

I think you should invert this if condition to check the missing parquet. Please correct me if I am missing anything.
Author
Member

You are correct.

You are correct.
@ -1269,0 +1329,4 @@
return False
missing_in_source = is_sub_bloom_filter(source_strings, parquet_strings)
if missing_in_source:
Member

I think you should invert the if condition to check the missing source and missing parquet.

I think you should invert the if condition to check the missing source and missing parquet.
Author
Member

You are correct.

You are correct.
@ -1269,0 +1321,4 @@
collect_parquet_strings(parquet_file)
)
missing_in_parquet = is_sub_bloom_filter(parquet_strings, source_strings)
Member

I think the variable name 'missing_in_parquet' is misleading. is_sub_bloom_filter returns True when parquet is a subset of source, not when strings are missing.

I think the variable name 'missing_in_parquet' is misleading. is_sub_bloom_filter returns True when parquet is a subset of source, not when strings are missing.
Author
Member

When I added a not, the variable's meaning becomes correct.

When I added a `not`, the variable's meaning becomes correct.
@ -1269,0 +1328,4 @@
)
return False
missing_in_source = is_sub_bloom_filter(source_strings, parquet_strings)
Member

same comment as previous one,

same comment as previous one,
Author
Member

When I added a not, the variable's meaning becomes correct.

When I added a `not`, the variable's meaning becomes correct.
aditya left a comment
Member

Requesting changes due to an unhandled potential runtime error.

Requesting changes due to an unhandled potential runtime error.
@ -0,0 +384,4 @@
count = 0
filter_fn = filter_fn or filter_string
for field in _generate_fields_from_source(dataset_id, base_dir):
Member

This code should be present inside a try catch block because _generate_fields_from_source can raise RuntimeError, not catching it will cause the validation to crash instead of gracefully reporting the issue.

This code should be present inside a try catch block because _generate_fields_from_source can raise RuntimeError, not catching it will cause the validation to crash instead of gracefully reporting the issue.
Author
Member

Done.

Done.
Author
Member

One other problem, pointed out in https://chat.google.com/room/AAQAc8wOydI/lpx8tu4iHQk/lpx8tu4iHQk?cls=10 :

The --validate argument is showing missing triples between source file
The parquet files downloaded from HuggingFace contain string identifiers like n663aa0870337408bb1b65f769bf5d1a4b262 that were generated during the original RDF-to-parquet conversion
These strings represent RDF blank nodes (anonymous nodes without URIs)
Source Re-Parse Generates Different Blank Node IDs
When validation re-parses the source RDF file, RDFlib generates completely different identifiers for the same blank nodes
Result: Parquet has strings like n663aa...262 but the re-parsed source has different IDs like Na1b2c... for the same semantic nodes
The validator reports these parquet strings as "missing" from source

One other problem, pointed out in https://chat.google.com/room/AAQAc8wOydI/lpx8tu4iHQk/lpx8tu4iHQk?cls=10 : > The --validate argument is showing missing triples between source file > The parquet files downloaded from HuggingFace contain string identifiers like n663aa0870337408bb1b65f769bf5d1a4b262 that were generated during the original RDF-to-parquet conversion > These strings represent RDF blank nodes (anonymous nodes without URIs) > Source Re-Parse Generates Different Blank Node IDs > When validation re-parses the source RDF file, RDFlib generates completely different identifiers for the same blank nodes > Result: Parquet has strings like n663aa...262 but the re-parsed source has different IDs like Na1b2c... for the same semantic nodes > The validator reports these parquet strings as "missing" from source
brent.edwards changed title from So that there is something: The code has been written. to Create-validator. 2026-02-05 01:19:08 +00:00
This pull request has changes conflicting with the target branch.
  • features/steps/google_sheets_steps.py
  • scripts/convert_rdf_to_hf_dataset_unified.py
  • scripts/rdf_to_hf_incremental.py
View command line instructions

Checkout

From your project repository, check out a new branch and test the changes.
git fetch -u origin create-validator:create-validator
git switch create-validator

Merge

Merge the changes and update on Forgejo.

Warning: The "Autodetect manual merge" setting is not enabled for this repository, you will have to mark this pull request as manually merged afterwards.

git switch streaming-upload-docker
git merge --no-ff create-validator
git switch create-validator
git rebase streaming-upload-docker
git switch streaming-upload-docker
git merge --ff-only create-validator
git switch create-validator
git rebase streaming-upload-docker
git switch streaming-upload-docker
git merge --no-ff create-validator
git switch streaming-upload-docker
git merge --squash create-validator
git switch streaming-upload-docker
git merge --ff-only create-validator
git switch streaming-upload-docker
git merge create-validator
git push origin streaming-upload-docker
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
cleverdatasets/dataset-uploader!56
No description provided.