Create-validator. #56

brent.edwards · 2026-01-27T02:53:36Z

brent.edwards commented

2026-01-27 02:53:36 +00:00

This is so that Aditya can read and comment on the file.

brent.edwards added 15 commits

2026-01-27 02:53:36 +00:00

feat: add HuggingFace dataset existence validator and tests a92244852b

Co-authored-by: aider (openrouter/openai/gpt-5.2-codex) <aider@aider.chat>

style: reorder imports for clarity ad005d9fa8

fix: import HfHubHTTPError from correct module to resolve pyright error e2ae2dec31

Co-authored-by: aider (openrouter/openai/gpt-5.2-codex) <aider@aider.chat>

fix: import HfHubHTTPError from correct module 97515849e7

Co-authored-by: aider (openrouter/openai/gpt-5.2-codex) <aider@aider.chat>

refactor: pass mock_api directly to dataset_exists_on_hf function e8d8c15146

Co-authored-by: aider (openrouter/openai/gpt-5.2-codex) <aider@aider.chat>

feat: add dataset download validation from HuggingFace 2b562c268a

Co-authored-by: aider (openrouter/openai/gpt-5.2-codex) <aider@aider.chat>

feat: add parquet file validation to dataset validator 9f902c915a

Co-authored-by: aider (openrouter/openai/gpt-5.2-codex) <aider@aider.chat>

feat: add pandas to dependencies and import read_parquet directly f6a3be2176

Co-authored-by: aider (openrouter/openai/gpt-5.2-codex) <aider@aider.chat>

test: update parquet file validation to use real file creation caf124c13c

Co-authored-by: aider (openrouter/openai/gpt-5.2-codex) <aider@aider.chat>

feat: add validation for required columns in dataset files c0f89a11da

Co-authored-by: aider (openrouter/openai/gpt-5.2-codex) <aider@aider.chat>

refactor: update validation to return download path and handle multiple parquet files 41a6f1a8ce

Co-authored-by: aider (openrouter/openai/gpt-5.2-codex) <aider@aider.chat>

fix: escape braces in f-strings within documentation strings 597267f4b0

Co-authored-by: aider (openrouter/openai/gpt-5.2-codex) <aider@aider.chat>

Deleting an unnecessary file. 60217f48ca

refactor: move REQUIRED_COLUMNS definition in dataset_validator.py 9cee285cca

feat: add string extraction and validation for datasets b250bb79a8

Co-authored-by: aider (openrouter/openai/gpt-5.2-codex) <aider@aider.chat>

aditya requested changes

2026-01-27 13:43:55 +00:00

aditya left a comment

Mostly memory and performance related issues.

scripts/dataset_validator.py

					
				@ -0,0 +53,4 @@

				        client.dataset_info(repo_id=repo_id)

				    except HfHubHTTPError as exc:

				        status_code = exc.response.status_code if exc.response else None

				        if status_code == 404:

what about 500, 503, timeouts? These crash the validator.

Fixed.

scripts/dataset_validator.py Outdated

					
				@ -0,0 +100,4 @@

				    return download_path

				def dataset_is_parquet_file(

This function return 'False' for permission errors, corrupted files, or any legitimate bugs. It should distinguish between validation failure vs. system errors.

I want to keep every test as simple as possible.

The file was downloaded by dataset_can_download_from_hf. If there are permission errors, it means that the file was downloaded into a directory that is writable but not readable. Those directories are very rare.

dataset_has_required_columns checks for corrupted files in a separate test.

Finally, I don't know what a "legitimate bug" is.

Let me know if you have other tests that you would like me to add...

I want to keep every test as simple as possible. The file was downloaded by `dataset_can_download_from_hf`. If there are permission errors, it means that the file was downloaded into a directory that is writable but not readable. Those directories are very rare. `dataset_has_required_columns` checks for corrupted files in a separate test. Finally, I don't know what a "legitimate bug" is. Let me know if you have other tests that you would like me to add...

scripts/dataset_validator.py Outdated

					
				@ -0,0 +194,4 @@

				                rdf_format = "xml"

				            graph = Graph()

				            graph.parse(str(file_path), format=rdf_format)

It loads the entire RDF file into an in-memory graph, for multi GB RDF files, this is extremly slow and memory-intensive

I have changed this to a bloom filter.

scripts/dataset_validator.py Outdated

					
				@ -0,0 +203,4 @@

				    return store

				def collect_parquet_strings(

This function loads entire datasets into memory as string sets for comparison
for large datasets (such as wikidata with billion of triples), this will exhaust memory.

This function loads entire datasets into memory as string sets for comparison for large datasets (such as wikidata with billion of triples), this will exhaust memory.

Changed to a bloom filter.

scripts/upload_all_datasets.py Outdated

					
				@ -1269,0 +1286,4 @@

				    base_dir: Path,

				) -> bool:

				    """Validate that source strings and parquet strings match (Level 2/3)."""

				    source_strings = collect_source_strings(dataset_id, base_dir)

collect_source_strings returns empty set if no files found, warning is printed, but validation continues and passes.
Should this be a failure?

collect_source_strings returns empty set if no files found, warning is printed, but validation continues and passes. Should this be a failure?

Excellent point. Changed.

scripts/upload_all_datasets.py Outdated

					
				@ -1586,6 +1718,11 @@ def main() -> int:

				                console.print(f"[bold cyan]Removing {rdf_file}[/bold cyan]")

				                remove_dir(rdf_file.parent)

				        validation_ok = validate(datasets_to_process, args.base_dir)

this validation runs after all the datasets are uploaded at the very end, user waits days for uploads, then discovers validation failures. Can we add per-dataset validation before marking complete?

Done.

brent.edwards added 1 commit

2026-01-28 02:20:55 +00:00

Changes requested by Aditya. 6ad7cb2e90

aditya requested changes

2026-01-28 11:53:07 +00:00

aditya left a comment

I have pointed out an error that I encountered while running the latest upload_all_datasets.py. I have also highlighted some points related to the use of Bloom filters.

scripts/upload_all_datasets.py Outdated

					
				@ -1269,0 +1299,4 @@

				    else:

				        parquet_strings = _new_string_store()

				        for parquet_file in parquet_files:

				            parquet_strings = parquet_strings.union(

The line 1302 is giving me an AttributeError: 'NoneType' object has no attribute 'union'

this is the reason I found on the internet:

In bloom_filter2, union() is implemented like a mutating operation, similar to set.update() in python.

parquet_strings = parquet_strings.union( collect_parquet_strings(parquet_file)) modifies the BloomFilter in-place and returns None

parquet_strings is now None

Trying to call None.union(..) raises AttributeError: 'NoneType' object has not attribute 'union'

The line 1302 is giving me an AttributeError: 'NoneType' object has no attribute 'union' this is the reason I found on the internet: In bloom_filter2, union() is implemented like a mutating operation, similar to set.update() in python. parquet_strings = parquet_strings.union( collect_parquet_strings(parquet_file)) modifies the BloomFilter in-place and returns None parquet_strings is now None Trying to call None.union(..) raises AttributeError: 'NoneType' object has not attribute 'union'

Thank you.

scripts/upload_all_datasets.py Outdated

					
				@ -1269,0 +1303,4 @@

				                collect_parquet_strings(parquet_file)

				            )

				    missing_in_parquet = parquet_strings in source_strings

I think the in operator here checks if one BloomFilter object is contained in another, which isn't a valid operation. BloomFilters support testing individual string membership (string in bloom_filter), but not subset/superset checks between filters.

please correct me if I am missing something here.

I think the `in` operator here checks if one BloomFilter object is contained in another, which isn't a valid operation. BloomFilters support testing individual string membership (`string in bloom_filter`), but not subset/superset checks between filters. please correct me if I am missing something here.

You were correct. I have added a new function and tests for the function.

brent.edwards added 2 commits

2026-01-29 02:55:54 +00:00

test: add tests for Bloom filter subset functionality bc76475087

Co-authored-by: aider (openrouter/qwen/qwen3-coder) <aider@aider.chat>

Changes made to correctly use Bloom filters. cc1df798db

aditya requested changes

2026-01-29 12:42:32 +00:00

aditya left a comment

When I ran the python scripts/upload_all_datasets.py --dataset wordnet --validate, it started the validation process, completed it but I did not get any success or failure message.

scripts/upload_all_datasets.py Outdated

					
				@ -1269,0 +1322,4 @@

				            )

				    missing_in_parquet = is_sub_bloom_filter(parquet_strings, source_strings)

				    if missing_in_parquet:

I think you should invert this if condition to check the missing parquet.
Please correct me if I am missing anything.

I think you should invert this if condition to check the missing parquet. Please correct me if I am missing anything.

You are correct.

scripts/upload_all_datasets.py Outdated

					
				@ -1269,0 +1329,4 @@

				        return False

				    missing_in_source = is_sub_bloom_filter(source_strings, parquet_strings)

				    if missing_in_source:

I think you should invert the if condition to check the missing source and missing parquet.

You are correct.

scripts/upload_all_datasets.py Outdated

					
				@ -1269,0 +1321,4 @@

				                collect_parquet_strings(parquet_file)

				            )

				    missing_in_parquet = is_sub_bloom_filter(parquet_strings, source_strings)

I think the variable name 'missing_in_parquet' is misleading. is_sub_bloom_filter returns True when parquet is a subset of source, not when strings are missing.

When I added a not, the variable's meaning becomes correct.

When I added a `not`, the variable's meaning becomes correct.

scripts/upload_all_datasets.py Outdated

					
				@ -1269,0 +1328,4 @@

				        )

				        return False

				    missing_in_source = is_sub_bloom_filter(source_strings, parquet_strings)

same comment as previous one,

When I added a not, the variable's meaning becomes correct.

When I added a `not`, the variable's meaning becomes correct.

brent.edwards added 1 commit

2026-01-29 19:57:04 +00:00

Thank you, Aditya, for pointing out missing "not"s. 49f6a288d9

brent.edwards added 1 commit

2026-01-29 22:26:00 +00:00

Adding assertions to turn off pyright warnings. 06d499e4e9

brent.edwards added 1 commit

2026-01-30 01:30:38 +00:00

Removed a circular dependency. 8393fa3c5d

brent.edwards added 1 commit

2026-01-30 01:50:12 +00:00

KNOWN BUG: This code is supposed to download the structures it doesn't 6447e1575c

have, but does not.

brent.edwards added 3 commits

2026-02-01 22:54:07 +00:00

refactor: use os.walk for file traversal in dataset_validator.py 56b06a9bc3

refactor: extract field generation logic into a generator function 45fc504873

Co-authored-by: aider (openrouter/anthropic/claude-sonnet-4.5) <aider@aider.chat>

Corrections to code and printing missing fields. 3022ce4e9e

aditya requested changes

2026-02-02 12:37:13 +00:00

aditya left a comment

Requesting changes due to an unhandled potential runtime error.

scripts/dataset_validator.py Outdated

					
				@ -0,0 +384,4 @@

				    count = 0

				    filter_fn = filter_fn or filter_string

				    for field in _generate_fields_from_source(dataset_id, base_dir):

This code should be present inside a try catch block because _generate_fields_from_source can raise RuntimeError, not catching it will cause the validation to crash instead of gracefully reporting the issue.

Done.

brent.edwards added 1 commit

2026-02-02 18:15:23 +00:00

Added a try/catch. 4ccee85888

brent.edwards commented

2026-02-02 18:17:55 +00:00

One other problem, pointed out in https://chat.google.com/room/AAQAc8wOydI/lpx8tu4iHQk/lpx8tu4iHQk?cls=10 :

The --validate argument is showing missing triples between source file
The parquet files downloaded from HuggingFace contain string identifiers like n663aa0870337408bb1b65f769bf5d1a4b262 that were generated during the original RDF-to-parquet conversion
These strings represent RDF blank nodes (anonymous nodes without URIs)
Source Re-Parse Generates Different Blank Node IDs
When validation re-parses the source RDF file, RDFlib generates completely different identifiers for the same blank nodes
Result: Parquet has strings like n663aa...262 but the re-parsed source has different IDs like Na1b2c... for the same semantic nodes
The validator reports these parquet strings as "missing" from source

One other problem, pointed out in https://chat.google.com/room/AAQAc8wOydI/lpx8tu4iHQk/lpx8tu4iHQk?cls=10 : > The --validate argument is showing missing triples between source file > The parquet files downloaded from HuggingFace contain string identifiers like n663aa0870337408bb1b65f769bf5d1a4b262 that were generated during the original RDF-to-parquet conversion > These strings represent RDF blank nodes (anonymous nodes without URIs) > Source Re-Parse Generates Different Blank Node IDs > When validation re-parses the source RDF file, RDFlib generates completely different identifiers for the same blank nodes > Result: Parquet has strings like n663aa...262 but the re-parsed source has different IDs like Na1b2c... for the same semantic nodes > The validator reports these parquet strings as "missing" from source

brent.edwards changed title from ~~So that there is something: The code has been written.~~ to Create-validator.

2026-02-05 01:19:08 +00:00

brent.edwards added 1 commit

2026-02-05 01:29:30 +00:00

Merge branch 'streaming-upload-docker' into create-validator f68315619a

This pull request has changes conflicting with the target branch.

features/steps/google_sheets_steps.py
scripts/convert_rdf_to_hf_dataset_unified.py
scripts/rdf_to_hf_incremental.py

View command line instructions

Checkout

From your project repository, check out a new branch and test the changes.

git fetch -u origin create-validator:create-validator

git switch create-validator

Merge

Merge the changes and update on Forgejo.

Warning: The "Autodetect manual merge" setting is not enabled for this repository, you will have to mark this pull request as manually merged afterwards.

git switch streaming-upload-docker

git merge --no-ff create-validator

git switch create-validator

git rebase streaming-upload-docker

git switch streaming-upload-docker

git merge --ff-only create-validator

git switch create-validator

git rebase streaming-upload-docker

git switch streaming-upload-docker

git merge --no-ff create-validator

git switch streaming-upload-docker

git merge --squash create-validator