So that there is something: The code has been written. #56

Open
brent.edwards wants to merge 16 commits from create-validator into streaming-upload-docker
Member

This is so that Aditya can read and comment on the file.

This is so that Aditya can read and comment on the file.
Co-authored-by: aider (openrouter/openai/gpt-5.2-codex) <aider@aider.chat>
Co-authored-by: aider (openrouter/openai/gpt-5.2-codex) <aider@aider.chat>
Co-authored-by: aider (openrouter/openai/gpt-5.2-codex) <aider@aider.chat>
Co-authored-by: aider (openrouter/openai/gpt-5.2-codex) <aider@aider.chat>
Co-authored-by: aider (openrouter/openai/gpt-5.2-codex) <aider@aider.chat>
Co-authored-by: aider (openrouter/openai/gpt-5.2-codex) <aider@aider.chat>
Co-authored-by: aider (openrouter/openai/gpt-5.2-codex) <aider@aider.chat>
Co-authored-by: aider (openrouter/openai/gpt-5.2-codex) <aider@aider.chat>
Co-authored-by: aider (openrouter/openai/gpt-5.2-codex) <aider@aider.chat>
Co-authored-by: aider (openrouter/openai/gpt-5.2-codex) <aider@aider.chat>
Co-authored-by: aider (openrouter/openai/gpt-5.2-codex) <aider@aider.chat>
Co-authored-by: aider (openrouter/openai/gpt-5.2-codex) <aider@aider.chat>
aditya left a comment
Member

Mostly memory and performance related issues.

Mostly memory and performance related issues.
@ -0,0 +53,4 @@
client.dataset_info(repo_id=repo_id)
except HfHubHTTPError as exc:
status_code = exc.response.status_code if exc.response else None
if status_code == 404:
Member

what about 500, 503, timeouts? These crash the validator.

what about 500, 503, timeouts? These crash the validator.
Author
Member

Fixed.

Fixed.
@ -0,0 +100,4 @@
return download_path
def dataset_is_parquet_file(
Member

This function return 'False' for permission errors, corrupted files, or any legitimate bugs. It should distinguish between validation failure vs. system errors.

This function return 'False' for permission errors, corrupted files, or any legitimate bugs. It should distinguish between validation failure vs. system errors.
Author
Member

I want to keep every test as simple as possible.

The file was downloaded by dataset_can_download_from_hf. If there are permission errors, it means that the file was downloaded into a directory that is writable but not readable. Those directories are very rare.

dataset_has_required_columns checks for corrupted files in a separate test.

Finally, I don't know what a "legitimate bug" is.

Let me know if you have other tests that you would like me to add...

I want to keep every test as simple as possible. The file was downloaded by `dataset_can_download_from_hf`. If there are permission errors, it means that the file was downloaded into a directory that is writable but not readable. Those directories are very rare. `dataset_has_required_columns` checks for corrupted files in a separate test. Finally, I don't know what a "legitimate bug" is. Let me know if you have other tests that you would like me to add...
@ -0,0 +194,4 @@
rdf_format = "xml"
graph = Graph()
graph.parse(str(file_path), format=rdf_format)
Member

It loads the entire RDF file into an in-memory graph, for multi GB RDF files, this is extremly slow and memory-intensive

It loads the entire RDF file into an in-memory graph, for multi GB RDF files, this is extremly slow and memory-intensive
Author
Member

I have changed this to a bloom filter.

I have changed this to a bloom filter.
@ -0,0 +203,4 @@
return store
def collect_parquet_strings(
Member

This function loads entire datasets into memory as string sets for comparison
for large datasets (such as wikidata with billion of triples), this will exhaust memory.

This function loads entire datasets into memory as string sets for comparison for large datasets (such as wikidata with billion of triples), this will exhaust memory.
Author
Member

Changed to a bloom filter.

Changed to a bloom filter.
@ -1269,0 +1286,4 @@
base_dir: Path,
) -> bool:
"""Validate that source strings and parquet strings match (Level 2/3)."""
source_strings = collect_source_strings(dataset_id, base_dir)
Member

collect_source_strings returns empty set if no files found, warning is printed, but validation continues and passes.
Should this be a failure?

collect_source_strings returns empty set if no files found, warning is printed, but validation continues and passes. Should this be a failure?
Author
Member

Excellent point. Changed.

Excellent point. Changed.
@ -1586,6 +1718,11 @@ def main() -> int:
console.print(f"[bold cyan]Removing {rdf_file}[/bold cyan]")
remove_dir(rdf_file.parent)
validation_ok = validate(datasets_to_process, args.base_dir)
Member

this validation runs after all the datasets are uploaded at the very end, user waits days for uploads, then discovers validation failures. Can we add per-dataset validation before marking complete?

this validation runs after all the datasets are uploaded at the very end, user waits days for uploads, then discovers validation failures. Can we add per-dataset validation before marking complete?
This pull request can be merged automatically.
This branch is out-of-date with the base branch
You are not authorized to merge this pull request.
View command line instructions

Checkout

From your project repository, check out a new branch and test the changes.
git fetch -u origin create-validator:create-validator
git switch create-validator

Merge

Merge the changes and update on Forgejo.

Warning: The "Autodetect manual merge" setting is not enabled for this repository, you will have to mark this pull request as manually merged afterwards.

git switch streaming-upload-docker
git merge --no-ff create-validator
git switch create-validator
git rebase streaming-upload-docker
git switch streaming-upload-docker
git merge --ff-only create-validator
git switch create-validator
git rebase streaming-upload-docker
git switch streaming-upload-docker
git merge --no-ff create-validator
git switch streaming-upload-docker
git merge --squash create-validator
git switch streaming-upload-docker
git merge --ff-only create-validator
git switch streaming-upload-docker
git merge create-validator
git push origin streaming-upload-docker
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
cleverdatasets/dataset-uploader!56
No description provided.