So that there is something: The code has been written. #56
No reviewers
Labels
No labels
Blocked
Bounty
$100
Bounty
$1000
Bounty
$10000
Bounty
$20
Bounty
$2000
Bounty
$250
Bounty
$50
Bounty
$500
Bounty
$5000
Bounty
$750
MoSCoW
Could have
MoSCoW
Must have
MoSCoW
Should have
Needs feedback
Points
1
Points
13
Points
2
Points
21
Points
3
Points
34
Points
5
Points
55
Points
8
Points
88
Priority
Backlog
Priority
Critical
Priority
High
Priority
Low
Priority
Medium
Signed-off: Owner
Signed-off: Scrum Master
Signed-off: Tech Lead
Spike
State
Completed
State
Duplicate
State
In Progress
State
In Review
State
Paused
State
Unverified
State
Verified
State
Wont Do
Type
Bug
Type
Discussion
Type
Documentation
Type
Epic
Type
Feature
Type
Legendary
Type
Support
Type
Task
Type
Testing
No milestone
No project
No assignees
2 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
cleverdatasets/dataset-uploader!56
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "create-validator"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
This is so that Aditya can read and comment on the file.
Mostly memory and performance related issues.
@ -0,0 +53,4 @@client.dataset_info(repo_id=repo_id)except HfHubHTTPError as exc:status_code = exc.response.status_code if exc.response else Noneif status_code == 404:what about 500, 503, timeouts? These crash the validator.
Fixed.
@ -0,0 +100,4 @@return download_pathdef dataset_is_parquet_file(This function return 'False' for permission errors, corrupted files, or any legitimate bugs. It should distinguish between validation failure vs. system errors.
I want to keep every test as simple as possible.
The file was downloaded by
dataset_can_download_from_hf. If there are permission errors, it means that the file was downloaded into a directory that is writable but not readable. Those directories are very rare.dataset_has_required_columnschecks for corrupted files in a separate test.Finally, I don't know what a "legitimate bug" is.
Let me know if you have other tests that you would like me to add...
@ -0,0 +194,4 @@rdf_format = "xml"graph = Graph()graph.parse(str(file_path), format=rdf_format)It loads the entire RDF file into an in-memory graph, for multi GB RDF files, this is extremly slow and memory-intensive
I have changed this to a bloom filter.
@ -0,0 +203,4 @@return storedef collect_parquet_strings(This function loads entire datasets into memory as string sets for comparison
for large datasets (such as wikidata with billion of triples), this will exhaust memory.
Changed to a bloom filter.
@ -1269,0 +1286,4 @@base_dir: Path,) -> bool:"""Validate that source strings and parquet strings match (Level 2/3)."""source_strings = collect_source_strings(dataset_id, base_dir)collect_source_strings returns empty set if no files found, warning is printed, but validation continues and passes.
Should this be a failure?
Excellent point. Changed.
@ -1586,6 +1718,11 @@ def main() -> int:console.print(f"[bold cyan]Removing {rdf_file}[/bold cyan]")remove_dir(rdf_file.parent)validation_ok = validate(datasets_to_process, args.base_dir)this validation runs after all the datasets are uploaded at the very end, user waits days for uploads, then discovers validation failures. Can we add per-dataset validation before marking complete?
View command line instructions
Checkout
From your project repository, check out a new branch and test the changes.Merge
Merge the changes and update on Forgejo.Warning: The "Autodetect manual merge" setting is not enabled for this repository, you will have to mark this pull request as manually merged afterwards.