16-error-messages-and-logging #17
No reviewers
Labels
No labels
Blocked
Bounty
$100
Bounty
$1000
Bounty
$10000
Bounty
$20
Bounty
$2000
Bounty
$250
Bounty
$50
Bounty
$500
Bounty
$5000
Bounty
$750
MoSCoW
Could have
MoSCoW
Must have
MoSCoW
Should have
Needs feedback
Points
1
Points
13
Points
2
Points
21
Points
3
Points
34
Points
5
Points
55
Points
8
Points
88
Priority
Backlog
Priority
Critical
Priority
High
Priority
Low
Priority
Medium
Signed-off: Owner
Signed-off: Scrum Master
Signed-off: Tech Lead
Spike
State
Completed
State
Duplicate
State
In Progress
State
In Review
State
Paused
State
Unverified
State
Verified
State
Wont Do
Type
Bug
Type
Discussion
Type
Documentation
Type
Epic
Type
Feature
Type
Legendary
Type
Support
Type
Task
Type
Testing
No milestone
No project
No assignees
2 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
cleverdatasets/dataset-uploader!17
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "16-error-messages-and-logging"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
I used qwen3-coder to suggest additions for error messages. They seemed good. Please read through them and make sure that I didn't approve things that I should not have.
Mostly suggestions, but the big commented block in the middle is more concerning.
@ -92,1 +103,4 @@console.print(f" - {path}")# List what files are actually theretry:all_files = list(input_dir.rglob("*"))Would you rather use a globstar here? You're searching in nested directories, so just seeing the contents of the top-level dir mightn't be as useful as giving the user the whole tree. I can imagine that if the program runs in a context where it doesn't have traverse permissions on one of the intermediate dirs, it'd be very helpful to the user to see immediately that the file tree accessible to the program is different from what he can see himself.
@ -719,0 +741,4 @@ProgressFileReader(f, update_progress, file_size_bytes) as pf,):file_content = pf.read()except (gzip.BadGzipFile, bz2.BadGzipFile) as e:I don't see
BadGzipFileinbz2. From source here:@ -618,3 +695,1 @@total_triples = 0chunk_count = 0parquet_files = []total_triples = 0 # DELETE THIS LINE!This looks like WIP code; did you forget to change something before filing the PR? I see a block of ninety-odd lines of commented-out code, minus three lines that were left active, and one of the three has an all-caps comment to
DELETE THIS LINE!.Yeah, later on I see that you're still trying to operate on the Parquet files despite having commented out the place where they're generated. If this is what's supposed to be happening, could you please make your intent clearer?
@ -221,0 +252,4 @@# Check if file size is reasonable compared to expected sizeif dataset_info.compressed_size_gb:expected_bytes = dataset_info.compressed_size_gb * 1024**3if file_size < expected_bytes * 0.1: # Less than 10% of expected sizePlease double-check this logic; quite possibly it's correct but I'm unsure. If
compressed_sizeactually refers to the compressed size (i.e. the size of the.gz/.bz2file) then I would expect thefile_sizeto be (very roughly) a factor of ten larger, not smaller, than the compressed size. This comparison looks reasonably correct if you mean the uncompressed size - if that's the case, I think it'd be clearer to say "uncompressed"/"decompressed" in the comment rather than "expected".@ -574,9 +574,14 @@ def process_dataset(Returns:True if successful, False otherwiseThis can also return None; both the docstring and the type annotation claim otherwise.
@ -889,2 +925,3 @@# Filter out README files if there are other optionsnon_readme_files = [f for f in rdf_files if "readme" not in f.name.lower()]non_readme_files = [f for f in rdf_files if "readme" not in f.name.lower()and f.name.endswith(".bz2")]This looks incorrect. Suppose you have
rdf_files == ["README", "foo.rdf"]this change will makenon_readme_filesempty and then you'll pick"README"out ofrdf_filesin a few lines.@ -1008,3 +1003,1 @@rdf_file = rdf_files[0]# Test againif rdf_file.endswith(".bz2"):In what is now line 934, you checked this as
if str(rdf_file).endswith(".bz2"):. Is the cast to string necessary here too, or do we know a priori that it's something that supports.endswith()?View command line instructions
Checkout
From your project repository, check out a new branch and test the changes.Merge
Merge the changes and update on Forgejo.Warning: The "Autodetect manual merge" setting is not enabled for this repository, you will have to mark this pull request as manually merged afterwards.