16-error-messages-and-logging #17

Open
brent.edwards wants to merge 11 commits from 16-error-messages-and-logging into master
Member

I used qwen3-coder to suggest additions for error messages. They seemed good. Please read through them and make sure that I didn't approve things that I should not have.

I used qwen3-coder to suggest additions for error messages. They seemed good. Please read through them and make sure that *I* didn't approve things that I should not have.
Co-authored-by: aider (openrouter/anthropic/claude-sonnet-4) <aider@aider.chat>
Co-authored-by: aider (openrouter/anthropic/claude-sonnet-4) <aider@aider.chat>
Co-authored-by: aider (openrouter/anthropic/claude-sonnet-4) <aider@aider.chat>
Co-authored-by: aider (openrouter/anthropic/claude-sonnet-4) <aider@aider.chat>
Co-authored-by: aider (openrouter/anthropic/claude-sonnet-4) <aider@aider.chat>
Co-authored-by: aider (openrouter/anthropic/claude-sonnet-4) <aider@aider.chat>
Co-authored-by: aider (openrouter/anthropic/claude-sonnet-4) <aider@aider.chat>
Co-authored-by: aider (openrouter/anthropic/claude-sonnet-4) <aider@aider.chat>
Co-authored-by: aider (openrouter/anthropic/claude-sonnet-4) <aider@aider.chat>
khird left a comment
First-time contributor

Mostly suggestions, but the big commented block in the middle is more concerning.

Mostly suggestions, but the big commented block in the middle is more concerning.
@ -92,1 +103,4 @@
console.print(f" - {path}")
# List what files are actually there
try:
all_files = list(input_dir.rglob("*"))
First-time contributor

Would you rather use a globstar here? You're searching in nested directories, so just seeing the contents of the top-level dir mightn't be as useful as giving the user the whole tree. I can imagine that if the program runs in a context where it doesn't have traverse permissions on one of the intermediate dirs, it'd be very helpful to the user to see immediately that the file tree accessible to the program is different from what he can see himself.

Would you rather use a globstar here? You're searching in nested directories, so just seeing the contents of the top-level dir mightn't be as useful as giving the user the whole tree. I can imagine that if the program runs in a context where it doesn't have traverse permissions on one of the intermediate dirs, it'd be very helpful to the user to see immediately that the file tree accessible to the program is different from what he can see himself.
@ -719,0 +741,4 @@
ProgressFileReader(f, update_progress, file_size_bytes) as pf,
):
file_content = pf.read()
except (gzip.BadGzipFile, bz2.BadGzipFile) as e:
First-time contributor

I don't see BadGzipFile in bz2. From source here:

__all__ = ["BZ2File", "BZ2Compressor", "BZ2Decompressor",
           "open", "compress", "decompress"]
I don't see `BadGzipFile` in `bz2`. From source [here](https://github.com/python/cpython/blob/3.14/Lib/bz2.py): ``` __all__ = ["BZ2File", "BZ2Compressor", "BZ2Decompressor", "open", "compress", "decompress"] ```
@ -618,3 +695,1 @@
total_triples = 0
chunk_count = 0
parquet_files = []
total_triples = 0 # DELETE THIS LINE!
First-time contributor

This looks like WIP code; did you forget to change something before filing the PR? I see a block of ninety-odd lines of commented-out code, minus three lines that were left active, and one of the three has an all-caps comment to DELETE THIS LINE!.

This looks like WIP code; did you forget to change something before filing the PR? I see a block of ninety-odd lines of commented-out code, minus three lines that were left active, and one of the three has an all-caps comment to `DELETE THIS LINE!`.
First-time contributor

Yeah, later on I see that you're still trying to operate on the Parquet files despite having commented out the place where they're generated. If this is what's supposed to be happening, could you please make your intent clearer?

Yeah, later on I see that you're still trying to operate on the Parquet files despite having commented out the place where they're generated. If this is what's supposed to be happening, could you please make your intent clearer?
@ -221,0 +252,4 @@
# Check if file size is reasonable compared to expected size
if dataset_info.compressed_size_gb:
expected_bytes = dataset_info.compressed_size_gb * 1024**3
if file_size < expected_bytes * 0.1: # Less than 10% of expected size
First-time contributor

Please double-check this logic; quite possibly it's correct but I'm unsure. If compressed_size actually refers to the compressed size (i.e. the size of the .gz/.bz2 file) then I would expect the file_size to be (very roughly) a factor of ten larger, not smaller, than the compressed size. This comparison looks reasonably correct if you mean the uncompressed size - if that's the case, I think it'd be clearer to say "uncompressed"/"decompressed" in the comment rather than "expected".

Please double-check this logic; quite possibly it's correct but I'm unsure. If `compressed_size` actually refers to the compressed size (i.e. the size of the `.gz`/`.bz2` file) then I would expect the `file_size` to be (very roughly) a factor of ten *larger*, not smaller, than the compressed size. This comparison looks reasonably correct if you mean the uncompressed size - if that's the case, I think it'd be clearer to say "uncompressed"/"decompressed" in the comment rather than "expected".
@ -574,9 +574,14 @@ def process_dataset(
Returns:
True if successful, False otherwise
First-time contributor

This can also return None; both the docstring and the type annotation claim otherwise.

This can also return None; both the docstring and the type annotation claim otherwise.
@ -889,2 +925,3 @@
# Filter out README files if there are other options
non_readme_files = [f for f in rdf_files if "readme" not in f.name.lower()]
non_readme_files = [f for f in rdf_files if "readme" not in f.name.lower()
and f.name.endswith(".bz2")]
First-time contributor

This looks incorrect. Suppose you have rdf_files == ["README", "foo.rdf"] this change will make non_readme_files empty and then you'll pick "README" out of rdf_files in a few lines.

This looks incorrect. Suppose you have `rdf_files == ["README", "foo.rdf"]` this change will make `non_readme_files` empty and then you'll pick `"README"` out of `rdf_files` in a few lines.
@ -1008,3 +1003,1 @@
rdf_file = rdf_files[0]
# Test again
if rdf_file.endswith(".bz2"):
First-time contributor

In what is now line 934, you checked this as if str(rdf_file).endswith(".bz2"):. Is the cast to string necessary here too, or do we know a priori that it's something that supports .endswith()?

In what is now line 934, you checked this as `if str(rdf_file).endswith(".bz2"):`. Is the cast to string necessary here too, or do we know a priori that it's something that supports `.endswith()`?
This pull request can be merged automatically.
You are not authorized to merge this pull request.
View command line instructions

Checkout

From your project repository, check out a new branch and test the changes.
git fetch -u origin 16-error-messages-and-logging:16-error-messages-and-logging
git switch 16-error-messages-and-logging

Merge

Merge the changes and update on Forgejo.

Warning: The "Autodetect manual merge" setting is not enabled for this repository, you will have to mark this pull request as manually merged afterwards.

git switch master
git merge --no-ff 16-error-messages-and-logging
git switch 16-error-messages-and-logging
git rebase master
git switch master
git merge --ff-only 16-error-messages-and-logging
git switch 16-error-messages-and-logging
git rebase master
git switch master
git merge --no-ff 16-error-messages-and-logging
git switch master
git merge --squash 16-error-messages-and-logging
git switch master
git merge --ff-only 16-error-messages-and-logging
git switch master
git merge 16-error-messages-and-logging
git push origin master
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
cleverdatasets/dataset-uploader!17
No description provided.