16-error-messages-and-logging #17

brent.edwards · 2025-11-21T01:39:06Z

brent.edwards commented

2025-11-21 01:39:06 +00:00

I used qwen3-coder to suggest additions for error messages. They seemed good. Please read through them and make sure that I didn't approve things that I should not have.

I used qwen3-coder to suggest additions for error messages. They seemed good. Please read through them and make sure that *I* didn't approve things that I should not have.

brent.edwards added 10 commits

2025-11-21 01:39:07 +00:00

feat: enhance error handling and reporting in dataset processing scripts e3bee8af4f

Co-authored-by: aider (openrouter/anthropic/claude-sonnet-4) <aider@aider.chat>

feat: improve error handling and logging in RDF dataset downloader e1cabc7069

Co-authored-by: aider (openrouter/anthropic/claude-sonnet-4) <aider@aider.chat>

feat: improve error handling for FB15k-237 dataset conversion 576dc842a9

Co-authored-by: aider (openrouter/anthropic/claude-sonnet-4) <aider@aider.chat>

fix: correct indentation in FB15k-237 file parsing function 163268b40d

Co-authored-by: aider (openrouter/anthropic/claude-sonnet-4) <aider@aider.chat>

refactor: improve error handling for NELL-995 dataset conversion 3862bf2655

Co-authored-by: aider (openrouter/anthropic/claude-sonnet-4) <aider@aider.chat>

fix: correct indentation in NELL-995 dataset converter script e6e1932f66

Co-authored-by: aider (openrouter/anthropic/claude-sonnet-4) <aider@aider.chat>

refactor: improve error handling and logging in ConceptNet conversion script 24869bef6c

Co-authored-by: aider (openrouter/anthropic/claude-sonnet-4) <aider@aider.chat>

fix: correct indentation in ConceptNet CSV parsing function e7bfa3f2c9

Co-authored-by: aider (openrouter/anthropic/claude-sonnet-4) <aider@aider.chat>

feat: add comprehensive error handling for RDF to HuggingFace dataset conversion 67a9857ab3

Co-authored-by: aider (openrouter/anthropic/claude-sonnet-4) <aider@aider.chat>

One more set of changes for error messages. 69de631c84

brent.edwards added 1 commit

2025-11-24 00:26:00 +00:00

Changes made to make the downloader work. c776025b0f

khird reviewed

2025-12-05 14:59:33 +00:00

khird left a comment

First-time contributor

Mostly suggestions, but the big commented block in the middle is more concerning.

scripts/convert_nell995_to_hf.py

					
				@ -92,1 +103,4 @@

				            console.print(f"  - {path}")

				        # List what files are actually there

				        try:

				            all_files = list(input_dir.rglob("*"))

Would you rather use a globstar here? You're searching in nested directories, so just seeing the contents of the top-level dir mightn't be as useful as giving the user the whole tree. I can imagine that if the program runs in a context where it doesn't have traverse permissions on one of the intermediate dirs, it'd be very helpful to the user to see immediately that the file tree accessible to the program is different from what he can see himself.

Done.

scripts/convert_rdf_to_hf_dataset.py Outdated

					
				@ -719,0 +741,4 @@

				                        ProgressFileReader(f, update_progress, file_size_bytes) as pf,

				                    ):

				                        file_content = pf.read()

				            except (gzip.BadGzipFile, bz2.BadGzipFile) as e:

I don't see BadGzipFile in bz2. From source here:

__all__ = ["BZ2File", "BZ2Compressor", "BZ2Decompressor",
           "open", "compress", "decompress"]

I don't see `BadGzipFile` in `bz2`. From source [here](https://github.com/python/cpython/blob/3.14/Lib/bz2.py): ``` __all__ = ["BZ2File", "BZ2Compressor", "BZ2Decompressor", "open", "compress", "decompress"] ```

Thank you. That does not exist.

I'm less open to letting LLMs write my code now...

Thank you. That does not exist. I'm less open to letting LLMs write my code now...

scripts/convert_rdf_to_hf_dataset_streaming_parallel.py Outdated

					
				@ -618,3 +695,1 @@

				    total_triples = 0

				    chunk_count = 0

				    parquet_files = []

				    total_triples = 0 # DELETE THIS LINE!

This looks like WIP code; did you forget to change something before filing the PR? I see a block of ninety-odd lines of commented-out code, minus three lines that were left active, and one of the three has an all-caps comment to DELETE THIS LINE!.

This looks like WIP code; did you forget to change something before filing the PR? I see a block of ninety-odd lines of commented-out code, minus three lines that were left active, and one of the three has an all-caps comment to `DELETE THIS LINE!`.

Yeah, later on I see that you're still trying to operate on the Parquet files despite having commented out the place where they're generated. If this is what's supposed to be happening, could you please make your intent clearer?

I messed up.

I had edited this file to speed up running python scripts/upload_all_datasets.py --dataset wikidata-truthy, but I didn't clean that up before I sent it out.

I have cleaned that up now.

I messed up. I had edited this file to speed up running `python scripts/upload_all_datasets.py --dataset wikidata-truthy`, but I didn't clean that up before I sent it out. I have cleaned that up now.

scripts/rdf_dataset_downloader.py Outdated

					
				@ -221,0 +252,4 @@

				                # Check if file size is reasonable compared to expected size

				                if dataset_info.compressed_size_gb:

				                    expected_bytes = dataset_info.compressed_size_gb * 1024**3

				                    if file_size < expected_bytes * 0.1:  # Less than 10% of expected size

Please double-check this logic; quite possibly it's correct but I'm unsure. If compressed_size actually refers to the compressed size (i.e. the size of the .gz/.bz2 file) then I would expect the file_size to be (very roughly) a factor of ten larger, not smaller, than the compressed size. This comparison looks reasonably correct if you mean the uncompressed size - if that's the case, I think it'd be clearer to say "uncompressed"/"decompressed" in the comment rather than "expected".

Please double-check this logic; quite possibly it's correct but I'm unsure. If `compressed_size` actually refers to the compressed size (i.e. the size of the `.gz`/`.bz2` file) then I would expect the `file_size` to be (very roughly) a factor of ten *larger*, not smaller, than the compressed size. This comparison looks reasonably correct if you mean the uncompressed size - if that's the case, I think it'd be clearer to say "uncompressed"/"decompressed" in the comment rather than "expected".

Wow. That is definitely in the wrong direction.

Since:

I expect any dataset to be very compressible
I don't want to give false warnings

I'm changing the factor from 0.1 to 1.5. I don't want false warnings.

Wow. That is definitely in the wrong direction. Since: 1. I expect any dataset to be very compressible 2. I don't want to give false warnings I'm changing the factor from 0.1 to 1.5. I don't want false warnings.

scripts/upload_all_datasets.py Outdated

					
				@ -574,9 +574,14 @@ def process_dataset(

				    Returns:

				        True if successful, False otherwise

This can also return None; both the docstring and the type annotation claim otherwise.

Fixed.

scripts/upload_all_datasets.py Outdated

					
				@ -889,2 +925,3 @@

				            # Filter out README files if there are other options

				            non_readme_files = [f for f in rdf_files if "readme" not in f.name.lower()]

				            non_readme_files = [f for f in rdf_files if "readme" not in f.name.lower()

				                                and f.name.endswith(".bz2")]

This looks incorrect. Suppose you have rdf_files == ["README", "foo.rdf"] this change will make non_readme_files empty and then you'll pick "README" out of rdf_files in a few lines.

This looks incorrect. Suppose you have `rdf_files == ["README", "foo.rdf"]` this change will make `non_readme_files` empty and then you'll pick `"README"` out of `rdf_files` in a few lines.

Fixed. I don't know why I did that.

scripts/upload_all_datasets.py Outdated

					
				@ -1008,3 +1003,1 @@

				                    rdf_file = rdf_files[0]

				                    # Test again

				                if rdf_file.endswith(".bz2"):

In what is now line 934, you checked this as if str(rdf_file).endswith(".bz2"):. Is the cast to string necessary here too, or do we know a priori that it's something that supports .endswith()?

In what is now line 934, you checked this as `if str(rdf_file).endswith(".bz2"):`. Is the cast to string necessary here too, or do we know a priori that it's something that supports `.endswith()`?

Fixed.

brent.edwards added 1 commit

2025-12-06 02:03:54 +00:00

Fixing the pyright and ruff check mistakes. dae85ef6b6

khird approved these changes

2025-12-06 02:56:47 +00:00

This pull request can be merged automatically.

This branch is out-of-date with the base branch

You are not authorized to merge this pull request.

View command line instructions

Checkout

From your project repository, check out a new branch and test the changes.

git fetch -u origin 16-error-messages-and-logging:16-error-messages-and-logging

git switch 16-error-messages-and-logging

Merge

Merge the changes and update on Forgejo.

Warning: The "Autodetect manual merge" setting is not enabled for this repository, you will have to mark this pull request as manually merged afterwards.

git switch master

git merge --no-ff 16-error-messages-and-logging

git switch 16-error-messages-and-logging

git rebase master

git switch master

git merge --ff-only 16-error-messages-and-logging

git switch 16-error-messages-and-logging

git rebase master

git switch master

git merge --no-ff 16-error-messages-and-logging

git switch master

git merge --squash 16-error-messages-and-logging