fix: fix path too long issue; fix pickling issue in unified converter script #42

aditya · 2025-12-26T12:55:04Z

aditya commented

2025-12-26 12:55:04 +00:00

The branch includes a fix which was due to system hit a Linux operating system limit because the file path used for internal communication (socket path created for interprocess communication) became too long. This caused the saving process to crash, so the dataset was never written to disk and the upload failed as a result. Apart from this it also fixed the pickling issue in the unified converter script.

aditya added 1 commit

2025-12-26 12:55:04 +00:00

fix: fix path too long issue; fix pickling issue in unified converter script 1bb26305ce

brent.edwards approved these changes

2025-12-26 22:32:45 +00:00

brent.edwards left a comment

I found a few ways to speed up the code or simplify it, but overall, good work! Approved!

scripts/convert_rdf_to_hf_dataset_unified.py

					
				@ -377,0 +389,4 @@

				    try:

				        with mp.Pool(processes=num_workers) as pool:

				            for triples in pool.imap_unordered(process_geonames_lines, batches, chunksize=1):

According to https://docs.python.org/3/library/multiprocessing.html#multiprocessing.pool.Pool.imap :

For very long iterables using a large value for chunksize can make the job complete much faster than using the default value of 1.

(Also on line 414.)

According to https://docs.python.org/3/library/multiprocessing.html#multiprocessing.pool.Pool.imap : > For very long iterables using a large value for chunksize can make the job complete much faster than using the default value of 1. (Also on line 414.)

I now set it equal to 10

scripts/convert_rdf_to_hf_dataset_unified.py

					
				@ -377,0 +424,4 @@

				    Module-level function for use in picklable generators.

				    """

				    is_gzipped = file_path.suffix == ".gz"

This can be simplified somewhat. The only time is_gzipped is used is line 433, so you don't need to create a variable. Just put the condition in line 433.

This can be simplified somewhat. The only time `is_gzipped` is used is line 433, so you don't need to create a variable. Just put the condition in line 433.

 @ -377,0 +465,4 @@
                         triples = []
                         for s, p, o in graph:
                             if isinstance(o, Literal):

 @ -1022,3 +1148,3 @@
                         rate = triple_count / elapsed if elapsed > 0 else 0
                         progress_pct = 10 + min(60, int((chunk_count / estimated_chunks) * 60))
                         self.progress.print(
                         print(

 @ -1027,2 +1153,3 @@
                             flush=True,
                         )
                         self.progress.emit_progress(progress_pct)
                         print(f"PROGRESS: {progress_pct}", flush=True)

 @ -1205,1 +1206,4 @@
     # Fix for "AF_UNIX path too long" error in multiprocessing
     # This forces the temporary directory to be /tmp (short path) instead of a potentially deep workspace path
     os.environ["TMPDIR"] = "/tmp"

Rows
Columns