fix: fix path too long issue; fix pickling issue in unified converter script #42
No reviewers
Labels
No labels
Blocked
Bounty
$100
Bounty
$1000
Bounty
$10000
Bounty
$20
Bounty
$2000
Bounty
$250
Bounty
$50
Bounty
$500
Bounty
$5000
Bounty
$750
MoSCoW
Could have
MoSCoW
Must have
MoSCoW
Should have
Needs feedback
Points
1
Points
13
Points
2
Points
21
Points
3
Points
34
Points
5
Points
55
Points
8
Points
88
Priority
Backlog
Priority
Critical
Priority
High
Priority
Low
Priority
Medium
Signed-off: Owner
Signed-off: Scrum Master
Signed-off: Tech Lead
Spike
State
Completed
State
Duplicate
State
In Progress
State
In Review
State
Paused
State
Unverified
State
Verified
State
Wont Do
Type
Bug
Type
Discussion
Type
Documentation
Type
Epic
Type
Feature
Type
Legendary
Type
Support
Type
Task
Type
Testing
No milestone
No project
No assignees
2 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
cleverdatasets/dataset-uploader!42
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "path-too-long-fix"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
The branch includes a fix which was due to system hit a Linux operating system limit because the file path used for internal communication (socket path created for interprocess communication) became too long. This caused the saving process to crash, so the dataset was never written to disk and the upload failed as a result. Apart from this it also fixed the pickling issue in the unified converter script.
I found a few ways to speed up the code or simplify it, but overall, good work! Approved!
@ -377,0 +389,4 @@try:with mp.Pool(processes=num_workers) as pool:for triples in pool.imap_unordered(process_geonames_lines, batches, chunksize=1):According to https://docs.python.org/3/library/multiprocessing.html#multiprocessing.pool.Pool.imap :
(Also on line 414.)
I now set it equal to 10
@ -377,0 +424,4 @@Module-level function for use in picklable generators."""is_gzipped = file_path.suffix == ".gz"This can be simplified somewhat. The only time
is_gzippedis used is line 433, so you don't need to create a variable. Just put the condition in line 433.Fixed !
@ -377,0 +425,4 @@Module-level function for use in picklable generators."""is_gzipped = file_path.suffix == ".gz"is_bz2 = file_path.suffix == ".bz2" or str(file_path).endswith(".ttl.bz2")The second condition is not needed. For example:
Same comment about not needing to create the variable
is_bz2.Fixed !
@ -377,0 +465,4 @@triples = []for s, p, o in graph:if isinstance(o, Literal):This code already exists in the
extract_triplemethod in lines 214-235.Fixed !
@ -377,0 +510,4 @@triples = []for s, p, o in graph:if isinstance(o, Literal):This code already exists in the
extract_triplecode in lines 214-235.Fixed !
@ -377,0 +553,4 @@current_chunk = []for s, p, o in graph:if isinstance(o, Literal):This code already exists in the
extract_triplecode of lines 214-235.Fixed !
@ -1022,3 +1148,3 @@rate = triple_count / elapsed if elapsed > 0 else 0progress_pct = 10 + min(60, int((chunk_count / estimated_chunks) * 60))self.progress.print(print(Why isn't this using
self.progress?self.progress contains a Rich Console object, which is not picklable. Capturing self.progress in the generator closure would make it non-picklable.
@ -1027,2 +1153,3 @@flush=True,)self.progress.emit_progress(progress_pct)print(f"PROGRESS: {progress_pct}", flush=True)(Same question.)
self.progress contains a Rich Console object, which is not picklable. Capturing self.progress in the generator closure would make it non-picklable.
@ -1205,1 +1206,4 @@# Fix for "AF_UNIX path too long" error in multiprocessing# This forces the temporary directory to be /tmp (short path) instead of a potentially deep workspace pathos.environ["TMPDIR"] = "/tmp"Superb choice. Thanks.
pyrightapproval. e7a540aeb3pyprojectto require Python 3.10. da9f785326