27-allow-download-ftp #28
No reviewers
Labels
No labels
Blocked
Bounty
$100
Bounty
$1000
Bounty
$10000
Bounty
$20
Bounty
$2000
Bounty
$250
Bounty
$50
Bounty
$500
Bounty
$5000
Bounty
$750
MoSCoW
Could have
MoSCoW
Must have
MoSCoW
Should have
Needs feedback
Points
1
Points
13
Points
2
Points
21
Points
3
Points
34
Points
5
Points
55
Points
8
Points
88
Priority
Backlog
Priority
Critical
Priority
High
Priority
Low
Priority
Medium
Signed-off: Owner
Signed-off: Scrum Master
Signed-off: Tech Lead
Spike
State
Completed
State
Duplicate
State
In Progress
State
In Review
State
Paused
State
Unverified
State
Verified
State
Wont Do
Type
Bug
Type
Discussion
Type
Documentation
Type
Epic
Type
Feature
Type
Legendary
Type
Support
Type
Task
Type
Testing
No milestone
No project
No assignees
2 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
cleverdatasets/dataset-uploader!28
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "27-allow-download-ftp"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Don't let the number of comments get you down; they're all small and you don't need to do anything about "I wouldn't have created this function."
@ -109,6 +111,18 @@ class RDFDatasetDownloader:raiseself.console = console or Console()def _get_protocol(self, url: str) -> str:(This is minuscule; feel free to ignore it.)
Functions
Since:
_get_protocolis just used once in line 196urlparse(url).scheme.lower()is almost as readable as_get_protocol._get_protocol()is going to be used againI wouldn't have created this function.
(IMPORTANT NOTE: You don't need to do anything.)
@ -211,1 +213,3 @@else:elif protocol == "ftp":self.console.print(f"[cyan]Using FTP protocol for: {dataset_info.url}[/cyan]")ruff checkreported:@ -213,0 +220,4 @@else:self.console.print(f"[red]Unsupported protocol: {protocol}://[/red]")self.console.print(f"[yellow]Supported: http://, https://, ftp://, file://[/yellow]")ruff checkreported:@ -218,1 +231,3 @@if "Name or service not known" in error_msg or "Errno -2" in error_msg:# Enhanced error context including FTP errorsif "unsupported protocol" in error_msg.lower():self.console.print(f"[yellow]→ Protocol '{protocol}://' is not supported[/yellow]")ruff checkreported:@ -321,0 +347,4 @@if server_size > 0:if partial_size >= server_size:# File is already complete!self.console.print(f"[green]✓ File already complete ({partial_size / (1024**2):.2f} MB)[/green]")ruff checkreports:@ -321,0 +351,4 @@return # Exit successfullyelse:# File incomplete, server doesn't support resumeself.console.print(f"[yellow]Server doesn't support resume. Have {partial_size / (1024**2):.2f} MB, need {server_size / (1024**2):.2f} MB[/yellow]")ruff checkreports:@ -321,0 +352,4 @@else:# File incomplete, server doesn't support resumeself.console.print(f"[yellow]Server doesn't support resume. Have {partial_size / (1024**2):.2f} MB, need {server_size / (1024**2):.2f} MB[/yellow]")self.console.print("[yellow]Deleting partial file and restarting...[/yellow]")ruff checkreports:@ -321,0 +354,4 @@self.console.print(f"[yellow]Server doesn't support resume. Have {partial_size / (1024**2):.2f} MB, need {server_size / (1024**2):.2f} MB[/yellow]")self.console.print("[yellow]Deleting partial file and restarting...[/yellow]")destination.unlink()# Force retry by raising exception that IS caught by retry loopruff checkreports:@ -321,0 +355,4 @@self.console.print("[yellow]Deleting partial file and restarting...[/yellow]")destination.unlink()# Force retry by raising exception that IS caught by retry loopraise httpx.NetworkError("Restarting download without resume")ruff checkreports:@ -321,0 +359,4 @@# Case 2: No Content-Length header, assume file is completeelse:self.console.print(f"[green]✓ File exists ({partial_size / (1024**2):.2f} MB), assuming complete[/green]")ruff checkreports:@ -321,0 +360,4 @@# Case 2: No Content-Length header, assume file is completeelse:self.console.print(f"[green]✓ File exists ({partial_size / (1024**2):.2f} MB), assuming complete[/green]")self.console.print("[dim]Use --force to re-download if needed.[/dim]")ruff checkreports:@ -380,6 +424,131 @@ class RDFDatasetDownloader:self.console.print(f"[red]Unexpected error during download: {e}[/red]")raisedef _download_file_ftp(self, url: str, destination: Path, max_retries: int = 3) -> None:ruff checkreports:@ -383,0 +441,4 @@while retry_count <= max_retries:try:if destination.exists():self.console.print("[yellow]FTP doesn't support resume - starting fresh download[/yellow]")ruff checkreports:@ -383,0 +464,4 @@total_size = int(response.headers['Content-Length'])progress.update(task, total=total_size)else:self.console.print("[yellow]FTP server didn't provide file size - progress will be indeterminate[/yellow]")ruff checkreports:@ -383,0 +482,4 @@chunk_count += 1if chunk_count % 100 == 0:f.flush()Are lines 484-485 necessary? Usually, the operating system is good at knowing the best times to flush the cache.
@ -383,0 +486,4 @@f.flush()self.console.print(f"[green]✓ Downloaded {downloaded / (1024**2):.2f} MB via FTP[/green]")ruff checkreports:@ -383,0 +490,4 @@returnexcept urllib.error.URLError as e:retry_count += 1It is probably easier to have the
retry_count += 1and thetime.sleep(...)outside of theexceptsection. Though this is correct, if we need anotherexceptclause, the new clause would need its ownretry_count += 1.@ -383,0 +496,4 @@reason = str(e.reason)if '530' in reason or 'Login incorrect' in reason:self.console.print(f"[red]FTP authentication failed: {url}[/red]")ruff checkreports:@ -383,0 +497,4 @@if '530' in reason or 'Login incorrect' in reason:self.console.print(f"[red]FTP authentication failed: {url}[/red]")self.console.print("[yellow]The server rejected anonymous login[/yellow]")ruff checkreports:@ -383,0 +501,4 @@raiseelif '550' in reason or 'No such file' in reason:self.console.print(f"[red]File not found on FTP server: {url}[/red]")ruff checkreports:@ -383,0 +502,4 @@elif '550' in reason or 'No such file' in reason:self.console.print(f"[red]File not found on FTP server: {url}[/red]")self.console.print("[yellow]The file may have been moved or removed[/yellow]")ruff checkreports:@ -383,0 +508,4 @@elif 'timed out' in reason.lower():if retry_count <= max_retries:self.console.print(f"[yellow]FTP timeout. Retrying {retry_count}/{max_retries}...[/yellow]"ruff checkreports:@ -383,0 +513,4 @@time.sleep(min(2**retry_count, 30))continueelse:self.console.print(f"[red]FTP download failed after {max_retries} retries[/red]")ruff checkreports:@ -383,0 +519,4 @@else:if retry_count <= max_retries:self.console.print(f"[yellow]FTP error: {reason}. Retrying {retry_count}/{max_retries}...[/yellow]"ruff checkreports:@ -383,0 +532,4 @@retry_count += 1if retry_count <= max_retries:self.console.print(f"[yellow]FTP connection timeout. Retrying {retry_count}/{max_retries}...[/yellow]"ruff checkreports:@ -383,0 +536,4 @@)time.sleep(min(2**retry_count, 30))else:self.console.print(f"[red]FTP download failed after {max_retries} retries[/red]")ruff checkreports:View command line instructions
Checkout
From your project repository, check out a new branch and test the changes.Merge
Merge the changes and update on Forgejo.Warning: The "Autodetect manual merge" setting is not enabled for this repository, you will have to mark this pull request as manually merged afterwards.