auto-excel-update #46
No reviewers
Labels
No labels
Blocked
Bounty
$100
Bounty
$1000
Bounty
$10000
Bounty
$20
Bounty
$2000
Bounty
$250
Bounty
$50
Bounty
$500
Bounty
$5000
Bounty
$750
MoSCoW
Could have
MoSCoW
Must have
MoSCoW
Should have
Needs feedback
Points
1
Points
13
Points
2
Points
21
Points
3
Points
34
Points
5
Points
55
Points
8
Points
88
Priority
Backlog
Priority
Critical
Priority
High
Priority
Low
Priority
Medium
Signed-off: Owner
Signed-off: Scrum Master
Signed-off: Tech Lead
Spike
State
Completed
State
Duplicate
State
In Progress
State
In Review
State
Paused
State
Unverified
State
Verified
State
Wont Do
Type
Bug
Type
Discussion
Type
Documentation
Type
Epic
Type
Feature
Type
Legendary
Type
Support
Type
Task
Type
Testing
No milestone
No project
No assignees
4 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
cleverdatasets/dataset-uploader!46
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "auto-excel-update"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
This branch contain code for integration of the automatic excel update using google cloud platform, when we use upload_all_datasets.py and devcontainer setup for vscode.
Thank you for being careful to follow the
pyrightandruff checkconditions forupload_all_datasets.I still don't understand how a user would run this code. In particular,
creds.jsonand.env.examplewould be inside the Docker container. Is there a way to provide the information they use from outside the docker container, so that every user can just do adocker runwith parameters of where to find that information?@ -4,3 +4,4 @@"context": "..","customizations": {"vscode": {Sadly, I wasn't able to get dev containers working with VSCode locally (due to NixOS.)
I couldn't test this.
This is how to use the environment credentials inside dev container.
@ -0,0 +23,4 @@try:from google.oauth2 import service_accountfrom googleapiclient.discovery import buildThis behavior is extremely unusual. Why are these imports separated from everything else? Why wouldn't someone have these libraries installed?
I recommend removing the
GOOGLE_SHEETS_AVAILABLEand lines 49-53. If someone doesn't havegoogle-api-python-clientorgoogle-authinstalled, the software should fail with the usualImportError.fixed !!
@ -0,0 +49,4 @@if not GOOGLE_SHEETS_AVAILABLE:raise ImportError("Google Sheets API not available. ""Install with: pip install google-api-python-client google-auth"The packages need to be added to
pyproject.toml.fixed !!
@ -0,0 +105,4 @@"""Load column headers from the sheet."""try:result = (self.sheet.values()pyrightreports:fixed !!
@ -0,0 +114,4 @@)values = result.get("values", [])if values:self.headers = [str(h).strip() for h in values[0]]You will save a LOT of trouble if you change this line to
fixed !!
@ -0,0 +118,4 @@self.dataset_name_col = self._find_column_index(["DATASET NAME","dataset name",If you change line 117, then you don't need duplicate column names.
fixed !!
@ -0,0 +124,4 @@"id","name",])This code should post warnings if the column names aren't found...
fixed !!
@ -0,0 +148,4 @@"""for name in possible_names:try:return self.headers.index(name)If you change line 117, then you should also change lines 149-159 to
See how simple that is? AND it works when comparing
Dataset_IdandDataset_ID.fixed !!
@ -0,0 +173,4 @@try:result = (self.sheet.values()pyrightreports:fixed !!
@ -0,0 +189,4 @@if (len(row) > self.dataset_name_coland str(row[self.dataset_name_col]).strip().lower()== dataset_id.lower()Because data is multilingual, it's much better to use
.casefold()instead of.lower().It doesn't matter in Devangari-based languages or English, but in other languages like German,
upperandloweraren't inverses. For example:.casefold()puts everything into one well-structured format.fixed !!
@ -0,0 +193,4 @@):status = (row[self.status_col].strip()if self.status_col and len(row) > self.status_colpyrightreports:fixed !!
@ -0,0 +212,4 @@dataset_id: str,status: str,error_message: str | None = None,notes: str | None = None,Why is
notesa parameter? Is there supposed to be anotescolumn?fixed !!
@ -0,0 +262,4 @@if updates:body = {"valueInputOption": "RAW", "data": updates}(self.sheet.values()pyrightreports:fixed !!
@ -0,0 +315,4 @@if status_upper == "DONE":if force:return (True, "Force processing enabled")IMHO --
if force:should be its own boolean condition at or near the start of thecheck_should_processfunction.It certainly should not be repeated in lines 317-318 and 322-323.
fixed !!
@ -0,0 +326,4 @@if status_upper == "NOT STARTED":return (True, "Status is NOT STARTED, processing")return (True, "Status allows processing")I'm very nervous about this line. If the status were "NO! DON'T PROCESS", then this line means that the processing will continue anyway.
I would rather return
(False, f"Unknown status {status_upper}: Not moving forward until the status is 'NOT STARTED'").fixed !!
@ -66,0 +70,4 @@except ImportError:passexcept Exception:passBoth
exceptstatements are the same. You can get rid of one of them.Keep
ImportErrorand deleteExceptionif the only reasonable exceptions are import errors.If other exceptions are reasonable (for example, if
load_dotenvthrowsFileNotFoundError), then include those exceptions explicitly.fixed !!
@ -71,2 +80,4 @@console = Console()try:from google_sheets_tracker import GoogleSheetsTracker, create_trackerSame comments as in
google_sheets_tracker. Any requirements should be in thepyproject.tomlfile, and we shouldn't have conditional imports.fixed !!
@ -73,0 +88,4 @@GoogleSheetsTracker = Nonetry:from huggingface_hub import HfApiSame comments.
fixed !!
@ -1094,0 +1181,4 @@"Organization not found in config file. ""Please set 'organization' field in dataset_config.json")repo_id = f"{organization}/{dataset_id}"Lines 1176-1184 are very conceptually similar to lines 262-291, but these assume that the repo_id must be
f"{organization}/{dataset_id}"instead of checking and allowingf"{organization}/{dataset_id}-v1".Are you SURE that the repo_id here can only be
f"{organization}/{dataset_id}"?Would it be better to use
check_dataset_exists_on_hf?fixed !!
@ -1236,0 +1346,4 @@try:sheet_tracker = create_tracker(spreadsheet_id=None,worksheet_name="Sheet1",The worksheet name should come from a file or an environment variable. It might not always be named "Sheet1".
fixed !!
@ -1249,0 +1397,4 @@)if current_status_value != "DONE":sheet_tracker.update_status(dataset_id, "DONE")Is it possible that a partial sheet was uploaded, but an error occurred and had been correctly listed as an error in the sheet?
If so, this is eliminating the error.
I understand the idea of read repair -- but this update makes me nervous that it will eliminate important information.
If you must keep this line, why are you using
update_statusinstead ofmark_completed?fixed !!
@ -1249,0 +1370,4 @@)if current_status_value not in ["DONE", "ERROR"]:sheet_tracker.mark_completed(dataset_id)This code still makes me VERY nervous.
Imagine that someone changed the status line to something that's not in the registered list -- for example,
SKIPor they mistype and writeNOTSTARTED.This line overwrites what the user sent, changes it to
DONE, and does NOT download anything.This will badly surprise both the
SKIPpeople and theNOTSTARTEDpeople.As a rule, if the code does not understand an instruction, it should do nothing but post an error message. It should never overwrite things that it does not understand.
Please remove lines 1372-1377 (and change line 1378 to a simple 'if').
Fixed !!
You're still missing
google_sheets_trackertoprocess_datasets_parallel.py.Otherwise, excellent work!
@ -44,0 +37,4 @@### 3. Open in DevContainer```bashcode .The standard inside CleverThis is PyCharm, not VS Code. Could you also include instructions on how to do this with PyCharm?
Fixed !!
@ -44,0 +42,4 @@When prompted: **"Reopen in Container?"** → Click **"Reopen in Container"**Wait 2-5 minutes for first-time setup.In
devcontainer.md, you wrote that this takes 8-10 minutes.Fixed !!
@ -30,3 +26,1 @@"pyarrow>=14.0.0", # Parquet/Arrow data"httpx>=0.25.0", # HTTP client (async)"requests>=2.31.0", # HTTP client (sync)"datasets>=2.0.0",Why did you remove comments? (I'm not saying that it's wrong; I'm saying that they were useful.)
got it !! added them back.
Please consider my comment, as it may be valid and worth the effort of addressing it.
@ -1332,1 +1606,4 @@)except Exception:passreturn 130This looks like a very long method to me... it should be logically split and i think the code should be better documented. I see long chains of conditionals with not a single business logic explanation for them. Since this is a diff it is not totally clear to me where the method starts and ends, but it looks quite long...
ef924a1aa1to53fcbbbb04@ -1332,0 +1514,4 @@except Exception:passreturn 1except Exception as e:Why catch ValueError and Exception separately here? You do almost exactly the same thing with each, but don't print the error message for the generic Exception.
@ -1332,0 +1545,4 @@try:upload(args, dataset_id)breakexcept ValueError as e:Uploading the dataset can fail in many ways that aren't ValueErrors. I see that your upload() function wraps them all as ValueErrors, but I think this is incorrect. For example, a network failure has nothing to do with a value error. To me it makes more sense to raise the appropriate type of Exception and respond accordingly. For example, you should probably retry on a network error (which could be transient and might resolve itself by next attempt), but not a permissions error (which is very unlikely to do so).
These are good, but you're still missing
Attaching the google_sheets_tracker to process_datasets_parallel.py.
Otherwise, excellent work!
@ -32,2 +10,2 @@# run the sample CLIpython -m boilerplategit clone <repository-url>cd datasetsThe directory is named
dataset-uploader.Fixed !!
@ -44,0 +26,4 @@```Required values:- `GOOGLE_SHEETS_ID`: Your Google Sheets spreadsheet ID from the URLIs there anything else about the spreadsheet? For example, should there be column names? What should those column names be?
Should some of those columns (like the list of datasets) be filled before running?
Fixed !!
@ -44,0 +29,4 @@- `GOOGLE_SHEETS_ID`: Your Google Sheets spreadsheet ID from the URL- `HF_TOKEN`: Your HuggingFace token from https://huggingface.co/settings/tokensEdit `creds.json` with your Google Cloud service account credentials:If there's any hints from Google on how to set up the Google Cloud service accounts, that would be very useful here. There will always be people who have never visited https://console.cloud.google.com .
Fixed !!
@ -44,0 +42,4 @@code .```When prompted: **"Reopen in Container?"** → Click **"Reopen in Container"**I still cannot test this. Could you try using your nix machine and check how to install containers here?
View command line instructions
Checkout
From your project repository, check out a new branch and test the changes.Merge
Merge the changes and update on Forgejo.Warning: The "Autodetect manual merge" setting is not enabled for this repository, you will have to mark this pull request as manually merged afterwards.