auto-excel-update #46

Open
aditya wants to merge 8 commits from auto-excel-update into stream-download-convert-upload-merge-15-refactor-rdf-converters
Member

This branch contain code for integration of the automatic excel update using google cloud platform, when we use upload_all_datasets.py and devcontainer setup for vscode.

This branch contain code for integration of the automatic excel update using google cloud platform, when we use upload_all_datasets.py and devcontainer setup for vscode.
brent.edwards left a comment
Member

Thank you for being careful to follow the pyright and ruff check conditions for upload_all_datasets.

I still don't understand how a user would run this code. In particular, creds.json and .env.example would be inside the Docker container. Is there a way to provide the information they use from outside the docker container, so that every user can just do a docker run with parameters of where to find that information?

Thank you for being careful to follow the `pyright` and `ruff check` conditions for `upload_all_datasets`. I still don't understand how a user would run this code. In particular, `creds.json` and `.env.example` would be inside the Docker container. Is there a way to provide the information they use from outside the docker container, so that every user can just do a `docker run` with parameters of where to find that information?
@ -4,3 +4,4 @@
"context": "..",
"customizations": {
"vscode": {
Member

Sadly, I wasn't able to get dev containers working with VSCode locally (due to NixOS.)

I couldn't test this.

Sadly, I wasn't able to get dev containers working with VSCode locally (due to NixOS.) I couldn't test this.
Author
Member

This is how to use the environment credentials inside dev container.

  • Clone repo
  • cp .env.example .env and cp creds-example.json creds.json
  • Edit files with their values
  • Open DevContainer → credentials are automatically available
  • No manual Docker commands needed, just click "Reopen in Container" in VS Code.
This is how to use the environment credentials inside dev container. - Clone repo - cp .env.example .env and cp creds-example.json creds.json - Edit files with their values - Open DevContainer → credentials are automatically available - No manual Docker commands needed, just click "Reopen in Container" in VS Code.
brent.edwards marked this conversation as resolved
@ -0,0 +23,4 @@
try:
from google.oauth2 import service_account
from googleapiclient.discovery import build
Member

This behavior is extremely unusual. Why are these imports separated from everything else? Why wouldn't someone have these libraries installed?

I recommend removing the GOOGLE_SHEETS_AVAILABLE and lines 49-53. If someone doesn't have google-api-python-client or google-auth installed, the software should fail with the usual ImportError.

This behavior is extremely unusual. Why are these imports separated from everything else? Why wouldn't someone have these libraries installed? I recommend removing the `GOOGLE_SHEETS_AVAILABLE` and lines 49-53. If someone doesn't have `google-api-python-client` or `google-auth` installed, the software should fail with the usual `ImportError`.
Author
Member

fixed !!

fixed !!
brent.edwards marked this conversation as resolved
@ -0,0 +49,4 @@
if not GOOGLE_SHEETS_AVAILABLE:
raise ImportError(
"Google Sheets API not available. "
"Install with: pip install google-api-python-client google-auth"
Member

The packages need to be added to pyproject.toml.

The packages need to be added to `pyproject.toml`.
Author
Member

fixed !!

fixed !!
brent.edwards marked this conversation as resolved
@ -0,0 +105,4 @@
"""Load column headers from the sheet."""
try:
result = (
self.sheet.values()
Member

pyright reports:

  /home/brent.edwards/Workspace-2/dataset-uploader/scripts/google_sheets_tracker.py:108:28 - error: "values" is not a known attribute of "None" (reportOptionalMemberAccess)
`pyright` reports: ``` /home/brent.edwards/Workspace-2/dataset-uploader/scripts/google_sheets_tracker.py:108:28 - error: "values" is not a known attribute of "None" (reportOptionalMemberAccess) ```
Author
Member

fixed !!

fixed !!
brent.edwards marked this conversation as resolved
@ -0,0 +114,4 @@
)
values = result.get("values", [])
if values:
self.headers = [str(h).strip() for h in values[0]]
Member

You will save a LOT of trouble if you change this line to

self.headers = [str(h).strip().casefold() for h in values[0]]
You will save a LOT of trouble if you change this line to ``` self.headers = [str(h).strip().casefold() for h in values[0]] ```
Author
Member

fixed !!

fixed !!
brent.edwards marked this conversation as resolved
@ -0,0 +118,4 @@
self.dataset_name_col = self._find_column_index(
[
"DATASET NAME",
"dataset name",
Member

If you change line 117, then you don't need duplicate column names.

If you change line 117, then you don't need duplicate column names.
Author
Member

fixed !!

fixed !!
brent.edwards marked this conversation as resolved
@ -0,0 +124,4 @@
"id",
"name",
]
)
Member

This code should post warnings if the column names aren't found...

This code should post warnings if the column names aren't found...
Author
Member

fixed !!

fixed !!
brent.edwards marked this conversation as resolved
@ -0,0 +148,4 @@
"""
for name in possible_names:
try:
return self.headers.index(name)
Member

If you change line 117, then you should also change lines 149-159 to

for name in possible_names:
    try:
        return self.headers.index(name.casefold())
    except ValueError:
        continue

return None

See how simple that is? AND it works when comparing Dataset_Id and Dataset_ID.

If you change line 117, then you should also change lines 149-159 to ``` for name in possible_names: try: return self.headers.index(name.casefold()) except ValueError: continue return None ``` See how simple that is? AND it works when comparing `Dataset_Id` and `Dataset_ID`.
Author
Member

fixed !!

fixed !!
brent.edwards marked this conversation as resolved
@ -0,0 +173,4 @@
try:
result = (
self.sheet.values()
Member

pyright reports:

  /home/brent.edwards/Workspace-2/dataset-uploader/scripts/google_sheets_tracker.py:176:28 - error: "values" is not a known attribute of "None" (reportOptionalMemberAccess)
`pyright` reports: ``` /home/brent.edwards/Workspace-2/dataset-uploader/scripts/google_sheets_tracker.py:176:28 - error: "values" is not a known attribute of "None" (reportOptionalMemberAccess) ````
Author
Member

fixed !!

fixed !!
brent.edwards marked this conversation as resolved
@ -0,0 +189,4 @@
if (
len(row) > self.dataset_name_col
and str(row[self.dataset_name_col]).strip().lower()
== dataset_id.lower()
Member

Because data is multilingual, it's much better to use .casefold() instead of .lower().

It doesn't matter in Devangari-based languages or English, but in other languages like German, upper and lower aren't inverses. For example:

>>> "fuß".lower()
'fuß'
>>> "fuß".upper().lower()
'fuss'

.casefold() puts everything into one well-structured format.

Because data is multilingual, it's much better to use `.casefold()` instead of `.lower()`. It doesn't matter in Devangari-based languages or English, but in other languages like German, `upper` and `lower` aren't inverses. For example: ``` >>> "fuß".lower() 'fuß' >>> "fuß".upper().lower() 'fuss' ``` `.casefold()` puts everything into one well-structured format.
Author
Member

fixed !!

fixed !!
brent.edwards marked this conversation as resolved
@ -0,0 +193,4 @@
):
status = (
row[self.status_col].strip()
if self.status_col and len(row) > self.status_col
Member

pyright reports:

  /home/brent.edwards/Workspace-2/dataset-uploader/scripts/google_sheets_tracker.py:190:21 - error: Operator ">" not supported for types "int" and "int | None"
    Operator ">" not supported for types "int" and "None" (reportOperatorIssue)
`pyright` reports: ``` /home/brent.edwards/Workspace-2/dataset-uploader/scripts/google_sheets_tracker.py:190:21 - error: Operator ">" not supported for types "int" and "int | None"   Operator ">" not supported for types "int" and "None" (reportOperatorIssue) ```
Author
Member

fixed !!

fixed !!
brent.edwards marked this conversation as resolved
@ -0,0 +212,4 @@
dataset_id: str,
status: str,
error_message: str | None = None,
notes: str | None = None,
Member

Why is notes a parameter? Is there supposed to be a notes column?

Why is `notes` a parameter? Is there supposed to be a `notes` column?
Author
Member

fixed !!

fixed !!
brent.edwards marked this conversation as resolved
@ -0,0 +262,4 @@
if updates:
body = {"valueInputOption": "RAW", "data": updates}
(
self.sheet.values()
Member

pyright reports:

  /home/brent.edwards/Workspace-2/dataset-uploader/scripts/google_sheets_tracker.py:265:32 - error: "values" is not a known attribute of "None" (reportOptionalMemberAccess)
`pyright` reports: ``` /home/brent.edwards/Workspace-2/dataset-uploader/scripts/google_sheets_tracker.py:265:32 - error: "values" is not a known attribute of "None" (reportOptionalMemberAccess) ```
Author
Member

fixed !!

fixed !!
brent.edwards marked this conversation as resolved
@ -0,0 +315,4 @@
if status_upper == "DONE":
if force:
return (True, "Force processing enabled")
Member

IMHO -- if force: should be its own boolean condition at or near the start of the check_should_process function.

It certainly should not be repeated in lines 317-318 and 322-323.

IMHO -- `if force:` should be its own boolean condition at or near the start of the `check_should_process` function. It certainly should not be repeated in lines 317-318 and 322-323.
Author
Member

fixed !!

fixed !!
brent.edwards marked this conversation as resolved
@ -0,0 +326,4 @@
if status_upper == "NOT STARTED":
return (True, "Status is NOT STARTED, processing")
return (True, "Status allows processing")
Member

I'm very nervous about this line. If the status were "NO! DON'T PROCESS", then this line means that the processing will continue anyway.

I would rather return (False, f"Unknown status {status_upper}: Not moving forward until the status is 'NOT STARTED'").

I'm very nervous about this line. If the status were "NO! DON'T PROCESS", then this line means that the processing will continue anyway. I would rather return `(False, f"Unknown status {status_upper}: Not moving forward until the status is 'NOT STARTED'")`.
Author
Member

fixed !!

fixed !!
brent.edwards marked this conversation as resolved
@ -66,0 +70,4 @@
except ImportError:
pass
except Exception:
pass
Member

Both except statements are the same. You can get rid of one of them.

Keep ImportError and delete Exception if the only reasonable exceptions are import errors.

If other exceptions are reasonable (for example, if load_dotenv throws FileNotFoundError), then include those exceptions explicitly.

Both `except` statements are the same. You can get rid of one of them. Keep `ImportError` and delete `Exception` if the only reasonable exceptions are import errors. If other exceptions are reasonable (for example, if `load_dotenv` throws `FileNotFoundError`), then include those exceptions explicitly.
Author
Member

fixed !!

fixed !!
brent.edwards marked this conversation as resolved
@ -71,2 +80,4 @@
console = Console()
try:
from google_sheets_tracker import GoogleSheetsTracker, create_tracker
Member

Same comments as in google_sheets_tracker. Any requirements should be in the pyproject.toml file, and we shouldn't have conditional imports.

Same comments as in `google_sheets_tracker`. Any requirements should be in the `pyproject.toml` file, and we shouldn't have conditional imports.
Author
Member

fixed !!

fixed !!
brent.edwards marked this conversation as resolved
@ -73,0 +88,4 @@
GoogleSheetsTracker = None
try:
from huggingface_hub import HfApi
Member

Same comments.

Same comments.
Author
Member

fixed !!

fixed !!
brent.edwards marked this conversation as resolved
@ -1094,0 +1181,4 @@
"Organization not found in config file. "
"Please set 'organization' field in dataset_config.json"
)
repo_id = f"{organization}/{dataset_id}"
Member

Lines 1176-1184 are very conceptually similar to lines 262-291, but these assume that the repo_id must be f"{organization}/{dataset_id}" instead of checking and allowing f"{organization}/{dataset_id}-v1".

Are you SURE that the repo_id here can only be f"{organization}/{dataset_id}"?

Would it be better to use check_dataset_exists_on_hf?

Lines 1176-1184 are very conceptually similar to lines 262-291, but these assume that the repo_id must be `f"{organization}/{dataset_id}"` instead of checking and allowing `f"{organization}/{dataset_id}-v1"`. Are you SURE that the repo_id here can only be `f"{organization}/{dataset_id}"`? Would it be better to use `check_dataset_exists_on_hf`?
Author
Member

fixed !!

fixed !!
brent.edwards marked this conversation as resolved
@ -1236,0 +1346,4 @@
try:
sheet_tracker = create_tracker(
spreadsheet_id=None,
worksheet_name="Sheet1",
Member

The worksheet name should come from a file or an environment variable. It might not always be named "Sheet1".

The worksheet name should come from a file or an environment variable. It might not always be named "Sheet1".
Author
Member

fixed !!

fixed !!
brent.edwards marked this conversation as resolved
@ -1249,0 +1397,4 @@
)
if current_status_value != "DONE":
sheet_tracker.update_status(dataset_id, "DONE")
Member

Is it possible that a partial sheet was uploaded, but an error occurred and had been correctly listed as an error in the sheet?

If so, this is eliminating the error.

I understand the idea of read repair -- but this update makes me nervous that it will eliminate important information.

Is it possible that a partial sheet was uploaded, but an error occurred and had been correctly listed as an error in the sheet? If so, this is eliminating the error. I understand the idea of read repair -- but this update makes me nervous that it will eliminate important information.
Member

If you must keep this line, why are you using update_status instead of mark_completed?

If you must keep this line, why are you using `update_status` instead of `mark_completed`?
Author
Member

fixed !!

fixed !!
brent.edwards marked this conversation as resolved
@ -1249,0 +1370,4 @@
)
if current_status_value not in ["DONE", "ERROR"]:
sheet_tracker.mark_completed(dataset_id)
Member

This code still makes me VERY nervous.

Imagine that someone changed the status line to something that's not in the registered list -- for example, SKIP or they mistype and write NOTSTARTED.

This line overwrites what the user sent, changes it to DONE, and does NOT download anything.

This will badly surprise both the SKIP people and the NOTSTARTED people.

As a rule, if the code does not understand an instruction, it should do nothing but post an error message. It should never overwrite things that it does not understand.

Please remove lines 1372-1377 (and change line 1378 to a simple 'if').

This code still makes me VERY nervous. Imagine that someone changed the status line to something that's not in the registered list -- for example, `SKIP` or they mistype and write `NOTSTARTED`. This line overwrites what the user sent, changes it to `DONE`, and does NOT download anything. This will badly surprise both the `SKIP` people and the `NOTSTARTED` people. As a rule, if the code does not understand an instruction, it should do nothing but post an error message. It should never overwrite things that it does not understand. Please remove lines 1372-1377 (and change line 1378 to a simple 'if').
Author
Member

Fixed !!

Fixed !!
brent.edwards marked this conversation as resolved
brent.edwards left a comment
Member

You're still missing

  1. Attaching the google_sheets_tracker to process_datasets_parallel.py.

Otherwise, excellent work!

You're still missing 1. Attaching the `google_sheets_tracker` to `process_datasets_parallel.py`. Otherwise, excellent work!
README.md Outdated
@ -44,0 +37,4 @@
### 3. Open in DevContainer
```bash
code .
Member

The standard inside CleverThis is PyCharm, not VS Code. Could you also include instructions on how to do this with PyCharm?

The standard inside CleverThis is PyCharm, not VS Code. Could you also include instructions on how to do this with PyCharm?
Author
Member

Fixed !!

Fixed !!
brent.edwards marked this conversation as resolved
README.md Outdated
@ -44,0 +42,4 @@
When prompted: **"Reopen in Container?"** → Click **"Reopen in Container"**
Wait 2-5 minutes for first-time setup.
Member

In devcontainer.md, you wrote that this takes 8-10 minutes.

In `devcontainer.md`, you wrote that this takes 8-10 minutes.
Author
Member

Fixed !!

Fixed !!
brent.edwards marked this conversation as resolved
pyproject.toml Outdated
@ -30,3 +26,1 @@
"pyarrow>=14.0.0", # Parquet/Arrow data
"httpx>=0.25.0", # HTTP client (async)
"requests>=2.31.0", # HTTP client (sync)
"datasets>=2.0.0",
Member

Why did you remove comments? (I'm not saying that it's wrong; I'm saying that they were useful.)

Why did you remove comments? (I'm not saying that it's wrong; I'm saying that they were useful.)
Author
Member

got it !! added them back.

got it !! added them back.
brent.edwards marked this conversation as resolved
CoreRasurae left a comment
First-time contributor

Please consider my comment, as it may be valid and worth the effort of addressing it.

Please consider my comment, as it may be valid and worth the effort of addressing it.
@ -1332,1 +1606,4 @@
)
except Exception:
pass
return 130
First-time contributor

This looks like a very long method to me... it should be logically split and i think the code should be better documented. I see long chains of conditionals with not a single business logic explanation for them. Since this is a diff it is not totally clear to me where the method starts and ends, but it looks quite long...

This looks like a very long method to me... it should be logically split and i think the code should be better documented. I see long chains of conditionals with not a single business logic explanation for them. Since this is a diff it is not totally clear to me where the method starts and ends, but it looks quite long...
aditya force-pushed auto-excel-update from ef924a1aa1 to 53fcbbbb04 2026-01-14 14:17:44 +00:00 Compare
@ -1332,0 +1514,4 @@
except Exception:
pass
return 1
except Exception as e:
First-time contributor

Why catch ValueError and Exception separately here? You do almost exactly the same thing with each, but don't print the error message for the generic Exception.

Why catch ValueError and Exception separately here? You do almost exactly the same thing with each, but don't print the error message for the generic Exception.
@ -1332,0 +1545,4 @@
try:
upload(args, dataset_id)
break
except ValueError as e:
First-time contributor

Uploading the dataset can fail in many ways that aren't ValueErrors. I see that your upload() function wraps them all as ValueErrors, but I think this is incorrect. For example, a network failure has nothing to do with a value error. To me it makes more sense to raise the appropriate type of Exception and respond accordingly. For example, you should probably retry on a network error (which could be transient and might resolve itself by next attempt), but not a permissions error (which is very unlikely to do so).

Uploading the dataset can fail in many ways that aren't ValueErrors. I see that your upload() function wraps them all as ValueErrors, but I think this is incorrect. For example, a network failure has nothing to do with a value error. To me it makes more sense to raise the appropriate type of Exception and respond accordingly. For example, you should probably retry on a network error (which could be transient and might resolve itself by next attempt), but not a permissions error (which is very unlikely to do so).
khird approved these changes 2026-01-16 15:02:01 +00:00
brent.edwards left a comment
Member

These are good, but you're still missing

Attaching the google_sheets_tracker to process_datasets_parallel.py.

Otherwise, excellent work!

These are good, but you're still missing Attaching the google_sheets_tracker to process_datasets_parallel.py. Otherwise, excellent work!
@ -32,2 +10,2 @@
# run the sample CLI
python -m boilerplate
git clone <repository-url>
cd datasets
Member

The directory is named dataset-uploader.

The directory is named `dataset-uploader`.
Author
Member

Fixed !!

Fixed !!
@ -44,0 +26,4 @@
```
Required values:
- `GOOGLE_SHEETS_ID`: Your Google Sheets spreadsheet ID from the URL
Member

Is there anything else about the spreadsheet? For example, should there be column names? What should those column names be?

Should some of those columns (like the list of datasets) be filled before running?

Is there anything else about the spreadsheet? For example, should there be column names? What should those column names be? Should some of those columns (like the list of datasets) be filled before running?
Author
Member

Fixed !!

Fixed !!
README.md Outdated
@ -44,0 +29,4 @@
- `GOOGLE_SHEETS_ID`: Your Google Sheets spreadsheet ID from the URL
- `HF_TOKEN`: Your HuggingFace token from https://huggingface.co/settings/tokens
Edit `creds.json` with your Google Cloud service account credentials:
Member

If there's any hints from Google on how to set up the Google Cloud service accounts, that would be very useful here. There will always be people who have never visited https://console.cloud.google.com .

If there's any hints from Google on how to set up the Google Cloud service accounts, that would be very useful here. There will always be people who have never visited https://console.cloud.google.com .
Author
Member

Fixed !!

Fixed !!
README.md Outdated
@ -44,0 +42,4 @@
code .
```
When prompted: **"Reopen in Container?"** → Click **"Reopen in Container"**
Member

I still cannot test this. Could you try using your nix machine and check how to install containers here?

I still cannot test this. Could you try using your nix machine and check how to install containers here?
This pull request can be merged automatically.
You are not authorized to merge this pull request.
View command line instructions

Checkout

From your project repository, check out a new branch and test the changes.
git fetch -u origin auto-excel-update:auto-excel-update
git switch auto-excel-update

Merge

Merge the changes and update on Forgejo.

Warning: The "Autodetect manual merge" setting is not enabled for this repository, you will have to mark this pull request as manually merged afterwards.

git switch stream-download-convert-upload-merge-15-refactor-rdf-converters
git merge --no-ff auto-excel-update
git switch auto-excel-update
git rebase stream-download-convert-upload-merge-15-refactor-rdf-converters
git switch stream-download-convert-upload-merge-15-refactor-rdf-converters
git merge --ff-only auto-excel-update
git switch auto-excel-update
git rebase stream-download-convert-upload-merge-15-refactor-rdf-converters
git switch stream-download-convert-upload-merge-15-refactor-rdf-converters
git merge --no-ff auto-excel-update
git switch stream-download-convert-upload-merge-15-refactor-rdf-converters
git merge --squash auto-excel-update
git switch stream-download-convert-upload-merge-15-refactor-rdf-converters
git merge --ff-only auto-excel-update
git switch stream-download-convert-upload-merge-15-refactor-rdf-converters
git merge auto-excel-update
git push origin stream-download-convert-upload-merge-15-refactor-rdf-converters
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
4 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
cleverdatasets/dataset-uploader!46
No description provided.