auto-excel-update #46

aditya · 2026-01-07T14:24:17Z

aditya commented

2026-01-07 14:24:17 +00:00

This branch contain code for integration of the automatic excel update using google cloud platform, when we use upload_all_datasets.py and devcontainer setup for vscode.

aditya added 2 commits

2026-01-07 14:24:17 +00:00

feat: integrate excel integration using gcp 0aad92eae5

feat: add devcontainer setup for vscode 23bd3eb9f0

aditya added 1 commit

2026-01-07 15:27:13 +00:00

fix: fix push to hub in unified converter script 734a58c9c1

brent.edwards requested changes

2026-01-08 01:00:23 +00:00

Dismissed

brent.edwards left a comment

Thank you for being careful to follow the pyright and ruff check conditions for upload_all_datasets.

I still don't understand how a user would run this code. In particular, creds.json and .env.example would be inside the Docker container. Is there a way to provide the information they use from outside the docker container, so that every user can just do a docker run with parameters of where to find that information?

Thank you for being careful to follow the `pyright` and `ruff check` conditions for `upload_all_datasets`. I still don't understand how a user would run this code. In particular, `creds.json` and `.env.example` would be inside the Docker container. Is there a way to provide the information they use from outside the docker container, so that every user can just do a `docker run` with parameters of where to find that information?

.devcontainer/devcontainer.json Outdated

					
				@ -4,3 +4,4 @@

				  "context": "..",

				  "customizations": {

				    "vscode": {

Sadly, I wasn't able to get dev containers working with VSCode locally (due to NixOS.)

I couldn't test this.

Sadly, I wasn't able to get dev containers working with VSCode locally (due to NixOS.) I couldn't test this.

This is how to use the environment credentials inside dev container.

Clone repo
cp .env.example .env and cp creds-example.json creds.json
Edit files with their values
Open DevContainer → credentials are automatically available
No manual Docker commands needed, just click "Reopen in Container" in VS Code.

This is how to use the environment credentials inside dev container. - Clone repo - cp .env.example .env and cp creds-example.json creds.json - Edit files with their values - Open DevContainer → credentials are automatically available - No manual Docker commands needed, just click "Reopen in Container" in VS Code.

brent.edwards marked this conversation as resolved

scripts/google_sheets_tracker.py Outdated

					
				@ -0,0 +23,4 @@

				try:

				    from google.oauth2 import service_account

				    from googleapiclient.discovery import build

This behavior is extremely unusual. Why are these imports separated from everything else? Why wouldn't someone have these libraries installed?

I recommend removing the GOOGLE_SHEETS_AVAILABLE and lines 49-53. If someone doesn't have google-api-python-client or google-auth installed, the software should fail with the usual ImportError.

This behavior is extremely unusual. Why are these imports separated from everything else? Why wouldn't someone have these libraries installed? I recommend removing the `GOOGLE_SHEETS_AVAILABLE` and lines 49-53. If someone doesn't have `google-api-python-client` or `google-auth` installed, the software should fail with the usual `ImportError`.

fixed !!

brent.edwards marked this conversation as resolved

scripts/google_sheets_tracker.py Outdated

					
				@ -0,0 +49,4 @@

				        if not GOOGLE_SHEETS_AVAILABLE:

				            raise ImportError(

				                "Google Sheets API not available. "

				                "Install with: pip install google-api-python-client google-auth"

The packages need to be added to pyproject.toml.

The packages need to be added to `pyproject.toml`.

fixed !!

brent.edwards marked this conversation as resolved

scripts/google_sheets_tracker.py Outdated

					
				@ -0,0 +105,4 @@

				        """Load column headers from the sheet."""

				        try:

				            result = (

				                self.sheet.values()

pyright reports:

  /home/brent.edwards/Workspace-2/dataset-uploader/scripts/google_sheets_tracker.py:108:28 - error: "values" is not a known attribute of "None" (reportOptionalMemberAccess)

`pyright` reports: ``` /home/brent.edwards/Workspace-2/dataset-uploader/scripts/google_sheets_tracker.py:108:28 - error: "values" is not a known attribute of "None" (reportOptionalMemberAccess) ```

fixed !!

brent.edwards marked this conversation as resolved

scripts/google_sheets_tracker.py Outdated

					
				@ -0,0 +114,4 @@

				            )

				            values = result.get("values", [])

				            if values:

				                self.headers = [str(h).strip() for h in values[0]]

You will save a LOT of trouble if you change this line to

self.headers = [str(h).strip().casefold() for h in values[0]]

You will save a LOT of trouble if you change this line to ``` self.headers = [str(h).strip().casefold() for h in values[0]] ```

fixed !!

brent.edwards marked this conversation as resolved

scripts/google_sheets_tracker.py Outdated

					
				@ -0,0 +118,4 @@

				                self.dataset_name_col = self._find_column_index(

				                    [

				                        "DATASET NAME",

				                        "dataset name",

If you change line 117, then you don't need duplicate column names.

fixed !!

brent.edwards marked this conversation as resolved

scripts/google_sheets_tracker.py Outdated

					
				@ -0,0 +124,4 @@

				                        "id",

				                        "name",

				                    ]

				                )

This code should post warnings if the column names aren't found...

fixed !!

brent.edwards marked this conversation as resolved

scripts/google_sheets_tracker.py Outdated

					
				@ -0,0 +148,4 @@

				        """

				        for name in possible_names:

				            try:

				                return self.headers.index(name)

If you change line 117, then you should also change lines 149-159 to

for name in possible_names:
    try:
        return self.headers.index(name.casefold())
    except ValueError:
        continue

return None

See how simple that is? AND it works when comparing Dataset_Id and Dataset_ID.

If you change line 117, then you should also change lines 149-159 to ``` for name in possible_names: try: return self.headers.index(name.casefold()) except ValueError: continue return None ``` See how simple that is? AND it works when comparing `Dataset_Id` and `Dataset_ID`.

fixed !!

brent.edwards marked this conversation as resolved

scripts/google_sheets_tracker.py Outdated

					
				@ -0,0 +173,4 @@

				        try:

				            result = (

				                self.sheet.values()

pyright reports:

  /home/brent.edwards/Workspace-2/dataset-uploader/scripts/google_sheets_tracker.py:176:28 - error: "values" is not a known attribute of "None" (reportOptionalMemberAccess)

`pyright` reports: ``` /home/brent.edwards/Workspace-2/dataset-uploader/scripts/google_sheets_tracker.py:176:28 - error: "values" is not a known attribute of "None" (reportOptionalMemberAccess) ````

fixed !!

brent.edwards marked this conversation as resolved

scripts/google_sheets_tracker.py Outdated

					
				@ -0,0 +189,4 @@

				                if (

				                    len(row) > self.dataset_name_col

				                    and str(row[self.dataset_name_col]).strip().lower()

				                    == dataset_id.lower()

Because data is multilingual, it's much better to use .casefold() instead of .lower().

It doesn't matter in Devangari-based languages or English, but in other languages like German, upper and lower aren't inverses. For example:

>>> "fuß".lower()
'fuß'
>>> "fuß".upper().lower()
'fuss'

.casefold() puts everything into one well-structured format.

Because data is multilingual, it's much better to use `.casefold()` instead of `.lower()`. It doesn't matter in Devangari-based languages or English, but in other languages like German, `upper` and `lower` aren't inverses. For example: ``` >>> "fuß".lower() 'fuß' >>> "fuß".upper().lower() 'fuss' ``` `.casefold()` puts everything into one well-structured format.

fixed !!

brent.edwards marked this conversation as resolved

scripts/google_sheets_tracker.py Outdated

					
				@ -0,0 +193,4 @@

				                ):

				                    status = (

				                        row[self.status_col].strip()

				                        if self.status_col and len(row) > self.status_col

pyright reports:

  /home/brent.edwards/Workspace-2/dataset-uploader/scripts/google_sheets_tracker.py:190:21 - error: Operator ">" not supported for types "int" and "int | None"
    Operator ">" not supported for types "int" and "None" (reportOperatorIssue)

`pyright` reports: ``` /home/brent.edwards/Workspace-2/dataset-uploader/scripts/google_sheets_tracker.py:190:21 - error: Operator ">" not supported for types "int" and "int | None" Operator ">" not supported for types "int" and "None" (reportOperatorIssue) ```

fixed !!

brent.edwards marked this conversation as resolved

scripts/google_sheets_tracker.py Outdated

					
				@ -0,0 +212,4 @@

				        dataset_id: str,

				        status: str,

				        error_message: str | None = None,

				        notes: str | None = None,

Why is notes a parameter? Is there supposed to be a notes column?

Why is `notes` a parameter? Is there supposed to be a `notes` column?

fixed !!

brent.edwards marked this conversation as resolved

scripts/google_sheets_tracker.py Outdated

					
				@ -0,0 +262,4 @@

				            if updates:

				                body = {"valueInputOption": "RAW", "data": updates}

				                (

				                    self.sheet.values()

pyright reports:

  /home/brent.edwards/Workspace-2/dataset-uploader/scripts/google_sheets_tracker.py:265:32 - error: "values" is not a known attribute of "None" (reportOptionalMemberAccess)

`pyright` reports: ``` /home/brent.edwards/Workspace-2/dataset-uploader/scripts/google_sheets_tracker.py:265:32 - error: "values" is not a known attribute of "None" (reportOptionalMemberAccess) ```

fixed !!

brent.edwards marked this conversation as resolved

scripts/google_sheets_tracker.py Outdated

					
				@ -0,0 +315,4 @@

				        if status_upper == "DONE":

				            if force:

				                return (True, "Force processing enabled")

IMHO -- if force: should be its own boolean condition at or near the start of the check_should_process function.

It certainly should not be repeated in lines 317-318 and 322-323.

IMHO -- `if force:` should be its own boolean condition at or near the start of the `check_should_process` function. It certainly should not be repeated in lines 317-318 and 322-323.

fixed !!

brent.edwards marked this conversation as resolved

scripts/google_sheets_tracker.py Outdated

					
				@ -0,0 +326,4 @@

				        if status_upper == "NOT STARTED":

				            return (True, "Status is NOT STARTED, processing")

				        return (True, "Status allows processing")

I'm very nervous about this line. If the status were "NO! DON'T PROCESS", then this line means that the processing will continue anyway.

I would rather return (False, f"Unknown status {status_upper}: Not moving forward until the status is 'NOT STARTED'").

I'm very nervous about this line. If the status were "NO! DON'T PROCESS", then this line means that the processing will continue anyway. I would rather return `(False, f"Unknown status {status_upper}: Not moving forward until the status is 'NOT STARTED'")`.

fixed !!

brent.edwards marked this conversation as resolved

scripts/upload_all_datasets.py Outdated

					
				@ -66,0 +70,4 @@

				except ImportError:

				    pass

				except Exception:

				    pass

Both except statements are the same. You can get rid of one of them.

Keep ImportError and delete Exception if the only reasonable exceptions are import errors.

If other exceptions are reasonable (for example, if load_dotenv throws FileNotFoundError), then include those exceptions explicitly.

Both `except` statements are the same. You can get rid of one of them. Keep `ImportError` and delete `Exception` if the only reasonable exceptions are import errors. If other exceptions are reasonable (for example, if `load_dotenv` throws `FileNotFoundError`), then include those exceptions explicitly.

fixed !!

brent.edwards marked this conversation as resolved

scripts/upload_all_datasets.py Outdated

					
				@ -71,2 +80,4 @@

				console = Console()

				try:

				    from google_sheets_tracker import GoogleSheetsTracker, create_tracker

Same comments as in google_sheets_tracker. Any requirements should be in the pyproject.toml file, and we shouldn't have conditional imports.

Same comments as in `google_sheets_tracker`. Any requirements should be in the `pyproject.toml` file, and we shouldn't have conditional imports.

fixed !!

brent.edwards marked this conversation as resolved

scripts/upload_all_datasets.py Outdated

					
				@ -73,0 +88,4 @@

				    GoogleSheetsTracker = None

				try:

				    from huggingface_hub import HfApi

Same comments.

fixed !!

brent.edwards marked this conversation as resolved

scripts/upload_all_datasets.py Outdated

					
				@ -1094,0 +1181,4 @@

				                        "Organization not found in config file. "

				                        "Please set 'organization' field in dataset_config.json"

				                    )

				                repo_id = f"{organization}/{dataset_id}"

Lines 1176-1184 are very conceptually similar to lines 262-291, but these assume that the repo_id must be f"{organization}/{dataset_id}" instead of checking and allowing f"{organization}/{dataset_id}-v1".

Are you SURE that the repo_id here can only be f"{organization}/{dataset_id}"?

Would it be better to use check_dataset_exists_on_hf?

Lines 1176-1184 are very conceptually similar to lines 262-291, but these assume that the repo_id must be `f"{organization}/{dataset_id}"` instead of checking and allowing `f"{organization}/{dataset_id}-v1"`. Are you SURE that the repo_id here can only be `f"{organization}/{dataset_id}"`? Would it be better to use `check_dataset_exists_on_hf`?

fixed !!

brent.edwards marked this conversation as resolved

scripts/upload_all_datasets.py Outdated

					
				@ -1236,0 +1346,4 @@

				            try:

				                sheet_tracker = create_tracker(

				                    spreadsheet_id=None,

				                    worksheet_name="Sheet1",

The worksheet name should come from a file or an environment variable. It might not always be named "Sheet1".

fixed !!

brent.edwards marked this conversation as resolved

scripts/upload_all_datasets.py Outdated

					
				@ -1249,0 +1397,4 @@

				                        )

				                        if current_status_value != "DONE":

				                            sheet_tracker.update_status(dataset_id, "DONE")

Is it possible that a partial sheet was uploaded, but an error occurred and had been correctly listed as an error in the sheet?

If so, this is eliminating the error.

I understand the idea of read repair -- but this update makes me nervous that it will eliminate important information.

Is it possible that a partial sheet was uploaded, but an error occurred and had been correctly listed as an error in the sheet? If so, this is eliminating the error. I understand the idea of read repair -- but this update makes me nervous that it will eliminate important information.

If you must keep this line, why are you using update_status instead of mark_completed?

If you must keep this line, why are you using `update_status` instead of `mark_completed`?

fixed !!

brent.edwards marked this conversation as resolved

aditya added 1 commit

2026-01-08 13:13:49 +00:00

fix: fix issues as per review feedback ae3f5bd791

brent.edwards reviewed

2026-01-08 21:17:46 +00:00

scripts/upload_all_datasets.py Outdated

					
				@ -1249,0 +1370,4 @@

				                        )

				                        if current_status_value not in ["DONE", "ERROR"]:

				                            sheet_tracker.mark_completed(dataset_id)

This code still makes me VERY nervous.

Imagine that someone changed the status line to something that's not in the registered list -- for example, SKIP or they mistype and write NOTSTARTED.

This line overwrites what the user sent, changes it to DONE, and does NOT download anything.

This will badly surprise both the SKIP people and the NOTSTARTED people.

As a rule, if the code does not understand an instruction, it should do nothing but post an error message. It should never overwrite things that it does not understand.

Please remove lines 1372-1377 (and change line 1378 to a simple 'if').

This code still makes me VERY nervous. Imagine that someone changed the status line to something that's not in the registered list -- for example, `SKIP` or they mistype and write `NOTSTARTED`. This line overwrites what the user sent, changes it to `DONE`, and does NOT download anything. This will badly surprise both the `SKIP` people and the `NOTSTARTED` people. As a rule, if the code does not understand an instruction, it should do nothing but post an error message. It should never overwrite things that it does not understand. Please remove lines 1372-1377 (and change line 1378 to a simple 'if').

Fixed !!

brent.edwards marked this conversation as resolved

brent.edwards requested changes

2026-01-08 21:30:24 +00:00

Dismissed

brent.edwards left a comment

You're still missing

Attaching the google_sheets_tracker to process_datasets_parallel.py.

Otherwise, excellent work!

You're still missing 1. Attaching the `google_sheets_tracker` to `process_datasets_parallel.py`. Otherwise, excellent work!

README.md Outdated

					
				@ -44,0 +37,4 @@

				### 3. Open in DevContainer

				```bash

				code .

The standard inside CleverThis is PyCharm, not VS Code. Could you also include instructions on how to do this with PyCharm?

Fixed !!

brent.edwards marked this conversation as resolved

README.md Outdated

					
				@ -44,0 +42,4 @@

				When prompted: **"Reopen in Container?"** → Click **"Reopen in Container"**

				Wait 2-5 minutes for first-time setup.

In devcontainer.md, you wrote that this takes 8-10 minutes.

In `devcontainer.md`, you wrote that this takes 8-10 minutes.

Fixed !!

brent.edwards marked this conversation as resolved

pyproject.toml Outdated

					
				@ -30,3 +26,1 @@

				    "pyarrow>=14.0.0",           # Parquet/Arrow data

				    "httpx>=0.25.0",             # HTTP client (async)

				    "requests>=2.31.0",          # HTTP client (sync)

				    "datasets>=2.0.0",

Why did you remove comments? (I'm not saying that it's wrong; I'm saying that they were useful.)

got it !! added them back.

brent.edwards marked this conversation as resolved

aditya added 1 commit

2026-01-12 12:26:11 +00:00

fix: fix sheet updation logic; update README.md for pycharm devcontainer ef924a1aa1

CoreRasurae reviewed

2026-01-13 21:54:08 +00:00

CoreRasurae left a comment

First-time contributor

Please consider my comment, as it may be valid and worth the effort of addressing it.

scripts/upload_all_datasets.py Outdated

					
				@ -1332,1 +1606,4 @@

				                )

				            except Exception:

				                pass

				        return 130

This looks like a very long method to me... it should be logically split and i think the code should be better documented. I see long chains of conditionals with not a single business logic explanation for them. Since this is a diff it is not totally clear to me where the method starts and ends, but it looks quite long...

aditya force-pushed auto-excel-update from ef924a1aa1 to 53fcbbbb04

2026-01-14 14:17:44 +00:00

Compare

khird reviewed

2026-01-16 15:01:40 +00:00

scripts/upload_all_datasets.py Outdated

					
				@ -1332,0 +1514,4 @@

				                    except Exception:

				                        pass

				                return 1

				            except Exception as e:

Why catch ValueError and Exception separately here? You do almost exactly the same thing with each, but don't print the error message for the generic Exception.

khird reviewed

2026-01-16 15:01:56 +00:00

scripts/upload_all_datasets.py

					
				@ -1332,0 +1545,4 @@

				                try:

				                    upload(args, dataset_id)

				                    break

				                except ValueError as e:

Uploading the dataset can fail in many ways that aren't ValueErrors. I see that your upload() function wraps them all as ValueErrors, but I think this is incorrect. For example, a network failure has nothing to do with a value error. To me it makes more sense to raise the appropriate type of Exception and respond accordingly. For example, you should probably retry on a network error (which could be transient and might resolve itself by next attempt), but not a permissions error (which is very unlikely to do so).

khird approved these changes

2026-01-16 15:02:01 +00:00

aditya added 1 commit

2026-01-19 11:28:30 +00:00

fix: refactor main() to make it short and more readable; improve exception handling in upload retry logic 2932a7e3da

brent.edwards requested changes

2026-01-21 02:15:35 +00:00

Dismissed

brent.edwards left a comment

These are good, but you're still missing

Attaching the google_sheets_tracker to process_datasets_parallel.py.

Otherwise, excellent work!

These are good, but you're still missing Attaching the google_sheets_tracker to process_datasets_parallel.py. Otherwise, excellent work!

README.md Outdated

					
				@ -32,2 +10,2 @@

				# run the sample CLI

				python -m boilerplate

				git clone <repository-url>

				cd datasets

The directory is named dataset-uploader.

The directory is named `dataset-uploader`.

Fixed !!

Errr, no. It's still cd datasets here.

Errr, no. It's still `cd datasets` here.

README.md

					
				@ -44,0 +26,4 @@

				```

				Required values:

				- `GOOGLE_SHEETS_ID`: Your Google Sheets spreadsheet ID from the URL

Is there anything else about the spreadsheet? For example, should there be column names? What should those column names be?

Should some of those columns (like the list of datasets) be filled before running?

Is there anything else about the spreadsheet? For example, should there be column names? What should those column names be? Should some of those columns (like the list of datasets) be filled before running?

Fixed !!

I don't see anything about column names and what they should be filled with.

README.md Outdated

					
				@ -44,0 +29,4 @@

				- `GOOGLE_SHEETS_ID`: Your Google Sheets spreadsheet ID from the URL

				- `HF_TOKEN`: Your HuggingFace token from https://huggingface.co/settings/tokens

				Edit `creds.json` with your Google Cloud service account credentials:

If there's any hints from Google on how to set up the Google Cloud service accounts, that would be very useful here. There will always be people who have never visited https://console.cloud.google.com .

Fixed !!

brent.edwards marked this conversation as resolved

README.md Outdated

					
				@ -44,0 +42,4 @@

				code .

				```

				When prompted: **"Reopen in Container?"** → Click **"Reopen in Container"**

I still cannot test this. Could you try using your nix machine and check how to install containers here?

I don’t currently have a Nix machine set up, so I’m unable to test this on my end

brent.edwards marked this conversation as resolved

aditya added 1 commit

2026-01-21 14:27:11 +00:00

docs: update README 3a53a797d1

brent.edwards requested changes

2026-02-02 18:37:13 +00:00

Dismissed

brent.edwards left a comment

Two tiny changes to finish.

Thank you for your hard work.

Two tiny changes to finish. Thank you for your hard work.

aditya added 1 commit

2026-02-04 14:14:18 +00:00

doc: update README as per review feedback 6e5b3517a5

brent.edwards approved these changes

2026-02-04 18:43:45 +00:00

This pull request can be merged automatically.

You are not authorized to merge this pull request.

View command line instructions

Checkout

From your project repository, check out a new branch and test the changes.

git fetch -u origin auto-excel-update:auto-excel-update

git switch auto-excel-update

Merge

Merge the changes and update on Forgejo.

Warning: The "Autodetect manual merge" setting is not enabled for this repository, you will have to mark this pull request as manually merged afterwards.

git switch stream-download-convert-upload-merge-15-refactor-rdf-converters

git merge --no-ff auto-excel-update

git switch auto-excel-update

git rebase stream-download-convert-upload-merge-15-refactor-rdf-converters

git switch stream-download-convert-upload-merge-15-refactor-rdf-converters

git merge --ff-only auto-excel-update

git switch auto-excel-update

git rebase stream-download-convert-upload-merge-15-refactor-rdf-converters

git switch stream-download-convert-upload-merge-15-refactor-rdf-converters

git merge --no-ff auto-excel-update

git switch stream-download-convert-upload-merge-15-refactor-rdf-converters

git merge --squash auto-excel-update