Feat: add initial Dockerfile and compose file for the repo #24

Open
aditya wants to merge 2 commits from 7-dockerfile-setup into mypy-and-pylint
Member

Initial docker setup

Initial docker setup
aditya changed target branch from mypy-and-pylint to 16-error-messages-and-logging 2025-12-03 14:38:40 +00:00
aditya changed target branch from 16-error-messages-and-logging to mypy-and-pylint 2025-12-03 14:39:55 +00:00
brent.edwards left a comment
Member

Is this the expected behavior?

Is this the expected behavior?
@ -36,2 +12,2 @@
RUN pip install --no-cache-dir /tmp/*.whl && \
rm -rf /tmp/*
# Install uv for fast dependency management
COPY --from=ghcr.io/astral-sh/uv:latest /uv /usr/local/bin/uv
Member

Is this the best way to install uv? https://docs.astral.sh/uv/guides/integration/docker/#installing-uv recommends:

# Download the latest installer
ADD https://astral.sh/uv/install.sh /uv-installer.sh

# Run the installer then remove it
RUN sh /uv-installer.sh && rm /uv-installer.sh

# Ensure the installed binary is on the `PATH`
ENV PATH="/root/.local/bin/:$PATH"
Is this the best way to install `uv`? https://docs.astral.sh/uv/guides/integration/docker/#installing-uv recommends: ``` # Download the latest installer ADD https://astral.sh/uv/install.sh /uv-installer.sh # Run the installer then remove it RUN sh /uv-installer.sh && rm /uv-installer.sh # Ensure the installed binary is on the `PATH` ENV PATH="/root/.local/bin/:$PATH" ```
Author
Member

Both methods are documented and valid, but COPY --from= is actually the preferred approach in the uv docs

Both methods are documented and valid, but COPY --from= is actually the preferred approach in the uv docs
brent.edwards marked this conversation as resolved
@ -0,0 +1,49 @@
version: '3.8'
Member

When I try running docker compose up, I get the following:

❯ docker compose up
WARN[0000] /home/brent.edwards/Workspace-2/dataset-uploader/docker-compose.yml: the attribute `version` is obsolete, it will be ignored, please remove it to avoid potential confusion 
[+] Running 1/1
 ! cleverdatasets Warning pull access denied for cleverdatasets, repository does not exist or may require 'docker login': denied: requested access to the resource is denied                                           1.6s 
WARN[0001] Docker Compose is configured to build using Bake, but buildx isn't installed 
[+] Building 18.0s (21/21) FINISHED                                                                                                                                                                          docker:default
 => [cleverdatasets internal] load build definition from Dockerfile                                                                                                                                                    0.2s
 => => transferring dockerfile: 1.67kB                                                                                                                                                                                 0.0s
 => [cleverdatasets] resolve image config for docker-image://docker.io/docker/dockerfile:1                                                                                                                             1.3s
 => CACHED [cleverdatasets] docker-image://docker.io/docker/dockerfile:1@sha256:b6afd42430b15f2d2a4c5a02b919e98a525b785b1aaff16747d2f623364e39b6                                                                       0.0s
 => [cleverdatasets internal] load metadata for docker.io/library/python:3.13-slim                                                                                                                                     0.5s 
 => [cleverdatasets internal] load metadata for ghcr.io/astral-sh/uv:latest                                                                                                                                            0.7s
 => [cleverdatasets internal] load .dockerignore                                                                                                                                                                       0.0s
 => => transferring context: 980B                                                                                                                                                                                      0.0s 
 => [cleverdatasets stage-0  1/11] FROM docker.io/library/python:3.13-slim@sha256:05b118ecc93ea09e30569706568fb251c71b77d2a3908d338b77be033e162b59                                                                     0.0s 
 => [cleverdatasets internal] load build context                                                                                                                                                                       0.1s 
 => => transferring context: 2.09kB                                                                                                                                                                                    0.0s 
 => [cleverdatasets] FROM ghcr.io/astral-sh/uv:latest@sha256:4c1ad814fe658851f50ff95ecd6948673fffddb0d7994bdb019dcb58227abd52                                                                                          0.0s 
 => CACHED [cleverdatasets stage-0  2/11] RUN apt-get update && apt-get install -y --no-install-recommends     bzip2     gzip     curl     ca-certificates     && rm -rf /var/lib/apt/lists/*                          0.0s 
 => CACHED [cleverdatasets stage-0  3/11] COPY --from=ghcr.io/astral-sh/uv:latest /uv /usr/local/bin/uv                                                                                                                0.0s 
 => [cleverdatasets stage-0  4/11] WORKDIR /app                                                                                                                                                                        0.1s 
 => [cleverdatasets stage-0  5/11] COPY pyproject.toml ./                                                                                                                                                              0.2s 
 => [cleverdatasets stage-0  6/11] RUN uv pip install --system --no-cache-dir     rdflib>=7.0.0     datasets>=2.0.0     huggingface-hub     rich>=13.0.0     httpx>=0.25.0     pyarrow>=14.0.0     pandas>=2.0.0       8.9s 
 => [cleverdatasets stage-0  7/11] COPY scripts/ ./scripts/                                                                                                                                                            0.3s 
 => [cleverdatasets stage-0  8/11] COPY src/ ./src/                                                                                                                                                                    0.2s 
 => [cleverdatasets stage-0  9/11] RUN mkdir -p /data/dataset_processing &&     mkdir -p /data/.cache/huggingface                                                                                                      1.5s 
 => [cleverdatasets stage-0 10/11] RUN useradd -m -u 1000 -s /bin/bash appuser &&     chown -R appuser:appuser /app /data                                                                                              0.7s 
 => [cleverdatasets stage-0 11/11] WORKDIR /data                                                                                                                                                                       0.2s 
 => [cleverdatasets] exporting to image                                                                                                                                                                                2.4s 
 => => exporting layers                                                                                                                                                                                                2.3s 
 => => writing image sha256:6085c144bd63a63dd576fe5069a75dfbd79c37919d69793563e0bcf593e68ac8                                                                                                                           0.0s 
 => => naming to docker.io/library/cleverdatasets:latest                                                                                                                                                               0.0s 
 => [cleverdatasets] resolving provenance for metadata file                                                                                                                                                            0.0s 
[+] Running 3/3                                                                                                                                                                                                             
 ✔ cleverdatasets                             Built                                                                                                                                                                    0.0s 
 ✔ Volume dataset-uploader_huggingface_cache  Created                                                                                                                                                                  0.4s 
 ✔ Container cleverdatasets                   Created                                                                                                                                                                  1.6s 
Attaching to cleverdatasets
cleverdatasets  | usage: upload_all_datasets.py [-h] [--dataset DATASET]
cleverdatasets  |                               [--category {small,medium,large,xlarge,all}]
cleverdatasets  |                               [--base-dir BASE_DIR] [--skip-download]
cleverdatasets  |                               [--skip-convert] [--skip-upload] [--dry-run]
cleverdatasets  |                               [--list] [--rm] [--parallel PARALLEL]
cleverdatasets  | 
cleverdatasets  | Download, convert, and upload RDF datasets to HuggingFace
cleverdatasets  | 
cleverdatasets  | options:
cleverdatasets  |   -h, --help            show this help message and exit
cleverdatasets  |   --dataset, -d DATASET
cleverdatasets  |                         Specific dataset(s) to process (can be specified
cleverdatasets  |                         multiple times)
cleverdatasets  |   --category, -c {small,medium,large,xlarge,all}
cleverdatasets  |                         Process datasets by category
cleverdatasets  |   --base-dir BASE_DIR   Base directory for work
cleverdatasets  |   --skip-download       Skip download step
cleverdatasets  |   --skip-convert        Skip conversion step
cleverdatasets  |   --skip-upload         Skip upload step
cleverdatasets  |   --dry-run             Show what would be done without doing it
cleverdatasets  |   --list                List datasets that would be processed
cleverdatasets  |   --rm                  Remove mode: wipe all downloads at start, delete each
cleverdatasets  |                         dataset after processing
cleverdatasets  |   --parallel, -p PARALLEL
cleverdatasets  |                         Number of datasets to process in parallel (default: 1,
cleverdatasets  |                         sequential)
cleverdatasets exited with code 0 (restarting)                                                                                                                                                                              

and the --help message keeps repeating.

Is this the expected behavior?

When I try running `docker compose up`, I get the following: ``` ❯ docker compose up WARN[0000] /home/brent.edwards/Workspace-2/dataset-uploader/docker-compose.yml: the attribute `version` is obsolete, it will be ignored, please remove it to avoid potential confusion [+] Running 1/1 ! cleverdatasets Warning pull access denied for cleverdatasets, repository does not exist or may require 'docker login': denied: requested access to the resource is denied 1.6s WARN[0001] Docker Compose is configured to build using Bake, but buildx isn't installed [+] Building 18.0s (21/21) FINISHED docker:default => [cleverdatasets internal] load build definition from Dockerfile 0.2s => => transferring dockerfile: 1.67kB 0.0s => [cleverdatasets] resolve image config for docker-image://docker.io/docker/dockerfile:1 1.3s => CACHED [cleverdatasets] docker-image://docker.io/docker/dockerfile:1@sha256:b6afd42430b15f2d2a4c5a02b919e98a525b785b1aaff16747d2f623364e39b6 0.0s => [cleverdatasets internal] load metadata for docker.io/library/python:3.13-slim 0.5s => [cleverdatasets internal] load metadata for ghcr.io/astral-sh/uv:latest 0.7s => [cleverdatasets internal] load .dockerignore 0.0s => => transferring context: 980B 0.0s => [cleverdatasets stage-0 1/11] FROM docker.io/library/python:3.13-slim@sha256:05b118ecc93ea09e30569706568fb251c71b77d2a3908d338b77be033e162b59 0.0s => [cleverdatasets internal] load build context 0.1s => => transferring context: 2.09kB 0.0s => [cleverdatasets] FROM ghcr.io/astral-sh/uv:latest@sha256:4c1ad814fe658851f50ff95ecd6948673fffddb0d7994bdb019dcb58227abd52 0.0s => CACHED [cleverdatasets stage-0 2/11] RUN apt-get update && apt-get install -y --no-install-recommends bzip2 gzip curl ca-certificates && rm -rf /var/lib/apt/lists/* 0.0s => CACHED [cleverdatasets stage-0 3/11] COPY --from=ghcr.io/astral-sh/uv:latest /uv /usr/local/bin/uv 0.0s => [cleverdatasets stage-0 4/11] WORKDIR /app 0.1s => [cleverdatasets stage-0 5/11] COPY pyproject.toml ./ 0.2s => [cleverdatasets stage-0 6/11] RUN uv pip install --system --no-cache-dir rdflib>=7.0.0 datasets>=2.0.0 huggingface-hub rich>=13.0.0 httpx>=0.25.0 pyarrow>=14.0.0 pandas>=2.0.0 8.9s => [cleverdatasets stage-0 7/11] COPY scripts/ ./scripts/ 0.3s => [cleverdatasets stage-0 8/11] COPY src/ ./src/ 0.2s => [cleverdatasets stage-0 9/11] RUN mkdir -p /data/dataset_processing && mkdir -p /data/.cache/huggingface 1.5s => [cleverdatasets stage-0 10/11] RUN useradd -m -u 1000 -s /bin/bash appuser && chown -R appuser:appuser /app /data 0.7s => [cleverdatasets stage-0 11/11] WORKDIR /data 0.2s => [cleverdatasets] exporting to image 2.4s => => exporting layers 2.3s => => writing image sha256:6085c144bd63a63dd576fe5069a75dfbd79c37919d69793563e0bcf593e68ac8 0.0s => => naming to docker.io/library/cleverdatasets:latest 0.0s => [cleverdatasets] resolving provenance for metadata file 0.0s [+] Running 3/3 ✔ cleverdatasets Built 0.0s ✔ Volume dataset-uploader_huggingface_cache Created 0.4s ✔ Container cleverdatasets Created 1.6s Attaching to cleverdatasets cleverdatasets | usage: upload_all_datasets.py [-h] [--dataset DATASET] cleverdatasets | [--category {small,medium,large,xlarge,all}] cleverdatasets | [--base-dir BASE_DIR] [--skip-download] cleverdatasets | [--skip-convert] [--skip-upload] [--dry-run] cleverdatasets | [--list] [--rm] [--parallel PARALLEL] cleverdatasets | cleverdatasets | Download, convert, and upload RDF datasets to HuggingFace cleverdatasets | cleverdatasets | options: cleverdatasets | -h, --help show this help message and exit cleverdatasets | --dataset, -d DATASET cleverdatasets | Specific dataset(s) to process (can be specified cleverdatasets | multiple times) cleverdatasets | --category, -c {small,medium,large,xlarge,all} cleverdatasets | Process datasets by category cleverdatasets | --base-dir BASE_DIR Base directory for work cleverdatasets | --skip-download Skip download step cleverdatasets | --skip-convert Skip conversion step cleverdatasets | --skip-upload Skip upload step cleverdatasets | --dry-run Show what would be done without doing it cleverdatasets | --list List datasets that would be processed cleverdatasets | --rm Remove mode: wipe all downloads at start, delete each cleverdatasets | dataset after processing cleverdatasets | --parallel, -p PARALLEL cleverdatasets | Number of datasets to process in parallel (default: 1, cleverdatasets | sequential) cleverdatasets exited with code 0 (restarting) ``` and the `--help` message keeps repeating. Is this the expected behavior?
Author
Member

fixed!! Changed restart: unless-stopped to restart: on-failure

fixed!! Changed restart: unless-stopped to restart: on-failure
brent.edwards marked this conversation as resolved
Member

Great work!

Great work!
khird approved these changes 2025-12-05 15:05:21 +00:00
This pull request can be merged automatically.
You are not authorized to merge this pull request.
View command line instructions

Checkout

From your project repository, check out a new branch and test the changes.
git fetch -u origin 7-dockerfile-setup:7-dockerfile-setup
git switch 7-dockerfile-setup

Merge

Merge the changes and update on Forgejo.

Warning: The "Autodetect manual merge" setting is not enabled for this repository, you will have to mark this pull request as manually merged afterwards.

git switch mypy-and-pylint
git merge --no-ff 7-dockerfile-setup
git switch 7-dockerfile-setup
git rebase mypy-and-pylint
git switch mypy-and-pylint
git merge --ff-only 7-dockerfile-setup
git switch 7-dockerfile-setup
git rebase mypy-and-pylint
git switch mypy-and-pylint
git merge --no-ff 7-dockerfile-setup
git switch mypy-and-pylint
git merge --squash 7-dockerfile-setup
git switch mypy-and-pylint
git merge --ff-only 7-dockerfile-setup
git switch mypy-and-pylint
git merge 7-dockerfile-setup
git push origin mypy-and-pylint
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
3 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
cleverdatasets/dataset-uploader!24
No description provided.