CleverDatasets does not handle dates before 1 AD. #54

Open
opened 2026-01-20 20:08:37 +00:00 by brent.edwards · 0 comments
Member

In https://matrix.to/#/!rOUJUiLQjurlBUFcvx:qoto.org/$nI85M0AWGPSv-4gAkLwxfSAsjLEZ8uKuwxMMYgPiitc?via=qoto.org&via=matrix.org , Kyle wrote:

I get the following error in the Docker logs when running $ docker run -d --name uploader dataset-uploader:

Input detected: registry → wikidata-full
Format auto-detected from registry: turtle → turtle
Rows per shard: 5,000,000, Record batch size: 500,000
Using provided/environment token
Mode: Streaming download from registry
Using line-based streaming parser (no multiprocessing)
Registry: wikidata-full — Wikidata Full
✓ Streaming source ready: wikidata-20251215-all-BETA.ttl.gz
✓ Source ready: wikidata-20251215-all-BETA.ttl.gz
Starting incremental conversion and upload...
Streaming Turtle from line iterator
Processed 100,000 lines...
WARNING:rdflib.term:Failed to convert Literal lexical form to value. Datatype=http://www.w3.org/2001/XMLSchema#dateTime, Converter=<built-in method fromisoformat of type object at 0x7f299267ad00>
Traceback (most recent call last):
  File "/usr/local/lib/python3.13/site-packages/rdflib/term.py", line 2262, in _castLexicalToPython
    return conv_func(lexical)  # type: ignore[arg-type]
ValueError: Invalid isoformat string: '-34000-01-01T00:00:00Z'
WARNING:rdflib.term:Failed to convert Literal lexical form to value. Datatype=http://www.w3.org/2001/XMLSchema#dateTime, Converter=<built-in method fromisoformat of type object at 0x7f299267ad00>
Traceback (most recent call last):
  File "/usr/local/lib/python3.13/site-packages/rdflib/term.py", line 2262, in _castLexicalToPython
    return conv_func(lexical)  # type: ignore[arg-type]
ValueError: Invalid isoformat string: '-34000-01-01T00:00:00Z'
WARNING:rdflib.term:Failed to convert Literal lexical form to value. Datatype=http://www.w3.org/2001/XMLSchema#dateTime, Converter=<built-in method fromisoformat of type object at 0x7f299267ad00>
Traceback (most recent call last):
  File "/usr/local/lib/python3.13/site-packages/rdflib/term.py", line 2262, in _castLexicalToPython
    return conv_func(lexical)  # type: ignore[arg-type]
ValueError: Invalid isoformat string: '-34000-01-01T00:00:00Z'

Based on my investigation, the problem occurs on lines 83974, 85066, and 87342 of wikidata-20251215-all-BETA.ttl. These lines attempt to express the triple "dog", "earliest date", "-34000-01-01T00:00:00Z"; this is the age of the Goyet dog, one candidate for the earliest known domesticated dog. Digging further, this is a known limitation of RDFLib, because it parses ISO8601-format dates to Python's datetime type, which does not support negative (i.e. BC rather than AD) years. So as far as I can tell, CleverDatasets would need to work around this limitation in the underlying library if it wanted to handle years prior to 1 AD.

In https://matrix.to/#/!rOUJUiLQjurlBUFcvx:qoto.org/$nI85M0AWGPSv-4gAkLwxfSAsjLEZ8uKuwxMMYgPiitc?via=qoto.org&via=matrix.org , Kyle wrote: > I get the following error in the Docker logs when running `$ docker run -d --name uploader dataset-uploader`: ``` Input detected: registry → wikidata-full Format auto-detected from registry: turtle → turtle Rows per shard: 5,000,000, Record batch size: 500,000 Using provided/environment token Mode: Streaming download from registry Using line-based streaming parser (no multiprocessing) Registry: wikidata-full — Wikidata Full ✓ Streaming source ready: wikidata-20251215-all-BETA.ttl.gz ✓ Source ready: wikidata-20251215-all-BETA.ttl.gz Starting incremental conversion and upload... Streaming Turtle from line iterator Processed 100,000 lines... WARNING:rdflib.term:Failed to convert Literal lexical form to value. Datatype=http://www.w3.org/2001/XMLSchema#dateTime, Converter=<built-in method fromisoformat of type object at 0x7f299267ad00> Traceback (most recent call last): File "/usr/local/lib/python3.13/site-packages/rdflib/term.py", line 2262, in _castLexicalToPython return conv_func(lexical) # type: ignore[arg-type] ValueError: Invalid isoformat string: '-34000-01-01T00:00:00Z' WARNING:rdflib.term:Failed to convert Literal lexical form to value. Datatype=http://www.w3.org/2001/XMLSchema#dateTime, Converter=<built-in method fromisoformat of type object at 0x7f299267ad00> Traceback (most recent call last): File "/usr/local/lib/python3.13/site-packages/rdflib/term.py", line 2262, in _castLexicalToPython return conv_func(lexical) # type: ignore[arg-type] ValueError: Invalid isoformat string: '-34000-01-01T00:00:00Z' WARNING:rdflib.term:Failed to convert Literal lexical form to value. Datatype=http://www.w3.org/2001/XMLSchema#dateTime, Converter=<built-in method fromisoformat of type object at 0x7f299267ad00> Traceback (most recent call last): File "/usr/local/lib/python3.13/site-packages/rdflib/term.py", line 2262, in _castLexicalToPython return conv_func(lexical) # type: ignore[arg-type] ValueError: Invalid isoformat string: '-34000-01-01T00:00:00Z' ``` > Based on my investigation, the problem occurs on lines 83974, 85066, and 87342 of wikidata-20251215-all-BETA.ttl. These lines attempt to express the triple "dog", "earliest date", "-34000-01-01T00:00:00Z"; this is the age of the Goyet dog, one candidate for the earliest known domesticated dog. Digging further, this is a known limitation of RDFLib, because it parses ISO8601-format dates to Python's datetime type, which does not support negative (i.e. BC rather than AD) years. So as far as I can tell, CleverDatasets would need to work around this limitation in the underlying library if it wanted to handle years prior to 1 AD.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
cleverdatasets/dataset-uploader#54
No description provided.