`load_geojson`: Several improvements, especially around empty propert… by m-mohr · Pull Request #549 · Open-EO/openeo-processes

m-mohr · 2025-07-17T06:01:51Z

load_geojson:

Specify behavior if properties parameter is empty
Clarify mixed data types
Other minor clarifications

Closes #448

clausmichele

Seems fine to me, but I'd rather wait for a feedback from @soxofaan since I'm not using this process so often.

soxofaan · 2025-07-22T13:07:07Z

proposals/load_geojson.json

        {
            "name": "properties",
-            "description": "A list of properties from the GeoJSON file to construct an additional dimension from. A new dimension with the name `properties` and type `other` is created if at least one property is provided. Only applies for GeoJSON Features and FeatureCollections. Missing values are generally set to no-data (`null`).\n\nDepending on the number of properties provided, the process creates the dimension differently:\n\n- Single property with scalar values: A single dimension label with the name of the property and a single value per geometry.\n- Single property of type array: The dimension labels correspond to the array indices. There are as many values and labels per geometry as there are for the largest array.\n- Multiple properties with scalar values: The dimension labels correspond to the property names. There are as many values and labels per geometry as there are properties provided here.",
+            "description": "A list of properties from the GeoJSON file to construct an additional dimension from. A new dimension with the name `properties` and type `other` is created. If no property is provided, a label with value `id` is created and all values are set to a no-data value. Missing values are generally set to a no-data value.\n\nDepending on the number of properties provided, the process creates the dimension differently:\n\n- Single property with scalar values: A single dimension label with the name of the property and a single value per geometry.\n- Single property of type array: The dimension labels correspond to the array indices. There are as many values and labels per geometry as there are for the largest array.\n- Multiple properties with scalar values: The dimension labels correspond to the property names. There are as many values and labels per geometry as there are properties provided here.\n\nAlthough mixing data types of values is generally allowed, certain combinations of data type may be unsupported by some implementations. It is recommended to keep the data types homogeneous.",


Initial suggestion (unfortunately the diff-view does not seem to work properly on these long lines):

Suggested change

"description": "A list of properties from the GeoJSON file to construct an additional dimension from. A new dimension with the name `properties` and type `other` is created. If no property is provided, a label with value `id` is created and all values are set to a no-data value. Missing values are generally set to a no-data value.\n\nDepending on the number of properties provided, the process creates the dimension differently:\n\n- Single property with scalar values: A single dimension label with the name of the property and a single value per geometry.\n- Single property of type array: The dimension labels correspond to the array indices. There are as many values and labels per geometry as there are for the largest array.\n- Multiple properties with scalar values: The dimension labels correspond to the property names. There are as many values and labels per geometry as there are properties provided here.\n\nAlthough mixing data types of values is generally allowed, certain combinations of data type may be unsupported by some implementations. It is recommended to keep the data types homogeneous.",

"description": "A list of properties from the GeoJSON data to construct an additional dimension from. A new dimension with the name `properties` and type `other` is created. Missing values are generally set to a no-data value.\n\nDepending on the number of properties provided, the process creates the dimension differently:\n\n- Empty `properties` list: A single dimension label with value `id` is created and all values are set to a no-data value.\n\n- Single property with scalar values: A single dimension label with the name of the property and a single value per geometry.\n- Single property of type array: The dimension labels correspond to the array indices. There are as many values and labels per geometry as there are for the largest array.\n- Multiple properties with scalar values: The dimension labels correspond to the property names. There are as many values and labels per geometry as there are properties provided here.\n\nMissing values are generally set to a no-data value.\n\nAlthough mixing data types of values is generally allowed, certain combinations of data type may be unsupported by some implementations. It is recommended to keep the data types homogeneous.",

Some additional notes:

about the Empty properties list: ... id is created and all values are set to a no-data value: we could also set auto-increment values, which makes this case (which is the default by the way) quite a bit more useful I would say

about Single property with scalar values/of type array: I'm a bit hesitant about involving the type of the property value in the behavior of this process (e.g. to unroll arrays into separate "columns"). I guess this is to be more user friendly, but to me it feels a bit too magic and it undermines consistency and predictability.

Slightly related: what happens if multiple properties are listed, but some of them have array values: would that be an error?

What would feel more strict to me is to allow the properties argument not only be an array of strings like it is now, but also a single string; and maybe even null, which could then be the default:

properties arg is null (default) -> label id with no-data (or auto-increment) values

properties arg is array (must be non-empty) -> values are expected to be scalar. No automatic unrolling: to be decided if it is allowed to have arrays as pixel values or to raise error here.

properties arg is string -> values are expected to be an array to be unrolled (and raise error when they are scalar)

I think that will make the API more clean and predictable, because it is more explicit about assumptions and it avoids brittle value sniffing

about the Empty properties list: ... id is created and all values are set to a no-data value: we could also set auto-increment values, which makes this case (which is the default by the way) quite a bit more useful I would say

I think the intention was to not change the features, so I'd be hesistant to create something new here.
A simple load_geojson => save_result would add data, which was not there before.

about Single property with scalar values/of type array: I'm a bit hesitant about involving the type of the property value in the behavior of this process (e.g. to unroll arrays into separate "columns"). I guess this is to be more user friendly, but to me it feels a bit too magic and it undermines consistency and predictability.

What would happen in case of arrays then? openEO datacubes only consist of scalars, so we wouldn't be able to load any property that is of type array if we remove the "magic".

Slightly related: what happens if multiple properties are listed, but some of them have array values: would that be an error?

I guess that should result in an error, indeed. Or we create something like "property.index".

What would feel more strict to me is to allow the properties argument not only be an array of strings like it is now, but also a single string; and maybe even null, which could then be the default:

I don't understand this. Why is this more strict? My impression would be that this is less strict (and a rather verbose schema). A single string is equivalent to an array with one string and null is equivalent to an empty array. I don't get it. The same rules that apply could also apply to the array schema, especially as that would still be allowed unless we set minItems 2 in the schena for the array. I think I'd prefer keeping the input data type as array only.

I don't understand this. Why is this more strict? My impression would be that this is less strict (and a rather verbose schema).

I'll try to rephrase my concern. It's basically about this part of the current proposal:

Single property with scalar values: ...

Single property of type array: ...

Multiple properties with scalar values: ...

While this looks like a straightforward system, it's a bit deceptive:

the first part (single versus multiple) is fully determined by the properties argument of the process graph node

the second part (scalar values versus array) requires to inspect the actual GeoJSON data, which:

might not be under control of the author of the process graph (e.g. because it is loaded from an URL elsewhere in the process graph, or it is provided as argument to a UDP)

might be inconsistent, e.g. across all features in a feature collection. As a matter of fact: this raises the question how a backend implementation is expected to evaluate the scalar/array situation: check the whole feature collection, shortcut on the first non-null hit, use majority voting, ...

large/expensive to traverse in full for validation/dry-run contexts

I'd like to avoid the data sniffing of the second part because its brittleness undermines the reproducibility and predictability of the structure of the resulting vector cube.

Hence my proposal, that only depends on the provided properties argument, not the actual (runtime) data:

properties argument is null (the default): A single dimension label with value id is created and all values are set to the no-data value.

properties argument is a non-empty array of strings: The dimension labels correspond to the property names. There are as many values and labels per geometry as there are properties provided here.

properties arg is single string: The given property is expected to have arrays as values in the (GeoJSON) feature properties. The dimension labels correspond to the array indices. There are as many values and labels per geometry as there are for the largest array.

Notes:

the schema of the properties argument is indeed a bit more complex (union of null, array of strings and scalar string), but that's not too bad I think. Added benefit: it allows a backend to not support the properties=single string case.

It makes assumptions/expectations about the data more explicit, lowering risk on confusion and weird/inconsistent behavior. The flip side of less magic is that it might be a little less user friendly, but that can be accommodated for at the client level.

The structure of the output is directly predictable, which helps in situations where you want quick static analysis of the process graph (e.g. client-side or server side validation or dry-runs)

I consider this more strict (less magic) because the user is required to be explicit (at "authoring" time) about the expected property data structure, without depending on runtime data sniffing (which would be more magic)

A single string is equivalent to an array with one string and null is equivalent to an empty array.

I disagree here. Not considering them equivalent allows to express intent and expectations more fine-grained:

single string: I expect the corresponding GeoJSON feature property values to be arrays, which must be unrolled

array with one string: I expect the corresponding GeoJSON feature property values to be scalars, which can be readily be put in a single "column"

null: I don't care, just do the default thing

empty array: don't create columns (which makes little sense in this case, but that's not the point I try to make here)

I think I'm fine with the proposed way. Could you propose a wording or is the initial wording already covering that?

…ies parameter #448 Co-authored-by: Stefaan Lippens <soxofaan@users.noreply.github.com>

soxofaan

some more suggestions

soxofaan · 2025-10-21T12:03:26Z

proposals/load_geojson.json

    "id": "load_geojson",
    "summary": "Converts GeoJSON into a vector data cube",
-    "description": "Converts GeoJSON data as defined by [RFC 7946](https://www.rfc-editor.org/rfc/rfc7946.html) into a vector data cube. Feature properties are preserved.",
+    "description": "Converts GeoJSON data as defined by [RFC 7946](https://www.rfc-editor.org/rfc/rfc7946.html) into a vector data cube. The feature properties provided in the parameter `properties` are preserved.",


Suggested change

"description": "Converts GeoJSON data as defined by [RFC 7946](https://www.rfc-editor.org/rfc/rfc7946.html) into a vector data cube. The feature properties provided in the parameter `properties` are preserved.",

"description": "Converts GeoJSON data as defined by [RFC 7946](https://www.rfc-editor.org/rfc/rfc7946.html) into a vector data cube. The feature properties provided in the parameter `properties` are used as cube values.",

That is a bit ambiguous to me.

cube values = data values or dimension labels? I guess more clear would be something like "property keys are used as dimension labels and the property values are used as data values"?

proposals/load_geojson.json

m-mohr added this to the 2.0.0-rc.2 milestone Jul 17, 2025

m-mohr requested review from clausmichele and soxofaan July 17, 2025 06:01

m-mohr added the vector label Jul 17, 2025

m-mohr added this to Next processes release Jul 17, 2025

m-mohr linked an issue Jul 17, 2025 that may be closed by this pull request

load_geojson issues #448

Open

m-mohr force-pushed the load-geojson-issues branch from 3e04b16 to 8a303fd Compare July 17, 2025 06:03

m-mohr mentioned this pull request Jul 17, 2025

load_geojson issues #448

Open

m-mohr moved this to In Progress in Next processes release Jul 17, 2025

clausmichele reviewed Jul 17, 2025

View reviewed changes

soxofaan reviewed Jul 22, 2025

View reviewed changes

m-mohr force-pushed the load-geojson-issues branch 2 times, most recently from d51ffba to 63ea54d Compare August 7, 2025 19:34

load_geojson: Several improvements, especially around empty propert…

7bfede4

…ies parameter #448 Co-authored-by: Stefaan Lippens <soxofaan@users.noreply.github.com>

m-mohr force-pushed the load-geojson-issues branch from 63ea54d to 7bfede4 Compare September 8, 2025 15:56

m-mohr requested review from clausmichele and soxofaan September 8, 2025 15:56

m-mohr self-assigned this Sep 16, 2025

soxofaan reviewed Oct 21, 2025

View reviewed changes

Update proposals/load_geojson.json

21a8486

m-mohr modified the milestones: 2.0.0-rc.2, 2.0.0 Jan 7, 2026

m-mohr removed this from Next processes release Jan 7, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`load_geojson`: Several improvements, especially around empty propert…#549

`load_geojson`: Several improvements, especially around empty propert…#549
m-mohr wants to merge 2 commits intodraftfrom
load-geojson-issues

m-mohr commented Jul 17, 2025

Uh oh!

clausmichele left a comment

Uh oh!

soxofaan Jul 22, 2025

Uh oh!

m-mohr Jul 22, 2025 •

edited

Loading

Uh oh!

soxofaan Oct 21, 2025

Uh oh!

soxofaan Oct 21, 2025

Uh oh!

m-mohr Dec 5, 2025

Uh oh!

soxofaan left a comment

Uh oh!

soxofaan Oct 21, 2025

Uh oh!

m-mohr Dec 5, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	"description": "Converts GeoJSON data as defined by [RFC 7946](https://www.rfc-editor.org/rfc/rfc7946.html) into a vector data cube. The feature properties provided in the parameter `properties` are preserved.",
	"description": "Converts GeoJSON data as defined by [RFC 7946](https://www.rfc-editor.org/rfc/rfc7946.html) into a vector data cube. The feature properties provided in the parameter `properties` are used as cube values.",

Conversation

m-mohr commented Jul 17, 2025

Uh oh!

clausmichele left a comment

Choose a reason for hiding this comment

Uh oh!

soxofaan Jul 22, 2025

Choose a reason for hiding this comment

Uh oh!

m-mohr Jul 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

soxofaan Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

soxofaan Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

m-mohr Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

soxofaan left a comment

Choose a reason for hiding this comment

Uh oh!

soxofaan Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

m-mohr Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

m-mohr Jul 22, 2025 •

edited

Loading