I'm trying to write a query to produce a list of all nodes in a tree given a root, and also the paths (using names the parents give their children) taken to get there. The recursive CTE I have working is a textbook CTE straight from the docs here, however, it's proven difficult to get the paths working in this case.
Following the git model, names are given to children by their parents as a result of paths created by traversing the tree. This implies a map to children ids like git's tree structure.
I've been looking online for a solution for a recursive query but they all seem to contain solutions that use parent ids, or materialized paths, which all would break structural sharing concepts that Rich Hickey's database as value talk is all about.
Imagine the objects table is dead simple (for simplicity sake, let's assume integer ids):
drop table if exists objects;
create table objects (
id INT,
data jsonb
);
-- A
-- / \
-- B C
-- / \ \
-- D E F
INSERT INTO objects (id, data) VALUES
(1, '{"content": "data for f"}'), -- F
(2, '{"content": "data for e"}'), -- E
(3, '{"content": "data for d"}'), -- D
(4, '{"nodes":{"f":{"id":1}}}'), -- C
(5, '{"nodes":{"d":{"id":2}, "e":{"id":3}}}'), -- B
(6, '{"nodes":{"b":{"id":5}, "c":{"id":4}}}') -- A
;
drop table if exists work_tree;
create table work_tree (
id INT NOT NULL,
path text,
ref text,
data jsonb,
primary key (ref, id) -- TODO change to ref, path
);
create or replace function get_nested_ids_array(data jsonb) returns int[] as $$
select array_agg((value->>'id')::int) as nested_id
from jsonb_each(data->'nodes')
$$ LANGUAGE sql STABLE;
create or replace function checkout(root_id int, ref text) returns void as $$
with recursive nodes(id, nested_ids, data) AS (
select id, get_nested_ids_array(data), data
from objects
where id = root_id
union
select child.id, get_nested_ids_array(child.data), child.data
from objects child, nodes parent
where child.id = ANY(parent.nested_ids)
)
INSERT INTO work_tree (id, data, ref)
select id, data, ref from nodes
$$ language sql VOLATILE;
SELECT * FROM checkout(6, 'master');
SELECT * FROM work_tree;
If you are familiar, these objects' data
property look similar to git blobs/trees, mapping names to ids or storing content. So imagine you want to create an index, so, after a "checkout", you need to query for the list of nodes, and potentially the paths to produce a working tree or index:
id path ref data
6 NULL master {"nodes":{"b":{"id":5}, "c":{"id":4}}}
4 NULL master {"nodes":{"d":{"id":2}, "e":{"id":3}}}
5 NULL master {"nodes":{"f":{"id":1}}}
1 NULL master {"content": "data for d"}
2 NULL master {"content": "data for e"}
3 NULL master {"content": "data for f"}
id path ref data
6 / master {"nodes":{"b":{"id":5}, "c":{"id":4}}}
4 /b master {"nodes":{"d":{"id":2}, "e":{"id":3}}}
5 /c master {"nodes":{"f":{"id":1}}}
1 /b/d master {"content": "data for d"}
2 /b/e master {"content": "data for e"}
3 /c/f master {"content": "data for f"}
What's the best way to aggregate path
in this case? I'm aware that I'm compressing the information when I call get_nested_ids_array
when I do the recursive query, so not sure with this top-down approach how to properly aggregate with the CTE.
to explain more about why I need to use children ids instead of parent:
Imagine a data structure like so:
A
/ \
B C
/ \ \
D E F
If you make a modification to F
, you only add a new root A'
, and children nodes C'
, and F'
, leaving the old tree in tact:
A' A
/ \ / \
C' B C
/ / \ \
F' D E F
If you make a deletion, you only add a new root A"
that only points to B
and you still have A
if you ever need to time travel (and they share the same objects, just like git!):
A" A
\ / \
B C
/ \ \
D E F
So it seems that the best way to achieve this is with children ids so children can have multiple parents - across time and space! If you think there's another way to achieve this, by all means, let me know!
Using parent_ids has cascading effects that requires editing the entire tree. For example,
A
/ \
B C
/ \ \
D E F
If you make a modification to F
, you still need a new root A'
to maintain immutability. And if we use parent_ids, then that means both B
and C
now have a new parent. Hence, you can see how it ripples through the entire tree immediately requiring every node is touched:
A A'
/ \ / \
B C B' C'
/ \ \ / \ \
D E F D' E' F'
We can make a recursive query where objects store their own name, but the question I'm asking is specifically about constructing a path where the names are given to children from their parents. This is modeling a data structure similar to the git tree, for example if you see this git graph pictured below, in the 3rd commit there is a tree (a folder) bak
that points to the original root which represents a folder of all the files at the 1st commit. If that root object had it's own name, it wouldn't be possible to achieve this so simply as adding a ref. That's the beauty of git, it's as simple as making a reference to a hash and giving it a name.
This is the relationship I'm setting up which is why the jsonb
data structure exists, it's to provide a mapping from a name to an id (hash in git's case). I know it's not ideal, but it does the job of providing the hash map. If there's another way to create this mapping of names to ids, and thus, a way for parents to give names to children in a top-down tree, I'm all ears!
Any help is appreciated!
Store the parent of a node instead of its children. It is a simpler and cleaner solution, in which you do not need structured data types.
This is an exemplary model with the data equivalent to that in the question:
create table objects (
id int primary key,
parent_id int,
label text,
content text);
insert into objects values
(1, 4, 'f', 'data for f'),
(2, 5, 'e', 'data for e'),
(3, 5, 'd', 'data for d'),
(4, 6, 'c', ''),
(5, 6, 'b', ''),
(6, 0, 'a', '');
And a recursive query:
with recursive nodes(id, path, content) as (
select id, label, content
from objects
where parent_id = 0
union all
select o.id, concat(path, '->', label), o.content
from objects o
join nodes n on n.id = o.parent_id
)
select *
from nodes
order by id desc;
id | path | content
----+---------+------------
6 | a |
5 | a->b |
4 | a->c |
3 | a->b->d | data for d
2 | a->b->e | data for e
1 | a->c->f | data for f
(6 rows)
The variant with children_ids
.
drop table if exists objects;
create table objects (
id int primary key,
children_ids int[],
label text,
content text);
insert into objects values
(1, null, 'f', 'data for f'),
(2, null, 'e', 'data for e'),
(3, null, 'd', 'data for d'),
(4, array[1], 'c', ''),
(5, array[2,3], 'b', ''),
(6, array[4,5], 'a', '');
with recursive nodes(id, children, path, content) as (
select id, children_ids, label, content
from objects
where id = 6
union all
select o.id, o.children_ids, concat(path, '->', label), o.content
from objects o
join nodes n on o.id = any(n.children)
)
select *
from nodes
order by id desc;
id | children | path | content
----+----------+---------+------------
6 | {4,5} | a |
5 | {2,3} | a->b |
4 | {1} | a->c |
3 | | a->b->d | data for d
2 | | a->b->e | data for e
1 | | a->c->f | data for f
(6 rows)