I never tried this kind of pyramidal display because I always thought that it was just a bad joke.
In particular, there is a single image for the two eyes (per side), so it is likely just like a head-up display (2D image in the air), repeated 4 times. The shadows and movement can probably trick some people, though.
By the way, by sampling the video you linked, except for a few exceptions (one with 1/4 out of sync + one that has two opposite sides upside-down + one with real 4 viewpoints), the 4 images are identical; which makes sense since there is no real depth encoded. The one with 4 real viewpoints can add an interesting effect, but since the model turns it may be less obvious.
However, I guess that if you want to do such a project, you found it nice. Actually, I think that the four viewpoints idea is a good addition, even if in my opinion it will never have the same awesome feeling as stereoscopic images where true depth can be perceived (assuming that you are not in the small part of the population that cannot use the strong binocular depth cues).
For your project, I would stream the videos from the StereoPi boards to a computer and do all the processing on this powerful machine, including the rotations (or possibly rotate the cameras physically depending of the expected aspect ratio). The videos will be in sync two by two. To synchronize the two pairs, it will be harder. However, a sub-frame synchronization is probably not necessary in your case. You might try to start the videos in response to a change on a GPIO, which might help to have a better starting point (far less variable than network datagrams); but there are several places where a drift can be added nonetheless. For the combination, you seem to have found one way; I personally would have started by exploring with
https://obsproject.com/ although I do not know if it can receive live feeds (probably).
Let us know if you manage to get something interesting.