I have been having fun playing all those image generation models and think it would be more interesting if we can make it into a 3D photo.
There has been quite some research on this topic and the results look decent, like this one from CVPR 2020 (3d-photo-inpainting).
In the context of doing this on images generated by stable diffusion, there have been much efforts from the community as well:
They all look great! But they are all too complicated to setup and resource-demanding. I will either need beefy GPUs and time to go through all the scripts to setup something locally or I need to wait in the queue to access some online service.
All I need is something easy enough to play and adding just a little lively vibe to the photo.
I end up finding this little JS library showing 3D photo from a RGB image and a depth image.
The artifacts can be obvious when the motion is large but with limited movements around the center, it is good enough.
The missing piece now is obtaining the depth image. Some tutorial online shows how to manually “draw” depth image (Link). In the vision community, Monocular depth estimation has been studied for quite a while.
It is always a trade-off between accuracy and efficiency and I found this one from Filippo Aleotti et al. is a well-balanced choice. It is lightweight enough to run locally on the client-side.
It is now straightforward to put them together to readily get a 3D photo from any photo.
P.S. I’ve put them together and made a little WebApp on this: