## Completing the picture in 3D

## Azmi Haider# and Dan Rosenbaum - Department of Computer Science

**Machine Learning**

**Processing**

**PhD Grant 2023**

Humans have the ability to imagine a 3D scene from a single image. Given one image of a car, humans can guess how it looks from every angle, Computers do not. This is due to something we call “prior knowledge”. The fact that we saw a car before can help us imagine how a new car will look like from a new viewpoint. What we aim to do is to teach computers to develop prior knowledge and use it to generate new images from unseen viewpoints (given one or more images).

Our research suggests creating representations of 3D scenes in a more compact way. The details of a 3D scene are captured in what we term a latent variable. This variable is transformed to and from a complete 3D scene.

A large neural network called a generative model is then trained on the latent variables and is taught to generate new latent variables. In words, this network will understand what constructs a 3D scene through the latent representations. It will learn what these latent representations have in common and what details they have that are transformed into 3D scenes – what we termed before “prior knowledge”.

Given one image of a scene: We can use it to add information to the latent representation generated (In the generative model). This will force the latent variable generated to have the same information that is in the image in addition to information about 3D structure.

After completing this process, we can transform the latent variable back to a 3D scene.

In the photo below, you can see a synthetic scene. The image on the left is a ground truth image from a certain angle. The image on the right is the “imagined” image from the model (it is slightly blurrier). You can see clearly that the model was able to capture the 3D structure of the scene and “guess” how the car will look like from a given angle

Humans have the ability to imagine a 3D scene from a single image. Given one image of a car, humans can guess how it looks from every angle, Computers do not. This is due to something we call “prior knowledge”. The fact that we saw a car before can help us imagine how a new car will look like from a new viewpoint. What we aim to do is to teach computers to develop prior knowledge and use it to generate new images from unseen viewpoints (given one or more images).

Our research suggests creating representations of 3D scenes in a more compact way. The details of a 3D scene are captured in what we term a latent variable. This variable is transformed to and from a complete 3D scene.

A large neural network called a generative model is then trained on the latent variables and is taught to generate new latent variables. In words, this network will understand what constructs a 3D scene through the latent representations. It will learn what these latent representations have in common and what details they have that are transformed into 3D scenes – what we termed before “prior knowledge”.

Given one image of a scene: We can use it to add information to the latent representation generated (In the generative model). This will force the latent variable generated to have the same information that is in the image in addition to information about 3D structure.

After completing this process, we can transform the latent variable back to a 3D scene.

In the photo below, you can see a synthetic scene. The image on the left is a ground truth image from a certain angle. The image on the right is the “imagined” image from the model (it is slightly blurrier). You can see clearly that the model was able to capture the 3D structure of the scene and “guess” how the car will look like from a given angle