Using Generated Data to Train Computer Vision Models.
In recent years more and more complex deep learning models have been developed to tackle problems of ever increasing difficulty. This holds true especially for the field of computer vision, where these models – mostly convolutional neural networks – are used to process images in order to extract information on the scenes depicted in them. Some of the problems solved by deep learning include: 6 DoF pose estimation, depth estimation, image segmentation, object recognition and detection, SVBDRF capturing and much more. Training these models requires a large amount of images from the intended domain, for which the desired annotations are known. To obtain such a dataset either a set of existing images is manually annotated or images are programmatically generated in a way that the desired annotation is known. The former option can quickly become strenuous, result in lower quality annotations and might not even be humanly possible at all due to the complex nature of the required labels. Furthermore, in some domains the availability and acquisition of images itself is problematic. E.g., a lot of industry is evolving towards high-variety, low-volume production, making the acquisition of labeled images even more difficult, as the objects being produced are constantly changing. Using generated training data can solve the aforementioned problems as the necessary labels can easily be extracted, using the information known from constructing the image. Such images can either be generated by constructing scenes using the 3D models of the objects of interest and rendering these, or by using existing images of the object and composing a new set of images in a way the annotations are known. This way a large amount of training data can be obtained at a lower cost, compared to manually generating the data. However, data generation is no magic fix for all data scarcity problems as there still are a lot of problems surrounding the technique that prevent it from being easily applicable in practice. The usage of generated data introduces new problems. Firstly, the domain of images this model is trained on (source domain) differs from the domain of images the model is intended to be used on (target domain). This leads to the model not performing well for the real-world tasks it was trained for. A first solution that comes to mind would be to generate more realistic looking images, so that the domains would be a closer match. This brings us to our second problem: the high cost of training data generation. To achieve realism that matches the real world, computationally expensive techniques such as monte-carlo raytracing are needed. Although recent techniques such as Nvidia RTX speed up the image rendering process, the accumulative cost of generating an entire dataset will still be significant, as the datasets used for these complex problems are often very large (more than 100k images). Furthermore, in industry applications such as pose estimation of produced items on a conveyor belt, the exact material properties of the items might not be known beforehand. Making it impossible to create realistic renders. An important question to ask is which features are being used by the machine learning model for which problems. Does a object detection model make use of low level features such as texture or are the higher level features like shape the dominant cue in it's decision making? Knowing the answer to this question allows for much more information. This does still leave us with the problem of the domain gap, which could also be solved by transferring the images from the target domain to the training domain. It is clear to see that there is a need for advanced methods that can help bridge the domain gap in modern machine learning-driven computer vision applications, in an efficient way. In this research we will investigate and compare the existing solutions for the bridging of the domain gap, and determine which are applicable in an industrial setting. For a model to be usable in industry applications the data generation and training cost should both be as low as possible and the preliminary data requirements have to be minimal as well, while still achieving the required accuracy. If deemed necessary to achieve the desired results, we will develop methods of our own or extend the existing work.
Period of project
01 September 2020 - 31 August 2024