Understanding the role of each pixel in the image - the so-called semantic image segmentation - is one of the central problems in computer vision and pattern recognition. Allowing a mathematical sound integration of different image labeling concepts into a single framework, conditional random fields belong to the best performing and best understood techniques for solving this task. They belong to the class of undirected graphical models, where the scene is represented by a graph whose nodes are the random variables involved in the classification process and whose edges model dependencies between the random variables corresponding to the nodes. However, they are often considered as a statistical model of context, which has a smoothing effect on the classification results. In this thesis I show that the conditional random fields technique is a much more powerful tool for semantic image segmentation by making two important scientific contributions, described in Chapters 2 and 3.
The first part of this thesis is dedicated to construction of conditional random fields methods (Chapter 2). I first discuss some classical probabilistic models, used for initializing the graph nodes and edges, and then propose new more accurate and efficient models, which are based on classical ones. Thereby, I demonstrate that this toolkit allows for incredible flexibility in modeling the graph structure and thus binding various kinds of observations together. Here I also investigate the influence of different data-features, extracted from the observations on the entire labeling process. Finally, I construct a local-global classification engine -- conditional random field, incorporating not only classical local nodes, but also additional global nodes, which correspond to the global features that describe the whole image in toto. Extensive qualitative and quantitative benchmarks for eight different node models and five edge models show the accuracy and the efficiency of the proposed implementations. At the current status quo this provides the most precise random fields approaches in the literature and allows me to make the second scientific contribution.
The second part of this thesis extends the previous scientific contributions to a novel Multi-Layer-CRF framework (Chapter 3) that allows for the integration of sophisticated occlusion potentials into the model and enables the automatic inference of the layer decomposition. I use a special message-passing algorithm to perform maximum a posterior inference on mixed graphs and demonstrate the ability to infer the correct labels of occluded regions in both the aerial near-vertical dataset and urban street-view dataset. A major innovation of the proposed framework is that the 3D structure of the scene is considered in the classification process. This is necessary to be able to deal with occlusions in a systematic way. In order to do so, multi-layer conditional random fields are built that use multiple nodes for the class labels at a certain position in object space, namely one corresponding to the base layer of the scene (containing background objects that do not occlude other objects but may be occluded) and others, corresponding to the occlusion layers (containing objects that may occlude other objects). Quality and efficiency benchmarks show the success of this layered framework: the accuracy of classification on occluded areas becomes considerably higher in comparison to the classical random fields techniques.