### Abstract

Co-segmentation aims at segmenting common objects from a group of images. Markov random field (MRF) has been widely used to solve co-segmentation, which introduces a global constraint to make the foreground similar to each other. However, it is difficult to minimize the new model. In this paper, we propose a new Markov random field-based co-segmentation model to solve co-segmentation problem without minimization problem. In our model, foreground similarity constraint is added into the unary term of MRF model rather than the global term, which can be minimized by graph cut method. In the model, a new energy function is designed by considering both the foreground similarity and the background consistency. Then, a mutual optimization approach is used to minimize the energy function. We test the proposed method on many pairs of images. The experimental results demonstrate the effectiveness of the proposed method.

##### Keywords:

Co-segmentation; GrabCut; Graph cut algorithm; Markov fandom field### Introduction

Image segmentation is a fundamental problem for many computer vision tasks, such as object recognition [1,2], image understanding [3], and retrieval [4]. Due to variations of the objects, image segmentation remains a challenging problem. Recently, co-segmentation [5-15] has attracted much attention from the community. The goal of co-segmentation is to segment common objects from a group of images. Unlike traditional single-image segmentation, the co-segmentation method can segment multiple images jointly rather than independently segmenting each image based on the co-occurrence of objects in the images [16]. Several examples can be found in Figure 1, where six image pairs are shown. In each image pair, the co-segmentation aims to extract the common objects from the image pair, such as the ‘plane’ and ‘banana’ in the first two image pairs. Compared with traditional segmentation methods, co-segmentation can accurately segment objects from images by several related images and requires less user workload [17]. It has many potential applications in computer vision, such as image classification, object recognition, and image retrieval. This paper focuses on the co-segmentation problem.

**Figure 1.** **Some examples of co-segmentation showing six image pairs.**

The existing co-segmentation models address co-segmentation as an optimization problem, which achieves common objects by adding foreground similarity into segmentation models. Both the local smoothness in each image and the foreground similarity among the images are considered. Many traditional segmentation methods have been improved to solve co-segmentation method, such as the Markov random field (MRF)-based segmentation methods [5-8], random walker-based segmentation method [18], and discriminative clustering-based segmentation method [10,19]. Analyzing these methods, these co-segmentation methods can be concluded as the extensions of the interactive-based segmentation methods since it is natural to replace the initial seeds manually given in the traditional method by searching the local similar regions shared by images.

Several well-known interactive-based segmentation methods have been extended to solve co-segmentation problem. MRF-based segmentation method was first extended for co-segmentation task by Rother [5], which introduced a global term representing foreground similarity into the MRF-based image segmentation model. Kim et al. [15] extended heat diffusion-based interactive segmentation method to solve multi-class co-segmentation problem. The heat diffusion-based method spreads the heat from the source seeds to the other pixels by pixel similarity. To solve multiple images in co-segmentation, the heat was diffused among the common objects by foreground similarity. The random walker-based interactive segmentation method was extended to solve co-segmentation problem in the work of Collins et al. [18], which introduces foreground similarity constraint into the random walker-based method. In the work of Meng et al. [16,20], the active contour-based model (Chan-Vese model) was extended to fit co-segmentation task by considering both foreground similarity constraint and background consistency.

Among these methods, MRF-based co-segmentation methods attract much attentions since the success of the MRF-based segmentation method on single-image segmentation. Several MRF-based co-segmentation methods have been proposed [5-8]. Their differences focus on the formulation of foreground similarity constraints. Several foreground similarity constraints have been added, such as L1-norm [5], L2-norm [6], and reward strategy [7]. However, it remains challenging to minimize the MRF-based co-segmentation energy function although many global terms have been introduced. To cope with the minimization problem, the existing methods search approximate solutions [5,6,8] or require user to provide foreground appearance and locations [7]. Other methods use saliency map [13] to obtain initial object appearance model. For these methods, the results depend on the accuracies of the initial appearance models.

GrabCut [21] is an important MRF-based co-segmentation method, which segments the objects from a manual rectangle setting by graph cut algorithm. The main advantage of GrabCut is that the energy function can be efficiently minimized by mutually applying graph cut algorithm in polynominal time. Hence, it can be used in many real-time applications. Furthermore, it models the foreground and background appearance priors by a simple rectangle setting, which is convenient compared with the other interactive-based segmentation methods. It is seen that performing co-segmentation based on the GrabCut model can result in efficient optimization and prior model generation. Meanwhile, the GrabCut model can also benefit from co-segmentation task. The GrabCut model will be more robust to initial curve setting. The reason is that the prior provided by a pair of images in co-segmentation is more sufficient compared with a single image. Hence, automatically segmenting objects by GrabCut (without manual curve setting) can be achieved in co-segmentation task.

In this paper, we propose a new MRF-based co-segmentation method namely mutual GrabCut (MGrabCut) for common object segmentation, which extends GrabCut [21] to solve co-segmentation. In the method, the region outside each initial rectangle is treated as background region. Meanwhile, the regions inside initial rectangles are used to model unary potential of the foreground. To segment similar foregrounds, we introduce the foreground model of the other image in the unary term of the current image. The final co-segmentation results are achieved by graph cuts with iteratively updating unary term of the foreground appearance model and background appearance model. The main advantage of the proposed method is that compared with existing MRF-based co-segmentation methods, we consider foreground similarity into unary term rather than global term, which results in easier minimization. Hence, the proposed model is efficient and real time. Secondly, the proposed method is robust to initial curve setting because the common objects can be more accurately located by the constraint of foreground similarity. A fixed initial curve can be used for all pairs of images. Thirdly, since the foreground model is dynamically updated along the iteration, a more accurate appearance model is obtained by the proposed method. We test the proposed method on many pairs of images. The experimental results demonstrate the effectiveness of the proposed method.

The contributions of the proposed method are listed as follows:

1. A novel MRF-based co-segmentation model is designed. In the model, the foreground similarity constraint is added into unary term rather than global term, which results in the efficient minimization by graph cut algorithm.

2. Compared with traditional GrabCut model, the proposed model is more robust to initial curve setting and can segment objects with fixed initial curves. The benefit is caused by considering a pair of images instead of a single image.

3. A mutual graph cut-based minimization method is developed to minimize the energy pairs.

### Related work

In image segmentation, many minimization techniques have been used to achieve accurate object segmentation. Boykov et al. in [22] used graph cut algorithm to minimize the energy in MRF-based segmentation model. In the work of Meng et al. [16], the active contour-based energy function was minimized by level set techniques and the method of calculus of variations. In [17], the shortest path algorithm achieved by dynamic programming method was used for object segmentation. In the work of Zeng et al. [23], a hybrid extended Kalman filter and switching particle swarm optimization algorithm were proposed for model parameter estimation. In [24], a new particle filter was developed to simultaneously identify both the states and parameters of the model. In [25], Zineddin et al. presented a new image reconstruction algorithm using the cellular neural network that solves the Navier-Stokes equation, which offered a robust method for estimating the background signal within the gene spot region.

In the existing co-segmentation methods, co-segmentation is commonly modeled as an optimization problem, which introduces foreground similarity to fit common object segmentation. For MRF-based co-segmentation model, the energy function is usually defined as

where *U*_{pixel} is the data term which evaluates the potential of the pixel to the foreground or
background. *V*_{pair} is the smoothness term to measure the smoothness of local pixels. These two terms
are single-image segmentation-based term. The term *G*_{global} is the global term evaluates the similarity between the foregrounds. By minimizing
the energy function, only common objects are extracted.

Although the global term makes the foreground similar, it also results in difficult minimization since searching the regions with similar appearance is challenging. The existing methods employ various global terms to cope with the minimizations. Rother et al. [5] used L1-norm to measure foreground similarity. The trust region graph cut method was proposed for energy optimization. Mukherjee et al. [6] replaced L1-norm with L2-norm. Pseudo-Boolean optimization was used for optimization. Instead of penalizing foreground difference, Hochbaum and Singh [7] rewarded foreground similarity. Vicente et al. in [8] modified the Boykov-Jolly model for foreground similarity measurement. Dual decomposition was employed for minimization.

Other methods have also been used for co-segmentation task. Joulin et al. [10] segmented common objects by clustering strategy. The main idea was that the common objects can be classified into the same class since they have similar features. Hence, by searching a classifier based on spectral clustering technique and positive definite kernels that best classified the common objects, co-segmentation was achieved. In the work of Batra et al. [11], an interactive co-segmentation method which segmented common objects through human interaction guided by an automatic recommendation system was proposed. Mukherjee et al. [12] proposed a scale-invariant co-segmentation method to segment common objects through the fact that the rank of the matrix corresponding to foreground regions should be equal to 1. The algorithm of Chang et al. [13] solved co-segmentation by a novel global energy term which used the co-saliency model to measure foreground potentials. The energy function considering both foreground similarity and background consistency was submodular and can be efficiently minimized by graph cut algorithm. Vicente et al. [14] focused on interesting object co-segmentation. A useful feature to distinguish the common objects was trained from a total of 33 features through random forest regression. The common objects were segmented by loop belief propagation on a full connected graph. Kim et al. in [15] solved multiple-class-based co-segmentation problem by anisotropic heat diffusion. By combining clustering method and random walk segmentation method, multiple classes can be successfully labeled from a large number of images. Recently, Joulin et al. in [19] focused on multi-class co-segmentation, which considers discriminative clustering and multi-class co-segmentations into account. More accurate segmentation results were obtained. Collins et al. in [18] solved co-segmentation by random walker-based segmentation method which added foreground consistency into traditional random walker-based method. Compared with MRF-based co-segmentation, the random walker-based co-segmentation method was efficient. Rubio et al. in [26] segmented common objects by modifying the wrongly segmented from the other successful segmentations. A co-segmentation framework was formulated by MRF, and a new global term based on graph matching was proposed. In the work of Meng [17], co-segmentation from a large number of original images with similar backgrounds was considered. A digraph was constructed by foreground similarity and saliency values. The co-segmentation problem was formulated as the shortest path problem and was solved by dynamic programming method.

### The proposed model

In this section, we first introduce the GrabCut method. Then, the proposed method is illustrated.

#### GrabCut segmentation

GrabCut is an interactive image segmentation method. It has been widely used in many
computer vision tasks. In GrabCut, the image segmentation is a label problem which
assigns a label *α*_{i}∈{0,1},*i*=1,⋯,*N* to each image pixels *z*_{i},*i*=1,⋯,*N* with *α*_{i}=1 for foreground and 0 for background. *N* is the number of pixels. The label problem is then set as an optimization problem
by minimizing the energy function

where *α*=(*α*_{1},…,*α*_{N}), **z**=(*z*_{1},…,*z*_{N}), and *θ* describes image foreground and background appearance model which is represented as

where *α*=0 for the background model and *α*=1 for the foreground model. *h* is the appearance model, which is represented as a Gaussian mixture model. In the
model, a full-covariance Gaussian mixture with *K* components is considered for the construction. With a Gaussian mixture model (GMM)
for the foreground or the background, each pixel *z*_{i} is assigned a unique GMM component *k*_{i} either from the background or the foreground model according to *α*=0 or 1, where **k**=(*k*_{1},…,*k*_{N}), *θ*={*Π*(*α*,*k*_{i}),*μ*(*α*,*k*_{i}),*Σ*(*α*,*k*_{i}),*α*=0,1,*i*=1,⋯*N*}. Here, *Π*(·) are mixture weighting coefficients, and *μ*(·) and *Σ*(·) are means and covariances of the distribution *p*(·).

The data term *U*(*α*,**k**,*θ*,**z**) in Equation 2 evaluates the fit of the label *α* to the date **z** with *θ* and **k** and is represented as

where *n* is the number of pixels and

The smoothness term *V*(*α*,**z**) in Equation 2 encourages coherence in local regions and is defined as

where [·] denotes the indicator function taking values 0,1 for a predicate ·. *β* is constant. **C** is the set of pairs of neighboring pixels. The pixels are neighbors if they are adjacent
either horizontally/vertically or diagonally.

Based on Equation 2, the segmentation is obtained by minimizing Equation 2 represented as

By fixing **k** and *θ*, the problem in Equation 2 is solved by minimum cut algorithm (graph cut algorithm).
In GrabCut, the energy minimization scheme works iteratively, which updates **k** and *θ* by current segmentation and uses new **k** and *θ* to obtain new segmentation by solving the problem in Equation 2. The algorithm starts
from an initial curve setting manually. The iteration stops when convergence criterion
is satisfied.

#### The proposed method

Unlike single-image-based GrabCut method, a pair of images **z**^{l},*l*=0,1 is considered in the proposed model. Set is the *i*th pixel in the *l*th image and . The label for image **z**^{l},*l*=0,1 is *α*^{l},*l*=0,1. The proposed method sets co-segmentation as a label problem that assigns 1 for
pixels on the common objects and 0 otherwise. To segment common objects, we design
a new unary term by considering foreground similarity, which guarantees that only
common objects are considered. In the method, the unary term is defined as

where *θ*^{l} and **k**^{l} are the parameter sets of GMM representation of **z**^{l}, which is similar to the definition in GrabCut. *λ* is the scale factor to balance the impacts of the foregrounds in the current image
and the other image. *D*^{1} evaluates the fit of the label *α*^{l} to the date **z**^{l} with *θ*^{l} and **k**^{l} in the current image and is represented as

The foreground similarity term *D*^{2} evaluates the similarity between the foregrounds and is defined as

We use the smoothness term in GrabCut shown in Equation 6 to form the smoothness term of the proposed method. Then, the co-segmentation is set to minimize the energy function represented as

We can see from Equation 10 that *D*^{2} evaluates the fit of the pixels with in the current image to the foreground model *θ*^{1−l} in the next image. The pixels on common objects have small *D*^{2} since they are similar to the common objects in the next image. Hence, it intends
to be assigned 1. For other pixels, a larger *D*^{2} will be obtained. Hence, it intends to be a background pixel.

By keeping **k**^{l}, *θ*^{l}, **k**^{1−l}, and *θ*^{1−l} fixed, the energy function is minimized by minimum cut method. Similar to GrabCut,
we iteratively update the foreground model and the background model to accurately
segment the common objects. The main difference is that there are two images in our
model. Hence, we improve the iteration method by simultaneously updating the foreground
model and the background model of two images. In the optimization method, the initial
curve is first set to each image. The initial segmentations are obtained by treating
the pixels inside the curve as the foreground and the pixels outside the curve as
the background. Then, based on the initial segmentation, we model the foreground model
and background model and **k**^{l} for each image which are then used to obtain the foreground potential and background
potential for each image according to Equations 9 and 10. Finally, we optimize the
two energies by Equation 11 to obtain segmentation results. The segmentation results
are used as the new initial segmentations for the next iteration. The algorithm stops
when stop condition is satisfied.

We analyze next the proposed model compared with the GrabCut. Their difference can
be found in Figure 2, where Figure 2a shows the model of the GrabCut, which is related to a single image. There is an
initial curve *C*^{0} which separates the image *Z*^{0} into two regions, i.e., the region inside the curve and the region outside the curve.
The GrabCut considers the region inside the curve as the foreground and the region
outside the curve as the background. Then, the GMM of the foreground and the background
are determined based on the two regions. The GMM is represented as *k*^{0} and *θ*^{0}. For a pixel (the blue points), there are two influences in the GrabCut model. One
is the foreground model represented by the green lines. The other is the background
model represented by the yellow lines. Based on the two aspects, the point will be
given a label. We can see that GrabCut is sensitive to the initial curve setting because
the change of initial curve will also change the parameters of the foreground model
and background model, which results in different segmentations. Hence, for GrabCut,
manually selecting the initial curve is used for the segmentation.

**Figure 2.** **Difference between GrabCut and the proposed model.** (**a**) The segmentation model of GrabCut. (**b**) The segmentation model of the proposed method.

The proposed model is represented in Figure 2b, where there are two images, *Z*^{0} and *Z*^{1}, rather than a single image. For each image, there is a curve. The curve also segments
the image into two regions: the region inside the curve and the region outside the
curve. Like the analysis of the GrabCut in Figure 2a, we consider the blue points in *Z*^{0}. We can see that there are three terms in our model. The first two are the foreground
model (the green line in *Z*^{0}) and the background model (the yellow line) in the current image *Z*^{0}. These two terms are similar to the two in GrabCut. The third is the foreground model
in *Z*^{1}. For the third influence, since only the common objects share similar colors, the
pixels on the objects will have large response of the third term. While for a background
pixel, it has a small response, which results in the label of background. Hence, the
pixels on the common objects will be considered as foreground.

Comparing our model with GrabCut, the difference is that we introduce the third term
in our model, which results in the segmentation of the common objects. We can see
that the third term also results in the robustness to initial curve setting. The reason
is that the initial curve setting of the current image may change the foreground model.
However, the next image can provide the accurate foreground model when the curve *C*^{1} covers most of the area of the image pairs. The appearance model of the third model
can improve the label of the pixels and result in successful segmentation. Here, we
have to guarantee that the curve in the next image covers the most area on the common
objects. This can be simply satisfied by setting the initial curve as the rectangle
with small distance to the image edge. We can see that this initial curve setting
can be used for all image pairs, which means that the proposed method does not need
to manually set the initial curve. Note that other initial curve settings, such as
the saliency map-based initial curve setting or manual setting, can also be used as
the initial curve setting.

In this paper, we set the initial curve as a rectangle with small distance (*ν*=5) to the image edge; some examples are shown in Figure 3. The iteration stops when the difference between the old segmentation and new segmentation
is less than a threshold *T*_{s}. The algorithm of the proposed method is shown in Algorithm 1.

#### Algorithm 1 The algorithm for MGrabCut

**Figure 3.** **The initial curve setting used in this paper.** Two image pairs are shown. *ν* is the distance between the curve and the image edge.