Rank-Order Clustering

R
algorithms
Author

Paolo Bosetti

Published

2024-Dec-18

Rationale

Oftentimes, a process can be described by an incidence matrix: a logical matrix that shows the relationship between two classes of objects.

For example:

  • suppose that you have a manufacturing plant that produces 10 different components using 6 different machines: the incidence matrix states which machines are used to make which components.
  • suppose you have 20 different employees that perform 15 different tasks: the incidence matrix identifies which employee performs which task.

Of course the order of rows and columns in the incidence matrix is arbitrary. But suppose that you can rearrange rows and columns so that the matrix becomes block diagonal: then each block is self-sufficient: there’s no process flow between different blocks. In our examples, machines in the same block can be located in different buildings, fore there is non need to move semi-finished products from one block to another. Likewise, employees that are part of different blocks should share the same room (for they probably have to exchange information).

So the question is: how to rearrange the incidence matrix so that it becomes block diagonal?

An example matrix

Let us start with an example matrix, which we know is block diagonal (’cause we make it!), but with randomly resorted rows and columns:

set.seed(0)
m <- matrix(
  c(
    1, 1, 1, 1, 0, 0, 0, 0,
    1, 0, 1, 1, 0, 0, 0, 0,
    1, 0, 1, 1, 0, 0, 0, 0,
    1, 1, 1, 1, 0, 0, 0, 0,
    0, 0, 0, 0, 1, 1, 0, 0,
    0, 0, 0, 0, 1, 0, 0, 0,
    0, 0, 0, 0, 1, 1, 0, 0,
    0, 0, 0, 0, 0, 0, 1, 1,
    0, 0, 0, 0, 0, 0, 1, 1
  ) %>% as.logical(),
  ncol=8,
  byrow=TRUE
)
m <- m[sample(nrow(m)),sample(ncol(m))]
dimnames(m) <- list(LETTERS[1:nrow(m)], letters[1:ncol(m)])
m
      a     b     c     d     e     f     g     h
A FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE
B  TRUE  TRUE FALSE  TRUE FALSE  TRUE FALSE FALSE
C FALSE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE
D  TRUE  TRUE FALSE  TRUE FALSE  TRUE FALSE FALSE
E FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE FALSE
F FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
G FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE FALSE
H FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE
I FALSE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE

Column names can refer to parts, row names to machine tools, for example.

Graphically, we can make a tile plot of that. We need to convert the matrix to a tibble, make the tibble tidy, be sure to convert the part and machine names to ordered factors, and finally use geom_tile:

m %>% 
  as_tibble(rownames="machine") %>% 
  pivot_longer(-machine, names_to = "part", values_to = "uses") %>% 
  mutate(
    machine=factor(machine, levels=dimnames(m)[[1]], ordered=T),
    part=factor(part, levels=dimnames(m)[[2]], ordered=T),
    uses=ifelse(uses, "YES", "NO")
  ) %>% 
  ggplot(aes(x=part, y=machine, fill=uses)) +
  geom_tile()

Rank-Order Clustering

A Roc bird A Roc is also a huge mythological bird…

How can we reorder the matrix to make it block diagonal? we need to provide a proper weight to each cell, so that they can “float” in their right position. The ROC algorithm does exactly that:

  1. assign each rows a weight \(w_i\), as power of two (so there are no duplicate values);
  2. for each column, calculate its total weight \(t_j=\sum_i e_{ij} w_i\);
  3. sort the columns according to their total weight \(t_j\);
  4. repeat steps 1–3 by swapping “row” with “column” and \(i\) with \(j\);
  5. stop if there are no changes.

in R, this is really straightforward. The weights vector for rows is:

(w <- 2^(0:(nrow(m)-1)))
[1]   1   2   4   8  16  32  64 128 256

The total weight for each column \(t_j\) is then:

w %*% m
      a  b   c  d   e  f   g   h
[1,] 10 90 129 90 292 90 129 260

so the first column-reordering of m gives:

m[,order(w %*% m)]
      a     b     d     f     c     g     h     e
A FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE
B  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE
C FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE
D  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE
E FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE
F FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
G FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE
H FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE
I FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE

Then we can repeat by swapping nrow with ncol, and m[,order(w %*% m)] with m[order(m %*% w),] (note that we’re swapping terms in the multiplication).

It’s easy to make a function of that. We just have to decide when to stop: we monitor the two vectors of dimension names for changes:

roc <- function(m) {
  if (is.null(dimnames(m)[[1]]) | is.null(dimnames(m)[[2]]))
    stop("The matrix must have named rows and columns")
  repeat {
    dn <- dimnames(m)
    # sort rows
    w <- 2^(0:(ncol(m) - 1))
    (m <- m[order(m %*% w), ])
    # sort columns
    w <-  2^(0:(nrow(m) - 1))
    (m <- m[, order(w %*% m)])
    # check for changes
    changed = map2(dn, dimnames(m), ~ any(.x != .y)) %>% as.logical() %>% any()
    if (!changed) break
  }
  return(m)
}

So here we are: we can rearrange rows and columns to have a block diagonal matrix:

r <- roc(m) 

as_tibble(r, rownames="machine") %>% 
  pivot_longer(-machine, names_to = "part", values_to = "uses") %>% 
  mutate(
    machine=factor(machine, levels=dimnames(r)[[1]], ordered=T),
    part=factor(part, levels=dimnames(r)[[2]], ordered=T),
    uses=ifelse(uses, "YES", "NO")
  ) %>% 
  ggplot(aes(x=part, y=machine, fill=uses)) +
  geom_tile()

So that, in our example, machines B, D, E, G only produce parts a, b, d, f , and these parts only need these machines. Each block is the self-sufficient.

That’s all, folks!