4 Jul 2003    clmformat 1.003, 03-185

1.  
2.  
3.  
4.  
5.  
6.  
7.  

NAME

clmformat - display cluster results in readable form, optionally with labels and/or cohesion and stickiness measures attached.

SYNOPSIS

clmformat -icl fname (input cluster file) -imx fname (input matrix/graph file) [-tab fname (read tab file)] [-fmt fname (write results to single file)] [-dir dirname (write results to directory)] [-infix str (use after base name/directory)] [-do txt (write ascified output rather than html)] [-lump-size n (cluster size threshold)] [-lump-count n (node threshold)] [-nsm fname (output node stickiness file)] [-ccm fname (output cluster cohesion file)] [--adapt (allow domain mismatch)]

DESCRIPTION

The primary function of clmformat is to display cluster results in a readable form, by listing clusters in terms of the labels associated with the indices that are used in the mcl matrix. The labels must be stored in a so called tab file; see the -tab option for more information.

By default the output is formatted using HTML. For each cluster a paragraph is output. First comes a listing of other clusters (in order of relevance, possibly empty) for which a significant amount of edges exists between the other and the current cluster. Second comes a listing of the nodes in the current cluster. For each node a small sublist is made (in order of relevance, possibly empty) of other clusters in which the node has neighbours and for which the total sum of corresponding edge weights is significant. The 'self' value is simply the projection value for the cluster to which the node belongs.

All (unqualified) values that are output are so called projection values described further below. The so-called coverage measures that are also output are described in [1]. You can safely ignore them, allthough they do sometimes explain why nodes with low 'self' projection value form a cluster. When this happens the coverage measures are usually higher than they normally are, and this signifies that a small-area cluster is efficient compared with a large-are cluster for those nodes (which may or may not be what you want).

It is possible to split output over multiple files using the -dir option. The intent is simply that for very large graphs browsing quality can still be maintained. Clusters will by default be output to file until the total node count has exceeded a threshold (refer to the -lump-count option). Alternatively, it is possible to specify a threshold such that clusters with few entries are all collected in a single file. Refer to the -lump-size option.

clmformat also shows how well each node fits in the cluster it is in and how cohesive each cluster is, using simple but effective measures (described under respectively the -nsm option and the -ccm option). This enables you to compare the quality of the clusters in a clustering relative to each other, and may help in identifying both interesting areas and areas for which cluster structure is hard to find or perhaps absent.

OPTION

-icl fname (input cluster file)
   
Name of the clustering file.
   
-imx fname (input matrix/graph file)
   
Name of the graph/matrix file.
   
-tab fname (read tab file)
   
The file fname should be in tab format; each line starts with a unique number which is an index used in the matrix input file and the cluster input file. The rest of the line contains a descriptive string associated with the number. Lines starting with # are considered comment and are disregarded. A single unique line should be present for each node/index of the cluster row domain (or the graph/matrix domain optionally specified with the -imx option). The leading indices should be in ascending order.
   
-fmt fname (write results)
   
The formatted results are written to the file fname.
   
-dir dirname (write results to directory)
   
Each formatted cluster is written to a file in directory dirname. If the directory does not exist an attempt is made to create it. Output file names will be of the form 0-3.html or {0-3.txt} depending on the ouput mode. If the -infix abc option is used, the file names will be of the form abc.0-3.html or abc.0-3.txt.

Clusters will by default be output to file until the total node count has exceeded a threshold (refer to the -lump-count option). Alternatively, it is possible to specify a threshold such that clusters with few entries are all collected in a single file. Refer to the -lump-count option.

   
-lump-count n (node threshold)
   
Used in conjunction with the -dir option. Clusters are formatted and output within a single file until the node threshold has been exceeded. A new file is then opened and the procedure repeats itself.
   
-lump-size n (cluster size threshold)
   
Used in conjunction with the -dir option. Each clusters is output to a separate file, except for clusters for which the size does not exceed the threshold specified. The latter are all output to a single file with a name of the form cut.html or cut.txt.
   
--adapt (allow domain mismatch)
   
Allow the cluster domain to differ from the graph domain. Presumably the clustering is a clustering of a subgraph. The cohesion and stickiness measures will pertain to the relevant part of the graph only.
   
-nsm fname (output node stickiness file)
   
This option specifies the name in which to store (optionally) the node stickiness matrix. It has the following structure. The columns range over all elements in the graph as specified by the -imx option. The rows range over the clusters as specified by the -icl option. The entries contain the projection value of that particular node onto that particular clusters, i.e. the sum of the weights of all arcs going out from the node to some node in that cluster, written as a fraction relative to the sum of weights of all outgoing arcs.
   
-ccm fname (output cluster cohesion file)
   
This option specifies the name of the file in which to store (optionally) the cluster cohesion matrix. It has the following structure. Both columns and rows range over all clusters in the clustering as specified by the -icl option. An entry specifies the projection of one cluster onto another cluster, which is simply the average of the projection value onto the second cluster of all nodes in the first cluster.

AUTHOR

Stijn van Dongen.

REFERENCES

[1] Stijn van Dongen. Performance criteria for graph clustering and Markov cluster experiments. Technical Report INS-R0012, National Research Institute for Mathematics and Computer Science in the Netherlands, Amsterdam, May 2000.
http://www.cwi.nl/ftp/CWIreports/INS/INS-R0012.ps.Z

SEE ALSO

mcl, mcx, clmdist, clminfo, clmmeet, mcxio.