So I'm trying to implement an algorithm for extracting message types from log files. The author created several published versions, of which this one appears to be the preferred reference:
http://web.cs.dal.ca/~makanju/publications/paper/kdd09.pdf (His TR omits one algorithm altogether, unfortunately.)
I can't wrap my head around what he meant by the following heuristic. P is a partition of log messages, which has already been tokenized and restricted to messages of a particular token length (i.e., count_token). We are trying to pick two columns within P for a subsequent partitioning step. Other published descriptions said that the heuristic was to pick the modal (most frequently occurring) token count per position, but the detailed version is:
1: Determine token count of P as count_token
2: if count_token = 2 then
3: Set P1 to first token position.
4: Set P2 to second token position.
5: else if { count_token is > 2 }
6: if P went through step 2 then
7: Determine cardinality of each token position.
8: Determine the token count value with the highest frequency
other than value 1 as freq_card.
9: If there is a tie for highest frequency value then
10: Select lower token value as freq_card
11: end if
12: if the frequency of freq_card > 1 then
13: Set P1 to first token position with cardinality freq_card.
14: Set P2 to second token position with cardinality freq_card.
15: else { the frequency of freq_card = 1 }
16: Set P1 to first token position with cardinality freq_card.
17: Set P2 to first token position with the next most frequent
cardinality other than value 1.
18: end if
19: else // other stuff, don't care
So, is freq_card allowed to be 1 or not? What if there is no mode, because all the positions have different cardinality?
I think what he's saying in a situation like
error: aaa bbb
error: aaa ccc
error: ddd eee
where the columns have cardinality 1, 2, and 3, is that we tiebreak among the "highest frequency value" by picking the lower cardinality: 1. But there is no "next most frequent cardinality" so I guess the only thing we could do is pick between 2 and 3, and the tiebreaker says to pick 2.
But if we had 1, 1, 2, 2 then we should pick 2 instead because it's not 1, in precedence to the tiebreaker in lines 8-9? Or if it's 1, 1, 1, 2, 2 we should pick 2 as well?
The rationale for picking the mode is that we're looking for the constant part of the messages. The messages include fixed phrases as well as variable components, so if there are N different messages in our partition, we should be able to find at least two columns that all have N different tokens in them. From that point of view, I guess if we can't find a modal token count then the heuristic simply shouldn't apply (but the algorithm spends quite a bit of effort when the relationship between the two columns we pick is not 1:1, so evidently he thought it was important to catch some other cases too.)