Difference between Closed and open Sequential Pattern Mining Algorithms

leon picture leon · Apr 22, 2013 · Viewed 9.7k times · Source

I want to use some algorithms to mine my log data.

I found a pattern mining framework on: http://www.philippe-fournier-viger.com/spmf/index.php?link=algorithms.php

I have tried several algorithms, the BIDE+ algorithm performs the best.

The BIDE+ algorithm is for mining frequent closed sequential patterns from a sequence database.

Can someone explain the definition about "closed" sequential patterns and open ones?

Answer

Phil picture Phil · Apr 26, 2013

Glad that you are using my SPMF software.

The support of a sequential pattern is the number of sequences that contains the sequential pattern.

A frequent sequential pattern is a pattern that appears in at least "minsup" sequences of a sequence database, where minsup is a parameter set by the user.

A frequent closed sequential pattern is a frequent sequential pattern such that it is not included in another sequential pattern having exactly the same support.

Algorithms such as PrefixSpan finds frequent sequential patterns. Algorithms such as BIDE+ finds frequent closed sequential patterns. BIDE+ is usually much faster than PrefixSpan because it uses pruning techniques to avoid generating all sequential patterns. Moreover, the set of closed patterns is usually much smaller than the set of sequential patterns so BIDE+ is also more memory efficient.

Another important thing to know is that closed sequential patterns are a compact and lossless representation of all sequential patterns. This means that the set of closed sequential patterns is usually much smaller but it is lossless, which means that it allows recovering the full set of sequential patterns (no information is loss), which is very convenient.

I can give you a simple example.

Let's consider 4 sequences:

a  b  c  d  e
a  b  d
b  e  a  
b  c  d  e

Let's say that minsup = 2.

b c is a frequent sequential patterns because it appears in two sequences (it has a support of 2). b c is not a closed sequential patterns because it is contained in a larger sequential pattern b c d having the same support.

b c d has a support of 2. It is also not a closed sequential pattern because it is contained in a larger sequential pattern b c d e having the same support. b c d e is a closed sequential pattern because there it is not included in any other sequential pattern having the same support.

By the way, you can also check my survey about sequential pattern mining. It gives a good introduction about this topic and the different algorithms.