SPMF is an open-source data mining mining platform written in Java.

Sequence data pattern mining Framework in Java (SPMF)

Your ads will be inserted here by

Easy Plugin for AdSense.

Please go to the plugin admin page to
Paste your ad code OR
Suppress this ad slot.

I was doing some research on how to mine patterns from sequence data and I found a really good open sourced platform called SPMF.

SPMF is an open-source data mining platform written in Java. It is distributed under the GPL v3 license.

SPMF is an open-source data mining mining platform written in Java.

SPMF is an open-source data mining mining platform written in Java.

The link to the website:  http://www.philippe-fournier-viger.com/spmf/index.php

It offers implementations of 52 data mining algorithms for:

  • sequential pattern mining,
  • association rule mining,
  • frequent itemset mining,
  • sequential rule mining,
  • clustering

It can be used as a standalone program with a user interface or from the command line. Moreover, the source code of each algorithm can be integrated in other Java software.

The following picture is a map which you can visualize the relationship between the various data mining algorithms offered in SPMF.

Visual map of algorithms

Visual map of algorithms

 

Supporting Algorithms

Sequential Pattern Mining Algorithms

  • the PrefixSpan algorithm for mining frequent sequential patterns from a sequence database (Pei et al., 2004).
  • the SPAM algorithm for mining frequent sequential patterns from a sequence database (Ayres, 2002)
  • the BIDE+ algorithm for mining frequent closed sequential patterns from a sequence database (Wang et al. 2007)
  • the SeqDIM algorithm for mining frequent multidimensional sequential patterns from a multi-dimensional sequence database (Pinto et al. 2001)
  • the Songram et al. algorithm for mining frequent closed multidimensional sequential patterns from a multi-dimensional sequence database (Songram et al. 2006)
  • the Fournier-Viger et al. algorithm, a sequential pattern mining algorithm that combines several features from well-known sequential pattern mining algorithms and also proposes some original features (Fournier-Viger et al., 2008):

Sequential Rule Mining Algorithms

(more…)

Read More

Eager Learning VS Lazy Learning

Your ads will be inserted here by

Easy Plugin for AdSense.

Please go to the plugin admin page to
Paste your ad code OR
Suppress this ad slot.

Eager Learning:

Eager learning methods construct a general and explicit description of target function based on the provided training examples.

Eager learning methods use the same approximation to the target function, which must be learned based on training examples and before input queries are observed

Lazy Learning

Lazy learning methods simply store the data and generalizing beyond these data is postponed until an explicit request is made.

Lazy learning methods can construct a different approximation to the target function for each encountered query instance.

Suitable for complex and incomplete problem domains, where a complex target function can be represented by a collection of less complex local approximations.

Eager Learning normally requires less space than Lazy Learning does

Read More

Decision Tree Learning Algorithm and ID3

Your ads will be inserted here by

Easy Plugin for AdSense.

Please go to the plugin admin page to
Paste your ad code OR
Suppress this ad slot.

General Decision Tree learning algorithms:

Employ top-down greedy search through the space of possible solutions.

1, Perform a statistical test of each attribute to determine how well it classifies the training examples when considered alone

2.Select the attributes that perform the best and use it as the root of the tree.

3. To decide the descendant node down each branch of the root, sort the training examples according to value related to the current branch and repeat the process in steps 1 and 2.

ID3:

ID3 uses Information Gain to determine how informative an attribute.

Information Gain is based on a measure that we call Entropy: which characterizes the impurity of a collection of examples S.  (The larger the Entropy, the larger the impurity)

Advantage:

1. Every discrete classification function can be represented by a decision tree.

2. Instead of making decisions based on individual training examples( e.g. Find-S), ID3 uses statistical property of all the examples(Information Gain), therefore less sensitive to errors (compare to Find-S, Candidate-Elimination).

Disadvantages:

1. ID3 determines a single hypothesis, not a space of consistent hypotheses.

2. No back tracking in its search, therefore ID3 may overfit the training data and converge to local optimal solution that is not globally optimal.

How to stop Overfitting?

1. Stop the training process before the learner reaches the point where it perfectly classifies the training data.

2. Apply backtracking – post pruning of overfitted tree

3. Cross Validation

Read More