找到你要的答案

Q:Repeated ordered sequence search algorithm

Q:重复有序序列搜索算法

I have large ordered sequence of symbols, millions of symbols. I have to find repeated ordered subsequences such that:

  1. Search subsequences are unknown, I have to find subsequences that repeats elsewhere of large sequence.
  2. Subsequences may have differences such as presence some amount of noise and absence of some symbols.

Not necessary condition:

  1. Subsequences may have little amount of permutations of neighbor symbols.

The alphabet consists of thousands symbols.

Can you recommend well-known and well-studied algorithm for such task?

I have large ordered sequence of symbols, millions of symbols. I have to find repeated ordered subsequences such that:

  1. Search subsequences are unknown, I have to find subsequences that repeats elsewhere of large sequence.
  2. Subsequences may have differences such as presence some amount of noise and absence of some symbols.

非必要条件:

  1. Subsequences may have little amount of permutations of neighbor symbols.

字母表由数千个符号组成。

你能为这样的任务推荐众所周知的和很好的算法?

answer1: 回答1:

You can try aho-corasick multiple pattern matching and use a wildcard to search for substrings. For subsequence you want also the levenstein-distance. You can try my implementation in PHP of aho-corasick algorithm with wildcard at https://phpahocorasick.codeplex.com.

你可以试试Aho-Corasick多模式匹配和使用通配符查找子字符串。随后你也想要Levenstein距离。你可以在PHP与通配符在https://phpahocorasick.codeplex.com Aho-Corasick算法尝试我的实现。

algorithm  sequence  data-mining  dynamic-programming  bioinformatics