找到你要的答案

Q:Text Search based algorithm not behaving as intended

Q:文本搜索为基础的算法不符合预期

Update

I've updated the question with newer code suggested by fellow SO users and will be clarifying any ambiguous text that was previously there.

Update #2

I only have access to the log files generated by the application in question. Thus I'm constrained to work within the content of the log files and no solutions out of that scope is quite possible. I'll have modified the sample data a little bit. I would like to point out the following key variables.

Thread ID - Ranges from 0..19 - A thread is used multiple times. Thus ScriptExecThread(2) could show up multiple times within the logs.

Script - Every thread will run a script on a particular file. Once again, the same script may run on the same thread but won't run on the same thread AND file.

File - Every Thread ID runs a Script on a File. If Thread(10) is running myscript.script on myfile.file, then that EXACT line won't be executed again. A successful example using the above example would be something like so.

------START------

Thread(10) starting myscript.script on myfile.file

Thread(10) finished myscript.script on myfile.file

------END-------

An unsuccessful example using the above example would be:

------START------

Thread(10) starting myscript.script on myfile.file

------END------


Before addressing my query I'll give a rundown of the code used and the desired behavior.


Summary

I'm currently parsing large log files (take an average of 100k - 600k lines) and am attempting to retrieve certain information in a certain order. I've worked out the boolean algebra behind my request which seemed to work on paper but no so much on code (I must've missed something blatantly obvious). I would like to inform in advance that the code is not in any shape or form optimized, right now I simply want to get it to work.

In this log file you can see that certain threads hang up if they started but never finished. The number of possible thread IDs ranges. Here is some pseudo code:

    REGEX = "ScriptExecThread(\\([0-9]+\\)).*?(finished|starting)" //in java
    Set started, finished
    for (int i=log.size()-1; i >=0; i--) {
    if(group(2).contains("starting")
        started.add(log.get(i))
    else if(group(2).contains("finished")
        finished.add(log.get(i)    
    }
    started.removeAll(finished);

Search Hung Threads

Set<String> started = new HashSet<String>(), finished = new HashSet<String>();

for(int i = JAnalyzer.csvlog.size()-1; i >= 0; i--) {
    if(JAnalyzer.csvlog.get(i).contains("ScriptExecThread")) 
        JUtility.hasThreadHung(JAnalyzer.csvlog.get(i), started, finished);     
}
started.removeAll(finished);

commonTextArea.append("Number of threads hung: " + noThreadsHung + "\n");
for(String s : started) { 
    JLogger.appendLineToConsole(s);
    commonTextArea.append(s+"\n");
}

Has Thread Hung

public static boolean hasThreadHung(final String str, Set<String> started, Set<String> finished) {      
    Pattern r = Pattern.compile("ScriptExecThread(\\([0-9]+\\)).*?(finished|starting)");
    Matcher m = r.matcher(str);
    boolean hasHung = m.find();

        if(m.group(2).contains("starting"))
            started.add(str);
        else if (m.group(2).contains("finished"))
            finished.add(str);

        System.out.println("Started size: " + started.size());
        System.out.println("Finished size: " + finished.size());

    return hasHung;
}

Example Data

ScriptExecThread(1) started on afile.xyz

ScriptExecThread(2) started on bfile.abc

ScriptExecThread(3) started on cfile.zyx

ScriptExecThread(4) started on dfile.zxy

ScriptExecThread(5) started on efile.yzx

ScriptExecThread(1) finished on afile.xyz

ScriptExecThread(2) finished on bfile.abc

ScriptExecThread(3) finished on cfile.zyx

ScriptExecThread(4) finished on dfile.zxy

ScriptExecThread(5) finished on efile.yzy

ScriptExecThread(1) started on bfile.abc

ScriptExecThread(2) started on dfile.zxy

ScriptExecThread(3) started on afile.xyz

ScriptExecThread(1) finished on bfile.abc

END OF LOG

If you example this, you'll noticed Threads number 2 & 3 started but failed to finished (reason is not necessary, I simply need to get the line).

Sample Data

09.08 15:06.53, ScriptExecThread(7),Info,########### starting

09.08 15:06.54, ScriptExecThread(18),Info,###################### starting

09.08 15:06.54, ScriptExecThread(13),Info,######## finished in #########

09.08 15:06.54, ScriptExecThread(13),Info,########## starting

09.08 15:06.55, ScriptExecThread(9),Info,##### finished in ########

09.08 15:06.55, ScriptExecThread(0),Info,####finished in ###########

09.08 15:06.55, ScriptExecThread(19),Info,#### finished in ########

09.08 15:06.55, ScriptExecThread(8),Info,###### finished in 2777 #########

09.08 15:06.55, ScriptExecThread(19),Info,########## starting

09.08 15:06.55, ScriptExecThread(8),Info,####### starting

09.08 15:06.55, ScriptExecThread(0),Info,##########starting

09.08 15:06.55, ScriptExecThread(19),Info,Post ###### finished in #####

09.08 15:06.55, ScriptExecThread(0),Info,###### finished in #########

09.08 15:06.55, ScriptExecThread(19),Info,########## starting

09.08 15:06.55, ScriptExecThread(0),Info,########### starting

09.08 15:06.55, ScriptExecThread(9),Info,########## starting

09.08 15:06.56, ScriptExecThread(1),Info,####### finished in ########

09.08 15:06.56, ScriptExecThread(17),Info,###### finished in #######

09.08 15:06.56, ScriptExecThread(17),Info,###################### starting

09.08 15:06.56, ScriptExecThread(1),Info,########## starting


Currently the code simply displays the entire log file with lines started with "starting". Which does somewhat make sense when I review the code.

I have removed any redundant information that I don't wish to display. If there is anything that I might have left out feel free to let me know and I'll add it.

Update

我更新了问题,由新的代码建议,所以用户和将澄清任何暧昧的文字,以前有。

Update #2

我只能访问由应用程序产生的日志文件中的问题。因此,我被限制在日志文件的内容中工作,并且没有从该范围内解决的方法。我会稍微修改一下样本数据。我想指出以下关键变量。

线程ID -范围从0。19 -一个线程多次使用。因此scriptexecthread(2)可能出现多次在日志。

脚本-每个线程将在特定文件上运行脚本。同样的脚本可以在同一个线程上运行,但不会在同一个线程和文件上运行。

文件-每个线程ID在文件上运行一个脚本。如果线程(10)运行myscript.script在myfile.file,然后精确线不会再次执行。使用上述例子的一个成功例子就是这样。

------开始------

螺纹(10)开始在myfile.file myscript.script

螺纹(10)完成了对myfile.file myscript.script

------端-------

使用上述例子的一个不成功的例子将是:

------开始------

螺纹(10)开始在myfile.file myscript.script

------端------


演讲前我查询我给一个破败的代码和所期望的行为。


Summary

我目前解析大日志文件(以100K平均60万行),我试图在一个特定的顺序检索特定的信息。我曾经在我的要求似乎在纸上的工作却没有这么多的代码的布尔代数(我一定错过了什么明显)。我想事先告知,代码是不是在任何形状或形式优化,现在我只是想得到它的工作。

在这个日志文件你可以看到某些线程挂起,如果他们开始,但从来没有完成。可能的线程ID范围的数目。这里是一些伪代码:

    REGEX = "ScriptExecThread(\\([0-9]+\\)).*?(finished|starting)" //in java
    Set started, finished
    for (int i=log.size()-1; i >=0; i--) {
    if(group(2).contains("starting")
        started.add(log.get(i))
    else if(group(2).contains("finished")
        finished.add(log.get(i)    
    }
    started.removeAll(finished);

Search Hung Threads

Set<String> started = new HashSet<String>(), finished = new HashSet<String>();

for(int i = JAnalyzer.csvlog.size()-1; i >= 0; i--) {
    if(JAnalyzer.csvlog.get(i).contains("ScriptExecThread")) 
        JUtility.hasThreadHung(JAnalyzer.csvlog.get(i), started, finished);     
}
started.removeAll(finished);

commonTextArea.append("Number of threads hung: " + noThreadsHung + "\n");
for(String s : started) { 
    JLogger.appendLineToConsole(s);
    commonTextArea.append(s+"\n");
}

Has Thread Hung

public static boolean hasThreadHung(final String str, Set<String> started, Set<String> finished) {      
    Pattern r = Pattern.compile("ScriptExecThread(\\([0-9]+\\)).*?(finished|starting)");
    Matcher m = r.matcher(str);
    boolean hasHung = m.find();

        if(m.group(2).contains("starting"))
            started.add(str);
        else if (m.group(2).contains("finished"))
            finished.add(str);

        System.out.println("Started size: " + started.size());
        System.out.println("Finished size: " + finished.size());

    return hasHung;
}

Example Data

scriptexecthread(1)开始afile.xyz

scriptexecthread(2)开始bfile.abc

scriptexecthread(3)开始cfile.zyx

scriptexecthread(4)开始dfile.zxy

scriptexecthread(5)开始efile.yzx

(1)完成了对afile.xyz scriptexecthread

(2)完成了对bfile.abc scriptexecthread

(3)完成了对cfile.zyx scriptexecthread

(4)完成了对dfile.zxy scriptexecthread

(5)完成了对efile.yzy scriptexecthread

scriptexecthread(1)开始bfile.abc

scriptexecthread(2)开始dfile.zxy

scriptexecthread(3)开始afile.xyz

(1)完成了对bfile.abc scriptexecthread

日志结束

如果你举这个例子,你会注意到2号和第三号线的开始,但没有完成(原因是没有必要的,我只需要得到的线)。

Sample Data

9.08 15: 6.53,scriptexecthread(7),信息,###########出发

9.08 15: 6.54,scriptexecthread(18),信息,######################出发

9.08 15: 6.54,scriptexecthread(13),信息,########完成#########

9.08 15: 6.54,scriptexecthread(13),信息,##########出发

9.08 15: 6.55,scriptexecthread(9),信息,#####完成########

9.08 15: 6.55,scriptexecthread(0),信息,# # # #完成###########

9.08 15: 6.55,scriptexecthread(19),信息,# # # #完成########

9.08 15: 6.55,scriptexecthread(8),信息,######完成2777 #########

9.08 15: 6.55,scriptexecthread(19),信息,##########出发

9.08 15: 6.55,scriptexecthread(8),信息,#######出发

9.08 15: 6.55,scriptexecthread(0),信息,##########出发

9.08 15: 6.55,scriptexecthread(19),信息,完成后###### #####

9.08 15: 6.55,scriptexecthread(0),信息,######完成#########

9.08 15: 6.55,scriptexecthread(19),信息,##########出发

9.08 15: 6.55,scriptexecthread(0),信息,###########出发

9.08 15: 6.55,scriptexecthread(9),信息,##########出发

9.08 15: 6.56,scriptexecthread(1),信息,#######完成########

9.08 15: 6.56,scriptexecthread(17),信息,######完成#######

9.08 15: 6.56,scriptexecthread(17),信息,######################出发

9.08 15: 6.56,scriptexecthread(1),信息,##########出发


目前的代码只显示整个日志文件与行开始“开始”。当我回顾代码时,有什么意义。

我删除了任何我不想显示的多余信息。如果有什么我可能已经离开了,随时让我知道,我会添加它。

answer1: 回答1:

Keeping two separate sets of started and finished threads (as described by @tucuxi) can't work. If a thread with ID 5 starts, runs, and finishes, then 5 will appear in the finished set, forever. If another thread with ID 5 starts, and hangs, it won't be reported.

Suppose, though, for a moment, that thread IDs aren't reused. Every thread ever created receives a new ID. By keeping separate started and finished sets, you'll have hundreds of thousands of elements in each by the time you're done. Performance on those data structures is proportional to what they've got in them at the time of the operation. It's unlikely that performance will matter for your use case, but if you were performing more expensive operations, or running on data 100 times larger, it might.

Preamble out of the way, here is a working version of @tucuxi's code:

import java.util.*;
import java.io.*;
import java.util.regex.*;

public class T {
    public static Collection<String> findHung(Iterable<String> data) {
        Pattern p = Pattern.compile(
            "ScriptExecThread\\(([0-9]+).*?(finished|starting)");
        HashMap<Integer, String> running = new HashMap<Integer, String>();
        for (String d : data) {
            Matcher m = p.matcher(d);
            if (m.find()) {
                int n = Integer.parseInt(m.group(1));
                if (m.group(2).equals("starting"))
                    running.put(n, d);
                else
                    running.remove(n);
            }
        }
        return running.values();
    }

    static Iterable<String> readFile(String path, String encoding) throws IOException {
        final Scanner sc = new Scanner(new File(path), encoding).useDelimiter("\n");
        return new Iterable<String>() {
            public Iterator<String> iterator() { return sc; }
        };
    }

    public static void main(String[] args) throws Exception {
        for (String fileName : args) {
            for (String s : findHung(readFile(fileName, "UTF-8"))) {
                System.out.println(fileName + ": '" + s + "' unfinished");
            }
        }
    }
}

Note that I've dropped the finished set, and the HashMap is now called running. When new threads start, they go in, and when the thread finishes, it is pulled out. This means that the HashMap will always be the size of the number of currently running threads, which will always be less than (or equal) to the total number of threads ever run. So the operations on it will be faster. (As a pleasant side effect, you can now keep track of how many threads are running on a log line by log line basis, which might be useful in determining why the threads are hanging.)

Here's a Python program I used to generate huge test cases:

#!/usr/bin/python

from random import random, choice
from datetime import datetime
import tempfile

all_threads = set([])
running = []
hung = []
filenames = { }

target_thread_count = 16
hang_chance = 0.001

def log(id, msg):
    now = datetime.now().strftime("%m:%d %H:%M:%S")
    print "%s, ScriptExecThread(%i),Info,%s" % (now, id, msg)

def new_thread():
    if len(all_threads)>0:
        for t in range(0, 2+max(all_threads)):
            if t not in all_threads:
                all_threads.add(t)
                return t
    else:
        all_threads.add(0)
        return 0

for i in range(0, 100000):
    if len(running) > target_thread_count:
        new_thread_chance = 0.25
    else:
        new_thread_chance = 0.75
        pass

    if random() < new_thread_chance:
        t = new_thread()
        name = next(tempfile._get_candidate_names())+".txt"
        filenames[t] = name
        log(t, "%s starting" % (name,))
        if random() < hang_chance:
            hung.append(t)
        else:
            running.append(t)
    elif len(running)>0:
        victim = choice(running)
        all_threads.remove(victim)
        running.remove(victim)
        log(t, "%s finished" % (filenames[victim],))

保持两套独立的开始和结束的线程(由@ tucuxi描述)不能工作。如果带有ID 5的线程启动、运行和完成,那么5将出现在已完成的集合中,直到永远。如果另一个线程与ID 5开始,并挂起,它不会被报告。

但是,假设一段时间,该线程ID不被重用。创建的每一个线程都会收到一个新的ID。通过保持单独的开始和完成的设置,你将有成千上万的元素在每次你完成。这些数据结构的性能与操作时的数据结构成正比。这是不太可能的性能将对您的用例,但如果您正在执行更昂贵的操作,或运行数据的100倍以上,它可能。

前导码的方式进行,这是一个工作版本@ tucuxi的代码:

import java.util.*;
import java.io.*;
import java.util.regex.*;

public class T {
    public static Collection<String> findHung(Iterable<String> data) {
        Pattern p = Pattern.compile(
            "ScriptExecThread\\(([0-9]+).*?(finished|starting)");
        HashMap<Integer, String> running = new HashMap<Integer, String>();
        for (String d : data) {
            Matcher m = p.matcher(d);
            if (m.find()) {
                int n = Integer.parseInt(m.group(1));
                if (m.group(2).equals("starting"))
                    running.put(n, d);
                else
                    running.remove(n);
            }
        }
        return running.values();
    }

    static Iterable<String> readFile(String path, String encoding) throws IOException {
        final Scanner sc = new Scanner(new File(path), encoding).useDelimiter("\n");
        return new Iterable<String>() {
            public Iterator<String> iterator() { return sc; }
        };
    }

    public static void main(String[] args) throws Exception {
        for (String fileName : args) {
            for (String s : findHung(readFile(fileName, "UTF-8"))) {
                System.out.println(fileName + ": '" + s + "' unfinished");
            }
        }
    }
}

请注意,我已经放弃了,和HashMap现在被称为运行。当新线程开始时,它们进入,当线程结束时,它被拔出。这意味着,HashMap永远是当前正在运行的线程数量的大小,这始终是小于(或等于)的线程总数过。所以它的操作会更快。(作为一个愉快的副作用,您现在可以跟踪多少线程正在运行的日志行的日志行的基础上,这可能是有用的,以确定为什么线程挂起。)

这是我用来产生巨大的测试用例的Python程序:

#!/usr/bin/python

from random import random, choice
from datetime import datetime
import tempfile

all_threads = set([])
running = []
hung = []
filenames = { }

target_thread_count = 16
hang_chance = 0.001

def log(id, msg):
    now = datetime.now().strftime("%m:%d %H:%M:%S")
    print "%s, ScriptExecThread(%i),Info,%s" % (now, id, msg)

def new_thread():
    if len(all_threads)>0:
        for t in range(0, 2+max(all_threads)):
            if t not in all_threads:
                all_threads.add(t)
                return t
    else:
        all_threads.add(0)
        return 0

for i in range(0, 100000):
    if len(running) > target_thread_count:
        new_thread_chance = 0.25
    else:
        new_thread_chance = 0.75
        pass

    if random() < new_thread_chance:
        t = new_thread()
        name = next(tempfile._get_candidate_names())+".txt"
        filenames[t] = name
        log(t, "%s starting" % (name,))
        if random() < hang_chance:
            hung.append(t)
        else:
            running.append(t)
    elif len(running)>0:
        victim = choice(running)
        all_threads.remove(victim)
        running.remove(victim)
        log(t, "%s finished" % (filenames[victim],))
answer2: 回答2:

If I understand correctly, you have large files and are trying to find patterns of the form "X started (but no mention of X finished)" for all numerical values of X.

If I were to do this, I would use this pseudocode:

Pattern p = Pattern.compile(
   "ScriptExecThread\\(([0-9]+).*?(finished|started)");
Set<Integer> started, finished;
Search for p; for each match m,
     int n = Integer.parseInt(m.group(1));
     if (m.group(2).equals("started")) started.add(n);
     else finished.add(n);
started.removeAll(finished); // found 'em: contains started-but-not-finished

This requires a single regex pass over each file and an O(size-of-finished) set substraction; it should be 20x faster than your current approach. The regex would use optional (|) matching to look for both alternatives at once, reducing the number of passes.

Edit: made regex explicit. Compiling the regex once instead of once-per-line should shave off some extra run-time.


Edit 2: implemented pseudocode, works for me


Edit 3: replaced implementation to show file & line. Reduces memory requirements (does not load whole file into memory); but printing the line does require all "start" lines to be stored.

public class T {

    public static Collection<String> findHung(Iterable<String> data) {
        Pattern p = Pattern.compile(   
            "ScriptExecThread\\(([0-9]+).*?(finished|starting)");
        HashMap<Integer, String> started = new HashMap<Integer, String>();
        Set<Integer> finished = new HashSet<Integer>();
        for (String d : data) {
            Matcher m = p.matcher(d);
            if (m.find()) {
                int n = Integer.parseInt(m.group(1));
                if (m.group(2).equals("starting")) started.put(n, d);
                else finished.add(n);
            }                
        }
        for (int f : finished) {
            started.remove(f);
        }
        return started.values();
    }

    static Iterable<String> readFile(String path, String encoding) throws IOException {
        final Scanner sc = new Scanner(new File(path), encoding).useDelimiter("\n");
        return new Iterable<String>() {
            public Iterator<String> iterator() { return sc; }
        };
    }

    public static void main(String[] args) throws Exception {
        for (String fileName : args) {
            for (String s : findHung(readFile(fileName, "UTF-8"))) {
                System.out.println(fileName + ": '" + s + "' unfinished");
            }
        }
    }
}

Input: sample data above, as the first argument, called "data.txt". The same data in another file called "data2.txt" as the second argument (javac T.java && java T data.txt data2.txt). Output:

data.txt: '    09.08 15:06.54, ScriptExecThread(18),Info,###################### starting' unfinished
data.txt: '    09.08 15:06.53, ScriptExecThread(7),Info,########### starting' unfinished
data2.txt: '    09.08 15:06.54, ScriptExecThread(18),Info,###################### starting' unfinished
data2.txt: '    09.08 15:06.53, ScriptExecThread(7),Info,########### starting' unfinished

如果我理解正确,你有大的文件,并试图找到模式的形式“X开始(但没有提到X完成)”的所有数值x。

如果我这样做,我会使用伪代码:

Pattern p = Pattern.compile(
   "ScriptExecThread\\(([0-9]+).*?(finished|started)");
Set<Integer> started, finished;
Search for p; for each match m,
     int n = Integer.parseInt(m.group(1));
     if (m.group(2).equals("started")) started.add(n);
     else finished.add(n);
started.removeAll(finished); // found 'em: contains started-but-not-finished

这就需要一个正则表达式通过每个文件和一个O(尺寸的成品)减法;应该20x比你当前的方法更快。正则表达式可以使用可选的(|)匹配寻找方案时,减少传球次数。

编辑:明确了正则表达式。编译正则表达式而不是每行应该剃掉一些额外的运行。


编辑2:实现的伪代码,我的作品


编辑3:替换实现显示文件和行。减少内存需求(不加载整个文件到内存中),但打印行需要所有的“开始”行存储。

public class T {

    public static Collection<String> findHung(Iterable<String> data) {
        Pattern p = Pattern.compile(   
            "ScriptExecThread\\(([0-9]+).*?(finished|starting)");
        HashMap<Integer, String> started = new HashMap<Integer, String>();
        Set<Integer> finished = new HashSet<Integer>();
        for (String d : data) {
            Matcher m = p.matcher(d);
            if (m.find()) {
                int n = Integer.parseInt(m.group(1));
                if (m.group(2).equals("starting")) started.put(n, d);
                else finished.add(n);
            }                
        }
        for (int f : finished) {
            started.remove(f);
        }
        return started.values();
    }

    static Iterable<String> readFile(String path, String encoding) throws IOException {
        final Scanner sc = new Scanner(new File(path), encoding).useDelimiter("\n");
        return new Iterable<String>() {
            public Iterator<String> iterator() { return sc; }
        };
    }

    public static void main(String[] args) throws Exception {
        for (String fileName : args) {
            for (String s : findHung(readFile(fileName, "UTF-8"))) {
                System.out.println(fileName + ": '" + s + "' unfinished");
            }
        }
    }
}

输入:样本数据,作为第一个参数,称为“数据.txt”。在另一个文件名为“数据相同的数据。txt”作为第二个参数(javac t.java &;&;java T data.txt DATA2。txt)。输出:

data.txt: '    9.08 15: 6.54,scriptexecthread(18),信息,######################出发' unfinished
data.txt: '    9.08 15: 6.53,scriptexecthread(7),信息,###########出发' unfinished
data2.txt: '    9.08 15: 6.54,scriptexecthread(18),信息,######################出发' unfinished
data2.txt: '    9.08 15: 6.53,scriptexecthread(7),信息,###########出发' unfinished
java  algorithm  text