Q：Text Search based algorithm not behaving as intended
I've updated the question with newer code suggested by fellow SO users and will be clarifying any ambiguous text that was previously there.
I only have access to the log files generated by the application in question. Thus I'm constrained to work within the content of the log files and no solutions out of that scope is quite possible. I'll have modified the sample data a little bit. I would like to point out the following key variables.
Thread ID - Ranges from 0..19 - A thread is used multiple times. Thus ScriptExecThread(2) could show up multiple times within the logs.
Script - Every thread will run a script on a particular file. Once again, the same script may run on the same thread but won't run on the same thread AND file.
File - Every Thread ID runs a Script on a File. If Thread(10) is running myscript.script on myfile.file, then that EXACT line won't be executed again. A successful example using the above example would be something like so.
An unsuccessful example using the above example would be:
Before addressing my query I'll give a rundown of the code used and the desired behavior.
I'm currently parsing large log files (take an average of 100k - 600k lines) and am attempting to retrieve certain information in a certain order. I've worked out the boolean algebra behind my request which seemed to work on paper but no so much on code (I must've missed something blatantly obvious). I would like to inform in advance that the code is not in any shape or form optimized, right now I simply want to get it to work.
In this log file you can see that certain threads hang up if they started but never finished. The number of possible thread IDs ranges. Here is some pseudo code:
Search Hung Threads
Has Thread Hung
If you example this, you'll noticed Threads number 2 & 3 started but failed to finished (reason is not necessary, I simply need to get the line).
Currently the code simply displays the entire log file with lines started with "starting". Which does somewhat make sense when I review the code.
I have removed any redundant information that I don't wish to display. If there is anything that I might have left out feel free to let me know and I'll add it.
线程ID -范围从0。19 -一个线程多次使用。因此scriptexecthread（2）可能出现多次在日志。
Search Hung Threads
Has Thread Hung
Keeping two separate sets of started and finished threads (as described by @tucuxi) can't work. If a thread with ID 5 starts, runs, and finishes, then 5 will appear in the finished set, forever. If another thread with ID 5 starts, and hangs, it won't be reported.
Suppose, though, for a moment, that thread IDs aren't reused. Every thread ever created receives a new ID. By keeping separate started and finished sets, you'll have hundreds of thousands of elements in each by the time you're done. Performance on those data structures is proportional to what they've got in them at the time of the operation. It's unlikely that performance will matter for your use case, but if you were performing more expensive operations, or running on data 100 times larger, it might.
Preamble out of the way, here is a working version of @tucuxi's code:
Note that I've dropped the finished set, and the HashMap is now called running. When new threads start, they go in, and when the thread finishes, it is pulled out. This means that the HashMap will always be the size of the number of currently running threads, which will always be less than (or equal) to the total number of threads ever run. So the operations on it will be faster. (As a pleasant side effect, you can now keep track of how many threads are running on a log line by log line basis, which might be useful in determining why the threads are hanging.)
Here's a Python program I used to generate huge test cases:
保持两套独立的开始和结束的线程（由@ tucuxi描述）不能工作。如果带有ID 5的线程启动、运行和完成，那么5将出现在已完成的集合中，直到永远。如果另一个线程与ID 5开始，并挂起，它不会被报告。
If I understand correctly, you have large files and are trying to find patterns of the form "X started (but no mention of X finished)" for all numerical values of X.
If I were to do this, I would use this pseudocode:
This requires a single regex pass over each file and an O(size-of-finished) set substraction; it should be 20x faster than your current approach. The regex would use optional (|) matching to look for both alternatives at once, reducing the number of passes.
Edit: made regex explicit. Compiling the regex once instead of once-per-line should shave off some extra run-time.
Edit 2: implemented pseudocode, works for me
Edit 3: replaced implementation to show file & line. Reduces memory requirements (does not load whole file into memory); but printing the line does require all "start" lines to be stored.
Input: sample data above, as the first argument, called "data.txt". The same data in another file called "data2.txt" as the second argument (javac T.java && java T data.txt data2.txt). Output:
输入：样本数据，作为第一个参数，称为“数据.txt”。在另一个文件名为“数据相同的数据。txt”作为第二个参数（javac t.java &；&；java T data.txt DATA2。txt）。输出：
|java algorithm text|