找到你要的答案

Q:finding needle in haystack, what is a better solution?

Q:发现针在干草堆,一个更好的解决方案是什么?

so given "needle" and "there is a needle in this but not thisneedle haystack"

I wrote

def find_needle(n,h):
    count = 0
    words = h.split(" ")
    for word in words:
        if word == n:
            count += 1
    return count

This is O(n) but wondering if there is a better approach? maybe not by using split at all?

How would you write tests for this case to check that it handles all edge cases?

所以,“针”和“这是一个针但不thisneedle草垛”

我写的

def find_needle(n,h):
    count = 0
    words = h.split(" ")
    for word in words:
        if word == n:
            count += 1
    return count

这是O(n),但不知是否有一个更好的办法?也许不是用分裂?

你如何为这个案例编写测试来检查它处理所有的边缘情况?

answer1: 回答1:

I don't think it's possible to get bellow O(n) with this (because you need to iterate trough the string at least once). You can do some optimizations.

I assume you want to match "whole words", for example looking up foo should match like this:

foo and foo, or foobar and not foo.
^^^     ^^^                    ^^^

So splinting just based on space wouldn't do the job, because:

>>> 'foo and foo, or foobar and not foo.'.split(' ')
['foo', 'and', 'foo,', 'or', 'foobar', 'and', 'not', 'foo.']
#                  ^                                     ^

This is where re module comes in handy, which will allows you to build fascinating conditions. For example \b inside the regexp means:

Matches the empty string, but only at the beginning or end of a word. A word is defined as a sequence of Unicode alphanumeric or underscore characters, so the end of a word is indicated by whitespace or a non-alphanumeric, non-underscore Unicode character. Note that formally, \b is defined as the boundary between a \w and a \W character (or vice versa), or between \w and the beginning/end of the string. This means that r'\bfoo\b' matches 'foo', 'foo.', '(foo)', 'bar foo baz' but not 'foobar' or 'foo3'.

So r'\bfoo\b' will match only whole word foo. Also don't forget to use re.escape():

>>> re.escape('foo.bar+')
'foo\\.bar\\+'
>>> r'\b{}\b'.format(re.escape('foo.bar+'))
'\\bfoo\\.bar\\+\\b'

All you have to do now is use re.finditer() to scan the string. Based on documentation:

Return an iterator yielding match objects over all non-overlapping matches for the RE pattern in string. The string is scanned left-to-right, and matches are returned in the order found. Empty matches are included in the result unless they touch the beginning of another match.

I assume that matches are generated on the fly, so they never have to be in memory at once (which may come in handy with large strings, with many matched items). And in the end just count them:

>>> r = re.compile(r'\bfoo\b')
>>> it = r.finditer('foo and foo, or foobar and not foo.')
>>> sum(1 for _ in it)
3

我不想让贝娄O可能(N)与(因为你需要迭代槽串至少一次)。你可以做一些优化。

我以为你想匹配整个单词”,例如寻找foo应该这样的比赛:

foo and foo, or foobar and not foo.
^^^     ^^^                    ^^^

所以,夹板是基于空间不会做的工作,因为:

>>> 'foo and foo, or foobar and not foo.'.split(' ')
['foo', 'and', 'foo,', 'or', 'foobar', 'and', 'not', 'foo.']
#                  ^                                     ^

这是重新模块派上用场,这将使您能够建立迷人的条件。例如B在regexp的手段:

匹配空字符串,但仅在单词的开头或结尾。单词被定义为一个字母数字或下划线字符的Unicode序列,所以一个单词的结尾是由空格或非字母数字显示,非下划线字符。注意,在形式上,b被定义为一个W和一个W字符(或反之亦然)之间的边界,或者是字符串的开始和结束之间的边界。这意味着,R \ bFOO \b”匹配foo,foo. ','(Foo)','酒吧foo巴兹”而不是“foobar”或“foo3”。

所以R \ bFOO \b”将只匹配整个单词foo。也别忘了使用escape()重新:

>>> re.escape('foo.bar+')
'foo\\.bar\\+'
>>> r'\b{}\b'.format(re.escape('foo.bar+'))
'\\bfoo\\.bar\\+\\b'

你现在要做的就是使用finditer()扫描字符串。基于文献:

返回一个在所有非重叠匹配的字符串中产生匹配对象的迭代器。字符串从左到右扫描,并在找到的顺序中返回匹配。空的比赛包括在结果,除非他们接触的另一场比赛的开始。

我认为比赛是在飞行中产生的,所以他们不必立即在内存(这可能是方便大字符串,与许多匹配的项目)。最后只算他们:

>>> r = re.compile(r'\bfoo\b')
>>> it = r.finditer('foo and foo, or foobar and not foo.')
>>> sum(1 for _ in it)
3
answer2: 回答2:

This does not address the complexity issue but simplifies the code:

def find_needle(n,h):
    return h.split().count(n)

这不解决复杂性问题,但简化了代码:

def find_needle(n,h):
    return h.split().count(n)
answer3: 回答3:

Actually, when you say O(n) you are forgetting the fact that after matching the first letter, you have to match the remaining ones as well (match n from needle to sentence, then match e, then the next e...) You are essentially trying to replicate the functionality of grep, so you can look at the grep algorithm. You can do well by building a finite state machine. There are many links that can help you, for one you could start from How does grep run so fast?

事实上,当你说O(N)你忘了一个事实,匹配的第一封信后,你要配合其余的(赛以及N针的话,那么比赛E,然后E)你基本上是在试图复制它的功能,所以你可以看看grep算法。你可以通过建立一个有限状态机。有许多的链接,可以帮助你,你可以从中怎么跑得这么快?

answer4: 回答4:

You can use Counter

from collections import Counter

def find_needle(n,h):
    return Counter(h.split())[n]

i.e.:

n = "portugal"
h = 'lobito programmer from portugal hello fromportugal portugal'

print find_needle(n,h)

Output:

2

DEMO

你可以用计数器

from collections import Counter

def find_needle(n,h):
    return Counter(h.split())[n]

即.:

n = "portugal"
h = 'lobito programmer from portugal hello fromportugal portugal'

print find_needle(n,h)

输出:

2

演示

answer5: 回答5:

This is still going to be O(n) but it uses the power of the re module and python's generator expressions.

import re

def find_needle(n,h):
    g = re.finditer(r'\b%s\b'%n, h)  # use regex word boundaries
    return sum(1 for _ in g)  # return the length of the iterator

Should use far less memory than .split for a relatively large 'haystack'.

Note that this is not exactly the same as the code in the OP because it will not only find 'needle' but also 'needle,' and 'needle.' It will not find 'needles' though.

这仍然是O(N)而是使用re模块,Python的发电机的功率表达式。

import re

def find_needle(n,h):
    g = re.finditer(r'\b%s\b'%n, h)  # use regex word boundaries
    return sum(1 for _ in g)  # return the length of the iterator

应该使用更少的内存比。一个比较大的“大海捞针”分裂。

请注意,这是不完全相同的代码在OP,因为它不仅会发现“针”,而且“针”和“针”,它不会找到'针'虽然。

answer6: 回答6:

If you are concerned with the time it takes (as distinct from time complexity) multiprocess it. Basically make n smaller. Here is an example to run it in 2 processes.

from multiprocessing import Process

def find(word, string):
    return string.count(word)

def search_for_words(word, string):
    full_length = len(string)
    part1 = string[:full_length/2]
    proc1 = Process(target=find, args=(word, part1,))
    proc1.start()
    part2 = string[full_lenght/2:]
    proc2 = Process(target=find, args=(word, part2,))
    proc2.start()
    proc1.join()
    proc2.join()

if its O(n) you are worried about -

如果你关心的是所花费的时间(不同于时间复杂度)处理它。基本上使N更小。下面是一个运行2个过程的例子。

from multiprocessing import Process

def find(word, string):
    return string.count(word)

def search_for_words(word, string):
    full_length = len(string)
    part1 = string[:full_length/2]
    proc1 = Process(target=find, args=(word, part1,))
    proc1.start()
    part2 = string[full_lenght/2:]
    proc2 = Process(target=find, args=(word, part2,))
    proc2.start()
    proc1.join()
    proc2.join()

如果它的O(n)你担心—

python  dynamic-programming