找到你要的答案

Q:how to get scala string split to match python

Q:如何让Scala与Python字符串分割

I am using spark-shell and pyspark to do word count on one article. scala flatmap on line.split(" ") and python split() get different word counts (scala has more). I tried split(" +") and split("\W+") in the scala code, but can not get the count to come down to the same as the python one.

Anyone knows what pattern would match python exactly?

我使用的火花pyspark壳上做一篇字数。Scala flatmap线。分裂(“”)和Python split()得到不同的单词计数(Scala有更多)。我试图分裂(“+”)和分裂(“\w+”)在Scala代码,但不能算下来像巨蟒一样。

有谁知道什么模式将匹配Python呢?

answer1: 回答1:

Python's str.split() has some special behaviour for default separator:

runs of consecutive whitespace are regarded as a single separator, and the result will contain no empty strings at the start or end if the string has leading or trailing whitespace. Consequently, splitting an empty string or a string consisting of just whitespace with a None separator returns [].

For example, ' 1 2 3 '.split() returns ['1', '2', '3']

The easiest way to fully match this in Scala is probably like this:

scala> """\S+""".r.findAllIn(" 1  2   3  ").toList
res0: List[String] = List(1, 2, 3)

scala> """\S+""".r.findAllIn("   ").toList
res1: List[String] = List()

scala> """\S+""".r.findAllIn("").toList
res2: List[String] = List()

Another way is to trim() the string beforehand:

scala> " 1  2   3  ".trim().split("""\s+""")
res3: Array[String] = Array(1, 2, 3)

But that doesn't have the same behaviour as Python for empty strings:

scala> "".trim().split("""\s+""")
res4: Array[String] = Array("")

In Scala split() of an empty string returns an array with one element, but in Python the result is a list with zero elements.

Python的结构split()已经默认分离器的一些特殊行为:

连续的空格运行看作一个分离器,其结果将包含在开始或结束,如果字符串的前导或尾随空格没空字符串。因此,将一个空字符串或组成的空白与无分隔符字符串返回[ ]。

例如,“1 2 3”。split()返回[ 1”、“2”、“3”]

完全符合本在Scala中的最简单的方法大概是这样的:

scala> """\S+""".r.findAllIn(" 1  2   3  ").toList
res0: List[String] = List(1, 2, 3)

scala> """\S+""".r.findAllIn("   ").toList
res1: List[String] = List()

scala> """\S+""".r.findAllIn("").toList
res2: List[String] = List()

另一种方法是trim()字符串之前:

scala> " 1  2   3  ".trim().split("""\s+""")
res3: Array[String] = Array(1, 2, 3)

但是,没有行为为空字符串Python一样:

scala> "".trim().split("""\s+""")
res4: Array[String] = Array("")

在Scala中split()空字符串返回一个元素的数组,但Python中的结果是一个零元素列表。

python  scala  split  apache-spark