找到你要的答案

Q:regex remove punct removes non-punctuation characters in R

Q:正则表达式删除并删除非标点符号R

While filtering and cleaning text in Hebrew, I found that

gsub("[[:punct:]]", "", txt)

actually removes a relevant character. The character is "ק" and it is located in the "E" spot on the keyboard. Interestingly, the gsub function in R removes the "ק" character and then all words get messed up. Does anyone have an idea why?

在希伯来语中过滤和清理文本时,我发现

gsub("[[:punct:]]", "", txt)

实际删除相关字符。的特点是“ק”,它位于键盘上的“E”点。有趣的是,在R gsub函数移除“ק”字,然后所有的话给搞砸了。有人知道为什么吗?

answer1: 回答1:

According to Regular Expressions as used in R:

Certain named classes of characters are predefined. Their interpretation depends on the locale (see locales); the interpretation below is that of the POSIX locale.

Acc. to POSIX locale, [[:punct:]]should capture ! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ ` { | } ~. So, you might need to adjust your regex to remove only the characters you want:

txt <- "!\"#$%&'()*+,\\-./:;<=>?@[\\\\^\\]_`{|}~"
gsub("[\\\\!\"#$%&'()*+,./:;<=>?@[\\^\\]_`{|}~-]", "", txt, perl = T)

Sample program output:

[1] ""

根据在R中使用的正则表达式:

Certain named classes of characters are predefined. Their interpretation depends on the locale (see locales); the interpretation below is that of the POSIX locale.

根据POSIX locale,[:] [:]并要抓住!”# $ % &’()* +,—。/:;<;= & gt;?“[ ] ^ _ ` { } ~ |。所以,你可能需要调整你的正则表达式只删除你想要的字:

txt <- "!\"#$%&'()*+,\\-./:;<=>?@[\\\\^\\]_`{|}~"
gsub("[\\\\!\"#$%&'()*+,./:;<=>?@[\\^\\]_`{|}~-]", "", txt, perl = T)

示例程序输出:

[1] ""
regex  r  punctuation