数据科学中的正则表达式

科技2023-12-27 91

在字符串向量中查找正则表达式匹配 (Finding Regex Matches in String Vectors)

The grep function takes your regex as the first argument, and the input vector as the second argument. If you pass value=FALSE or omit the value parameter then grep returns a new vector with the indexes of the elements in the input vector that could be (partially) matched by the regular expression. If you pass value=TRUE, then grep returns a vector with copies of the actual elements in the input vector that could be (partially) matched.

grep函数将您的正则表达式作为第一个参数，并将输入向量作为第二个参数。如果您传递value = FALSE或忽略value参数，则grep返回一个新向量，该向量具有输入向量中元素的索引，该索引可以(部分)与正则表达式匹配。如果您传递value = TRUE，则grep返回一个向量，该向量包含输入向量中可能(部分)匹配的实际元素的副本。

> grep("a+", c("abc", "def", "cba a", "aa"), perl=TRUE, value=FALSE)[1] 1 3 4> grep("a+", c("abc", "def", "cba a", "aa"), perl=TRUE, value=TRUE)[1] "abc" "cba a" "aa"

The grepl function takes the same arguments as the grep function, except for the value argument, which is not supported. grepl returns a logical vector with the same length as the input vector. Each element in the returned vector indicates whether the regex could find a match in the corresponding string element in the input vector.

grepl函数采用与grep函数相同的参数，但不支持value参数。 grepl返回长度与输入向量相同的逻辑向量。返回向量中的每个元素都表明正则表达式是否可以在输入向量中的相应字符串元素中找到匹配项。

> grepl("a+", c("abc", "def", "cba a", "aa"), perl=TRUE)[1] TRUE FALSE TRUE TRUE

The regexpr function takes the same arguments as grepl. regexpr returns an integer vector with the same length as the input vector. Each element in the returned vector indicates the character position in each corresponding string element in the input vector at which the (first) regex match was found. A match at the start of the string is indicated with character position 1. If the regex could not find a match in a certain string, its corresponding element in the result vector is -1. The returned vector also has a match.length attribute. This is another integer vector with the number of characters in the (first) regex match in each string, or -1 for strings that didn’t match.

regexpr函数采用与grepl相同的参数。 regexpr返回一个整数向量，该整数向量的长度与输入向量的长度相同。返回的向量中的每个元素表示输入向量中每个对应的字符串元素中(第一个)正则表达式匹配的字符位置。字符串开头的匹配项以字符位置1表示。如果正则表达式在某个字符串中找不到匹配项，则其在结果向量中的对应元素为-1。返回的向量还具有match.length属性。这是另一个整数向量，每个字符串的(第一个)正则表达式中的字符数匹配，每个不匹配的字符串为-1。

gregexpr is the same as regexpr, except that it finds all matches in each string. It returns a vector with the same length as the input vector. Each element is another vector, with one element for each match found in the string indicating the character position at which that match was found. Each vector element in the returned vector also has a match.length attribute with the lengths of all matches. If no matches could be found in a particular string, the element in the returned vector is still a vector, but with just one element -1.

gregexpr与regexpr相同，除了它在每个字符串中查找所有匹配项。它返回一个长度与输入向量相同的向量。每个元素都是另一个向量，在字符串中找到的每个匹配项都有一个元素，指示找到该匹配项的字符位置。返回的向量中的每个向量元素还具有match.length属性，该属性具有所有匹配项的长度。如果在特定字符串中找不到匹配项，则返回向量中的元素仍然是向量，但只有一个元素-1。

> regexpr("a+", c("abc", "def", "cba a", "aa"), perl=TRUE)[1] 1 -1 3 1attr(,"match.length")[1] 1 -1 1 2> gregexpr("a+", c("abc", "def", "cba a", "aa"), perl=TRUE)[[1]] [1] 1 attr(,"match.length") [1] 1[[2]] [1] -1 attr(,"match.length") [1] -1[[3]] [1] 3 5 attr(,"match.length") [1] 1 1[[4]] [1] 1 attr(,"match.length") [1] 2

Use regmatches to get the actual substrings matched by the regular expression. As the first argument, pass the same input that you passed to regexpr or gregexpr . As the second argument, pass the vector returned by regexpr or gregexpr. If you pass the vector from regexpr then regmatches returns a character vector with all the strings that were matched. This vector may be shorter than the input vector if no match was found in some of the elements. If you pass the vector from gregexpr then regmatches returns a vector with the same number of elements as the input vector. Each element is a character vector with all the matches of the corresponding element in the input vector, or NULL if an element had no matches.

使用regmatches获取与正则表达式匹配的实际子字符串。作为第一个参数，传递与传递给regexpr或gregexpr相同的输入。作为第二个参数，传递regexpr或gregexpr返回的向量。如果您从regexpr传递矢量，则regmatches返回一个具有所有匹配字符串的字符矢量。如果在某些元素中未找到匹配项，则此向量可能比输入向量短。如果从gregexpr传递矢量，则regmatches返回的矢量与输入矢量的元素数相同。每个元素都是一个字符向量，具有输入向量中相应元素的所有匹配项；如果一个元素没有匹配项，则为NULL。

>x <- c("abc", "def", "cba a", "aa")> m <- regexpr("a+", x, perl=TRUE)> regmatches(x, m)[1] "a" "a" "aa"> m <- gregexpr("a+", x, perl=TRUE)> regmatches(x, m)[[1]] [1] "a"[[2]] character(0)[[3]] [1] "a" "a"[[4]] [1] "aa"

替换字符串向量中的正则表达式匹配 (Replacing Regex Matches in String Vectors)

The sub function has three required parameters: a string with the regular expression, a string with the replacement text, and the input vector. sub returns a new vector with the same length as the input vector. If a regex match could be found in a string element, it is replaced with the replacement text. Only the first match in each string element is replaced. If no matches could be found in some strings, those are copied into the result vector unchanged.

子函数具有三个必需的参数：带正则表达式的字符串，带替换文本的字符串和输入向量。 sub返回长度与输入向量相同的新向量。如果可以在字符串元素中找到正则表达式匹配项，则将其替换为替换文本。仅替换每个字符串元素中的第一个匹配项。如果在某些字符串中找不到匹配项，则将其原样复制到结果向量中。

Use gsub instead of sub to replace all regex matches in all the string elements in your vector. Other than replacing all matches, gsub works in exactly the same way, and takes exactly the same arguments.

使用gsub而不是sub来替换向量中所有字符串元素中的所有正则表达式匹配项。除了替换所有匹配项外，gsub的工作方式完全相同，并且采用的参数完全相同。

R uses its own replacement string syntax. Even though R 4.0.0 uses the PCRE2 regex flavor when you pass perl=TRUE, it still uses the R replacement string syntax. There is no option to use the PCRE2 replacement string syntax.

R使用其自己的替换字符串语法。即使在您传递perl = TRUE时R 4.0.0使用PCRE2正则表达式风格，它仍然使用R替换字符串语法。没有使用PCRE2替换字符串语法的选项。

You can use the backreferences \1 through \9 in the replacement text to reinsert text matched by a capturing group. You cannot use backreferences to groups 10 and beyond. If your regex has named groups, you can use numbered backreferences to the first 9 groups. There is no replacement text token for the overall match. Place the entire regex in a capturing group and then use \1 to insert the whole regex match.

您可以在替换文本中使用后向引用 \ 1到\ 9来重新插入与捕获组匹配的文本。您不能对第10组及以后的组使用反向引用。如果您的正则表达式已命名组，则可以对前9个组使用编号的反向引用。整体匹配没有替代文本标记。将整个正则表达式放在捕获组中，然后使用\ 1插入整个正则表达式匹配项。

> sub("(a+)", "z\\1z", c("abc", "def", "cba a", "aa"), perl=TRUE)[1] "zazbc" "def" "cbzaz a" "zaaz"> gsub("(a+)", "z\\1z", c("abc", "def", "cba a", "aa"), perl=TRUE)[1] "zazbc" "def" "cbzaz zaz" "zaaz"

You can use \U and \L to change the text inserted by all following backreferences to uppercase or lowercase. You can use \E to insert the following backreferences without any change of case. These escapes do not affect literal text.

您可以使用\ U和\ L将所有以下反向引用插入的文本更改为大写或小写。您可以使用\ E插入以下反向引用，而不区分大小写。这些转义符不会影响文字文本。

> sub("(a+)", "z\\U\\1z", c("abc", "def", "cba a", "aa"), perl=TRUE)[1] "zAzbc" "def" "cbzAz a" "zAAz"> gsub("(a+)", "z\\U\\1z", c("abc", "def", "cba a", "aa"), perl=TRUE)[1] "zAzbc" "def" "cbzAz zAz" "zAAz"

A very powerful way of making replacements is to assign a new vector to the regmatches function when you call it on the result of gregexpr. The vector you assign should have as many elements as the original input vector. Each element should be a character vector with as many strings as there are matches in that element. The original input vector is then modified to have all the regex matches replaced with the text from the new vector.

进行替换的一种非常有效的方法是，当您在gregexpr的结果上调用它时，将新的向量分配给regmatches函数。您分配的向量应具有与原始输入向量一样多的元素。每个元素应该是一个字符向量，其字符串数应与该元素中匹配项的数目相同。然后修改原始输入向量，以使所有正则表达式匹配都替换为新向量中的文本。

> x <- c("abc", "def", "cba a", "aa")> m <- gregexpr("a+", x, perl=TRUE)> regmatches(x, m) <- list(c("one"), character(0), c("two", "three"), c("four"))> x[1] "onebc" "def" "cbtwo three" "four"

使用rex :: rex()可以方便地创建正则表达式。 (Regular expressions can conveniently be created using rex::rex().)

Bonus Information :

奖金信息：

Cheat Sheet: https://rstudio.com/wp-content/uploads/2016/09/RegExCheatsheet.pdf

备忘单： https : //rstudio.com/wp-content/uploads/2016/09/RegExCheatsheet.pdf

follow me on YouTube for Technical videos :

在YouTube上跟随我观看技术视频：

演示地址

More such simplified Data Science concepts will follow. If you liked this or have some feedback or follow-up questions please comment below.

随后将有更多此类简化的数据科学概念。如果您喜欢这个或有任何反馈或后续问题，请在下面评论。

Thanks for Reading!

谢谢阅读！

翻译自: https://medium.com/swlh/regex-in-r-for-data-science-96e144530494

相关资源：四史答题软件安装包exe

Processed: 0.016, SQL: 9