Python 正则表达式详解

正则表达式（Regular Expression，简称 Regex 或 Regexp）是一种强大的字符串匹配模式，用于描述、搜索和操作字符串。Python 通过内置的 re 模块提供了对正则表达式的全面支持。掌握正则表达式可以极大地提高文本处理的效率和灵活性，无论是数据清洗、日志分析还是表单验证，它都扮演着重要的角色。

1. 什么是正则表达式？

正则表达式是由字符和操作符组成的模式，用于查找、替换和操作字符串中的特定文本模式。它提供了一种简洁而强大的方式来：
* 在文本中搜索特定模式。
* 替换匹配模式的文本。
* 验证字符串格式（如邮箱、电话号码）。
* 从复杂文本中提取信息。

2. Python 中的 `re` 模块

Python 的 re 模块是处理正则表达式的核心。它提供了一系列函数来编译、搜索、匹配、查找和替换正则表达式。

2.1 `re` 模块的常用函数

re.search(pattern, string, flags=0)
- 在整个字符串中查找第一个匹配 pattern 的位置。
- 如果找到匹配，返回一个匹配对象（Match Object），否则返回 None。
- 匹配对象包含匹配到的文本、起始和结束位置等信息。
“`python
import re

text = “The quick brown fox jumps over the lazy dog.”
pattern = “fox”
match = re.search(pattern, text)

if match:
print(f”模式 ‘{match.group()}’ 找到。”) # group() 返回匹配到的字符串
print(f”起始位置: {match.start()}”)
print(f”结束位置: {match.end()}”)
print(f”区间: {match.span()}”) # span() 返回 (start, end) 元组

输出:

模式 ‘fox’ 找到。

起始位置: 16

结束位置: 19

区间: (16, 19)

“`
re.match(pattern, string, flags=0)
- 只尝试从字符串的起始位置匹配 pattern。
- 如果字符串的开头与模式匹配，返回一个匹配对象，否则返回 None。
“`python
text1 = “Hello, world!”
text2 = “world, Hello!”
pattern = “Hello”

match1 = re.match(pattern, text1)
match2 = re.match(pattern, text2)

if match1:
print(f”text1 开头匹配到: ‘{match1.group()}'”)
else:
print(“text1 开头没有匹配。”)

if match2:
print(f”text2 开头匹配到: ‘{match2.group()}'”)
else:
print(“text2 开头没有匹配。”)

输出:

text1 开头匹配到: ‘Hello’

text2 开头没有匹配。

“`
re.findall(pattern, string, flags=0)
- 在字符串中查找所有与 pattern 匹配的非重叠部分。
- 返回一个包含所有匹配字符串的列表。
“`python
text = “apple banana cherry apple date”
pattern = “apple”
matches = re.findall(pattern, text)
print(f”所有匹配: {matches}”)

输出:

所有匹配: [‘apple’, ‘apple’]

“`
re.sub(pattern, repl, string, count=0, flags=0)
- 将 string 中所有匹配 pattern 的部分替换为 repl。
- count 参数可选，指定最大替换次数（默认为 0，表示全部替换）。
- repl 可以是一个字符串，也可以是一个函数。
“`python
text = “The price is $10.99 and $20.50.”
pattern = r”\$\d+.\d+” # 匹配 $ 后跟数字、点、数字
replacement = “USD”

new_text = re.sub(pattern, replacement, text)
print(f”原始文本: {text}”)
print(f”修改后文本: {new_text}”)

只替换第一个匹配

new_text_one = re.sub(pattern, replacement, text, count=1)
print(f”修改后文本 (仅第一个): {new_text_one}”)

输出:

原始文本: The price is $10.99 and $20.50.

修改后文本: The price is USD and USD.

修改后文本 (仅第一个): The price is USD and $20.50.

“`
re.compile(pattern, flags=0)
- 将正则表达式 pattern 编译成一个正则表达式对象。
- 当需要多次使用同一个正则表达式模式时，编译可以提高性能。
- 编译后的对象有 search(), match(), findall(), sub() 等方法。
“`python
email_pattern_compiled = re.compile(r”\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+.[A-Z|a-z]{2,}\b”)
text = “My email is [email protected] and another is [email protected]”

使用编译后的模式

matches = email_pattern_compiled.findall(text)
print(f”找到的邮箱: {matches}”)

输出:

找到的邮箱: [‘[email protected]’, ‘[email protected]’]

“`

3. 正则表达式元字符（Metacharacters）

元字符是正则表达式中具有特殊含义的字符，它们不是字面匹配。

. (点号): 匹配除换行符 \n 之外的任何单个字符。
- re.findall("a.b", "acb aab a-b a\nb") -> ['acb', 'aab', 'a-b']
^ (脱字符): 匹配字符串的开头。
- re.findall("^Hello", "Hello world") -> ['Hello']
- re.findall("^Hello", "world Hello") -> []
$ (美元符): 匹配字符串的结尾。
- re.findall("world$", "Hello world") -> ['world']
- re.findall("world$", "Hello world!") -> []
* (星号): 匹配前面字符或组零次或多次。
- re.findall("ab*c", "ac abc abbc abbbc") -> ['ac', 'abc', 'abbc', 'abbbc']
+ (加号): 匹配前面字符或组一次或多次。
- re.findall("ab+c", "ac abc abbc") -> ['abc', 'abbc']
? (问号): 匹配前面字符或组零次或一次（使其成为可选的）。
- re.findall("ab?c", "ac abc abbc") -> ['ac', 'abc']
{m,n} (量词): 匹配前面字符或组至少 m 次，但不超过 n 次。
- {m}: 精确匹配 m 次。
- {m,}: 匹配 m 次或更多次。
- {,n}: 匹配最多 n 次。
- re.findall("a.{2,4}b", "axxxxb axxb ab") -> ['axxxxb', 'axxb']
[] (字符集): 匹配括号中的任意一个字符。
- [aeiou] 匹配任何元音字母。
- [0-9] 匹配任何数字。
- [^0-9] 匹配任何非数字字符。
- re.findall("[aeiou]", "hello world") -> ['e', 'o', 'o']
| (或): 匹配 | 符号前或后的模式。
- re.findall("cat|dog", "I have a cat and a dog.") -> ['cat', 'dog']
\ (反斜杠): 用于转义特殊字符（使其字面匹配）或引入特殊序列。
- 要匹配字面点号 .，需要写成 \.。
- re.findall("www\.example\.com", "Visit www.example.com") -> ['www.example.com']
() (分组): 将多个字符作为一个单元分组，也用于捕获匹配的子字符串。
- match = re.search(r"(\d{4})-(\d{2})-(\d{2})", "Date: 2023-10-26")
- match.group(0) 或 match.group() 返回整个匹配。
- match.group(1) 返回第一个捕获组（例如年份）。
- match.group(2) 返回第二个捕获组（例如月份）。

4. 特殊序列（Special Sequences）

特殊序列是预定义的字符集，使用反斜杠 \ 引入。

\d: 匹配任何数字字符（0-9）。等同于 [0-9]。
- re.findall(r"\d", "Phone: 123-456-7890") -> ['1', '2', '3', '4', '5', '6', '7', '8', '9', '0']
\D: 匹配任何非数字字符。等同于 [^0-9]。
\w: 匹配任何字母数字字符（包括下划线）。等同于 [a-zA-Z0-9_]。
- re.findall(r"\w+", "Hello_world 123!") -> ['Hello_world', '123']
\W: 匹配任何非字母数字字符。等同于 [^a-zA-Z0-9_]。
\s: 匹配任何空白字符（空格、制表符、换行符等）。
- re.findall(r"\s", "Hello world\tPython\nRegex") -> [' ', '\t', '\n']
\S: 匹配任何非空白字符。
\b: 匹配一个单词边界。例如 \bcat\b 会匹配独立的 “cat”，而不是 “concatenate” 中的 “cat”。
- re.findall(r"\bcat\b", "The cat sat on the concatenate.") -> ['cat']
\B: 匹配一个非单词边界。
- re.findall(r"\Bcat\B", "The concatenate has a cat.") -> ['cat'] (来自 “concatenate”)

5. 正则表达式标志（Flags）

标志用于修改正则表达式的匹配行为。它们作为函数的第三个参数或 re.compile() 的第二个参数传递。

re.IGNORECASE 或 re.I: 进行不区分大小写的匹配。
- re.findall("apple", "Apple pie", re.IGNORECASE) -> ['Apple']
re.MULTILINE 或 re.M: 使 ^ 和 $ 不仅匹配整个字符串的开头和结尾，也匹配每一行的开头和结尾。
- python text = "Line 1\nLine 2\nLine 3" re.findall("^Line", text) # ['Line'] (只匹配第一行) re.findall("^Line", text, re.MULTILINE) # ['Line', 'Line', 'Line']
re.DOTALL 或 re.S: 使 . 元字符匹配包括换行符在内的所有字符。
- python text = "Hello\nworld" re.findall("Hello.world", text) # [] (默认情况下 . 不匹配换行符) re.findall("Hello.world", text, re.DOTALL) # ['Hello\nworld']
re.VERBOSE 或 re.X: 允许在正则表达式中添加空白字符和注释，提高可读性。

“`python
pattern = re.compile(r”””
\d{3} # 匹配三位数字
– # 匹配一个连字符
\d{3} # 匹配三位数字
– # 匹配一个连字符
\d{4} # 匹配四位数字
“””, re.VERBOSE)
print(pattern.findall(“My number is 123-456-7890.”))

输出: [‘123-456-7890’]

“`

6. 原始字符串（Raw Strings `r"..."`）

在 Python 中使用正则表达式时，强烈建议使用原始字符串（在字符串前加上 r）。原始字符串会将反斜杠 \ 视为字面字符，而不是解释为 Python 的转义序列。这对于正则表达式至关重要，因为正则表达式本身大量使用 \ 来表示特殊序列（如 \d, \s）或转义元字符（如 \.）。

“`python

如果不使用原始字符串，Python 会先处理 ‘\b’ 为退格符，这可能导致非预期行为

print(re.findall(“\bword\b”, “a word b”)) # 可能会报错或行为异常

使用原始字符串，’\b’ 会被直接传递给 re 模块，作为单词边界

print(re.findall(r”\bword\b”, “a word b”))

输出: [‘word’]

“`

总结

Python 的 re 模块提供了一套强大而灵活的工具，用于处理各种复杂的字符串匹配和操作任务。通过理解其核心函数、元字符、特殊序列和标志，你将能够高效地利用正则表达式来解决实际问题。虽然正则表达式的语法初看起来可能有些复杂，但通过实践和不断尝试，它将成为你编程工具箱中不可或缺的一部分。

Python 正则表达式详解

1. 什么是正则表达式？

2. Python 中的 re 模块

2.1 re 模块的常用函数

输出:

模式 ‘fox’ 找到。

起始位置: 16

结束位置: 19

区间: (16, 19)

输出:

text1 开头匹配到: ‘Hello’

text2 开头没有匹配。

输出:

所有匹配: [‘apple’, ‘apple’]

只替换第一个匹配

输出:

原始文本: The price is $10.99 and $20.50.

修改后文本: The price is USD and USD.

修改后文本 (仅第一个): The price is USD and $20.50.

使用编译后的模式

输出:

找到的邮箱: [‘[email protected]’, ‘[email protected]’]