Python 正则表达式：基础与高级用法详解

正则表达式（Regular Expression，简称 Regex 或 Regexp）是一种强大的文本处理工具，它使用一种特殊的字符序列来匹配字符串中的模式。在 Python 中，通过内置的 re 模块，我们可以轻松地利用正则表达式进行字符串的搜索、替换、分割和提取。本文将详细介绍 Python 正则表达式的基础知识及其高级用法。

一、正则表达式基础

1. 导入 `re` 模块

在使用正则表达式之前，需要先导入 Python 的 re 模块：

python import re

2. 基本匹配函数

re 模块提供了一系列函数来执行正则表达式操作：

re.match(pattern, string, flags=0): 从字符串的开头匹配模式。如果匹配成功，返回一个匹配对象；否则返回 None。
re.search(pattern, string, flags=0): 扫描整个字符串，查找模式的第一个匹配项。如果匹配成功，返回一个匹配对象；否则返回 None。
re.findall(pattern, string, flags=0): 查找字符串中所有非重叠的模式匹配项，并以列表形式返回所有匹配到的字符串。
re.finditer(pattern, string, flags=0): 查找字符串中所有非重叠的模式匹配项，并返回一个迭代器，其中每个元素都是一个匹配对象。
re.sub(pattern, repl, string, count=0, flags=0): 替换字符串中所有匹配模式的部分为 repl。count 参数用于指定最多替换的次数。
re.split(pattern, string, maxsplit=0, flags=0): 根据模式将字符串分割成列表。maxsplit 参数用于指定最大分割次数。

示例：

“`python
text = “Python is fun, Python is powerful.”

re.match: 从开头匹配

match_obj = re.match(r”Python”, text)
print(f”Match object (match): {match_obj}”) #

match_obj = re.match(r”fun”, text)
print(f”Match object (match ‘fun’): {match_obj}”) # None (因为’fun’不在开头)

re.search: 查找第一个匹配项

search_obj = re.search(r”Python”, text)
print(f”Search object: {search_obj}”) #

search_obj = re.search(r”fun”, text)
print(f”Search object (‘fun’): {search_obj}”) #

re.findall: 查找所有匹配项

all_matches = re.findall(r”Python”, text)
print(f”All matches: {all_matches}”) # [‘Python’, ‘Python’]

re.sub: 替换

replaced_text = re.sub(r”Python”, “Java”, text)
print(f”Replaced text: {replaced_text}”) # Java is fun, Java is powerful.

re.split: 分割

split_text = re.split(r”, “, text)
print(f”Split text: {split_text}”) # [‘Python is fun’, ‘Python is powerful.’]
“`

3. 匹配对象 (Match Object)

当 re.match 或 re.search 成功匹配时，会返回一个匹配对象。这个对象包含匹配的详细信息：

group(0) 或 group(): 返回整个匹配到的字符串。
group(n): 返回第 n 个捕获组匹配到的字符串。
groups(): 以元组形式返回所有捕获组匹配到的字符串。
start(): 返回匹配到的子串的起始索引。
end(): 返回匹配到的子串的结束索引（不包含）。
span(): 返回一个元组 (start, end)，表示匹配到的子串的起始和结束索引。

示例：

“`python
text = “My email is [email protected]”
match = re.search(r”(\w+)@(\w+).(\w+)”, text)

if match:
print(f”Full match: {match.group(0)}”) # [email protected]
print(f”Username: {match.group(1)}”) # user
print(f”Domain: {match.group(2)}”) # example
print(f”Top-level domain: {match.group(3)}”) # com
print(f”All groups: {match.groups()}”) # (‘user’, ‘example’, ‘com’)
print(f”Start index: {match.start()}”) # 12
print(f”End index: {match.end()}”) # 28
print(f”Span: {match.span()}”) # (12, 28)
“`

4. 正则表达式元字符

元字符是正则表达式中具有特殊含义的字符：

字符	描述	示例	匹配
`.`	匹配除换行符 `\n` 之外的任何单个字符。	`a.c`	`abc`, `axc`
`^`	匹配字符串的开头。	`^Hello`	匹配以 `Hello` 开头的字符串
`$`	匹配字符串的结尾。	`world$`	匹配以 `world` 结尾的字符串
`*`	匹配前一个字符零次或多次。	`ab*c`	`ac`, `abc`, `abbc`
`+`	匹配前一个字符一次或多次。	`ab+c`	`abc`, `abbc`
`?`	匹配前一个字符零次或一次。	`ab?c`	`ac`, `abc`
`{n}`	匹配前一个字符恰好 `n` 次。	`a{3}b`	`aaab`
`{n,}`	匹配前一个字符至少 `n` 次。	`a{2,}b`	`aab`, `aaab`
`{n,m}`	匹配前一个字符至少 `n` 次，但不超过 `m` 次。	`a{2,4}b`	`aab`, `aaab`, `aaaab`
`[]`	匹配方括号内任意一个字符。	`[abc]`	`a`, `b`, `c`
`[^]`	匹配方括号内不在的任意一个字符。	`[^abc]`	匹配除 `a, b, c` 之外的字符
`\|`	或操作符，匹配 `\|` 前或后的表达式。	`cat\|dog`	`cat` 或 `dog`
`()`	分组，将多个字符视为一个单元，并创建捕获组。	`(ab)+`	`ab`, `abab`
`\`	转义字符，将特殊字符转义为普通字符，或将普通字符转义为特殊字符。	`\.`	匹配实际的 `.` 字符

5. 常用特殊序列

这些特殊序列是元字符的组合，用于匹配特定类型的字符：

序列	描述	示例	匹配
`\d`	匹配任何数字 (0-9)。等价于 `[0-9]`。	`\d{3}`	`123`, `456`
`\D`	匹配任何非数字字符。等价于 `[^0-9]`。	`\D`	`a`, `B`, `@`
`\w`	匹配任何字母、数字或下划线。等价于 `[a-zA-Z0-9_]`。	`\w+`	`word`, `_var1`
`\W`	匹配任何非字母、数字或下划线字符。等价于 `[^a-zA-Z0-9_]`。	`\W`	, `!`, `$`
`\s`	匹配任何空白字符（空格、制表符、换行符等）。	`\s`	, `\t`, `\n`
`\S`	匹配任何非空白字符。	`\S`	`a`, `1`, `$`
`\b`	匹配单词边界。	`\bcat\b`	匹配独立的 `cat`
`\B`	匹配非单词边界。	`\Bcat\B`	匹配 `category` 中的 `cat`

示例：

“`python
text = “The price is $12.99, order_id: ABC-123.”

匹配数字

numbers = re.findall(r”\d+”, text)
print(f”Numbers: {numbers}”) # [’12’, ’99’, ‘123’]

匹配单词

words = re.findall(r”\b\w+\b”, text)
print(f”Words: {words}”) # [‘The’, ‘price’, ‘is’, ’12’, ’99’, ‘order_id’, ‘ABC’, ‘123’]

匹配邮箱 (简单示例)

email_pattern = r”[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+.[a-zA-Z]{2,}”
email_match = re.search(email_pattern, “Contact us at [email protected].”)
if email_match:
print(f”Found email: {email_match.group(0)}”) # [email protected]
“`

6. 编译正则表达式

如果需要多次使用同一个正则表达式，可以先将其编译成一个正则表达式对象，这样可以提高效率。

“`python

编译正则表达式

pattern = re.compile(r”\d+”)

text1 = “There are 123 apples.”
text2 = “I have 456 oranges.”

使用编译后的模式对象

matches1 = pattern.findall(text1)
matches2 = pattern.findall(text2)

print(f”Matches in text1: {matches1}”) # [‘123’]
print(f”Matches in text2: {matches2}”) # [‘456’]
“`

二、正则表达式高级用法

1. 标志位 (Flags)

re 模块提供了一些标志位，可以修改正则表达式的匹配行为：

re.IGNORECASE 或 re.I: 忽略大小写匹配。
re.MULTILINE 或 re.M: 多行模式。^ 和 $ 不仅匹配字符串的开头和结尾，还匹配每一行的开头和结尾。
re.DOTALL 或 re.S: 使 . 匹配包括换行符在内的所有字符。
re.VERBOSE 或 re.X: 详细模式。忽略正则表达式中的空白字符和 # 后面的注释，使复杂的正则表达式更易读。
re.ASCII 或 re.A: 使 \w, \b, \s, \d 等仅匹配 ASCII 字符。
re.UNICODE 或 re.U (默认): 使 \w, \b, \s, \d 等匹配 Unicode 字符。

示例：

“`python
text = “Hello\nworld”

re.DOTALL: . 匹配换行符

match_dotall = re.search(r”Hello.world”, text, re.DOTALL)
print(f”DOTALL match: {match_dotall.group(0)}”) # Hello\nworld

re.MULTILINE: ^ 和 $ 匹配行首行尾

text_multi = “Line 1\nLine 2\nLine 3″
lines = re.findall(r”^Line”, text_multi, re.MULTILINE)
print(f”Multiline lines: {lines}”) # [‘Line’, ‘Line’, ‘Line’]

re.IGNORECASE: 忽略大小写

case_insensitive = re.search(r”python”, “Python is great!”, re.IGNORECASE)
print(f”Case insensitive match: {case_insensitive.group(0)}”) # Python

re.VERBOSE: 提高可读性

verbose_pattern = re.compile(r”””
^(\w+) # 匹配开头的单词作为用户名
@ # 匹配 @ 符号
([\w.-]+) # 匹配域名
. # 匹配 . 符号
(\w+)$ # 匹配顶级域名
“””, re.VERBOSE)
email_match = verbose_pattern.match(“[email protected]”)
if email_match:
print(f”Verbose email match: {email_match.groups()}”) # (‘user’, ‘domain’, ‘com’)
“`

2. 非贪婪匹配 (Non-Greedy Matching)

默认情况下，*, +, ?, {m,n} 等量词是贪婪的，它们会尽可能多地匹配字符。通过在量词后添加 ?，可以使其变为非贪婪模式，尽可能少地匹配字符。

贪婪量词	非贪婪量词
`*`	`*?`
`+`	`+?`
`?`	`??`
`{m,n}`	`{m,n}?`

示例：

“`python
html_text = “HelloWorld“

贪婪匹配: 匹配从第一个到最后一个之间的所有内容

greedy_match = re.search(r”.*“, html_text)
print(f”Greedy match: {greedy_match.group(0)}”) # HelloWorld

非贪婪匹配: 匹配从第一个到最近的之间的内容

non_greedy_match = re.search(r”.*?“, html_text)
print(f”Non-greedy match: {non_greedy_match.group(0)}”) # Hello
“`

3. 前后查找 (Lookahead and Lookbehind Assertions)

前后查找是一种零宽断言，它匹配一个位置而不是字符本身。它们不会消耗字符串中的字符，只是断言在当前位置的前面或后面是否存在某个模式。

正向肯定查找 (Positive Lookahead): (?=...)
断言在当前位置后面紧跟着 ... 模式。
正向否定查找 (Negative Lookahead): (?!...)
断言在当前位置后面没有紧跟着 ... 模式。
反向肯定查找 (Positive Lookbehind): (?<=...)
断言在当前位置前面紧跟着 ... 模式。
反向否定查找 (Negative Lookbehind): (?<!...)
断言在当前位置前面没有紧跟着 ... 模式。

示例：

“`python
text = “apple, banana, orange, pineapple”

查找后面跟着 ‘apple’ 的 ‘pine’ (但只匹配 ‘pine’)

match_lookahead = re.search(r”pine(?=apple)”, text)
print(f”Lookahead match: {match_lookahead.group(0)}”) # pine

查找不以 ‘apple’ 结尾的水果

fruits_without_apple = re.findall(r”\b\w+(?!apple)\b”, text)
print(f”Fruits without ‘apple’: {fruits_without_apple}”) # [‘banana’, ‘orange’] (不包括’apple’和’pineapple’中的’pine’)

查找前面跟着 ‘apple’ 的 ‘,’ (但只匹配 ‘,’)

match_lookbehind = re.search(r”(?<=apple),”, text)
print(f”Lookbehind match: {match_lookbehind.group(0)}”) # ,

查找前面没有跟着 ‘apple’ 的单词

words_not_after_apple = re.findall(r”(?<!apple )\b\w+\b”, text)
print(f”Words not after ‘apple’: {words_not_after_apple}”) # [‘apple’, ‘banana’, ‘orange’, ‘pineapple’]
`` **注意：** 对于(?<!…)和(?<=…)反向查找，Python 的re模块要求…` 部分是固定长度的字符串（或能够被优化为固定长度的正则表达式）。

4. 命名捕获组 (Named Capture Groups)

使用 (?P<name>...) 语法可以为捕获组命名，这样可以通过名称而不是数字索引来访问匹配内容，提高代码可读性。

“`python
text = “Name: Alice, Age: 30”
match = re.search(r”Name: (?P\w+), Age: (?P\d+)”, text)

if match:
print(f”Name: {match.group(‘name’)}”) # Alice
print(f”Age: {match.group(‘age’)}”) # 30
print(f”Dictionary of groups: {match.groupdict()}”) # {‘name’: ‘Alice’, ‘age’: ’30’}
“`

5. 条件匹配

虽然 Python 的 re 模块不像一些高级正则表达式引擎那样支持复杂的条件匹配（如 (?(condition)yes-pattern|no-pattern)），但可以通过其他方式模拟或简化。例如，使用 re.sub 的替换函数可以实现更复杂的条件逻辑。

“`python
def replace_func(match):
name = match.group(1)
if name.startswith(“Mr”):
return f”Hello, {name}!”
else:
return f”Hi, {name}.”

text = “Hello Mr. Smith and Ms. Jones.”

查找名字

replaced_text = re.sub(r”(Mr.\s\w+|Ms.\s\w+)”, replace_func, text)
print(f”Conditional replacement: {replaced_text}”)

输出: Hello Hello, Mr. Smith! and Hi, Ms. Jones..

“`

6. 迭代器 `re.finditer()`

对于需要逐个处理所有匹配项的情况，re.finditer() 比 re.findall() 更有效，因为它返回一个迭代器，避免一次性加载所有匹配结果到内存中。

“`python
text = “The quick brown fox jumps over the lazy dog.”
for match in re.finditer(r”\b\w{4}\b”, text): # 匹配所有长度为4的单词
print(f”Found 4-letter word: {match.group(0)} at {match.span()}”)

输出:

Found 4-letter word: over at (20, 24)

Found 4-letter word: lazy at (33, 37)

“`

三、总结

Python 的 re 模块为字符串处理提供了强大的正则表达式功能。从基础的模式匹配、字符串搜索和替换，到高级的标志位、非贪婪匹配、前后查找和命名捕获组，正则表达式都能以简洁而高效的方式解决复杂的文本处理任务。熟练掌握正则表达式，将极大地提升你在数据清洗、日志分析、爬虫数据提取等方面的效率。

在编写正则表达式时，建议遵循以下几点：
* 使用原始字符串 (raw string)：在正则表达式前加上 r (例如 r"...")，可以避免反斜杠 \ 的多次转义问题。
* 从小到大构建：先构建简单的模式，然后逐步添加复杂性。
* 测试：使用在线正则表达式测试工具或编写测试代码来验证你的正则表达式是否按预期工作。
* 可读性：对于复杂的正则表达式，使用 re.VERBOSE 标志和注释来提高可读性。

一、正则表达式基础

1. 导入 re 模块

2. 基本匹配函数

re.match: 从开头匹配

re.search: 查找第一个匹配项

re.findall: 查找所有匹配项

re.sub: 替换

re.split: 分割

3. 匹配对象 (Match Object)

4. 正则表达式元字符

5. 常用特殊序列

匹配数字

匹配单词

匹配邮箱 (简单示例)

6. 编译正则表达式

编译正则表达式

使用编译后的模式对象

二、正则表达式高级用法

1. 标志位 (Flags)

re.DOTALL: . 匹配换行符

re.MULTILINE: ^ 和 $ 匹配行首行尾

re.IGNORECASE: 忽略大小写

re.VERBOSE: 提高可读性

2. 非贪婪匹配 (Non-Greedy Matching)

贪婪匹配: 匹配从第一个到最后一个之间的所有内容

非贪婪匹配: 匹配从第一个到最近的之间的内容

3. 前后查找 (Lookahead and Lookbehind Assertions)

查找后面跟着 ‘apple’ 的 ‘pine’ (但只匹配 ‘pine’)

查找不以 ‘apple’ 结尾的水果

查找前面跟着 ‘apple’ 的 ‘,’ (但只匹配 ‘,’)

查找前面没有跟着 ‘apple’ 的单词

4. 命名捕获组 (Named Capture Groups)

5. 条件匹配

查找名字

输出: Hello Hello, Mr. Smith! and Hi, Ms. Jones..

6. 迭代器 re.finditer()

输出:

Found 4-letter word: over at (20, 24)

Found 4-letter word: lazy at (33, 37)

三、总结

1. 导入 `re` 模块

6. 迭代器 `re.finditer()`