初学者必看：Python 正则表达式基础教学

在处理文本数据时，正则表达式（Regular Expression，简称 Regex 或 Regexp）是一种强大而灵活的工具。它能够让你通过定义模式来匹配、查找、替换字符串中的特定文本。无论你是要从日志文件中提取信息，验证用户输入，还是对大量文本进行清洗，正则表达式都是你不可或缺的利器。

Python 内置了 re 模块，提供了完整的正则表达式支持。本文将带领初学者从零开始，掌握 Python 正则表达式的基础知识和常用操作。

1. 什么是正则表达式？为什么使用它？

正则表达式本质上是一个小型的、高度优化的编程语言，它用一种紧凑的、声明性的语法来描述字符串的模式。

为什么使用它？

高效文本匹配： 相比于字符串的 startswith(), endswith(), find(), replace() 等方法，正则表达式能处理更复杂、更灵活的匹配需求。
数据提取： 从非结构化文本中精准提取所需信息，如邮箱、电话号码、日期等。
数据验证： 检查输入字符串是否符合特定格式，如密码强度、URL 格式等。
文本替换与分割： 基于模式进行批量替换或分割字符串。

2. Python 的 `re` 模块

在 Python 中使用正则表达式，首先需要导入 re 模块：

python import re

re 模块提供了多种函数来执行正则表达式操作。

3. 正则表达式基本语法

正则表达式由普通字符（如字母、数字）和特殊字符（称为“元字符”）组成。

3.1 普通字符 (Literal Characters)

大多数字符都是普通字符，它们在模式中匹配自身。
例如："hello" 会匹配字符串中字面意义上的 “hello”。

python pattern = "hello" text = "hello world" match = re.search(pattern, text) print(match) # <re.Match object; span=(0, 5), match='hello'>

3.2 元字符 (Metacharacters)

元字符赋予正则表达式强大的匹配能力。

元字符	描述	示例	匹配内容
`.`	匹配除换行符 `\n` 之外的任意单个字符。	`a.b`	`acb`, `aab`, `a1b` 等
`^`	匹配字符串的开头。	`^hello`	匹配以 `hello` 开头的字符串
`$`	匹配字符串的结尾。	`world$`	匹配以 `world` 结尾的字符串
`*`	匹配前一个字符零次或多次。	`a*b`	`b`, `ab`, `aab`, `aaab` 等
`+`	匹配前一个字符一次或多次。	`a+b`	`ab`, `aab`, `aaab` 等 (不匹配 `b`)
`?`	匹配前一个字符零次或一次。	`a?b`	`b`, `ab`
`{n}`	匹配前一个字符恰好 `n` 次。	`a{3}b`	`aaab`
`{n,}`	匹配前一个字符至少 `n` 次。	`a{2,}b`	`aab`, `aaab` 等
`{n,m}`	匹配前一个字符 `n` 到 `m` 次。	`a{1,3}b`	`ab`, `aab`, `aaab`
`[]`	匹配括号内的任意一个字符。	`[abc]`	`a`, `b`, `c` 中的任意一个
`[^]`	匹配不在括号内的任意一个字符。	`[^abc]`	除 `a`, `b`, `c` 之外的任意字符
`\|`	匹配 `\|` 前或后的表达式。逻辑或。	`cat\|dog`	`cat` 或 `dog`
`()`	捕获组，将多个字符组合成一个单元，并可以捕获匹配内容。	`(ab)+`	`ab`, `abab` 等
`\`	转义字符，将特殊字符转义为普通字符，或标识特殊序列。	`\.` (匹配点号本身)	`\d` (匹配数字)

3.3 常用特殊字符序列 (Character Classes)

这些是 \ 和特定字符组合的简写，非常实用：

序列	描述	等价于
`\d`	匹配任何数字 (0-9)。	`[0-9]`
`\D`	匹配任何非数字字符。	`[^0-9]`
`\w`	匹配任何单词字符（字母、数字或下划线）。	`[a-zA-Z0-9_]`
`\W`	匹配任何非单词字符。	`[^a-zA-Z0-9_]`
`\s`	匹配任何空白字符（空格、制表符、换行符等）。	`[\t\n\r\f\v]`
`\S`	匹配任何非空白字符。	`[^\t\n\r\f\v]`
`\b`	匹配单词边界。
`\B`	匹配非单词边界。

注意： 在 Python 中定义正则表达式字符串时，推荐使用原始字符串（raw string），即在字符串前加上 r。这样可以避免反斜杠 \ 的额外转义问题，例如 r'\n' 会匹配字面上的 \n 而不是换行符。

“`python

匹配一个数字

print(re.search(r’\d’, “hello123world”)) #

匹配三个字母或数字

print(re.search(r’\w{3}’, “Python”)) #

匹配邮箱地址的简单模式

email_pattern = r’\w+@\w+.\w+’
print(re.search(email_pattern, “[email protected]”)) #
“`

4. `re` 模块常用函数

4.1 `re.match(pattern, string, flags=0)`

尝试从字符串的开头匹配模式。如果找到匹配，返回一个匹配对象（Match Object），否则返回 None。

“`python
text = “Hello World”
match1 = re.match(r”Hello”, text)
print(match1) #

match2 = re.match(r”World”, text) # World不在开头
print(match2) # None
“`

4.2 `re.search(pattern, string, flags=0)`

扫描整个字符串，查找模式的第一个匹配项。如果找到，返回一个匹配对象，否则返回 None。

“`python
text = “Hello World”
match1 = re.search(r”Hello”, text)
print(match1) #

match2 = re.search(r”World”, text) # World在字符串中间
print(match2) #
“`

re.match() 和 re.search() 的区别： match() 只在字符串的开头进行匹配，而 search() 会扫描整个字符串。

4.3 匹配对象 (Match Object) 的方法

如果 match() 或 search() 返回了匹配对象，你可以使用它的方法来获取匹配信息：

group(0) 或 group(): 返回整个匹配到的字符串。
group(N): 返回第 N 个捕获组匹配到的字符串。
groups(): 返回所有捕获组匹配到的字符串组成的元组。
start(): 返回匹配的起始索引。
end(): 返回匹配的结束索引（不包含）。
span(): 返回一个元组 (start, end)。

“`python
text = “My phone number is 123-456-7890.”
pattern = r”(\d{3})-(\d{3})-(\d{4})” # 三个捕获组
match = re.search(pattern, text)

if match:
print(“完整匹配:”, match.group(0)) # 123-456-7890
print(“区号:”, match.group(1)) # 123
print(“中间三位:”, match.group(2)) # 456
print(“最后四位:”, match.group(3)) # 7890
print(“所有捕获组:”, match.groups()) # (‘123’, ‘456’, ‘7890’)
print(“起始位置:”, match.start()) # 20
print(“结束位置:”, match.end()) # 32
print(“匹配范围:”, match.span()) # (20, 32)
“`

4.4 `re.findall(pattern, string, flags=0)`

查找字符串中所有非重叠的匹配项，并以列表形式返回所有匹配到的字符串（如果模式中有捕获组，则返回捕获组内容的列表或元组列表）。

“`python
text = “Today is 2024-01-01, tomorrow is 2024-01-02.”
dates = re.findall(r”\d{4}-\d{2}-\d{2}”, text)
print(dates) # [‘2024-01-01’, ‘2024-01-02’]

如果有捕获组

text_phones = “John: 123-456-7890, Jane: 987-654-3210″
phone_numbers = re.findall(r”(\d{3})-(\d{3})-(\d{4})”, text_phones)
print(phone_numbers) # [(‘123’, ‘456’, ‘7890’), (‘987’, ‘654’, ‘3210’)]
“`

4.5 `re.finditer(pattern, string, flags=0)`

与 findall() 类似，但返回一个迭代器，每次迭代产生一个匹配对象。这在处理大量匹配结果时更节省内存。

“`python
text = “apple banana cherry”
for match in re.finditer(r”\b\w+\b”, text):
print(f”找到 ‘{match.group()}’ 在位置 {match.span()}”)

找到 ‘apple’ 在位置 (0, 5)

找到 ‘banana’ 在位置 (6, 12)

找到 ‘cherry’ 在位置 (13, 19)

“`

4.6 `re.sub(pattern, repl, string, count=0, flags=0)`

在字符串中查找所有匹配 pattern 的子串，并将其替换为 repl。

pattern: 要匹配的正则表达式。
repl: 替换的字符串，可以是普通字符串，也可以是引用捕获组的字符串（如 r'\1'），甚至是一个函数。
string: 要操作的原始字符串。
count: 最大替换次数，默认为 0（即替换所有匹配）。

“`python
text = “The price is $100. Another item is $50.”
new_text = re.sub(r”\$(\d+)”, r”¥\1″, text) # 使用捕获组 \1 引用匹配到的数字
print(new_text) # The price is ¥100. Another item is ¥50.

替换所有空格为下划线

text_with_spaces = “hello world python”
new_text_underline = re.sub(r”\s”, “_”, text_with_spaces)
print(new_text_underline) # hello_world_python
“`

4.7 `re.split(pattern, string, maxsplit=0, flags=0)`

根据 pattern 匹配到的分隔符将字符串分割成列表。

maxsplit: 最大分割次数，默认为 0（即分割所有匹配）。

“`python
text = “apple,banana;cherry orange”
parts = re.split(r”[,; ]”, text) # 匹配逗号、分号或空格作为分隔符
print(parts) # [‘apple’, ‘banana’, ‘cherry’, ‘orange’]

text_with_numbers = “item100price200count50″
parts_numbers = re.split(r”\d+”, text_with_numbers)
print(parts_numbers) # [‘item’, ‘price’, ‘count’, ”]
“`

5. 编译正则表达式 `re.compile()`

当你需要多次使用同一个正则表达式模式时，使用 re.compile() 函数将其编译成一个正则表达式对象可以提高效率。

“`python
import re

phone_pattern = re.compile(r”(\d{3})-(\d{3})-(\d{4})”)

text1 = “My number is 123-456-7890.”
match1 = phone_pattern.search(text1)
if match1:
print(f”找到电话号码: {match1.group()}”)

text2 = “Call me at 987-654-3210 soon.”
match2 = phone_pattern.search(text2)
if match2:
print(f”找到电话号码: {match2.group()}”)
“`

6. 正则表达式标志 (Flags)

re 模块提供了一些标志，可以修改正则表达式的匹配行为。这些标志可以作为函数参数传递，或者在编译时使用。

标志	描述
`re.IGNORECASE` (或 `re.I`)	忽略大小写进行匹配。
`re.MULTILINE` (或 `re.M`)	使 `^` 和 `$` 匹配每行的开头和结尾（而不仅仅是字符串的开头和结尾）。
`re.DOTALL` (或 `re.S`)	使 `.` 匹配包括换行符 `\n` 在内的所有字符。
`re.ASCII` (或 `re.A`)	使 `\w`, `\b`, `\s`, `\d` 只匹配 ASCII 字符。
`re.VERBOSE` (或 `re.X`)	忽略模式中的空白符和 `#` 后面的注释，提高可读性。

示例：re.IGNORECASE

python text = "Hello world" match = re.search(r"hello", text, re.IGNORECASE) print(match) # <re.Match object; span=(0, 5), match='Hello'>

示例：re.MULTILINE

“`python
text = “Line 1\nLine 2\nLine 3”

默认情况下，^只匹配字符串开头

match_start_default = re.findall(r”^Line”, text)
print(f”默认模式匹配: {match_start_default}”) # [‘Line’]

使用 re.MULTILINE，^匹配每行开头

match_start_multiline = re.findall(r”^Line”, text, re.MULTILINE)
print(f”多行模式匹配: {match_start_multiline}”) # [‘Line’, ‘Line’, ‘Line’]
“`

示例：re.DOTALL

“`python
text = “Hello\nWorld”

默认情况下，. 不匹配换行符

match_dot_default = re.search(r”Hello.World”, text)
print(f”默认模式匹配: {match_dot_default}”) # None

使用 re.DOTALL，. 匹配所有字符，包括换行符

match_dot_all = re.search(r”Hello.World”, text, re.DOTALL)
print(f”DOTALL模式匹配: {match_dot_all}”) #
“`

7. 贪婪与非贪婪匹配 (Greedy vs. Non-Greedy)

正则表达式中的量词（*, +, ?, {n,m}）默认是贪婪的 (Greedy)，这意味着它们会尽可能多地匹配字符。

如果你希望它们尽可能少地匹配字符，可以在量词后面加上 ?，使其变为非贪婪的 (Non-Greedy)。

“`python
text = “some content other content“

贪婪匹配：会匹配到第一个到最后一个

greedy_match = re.search(r”<.+>”, text)
print(f”贪婪匹配: {greedy_match.group()}”) # some content other content

非贪婪匹配：会尽可能少地匹配，只匹配到最近的

non_greedy_match = re.search(r”<.+?>”, text)
print(f”非贪婪匹配: {non_greedy_match.group()}”) #
“`

8. 总结与实践建议

正则表达式的学习曲线可能有些陡峭，但掌握它绝对物超所值。

学习建议：

从小模式开始： 不要试图一次性写出复杂的正则表达式，先匹配最简单的部分。
逐步构建： 每次添加一个新的元字符或量词，都测试其效果。
多加练习： 解决实际问题是最好的学习方式。
使用在线工具： 有许多在线正则表达式测试工具（如 regex101.com, regextester.com），它们能实时显示匹配结果并解释模式，非常有助于学习和调试。
查阅文档： 遇到不清楚的地方，查阅 Python re 模块的官方文档。

希望这篇入门教学能帮助你开启 Python 正则表达式的学习之旅！