Skip to main content

Command Palette

Search for a command to run...

Using Python Regex to extract phone numbers from a text file

Published
2 min read
D

A data analytics engineer with four years of experience working as a data engineer. Holds a MSc in Data.

with open ('lorem.txt', 'rt') as myfile:  # Open lorem.txt for reading text
    contents = myfile.read()              # Read the entire file to a string
# print(contents)                         # Print the string if you want to

# Now let's extract the text from here

import re
reg_ex=r"\+?\d+(?:[- (]+\d+\)?)+"
print(re.findall(rs, contents))

The code imports the re module, which provides support for regular expressions in Python.

  1. reg_ex = r"\+?\d+(?:[- (]+\d+\)?)+" defines a regular expression pattern. Let's break it down:

    \+?: Matches an optional plus sign (\+). The backslash \ is used to escape the plus sign because it has a special meaning in regular expressions.

    \d+: Matches one or more digits (\d). This captures the numeric part of the phone number.

    (?:[- (]+\d+\)?)+: This is a non-capturing group (?: ... ) that matches one or more occurrences of a sequence of characters. Let's break it down further:

    [- (]+: Matches one or more occurrences of a hyphen, space, or opening parenthesis character. The characters are enclosed within square brackets [- (].

    \d+: Matches one or more digits.

    \)?: Matches an optional closing parenthesis \).

    The combination of (?:[- (]+\d+\)?)+ inside the capturing group (...)+ allows the regular expression to match multiple occurrences of the separator and digit pattern, capturing the entire phone number.

    re.findall(rs, contents) searches for all non-overlapping matches of the regular expression pattern rs in the contents string. It returns a list of all matched substrings.

tips: \+?: The plus sign (\+) is optional (?). It matches zero or one occurrence of the plus sign. This allows for phone numbers with or without a plus sign at the beginning, indicating an international number.\d+: This matches one or more digits (\d). It captures the numeric portion of the phone number, such as the area code and subscriber number.

More from this blog

Speaking of Data

19 posts