[Solved] Regex works fine on Pythex, but not in Python

I used the following regular expression on pythex to test it:

(d|t)(_d+){1}.

It works fine and I am primarily interested in group 2. That it works successfully is shown below:

pythex demo

However, I can’t get Python to actually show me the correct results. Here’s a MWE:

fn_list = ['IMG_0064.png',
           'IMG_0064.JPG',
           'IMG_0064_1.JPG',
           'IMG_0064_2.JPG',
           'IMG_0064_2.PNG',
           'IMG_0064_2.BMP',
           'IMG_0064_3.JPEG',
           'IMG_0065.JPG',
           'IMG_0065.JPEG',
           'IMG-20150623-00176-preview-left.jpg',
           'IMG-20150623-00176-preview-left_2.jpg',
           'thumb_2595.bmp',
           'thumb_2595_1.bmp',
           'thumb_2595_15.bmp']

pattern = re.compile(r'(d|t)(_d+){1}.', re.IGNORECASE)

for line in fn_list:
    search_obj = re.match(pattern, line)
    if search_obj:
        matching_group = search_obj.groups()
        print matching_group

The output is nothing.

However, the pythex above clearly shows two groups returned for each, the second should be present and hit off many more files. What am I doing wrong?

Solution #1:

You need to use re.search(), not re.match(). re.search() matches anywhere in the string, whereas re.match() matches only at the beginning.

import re

fn_list = ['IMG_0064.png',
           'IMG_0064.JPG',
           'IMG_0064_1.JPG',
           'IMG_0064_2.JPG',
           'IMG_0064_2.PNG',
           'IMG_0064_2.BMP',
           'IMG_0064_3.JPEG',
           'IMG_0065.JPG',
           'IMG_0065.JPEG',
           'IMG-20150623-00176-preview-left.jpg',
           'IMG-20150623-00176-preview-left_2.jpg',
           'thumb_2595.bmp',
           'thumb_2595_1.bmp',
           'thumb_2595_15.bmp']

pattern = re.compile(r'(d|t)(_d+){1}.', re.IGNORECASE)

for line in fn_list:
    search_obj = re.search(pattern, line)  # CHANGED HERE
    if search_obj:
        matching_group = search_obj.groups()
        print matching_group

Result:

('4', '_1')
('4', '_2')
('4', '_2')
('4', '_2')
('4', '_3')
('t', '_2')
('5', '_1')
('5', '_15')

Since you are compiling the regular expression, you can do search_obj = pattern.search(line) instead of search_obj = re.search(pattern, line). As for your regular expression itself, r'([dt])(_d+).' is equivalent to the one you’re using, and a bit cleaner.

Respondent: Bob Dylan

Solution #2:

You need to use the following code:

import re
fn_list = ['IMG_0064.png',
           'IMG_0064.JPG',
           'IMG_0064_1.JPG',
           'IMG_0064_2.JPG',
           'IMG_0064_2.PNG',
           'IMG_0064_2.BMP',
           'IMG_0064_3.JPEG',
           'IMG_0065.JPG',
           'IMG_0065.JPEG',
           'IMG-20150623-00176-preview-left.jpg',
           'IMG-20150623-00176-preview-left_2.jpg',
           'thumb_2595.bmp',
           'thumb_2595_1.bmp',
           'thumb_2595_15.bmp']

pattern = re.compile(r'([dt])(_d+).', re.IGNORECASE) # OPTIMIZED REGEX A BIT

for line in fn_list:
    search_obj = pattern.search(line)  # YOU NEED SEARCH WITH THE COMPILED REGEX
    if search_obj:
        matching_group = search_obj.group(2) # YOU NEED TO ACCESS GROUP 2 IF YOU ARE INTERESTED JUST IN GROUP 2
        print matching_group

See IDEONE demo

As for the regex, (d|t) is the same as ([dt]), but the latter is more efficient. Also, {1} is redundant in regex.

Respondent: Cyphase

The answers/resolutions are collected from stackoverflow, are licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0 .

Leave a Reply

Your email address will not be published.