How can I determine the byte length of a utf-8 encoded string in Python?

Each Answer to this Q is separated by one/two green lines.

I am working with Amazon S3 uploads and am having trouble with key names being too long. S3 limits the length of the key by bytes, not characters.

From the docs:

The name for a key is a sequence of Unicode characters whose UTF-8 encoding is at most 1024 bytes long.

I also attempt to embed metadata in the file name, so I need to be able to calculate the current byte length of the string using Python to make sure the metadata does not make the key too long (in which case I would have to use a separate metadata file).

How can I determine the byte length of the utf-8 encoded string? Again, I am not interested in the character length… rather the actual byte length used to store the string.

def utf8len(s):
    return len(s.encode('utf-8'))

Works fine in Python 2 and 3.

Use the string ‘encode’ method to convert from a character-string to a byte-string, then use len() like normal:

>>> s = u"¡Hola, mundo!"                                                      
>>> len(s)                                                                    
13 # characters                                                                             
>>> len(s.encode('utf-8'))   
14 # bytes

Encoding the string and using len on the result works great, as other answers have shown. It does need to build a throw-away copy of the string – if you’re working with very large strings this might not be optimal (I don’t consider 1024 bytes to be large though). The structure of UTF-8 allows you to get the length of each character very easily without even encoding it, although it might still be easier to encode a single character. I present both methods here, they should give the same result.

def utf8_char_len_1(c):
    codepoint = ord(c)
    if codepoint <= 0x7f:
        return 1
    if codepoint <= 0x7ff:
        return 2
    if codepoint <= 0xffff:
        return 3
    if codepoint <= 0x10ffff:
        return 4
    raise ValueError('Invalid Unicode character: ' + hex(codepoint))

def utf8_char_len_2(c):
    return len(c.encode('utf-8'))

utf8_char_len = utf8_char_len_1

def utf8len(s):
    return sum(utf8_char_len(c) for c in s)

The answers/resolutions are collected from stackoverflow, are licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0 .