Changes between Version 1 and Version 2 of ThePerilsOfStr


Ignore:
Timestamp:
Jan 11, 2011, 4:29:51 PM (10 years ago)
Author:
flip
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • ThePerilsOfStr

    v1 v2  
    77
    88 * It's OK to call `str()` on non-strings.
    9  * It's not OK to call `str()` on strings. Use the string's `.encode()`
    10    method and pass `"utf-8"` as the parameter. e.g. `my_string.encode("utf-8")`
     9 * It's not OK to call `str()` on strings because it might raise an exception.
     10 Use the string's `.encode()`
     11 method and pass `"utf-8"` as the parameter. e.g. `my_string.encode("utf-8")`
    1112 * The built-in function `unicode(my_string)` is a bit less cumbersome than
    12    calling `my_string.encode("utf-8")`, but `unicode()` doesn't exist in
    13    Python 3
    14    so using it can create compatibility problems down the road. We should
    15    avoid it.
     13 calling `my_string.encode("utf-8")`, but `unicode()` doesn't exist in
     14 Python 3 so all instances of it will eventually have to be replaced. We should
     15 keep our use to a minimum.
     16 * We can't avoid using `unicode()` entirely because that's the best way
     17 to convert our custom objects (experiments, metabs, etc.) to strings.
    1618
    1719
     
    2123Most modern Python libraries return Unicode. (See the historical note
    2224below.) That includes the two sources of most of our strings -- `sqlite`
    23 and `wxPython`. Python allows one to mix the two and in general we can ignore
    24 the difference.
     25and `wxPython`. Python makes it easy to mix the two and in general we can
     26ignore the difference.
    2527
    2628However, there's two cases where we must use 8 bit strings, and how we convert
    2729from Unicode to 8 bit is important. The two cases where we need 8 bit strings
    28 are (1) when passing strings to PyGAMMA and (2) when writing to a file or the
    29 console. (You might think that it'd be a concern when writing to the database,
    30 but the sqlite module handles that for us.)
     30are (1) when passing strings to PyGAMMA and (2) when writing to a stream
     31(like a file or the console). You might think that it'd be a concern when
     32writing to the database, but the sqlite module handles that for us.
    3133
    3234To represent Unicode in 8 bit strings, one must ''encode'' the string.
    3335For historical reasons there are lots of encodings but nowadays in this
    34 part of the world, UTF-8 is the most popular and the only one we need to
    35 think about.
     36part of the world, UTF-8 is the most popular and the only one we'll use.
    3637
    3738In Python, one encodes a Unicode string with its `.encode()` method like so --
     
    5253}}}
    5354
    54 What's `sys.getdefaultencoding()`? There's the problem. In most Pythons it's
     55What's `sys.getdefaultencoding()`? It's the problem. In most Pythons it's
    5556"ascii" which means the string you're converting had better be limited to the
    5657128 characters in the ASCII repetoire or you'll get an exception --
     
    5859{{{
    5960#!python
    60 str(u"Bj\xc3\xb6rn")
     61>>> str(u"Bj\xf6rn")
    6162Traceback (most recent call last):
    6263  File "<stdin>", line 1, in <module>
    63 UnicodeEncodeError: 'ascii' codec can't encode characters in position 2-3: ordinal not in range(128)
    64 }}}
    65 
    66 
    67 The string `u"Bj\xc3\xb6rn"` is Python's Unicode representation of
     64UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in position 2: ordinal not in range(128)
     65}}}
     66
     67
     68The string `u"Bj\xf6rn"` is Python's Unicode representation of
    6869"Björn" which is a fairly common Swedish first name.
    6970
     
    7172So now you can probably see where `str()` leads to trouble.
    7273A Swedish user might name a metabolite that he got from his colleague
    73 "aspartate från Björn". The code below is no problem --
     74"aspartate från Björn". The code below (in which metabolite.name is a
     75Unicode object) is no problem --
    7476{{{
    7577#!python
     
    8890}}}
    8991
     92Because that code is equivalent to this (assuming the typical default
     93encoding of ASCII) --
     94{{{
     95#!python
     96metabolite.name.encode("ascii")
     97Traceback (most recent call last):
     98  File "<stdin>", line 1, in <module>
     99UnicodeEncodeError: 'ascii' codec can't encode character u'\xe5' in position 12: ordinal not in range(128)
     100}}}
     101
     102
    90103So our code that calls `str()` to convert strings from the GUI or the database
    91 into 8 bit representations in order to make them safe for PyGAMMA will
     104into 8 bit representations in order to make them safe for PyGAMMA or to
     105display them in a text file will
    92106break as soon as someone passes non-ASCII to it.
     107
     108
     109== What Does This Mean For Us? ==
     110
     111When one calls `unicode()` on an object, Python first checks to see if the object
     112implements `__unicode__()`. If not, it calls the object's `__str__()` method.
     113If the object doesn't implement a custom `__str__()` method, standard object
     114inheritance implies a call to the `__str__()` method on Python's `object`
     115class (remember that everything inherits from object). That will print
     116the reliable-but-boring representation you've surely seen before --
     117{{{
     118#!python
     119<vespa.common.mrs_metabolite.Metabolite object at 0x13f2350>
     120}}}
     121
     122That's the short version of how `unicode()` is implemented.
     123
     124Note that an object's `__unicode__()` method ''must'' return a Unicode object.
     125
     126
     127When one calls `str()` on an object, Python ignores `__unicode__()` and calls
     128only `__str__()`. ''However'', objects may return Unicode objects from their
     129`__str__()` method. If an object's `__str__()` method returns a Unicode
     130object, Python might (sometimes? always?) convert
     131the string to 8 bit using the default encoding before `str()` returns.
     132
     133When we `print` an object, Python calls `str()`.
     134
     135Therefore --
     136
     137 * We want to be able to continue to use `print my_object`, so all
     138 of our objects need to implement `__str__()`.
     139 * If we return a Unicode object from the `__str__()` method, Python will
     140 convert it to 8 bit using the default encoding. As discussed above, this
     141 breaks when the string is non-ASCII.
     142 * Therefore, we need to return an 8 bit string from `__str__()`, which,
     143 for us, implies UTF-8.
     144 * Sometimes we want a Unicode representation of our objects, so our
     145 objects also need to implement `__unicode__()`.
     146 * To keep ourselves sane, the `__unicode__()` method should be the
     147 primary method for converting the object to a string. The `__str__()`
     148 method should implemented like so --
     149 {{{
     150#!python
     151def __str__(self):
     152    return self.__unicode__().encode("utf-8")
     153 }}}
     154
     155This conversation goes into a little more detail on the subject
     156of `__str__()` and `__unicode__()` --[[br]]
     157http://mail.python.org/pipermail/python-dev/2006-December/070237.html
     158
    93159
    94160== So What's So Great About UTF-8? ==