Encoding fixes for python 3.4 #15

sfarbotka · 2014-08-21T21:34:18Z

In python 3.x PyUnicode_FromString() function accepts an UTF-8 encoded strings only.
But country_code, country_name, country_continent are all ISO-8859-1 encoded.
This commit fixes the issue.

Before fix:

Python 3.4.1 (default, Aug 21 2014, 16:21:32) 
[GCC 4.6.3] on linux
>>> import GeoIP
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe7 in position 4: invalid continuation byte

After fix:

Python 3.4.1 (default, Aug 21 2014, 16:20:07) 
[GCC 4.6.3] on linux
>>> import GeoIP
>>> GeoIP.country_names['CW']
'Curaçao'

oschwald · 2014-08-22T01:11:34Z

We already set the character set for the C API. The strings coming from it should be UTF-8.

I can't reproduce your issue:

Python 3.4.1 (default, Jul 27 2014, 17:47:19) 
[GCC 4.8.2] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import GeoIP
>>> GeoIP.country_names['CW']
'Curacao'

What version of the libGeoIP are you using? If any of those methods are returning ISO-8859-1, it sounds like it would be a bug in libGeoIP.

oschwald · 2014-08-22T01:21:42Z

I took a closer look at the code in question, and I think the right fix is to populate from the UTF-8 country name array in libGeoIP.

This reverts commit e589530.

sfarbotka · 2014-08-22T11:04:27Z

My system details:

Python 3.4.1 (default, Aug 21 2014, 16:21:32) 
[GCC 4.6.3] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import GeoIP
>>> GeoIP.lib_version()
'1.4.8'

$ uname -a
Linux linuxhost 3.4.79 #6 SMP PREEMPT Fri Feb 14 23:58:54 CST 2014 armv7l GNU/Linux

$ locale
LANG=en_US.UTF-8
LANGUAGE=
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=en_US.UTF-8

$ cat /etc/issue
Debian GNU/Linux 7 \n \l

My fixed version of GeoIP prints 'Curaçao' and your output is 'Curacao'. 5th chars are different.
Also when I use unfixed version of GeoIP in python 2.7, print skips 5th char in my locale:

Python 2.7.3 (default, Mar 14 2014, 17:55:54) 
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import GeoIP
>>> GeoIP.country_names['CW']
'Cura\xe7ao'
>>> print GeoIP.country_names['CW']
Curao

I pushed updates. Now the code uses GeoIP_utf8_country_name for populating of dictionary.

oschwald · 2014-08-22T14:42:14Z

Thanks. I merged this. I did change it to use GeoIP_country_name for Python 2 since people may be expecting latin1 there.

oschwald · 2014-08-22T14:59:42Z

I also release 1.3.2 with this fix.

Encoding fixes for python 3.4

e589530

sfarbotka added 2 commits August 22, 2014 13:06

Revert "Encoding fixes for python 3.4"

1651fd0

This reverts commit e589530.

Use UTF-8 encoded country names

10b2a4a

oschwald closed this Aug 22, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Encoding fixes for python 3.4 #15

Encoding fixes for python 3.4 #15

sfarbotka commented Aug 21, 2014

oschwald commented Aug 22, 2014

oschwald commented Aug 22, 2014

sfarbotka commented Aug 22, 2014

oschwald commented Aug 22, 2014

oschwald commented Aug 22, 2014

Encoding fixes for python 3.4 #15

Encoding fixes for python 3.4 #15

Conversation

sfarbotka commented Aug 21, 2014

oschwald commented Aug 22, 2014

oschwald commented Aug 22, 2014

sfarbotka commented Aug 22, 2014

oschwald commented Aug 22, 2014

oschwald commented Aug 22, 2014