Skip to content
This repository was archived by the owner on Jun 1, 2022. It is now read-only.

Encoding fixes for python 3.4 #15

Closed
wants to merge 3 commits into from
Closed

Conversation

sfarbotka
Copy link
Contributor

In python 3.x PyUnicode_FromString() function accepts an UTF-8 encoded strings only.
But country_code, country_name, country_continent are all ISO-8859-1 encoded.
This commit fixes the issue.

Before fix:

Python 3.4.1 (default, Aug 21 2014, 16:21:32) 
[GCC 4.6.3] on linux
>>> import GeoIP
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe7 in position 4: invalid continuation byte

After fix:

Python 3.4.1 (default, Aug 21 2014, 16:20:07) 
[GCC 4.6.3] on linux
>>> import GeoIP
>>> GeoIP.country_names['CW']
'Curaçao' 

@oschwald
Copy link
Member

We already set the character set for the C API. The strings coming from it should be UTF-8.

I can't reproduce your issue:

Python 3.4.1 (default, Jul 27 2014, 17:47:19) 
[GCC 4.8.2] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import GeoIP
>>> GeoIP.country_names['CW']
'Curacao'

What version of the libGeoIP are you using? If any of those methods are returning ISO-8859-1, it sounds like it would be a bug in libGeoIP.

@oschwald
Copy link
Member

I took a closer look at the code in question, and I think the right fix is to populate from the UTF-8 country name array in libGeoIP.

@sfarbotka
Copy link
Contributor Author

My system details:

Python 3.4.1 (default, Aug 21 2014, 16:21:32) 
[GCC 4.6.3] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import GeoIP
>>> GeoIP.lib_version()
'1.4.8'
$ uname -a
Linux linuxhost 3.4.79 #6 SMP PREEMPT Fri Feb 14 23:58:54 CST 2014 armv7l GNU/Linux

$ locale
LANG=en_US.UTF-8
LANGUAGE=
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=en_US.UTF-8

$ cat /etc/issue
Debian GNU/Linux 7 \n \l

My fixed version of GeoIP prints 'Curaçao' and your output is 'Curacao'. 5th chars are different.
Also when I use unfixed version of GeoIP in python 2.7, print skips 5th char in my locale:

Python 2.7.3 (default, Mar 14 2014, 17:55:54) 
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import GeoIP
>>> GeoIP.country_names['CW']
'Cura\xe7ao'
>>> print GeoIP.country_names['CW']
Curao

I pushed updates. Now the code uses GeoIP_utf8_country_name for populating of dictionary.

@oschwald
Copy link
Member

Thanks. I merged this. I did change it to use GeoIP_country_name for Python 2 since people may be expecting latin1 there.

@oschwald oschwald closed this Aug 22, 2014
@oschwald
Copy link
Member

I also release 1.3.2 with this fix.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants