move regex config to same constants as everything else #1

derek73 · derek73 · commit 7446b6ab859a · 2014-05-15T04:33:10.000-07:00
diff --git a/README.rst b/README.rst
@@ -186,13 +186,16 @@ These constants are set at the module level using nameparser.config_.
 
 .. _nameparser.config: https://github.com/derek73/python-nameparser/tree/master/nameparser/config
 
-Predefined Variable Names
-+++++++++++++++++++++++++
+Predefined Variables
+++++++++++++++++++++
+
+These are available via ``from nameparser.config import constants`` or on the ``C`` 
+attribute of a ``HumanName`` instance, e.g. ``hn.C``.
 
 * **prefixes**:
   Parts that come before last names, e.g. 'del' or 'van'
 * **titles**:
-  Parts that come before the first names. Any words included in
+  Parts that come before the first names. Any strings included in
   here will never be considered a first name, so use with care.
 * **suffixes**:
   Parts that appear after the last name, e.g. "Jr." or "MD"
@@ -204,22 +207,25 @@ Predefined Variable Names
   Most parts should be capitalized by capitalizing the first letter.
   There are some exceptions, such as roman numbers used for suffixes.
   You can update this with a dictionary or a tuple. 
+* **RE**: 
+  Contains all the various regular expressions used in the parser.
 
-Each of these predefined sets of variables includes ``add()`` and ``remove()``
-methods for easy modification. They also inherit from ``set()`` so you can 
-modify them with any methods that work on sets. 
+Each of these predefined sets of variables (except ``RE``) includes ``add()``
+and ``remove()`` methods for easy modification. They also inherit from ``set()``
+so you can modify them with any methods that work on sets. ``RE`` is a tuple
+but can be replaced with a dictionary if you need to modify it.
 
 Any strings you add to the constants should be lower case and not include
 periods. The ``add()`` and ``remove()`` method handles that for you
-automatically, but other set methods will not.
+automatically, but other ``set()`` methods will not.
 
 Parser Customization Examples
 +++++++++++++++++++++++++++++
 
 "Hon" is a common abbreviation for "Honorable", a title used when addressing
 judges. It is also sometimes a first name. If your dataset contains more
 "Hon"s than judges, you may wish to remove it from the titles constant so
-that "Hon" can be recognized as a first name.
+that "Hon" can be parsed as a first name.
 
 ::
 
@@ -272,13 +278,13 @@ methods and each string will be added or removed.
     ]>
 
 
-Parser Customizations Are At Module-Level 
+Parser Customizations Are Module-Wide 
 +++++++++++++++++++++++++++++
 
-When you modify the configuration, by default this will modify the behavior all HumanName
-instances. This could be a handy way to set it up for your entire project, but could also 
-lead to some unexpected behavior because changing one instance could modify the behavior 
-of another instance. 
+When you modify the configuration, by default this will modify the behavior all
+HumanName instances. This could be a handy way to set it up for your entire
+project, but it could also lead to some unexpected behavior because changing one
+instance could modify the behavior of another instance.
 
 ::
 
@@ -307,15 +313,14 @@ of another instance.
     ]>
 
 
-If you'd prefer new instances to have their own config values, you can pass ``None``
-as the second argument when instantiating ``HumanName``. The instance's constants can
-be accessed via its ``C`` attribute. Similarly the regexes can be overridden by
-setting the ``regexes`` argument to ``None``, and the instance's regexes are availabe
-via its ``RE`` attribute.
+If you'd prefer new instances to have their own config values, you can pass
+``None`` as the second argument (or ``constant`` keyword argument) when
+instantiating ``HumanName``. The instance's constants can be accessed via its
+``C`` attribute. 
 
-Note that each instance always has a ``C`` attribute, but if you didn't pass ``None``
-or ``False`` to the ``constants`` argument then you'd still be modifying the module-level
-config values with the behavior described above.
+Note that each instance always has a ``C`` attribute, but if you didn't pass
+something falsey to the ``constants`` argument then you'd still be
+modifying the module-level config values with the behavior described above.
 
 ::
 
@@ -346,10 +351,11 @@ Contributing
 ------------
 
 Please let me know if there are ways this library could be restructured to make
-it easier for you to use in your projects. 
+it easier for you to use in your projects. Read CONTRIBUTING.md_ for more info.
 
     https://github.com/derek73/python-nameparser
 
+.. _CONTRIBUTING.md: https://github.com/derek73/python-nameparser/tree/master/CONTRIBUTING.md
 
 Testing
 +++++++
@@ -391,6 +397,9 @@ Naming Practices and Resources
 Release Log
 -----------
 
+    * 0.3 - May ?, 2014
+        - Refactor configuration to simplify modifications to constants
+        - use unicode_literals to simplify Python 2 & 3 support.
     * 0.2.10 - May 6, 2014
         - If name is only a title and one part, assume it's a last name instead of a first name. (`#7 <https://github.com/derek73/python-nameparser/issues/7>`_).
         - Add some judicial and other common titles. 
diff --git a/nameparser/config/__init__.py b/nameparser/config/__init__.py
@@ -47,6 +47,13 @@ def remove(self, *strings):
         [self.elements.remove(lc(s)) for s in strings if lc(s) in self.elements]
         return self.elements
 
+
+class Regexes(object):
+    def __init__(self):
+        for name, re in REGEXES:
+            setattr(self, name, re)
+
+
 class Constants(object):
     
     def __init__(self):
@@ -55,7 +62,7 @@ def __init__(self):
         self.titles            = Manager(TITLES)
         self.first_name_titles = Manager(FIRST_NAME_TITLES)
         self.conjunctions      = Manager(CONJUNCTIONS)
-        self.regexes           = Manager(REGEXES)
+        self.RE                = Regexes()
     
     @property
     def suffixes_prefixes_titles(self):
@@ -65,10 +72,4 @@ def suffixes_prefixes_titles(self):
     capitalization_exceptions = CAPITALIZATION_EXCEPTIONS
     
 
-class Regexes(object):
-    def __init__(self):
-        for name, re in REGEXES:
-            setattr(self, name, re)
-
 constants = Constants()
-regexes = Regexes()
diff --git a/nameparser/parser.py b/nameparser/parser.py
@@ -38,15 +38,13 @@ class HumanName(object):
      
     """
     
-    def __init__(self, full_name="", constants=constants, regexes=regexes,
-                                    encoding=ENCODING, string_format=None):
+    def __init__(self, full_name="", constants=constants, encoding=ENCODING, 
+                string_format=None):
         if constants:
             self.C = constants
-            self.RE = regexes or Regexes()
             self.has_own_config = False
         else:
             self.C = Constants()
-            self.RE = Regexes()
             self.has_own_config = True
         self.ENCODING = encoding
         self.string_format = string_format
@@ -197,7 +195,7 @@ def is_rootname(self, piece):
             and not self.is_an_initial(piece) 
     
     def is_an_initial(self, value):
-        return self.RE.initial.match(value) or False
+        return self.C.RE.initial.match(value) or False
 
     # def is_a_roman_numeral(value):
     #     return re_roman_numeral.match(value) or False
@@ -215,6 +213,21 @@ def full_name(self, value):
         self.parse_full_name()
 
     
+    def pre_process(self):
+        """
+        This happens at the beginning of the parse_full_name() before
+        any other processing of the string aside from unicode normalization.
+        """
+        self.parse_nicknames()
+        
+
+    def post_process(self):
+        """
+        This happens at the end of the parse_full_name() after
+        all other processing has taken place.
+        """
+        self.handle_firstnames()
+
     def parse_nicknames(self):
         """
         Handling Nicknames
@@ -226,11 +239,23 @@ def parse_nicknames(self):
         
         https://code.google.com/p/python-nameparser/issues/detail?id=33
         """
-        re_nickname = self.RE.nickname
+        re_nickname = self.C.RE.nickname
         if re_nickname.search(self._full_name):
             self.nickname_list = re_nickname.findall(self._full_name)
             self._full_name = re_nickname.sub('', self._full_name)
 
+    def handle_firstnames(self):
+        """
+        If there are only two parts and one is a title, assume it's a last name
+        instead of a first name. e.g. Mr. Johnson. Unless it's a special title
+        like "Sir", then when it's followed by a single name that name is always
+        a first name.
+        """
+        if self.title \
+            and len(self) == 2 \
+            and not lc(self.title) in self.C.first_name_titles:
+            self.last, self.first = self.first, self.last
+
     def parse_full_name(self):
         """
         Parse full name into the buckets
@@ -247,10 +272,10 @@ def parse_full_name(self):
         if not isinstance(self._full_name, text_type):
             self._full_name = u(self._full_name, self.ENCODING)
         
-        self.parse_nicknames()
+        self.pre_process()
         
         # collapse multiple spaces
-        self._full_name = self.RE.spaces.sub(" ", self._full_name.strip())
+        self._full_name = self.C.RE.spaces.sub(" ", self._full_name.strip())
         
         # break up full_name by commas
         parts = [x.strip() for x in self._full_name.split(",")]
@@ -350,11 +375,13 @@ def parse_full_name(self):
 
     def _parse_pieces(self, parts, additional_parts_count=0):
         """
-        Split parts on spaces and remove commas, join on conjunctions and lastname prefixes.
+        Split parts on spaces and remove commas, join on conjunctions and
+        lastname prefixes.
         
         additional_parts_count: if the comma format contains other parts, we need to know 
         how many there are to decide if things should be considered a conjunction.
         """
+        
         ps = []
         for part in parts:
             ps += [x.strip(' ,') for x in part.split(' ')]
@@ -451,15 +478,6 @@ def find_p(p):
         log.debug("pieces: {0}".format(pieces))
         return pieces
     
-    def post_process(self):
-        # if there are only two parts and one is a title,
-        # assume it's a last name instead of a first name.
-        # e.g. Mr. Johnson. 
-        if self.title \
-            and len(self) == 2 \
-            and not lc(self.title) in self.C.first_name_titles:
-            self.last, self.first = self.first, self.last
-    
     
     ### Capitalization Support
     
@@ -469,19 +487,19 @@ def cap_word(self, word):
         exceptions = dict(self.C.capitalization_exceptions)
         if word in exceptions:
             return exceptions[word]
-        mac_match = self.RE.mac.match(word)
+        mac_match = self.C.RE.mac.match(word)
         if mac_match:
             def cap_after_mac(m):
                 return m.group(1).capitalize() + m.group(2).capitalize()
-            return self.RE.mac.sub(cap_after_mac, word)
+            return self.C.RE.mac.sub(cap_after_mac, word)
         else:
             return word.capitalize()
 
     def cap_piece(self, piece):
         if not piece:
             return ""
         replacement = lambda m: self.cap_word(m.group(0))
-        return self.RE.word.sub(replacement, piece)
+        return self.C.RE.word.sub(replacement, piece)
 
     def capitalize(self):
         """