Skip to content

Commit 7446b6a

Browse files
committed
move regex config to same constants as everything else #1
1 parent 2a56dbb commit 7446b6a

File tree

3 files changed

+78
-50
lines changed

3 files changed

+78
-50
lines changed

README.rst

Lines changed: 31 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -186,13 +186,16 @@ These constants are set at the module level using nameparser.config_.
186186

187187
.. _nameparser.config: https://github.com/derek73/python-nameparser/tree/master/nameparser/config
188188

189-
Predefined Variable Names
190-
+++++++++++++++++++++++++
189+
Predefined Variables
190+
++++++++++++++++++++
191+
192+
These are available via ``from nameparser.config import constants`` or on the ``C``
193+
attribute of a ``HumanName`` instance, e.g. ``hn.C``.
191194

192195
* **prefixes**:
193196
Parts that come before last names, e.g. 'del' or 'van'
194197
* **titles**:
195-
Parts that come before the first names. Any words included in
198+
Parts that come before the first names. Any strings included in
196199
here will never be considered a first name, so use with care.
197200
* **suffixes**:
198201
Parts that appear after the last name, e.g. "Jr." or "MD"
@@ -204,22 +207,25 @@ Predefined Variable Names
204207
Most parts should be capitalized by capitalizing the first letter.
205208
There are some exceptions, such as roman numbers used for suffixes.
206209
You can update this with a dictionary or a tuple.
210+
* **RE**:
211+
Contains all the various regular expressions used in the parser.
207212

208-
Each of these predefined sets of variables includes ``add()`` and ``remove()``
209-
methods for easy modification. They also inherit from ``set()`` so you can
210-
modify them with any methods that work on sets.
213+
Each of these predefined sets of variables (except ``RE``) includes ``add()``
214+
and ``remove()`` methods for easy modification. They also inherit from ``set()``
215+
so you can modify them with any methods that work on sets. ``RE`` is a tuple
216+
but can be replaced with a dictionary if you need to modify it.
211217

212218
Any strings you add to the constants should be lower case and not include
213219
periods. The ``add()`` and ``remove()`` method handles that for you
214-
automatically, but other set methods will not.
220+
automatically, but other ``set()`` methods will not.
215221

216222
Parser Customization Examples
217223
+++++++++++++++++++++++++++++
218224

219225
"Hon" is a common abbreviation for "Honorable", a title used when addressing
220226
judges. It is also sometimes a first name. If your dataset contains more
221227
"Hon"s than judges, you may wish to remove it from the titles constant so
222-
that "Hon" can be recognized as a first name.
228+
that "Hon" can be parsed as a first name.
223229

224230
::
225231

@@ -272,13 +278,13 @@ methods and each string will be added or removed.
272278
]>
273279

274280

275-
Parser Customizations Are At Module-Level
281+
Parser Customizations Are Module-Wide
276282
+++++++++++++++++++++++++++++
277283

278-
When you modify the configuration, by default this will modify the behavior all HumanName
279-
instances. This could be a handy way to set it up for your entire project, but could also
280-
lead to some unexpected behavior because changing one instance could modify the behavior
281-
of another instance.
284+
When you modify the configuration, by default this will modify the behavior all
285+
HumanName instances. This could be a handy way to set it up for your entire
286+
project, but it could also lead to some unexpected behavior because changing one
287+
instance could modify the behavior of another instance.
282288

283289
::
284290

@@ -307,15 +313,14 @@ of another instance.
307313
]>
308314

309315

310-
If you'd prefer new instances to have their own config values, you can pass ``None``
311-
as the second argument when instantiating ``HumanName``. The instance's constants can
312-
be accessed via its ``C`` attribute. Similarly the regexes can be overridden by
313-
setting the ``regexes`` argument to ``None``, and the instance's regexes are availabe
314-
via its ``RE`` attribute.
316+
If you'd prefer new instances to have their own config values, you can pass
317+
``None`` as the second argument (or ``constant`` keyword argument) when
318+
instantiating ``HumanName``. The instance's constants can be accessed via its
319+
``C`` attribute.
315320

316-
Note that each instance always has a ``C`` attribute, but if you didn't pass ``None``
317-
or ``False`` to the ``constants`` argument then you'd still be modifying the module-level
318-
config values with the behavior described above.
321+
Note that each instance always has a ``C`` attribute, but if you didn't pass
322+
something falsey to the ``constants`` argument then you'd still be
323+
modifying the module-level config values with the behavior described above.
319324

320325
::
321326

@@ -346,10 +351,11 @@ Contributing
346351
------------
347352

348353
Please let me know if there are ways this library could be restructured to make
349-
it easier for you to use in your projects.
354+
it easier for you to use in your projects. Read CONTRIBUTING.md_ for more info.
350355

351356
https://github.com/derek73/python-nameparser
352357

358+
.. _CONTRIBUTING.md: https://github.com/derek73/python-nameparser/tree/master/CONTRIBUTING.md
353359

354360
Testing
355361
+++++++
@@ -391,6 +397,9 @@ Naming Practices and Resources
391397
Release Log
392398
-----------
393399

400+
* 0.3 - May ?, 2014
401+
- Refactor configuration to simplify modifications to constants
402+
- use unicode_literals to simplify Python 2 & 3 support.
394403
* 0.2.10 - May 6, 2014
395404
- If name is only a title and one part, assume it's a last name instead of a first name. (`#7 <https://github.com/derek73/python-nameparser/issues/7>`_).
396405
- Add some judicial and other common titles.

nameparser/config/__init__.py

Lines changed: 8 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -47,6 +47,13 @@ def remove(self, *strings):
4747
[self.elements.remove(lc(s)) for s in strings if lc(s) in self.elements]
4848
return self.elements
4949

50+
51+
class Regexes(object):
52+
def __init__(self):
53+
for name, re in REGEXES:
54+
setattr(self, name, re)
55+
56+
5057
class Constants(object):
5158

5259
def __init__(self):
@@ -55,7 +62,7 @@ def __init__(self):
5562
self.titles = Manager(TITLES)
5663
self.first_name_titles = Manager(FIRST_NAME_TITLES)
5764
self.conjunctions = Manager(CONJUNCTIONS)
58-
self.regexes = Manager(REGEXES)
65+
self.RE = Regexes()
5966

6067
@property
6168
def suffixes_prefixes_titles(self):
@@ -65,10 +72,4 @@ def suffixes_prefixes_titles(self):
6572
capitalization_exceptions = CAPITALIZATION_EXCEPTIONS
6673

6774

68-
class Regexes(object):
69-
def __init__(self):
70-
for name, re in REGEXES:
71-
setattr(self, name, re)
72-
7375
constants = Constants()
74-
regexes = Regexes()

nameparser/parser.py

Lines changed: 39 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -38,15 +38,13 @@ class HumanName(object):
3838
3939
"""
4040

41-
def __init__(self, full_name="", constants=constants, regexes=regexes,
42-
encoding=ENCODING, string_format=None):
41+
def __init__(self, full_name="", constants=constants, encoding=ENCODING,
42+
string_format=None):
4343
if constants:
4444
self.C = constants
45-
self.RE = regexes or Regexes()
4645
self.has_own_config = False
4746
else:
4847
self.C = Constants()
49-
self.RE = Regexes()
5048
self.has_own_config = True
5149
self.ENCODING = encoding
5250
self.string_format = string_format
@@ -197,7 +195,7 @@ def is_rootname(self, piece):
197195
and not self.is_an_initial(piece)
198196

199197
def is_an_initial(self, value):
200-
return self.RE.initial.match(value) or False
198+
return self.C.RE.initial.match(value) or False
201199

202200
# def is_a_roman_numeral(value):
203201
# return re_roman_numeral.match(value) or False
@@ -215,6 +213,21 @@ def full_name(self, value):
215213
self.parse_full_name()
216214

217215

216+
def pre_process(self):
217+
"""
218+
This happens at the beginning of the parse_full_name() before
219+
any other processing of the string aside from unicode normalization.
220+
"""
221+
self.parse_nicknames()
222+
223+
224+
def post_process(self):
225+
"""
226+
This happens at the end of the parse_full_name() after
227+
all other processing has taken place.
228+
"""
229+
self.handle_firstnames()
230+
218231
def parse_nicknames(self):
219232
"""
220233
Handling Nicknames
@@ -226,11 +239,23 @@ def parse_nicknames(self):
226239
227240
https://code.google.com/p/python-nameparser/issues/detail?id=33
228241
"""
229-
re_nickname = self.RE.nickname
242+
re_nickname = self.C.RE.nickname
230243
if re_nickname.search(self._full_name):
231244
self.nickname_list = re_nickname.findall(self._full_name)
232245
self._full_name = re_nickname.sub('', self._full_name)
233246

247+
def handle_firstnames(self):
248+
"""
249+
If there are only two parts and one is a title, assume it's a last name
250+
instead of a first name. e.g. Mr. Johnson. Unless it's a special title
251+
like "Sir", then when it's followed by a single name that name is always
252+
a first name.
253+
"""
254+
if self.title \
255+
and len(self) == 2 \
256+
and not lc(self.title) in self.C.first_name_titles:
257+
self.last, self.first = self.first, self.last
258+
234259
def parse_full_name(self):
235260
"""
236261
Parse full name into the buckets
@@ -247,10 +272,10 @@ def parse_full_name(self):
247272
if not isinstance(self._full_name, text_type):
248273
self._full_name = u(self._full_name, self.ENCODING)
249274

250-
self.parse_nicknames()
275+
self.pre_process()
251276

252277
# collapse multiple spaces
253-
self._full_name = self.RE.spaces.sub(" ", self._full_name.strip())
278+
self._full_name = self.C.RE.spaces.sub(" ", self._full_name.strip())
254279

255280
# break up full_name by commas
256281
parts = [x.strip() for x in self._full_name.split(",")]
@@ -350,11 +375,13 @@ def parse_full_name(self):
350375

351376
def _parse_pieces(self, parts, additional_parts_count=0):
352377
"""
353-
Split parts on spaces and remove commas, join on conjunctions and lastname prefixes.
378+
Split parts on spaces and remove commas, join on conjunctions and
379+
lastname prefixes.
354380
355381
additional_parts_count: if the comma format contains other parts, we need to know
356382
how many there are to decide if things should be considered a conjunction.
357383
"""
384+
358385
ps = []
359386
for part in parts:
360387
ps += [x.strip(' ,') for x in part.split(' ')]
@@ -451,15 +478,6 @@ def find_p(p):
451478
log.debug("pieces: {0}".format(pieces))
452479
return pieces
453480

454-
def post_process(self):
455-
# if there are only two parts and one is a title,
456-
# assume it's a last name instead of a first name.
457-
# e.g. Mr. Johnson.
458-
if self.title \
459-
and len(self) == 2 \
460-
and not lc(self.title) in self.C.first_name_titles:
461-
self.last, self.first = self.first, self.last
462-
463481

464482
### Capitalization Support
465483

@@ -469,19 +487,19 @@ def cap_word(self, word):
469487
exceptions = dict(self.C.capitalization_exceptions)
470488
if word in exceptions:
471489
return exceptions[word]
472-
mac_match = self.RE.mac.match(word)
490+
mac_match = self.C.RE.mac.match(word)
473491
if mac_match:
474492
def cap_after_mac(m):
475493
return m.group(1).capitalize() + m.group(2).capitalize()
476-
return self.RE.mac.sub(cap_after_mac, word)
494+
return self.C.RE.mac.sub(cap_after_mac, word)
477495
else:
478496
return word.capitalize()
479497

480498
def cap_piece(self, piece):
481499
if not piece:
482500
return ""
483501
replacement = lambda m: self.cap_word(m.group(0))
484-
return self.RE.word.sub(replacement, piece)
502+
return self.C.RE.word.sub(replacement, piece)
485503

486504
def capitalize(self):
487505
"""

0 commit comments

Comments
 (0)