最佳答案
This is a self-answered post. Below I outline a common problem in the NLP domain and propose a few performant methods to solve it.
Oftentimes the need arises to remove punctuation during text cleaning and pre-processing. Punctuation is defined as any character in string.punctuation
:
>>> import string
string.punctuation
'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
This is a common enough problem and has been asked before ad nauseam. The most idiomatic solution uses pandas str.replace
. However, for situations which involve a lot of text, a more performant solution may need to be considered.
What are some good, performant alternatives to str.replace
when dealing with hundreds of thousands of records?