How to clean our data in python (Here is persian documents)
In processing of language, if you are looking for better result you should clean your data,in advance. Obviously some words or punctuations have not a lot of effect on your result.
First thing you should do before cleaning your data is knowing you data completly because cleaning data is NOT a fix methodس that you can apply to any projects.
There are three common methods that would be useful(as I use in this project):
. STOP WORDS
In this project I collected a csv file that includes "STOP WORDS IN PERSIAN".
. TfidfVectorizer
TFIF method try to find some kind of word that repeat a lots in documents and not carry important information.