I have five text files that I input to a CountVectorizer. When specifying min_df
and max_df
to the CountVectorizer instance what does the min/max document frequency exactly mean? Is it the frequency of a word in its particular text file or is it the frequency of the word in the entire overall corpus (five text files)?
What are the differences when min_df
and max_df
are provided as integers or as floats?
The documentation doesn't seem to provide a thorough explanation nor does it supply an example to demonstrate the use of these two parameters. Could someone provide an explanation or example demonstrating min_df
and max_df
?