18. [Activity] Improving the Word Count Script with Regular Expressions
IMPROVING WORD COUNT
Normalizing data, sorting the results
Text Normalization
Problem: word variants with different capitalization, punctuation, etc.
There are fancy natural language processing toolkits like NLTK
But we'll keep it simple, and use a regular expression.
Activity
Looking at WordCount.scala, we need to account for , in the word which is contentuated with the word
To improve upon WordCount.scala to account for , , we can look at WordCountBetter.scala from the resource folder
Import WordCountBetter.scala and open up in Eclipse-Scala IDE
The following code split the lines using REGEX for one word or more of them, and eliminate the additional , found
This is done to let all the words be in lowercase to avoid duplicates between capitalized words
Run it to see the difference in output
Now the output looks much cleaner than WordCount.scala
But now if the results were sorted, it would look much better
We would explore this in the next lecture
Last updated