I need to do two things, first, find a given text which are the most used word and word sequences (limited to n). Example:
Lorem *ipsum* dolor sit amet, consectetur adipiscing elit. Nunc auctor urna sed urna mattis nec interdum magna ullamcorper. Donec ut lorem eros, id rhoncus nisl. Praesent sodales lorem vitae sapien volutpat et accumsan lorem viverra. Proin lectus elit, cursus ut feugiat ut, porta sit amet leo. Cras est nisl, aliquet quis lobortis sit amet, viverra non erat. Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia Curae; Integer euismod scelerisque quam, et aliquet nibh dignissim at. Pellentesque ut elit neque. Etiam facilisis nisl eu mauris luctus in consequat libero volutpat. Pellentesque auctor, justo in suscipit mollis, erat justo sollicitudin ipsum, in cursus erat ipsum id turpis. In tincidunt hendrerit scelerisque.
(some words my have been omited, but it's an example).
I'd like to result with sit amet and not sit and amet
Any ideas on how to start?
Second, I need to wrap all the words or word sequences matched from a given list in a given file.
For this, I think to order the result by desceding length and then process each string in replace function, to avoid having sit amet wrapped if I have another sit word in my list. Is it a good way to do?!
Thank you