(Thanks to greg0ire below for helping with key concepts)
The challenge: Build a program that finds all substrings and "tags" them with color attributes (effectively highlighting them in XML).
The rules:
- This should only be done for substrings of length 2 or more.
- Substrings are just strings of consecutive characters, which may include non-alphabetic characters. Note that spaces and other punctuation do not delimit substrings.
- Character casing cannot be ignored.
- The "highlight" should be done by tagging the substring in XML. Your tagging should be of the form
<TAG#>theSubstring</TAG#>
where#
is a positive number unique to that substring and identical substrings. - The priority of the algorithm is to find the longest substring, not how many times it matches within the text.
Note: The order of the tagging shown in the example below is not important. Its just used by the OP for clarity.
An example input:
LoremIpsumissimplydummytextoftheprintingandtypesettingindustry.LoremIpsumhasbeentheindustry'sstandarddummytexteversincethe1500s,whenanunknownprintertookagalleyoftypeandscrambledittomakeatypespecimenbook.
A partially correct output (OP may NOT have completely replaced perfectly in this example)
<TAG1>LoremIpsum</TAG1>issimply<TAG2>dummytext</TAG2>of<TAG5>the</TAG5><TAG3>print</TAG3>ingand<TAG4>type</TAG4>setting<TAG6>industry</TAG6>.<TAG1>LoremIpsum</TAG1>hasbeen<TAG5>the</TAG5><TAG6>industry</TAG6>'sstandard<TAG2>dummytext</TAG2>eversince<TAG5>the</TAG5>1500s,whenanunknown<TAG3>print</TAG3>ertookagalleyof<TAG4>type</TAG4>andscrambledittomakea<TAG4>type</TAG4>specimenbook.
Your code should be able to handle edge cases, such as the following:
Example Input 2:
hello!TAG!</hello.TAG.</
Example Output 2:
<TAG1>hello</TAG1>!<TAG2>TAG</TAG2>!<TAG3></</TAG3><TAG1>hello</TAG1>.<TAG2>TAG</TAG2>.<TAG3></</TAG3>
The winner:
- Most elegant solution wins (judged by others comments, upvotes)
- Bonus points/consideration for solutions utilizing shell scripting
Minor clarifications:
- Input can be hard coded or read from a file
- The criteria remains "elegance", which admittedly IS slightly vague, but it also encapsulates simple character/line counts as well. Comments by others and/or upvotes are also indicative of how the SO community views the challenge