21 years after Perl 5, 15 years after Larry Wall’s State of the onion, 12 years after the book Perl 6 essentials, yes finally, Perl 6 is coming this Christmas (2015) !!! I am a happy webmin user, and I can book a hotel room via booking.com. Perl is still alive.
My favorite Perl 5 script is an one-liner:
perl -0777 -ne 'print "$1\n" while m!(<Templates>.*?</Templates>)!sg'
- -0 is the record separator, which is normally the newline. 777 is a magic number to tell that there is no separator. This helps to search across newlines.
- -n loop through all records
- -e run script from the command line
- print $1 ; print group 1 of the regular expression, e.g. the text between ( and )
- m! match with ! as alternative delimiter. A regular expression is normally written as /<pattern>/ or m/<pattern>/. / is the delimiter. Perl allows you to use another delimiter. ! is used such that / can occur in the search text without escaping it with a backslash.
- <Templates> … </Templates> is what I search for in my xml files.
- .* is a wildcard match. Normally you get the longest match, but .*? gives the shortest match. In other words: match non-greedy
- !sg The s modifier let the . from .* match newlines. The g is to search for all matches. g = global match
What the above perl script does is searching for <Templates> blocks in xml files. It doesn’t work correctly with nested blocks, but I don’t need that anyway. It’s purpose is to quickly scan in a project for Template information without needing a heavy duty program to parse XML files. If I want to find something else I just change the regular expression to my likings.
Wait a minute, isn’t this blog about Elixir?
Of course it is. I was curious how the Perl script would look like in Elixir, and I wanted to get acquainted with Elixir with something simple, so I decided to make an Elixir command line script that could do the same.
I first looked how regular expressions are done in Elixir. I found the Regex module, and tried some things in the iex shell:
nico@nico-ubuntu:~/elixir/adhoc_scripts/multilinesearch$ iex
Erlang/OTP 18 [erts-7.1] [smp:4:4] [async-threads:10] [kernel-poll:false]
Interactive Elixir (1.1.1) - press Ctrl+C to exit (type h() ENTER for help)
iex(1)> Regex.replace(~r/abc/, "aaabccccc abc qq", "def")
"aadefcccc def qq"
iex(2)> Regex.replace(~r/a(b|x)c/, "aaaxccccc abc axc q", "\\1")
"aaxcccc b x q"
iex(3)> Regex.scan(~r/BEGIN.*END/, "ab BEGIN cde \n fg END h \n ij BEGIN klm END nop")
[["BEGIN klm END"]]
iex(4)> Regex.scan(~r/BEGIN.*END/U, "ab BEGIN cde \n fg END h \n ij BEGIN klm END nop")
[["BEGIN klm END"]]
iex(5)> Regex.scan(~r/BEGIN.*END/s, "ab BEGIN cde \n fg END h \n ij BEGIN klm END nop")
[["BEGIN cde \n fg END h \n ij BEGIN klm END"]]
iex(6)> Regex.scan(~r/BEGIN.*END/Us, "ab BEGIN cde \n fg END h \n ij BEGIN klm END nop")
[["BEGIN cde \n fg END"], ["BEGIN klm END"]]
iex(7)>
BREAK: (a)bort (c)ontinue (p)roc info (i)nfo (l)oaded
(v)ersion (k)ill (D)b-tables (d)istribution
a
My conclusion: use Regex.scan to find all matches, with the U modifier for the ungreedy match and the s modifier to let . match newlines as well.
I needed some test data, so I created two xml files:
mltest_a.xml:
<?xml version="1.0" encoding="utf-8"?>
<root>
<item name"item a1">
<definition />
<Templates>
<tpl name="tpl1">tpl text1</tpl>
<tpl name="tpl2¨>tpl text2</tpl>
</Templates>
<properties />
</item>
<item name="item a2"><Templates><tpl name="tpl3" >tpl text3</tpl></Templates></item>
</root>
mltest_b.xml:
<?xml version="1.0" encoding="utf-8"?>
<root>
<item name"item b">
<definition />
<Templates>
<tpl name="tpl4">tpl text4</tpl>
<!-- here comes a nested template -->
<Templates>
<tpl name="tpl5">tpl text5</tpl>
</Templates>
<tpl name="tpl6¨>this one will be missed out</tpl>
</Templates>
<properties />
</item>
</root>
Output
The perl output when I pass both files as arguments is:
<Templates>
<tpl name="tpl1">tpl text1</tpl>
<tpl name="tpl2¨>tpl text2</tpl>
</Templates>
<Templates><tpl name="tpl3" >tpl text3</tpl></Templates>
<Templates>
<tpl name="tpl4">tpl text4</tpl>
<!-- here comes a nested template -->
<Templates>
<tpl name="tpl5">tpl text5</tpl>
</Templates>
Then I had a look at the OptionParser, because I have to parse the command line parameters.
If I don’t pass any paramer, perl will read from stdin. The same applies if you pass the ‘-‘ parameter. Maybe you know this trick:
echo "and the next file comes here:" | cat mltest_a.xml - mltest_b.xml
This will print the text between the content of the files.
iex(1)> OptionParser.parse(["--what", "-c=5", "-d", "100", "test.xml"], switches: [help: :boolean])
{[what: true], ["test.xml"], [{"-c", "5"}, {"-d", "100"}]}
iex(2)> OptionParser.parse(["--what", "-c=5", "-d", "100", "test.xml"], strict: [help: :boolean])
{[], ["100", "test.xml"], [{"--what", nil}, {"-c", nil}, {"-d", nil}]}
iex(3)> OptionParser.parse(["--help", "-c=6", "-d", "200", "test2.xml"], switches: [help: :boolean])
{[help: true], ["test2.xml"], [{"-c", "6"}, {"-d", "200"}]}
iex(4)> OptionParser.parse(["--help", "-c=6", "-d", "200", "test2.xml"], strict: [help: :boolean])
{[help: true], ["200", "test2.xml"], [{"-c", nil}, {"-d", nil}]}
iex(5)> { ok, files, failures } = OptionParser.parse(["--help", "-c=6", "-d", "200", "test2.xml"], strict: [help: :boolean])
{[help: true], ["200", "test2.xml"], [{"-c", nil}, {"-d", nil}]}
iex(6)> ok
[help: true]
iex(7)> files
["200", "test2.xml"]
iex(8)> failures
[{"-c", nil}, {"-d", nil}]
This is the final result:
grep_templates.exs:
#!/usr/bin/env elixir
#
defmodule GrepTemplates do
@moduledoc """
Demonstrate multiline search in file content.
Prints everything between <Templates> and </Templates> markers.
Runs as CLI utility.
"""
@doc "Call main with file names. E.g. main([\"xmlfile.xml\"])"
def main(args) do
# OptionParser.parse returns { option_list, file_list, unknown_option_list }
parse = OptionParser.parse(args, strict: [help: :boolean])
case parse do
{[help: true] , _, _ } -> show_help
{_, [], [] } -> scan ["-"]
{_, file_name_list, [] } -> scan file_name_list
{_, _, failed_option_list } -> show_error failed_option_list
_ -> IO.puts(:stderr, "Error while parsing arguments.")
end
end
# "print usage line"
defp show_help, do: IO.puts "usage: grep_templates.exs [--help] [file...]"
# "print last line of unknown options"
defp show_error([]), do: IO.puts(:stderr, "Type 'grep_templates.exs --help' for more information.")
# "print unknown options"
defp show_error([option_value | tail]) do
{ option, _ } = option_value
IO.puts(:stderr, "grep_templates.exs: Unknown option '" <> String.slice( option, 1..-1 ) <> "'")
show_error tail
end
def scan([]), do: :ok
def scan(["-" | tail]) do
stdin_text = IO.read :all
print_templates stdin_text
scan tail
end
@doc "search in given files for everything between <Templates> and </Templates> markers"
def scan([filename | tail]) do
file = File.read! filename
print_templates file
scan tail
end
defp print_templates(text) do
list_of_lists = Regex.scan(~r/<Templates>.*<\/Templates>/Us, text, capture: :all)
IO.puts Enum.join(list_of_lists, "\n")
end
end
# call main method with given arguments
GrepTemplates.main(System.argv())
Make the script executable:
chmod +x *.exs
Make sure that elixir is in the executable search path and start the script:
./grep_templates.exs mltest_a.xml mltest_b.xml
The output is exact the same as the perl script.
There is a noticeable delay when you start the script before you get any output. On my system it takes half a second. I would not worry about it too much, but don’t place the script in a big loop. I combine the find and xargs commands to get the list of files that must be passed on to the script:
find . -type f -name '*.xml' | xargs ./grep_templates.exs
Explanation of the code:
- Line 1: is the shebang which will invoke elixir to run the script.
- Line 62: This makes it a script. Most of the code is put in a module because that will make it easier adopt it in a ‘mix’ project.
- Lines 6-14: This is for generating documentation with ExDoc. You can use markdown, but please don’t use # or h1, because that looks weird in the end result. From the @moduledoc the first line will appear in modules overview page.
- Line 15: I could have used any function name, but I used main/1 for compatibility with ‘mix escript.build’.
- Line 16: use the OptionParser
- Line 17-24: some serious pattern matching. Works like a router.
- Line 28: private function to print the usage line
- Line 31: Prints this line when at the end when unknown switches are used. This also ends the recursive showError calls.
- Line 34: Pick the first item of the list
- Line 35: We are only interested in the option of the option-value tuple
- Line 36: <> is used for string concatenation
- Line 37, 45, 52: recursive call to handle the other items of the list
- Line 40: recursion of the scan function ends here
- Line 42: match the “-” parameter
- Line 43: read all lines from stdin
- Line 44, 51: pass on the text to print the Templates in the text
- Line 45: recursive call to the handle
- Line 50: read the file content. Fail if the file doesn’t exist
- Line 56: the regular expression to find anything between <Templates> and </Templates>
- Line 57: join the strings in the list of lists, use newline as separator
This was a fun exercise for me.
Nico