Fri, Apr 19, 2019
Read in 5 minutes
Elixir has an interesting set of tools that make it a breeze for parsing text.
So recently I was getting annoyed at why my local bank (looking at you DBS Bank) doesn’t provide .csv
or .xlsx
files for my credit card transaction history. I can see the transactions in the iBanking platform there’s no way to download it (confirmed by the Live Chat CSO).
Then a thought struck me. I’ve heard a lot of good things about Elixir being a good text parsing language. You can even parse .mp3
files and read metadata in just 25 lines of code. Sick!
So let’s see what we get after highlighting the HTML tables and pasting in Sublime Text:
...
12 Mar 2019 DELIVEROO.COM.SG S$31.26
13 Mar 2019 FAVEPAY S$52.82
17 Mar 2019 GRAB 4437139-9-024 S$13.00
...
Wow that looks easy enough. It appears that as long as the data is tabular, copy and pasting seems to arrange the data into almost tabular form. Let’s see how iex
interprets this. Save the file as sample
and use File.read!/1
to read the binary:
iex(1)> File.read!("sample")
"12 Mar 2019 DELIVEROO.COM.SG S$31.26\n13 Mar 2019 FAVEPAY S$52.82\n17 Mar 2019 GRAB 4437139-9-024 S$13.00"
Nice. The pattern is well behaved, there is clearly just one \n
for each new line. After initialising our project mix new bank_parser
, we go add:
# In /lib/bank_parser.ex
def parse(file) do
File.read!(file)
|> String.split("\n")
end
# Output
# ["12 Mar 2019 DELIVEROO.COM.SG S$31.26",
# "13 Mar 2019 FAVEPAY S$52.82",
# "17 Mar 2019 GRAB 4437139-9-024 S$13.00"]
String.spilt/2
is a common function in most languages. Nothing special to be said about it.
Now we have a list of strings to parse.
When I started programming, I didn’t quite understand the difference between strings and binaries. Other programming languages don’t even seem to have distinctions. In Elixir however, I’m starting to gain an appreciation how deep the language is, and how careful it is in dealing with strings.
Just try this in the iex
console:
iex> 'Hello'
'Hello'
iex> "Hello"
"Hello"
iex> is_binary "Hello"
true
iex> is_binary 'Hello'
false
What? I’ve never encountered this before! In Javascript, Python and Ruby, you can use single quotes and doubles interchangably!
Don’t believe me? Let’s try a String.length/1
on it:
iex> String.valid? 'Hello'
false
All functions in the String
module necessitates that the input be a binary!
A binary is a sequence of bytes. Therefore by extension, strings are also a sequence of bytes. In Elixir you use <<
and >>
to wrap codepoints:
# Elixir uses UTF-8 decimal codepoints
iex> <<36>>
"!"
# You can also use ? to get the decimal codepoint
iex> ?a
97
That <>
you use to join strings? You are joining byte sequences!
iex> "hello" <> "there"
"hellothere"
iex> <<36>> <> "there"
"$there"
The official Elixir docs gives an excellent introduction to binaries and strings! Even if you don’t end up using Elixir, you’ll be more sensitive to the differences in other languages 😄. Great knowledge to know about string encoding and decoding on the OS level.
Yes, you can pattern match on binaries! If we look at our lines in the bank transaction:
"13 Mar 2019 FAVEPAY S$52.82"
There appears to be a fixed 11 character length in the first part. We can strip out the date just like this:
iex> <<date::binary-size(11), rest::binary>> = "13 Mar 2019 FAVEPAY S$52.82"
"13 Mar 2019 FAVEPAY S$52.82"
iex> date
"13 Mar 2019"
iex> rest
" FAVEPAY S$52.82"
Sweet! In just 1 line we got what we wanted! date::binary-size(11)
tells Elixir that you are defining a date
of type binary
with size 11 in the first part, and the rest
is a binary of any size that you want to match.
However, as powerful as Elixir is, there’s still no way to match a string of open size at the head:
iex> <<item::binary, "S$", amount::binary>> = " FAVEPAY S$52.82"
** (CompileError) iex: a binary field without size is only allowed at the end of a binary pattern and never allowed in binary generators
In case you are confused, <<>>
generates a binary using the UTF codepoints. It is only after generating the binary (or string), it matches with the right side. For this, we need to match with regular expression.
# Define regex that captures the group
@regex_figures ~r/(S\$)(\d{0,}\.\d{0,2})/
# Define a function
def parse_rest(line) do
case Regex.split(@regex_figures, line, include_captures: :true, trim: true) do
[line, figures] ->
{line |> String.trim, figures, "debit"}
[line, figures, " cr"] ->
# Some lines have the `cr` character which indicates credit
{line |> String.trim, figures, "credit"}
_rest ->
{:error}
end
end
# Test in iex with the string
iex> BankParser.parse_rest(" FAVEPAY S$52.82")
{"FAVEPAY", "S$52.82", "debit"}
Elixir comes with super helpful Regex
library which has some cool functions like Regex.split/2
that allows you to include the captured group, and even trim whitespace ðŸ˜. Kudos to Elixir core team for improving QoL of developers. More information here.
Parser for DBS Credit Card Transactions
I’ve managed to covered the basic points on parsing strings and binaries! There’s a lot more to be read. Once you understand the power of pattern matching and binaries, you can start to do things like parsing .png
files and even .mp3
files!
A list of references below:
Learning to write software? Subscribe now to receive tips on software engineering.