Parsing Text & Binaries with Elixir

Fri, Apr 19, 2019

Read in 5 minutes

Elixir has an interesting set of tools that make it a breeze for parsing text.

Parsing Text & Binaries with Elixir

So recently I was getting annoyed at why my local bank (looking at you DBS Bank) doesn’t provide .csv or .xlsx files for my credit card transaction history. I can see the transactions in the iBanking platform there’s no way to download it (confirmed by the Live Chat CSO).

Then a thought struck me. I’ve heard a lot of good things about Elixir being a good text parsing language. You can even parse .mp3 files and read metadata in just 25 lines of code. Sick!

Copy-paste from the HTML

So let’s see what we get after highlighting the HTML tables and pasting in Sublime Text:

...
12 Mar 2019   DELIVEROO.COM.SG  S$31.26
13 Mar 2019   FAVEPAY   S$52.82
17 Mar 2019   GRAB 4437139-9-024   S$13.00
...

Wow that looks easy enough. It appears that as long as the data is tabular, copy and pasting seems to arrange the data into almost tabular form. Let’s see how iex interprets this. Save the file as sample and use File.read!/1 to read the binary:

iex(1)> File.read!("sample")
"12 Mar 2019   DELIVEROO.COM.SG  S$31.26\n13 Mar 2019   FAVEPAY   S$52.82\n17 Mar 2019   GRAB 4437139-9-024   S$13.00"

Nice. The pattern is well behaved, there is clearly just one \n for each new line. After initialising our project mix new bank_parser, we go add:

# In /lib/bank_parser.ex

def parse(file) do
  File.read!(file)
  |> String.split("\n")
end

# Output
# ["12 Mar 2019   DELIVEROO.COM.SG  S$31.26", 
# "13 Mar 2019   FAVEPAY   S$52.82",
# "17 Mar 2019   GRAB 4437139-9-024   S$13.00"]

String.spilt/2 is a common function in most languages. Nothing special to be said about it.

Now we have a list of strings to parse.

Strings & Binaries? what’s the difference?

When I started programming, I didn’t quite understand the difference between strings and binaries. Other programming languages don’t even seem to have distinctions. In Elixir however, I’m starting to gain an appreciation how deep the language is, and how careful it is in dealing with strings.

Just try this in the iex console:

iex> 'Hello'
'Hello'
iex> "Hello"
"Hello"
iex> is_binary "Hello" 
true
iex> is_binary 'Hello'
false

What? I’ve never encountered this before! In Javascript, Python and Ruby, you can use single quotes and doubles interchangably!

Don’t believe me? Let’s try a String.length/1 on it:

iex> String.valid? 'Hello'
false

All functions in the String module necessitates that the input be a binary!

What is this binary I keep hearing?

A binary is a sequence of bytes. Therefore by extension, strings are also a sequence of bytes. In Elixir you use << and >> to wrap codepoints:

# Elixir uses UTF-8 decimal codepoints
iex> <<36>>
"!"

# You can also use ? to get the decimal codepoint
iex> ?a
97

That <> you use to join strings? You are joining byte sequences!

iex> "hello" <> "there"
"hellothere"

iex> <<36>> <> "there"
"$there"

The official Elixir docs gives an excellent introduction to binaries and strings! Even if you don’t end up using Elixir, you’ll be more sensitive to the differences in other languages 😄. Great knowledge to know about string encoding and decoding on the OS level.

Pattern matching on binaries!~

Yes, you can pattern match on binaries! If we look at our lines in the bank transaction:

"13 Mar 2019   FAVEPAY   S$52.82"

There appears to be a fixed 11 character length in the first part. We can strip out the date just like this:

iex> <<date::binary-size(11), rest::binary>> = "13 Mar 2019   FAVEPAY   S$52.82"
"13 Mar 2019   FAVEPAY   S$52.82"
iex> date
"13 Mar 2019"
iex> rest
"   FAVEPAY   S$52.82"

Sweet! In just 1 line we got what we wanted! date::binary-size(11) tells Elixir that you are defining a date of type binary with size 11 in the first part, and the rest is a binary of any size that you want to match.

However, as powerful as Elixir is, there’s still no way to match a string of open size at the head:

iex> <<item::binary, "S$", amount::binary>> = "   FAVEPAY   S$52.82"
** (CompileError) iex: a binary field without size is only allowed at the end of a binary pattern and never allowed in binary generators

In case you are confused, <<>> generates a binary using the UTF codepoints. It is only after generating the binary (or string), it matches with the right side. For this, we need to match with regular expression.

# Define regex that captures the group
@regex_figures ~r/(S\$)(\d{0,}\.\d{0,2})/

# Define a function
def parse_rest(line) do
  case Regex.split(@regex_figures, line, include_captures: :true, trim: true) do
    [line, figures] ->
      {line |> String.trim, figures, "debit"} 
    [line, figures, " cr"] ->
    # Some lines have the `cr` character which indicates credit
      {line |> String.trim, figures, "credit"} 
    _rest -> 
      {:error}
  end
end

# Test in iex with the string
iex> BankParser.parse_rest("   FAVEPAY   S$52.82")
{"FAVEPAY", "S$52.82", "debit"}

Elixir comes with super helpful Regex library which has some cool functions like Regex.split/2 that allows you to include the captured group, and even trim whitespace 😭. Kudos to Elixir core team for improving QoL of developers. More information here.

Project Repo

Parser for DBS Credit Card Transactions

That’s it!

I’ve managed to covered the basic points on parsing strings and binaries! There’s a lot more to be read. Once you understand the power of pattern matching and binaries, you can start to do things like parsing .png files and even .mp3 files!

A list of references below:

  1. Elixir - Binaries, strings, and charlists
  2. Elixir Docs - <<>> operator
  3. Playing Together with Elixir Binaries-Strings :)