Mitto IO job, Regex, and mismatched files

We had a Mitto support ticket come through where the Mitto user had a RegEx job piping multiple CSV files into a single database table. The user felt like resulting output database table had results they weren’t expecting.

This is a good example to explain how Mitto IO jobs using the RegEx input handle multiple files and specifically files with mismatched columns and column names.

So let’s take at this use case by digging deeper into these aspects:

  • The individual flat files
  • The Mitto IO job with a RegEx input
  • The output database table
  • The Mitto job logs

File #1 - regex_mismatch1.csv

id,Item,Person
1,123,Justin

This file has three columns and one row:

  • id
  • Item
  • Person

File #2 - regex_mismatch2.csv

id,Account Name,Item # ,Person
2,Account A,345,Andy

This file has four columns and one row:

  • id
  • Account Name
  • Item #
  • Person

Some important things to note:

  • Account Name is a column that only exists in file #2.
  • Item # might be an error and should be the Item column from the first file, but we can’t know for sure. This column also has a trailing space.
  • Person exists in both files, but in this file it is the 4th column, whereas in the first file it is the 3rd column.

Mitto IO job with RegEx input

{
    "input": {
        "base": "xsv.iov2#XsvInput2",
        "delimiter": ",",
        "encoding": "UTF-8",
        "includes_header": true,
        "regex": "^regex_mismatch.+\\.csv$",
        "use": "flatfile.iov2#RegexInput"
    },
    "output": {
        "dbo": "postgresql://localhost/analytics",
        "schema": "test",
        "tablename": "regex_mismatch",
        "use": "call:mitto.iov2.db#todb"
    },
    "steps": [
        {
            "column": "__mtime__",
            "use": "mitto.iov2.steps#MaxTimestamp"
        },
        {
            "use": "mitto.iov2.steps.upsert#SetUpdatedAt"
        },
        {
            "transforms": [
                {
                    "use": "mitto.iov2.transform#ExtraColumnsTransform"
                },
                {
                    "use": "mitto.iov2.transform#ColumnsTransform"
                }
            ],
            "use": "mitto.iov2.steps#Input"
        },
        {
            "use": "mitto.iov2.steps#CreateTable"
        },
        {
            "use": "mitto.iov2.steps.upsert#CreateTempTable"
        },
        {
            "transforms": [
                {
                    "use": "mitto.iov2.transform#FlattenTransform"
                }
            ],
            "use": "mitto.iov2.steps#Output"
        },
        {
            "key": "__source__",
            "use": "mitto.iov2.steps.upsert#SyncTempTable"
        },
        {
            "use": "mitto.iov2.steps#CollectMeta"
        }
    ]
}

This Mitto IO job’s RegEx input will pipe all files matching this regular expression: ^regex_mismatch.+\.csv$:

  • regex_mismatch1.csv
  • regex_mismatch2.csv

The output is a PostgreSQL database table: test.regex_mismatch

The job is configured for upsert based on the __mtime__ and __source__ columns created by the RegEx input.

If you would like to learn more about all the parts of this job config, take a look at the Mitto RegEx documentation.

Output database table

Here’s the output PostgreSQL database table:

__index__ id item person __source__ __mtime__ account_name item_number
1 1 123 Justin regex_mismatch1.csv 2021-03-03 18:48:29.318373 NULL NULL
1 2 NULL Andy regex_mismatch2.csv 2021-03-03 18:48:29.318373 Account A 345

Some important things to note:

  • __index__ is a new column that wasn’t in either of the source files. Mitto creates this column when piping data from a file. The __index__ denotes the row from the original file.
  • The original Item column from file #1 has been renamed to item. Rows from input files that didn’t have the column Item (file #2) have NULL values.
  • The original Person column from file #1 and file #2 has been renamed to person. Notice that the output table has data from both files, even though ordinal position of the Person column was different. Mitto took care of this for us.
  • __source__ is a new column that wasn’t in either of the source files. Mitto creates this column when using the RegEx input. The __source__ column denotes the original file this data came from.
  • __mtime__ is a new column that wasn’t in either of the source files. Mitto creates this column when using the RegEx input. The __mtime__ column denotes the modified time of the original file this data came from.
  • The original column Account Name from file #2 has been renamed to account_name. Mitto has converted the name to snake_case. Rows from input files that didn’t have the column Account Name (file #1) have NULL values.
  • The original column Item # from file #2 has been renamed to item_number. MItto has converted the name to snake_case, stripped the trailing space, and converted the # to the word number. Rows from input files that didn’t have the column Item # (file #1) have NULL values.

Mitto Job Logs

So what happened?

Let’s take a look at the Mitto job’s logs:

[2021-03-03T12:49:32] INFO [root] job started, pid=4537
[2021-03-03T12:49:32] INFO [root] handling signals
[2021-03-03T12:49:33] INFO [mitto.iov2.steps.builtin.MaxTimestamp] started
[2021-03-03T12:49:33] INFO [root] Table test.regex_mismatch does not exist, skipping timestamp
[2021-03-03T12:49:33] INFO [mitto.iov2.steps.upsert.SetUpdatedAt] started
[2021-03-03T12:49:33] INFO [mitto.iov2.steps.transform.Input] started
[2021-03-03T12:49:33] INFO [mitto.iov2.transform.extra_columns.ExtraColumnsTransform] id(self)=139654354855752, jpath=$, ignores=[]
[2021-03-03T12:49:33] INFO [root] ColumnsTransform: table does not exist, using initial SDL
[2021-03-03T12:49:33] INFO [flatfile.iov2.regex] /var/mitto/data/regex_mismatch1.csv
[2021-03-03T12:49:33] INFO [mitto.iov2.transform.extra_columns.ExtraColumnsTransform] adding column __index__ $.['__index__']
[2021-03-03T12:49:33] INFO [mitto.iov2.transform.extra_columns.ExtraColumnsTransform] adding column id $.['id']
[2021-03-03T12:49:33] INFO [mitto.iov2.transform.extra_columns.ExtraColumnsTransform] adding column item $.['Item']
[2021-03-03T12:49:33] INFO [mitto.iov2.transform.extra_columns.ExtraColumnsTransform] adding column person $.['Person']
[2021-03-03T12:49:33] INFO [mitto.iov2.transform.extra_columns.ExtraColumnsTransform] adding column __source__ $.['__source__']
[2021-03-03T12:49:33] INFO [mitto.iov2.transform.extra_columns.ExtraColumnsTransform] adding column __mtime__ $.['__mtime__']
[2021-03-03T12:49:33] INFO [root] set column '__index__' to Integer (auto), value '1'
[2021-03-03T12:49:33] INFO [root] updating column 'id' (NULL) to Boolean, value '1'
[2021-03-03T12:49:33] INFO [root] updating column 'item' (NULL) to Integer, value '123'
[2021-03-03T12:49:33] INFO [root] updating column 'person' to String 6, value 'b'Justin''
[2021-03-03T12:49:33] INFO [root] updating column 'person' (NULL) to String, value 'Justin'
[2021-03-03T12:49:33] INFO [root] updating column '__source__' to String 19, value 'b'regex_mismatch1.csv''
[2021-03-03T12:49:33] INFO [root] updating column '__source__' (NULL) to String, value 'regex_mismatch1.csv'
[2021-03-03T12:49:33] INFO [root] updating column '__mtime__' (NULL) to DateTime, value '2021-03-03T18:48:29.318373+00:00'
[2021-03-03T12:49:33] INFO [flatfile.iov2.regex] /var/mitto/data/regex_mismatch2.csv
[2021-03-03T12:49:33] INFO [mitto.iov2.transform.extra_columns.ExtraColumnsTransform] adding column account_name $.['Account Name']
[2021-03-03T12:49:33] INFO [mitto.iov2.transform.extra_columns.ExtraColumnsTransform] adding column item_number $.['Item # ']
[2021-03-03T12:49:33] INFO [root] updating column 'id' (Boolean) to Integer, value '2'
[2021-03-03T12:49:33] INFO [root] updating column 'account_name' to String 9, value 'b'Account A''
[2021-03-03T12:49:33] INFO [root] updating column 'account_name' (NULL) to String, value 'Account A'
[2021-03-03T12:49:33] INFO [root] updating column 'item_number' (NULL) to Integer, value '345'
[2021-03-03T12:49:33] INFO [mitto.iov2.steps.builtin.CreateTable] started
[2021-03-03T12:49:33] INFO [mitto.iov2.steps.builtin.CreateTable] Creating table: test.regex_mismatch
[2021-03-03T12:49:33] INFO [mitto.iov2.steps.upsert.CreateTempTable] started
[2021-03-03T12:49:33] INFO [root] CreateTempTable: main table just created, skipping creating temp table
[2021-03-03T12:49:33] INFO [mitto.iov2.steps.transform.Output] started
[2021-03-03T12:49:33] INFO [mitto.iov2.steps.upsert.SyncTempTable] started
[2021-03-03T12:49:33] INFO [root] SyncTempTable: No temp table found so skipping syncing
[2021-03-03T12:49:33] INFO [mitto.iov2.steps.builtin.CollectMeta] started
[2021-03-03T12:49:34] INFO [mitto.iov2.steps.builtin.CollectMeta] 'test.regex_mismatch' row count: 2
[2021-03-03T12:49:34] INFO [root] {
    "__inputter__": "<flatfile.iov2.regex.RegexInput object at 0x7f03d07a7b38>",
    "__location__": "None",
    "__name__": "regex_regex_mistmatch_csv",
    "__origin_table__": "regex_mismatch",
    "__outputter__": "functools.partial(<function todb at 0x7f03d035e7b8>)",
    "__session__": "<sqlalchemy.orm.scoping.scoped_session object at 0x7f03e8f73518>",
    "__starttime__": "2021-03-03T18:49:32.420753+00:00",
    "__store__": "None",
    "__temporary__": "<mitto.iov2.steps.transform.TemporaryStore object at 0x7f03d0493908>",
    "count": 2,
    "created": "test.regex_mismatch",
    "schema": "test",
    "table": "regex_mismatch"
}
[2021-03-03T12:49:34] INFO [root] regex_regex_mistmatch_csv succeeded

Mitto is running through all the steps and logging information throughout the process.

I want to specifically focus on the log lines with the phrases regex and adding column:

...
[2021-03-03T12:49:33] INFO [flatfile.iov2.regex] /var/mitto/data/regex_mismatch1.csv
[2021-03-03T12:49:33] INFO [mitto.iov2.transform.extra_columns.ExtraColumnsTransform] adding column __index__ $.['__index__']
[2021-03-03T12:49:33] INFO [mitto.iov2.transform.extra_columns.ExtraColumnsTransform] adding column id $.['id']
[2021-03-03T12:49:33] INFO [mitto.iov2.transform.extra_columns.ExtraColumnsTransform] adding column item $.['Item']
[2021-03-03T12:49:33] INFO [mitto.iov2.transform.extra_columns.ExtraColumnsTransform] adding column person $.['Person']
[2021-03-03T12:49:33] INFO [mitto.iov2.transform.extra_columns.ExtraColumnsTransform] adding column __source__ $.['__source__']
[2021-03-03T12:49:33] INFO [mitto.iov2.transform.extra_columns.ExtraColumnsTransform] adding column __mtime__ $.['__mtime__']
...
[2021-03-03T12:49:33] INFO [flatfile.iov2.regex] /var/mitto/data/regex_mismatch2.csv
[2021-03-03T12:49:33] INFO [mitto.iov2.transform.extra_columns.ExtraColumnsTransform] adding column account_name $.['Account Name']
[2021-03-03T12:49:33] INFO [mitto.iov2.transform.extra_columns.ExtraColumnsTransform] adding column item_number $.['Item # ']
...

The RegEx input processes all the files that it finds that match the regular expression in the job config.

First, Mitto processed the regex_mismatch1.csv file:

[2021-03-03T12:49:33] INFO [flatfile.iov2.regex] /var/mitto/data/regex_mismatch1.csv

As Mitto was processing this file, it came across columns it had not seen before (all of the column at first) and handled them with the ExtraColumnsTransform step:

[2021-03-03T12:49:33] INFO [mitto.iov2.transform.extra_columns.ExtraColumnsTransform] adding column __index__ $.['__index__']
[2021-03-03T12:49:33] INFO [mitto.iov2.transform.extra_columns.ExtraColumnsTransform] adding column id $.['id']
[2021-03-03T12:49:33] INFO [mitto.iov2.transform.extra_columns.ExtraColumnsTransform] adding column item $.['Item']
[2021-03-03T12:49:33] INFO [mitto.iov2.transform.extra_columns.ExtraColumnsTransform] adding column person $.['Person']
[2021-03-03T12:49:33] INFO [mitto.iov2.transform.extra_columns.ExtraColumnsTransform] adding column __source__ $.['__source__']
[2021-03-03T12:49:33] INFO [mitto.iov2.transform.extra_columns.ExtraColumnsTransform] adding column __mtime__ $.['__mtime__']

You can see the original column name from the input as jpath and the name the output table’s column will be.

For example: Item column ($.['Item']) is renamed to item.

Next, Mitto processed the regex_mismatch2.csv file:

[2021-03-03T12:49:33] INFO [flatfile.iov2.regex] /var/mitto/data/regex_mismatch2.csv

As Mitto was processing tis file, it came across columns it had not seen before and again handled them with the ExtraColumnsTransform step:

[2021-03-03T12:49:33] INFO [mitto.iov2.transform.extra_columns.ExtraColumnsTransform] adding column account_name $.['Account Name']
[2021-03-03T12:49:33] INFO [mitto.iov2.transform.extra_columns.ExtraColumnsTransform] adding column item_number $.['Item # ']

This is how the output table ends up with the account_name and item_number columns.

End Result

In general, Mitto automatically processes multiple flat files and combines them into a single database table using an IO job with a RegEx input and database output.

In this particular case, the input files’ column names were not consistent, so the user cleaned up the source files, dropped the output table, and reran the Mitto job.

1 Like

@Andy created a Python tool to investigate RegEx files: Mitto REGEX job - Debugging with python