We had a Mitto support ticket come through where the Mitto user had a RegEx job piping multiple CSV files into a single database table. The user felt like resulting output database table had results they weren’t expecting.
This is a good example to explain how Mitto IO jobs using the RegEx input handle multiple files and specifically files with mismatched columns and column names.
So let’s take at this use case by digging deeper into these aspects:
- The individual flat files
- The Mitto IO job with a RegEx input
- The output database table
- The Mitto job logs
File #1 - regex_mismatch1.csv
id,Item,Person
1,123,Justin
This file has three columns and one row:
- id
- Item
- Person
File #2 - regex_mismatch2.csv
id,Account Name,Item # ,Person
2,Account A,345,Andy
This file has four columns and one row:
- id
- Account Name
- Item #
- Person
Some important things to note:
-
Account Name
is a column that only exists in file #2. -
Item #
might be an error and should be theItem
column from the first file, but we can’t know for sure. This column also has a trailing space. -
Person
exists in both files, but in this file it is the 4th column, whereas in the first file it is the 3rd column.
Mitto IO job with RegEx input
{
"input": {
"base": "xsv.iov2#XsvInput2",
"delimiter": ",",
"encoding": "UTF-8",
"includes_header": true,
"regex": "^regex_mismatch.+\\.csv$",
"use": "flatfile.iov2#RegexInput"
},
"output": {
"dbo": "postgresql://localhost/analytics",
"schema": "test",
"tablename": "regex_mismatch",
"use": "call:mitto.iov2.db#todb"
},
"steps": [
{
"column": "__mtime__",
"use": "mitto.iov2.steps#MaxTimestamp"
},
{
"use": "mitto.iov2.steps.upsert#SetUpdatedAt"
},
{
"transforms": [
{
"use": "mitto.iov2.transform#ExtraColumnsTransform"
},
{
"use": "mitto.iov2.transform#ColumnsTransform"
}
],
"use": "mitto.iov2.steps#Input"
},
{
"use": "mitto.iov2.steps#CreateTable"
},
{
"use": "mitto.iov2.steps.upsert#CreateTempTable"
},
{
"transforms": [
{
"use": "mitto.iov2.transform#FlattenTransform"
}
],
"use": "mitto.iov2.steps#Output"
},
{
"key": "__source__",
"use": "mitto.iov2.steps.upsert#SyncTempTable"
},
{
"use": "mitto.iov2.steps#CollectMeta"
}
]
}
This Mitto IO job’s RegEx input will pipe all files matching this regular expression: ^regex_mismatch.+\.csv$
:
- regex_mismatch1.csv
- regex_mismatch2.csv
The output is a PostgreSQL database table: test.regex_mismatch
The job is configured for upsert based on the __mtime__
and __source__
columns created by the RegEx input.
If you would like to learn more about all the parts of this job config, take a look at the Mitto RegEx documentation.
Output database table
Here’s the output PostgreSQL database table:
__index__ | id | item | person | __source__ | __mtime__ | account_name | item_number |
---|---|---|---|---|---|---|---|
1 | 1 | 123 | Justin | regex_mismatch1.csv | 2021-03-03 18:48:29.318373 | NULL | NULL |
1 | 2 | NULL | Andy | regex_mismatch2.csv | 2021-03-03 18:48:29.318373 | Account A | 345 |
Some important things to note:
-
__index__
is a new column that wasn’t in either of the source files. Mitto creates this column when piping data from a file. The__index__
denotes the row from the original file. - The original
Item
column from file #1 has been renamed toitem
. Rows from input files that didn’t have the columnItem
(file #2) haveNULL
values. - The original
Person
column from file #1 and file #2 has been renamed toperson
. Notice that the output table has data from both files, even though ordinal position of thePerson
column was different. Mitto took care of this for us. -
__source__
is a new column that wasn’t in either of the source files. Mitto creates this column when using the RegEx input. The__source__
column denotes the original file this data came from. -
__mtime__
is a new column that wasn’t in either of the source files. Mitto creates this column when using the RegEx input. The__mtime__
column denotes the modified time of the original file this data came from. - The original column
Account Name
from file #2 has been renamed toaccount_name
. Mitto has converted the name to snake_case. Rows from input files that didn’t have the columnAccount Name
(file #1) haveNULL
values. - The original column
Item #
from file #2 has been renamed toitem_number
. MItto has converted the name to snake_case, stripped the trailing space, and converted the#
to the wordnumber
. Rows from input files that didn’t have the columnItem #
(file #1) haveNULL values
.
Mitto Job Logs
So what happened?
Let’s take a look at the Mitto job’s logs:
[2021-03-03T12:49:32] INFO [root] job started, pid=4537
[2021-03-03T12:49:32] INFO [root] handling signals
[2021-03-03T12:49:33] INFO [mitto.iov2.steps.builtin.MaxTimestamp] started
[2021-03-03T12:49:33] INFO [root] Table test.regex_mismatch does not exist, skipping timestamp
[2021-03-03T12:49:33] INFO [mitto.iov2.steps.upsert.SetUpdatedAt] started
[2021-03-03T12:49:33] INFO [mitto.iov2.steps.transform.Input] started
[2021-03-03T12:49:33] INFO [mitto.iov2.transform.extra_columns.ExtraColumnsTransform] id(self)=139654354855752, jpath=$, ignores=[]
[2021-03-03T12:49:33] INFO [root] ColumnsTransform: table does not exist, using initial SDL
[2021-03-03T12:49:33] INFO [flatfile.iov2.regex] /var/mitto/data/regex_mismatch1.csv
[2021-03-03T12:49:33] INFO [mitto.iov2.transform.extra_columns.ExtraColumnsTransform] adding column __index__ $.['__index__']
[2021-03-03T12:49:33] INFO [mitto.iov2.transform.extra_columns.ExtraColumnsTransform] adding column id $.['id']
[2021-03-03T12:49:33] INFO [mitto.iov2.transform.extra_columns.ExtraColumnsTransform] adding column item $.['Item']
[2021-03-03T12:49:33] INFO [mitto.iov2.transform.extra_columns.ExtraColumnsTransform] adding column person $.['Person']
[2021-03-03T12:49:33] INFO [mitto.iov2.transform.extra_columns.ExtraColumnsTransform] adding column __source__ $.['__source__']
[2021-03-03T12:49:33] INFO [mitto.iov2.transform.extra_columns.ExtraColumnsTransform] adding column __mtime__ $.['__mtime__']
[2021-03-03T12:49:33] INFO [root] set column '__index__' to Integer (auto), value '1'
[2021-03-03T12:49:33] INFO [root] updating column 'id' (NULL) to Boolean, value '1'
[2021-03-03T12:49:33] INFO [root] updating column 'item' (NULL) to Integer, value '123'
[2021-03-03T12:49:33] INFO [root] updating column 'person' to String 6, value 'b'Justin''
[2021-03-03T12:49:33] INFO [root] updating column 'person' (NULL) to String, value 'Justin'
[2021-03-03T12:49:33] INFO [root] updating column '__source__' to String 19, value 'b'regex_mismatch1.csv''
[2021-03-03T12:49:33] INFO [root] updating column '__source__' (NULL) to String, value 'regex_mismatch1.csv'
[2021-03-03T12:49:33] INFO [root] updating column '__mtime__' (NULL) to DateTime, value '2021-03-03T18:48:29.318373+00:00'
[2021-03-03T12:49:33] INFO [flatfile.iov2.regex] /var/mitto/data/regex_mismatch2.csv
[2021-03-03T12:49:33] INFO [mitto.iov2.transform.extra_columns.ExtraColumnsTransform] adding column account_name $.['Account Name']
[2021-03-03T12:49:33] INFO [mitto.iov2.transform.extra_columns.ExtraColumnsTransform] adding column item_number $.['Item # ']
[2021-03-03T12:49:33] INFO [root] updating column 'id' (Boolean) to Integer, value '2'
[2021-03-03T12:49:33] INFO [root] updating column 'account_name' to String 9, value 'b'Account A''
[2021-03-03T12:49:33] INFO [root] updating column 'account_name' (NULL) to String, value 'Account A'
[2021-03-03T12:49:33] INFO [root] updating column 'item_number' (NULL) to Integer, value '345'
[2021-03-03T12:49:33] INFO [mitto.iov2.steps.builtin.CreateTable] started
[2021-03-03T12:49:33] INFO [mitto.iov2.steps.builtin.CreateTable] Creating table: test.regex_mismatch
[2021-03-03T12:49:33] INFO [mitto.iov2.steps.upsert.CreateTempTable] started
[2021-03-03T12:49:33] INFO [root] CreateTempTable: main table just created, skipping creating temp table
[2021-03-03T12:49:33] INFO [mitto.iov2.steps.transform.Output] started
[2021-03-03T12:49:33] INFO [mitto.iov2.steps.upsert.SyncTempTable] started
[2021-03-03T12:49:33] INFO [root] SyncTempTable: No temp table found so skipping syncing
[2021-03-03T12:49:33] INFO [mitto.iov2.steps.builtin.CollectMeta] started
[2021-03-03T12:49:34] INFO [mitto.iov2.steps.builtin.CollectMeta] 'test.regex_mismatch' row count: 2
[2021-03-03T12:49:34] INFO [root] {
"__inputter__": "<flatfile.iov2.regex.RegexInput object at 0x7f03d07a7b38>",
"__location__": "None",
"__name__": "regex_regex_mistmatch_csv",
"__origin_table__": "regex_mismatch",
"__outputter__": "functools.partial(<function todb at 0x7f03d035e7b8>)",
"__session__": "<sqlalchemy.orm.scoping.scoped_session object at 0x7f03e8f73518>",
"__starttime__": "2021-03-03T18:49:32.420753+00:00",
"__store__": "None",
"__temporary__": "<mitto.iov2.steps.transform.TemporaryStore object at 0x7f03d0493908>",
"count": 2,
"created": "test.regex_mismatch",
"schema": "test",
"table": "regex_mismatch"
}
[2021-03-03T12:49:34] INFO [root] regex_regex_mistmatch_csv succeeded
Mitto is running through all the steps and logging information throughout the process.
I want to specifically focus on the log lines with the phrases regex
and adding column
:
...
[2021-03-03T12:49:33] INFO [flatfile.iov2.regex] /var/mitto/data/regex_mismatch1.csv
[2021-03-03T12:49:33] INFO [mitto.iov2.transform.extra_columns.ExtraColumnsTransform] adding column __index__ $.['__index__']
[2021-03-03T12:49:33] INFO [mitto.iov2.transform.extra_columns.ExtraColumnsTransform] adding column id $.['id']
[2021-03-03T12:49:33] INFO [mitto.iov2.transform.extra_columns.ExtraColumnsTransform] adding column item $.['Item']
[2021-03-03T12:49:33] INFO [mitto.iov2.transform.extra_columns.ExtraColumnsTransform] adding column person $.['Person']
[2021-03-03T12:49:33] INFO [mitto.iov2.transform.extra_columns.ExtraColumnsTransform] adding column __source__ $.['__source__']
[2021-03-03T12:49:33] INFO [mitto.iov2.transform.extra_columns.ExtraColumnsTransform] adding column __mtime__ $.['__mtime__']
...
[2021-03-03T12:49:33] INFO [flatfile.iov2.regex] /var/mitto/data/regex_mismatch2.csv
[2021-03-03T12:49:33] INFO [mitto.iov2.transform.extra_columns.ExtraColumnsTransform] adding column account_name $.['Account Name']
[2021-03-03T12:49:33] INFO [mitto.iov2.transform.extra_columns.ExtraColumnsTransform] adding column item_number $.['Item # ']
...
The RegEx input processes all the files that it finds that match the regular expression in the job config.
First, Mitto processed the regex_mismatch1.csv
file:
[2021-03-03T12:49:33] INFO [flatfile.iov2.regex] /var/mitto/data/regex_mismatch1.csv
As Mitto was processing this file, it came across columns it had not seen before (all of the column at first) and handled them with the ExtraColumnsTransform
step:
[2021-03-03T12:49:33] INFO [mitto.iov2.transform.extra_columns.ExtraColumnsTransform] adding column __index__ $.['__index__']
[2021-03-03T12:49:33] INFO [mitto.iov2.transform.extra_columns.ExtraColumnsTransform] adding column id $.['id']
[2021-03-03T12:49:33] INFO [mitto.iov2.transform.extra_columns.ExtraColumnsTransform] adding column item $.['Item']
[2021-03-03T12:49:33] INFO [mitto.iov2.transform.extra_columns.ExtraColumnsTransform] adding column person $.['Person']
[2021-03-03T12:49:33] INFO [mitto.iov2.transform.extra_columns.ExtraColumnsTransform] adding column __source__ $.['__source__']
[2021-03-03T12:49:33] INFO [mitto.iov2.transform.extra_columns.ExtraColumnsTransform] adding column __mtime__ $.['__mtime__']
You can see the original column name from the input as jpath and the name the output table’s column will be.
For example: Item
column ($.['Item']
) is renamed to item
.
Next, Mitto processed the regex_mismatch2.csv
file:
[2021-03-03T12:49:33] INFO [flatfile.iov2.regex] /var/mitto/data/regex_mismatch2.csv
As Mitto was processing tis file, it came across columns it had not seen before and again handled them with the ExtraColumnsTransform
step:
[2021-03-03T12:49:33] INFO [mitto.iov2.transform.extra_columns.ExtraColumnsTransform] adding column account_name $.['Account Name']
[2021-03-03T12:49:33] INFO [mitto.iov2.transform.extra_columns.ExtraColumnsTransform] adding column item_number $.['Item # ']
This is how the output table ends up with the account_name
and item_number
columns.
End Result
In general, Mitto automatically processes multiple flat files and combines them into a single database table using an IO job with a RegEx input and database output.
In this particular case, the input files’ column names were not consistent, so the user cleaned up the source files, dropped the output table, and reran the Mitto job.