Mitto - Single and Multiple Flat File Use Cases

anon68878319 · March 24, 2021, 8:14pm

This post is for consolidating ideas and best practices around flat file use cases.

Single Flat File into Single Database Table

What happens when the flat file is updated and how is it named?

Scenario #1 - Rows are added/updated and the file is overwritten with the same name every time. This is the simplest case.
- data.csv - File is created on 2021-01-01. 100 total rows in the file.
- data.csv - File is updated on 2021-01-02. 105 total rows in the file. 5 new rows added and 1 row updated.

Which Mitto job should I use?

What’s the file extension?

Any delimited flat file (CSV, TSV, etc) - Use an IO job type with an XSV input
Excel file - Use an IO job type with an Excel input
JSON / JSON Lines - Use an IO job with a JSON or JSON lines input

What happens to the data?

On the first run of the Mitto job, the entire flat file’s data is loaded into a database table.

One new column is added to the output database table:

__index__ - This is the index of each row in the source file.

On subsequent Mitto job runs, the database table’s data is truncated and the entire flat file’s data is loaded into the database table.

Multiple Flat Files into Single Database Table

Scenario #1 - Single source of data, but spread across multiple files (by some segmentation or date criteria). Rows are added/updated, but the file name is changed.
- data_2021_01_01.csv - File is created on 2021-01-01. 100 rows total in the file.
- data_2021_01_02.csv - File is updated on 2021-01-01. 105 total rows in the file. 5 new rows added and 1 row updated.
- customer_a.csv - File is specific to customer A. It is an example of the Single Flat File Scenario #1.
- customer_b.csv - File is specific to customer B, but contains the same columns of data as Customer A’s version of the file. It is an example of the Single Flat File Scenario #1.
Scenario #2 - Single source of data, but spread across multiple files (by some segmentation or date criteria). New file is created with new/changed rows.
- data_2021_01_01.csv - File is created on 2021-01-01. 100 rows total in the file.
- data_2021_01_02.csv - File is created on 2021-01-01. 6 rows in the file. 5 new rows and 1 existing row that has changed.

Which Mitto job should I use?

Use an IO job with a RegEx input.

What happens to the data?

On the first run of the Mitto job, for each flat file that matches the RegEx input’s regex pattern, the entire flat file’s data is loaded into a database table.

Three new columns are added to the output database table:

__index__ - This is the index of each row in the source file.
__source__ - This is the source file the rows originate from.
__mtime__ - This is the modified time of the source file.

On subsequent Mitto job runs, only files that have been modified since the last job run are processed. Specifically, each flat file that matches the RegEx input’s regex pattern and who’s modified time is greater than or equal to the maximum __mtime__ in the output database table. All rows for each __source__ matching this criteria are deleted from the output database table and that entire flat file’s data is loaded into the output database table.