Mitto v2.9 Sneak Peek - Output to JSON and JSON lines

anon68878319 · May 7, 2021, 7:37pm

Pre Mitto v2.9, using IO jobs, Mitto outputs data from APIs, databases, and files to relational databases (typical behavior) and delimited flat files (e.g. csv, tsv, etc).

Mitto v2.9 introduces two new file outputs for IO jobs:

This introduces several new use cases for IO jobs:

Output raw API data directly to JSON or JSON lines - This is especially useful when exploring data from new APIs and helping to understand the structure of that potentially nested data
Output database data directly to JSON or JSON lines
Convert files (e.g. csv, tsv, json, json lines etc) to JSON or JSON lines

Example Use Case

This example demonstrates using Mitto to download a .json and a .jsonl file from a public Github with a Mitto curl job, piping that data through Mitto, and outputting the data as a JSON or JSON lines file.

curl job

Here are the two files we will be downloading with Mitto:

Here are the two curl job configs:

{
  url: https://raw.githubusercontent.com/zuarbase/data/master/zuar_pets.json
  args: [
    -s
    -b
    /tmp/cookies
    -L
    -O
    -f
  ]
}

{
  url: https://raw.githubusercontent.com/zuarbase/data/master/zuar_pets.jsonl
  args: [
    -s
    -b
    /tmp/cookies
    -L
    -O
    -f
  ]
}

End result: Two new files in Mitto’s file manager.

IO job - JSON input with JSON output

Here’s the IO job that takes the zuar_pets.json file and pipes it through Mitto and outputs it as zuar_pets_tojson.json:

{
  input: {
    use: flatfile.iov2#JsonInput
    source: /var/mitto/data/zuar_pets.json
  }
  output: {
    path: /var/mitto/data/zuar_pets_tojson.json
    use: call:mitto.iov2#tojson
  }
  steps: [
    {
      transforms: [
        {
          use: mitto.iov2.transform#ExtraColumnsTransform
          rename_columns: false
          include_empty_columns: true
          include_nested_json: true
        }
      ]
      use: mitto.iov2.steps#Input
    }
    {
      transforms: [
        {
          use: mitto.iov2.transform#FlattenTransform
        }
      ]
      use: mitto.iov2.steps#Output
    }
  ]
}

Two critical job config pieces here:

The output's use references the new tojson code and the path references the output file Mitto will create.
The ExtraColumnsTransform transform step includes a new parameter include_nested_json: true.

End result - We end up with the exact same JSON file we started with as a new file.

IO job - JSON lines input with JSON lines output

Here’s the IO job that takes the zuar_pets.json file and pipes it through Mitto and outputs it as zuar_pets_tojson.json:

{
  input: {
    use: flatfile.iov2#JsonlInput
    source: /var/mitto/data/zuar_pets.jsonl
  }
  output: {
    path: /var/mitto/data/zuar_pets_tojsonl.jsonl
    use: call:mitto.iov2#tojsonl
  }
  steps: [
    {
      transforms: [
        {
          use: mitto.iov2.transform#ExtraColumnsTransform
          rename_columns: false
          include_empty_columns: true
          include_nested_json: true
        }
      ]
      use: mitto.iov2.steps#Input
    }
    {
      transforms: [
        {
          use: mitto.iov2.transform#FlattenTransform
        }
      ]
      use: mitto.iov2.steps#Output
    }
  ]
}

Differences here:

The input's use is JsonlInput instead of JsonInput.
The output's use is tojsonl instead of tojson.
The output's path ends in jsonl instead of json.

End result - We end up with the exact same JSON lines file we started with as a new file.