Mule 4: Processing Multibyte Characters in Fixed-Width Flat Files

A custom solution to handle multibyte characters for fixed-width flat files in Mulesoft. Dataweave can process several different types of data formats.

👁 Ravneet Bhardwaj user avatar

Ravneet Bhardwaj

Mar. 19, 21 · Tutorial

Likes (2)

Comment

Save

12.1K Views

Join the DZone community and get the full member experience.

Join For Free

Introduction

Dataweave can process several different types of data formats like flat files, copybooks, fixed-width files, etc. For most of these types, you can import a schema that describes the input structure in order to have access to valuable metadata at design time.

Problem Statement

These schemas currently only work with certain single-byte character encodings. All foreign languages — Spanish, Japanese, French, etc. are of multibyte size unlike English, where 1 character = 1 byte.

Mulesoft doesn’t have the capability to handle multibyte characters. While processing, the mule runtime considers each character present in the fixed-width flat file of 1 (one) byte.

Thus, if there is any multibyte character present in the flat file it couldn’t be parsed correctly.

Solution Brief

Conversion of flat files with multibyte characters, to single byte, thus ensuring correct parsing while it is processed in Mulesoft.

Optional conversion back to the original structure while exporting the file to external systems.

Prerequisites

Anypoint Studio 7.6, Mule Runtime 4.3.0, Knowledge of Flat File Schemas.

Problem

For the purpose of demonstration, I am using the Flat File Definition (.ffd) as given below:

YAML

form: FIXEDWIDTH

id: 'flatfile'

name: 'flatfile'

values:

- { name: 'Id', usage: M, type: String, length: 2 }

- { name: 'FirstName', usage: M, type: String, length: 10 }

- { name: 'LastName', usage: M, type: String, length: 10 }

- { name: 'City', usage: M, type: String, length: 10 }

- { name: 'State', usage: M, type: String, length: 10 }

- { name: 'Country', usage: M, type: String, length: 10 }

As per the schema definition, the FirstName length should be size 10 (ten). When a fixed width flat file is created by ERP systems, each line is written by bytes and not characters. For example, the Japanese characters are of 3 (three) bytes in UTF-8 format. They will occupy only one space in a file but in actual as per the ERP system, it is 3 spaces.

Plain Text

日 -> 3 bytes

本 -> 3 bytes

語 -> 3 bytes

日本語 -> 9 bytes

ABC -> 3 bytes (English Alphabets 1 character = 1 byte)

The process flow below is used for processing fixed-width flat file:

The fixed-width flat file to be processed is as below:

Plain Text

xxxxxxxxxx

1 Ravneet   Bhardwaj  Gurugram  Haryana   India

2 日本語 Bhardwaj  Gurugram  Haryana   India

Here the first line contains single-byte characters so Mulesoft will be able to handle them without any problems, but in the second line for FirstName certain Japanese characters are passed. These characters occupy only 3 spaces but in actual these are 9 bytes. As per the FFD schema, the FirstName should be of size 10 (ten). While processing, Mulesoft will throw an exception.

Since Mulesoft considers all characters as single bytes, it throws an exception that the expected size is not met for a particular field. In this case, it's the FirstName.

Solution

A custom Java utility that will read each character of the flat file, line by line. As soon a multibyte character is encountered, the custom solution appends extra spaces next to each multibyte character.

The number of spaces added is dynamic, based on the size of the multibyte character.

MyUtility.java

Java

xxxxxxxxxx

package com.ravneet.utility;

import java.io.BufferedReader;

import java.io.IOException;

import java.io.InputStream;

import java.io.InputStreamReader;

public class MyUtils {

public static String convertflatfile(Object input) {

StringBuilder builder = new StringBuilder();

try {

BufferedReader bufferedReader = new BufferedReader(new InputStreamReader((InputStream) input)) {

};

String fileOutput;

while ((fileOutput = bufferedReader.readLine()) != null) {

String[] arr = fileOutput.split("");

for (int i = 0; i < arr.length; i++) {

builder.append(arr[i]);

if (arr[i].getBytes().length > 1) {

int bytelength = arr[i].getBytes().length;

while (bytelength != 1) {

builder.append(" ");

--bytelength;

builder.append("\n");

//System.out.println(builder.toString());

} catch (IOException e) {

// TODO Auto-generated catch block

e.printStackTrace();

return builder.toString();

public static String removingSpace(String input) {

String[] arr = input.split("");

StringBuilder builder = new StringBuilder();

for (int i = 0; i < arr.length; i++) {

builder.append(arr[i]);

if (arr[i].getBytes().length > 1) {

int bytelength = arr[i].getBytes().length;

while (bytelength != 1) {

--bytelength;

i++;

//System.out.println(builder.toString());

return builder.toString();

The above Java utility has two functions: convertflatfile and removingSpace.

convertflatfile — This is a preprocessor that reads the flat file and converts the multibyte characters to single-byte characters.
removingSpace — This is a post-processor that removes the extra spaces added by the preprocessor.

Below is the modified process flow:

While reading the file, the below property should be selected:

The Set Payload component will contain the FFD schema details.

In the post-processing, the spaces are removed:

Output:

Conclusion

The above solution will enable Mulesoft to process and parse flat files with multibyte characters. This is an interim solution until Mulesoft provides an out-of-the-box functionality for the problem.

Flat (geometry) Processing

Opinions expressed by DZone contributors are their own.

MuleSoft IDP: Enhancing Efficiency and Accuracy in Data Extraction
Exactly-Once Processing: Myth vs Reality
Enhancing SQL Server Performance with Query Store and Intelligent Query Processing
Why Queues Don’t Fix Scaling Problems

URL: https://dzone.com/articles/mule-4-processing-multibyte-characters-in-fixed-wi