Taking a Look At Minifi

Apache NiFi Minifi has been out for a while now, but it just came across my radar and I was intrigued. I have done a lot of work with Apache NiFi itself. I enjoy it as a platform for getting about 95% of the way to a solution for many problems.

For the uninitiated, Apache NiFi is a tool for data flow that supports powerful and scalable directed graphs of data routing, transformation, and system mediation logic.

NiFi excels in getting data from point A to point B through any number of technologies and protocols.  Data is acted upon through logical units called Processors. There is a slew of Processors out of the box that can make developing a data flow trivial. What’s the catch you ask? It can be a resource hog. It is a massive platform that at its core is a wrapper around file I/O. The vast majority of my NiFi experience is with  deployments that predate the 1.x release series, so this may have changed.

Enter Minifi. It seeks to solve the problem of having the power and flexibility of NiFi itself, but without a large footprint and closer to the data source:

MiNiFi – a subproject of Apache NiFi – is a complementary data collection approach that  supplements the core tenets of NiFi in dataflow management, focusing on the collection of data at the source of its creation.

At first glance it looks to be competing in the space of embedded systems or lightweight programs like Filebeat.  In this article we are going to set up a Minifi flow and have it talk to a NiFi instance.

You will need:

  • Apache NiFi 1.6.0
  • Apache NiFi Minifi 0.4.0 (We’ll be using the Java agent)

To demonstrate, we’ll collect some data with Minifi and have it route to NiFi for upload to Elasticsearch.

The tricky part to coming up with a non trivial example that is actually relevant is to put Minifi in an environment where you might not put NiFi itself.

For this, I have created a minimal CentOS 7 VM: 3 GB HDD and 256 MB of RAM. This certainly does not emulate a true embedded system, but it will provide constraints otherwise not present in a powerful server or desktop setup.

Picture1

Unlike NiFi, Minifi’s config file is YAML and there is no corresponding GUI to “auto generate” an XML flow file.

The guidance from the project is to create a flow in NiFi, export the template, and then use a conversion tool to get the config.yml for Minifi. That seems overly convoluted, but after a conversion I can see why they suggested this. A simple Get File Processor routing to a Remote Process Group generated a 105 line config yaml file.

Connections:
- id: e63b2765-fb5a-39b3-0000-000000000000
  name: GetDataFile/success/32d4c68b-0164-1000-ce63-b3cc7e1b1bb9
  source id: a9b20b2d-dc82-3506-0000-000000000000
  source relationship names:
  - success
  destination id: 32d4c68b-0164-1000-ce63-b3cc7e1b1bb9
  max work queue size: 10000
  max work queue data size: 1 GB
  flowfile expiration: 0 sec
  queue prioritizer class: ''

What I found more surprising is that they kept the same type of (naming) conventions from the flow xml file:

<connections>
  <id>e63b2765-fb5a-39b3-0000-000000000000</id>
  <parentGroupId>a192196c-295e-3a96-0000-000000000000</parentGroupId>
  <destination>
    <groupId>71ea7c85-9108-32b0-0000-000000000000</groupId>
    <id>e792bad3-c8c3-32db-9032-8745ecb34678</id>
    <type>REMOTE_INPUT_PORT</type>
  </destination>
  <flowFileExpiration>0 sec</flowFileExpiration>
  <labelIndex>1</labelIndex>
  <name/>
  <selectedRelationships>success</selectedRelationships>
  <source>
    <groupId>a192196c-295e-3a96-0000-000000000000</groupId>
    <id>a9b20b2d-dc82-3506-0000-000000000000</id>
    <type>PROCESSOR</type>
  </source>
</connections>

Generating UUIDs for Processors and Connections and lacing them up by hand is a path to madness. Creating a data flow template and converting it to yaml was relatively painless thankfully.

Here is the process group we created the template from and converted to Minifi:

Picture2

Here is the NiFi data flow set up:

Picture3

I copied the generated config.yml to the vm and ran minifi.sh start.  There was some minor tweaking: I had forgotten to set site to site properties on the normal nifi instance. That information also has to be reflected in a new Properties section in the config.yml for Minifi.

Remote Process Groups:
- id: 71ea7c85-9108-32b0-0000-000000000000
  name: ''
  url: http://192.168.1.193:8080/nifi
  Input Ports:
  - id: 32d4c68b-0164-1000-ce63-b3cc7e1b1bb9
    name: From Minifi
    comment: ''
    max concurrent tasks: 1
    use compression: false
    Properties:
     Port: 8081
     Host Name: localhost
  Output Ports: []

 

Picture4

Great, we now have connectivity between the Minifi instance and our home NiFi! Minifi does a literal classpath of each jar on the command line on execution, it’s massive. Be forewarned if grepping for it.

Resource usage looks good:

Picture5

Minifi comes with some command line tools to check flow status, let’s look at our Get File processor.

The system admin guide for Minifi originally suggested this call:

./bin/minifi.sh flowStatus processor:GetDataFile:health,stats,bulletins

‘GetDataFile’ was the string name we gave the processor. For whatever reason (it still eludes me as of this writing), that did not work. The guide was also helpful in suggesting using the UUID of the processor. I was surprised that wasn’t default guidance because Process string names are not required to be unique, at least in NiFi. So trying again:

./bin/minifi.sh flowStatus processor:a9b20b2d-dc82-3506-0000-000000000000:all
{"controllerServiceStatusList":null,"processorStatusList"             [{"name":"GetDataFile","processorHealth":null,"processorStats":null,"bulletinList":null}],
"connectionStatusList":null,"remoteProcessGroupStatusList":null,"instanceStatus":null,
"systemDiagnosticsStatus":null,"reportingTaskStatusList":null,"errorsGeneratingReport":[]}

Uh, I’m not sure if something went wrong or not in this case. Even calling “all” instead of the processor name resulted in the above output. The tooling needs a little help or the issue lies between the computer and the chair, it is unclear at the moment.

In any event, we have connectivity, so let’s do something a little more interesting. In addition to writing out the content to disk, we’ll do a simple post through an existing NiFi processor into Elasticsearch.

We’ll drop some json files into the data_source directory that look something like this:

{"name": "dataFromMinifi1", "date": "20180624", "message": "This is a payload"}

Data transmission and upload looks good in NiFi:

Picture6

Checking Elastic:

Picture7

I definitely think there is a place for Minifi, especially in low resource systems. As of this writing, interaction with it leaves a lot to be desired, namely its developer friendliness. YAML was an exceptionally interesting choice given the language’s particulars regarding whitespace and formatting and NiFi’s original flow.xml graph. Having worked on a large system that used YAML to stitch components together and had no tooling to support that endeavor, it can be frustrating.

Happy coding!

Knowles Atchison, Jr.

Senior Software Engineer, Synergist Computing

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Up ↑

%d bloggers like this: