When collecting data from devices, there are two fundamental choices – are we going to poll them (pull) periodically, or are the devices going to send the data to us with no prompting (push)?
As is the way of the world – there is not a simple answer, and hybrid solutions often the best choice.
Polling is great when the device you are trying to collect from has a lot of different types of data, and you may be interested in a small subset of it. An oversimplistic example is the phone switch. The switch is tracking thousands of simultaneous phone calls, but all we need is a periodic reading of the number of available lines. Asking the switch for that value alone saves a lot of work and bandwidth over having the switch push all of its data periodically.
There are a few disadvantages to polling, however. The data is not in real time, and when done at scale, can actually consume a lot of resources on the polled device. Its granularity is determined by the polling interval. Reducing the polling interval requires the monitored device to do more and more work. The number of metrics we collect is also a factor, since each of these metrics is maintained in a database and each read is randomly accessed (worst case scenario, but one we have to plan for). The device has no options around scheduling when to provide data.
Push is great when the device wants to report a fairly small subset of data at high rates and in real time. To use our phone switch example – it would be great that when a call terminates, it sends us a small packet with all the quality and utilization information in real time. The small subset of fields are always the same. The disadvantage here, of course, is that the small subset of fields that do get sent (even if configurable) would be somewhat limited. The moment we try to export a great variety of data fields – the model begins to breakdown.
In traditional IT systems, the two best known examples are SNMP for pull and NetFlow for push. SysLog is also a great example of push – where events are sent on demand – but it’s not traditionally used for metrics.
In IoT land, things, especially in the field, are usually small and extremely purpose-built, and also need to be extremely battery and bandwidth efficient. This makes the push model ideal for monitoring sensors and small devices that want to manage their power consumption and wake cycles. It’s also beneficial for when data needs to be sent, but you can’t be bothered waiting for a poll.
The data quickly leaves the devices and makes its way to the management station where it is stored. There is a lot to be said about asynchronous communication here, and multiplexing and processing pipelines, but that story will be told another time.
Once it has hit the management stations a few milliseconds after being born, the data from the IoT device has two journeys to take. A fairly small subset of applications out there have the ability to use the data right away – like updating a status screen immediately or generating an alert or updating a map position. These applications can take advantage of a Publisher-Subscriber (PubSub) architecture where they get notified by the management station when new data arrives.
The majority of the data, however, gets stored in the management station waiting to be... polled. Well, we don't really call it poll at that point of time. We usually 'query' for the data. And query is the more appropriate term. “Poll” usually implies retrieving only the most current value, whereas “query” usually retrieves multiple data points and some historical context. Being able to pull this historical context in near-real time and enriching it is what inherently makes most IoT applications valuable. Real time status is great, but understanding behavior is the true value-add.
To sum it up, both polling and pushing are, and will continue to be used in the IoT industry. Depending on the particulars of the application, either or both methods may be used to collect the data. But in the case of most IoT applications, push is the technology of choice between the sensor and, at least, the gateway. This is dictated by processing limitations at the edge, power management concerns as well as bandwidth availability and management.
It goes without saying that the management layer must comfortably support both modes seamlessly. The data collected, once processed should be normalized, and the end-users should not care how it was collected. Making sure that the massive data collected (independently of how) is available for real-time query via API or other mechanisms (MQs, for example) is essential for making sure the system provides the kind of open architecture mandatory for a truly transformative IoT application.