Sean Cohen, director of product management at Red Hat exclusively tells EdgeIR that right now, the world is seeing edge use cases and devices in almost every vertical.
“This includes automotive with software-defined vehicles, industrial (manufacturing, mining), retail (store, shop floor, POS, kiosks), service providers (5G RAN, telco cloud, cable, MEC), energy and utilities (smart grids), IoT (smart cities) and government (defense),” adds Cohen.
“We see devices being deployed in all of these locations and even at the very far edge in a remote desert, at sea or even in space. With such a broad range of use cases, we also see a huge variety of applications. Some examples include fraud detection, facial recognition, user experience/surveys, build tooling, chemical/deposit evaluation, field rescue operations, radio/networking, in-vehicle and many more. And with these applications, observability, control, security and lifecycle management is as crucial as the very application that’s running.”
Visual data collection
Observability at the edge is about gaining insights into the performance and behavior of these distributed systems in real-time. It involves collecting and analyzing data from diverse sources at the edge of the network to provide a comprehensive view of the entire ecosystem.
When asked what type of monitoring tools and platforms should typically be used to collect and visualize data, Cohen says the landscape of possibilities can be overwhelming and Red Hat’s mission is to make it easier for its customers to manage their applications while keeping their options open.
“It’s very important to understand that one particular solution does not fill every need, and we put a lot of effort into providing choice and added value. That’s where different planes start to be important. We see observability in stages – from the very first data collection at far edge locations until we reach the final data analysis and remediation, this data needs to be filtered, transformed, shipped and stored in many different but optimal ways,” he adds.
“So adopting standards in the first collection phase is crucial, and OpenTelemetry is unveiling itself as the best choice. We are helping and co-creating with our customers in this new and exciting path. Then, to store data, scalable solutions like Tempo, Loki together with Prometheus and Thanos cover a wide variety of needs. Data can then be visualized in many ways: from our own out-of-the-box OpenShift console to other tailored solutions like building custom grafana dashboards, or other third-party partners like Dynatrace or Splunk. It really comes down to particular needs, and we are happy to adapt and help our users.”
To ensure real-time monitoring and alerting for critical edge applications, Cohen adds that enterprises should deploy a decentralized monitoring approach. This involves placing monitoring agents directly on edge devices or nearby edge servers, reducing latency in data collection and alert generation.
“In order to not get overwhelmed by massive amounts of data from many devices, it’s important to understand where there are critical pain points and take an iterative approach. It all comes down to a mix of instrumenting own applications where it matters, together with the underlying device/platform data,” he continues.
“But there’s no silver bullet here. We recommend booking time and resources to know your system and implement metrics, logs, traces and alarms that are relevant to you. Also, ask yourself about which data is being used, and which one can be deprecated once in a while. Being prepared for time periods in which devices can be disconnected is also critical to have reliable data.”
A clear view into edge operations
The comprehensive understanding and real-time insights gained from monitoring, analyzing, and managing the various components and activities within an edge computing environment, is needed for a clear view into edge operations.
Cohen notes that ideally, individuals should have a unified monitoring dashboard that aggregates data from all edge locations.
“From the beginning, you need pattern based deployments, whether via GitOps or automation tooling. This requires consistent data collection, configuration, deploying and reporting standards that span across all devices,” he continues.
“Again, a curated and maintained collection of observability signals is crucial to success in edge operations. In this scenario, the small and bigger pictures both matter. Thus, both a general fleet view with a smart alerting system and the ability to enable low level debugging in specific devices is important.”
Best practices and challenges
The decentralized nature of edge computing introduces new challenges for securing systems. Observability tools can play a crucial role in detecting and responding to security incidents, providing visibility into anomalous behavior and potential threats.
“Best practices include implementing a layered monitoring approach that covers hardware, network and application levels. Regularly updating and patching edge devices, and conducting simulations of common failure scenarios also form crucial aspects of a robust monitoring strategy,” says Cohen.
“Many different devices with workloads on top of them are going to produce data, and you will want to observe it all. Because of that, it’s very important to use a standard such as OpenTelemetry. With a common language, organizations can aggregate data together and take action.
“From an operational point of view, you also want to cache data in the device local storage for networking outages, batching signaling to avoid networking storms by reducing the messaging rate and even filter out nearly all successful data that is not going to help after all. All of this (and more) can be done with an OpenTelemetry collector.”
Regarding the aspects getting in the way of observability at the edge, Cohen explains that major challenges include network instability, limited computing resources on edge devices and the sheer volume and diversity of data generated.
“Additionally, ensuring security and compliance across distributed networks. And the current stack of observability offerings are heavy when compared to device edge constraints. On top of that, edge computing is a concept that covers many areas, from very small devices close to each other, to others that can have high computational capabilities but are very far away. This makes it hard to create a solution which covers for observability at the edge,” he continues.
“Edge devices are designed for low cost and low power consumption and generally have a narrow enclosure. The device design can lack flexibility. Devices might have very few hardware controllers that are specifically used to perform a job function and may have only one software controller that is capable of loading new workload OSimages.
“The nature of the edge device is itself constrained by packaging – a drone, a drilling head, a surgical camera. Edge devices often have limited processing power, memory, and storage, which constrains the ability to run full-fledged monitoring tools. This necessitates lightweight monitoring solutions that can operate efficiently within these resource constraints.”
Failure scenarios with edge computing and solutions
Edge computing, while offering numerous advantages, is not immune to challenges and failure scenarios. Cohen pinpoints network connectivity loss, hardware malfunctions due to harsh operating environments, and security breaches as some of the common failure scenarios of edge computing. He says these failures can lead to data loss, compromised device functionality, and delayed decision-making processes.
“Small devices can easily run out of resources, which is a pretty normal failure case. Disconnections and latency issues are always something to look after. These cases are very relevant at the edge, so it’s important to design observability around this,” says Cohen.
“Then, you have typical failure scenarios like hardware problems (e.g. failed SD card), software problems (bugs) and human/operator errors. And of course Murphy’s law strikes back, e.g. a power outage during an upgrade, a natural disaster, etc.
“We provide solutions on the operating level to allow for more resilient deployments. We provide a transactional upgrade technology which first stages the update on the disk, then reboots into the new version and if that turns out to be broken, automatically falls back / reboot into the previous version.”
He notes that another approach is simple and fast provisioning of the device with zero touch intervention needed.
“This makes replacing a failed device really easy: simply remove the broken one, replace with a new one, connect power and network and turn it on. The device will then connect to the management plane and get the required software and configuration. The open standard behind this is FIDO Device Onboarding,” adds Cohen.
“We are co-creating solutions that can cover a high number of customers with these problems. Our observability stack is evolving hand-in-hand with our customer’s software and use cases, so we can decide and understand what fits best. We are finding out that OpenTelemetry, together with a solid central point to store, process, analyze and visualize the data is a game changer in this area that clears the path to observe these complex systems.”
The company also recently launched Red Hat Device Edge, a platform designed for resource-constrained environments which require small form factor compute at the device edge, including Internet of Things (IoT) gateways, industrial controllers, smart displays, point of sales terminals, vending machines, and robots.
Red Hat is working with partners and customers including ABB, DSO National Laboratories, Dynatrace, Guise AI, Intel, Lockheed Martin and more to deploy, test and validate that Red Hat Device Edge platform can extend operational consistency across edge and hybrid cloud environments.
decentralized computing | edge | edge computing | IoT | observability | Red Hat