Data Center Energy Management

My first UX project: a data center energy management platform I helped design from scratch at HCL, with no design system and, for most of it, no access to the people who would actually use it.

Role: UI/UX Designer
Year: 2023-2024
Client: HCLSoftware (in-house)
Tools: Figma

The DCEM dashboard: a row of energy KPIs, a world map of data centers, a data-center health table, and energy trend, distribution, and savings charts.

Overview

DCEM stands for Data Center Energy Management. HCL's telecom side ran a few of its own data centers and wanted one place to see how much energy they were using, where the problems were, and where they could save. So the product is built for data center operators, the people who look after the racks, servers and VMs every day. It started as an in-house tool for HCL's own data centers.

Starting with almost nothing

This was my first real UX project, and we started with almost nothing. No design system, no component library, just a color palette and an icon set we were told to stay inside. So the UX, the UI and the components all got built at the same time. I would sketch a screen, work out what it needed, and build that component in the same pass.

Two things made it harder. The domain was new to me, so I had to learn what racks, VMs, power profiles and energy KPIs even were before I could design for them. And for most of the project I never spoke to a real operator. Requirements came from management, and I leaned on a persona we wrote ourselves. We only met an actual user late, after the demos had started.

Designing for a user I couldn't reach

Since I couldn't talk to a real operator, I built one to design against. Jerry runs data center operations and has spent fifteen years in telecom. His job is keeping the data centers performing, planning capacity, catching problems early and protecting the equipment. Whenever I wasn't sure a screen earned its place, I checked it against him.

User persona for Jerry, an enterprise data-center operator: head of data-center operations, 45, 15 years in telecom, with a list of key responsibilities from ensuring uptime to capacity planning. — The data center operator persona we wrote to stand in for a user I couldn't reach.

Discovery: inspiration and sketches

Before designing anything, I looked at how other energy and monitoring dashboards handle this kind of data. Which KPIs they show, what units they use, and which chart they reach for and when. That turned into an inspiration board. From there I sketched the dashboard and the module layouts roughly before moving to high fidelity.

An inspiration board collecting screenshots of energy and sustainability dashboards, showing PUE gauges, consumption-by-sector charts, and savings visualizations. — An inspiration board of energy dashboards, used to work out which KPIs and chart types fit this data.

Low-fidelity sketches and wireframes of the DCEM dashboard and observability screens, exploring KPI placement, map, charts, and per-site summaries. — Rough sketches of the dashboard and module layouts, before any high-fidelity work.

Building the system in parallel

Because nothing existed yet, every screen meant building new components. Not just inputs and dropdowns, but the harder ones: KPI cards, gauges, heat maps, mixed bar and line charts, maps, and a lot of dense tables. Some were involved, like the PDU and SNMP setup form, which had to map KPIs to OIDs and validate things like IP addresses inline. I documented each one as I built it so the rest of the team could reuse it instead of starting over.

A board of DCEM components: VM-to-host ratio line charts with inference panels, the data-center overview map and table, energy distribution bars, alarm summaries, KPI configuration, and a managed-nodes table. — From KPI cards and tables to inference panels and the data center map.

A board of chart components built for DCEM: rack heatmaps for power, temperature and space, plus grouped bar charts for power, temperature, U-space and weight mapping. — Chart components: the rack heat maps and mapping bar charts, built to suit the data each one carries.

The PDU configuration form component in multiple states, with fields for management interface type, SNMP version, community string, port and IP address, KPI-to-OID mapping, and inline validation for an incorrect IP. — One of the involved ones: the PDU and SNMP setup form, with KPI to OID mapping and inline validation.

The dashboard I owned

The dashboard was mine to lead, and it is the page I iterated on the most. The world map shows every data center at once. I zoomed out to world level on purpose so all the sites sit together, even the ones that are close. The Energy Trend chart is the one I cared about most: it lays projected power over actual power, so you can see consumption drop once servers move onto a Dynamic Power profile. Energy Distribution splits usage across the profiles, dynamic power saving, balanced and high performance. The savings chart puts daily savings as bars and cumulative savings as a line on the same view. Data center health, with its alarms and power utilization, sits next to it.

The DCEM dashboard in full: energy KPIs across the top, a world map of data centers beside a data-center health table, and Energy Trend, Energy Distribution and Adjusted Energy Savings charts below. — The main dashboard. A global view, the savings story and health, on the first screen an operator lands on.

Reports

Reports is really a set of reports behind one page: Utilization Overview, Utilization Trend, DCEM KPIs, ESG, Infra Utilization and Anomaly Overview. Each card opens its own detailed view.

Inside a report

The reports go deep. Make and model power insights, the data centers running at low utilization, a scatter of CPU usage against power, rack temperature mapping, and ESG numbers like power density and the VM to host ratio. You can filter them and download them. These pages are tall, so I kept them in a carousel here instead of stacking them into one long scroll.

Utilization Overview report: make-and-model power insights, data centers with low power utilization, a make-and-model power trend line chart, rack temperature mapping, and a CPU-bucket versus average-power scatter.

A continuation of the Utilization Overview report with further utilization and power-consumption breakdowns.

The Utilization Trend report, charting utilization and power-consumption trends over time.

The ESG report showing Data Center Power Density, IT Equipment Energy Utilization for servers, and VM-to-host ratio as bar and line charts over a year.

Utilization Overview: power by make and model, the low-utilization data centers, and rack temperature mapping.

01 / 04

Using color to mean something

We couldn't invent new colors, so the few we had needed to carry meaning. The same rack grid shows up for three different metrics, and each one gets a color scale chosen so the thing that matters stands out.

A rack temperature heatmap grid, racks by rows, colored cool green for under 39 degrees, amber for 40 to 49, and red for 50 and above.

Step 01 / 03
Temperature runs cool green to amber to red, so a hot rack is impossible to miss.
Step 02 / 03
Power consumption uses one light to dark ramp. One number, one color, getting deeper as it climbs.
Step 03 / 03
Weight uses a neutral scale. Full racks read green and lighter ones red, so an operator can see where there is room to add.

Step 01 / 03
Temperature runs cool green to amber to red, so a hot rack is impossible to miss.
Step 02 / 03
Power consumption uses one light to dark ramp. One number, one color, getting deeper as it climbs.
Step 03 / 03
Weight uses a neutral scale. Full racks read green and lighter ones red, so an operator can see where there is room to add.

Site Explorer

Site Explorer is a set of cards, one per data center. It covers our own sites and external ones too, like Amazon and Google, because the idea was that outside data centers could run on this as well. Each card shows where the site is, down to latitude and longitude, with its energy, memory and CPU use. Open one and you get the detail: category-wise energy, how much each server uses against its CPU, and consumption broken down by region.

Step 01 / 02
Cards for internal and external data centers, each with location and live utilization.
Step 02 / 02
Opening a site: category-wise energy, server energy against CPU, and consumption by region.

Step 01 / 02
Cards for internal and external data centers, each with location and live utilization.
Step 02 / 02
Opening a site: category-wise energy, server energy against CPU, and consumption by region.

Site Provisioning

Provisioning is where the infrastructure itself lives, from the site down through buildings, floors, rows, racks and managed nodes. There were 245 sites in all. You can add a new site or bulk upload them. Open one, say Bangalore, and every building, floor and node underneath it is laid out in tables.

The Bangalore Data Centre detail with location and provisioned time, a Buildings table, and a Floors table listing floor plans, rows, racks and nodes per floor. — Inside a site: its buildings, floors and managed nodes, each one editable.

Policy management

This is where an operator actually changes how power behaves. The page groups servers by their profile, dynamic power saving, high performance, remote control, and shows what each group is consuming. Next to it, the system suggests a better profile for each server with the saving you would expect, and an Apply button. Bringing a new server under a policy takes two steps: pick the servers, then define the policy, scheduled or on demand.

The Power Recycle Policy Applied Servers table, listing servers with manufacturer, average CPU and network utilization, power consumption, status and scheduled or on-demand policy type.

Step one of the Add New Server flow: a Select Server table with checkboxes, manufacturer, average CPU and network utilization and power consumption, and a two-step progress indicator.

Step two of the Add New Server flow: Define Policy form with policy type Scheduled, control type Powerdown, weekly schedule, timezone, dates, day and time range, and Apply.

Policy management: servers grouped by power profile, with a suggestion for each one.

01 / 04

Alarms

The alarms page keeps the count of active alarms by severity, shows which sites are throwing the most, and lists the recent ones against their site and ticket.

Mobile companion

A companion app for the floor

The web app lives on a desk, but a lot of an operator's job happens on their feet, walking the data center to find the one server throwing an alert. So there is a companion mobile app built around that reality.

Its user is Mike, an operation specialist who is on call and rarely sitting at a screen. He needs to find the faulty server fast, see what is wrong and how to fix it while standing in front of it, and close the alert on the spot. So the app is built around two things: AR navigation to reach the server, and resolving alerts in the field.

The mobile app brief and persona board: an overview of the Data Centre Management app, and Mike, a 29-year-old data center operation specialist, with his goals and challenges. — The mobile brief and Mike, the on-call operation specialist the app was designed for.

Getting in

Because the app uses the camera for AR and scanning, and location to place you on the floor, signing in is followed by a clear permissions step rather than asking for everything silently.

The mobile login screen for Data Centre Management, with username and password over a data-center photo. — Login.

A grant-permissions sheet asking for location, camera and storage access, with skip and continue. — Permissions are asked up front, because the camera and location power AR and scanning.

The dashboard in your pocket

A quick read on the way in: how many alerts are open, server utilization, device counts, and how alerts have been trending.

The mobile dashboard: total alerts, a server-utilization gauge at 67 percent, device details for servers, VMs and applications, and an alert-stats chart, with a bottom tab bar. — The mobile dashboard: open alerts, utilization, device counts.

An alert-status screen with 7-day and 15-day toggles, totals for in-progress and completed, a stacked alert-stats bar chart, and key insights. — Alert stats over 7 or 15 days, with a short read on the trend.

Finding the server with AR

This is the part that actually needed to be mobile. You pick the floor, the app draws a route over the live camera view, and arrows on the floor walk you to the exact rack. When you arrive it tells you, and you scan the server to pull up its alert.

AR navigation on the live camera: a Go Straight 5m instruction and blue arrows on the floor, with Karle Data Centre floor, room, row and rack labels. — Live AR arrows guide you down the aisle.

AR navigation showing a Turn Left instruction with a curved arrow and Destination on the Left. — It calls the turns as you walk.

A floor-plan view of the Karle data center with a grid of named servers, a red route line to the target, floor tabs, and a Live AR View button. — A floor map plots the route, with a switch to the live AR view.

The floor map with the target server highlighted and a prompt reading you have reached the server with a scan button. — Arrival is confirmed at the rack.

A scan-code screen with a camera viewfinder and the prompt hold the camera still to scan the image. — Scan the server to pull up its alert.

From alert to fix

Alerts can be filtered and sorted by device, impact and status. Open one and you get the server's state, the downtime and cost impact, and a list of mitigation options. You work through the steps, upload proof of each fix, and submit to close it.

The alerts list with a search bar, Top 5 and Show All toggles, filter and sort, and severity-coloured alert cards for servers and VMs. — The alerts list, severity-coded.

A filter sheet over the alerts list with Device type, Impact and Status options. — Filter by device, impact or status.

The device-type filter with checkboxes for device types one through five and an apply button. — Drilling into a filter.

Alert details for Server 7872MC: CPU usage, VMs and applications, the alert with downtime and dollar impact, and a list of mitigation options. — Alert details: state, downtime, cost impact, and ways to fix it.

A scanned-server detail sheet repeating the alert and mitigation options with a scan-again button. — The same detail, pulled up by scanning the server.

A mitigation steps screen with numbered steps and click-here-to-upload buttons for proof of each fix. — Work the steps and upload proof of each fix.

The mitigation steps marked uploaded successfully with filenames and an active submit button. — Proof uploaded, ready to submit and close.

Topologies, profile and closed work

The rest rounds out the picture: browse the topology by site, floor and type, drill into a server's history, and see your own closed alerts from the profile.

Topologies, browsable by site, floor and type.

A server's detail and event history.

Mike's profile, with his alert counts.

His closed alerts.

A closed alert, with the proof that closed it.

Mobile wireframes

The screens were built from these wireframes, where the flows and layouts were worked out first.

The wireframes the mobile screens were built from.

Outcome and impact

DCEM began as an internal tool and turned into something we could sell. We ran a lot of demos, and two external clients came on board, with roughly 2,500 racks between them to monitor. That was the proof it worked outside HCL's own data centers.

Together the web app and the mobile companion covered both halves of the job: watching and planning at the desk, and finding and fixing on the floor.

What I'd do differently

Because it was my first project, the clearest lessons are the things I would change. I would step back and map the whole user journey first, then design the components and screens, instead of doing all three at once. I would push harder to get in front of a real operator early, since that feedback only reached us near the end. I would design and document proper patterns from the start, because handoff turned into a lot of back and forth explaining decisions that were never written down. And I would build in accessibility from day one rather than leaving it out, which is something we just didn't know to do at the time.