Introducing Cruz: An AI Data Engineer In-a-Box

Read about why we built Cruz - an autonomous agentic AI to automate data engineering tasks to empower security and data teams

February 12, 2025
Cruz Banner Image

Introducing Cruz: An AI Data Engineer In-a-Box

Why we built it and what it does

Artificial Intelligence is perceived as a panacea for modern business challenges with its potential to unlock greater efficiency, enhance decision-making, and optimize resource allocation. However, today’s commercially-available AI solutions are reactive – they assist, enhance analysis, and bolster detection, but don’t act on their own. With the explosion of data from cloud applications, IoT devices, and distributed systems, data teams are burdened with manual monitoring, complex security controls, and fragmented systems that demand constant oversight. What they really need is more than an AI copilot, but a complementary data engineer that takes over all the exhausting work and freeing them up for more strategic data and security work.

That’s where we saw an opportunity. The question that inspired us: How do we transform the way organizations approach data management? The answer led us to Cruz—not just another AI tool, but an autonomous AI data engineer that monitors, detects, adapts, and actively resolves issues with minimal human intervention.

Why We Built Cruz

Organizations face unprecedented challenges in managing vast amounts of data across multiple systems. From integration headaches to security threats, data engineers and security teams are under immense pressure to keep pace with evolving data risks. These challenges extend beyond mere volume—they strike at the effectiveness, security, and real-time insight generation.

  1. Integration Complexity

Data ecosystems are expanding, encompassing diverse tools and platforms—from SIEMs to cloud infrastructure, data lakes, and observability tools. The challenge lies in integrating these disparate systems to achieve unified visibility without compromising security or efficiency. Data teams often spend days or even weeks developing custom connections, which then require continuous monitoring and maintenance.

  1. Disparate Data Formats

Data is generated in varied formats—from logs and alerts to metrics and performance data—making it difficult to maintain quality and extract actionable insights. Compounding this challenge, these formats are not static; schema drifts and unexpected variations further complicate data normalization.

  1. The Cost of Scaling and Storage

With data growing exponentially, organizations struggle with storage, retrieval, and analysis costs. Storing massive amounts of data inflates SIEM and cloud storage costs, while manually filtering out data without loss is nearly impossible. The challenge isn’t just about storage—it’s about efficiently managing data volume while preserving essential information.

  1. Delayed and Inconsistent Insights

Even after data is properly integrated and parsed, extracting meaningful insights is another challenge. Overwhelming volumes of alerts and events make it difficult for data teams to manually query and review dashboards. This overload delays insights, increasing the risk of missing real-time opportunities and security threats.

These challenges demand excessive manual effort—updating normalization, writing rules, querying data, monitoring, and threat hunting—leaving little time for innovation. While traditional AI tools improve efficiency by automating basic tasks or detecting predefined anomalies, they lack the ability to act, adapt, and prioritize autonomously.

What if AI could do more than assist? What if it could autonomously orchestrate data pipelines, proactively neutralize threats, intelligently parse data, and continuously optimize costs? This vision drove us to build Cruz to be an AI system that is context-aware, adaptive, and capable of autonomous decision-making in real time.

Cruz as Agentic AI: Informed, Perceptive, Proactive

Traditional data management solutions are struggling to keep up with the complexities of modern enterprises. We needed a transformative approach—one that led us to agentic AI. Agentic AI represents the next evolution in artificial intelligence, blending sophisticated reasoning with iterative planning to autonomously solve complex, multi-step problems. Cruz embodies this evolution through three core capabilities: being informed, perceptive, and proactive.

Informed Decision-Making

Cruz leverages Retrieval-Augmented Generation (RAG), to understand complex data relationships and maintain a holistic view of an organization’s data ecosystem. By analyzing historical patterns, real-time signals, and organizational policies, Cruz goes beyond raw data analysis to make intelligent, autonomous decisions enhancing efficiency and optimization.

Perceptive Analysis

Cruz’s perceptive intelligence extends beyond basic pattern detection. It recognizes hidden correlations across diverse data sources, differentiates between routine fluctuations and critical anomalies, and dynamically adjusts its responses based on situational context. This deep awareness ensures smarter, more precise decisions without requiring constant human intervention.

Proactive Intelligence

Rather than waiting for issues to emerge, Cruz actively monitors data environments, anticipating potential challenges before they impact operations. It identifies optimization opportunities, detects anomalies, and initiates corrective actions autonomously while continuously evolving to deliver smarter and more effective data management over time.

Redefining Data Management with Autonomous Intelligence

Modern data environments are complex and constantly evolving, requiring more than just automation. Cruz’s agentic capabilities redefine how organizations manage data by autonomously handling tasks traditionally consuming significant engineering time. For example, when schema drift occurs, traditional tools may only alert administrators, but Cruz autonomously analyzes the data pattern, identifies inconsistencies, and updates normalization in real-time.

Unlike traditional tools that rely on static monitoring, Cruz actively scans your data ecosystem, identifying threats and optimization opportunities before they escalate. Whether it's streamlining data flows, transforming data, or reducing data volume, Cruz executes these tasks autonomously while ensuring data integrity.

Cruz's Core Capabilities

  • Plug and Play Integration: Cruz automatically discovers data sources across cloud and on-prem environments, providing a comprehensive data overview. With a single click, Cruz streamlines what would typically be hours of manual setup into a fast, effortless process, ensuring quick and seamless integration with your existing infrastructure.
  • Automated Parsing: Where traditional tools stop at flagging issues, Cruz takes the next step. It proactively parses, normalizes, and resolves inconsistencies in real time. It autonomously updates schemas, masks sensitive data, and refines structures—eliminating days of manual engineering effort.
  • Real-time AI-driven Insights: Cruz leverages advanced AI capabilities to provide insights that go far beyond human-scale analysis. By continuously monitoring data patterns, it provides real-time insights into performance, emerging trends, volume reduction opportunities, and data quality enhancements, enabling better decision-making and faster data optimization.
  • Intelligent Volume Reduction: Cruz actively monitors data environments to identify opportunities for volume reduction by analyzing patterns and creating rules to filter out irrelevant data. For example, it identifies irrelevant fields in logs sent to SIEM systems, eliminating data that doesn't contribute to security insights. Additionally, it filters out duplicate or redundant data, minimizing storage and observability costs while maintaining data accuracy and integrity.
  • Automating Analytics: Cruz operates 24/7, continuously monitoring and analyzing data streams in real-time to ensure no insights are missed. With deep contextual understanding, it detects patterns, anticipates potential threats, and uncovers optimization. By automating these processes, Cruz saves engineering hours, minimizes human errors, and ensures data remains protected, enriched, and readily available for actionable insights.

Conclusion

Cruz is more than an AI tool—it’s an AI Data Engineer that evolves with your data ecosystem, continuously learning and adapting to keep your organization ahead of data challenges. By automating complex tasks, resolving issues, and optimizing operations, Cruz frees data teams from the burden of constant monitoring and manual intervention. Instead of reacting to problems, organizations can focus on strategy, innovation, and scaling their data capabilities.

In an era where data complexity is growing, businesses need more than automation—they need an intelligent, autonomous system that optimizes, protects, and enhances their data. Cruz delivers just that, transforming how companies interact with their data and ensuring they stay competitive in an increasingly data-driven world.

With Cruz, data isn’t just managed—it’s continuously improved.

Ready to transform your data ecosystem with Cruz? Learn more about Cruz here.

Uncover hidden visitor insights to improve their website journey
Share

See related articles

Two years ago, our DataBahn journey began with a simple yet urgent realization: security data management is fundamentally flawed. Enterprises are overwhelmed by security and telemetry, struggling to collect, store, and process it, while finding it harder and harder to gain timely insights from it. As leaders and practitioners in cybersecurity, data engineering, and data infrastructure, we saw this pattern everywhere: spiraling SIEM costs, tool sprawl, noisy data, tech debt, brittle pipelines, and AI initiatives blocked by legacy systems and architectures.

We founded DataBahn to break this cycle. Our platform is specifically designed to help enterprises regain control: disconnecting data pipelines from outdated tools, applying AI to automate data engineering, and constructing systems that empower security, data, and IT teams. We believe data infrastructure should be dynamic, resilient, and scalable, and we are creating systems that leverage these core principles to enhance efficiency, insight, and reliability.

Today, we’re announcing a significant milestone in this journey: a $17M Series A funding round led by Forgepoint Capital, with participation from S3 Ventures and returning investor GTM Capital. Since coming out of stealth, our trajectory has been remarkable – we’ve secured a Fortune 10 customer and have already helped several Fortune 500 and Global 200 companies cut over 50% of their telemetry processing costs and automate most of their data engineering workloads. We're excited by this opportunity to partner with these incredible customers and investors to reimagine how telemetry data is managed.

Tackling an industry problem

As operators, consultants, and builders, we worked with and interacted with CISOs across continents who complained about how they had gone from managing gigabytes of data every month to being drowned by terabytes of data daily, while using the same pipelines as before. Layers and levels of complexity were added by proprietary formats, growing disparity in sources and devices, and an evolving threat landscape. With the advent of Generative AI, CISOs and CIOs found themselves facing an incredible opportunity wrapped in an existential threat, and without the right tools to prepare for it.

DataBahn is setting a new benchmark for how modern enterprises and their CISO/CIOs can manage and operationalize their telemetry across security, observability, and IOT/OT systems and AI ecosystems. Built on a revolutionary AI-driven architecture, DataBahn parses, enriches, and suppresses noise at scale, all while minimizing egress costs. This is the approach our current customers are excited about, because it addresses key pain points they have been unable to solve with other solutions.

Our two new Agentic AI products are solving problems for enterprise data engineering and analytics teams. Cruz automates complex data engineering tasks from log discovery, pipeline creation, tracking and maintaining telemetry health, to providing insights on data quality. Reef surfaces context-aware and enriched insights from streaming telemetry data, turning hours of complex querying across silos into seconds of natural-language queries.

The Right People

We’re incredibly grateful to our early customers; their trust, feedback, and high expectations have shaped who we are. Their belief drives us every day to deliver meaningful outcomes. We’re not just solving problems with them, we’re building long-term partnerships to help enterprise security and IT teams take control of their data, and design systems that are flexible, resilient, and built to last. There’s more to do, and we’re excited to keep building together.

We’re also deeply thankful for the guidance and belief of our advisors, and now our investors. Their support has not only helped us get here but also sharpened our understanding of the opportunity ahead. Ernie, Aaron, and Saqib’s support has made this moment more meaningful than the funding; it’s the shared conviction that the way enterprises manage and use data must fundamentally change. Their backing gives us the momentum tomove faster, and the guidance to keep building towards that mission.

Above all, we want to thank our team. Your passion, resilience, and belief in what we’re building together are what got us here. Every challenge you’ve tackled, every idea you’ve contributed, every late night and early morning has laid the foundation for what we have done so far and for what comes next. We’re excited about this next chapter and are grateful to have been on this journey with all of you.

The Next Chapter

The complexity of enterprise data management is growing exponentially. But we believe that with the right foundation, enterprises can turn that complexity into clarity, efficiency, and competitive advantage.

If you’re facing challenges with your security or observability data, and you’re ready to make your data work smarter for AI, we’d love to show you what DataBahn can do. Request a demo and see how we can help.

Onwards and upwards!

Nanda and Nithya
Cofounders, DataBahn

In September 2022, cybercriminals accessed, encrypted, and stole a substantial amount of data from Suffolk County’s IT systems, which included personally identifiable information (PII) of county residents, employees, and retirees. Although Suffolk County did not pay the ransom demand of $2.5 million, it ultimately spent $25 million to address and remediate the impact of the attack.

Members of the county’s IT team reported receiving hundreds of alerts every day in the weeks leading up to the attack. Several months earlier, frustrated by the excessive number of unnecessary alerts, the team redirected the notifications from their tools to a Slack channel. Although the frequency and severity of the alerts increased leading up to the September breach, the constant stream of alerts wore the small team down, leaving them too exhausted to respond and distinguish false positives from relevant alerts. This situation created an opportunity for malicious actors to successfully circumvent security systems.

The alert fatigue problem

Today, cybersecurity teams are continually bombarded by alerts from security tools throughout the data lifecycle. Firewalls, XDRs/EDRs, and SIEMs are among the common tools that trigger these alerts. In 2020, Forrester reported that SOC teams received 11,000 alerts daily, and 55% of cloud security professionals admitted to missing critical alerts. Organizations cannot afford to ignore a single alert, yet alert fatigue (and an overwhelming number of unnecessary alerts) causes SOCs to miss up to 30% of security alerts that go uninvestigated or are completely overlooked.

While this creates a clear cybersecurity and business continuity problem, it also presents a pressing human issue. Alert fatigue leads to cognitive overload, emotional exhaustion, and disengagement, resulting in stress, mental health concerns, and attrition. More than half of cybersecurity professionals cite their workload as the primary source of stress, two-thirds reported experiencing burnout, and over 60% of cybersecurity professionals surveyed stated it contributed to staff turnover and talent loss.

Alert fatigue poses operational challenges, represents a critical security risk, and truly becomes an adversary of the most vital resource that enterprises rely on for their security — SOC professionals doing their utmost to combat cybercriminals. SOCs are spending so much time and effort triaging alerts and filtering false positives that there’s little room for creative threat hunting.

Data is the problem – and the solution

Alert fatigue is a result, not a root cause. When these security tools were initially developed, cybersecurity teams managed gigabytes of data each month from a limited number of computers on physically connected sites. Today, Security Operations Centers (SOCs) are tasked with handling security data from thousands of sources and devices worldwide, which arrive through numerous distinct devices in various formats. The developers of these devices did not intend to simplify the lives of security teams, and the tools they designed to identify patterns often resemble a fire alarm in a volcano. The more data that is sent as an input to these machines, the more likely they are to malfunction – further exhausting and overwhelming already stretched security teams.

Well-intentioned leaders advocate for improved triaging, the use of automation, refined rules to reduce false-positive rates, and the application of popular technologies like AI and ML. Until we can stop security tools from being overwhelmed by large volumes of unstructured, unrefined, and chaotic data from diverse sources and formats, these fixes will be band aids on a gaping wound.

The best way to address alert fatigue is to filter out the data being ingested into downstream security tools. Consolidate, correlate, parse, and normalize data before it enters your SIEM or UEBA. If it isn’t necessary, store it in blob storage. If it’s duplicated or irrelevant, discard it. Don’t clutter your SIEM with poor data so it doesn’t overwhelm your SOC with alerts no one requested.

How Databahn helps

At DataBahn, we help enterprises cut through cybersecurity noise with our security data pipeline solution, which works around the clock to:

1. Aggregates and normalizes data across tools and environments automatically

2. Uses AI-driven correlation and prioritization

3. Denoises the data going into the SIEM, ensuring more actionable alerts with full context

SOCs using DataBahn aren’t overwhelmed with alerts; they only see what’s relevant, allowing them to respond more quickly and effectively to threats. They are empowered to take a more strategic approach in managing operations, as their time isn’t wasted triaging and filtering out unnecessary alerts.

Organizations looking to safeguard their systems – and protect their SOC members – should shift from raw alert processing to smarter alert management, driven by an intelligent pipeline which combines automation, correlation, and transformation that filters out the noise and combats alert fatigue.

Interested in saving your SOC from alert fatigue? Contact DataBahn
In the past, we've written about how we solve this problem for Sentinel. You can read more here: 
AI-powered Sentinel Log Optimization

I see a lot more organizations head towards Headless Cyber Architecture. Traditionally, cybersecurity teams relied on one massive tool: the SIEM. For years, Cyber security orgs funneled all their cyber data into it; not because it was optimal, but because it was the compliance checkbox.

That’s how SIEMs earned their core seat at the table. Over time, SIEMs evolved from a log aggregator into something more sophisticated: UEBA ->Security Analytics -> and now, increasingly, SaaS-based platforms to more AI SOC. But there’s a catch—in this model, you don’t truly own your data. It lives in the vendor’s ecosystem, locked into their proprietary format rather than an open standard.

You end up paying for storage, analytics, and access to your own telemetry—creating a cycle of dependency and vendor lock-in.

But the game is changing. What’s New?

SIEMs are not going away; they remain mission-critical. But they’re no longer the sole destination for all cyber data. Instead, they are being refocused: They now consume only Security-Relevant Data (SRDs)—purposefully curated feeds for advanced threat detection, correlation, and threat chaining. Nearly 80% of organizations have only integrated baseline telemetry—firewalls, endpoints, XDRs, and the like. But where’s the visibility into mission-critical apps? Your plant data? Manufacturing systems? The rest of your telemetry often remains siloed, unparsed, and not in open, interoperable formats like OTEL or OCSF.


The shift is this : It’s now flowing into your Security Data Lake (SDL)—parsed, normalized, enriched with context like threat intel, HR systems, identity, and geo signals. This data increasingly lives in your environment: Databricks. Snowflake. Amazon Web Services (AWS), Microsoft Azure, Google Cloud, Hydrolix.

With this shift, a new category is exploding: headless cybersecurity products—tools that sit on top of your data rather than ingesting it.

· Headless SIEMs: Built for detection, not data hoarding.

· Headless Vulnerability Analytics: Operating directly on vuln data inside your SDL.

· Headless Data Science: ML models run atop your lake, no extraction needed.

· Soon: Headless IAM & Access Analytics: Compliance and reporting directly from where access logs reside.

These solutions don’t route your data out—they bring their algorithm to your lake. This flips the control model.


To Get There: The Data Pipeline Must Evolve

What’s needed is an independent platform purpose-built for streaming ETL and pipeline management, the connective tissue that moves, filters, and enriches your telemetry in real time. A platform that’s-

· Lightweight and modular—drop a node anywhere to start collecting from a new business unit or acquisition.

· Broadly integrated—connecting with thousands of systems to maximize visibility.

· Smart at filtering—removing up to 60%-80% of Non-Security Data (NSDs) that bloats your SIEM

· Enrichment-first—applying threat intel, identity, geo, and other contextual data before forwarding to your Security Data Lake (SDL) and SIEM. Remember, analysts spend valuable time manually stitching together context during investigations. Pre-enriched data dramatically reduces that effort—cutting investigation time, improving accuracy, and accelerating response.

· AI-ready—feeding clean, contextualized data into your models, reducing noise and improving MTTD/MTTR. Also helps desanitize sensitive information leaving your environment.

· Insightful in motion—offering real-time observability as data flows through the pipeline.

In short, the pipeline becomes the foundation for modern security architecture and the fuel for your AI-driven transformation.

With this shift, a new category is exploding: headless cybersecurity products—tools that sit on top of your data rather than ingesting it.

Bottom Line : We’re entering a new era where

· SIEMs do less ingestion, more detection

· Data lakes become the source of truth, enriched, stored in your format

· Vendors no longer take your data—they work on top of it

· Security teams get flexibility, visibility, and control—without the lock-in

This is the rise of modular, headless cybersecurity—

This is the rise of modular, headless cybersecurity, where your data stays yours. Their analytics run where you want and computing happens on your terms, all while you have complete control over your data.