Serverless will democratize modern data infrastructure
It’s funny that people talk about “serverless” as a new trend. Isn’t every SaaS application like Zendesk/Salesforce or platform like Twilio/Zapier also “serverless” from the perspective of a consumer of that application? And that’s really what’s driven the democratization of access to modern software. No longer did anyone need to hire an IT group, set up servers, install and maintain software, etc. You just sign up and start using it. You never think about the idea of a server, even though that’s what running it behind the scenes. And because you’re not making a big investment up front, there’s a low entry point from a cost perspective, and it then scales with your business. True democratization of access.
Going beyond that self-service SaaS haven, you then run into your first challenges. Maybe you want to integrate your CRM data into your marketing automation system. Maybe you want to enrich your CRM data with data from something like Clearbit, Crunchbase, or another data provider. Maybe you want to send a Slack notification when a lead is submitted on your website.
This is where things like Zapier, Automate.io, Tray.io, and Workato come in. They’re event-based integration platforms taking triggers from certain applications (e.g. new customer created in CRM) and then running multi-step workflows (e.g. call out to the Clearbit API, pull detail on the prospect, send it back to the CRM). And you can use them without any engineering expertise (for the most part).
This approach works well when you don’t care about anything being stored beyond the lifecycle of each workflow. Once the customer in the CRM is enriched with data, the workflow disappears, without a trace.
But what do you do when you want to capture things that aren’t logical to store in your existing systems? Maybe you’re running a B2B marketplace and you want to capture the “clicks” on certain products to see what clients are looking for. You don’t want a long list of clicks in the CRM. Your CRM is not built to store a super long list of something like that, and it’s not built to query it efficiently.
This is the type of data that modern data warehouses (e.g. Snowflake, Google BigQuery, Amazon Redshift, etc.) are built to store. They’ll save long lists of information and then allow you to query it to generate aggregate metrics. So to continue with the B2B marketplace example, you can query for the most frequently “clicked” product for a given customer and put that into your data. Or maybe you want to run a machine learning model that predicts what or when a customer is likely to buy based on their click history. This is where data warehouses really shine.
So why doesn’t everyone do this?
It’s the worst type of problem — one that seems easy, but is actually really hard to solve well.
Reason #1 — The paradox of choice / there are many ways to skin a cat
Here is a list of technologies you’d need to consider leveraging (keep in mind you need a combination of some categories, not just one of them):
- Data Warehouses (storage)— Snowflake, BigQuery, Redshift, etc.
- ETL/ELT/EL (system -> warehouse)— Fivetran, Matillion, HVR, Airbyte, Hevo Data
- Event Data Capture (like ETL, but managing capturing of events)— Funnel, Snowplow, Segment
- Reverse ETL (warehouse -> system) — Census, Hightouch, Rudderstack
- Transformation (change data formats / clean)— DBT
- Custom Data Pipelines (vs standardized versions of the ETL and reverse ETL players above) — Kafka, and those that manage it (e.g. Confluent, Aiven)
- DataOps (CI/CD but for data pipelines/models) — DataKitchen, Lenses
- Data Observability & Quality Monitoring — Bigeye, Accel Data, Monte Carlo
- Data Connectors / Drivers (connectivity to each app’s specific protocols/API)— CData
- iPaaS (point to point data replication) — MuleSoft Anypoint, Informatica, Tibco
What makes this even more complicated / hard to think about is that many of these players overlap in what they do. Kafka Connect for example provides a subset of connectors that CData provides. Some of DBT’s transformation work can happen in many of these other apps. You can accomplish with MuleSoft much of what you can example with a combination of other tools. Just a few examples…
Reason #2 — Underlying server infrastructure — scaling and maintenance
With many of these technologies, you don’t just sign up and go. You’re managing infrastructure that this software runs on top of. You’re dealing with servers crashing and keeping backup nodes in sync. You’re dealing with needing to split up / “shard” data across many servers to scale with the volume of your data. You’re needing to make sure none of the pipelines are breaking because of unexpected data types and formats.
Some companies are starting to solve this complexity for you by obfuscating away or helping you manage the underlying servers: e.g. Fivetran in ETL, Census in reverse ETL on full obfuscation, and Confluent and Aiven for Kafka.
But much of this technology still requires some significant experience to administer, which brings us to reason #3:
Reason #3 — Data engineering talent is in short supply
It’s worth a note: data engineering is different than data science. Data engineers get the data organized & aggregated for analysis. Data science teams are the ones actually doing the analysis (building and tuning models, etc.).
In any case, data engineering is a relatively new discipline, is not easy, and every big company building this stuff wants it. Inevitable outcome that this talent is hard to find and expensive…
So where does this all go?
In abstract, it says it in the title —software vendors that create serverless, self-service applications to build modern data infrastructure will increasingly take share. In practice, it probably looks like the following:
Trend #1 — Companies fully obfuscating away complexity will grow quickly
Companies like Fivetran and Census are already doing this by allowing customers to think about getting data from point A to point B, not about the servers that are getting it there. Google BigQuery arguably improved on Snowflake by offering its solution in an entirely serverless format; I suspect this has served them well. And companies like Y42 that offer fully managed end-to-end data infrastructure will increasingly take share in the SMB segment of the market.
Trend #2 — Vertical-specific and data-specific applications will be increasingly popular
It’s the classic trade-off of generalizability / breadth of applicability vs depth of solution. Segment and others in the customer data space figured this out early on — focusing just on customer data allows you to solve a greater depth of the problem for your customers. Brinqa takes a similar strategy in Cyber.
And on the vertical-specific side, I would expect to see far more companies coming out offering data integration for healthcare, for financial services, for logistics, etc. Health Catalyst would be a great example for healthcare. And for financial services, Codat is another.