by Marek Kolodziej, Sr. Research Engineer, Nitro, author, Machine Learning on Spark, Manning (2016)
The term NoETL is a play on NoSQL, and may be misunderstood for the same reason why its predecessor had once been. When the current notion of NoSQL came to fruition in 2009 (I’m not talking about Carlo Strozzi’s use of the name back in 1998), its originators, Johan Oskarsson and Eric Evans merely had in mind the gathering of ideas of how to escape the constraints of existing relational systems - the scalability problems associated with ACID. People ridiculed the term, in part because there was little that unified the various architectures. After all, isn’t it better to define something by what it is, rather than by what it’s not? Surely we would be interested in what (if anything) Cassandra, Redis and Neo4J have in common, rather than what they don’t stand for. And anyway, why make the lack of a quasi-set-theoretic language (apologies here to purists like Chris Date) a strong point, when that language was exactly what was actually useful about the relational model - to be able to ask arbitrary questions of the data, to roll it up in any way one wished, regardless of the underlying schema? The schema could be rigid, but the queries were flexible and almost limitless - contrast this with existing “aggregate-oriented” (to use Martin Fowler’s term) NoSQL databases which make it really hard to ask questions that pertain to a different data representation than the one that is tied to the current aggregation, e.g. the current document (Mongo, CouchDB), value blob (Redis, Riak), etc.
Yet it was this very unfortunate term and the following controversy that drew attention to the problems with the relational model, be it the distributed consistency, extreme normalization and the complex joins necessary to support it, etc. The relational model might not be going out the window, particularly for sensitive transactional data, but with the growing data size, we got to the point at which “polyglot persistence” (to borrow Martin Fowler’s term) became the norm. Surely we don’t need as much durability when dealing with trivial problems such as shopping cart contents - meanwhile the response time matters a lot to users of e-commerce sites. Similarly, relational queries aren’t needed for looking up the details of the user’s cart, so perhaps a key-value store would serve better here, even as the relational model might still need to support credit card transactions.
Personally, I hope the same controversy arises around the idea of NoETL to get people finally talking about evolving the way data is processed. What’s broken in today’s world? For one, the promise of ingesting mountains of “unstructured data” into systems like HDFS resulted in a lot of demand for storage, without the corresponding increase in value added. Sure one can store petabytes of logs, but without a plan to process them, what’s the point? The point of MapReduce, no matter its incarnation (Hadoop, Spark) was to increase the volume of data that could be processed in batch, but one can usually foresee the way one may need to store the data, in some structured, strongly-typed form. The return to strong typing in the Big Data ecosystem, thanks to technologies such as Avro, Protobuf, Thrift (object models and storage) and Parquet (storage only) suggests that perhaps we really don’t need the “E” in ETL. Why don’t we simply store the data in a reasonable form to begin with, removing the need for extraction? Of course, as our needs change, we may need to do transform the data, so I doubt the “T” is going anywhere. As to “L,” I think that it’s a bit outdated, too, especially when it comes to the data that a given company controls end-to-end. For example, if your organization’s data pipeline is backed by a queue (e.g. Kafka), then the data can be consumed by many services, each of which can do its own transformation and can emit (produce) an output, or simply push the result out to a given destination, be it a relational database, a document store, or HDFS. Even if it’s not a multi-producer, multi-consumer queue, the idea of a real-time “fan-out” has been around for quite some time as well. So again, perhaps the “L” in ETL is quite outmoded as well. After all, it’s easier to keep a system processing at a constant load in real-time than to amass tons of stale data and to periodically do a batch “load” from A to B.
Of course, let’s be realistic - we don’t always control all the data that we may need to use. We may need to partner with third parties, and interoperate with them via APIs. If we submit one API request to a third-party service at a time and get a near-real-time response, then of course nothing changes in our architecture - it’s yet another event in our real-time system. On the other hand, if we need to move the data from company X’s service to our own to pre-compute the combined result (e.g. because we need to correlate a lot of records and generate a summary), then this model seems a bit of a stretch. Nevertheless, the bulk movement could again be represented as a simulation of real-time, taking reactive considerations such as backpressure into account. For this reason, I believe that even integration with third-party systems is still in a way mostly a transformation.
I think that ETL is a bit of an outdated term - in modern architectures, I’d like to propose CTP - “consume, transform, produce.” This has the connotations of real-time processing. More importantly, though, by dropping the notion of arbitrary, untyped “extraction,” the idea of consumption hopefully suggests some discipline regarding the types of items that are processed. Even before binary formats such as Protobuf, RESTful APIs still had schemas, and before that there was SOAP and other technologies with clear schemas, so that’s new. What was temporarily new was the misguided obsession about “stringly-typed” data sitting in HDFS and “schema-on-read” processing. On the “P” side, we again skip the idea of ad-hoc, batch data loading, and replace it with a predictable, resource-usage-balanced, flow-based system.