NoETL

ETL - Pain/Gain and NoETL ?

When using ETL we might come across points like data duplication at the time of crash and restart. In this process we might loose data/may not load records to the system. ETL has flaws when trying to process a heavy set of data using ETL’s engine and database engine (might worse) the performance of the process. Loading - one good aspect to think of ETL’s are they are multi processors which can load the data in parallel but here we need to think on data duplication at time of crash etc..

No- ETL? sounds good. But how this would work on financial data - need to test on it.

ETL is sadness

When you build ETL pipelines you constantly have the fear of getting out of sync, missing a corner case, have a type incompatibility, etc. If you are extracting from a database, you need to really understand what the database is doing, in order to be in sync. Alternatively you always copy the whole database and this is not practical in the age of big data.

Accenture built the whole company around ETL and powered it with people. Modern infrastructure moves towards NoETL, the world where you can achieve a lot without storing your data in many copies inside independent systems.

It’s much better to use systems capable of achieving business objectives keeping data in once place, or using lambda architecture and push data into several locations, rather then extracting data from other systems.

NoETL for me says “no extract/transform/load”

What do I think when I hear NoETL?

by Marek Kolodziej, Sr. Research Engineer, Nitro, author, Machine Learning on Spark, Manning (2016)

The term NoETL is a play on NoSQL, and may be misunderstood for the same reason why its predecessor had once been. When the current notion of NoSQL came to fruition in 2009 (I’m not talking about Carlo Strozzi’s use of the name back in 1998), its originators, Johan Oskarsson and Eric Evans merely had in mind the gathering of ideas of how to escape the constraints of existing relational systems - the scalability problems associated with ACID. People ridiculed the term, in part because there was little that unified the various architectures. After all, isn’t it better to define something by what it is, rather than by what it’s not? Surely we would be interested in what (if anything) Cassandra, Redis and Neo4J have in common, rather than what they don’t stand for. And anyway, why make the lack of a quasi-set-theoretic language (apologies here to purists like Chris Date) a strong point, when that language was exactly what was actually useful about the relational model - to be able to ask arbitrary questions of the data, to roll it up in any way one wished, regardless of the underlying schema? The schema could be rigid, but the queries were flexible and almost limitless - contrast this with existing “aggregate-oriented” (to use Martin Fowler’s term) NoSQL databases which make it really hard to ask questions that pertain to a different data representation than the one that is tied to the current aggregation, e.g. the current document (Mongo, CouchDB), value blob (Redis, Riak), etc.

Yet it was this very unfortunate term and the following controversy that drew attention to the problems with the relational model, be it the distributed consistency, extreme normalization and the complex joins necessary to support it, etc. The relational model might not be going out the window, particularly for sensitive transactional data, but with the growing data size, we got to the point at which “polyglot persistence” (to borrow Martin Fowler’s term) became the norm. Surely we don’t need as much durability when dealing with trivial problems such as shopping cart contents - meanwhile the response time matters a lot to users of e-commerce sites. Similarly, relational queries aren’t needed for looking up the details of the user’s cart, so perhaps a key-value store would serve better here, even as the relational model might still need to support credit card transactions.

Personally, I hope the same controversy arises around the idea of NoETL to get people finally talking about evolving the way data is processed. What’s broken in today’s world? For one, the promise of ingesting mountains of “unstructured data” into systems like HDFS resulted in a lot of demand for storage, without the corresponding increase in value added. Sure one can store petabytes of logs, but without a plan to process them, what’s the point? The point of MapReduce, no matter its incarnation (Hadoop, Spark) was to increase the volume of data that could be processed in batch, but one can usually foresee the way one may need to store the data, in some structured, strongly-typed form. The return to strong typing in the Big Data ecosystem, thanks to technologies such as Avro, Protobuf, Thrift (object models and storage) and Parquet (storage only) suggests that perhaps we really don’t need the “E” in ETL. Why don’t we simply store the data in a reasonable form to begin with, removing the need for extraction? Of course, as our needs change, we may need to do transform the data, so I doubt the “T” is going anywhere. As to “L,” I think that it’s a bit outdated, too, especially when it comes to the data that a given company controls end-to-end. For example, if your organization’s data pipeline is backed by a queue (e.g. Kafka), then the data can be consumed by many services, each of which can do its own transformation and can emit (produce) an output, or simply push the result out to a given destination, be it a relational database, a document store, or HDFS. Even if it’s not a multi-producer, multi-consumer queue, the idea of a real-time “fan-out” has been around for quite some time as well. So again, perhaps the “L” in ETL is quite outmoded as well. After all, it’s easier to keep a system processing at a constant load in real-time than to amass tons of stale data and to periodically do a batch “load” from A to B.

Of course, let’s be realistic - we don’t always control all the data that we may need to use. We may need to partner with third parties, and interoperate with them via APIs. If we submit one API request to a third-party service at a time and get a near-real-time response, then of course nothing changes in our architecture - it’s yet another event in our real-time system. On the other hand, if we need to move the data from company X’s service to our own to pre-compute the combined result (e.g. because we need to correlate a lot of records and generate a summary), then this model seems a bit of a stretch. Nevertheless, the bulk movement could again be represented as a simulation of real-time, taking reactive considerations such as backpressure into account. For this reason, I believe that even integration with third-party systems is still in a way mostly a transformation.

I think that ETL is a bit of an outdated term - in modern architectures, I’d like to propose CTP - “consume, transform, produce.” This has the connotations of real-time processing. More importantly, though, by dropping the notion of arbitrary, untyped “extraction,” the idea of consumption hopefully suggests some discipline regarding the types of items that are processed. Even before binary formats such as Protobuf, RESTful APIs still had schemas, and before that there was SOAP and other technologies with clear schemas, so that’s new. What was temporarily new was the misguided obsession about “stringly-typed” data sitting in HDFS and “schema-on-read” processing. On the “P” side, we again skip the idea of ad-hoc, batch data loading, and replace it with a predictable, resource-usage-balanced, flow-based system.

NoETL!

Update. This proposal was substantially improved with addressing some, but not all, comments from Gwen Shapira, Dean Wampler, Viktor Klang, and Jonas Bonér. It is still a personal opinion and will remain as such, a conversation anchor.

NoETL is a term based on what is emerging as reference architectures in the meetups: SF Scala, SF Spark and Friends, SF Text, and Reactive Systems meetups, and covered in their respective conferences — Scala By the Bay, Big Data Scala By the Bay, and Text By the Bay. NoETL is an approach to end-to-end data pipelines, or data engineering at scale, similar to what NoSQL is to SQL for the persistence and analytics component of those pipelines. NoETL pipelines are typically built on the SMACK stack — Scala/Spark, Mesos, Akka, Cassandra and Kafka. And if you SMACK, SMACK HARD — make sure it’s Highly-Available, Resilient, and Distributed. (OK, the HARD part is a joke, and redundant.)

Why NoETL? ETL is an intermediary step, and at each ETL step you can introduce errors and risk:

ETL can lose data
ETL can duplicate data after failover
ETL tools can cost millions of dollars
ETL decreases throughput
ETL increases the complexity of the pipeline

A good sign you’re having ETLitis is writing intermediary files. Files had their day under the sun, but are not typeful and some folks advocate replacing a file system with a better API altogether.

If we were F# fans, we’d say, in a way, ETL is the pipe in Unix, and NoETL is a typed object in PowerShell. But since we’re on Unix, we’ll use the rising standard of Kafka for message bus and compare it with Oracle. If you store your transactional data in Oracle, which is still a thing to with financial data for compliance reasons alone, you need to periodically fetch these data and use them downstream. From Oracle’s website:

Designing and maintaining the ETL process is often considered one of the most difficult and resource-intensive portions of a data warehouse project.

ETL involves Extraction of data form one system into another, Transforming it, and Loading it into another system. Typical intermediary is HDFS, and extracted data is logs. Logs are dumped to HDFS in delimited text format, and then loaded into another system which parses them all anew. LogStash, Splunk, and similar systems owe their whole existence to the fact that text serves as an intermediary ETL step. LogStash and Splunk has various agents which read the text logs and send them to the master, such as an Elastic Search cluster for LogStash. If the cluster fails, LogStash has to resend. Quite often the cluster will have to be reloaded data lost, or agent will be backed up; or data duplicated. With careful engineering, duplicates can be eliminated, and data backed up for sending. But then you end up reimplementing Kafka.

As a counterexample, Kafka allows one to build a distributed system with at least once delivery guarantee. Moreover, you can send binary data that remains strongly typed, e.g. in Avro, and if you use a single language — such as Scala — you will preserve all the guarantees that the type system gives you. This sharply contrasts with the text dump that you need to parse all over again.

NoETL systems strive to use typeful APIs and schemas, such as Avro, to pass along objects in real time, with flow control such as back pressure. Akka Streams is an example, and also Rx systems. Reactive Systems can be seen as a class of NoETL pipelines.

NoETL intentionally resembles NoSQL, with the point of stating that a lot of ETL pipelines are unnecessarily hard and error-prone. There’s a valid point that NoSQL databases are all getting their SQL groove back, that majority of Spark users are using SparkSQL, and that NoSQL is ultimately a misnomer. To that, we can answer that NoSQL movement defined a whole class of databases, CouchDB, HBase, Cassandra, Riak, to name a few, and brought up platforms like Erlang which have features conducive of implementing such databases – e.g. distributed agreement. You have to think of Zookeeper, and build tests such as Jepsen.

We can argue that Lambda Architecture is a way to formalize NoETL aspects of stream processing, while reconciling it with more traditional batch processing normally hoping through a series of ETL steps. Jay Kreps’ somewhat facetious Kappa Architecture adressed similar concerns.

NoETL

See, that’s what the app is perfect for.

ETL - Pain/Gain and NoETL ?

ETL is sadness

What do I think when I hear NoETL?

NoETL!