Replace Camus with Kafka Connect for event data imports
Closed, DuplicatePublic
Actions

Assigned To

Authored By

	Ottomata
	May 17 2019, 3:49 PM

Description

Camus imports data from Kafka and writes to HDFS. It is old and unsupported, and does not use a modern Kafka client. Unfortunetly, the Kafka Connect HDFS Plugin is written by Confluent, and was recently placed under their Confluent Community License. We will either have to get approval to use the HDFS connector under this license, or fork the connector from before the license change.

Related Objects
Search...

Status	Assigned	Task
Resolved	Ottomata	T185233 Modern Event Platform
Open	None	T214430 Event Platform: Stream Connectors
Duplicate	Ottomata	T223626 Kafka Connect development work
Resolved	JAllemandou	T238400 Evaluate possible replacements for Camus: Gobblin, Marmaray, Kafka Connect HDFS, etc.
Duplicate	Ottomata	T223628 Replace Camus with Kafka Connect for event data imports

Event Timeline

Ottomata created this task.May 17 2019, 3:49 PM

Ottomata mentioned this in T185233: Modern Event Platform.May 17 2019, 3:55 PM

Milimetric triaged this task as High priority.May 20 2019, 3:36 PM

Milimetric moved this task from Incoming to Event Platform on the Analytics board.

It looks like the latest kafka-connect-hdfs release not under the CCL is 5.0.3, released on March 24th this year.

@Nuria @faidon o/

It is time to start the discussion about the Confluent Community License. This is complicated, and I know we have 'policies'(?) about using FLOSS/OSI-OSD? licenses only. The main Kafka Connect stuff is Apache license, but some of the plugins e.g. kafka-connect-hdfs are from Confluent and are licensed with CCL.

I'd really like to use kafka-connect-hdfs to replace Camus. I'd much rather use up to date releases, rather than forking from a pre-CCL version and maintaining from there. Faidon and I talked about this a bit at All-Hands this year. I understand our intentions for open source only, but I also understand why Confluent has chosen to use this new license. For them it is the best they can do as a for-profit company. They want to make free and open software, without allowing others to use that software to compete directly with their revenue model. More info in this CCL FAQ and this blog post.

Since hosting a streaming SaaS is quite far outside of WMF's mission, I don't think the CCL conflicts with our intentions of using open source, even if it technically conflicts with FLOSS/OSI-OSD. I think Faidon differs. :)

Let's discuss! Who else should we add to this conversation?

Unfortunately, I think this is one of the matters that we cannot fully discuss in a public task. I'll start a private email thread; if anyone reading this is interested to be part of this, ping me off-list and I can loop you in :)

Ok!

Other questions too. Assuming we will use Kafka Connect (and hopefully kafka-connect-hdfs) in some capacity, we'll have to figure out where and how to run it. It seems the easiest way to run a distributed cluster is to do so via Kubernetes. For this task, we could alternatively run this in YARN, but that would take a lot of manual work (and use of Slider?) to do. I'd of course prefer if we did something more standard like Kubernetes. But, this would mean that we run a Kubernetes service that can read from Kafka and write to HDFS & Hive. A distributed Kafka Connect cluster running in k8s would accept actual Connect jobs via its REST API. For this task, we'd have a single job reading from Kafka topics and writing to HDFS directories & Hive tables. We might want to have more Connect jobs in the future, but those are outside the scope of this task. Those could be run in the same Kafka Connect cluster or a totally separate one.

@akosiaris thoughts about running this in k8s? Our timeline is beginning to use this in Q2.

Ottomata merged a task: T166832: Import Kafka messages into HDFS authenticating with TLS/SSL.Jun 26 2019, 4:13 PM

Ottomata added a parent task: T238400: Evaluate possible replacements for Camus: Gobblin, Marmaray, Kafka Connect HDFS, etc..Nov 18 2019, 4:41 PM

Merging this into a parent task about replacing Camus with ____.

Ottomata closed this task as a duplicate of T238400: Evaluate possible replacements for Camus: Gobblin, Marmaray, Kafka Connect HDFS, etc..May 14 2020, 2:25 PM

Replace Camus with Kafka Connect for event data importsClosed, DuplicatePublicActions

Description

Related ObjectsSearch...

Event Timeline

Replace Camus with Kafka Connect for event data imports
Closed, DuplicatePublic
Actions

Related Objects
Search...