Page MenuHomePhabricator

Produce ECS formatted logstash logs to Event Platform, allowing them to be queried in the WMF Data Lake with SQL
Open, MediumPublic

Description

T234565: Standardize the logging format is trying to standardize the software logging format with the Elastic Common Schema. If we are able to produce these ECS logs with Event Platform, they would be automatically ingested into the WMF Data Lake.

This will allow people to use SQL to query the logs with Spark SQL (e.g,. spark3-sql, or pyspark, etc.) or Presto, and can be used for dashboarding with Superset.

This would be particularly useful if we are able to successfully migrate the MediaWiki logging format to ECS, as then MediaWiki software logs could be joined with other MediaWiki data in Hive.

Since the logs would be in Kafka with a well defined schema, they would also be consumable and reusable for other purposes, e.g. stream processing, anomaly detection and alerting, or ingestion into different data stores.


In 2021, Data Engineering and Observability teams met to discuss this idea. To accomplish this, we'd need:

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Marostegui triaged this task as Medium priority.Sep 24 2021, 4:29 AM

@colewhite, in https://phabricator.wikimedia.org/T288851#7456931 you said:

topics prefixed by rsyslog- will be automatically picked up by Logstash.

We've found using topic naming conventions for ingestion jobs to be brittle. We're moving towards using EventStreamConfig to automate configuring things like this. See: https://wikitech.wikimedia.org/wiki/Event_Platform/Stream_Configuration#consumers_and_producers

Example:

curl  'https://meta.wikimedia.org/w/api.php?action=streamconfigs&all_settings=1&streams=mediawiki.api-request' |  jq .
{
  "streams": {
    "mediawiki.api-request": {
      "topics": [
        "eqiad.mediawiki.api-request",
        "codfw.mediawiki.api-request"
      ],
      "stream": "mediawiki.api-request",
      "consumers": {
        "analytics_hadoop_ingestion": {
          "enabled": true,
          "job_name": "event_default"
        }
      },
      "canary_events_enabled": true,
      "topic_prefixes": [
        "eqiad.",
        "codfw."
      ],
      "destination_event_service": "eventgate-analytics",
      "schema_title": "mediawiki/api/request"
    }
  }
}

Here, we are declaring a consumer called 'analytics_hadoop_ingestion'. The settings for that consumer are arbitrary and specific to the consumer job. When that job runs, it requests all streams that have consumers.analytics_hadoop_ingestion declared, and uses those settings to import the data.

Logstash ingestion could probably do something similar, if the logging streams to import were declared in EventStreamConfig.

This would have been useful to debug T374662, aggregating the times out of elasticsearch is a bit hard as it would have to aggregate 50M requests through a single core (estimated time: multiple days due to repeated work for each pagination). Being able to throw hadoop at the problem would solve in a few 10s of minutes with an easy query.

This would be very useful for us to be able to understand if known problematic reusers (see: https://phabricator.wikimedia.org/T317001) are similarly saturating other data streams to avoid correcting issue behavior.

This would also help with some analysis in {T375146}

I've been told that this project would let me process Logstash data with SQL queries, and I would like that very much.

Ottomata renamed this task from Integrate Event Platform and ECS logs to Produce ECS formatted logstash logs to Event Platform, allowing them to be queried in the WMF Data Lake with SQL.Oct 10 2024, 1:16 PM
Ottomata updated the task description. (Show Details)
Ottomata updated the task description. (Show Details)