Skip to content

Updating the casing for the messageid field #3206

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 12, 2022
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 5 additions & 5 deletions src/guides/duplicate-data.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,15 +6,15 @@ Segment guarantees that 99% of your data won't have duplicates within a 24 hour

## 99% deduplication

Segment has a special deduplication service that sits behind the `api.segment.com` endpoint and attempts to drop 99% of duplicate data. Segment stores 24 hours worth of event `message_id`s, allowing Segment to deduplicate any data that appears within a 24 hour rolling window.
Segment has a special deduplication service that sits behind the `api.segment.com` endpoint and attempts to drop 99% of duplicate data. Segment stores 24 hours worth of event `messageId`s, allowing Segment to deduplicate any data that appears within a 24 hour rolling window.

Segment deduplicates on the event's `message_id`, _not_ on the contents of the event payload. Segment doesn't have a built-in way to deduplicate data over periods longer than 24 hours or for events that don't generate `message_id`s.
Segment deduplicates on the event's `messageId`, _not_ on the contents of the event payload. Segment doesn't have a built-in way to deduplicate data over periods longer than 24 hours or for events that don't generate `messageId`s.

> info ""
> Keep in mind that Segment's libraries all generate `message_id`s for each event payload, with the exception of the Segment HTTP API, which assigns each event a unique `message_id` when the message is ingested. You can override these default generated IDs and manually assign a `message_id` if necessary.
> Keep in mind that Segment's libraries all generate `messageId`s for each event payload, with the exception of the Segment HTTP API, which assigns each event a unique `messageId` when the message is ingested. You can override these default generated IDs and manually assign a `messageId` if necessary.

## Warehouse deduplication
Duplicate events that are more than 24 hours apart from one another deduplicate in the Warehouse. Segment deduplicates messages going into a Warehouse based on the `message_id`, which is the `id` column in a Segment Warehouse.
Duplicate events that are more than 24 hours apart from one another deduplicate in the Warehouse. Segment deduplicates messages going into a Warehouse based on the `messageId`, which is the `id` column in a Segment Warehouse.

## Data Lake deduplication
To ensure clean data in your Data Lake, Segment removes duplicate events at the time your Data Lake ingests data. The Data Lake deduplication process dedupes the data the Data Lake syncs within the last 7 days with Segment deduping the data based on the `message_id`.
To ensure clean data in your Data Lake, Segment removes duplicate events at the time your Data Lake ingests data. The Data Lake deduplication process dedupes the data the Data Lake syncs within the last 7 days with Segment deduping the data based on the `messageId`.