Skip to content

Add summary statistics for metrics views via MCP #7602

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 18 commits into
base: main
Choose a base branch
from

Conversation

grahamplata
Copy link
Contributor

@grahamplata grahamplata commented Jul 9, 2025

Add summary statistics for metrics views via MCP server tool

Screenshot 2025-07-09 at 3 00 31 PM

query_metrics_view_summary mcp tool

request

{
  "metrics_view": "product_metrics_pg_snapshot_date_2025_03_21_orgs_metrics"
}

response

    [{
        "max": "2025-03-21T15:06:06.104879Z",
        "min": "2023-05-04T12:24:25.106518Z",
        "watermark": "2025-03-21T15:06:06.104879Z",
        "summary": {
            "_all_usergroup_id": {
                "DimensionName": "_all_usergroup_id",
                "DataType": "VARCHAR",
                "SampleValue": "5bd92381-6a5b-4858-a9ec-7cc056ec82f3"
            },
            "_billing_customer_id": {
                "DimensionName": "_billing_customer_id",
                "DataType": "VARCHAR",
                "SampleValue": "24d49175-8de2-428e-9ea3-0b01db4724a5"
            },
            "_created_by_user_id": {
                "DimensionName": "_created_by_user_id",
                "DataType": "VARCHAR",
                "SampleValue": "d991b5f5-fde2-4c15-b30d-169f030b27fa"
            },
            "_id": {
                "DimensionName": "_id",
                "DataType": "VARCHAR",
                "SampleValue": "24d49175-8de2-428e-9ea3-0b01db4724a5"
            },
            "_payment_customer_id": {
                "DimensionName": "_payment_customer_id",
                "DataType": "VARCHAR",
                "SampleValue": "cus_RIEnoGvpb0AUlS"
            },
            "billing_customer_id": {
                "DimensionName": "billing_customer_id",
                "DataType": "VARCHAR",
                "SampleValue": "24d49175-8de2-428e-9ea3-0b01db4724a5"
            },
// ...snip

Sample execution from Claude Desktop

Checklist:

  • Covered by tests
  • Ran it and it works as intended
  • Reviewed the diff before requesting a review
  • Checked for unhandled edge cases
  • Linked the issues it closes
  • Checked if the docs need to be updated. If so, create a separate Linear DOCS issue
  • Intend to cherry-pick into the release branch
  • I'm proud of this work!

@grahamplata grahamplata marked this pull request as ready for review July 9, 2025 20:53
@avaitla
Copy link

avaitla commented Jul 9, 2025

One thing to consider later is if SampleValue can be [SampleValue], with top five sample values, I suspect this will get better results (low cardinality columns will show their full set of values) - then again on the other hand might be too much context. A human generally sees this immediately in the ui and benefits from it.

@grahamplata
Copy link
Contributor Author

One thing to consider later is if SampleValue can be [SampleValue], with top five sample values, I suspect this will get better results (low cardinality columns will show their full set of values) - then again on the other hand might be too much context. A human generally sees this immediately in the ui and benefits from it.

Done and done, added more core sample

[
    {
        "max": "2025-03-21T15:06:06.104879Z",
        "min": "2023-05-04T12:24:25.106518Z",
        "watermark": "2025-03-21T15:06:06.104879Z",
        "summary": {
            "custom_domain": {
                "DimensionName": "custom_domain",
                "DataType": "CODE_STRING",
                "Values": [
                    "reports.gdb.net",
                    "",
                    "rill.romen.ai",
                    "lens.digiadglobal.com",
                    "nebula.boldcollective.co"
                ]
            },
            "description": {
                "DimensionName": "description",
                "DataType": "CODE_STRING",
                "Values": [
                    "A sandbox for me",
                    "Audience-driven revenue data provided by Auth",
                    "BlueCargo",
                    "\"\"",
                    ""
                ]
            },
            "display_name": {
                "DimensionName": "display_name",
                "DataType": "CODE_STRING",
                "Values": [
                    "iansilva2305",
                    "audiohook",
                    "emodo",
                    "djkim",
                    "networkcraze"
                ]
            },
            "favicon_asset_id": {
                "DimensionName": "favicon_asset_id",
                "DataType": "CODE_UUID",
                "Values": [
                    null,
                    "60ec1638-dc36-4c2f-b4fb-56074f2aab57"
                ]
            },
            "id": {
                "DimensionName": "id",
                "DataType": "CODE_UUID",
                "Values": [
                    "78599ac2-c458-453c-86bd-1cd4aa1711c7",
                    "2ca72c5b-e5d7-48ff-8ef8-d235e493f47e",
                    "61cfb1bb-8426-468e-b023-94685cd015ae",
                    "ed5226e0-4b61-4c43-b315-db5ec6b464af",
                    "54c47392-b6ec-4dac-a9f6-db436bbf51da"
                ]
            },
            "logo_asset_id": {
                "DimensionName": "logo_asset_id",
                "DataType": "CODE_UUID",
                "Values": [
                    "54652be3-1ef5-4009-b4ff-e06f3aec2945",
                    "d9526a4f-176a-482b-ae5f-ecfe881e9280",
                    "106d3b6f-f66f-42a0-96a7-6feb3dda97db",
                    "3ec454c6-2fec-40b0-bfe8-57f37425abaf",
                    "70fa445b-596f-49a5-b6a4-41dd499ef2ae"
                ]
            },
            "name": {
                "DimensionName": "name",
                "DataType": "CODE_STRING",
                "Values": [
                    "kiwialec",
                    "syft_data",
                    "BillingTest",
                    "TTI",
                    "kensuke-b-sasaki"
                ]
            },
            "payment_customer_id": {
                "DimensionName": "payment_customer_id",
                "DataType": "CODE_STRING",
                "Values": [
                    "",
                    "cus_RnO55p6OIzqR7u",
                    "cus_RIEnwjUggkNuuB",
                    "cus_RIEnaX6gP5KpAl",
                    "cus_RIEnPcDZ10wDMq"
                ]
            },
            "updated_on": {
                "DimensionName": "updated_on",
                "DataType": "CODE_TIMESTAMP",
                "Values": [
                    "2025-01-06 01:16:48.392374+00",
                    "2025-02-25 15:32:28.471094+00",
                    "2024-12-09 06:55:56.456532+00",
                    "2025-03-17 01:15:48.121846+00",
                    "2025-01-06 01:18:41.720782+00"
                ]
            }
        }
    }
]

@avaitla
Copy link

avaitla commented Jul 10, 2025

Nice. I think Examples or Samples is still a better name than just Values as the latter indicates it's the only set of values but the former indicates it's just an example subset. You can also ask gpt itself what it thinks the best naming is for it to best use it in context window for the use case of an AI Analyst trying to better understand the metrics view of interest.

@ericpgreen2 ericpgreen2 requested a review from k-anshul July 10, 2025 22:37
@ericpgreen2
Copy link
Contributor

ericpgreen2 commented Jul 10, 2025

Tagged @k-anshul for the backend code review, given Benjamin's OOO.

Some notes about the data structure:

  • The current summary structure has redundancy: dimension names serve as both object keys and object properties.
  • It's unclear how to extend this pattern to include measure summaries.

What about something like this (doesn't all need implementing now, but demonstrates a cleaner, more extensible structure):

{
  "time_range": {
    "min": "2023-05-04T12:24:25.106518Z",
    "max": "2025-03-21T15:06:06.104879Z",
    "watermark": "2025-03-21T15:06:06.104879Z"
  },
  "dimensions": [
    {
      "name": "custom_domain",
      "data_type": "CODE_STRING",
      "sample_values": ["reports.foodie.com", "", "rill.groupie.ai", ...],
      "null_count": 42,
      "distinct_count": 5
    },
    {
      "name": "description", 
      "data_type": "CODE_STRING",
      "sample_values": ["A sandbox for me", "Audience-driven...", ...],
      "null_count": 15,
      "distinct_count": 100
    }
  ],
  "measures": [
    {
      "name": "revenue",
      "data_type": "CODE_FLOAT64",
      "min": 0.0,
      "max": 1000000.0,
      "avg": 50000.0,
      "null_count": 0,
      "zero_count": 150
    }
  ],
  "metadata": {
    "total_rows": 1000000,
    "dimensions_count": 9,
    "measures_count": 3,
    "last_refreshed": "2025-03-21T15:06:06.104879Z"
  }
}

@avaitla
Copy link

avaitla commented Jul 11, 2025

Curious if you run both in your own set of basic eval queries on your demo dashboard which performs better. I think the suggestion from Eric is good, not having dim name as the key for the dict but rather name: user_id, the null and total counts are also good signals (but it's worthwhile to build eval queries to avoid regression)

@grahamplata
Copy link
Contributor Author

Screenshot 2025-07-11 at 12 08 23 PM

initial pass
remove unused
save
save
golangci-lint
Comment on lines 101 to 128
// Validate time range
if defaultTimeRange.Max.IsZero() {
// If we have no time data, still process time dimensions we found
for _, dim := range timeDimensions {
var dataType string
if dim.DataType != nil {
dataType = dim.DataType.Code.String()
}

timeDimensionSummaries = append(timeDimensionSummaries, DimensionSummary{
Name: dim.Name,
DataType: dataType,
})
}

// Include default time dimension if it wasn't found in explicit dimensions
if !defaultTimeDimFound && defaultTimeDimName != "" && e.security.CanAccessField(defaultTimeDimName) {
timeDimensionSummaries = append(timeDimensionSummaries, DimensionSummary{
Name: defaultTimeDimName,
DataType: "TIMESTAMP", // Assume timestamp for default time dimension
})
}

return &SummaryResult{
Dimensions: timeDimensionSummaries,
TimeRange: defaultTimeRange,
}, nil
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This case seems a little weird. If the range is empty, I think it can just keep it empty and keep going (not return early).

(And not add a time filter in the dimension profiling.)

@grahamplata
Copy link
Contributor Author

Sample

[
    {
        "dimensions": [
            {
                "name": "__time",
                "data_type": "TIMESTAMP",
                "min_value": "2025-07-15T18:24:27.8Z",
                "max_value": "2025-07-22T18:24:27.8Z"
            },
            {
                "name": "adomain",
                "data_type": "CODE_STRING",
                "example_value": "clorox.com",
                "min_value": "1800petmeds.com",
                "max_value": "zyp.test"
            },
            {
                "name": "advertiser_name",
                "data_type": "CODE_STRING",
                "example_value": "Clorox",
                "min_value": "6sense",
                "max_value": "vertex"
            },
            {
                "name": "app_or_site",
                "data_type": "CODE_STRING",
                "example_value": "App",
                "min_value": "App",
                "max_value": "Site"
            }
        ],
        "time_range": {
            "name": "__time",
            "data_type": "TIMESTAMP",
            "min_value": "2025-07-15T18:24:27.8Z",
            "max_value": "2025-07-22T18:24:27.8Z"
        }
    }
]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants