Create schema for landing page views
Add EventLogging call to FundraiserLandingPage
Description
Details
Subject | Repo | Branch | Lines +/- | |
---|---|---|---|---|
Add EventLogging beacon for all pageviews | mediawiki/extensions/FundraiserLandingPage | master | +134 -2 |
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Open | None | T183978 [Epic] Fundraising kafkatee changes | |||
Resolved | AndyRussG | T185933 Donatewiki: use EventLogging to log pageloads |
Event Timeline
Quick sanity check:
The approach we've been looking at so far is to send an EventLogging event from the client for every pageview of a donation form on Donate Wiki. If I understand correctly, the advantage of doing this is that it will facilitate the ingress of data about the pageviews via a separate Kafka topic, so we don't have to filter the entire firehose of all web requests.
This would also essentially send back to our servers information that is already on the URL and fully available server-side. So it's an extra round trip from the client back to our infrastructure, just to get the data in the right place in our own infrastructure.
Somehow, it doesn't quite feel right... So, before moving forward, I thought we might circle back and make sure it's really our best option... Apologies for the bother! Thanks much!!!!
Maybe you can add descriptions to schema fields? That way is clear what things like "form-template" stand for.
Otherwise schema looks fine, it is more really up to FR folks to decide what dimensions to track.
That schema looks good, but I think we need to add the page title in there, unless it's already gathered as part of EL.
Existing code parses it out of the url with a regex, then splits it up and takes the second part.
Also, looks like we can call country and language required.
Thanks!!! Pretty complex filtering in there--hard to tell just from the code what's current and what's cruft!
As I understand it, we're no longer interested in data from wikimediafoundation.org. So, of the top-level regex used by LoadLPImpressions.py to parse lines in the log files, we'd only be interested in the last two.
Of these, only one parses out "title" (that is, the page used to create the forms on Donatewiki). The other matches only pageviews of Special:LandingPage. (The article for any previous wiki page that was viewed is not in the data.) Either the Donatewiki article or the params used by Special:LandingPage to create the form eventually end up in the landingpage field in landingpageimpression_raw table.
As per discussions in Hangouts, we'll log all pageviews on Donatewiki via EventLogging. This should ensure the greatest possible equivalence with existing data. (At a later date, we could change logging/filtering to remove Donatewiki pages other than Special:LandingPage as needed--apparently they're unused, even though random pageivews of them still make their way into the logs.)
Also, looks like we can call country and language required.
OK! Will do... :)
I think it will be of use to look at this schema: https://meta.wikimedia.org/wiki/Schema:VirtualPageView you can probably reuse most fields and even some of the instrumenting code. Seems like you would want to add the campaign id to your schema but other than than that fields shoudl be pretty similar.
Also , seems like you are using schema for two types of information:
- basic pageview fields + campaign
- form field info
Maybe is worth decoupling those two pieces? If you do that and 1) follows a schema that just sends data about pageviews data could be aggreggated more easily.
We'll probably want to keep those if possible, since we're planning to move more pages to donatewiki soon T189668: Move fundraising support pages to donate.wikimedia.org
Change 423952 had a related patch set uploaded (by AndyRussG; owner: AndyRussG):
[mediawiki/extensions/FundraiserLandingPage@master] Add EventLogging beacon for all pageviews
Thanks so much for the suggestions!!! Interesting stuff... In this case we're trying to copy as closely as possible the fields currently used by the Python script that ingresses data from these specific pageviews into a Fundraising database. So, for now, it does make sense to keep all this together. (At this stage, it's important to keep the workflow of the users of this data intact. Later on, I do hope we can refactor to get some even better tools up and running...)
¡¡OK!! Thx :)
qq: Did you intend to use a hyphen in the form-template and form-countryspecific names? Underscore might be better.
Yep, intentional ;P For anyone interested (/me hides under table) here they are in the legacy Python script. (May be taking this a bit too far, but the idea has been to be very incremental, so, jiggle things around as little as possible in that script and in the data format for now, and refactor on a longer timeline once the new pipeline is fully proven...) Thanks!!!
After staring at the code longer than is healthy, I decided to make those fields not required in the schema, because the script actually ensures those variables have values a little earlier on (so the conditional on line 360 partly dead code).
Regarding language, let's condense language and uselang URL params into a single schema property, since the Python script doesn't care which URL param it comes from, and having just one such param in the schema will keep it a bit simpler.
Description fields on the schema coming soooooon...
I'd really recommend not using a hyphen in a schema field name. Those fields are going to be directly mapped to SQL table fields (Hive, MySQL, etc.), and you'll either run into errors, or need to remember to always backtick quote them when querying them. Choosing names in schemas is really important! You aren't allowed to make backwards incompatible changes later, so you'll be stuck with what you choose forever!
BTW, just in case you haven't seen it: https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging/Schema_Guidelines
Change 423952 merged by jenkins-bot:
[mediawiki/extensions/FundraiserLandingPage@master] Add EventLogging beacon for all pageviews