User Details
- User Since
- Jan 20 2024, 12:05 AM (45 w, 3 d)
- Availability
- Available
- IRC Nick
- amastilovic
- LDAP User
- Aleksandar Mastilovic
- MediaWiki User
- AMastilovic-WMF [ Global Accounts ]
Yesterday
Wikitech account/LDAP: | AMastilovic-WMF |
SUL account | AMastilovic-WMF |
Account linked on IDM | Y |
I have visited MediaWiki:Loginprompt | Y |
I have tried to reset my password using Special:PasswordReset | Y |
Maybe we should just use the tool provided by Airflow itself? db clean --clean-before-timestamp for example.
Wed, Nov 27
Related Airflow-DAGs MR: https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/943
Oct 23 2024
@brouberol got it. You'll mount Ceph as a file system local to the Airflow instance, and HDFS sync will write to Ceph - effectively, to Airflow, this will look like a local fs directory being updated.
Oct 22 2024
@brouberol automatic sync from Ceph every 5 minutes might cause some issues with the HDFS synchronizer. We need to think about the scenario where your 5 minute sync starts in the middle of HDFS synchronization to Ceph - in that case, you will only get a part of airflow-dag repository and Airflow will most likely choke. Sure, it would fix itself on the next run within 5 minutes but it would be much better if we could mitigate this scenario somehow.
Oct 17 2024
Update: We've refactored the library to support cache_key_fn config parameter, which enabled us to get rid of FsVersionedArtifactCache in favor of simply having one class FsArtifactCache:
Oct 16 2024
Oct 15 2024
Cool! How will this be used via artifact.yaml config?
But for this MR, what do you think of limiting the change to just restricting to fsspec, perhaps, perhaps by just renaming FsArtifactSource and FsArtifactCache to ArtifactSource and ArtifactCache and removing those abstract base classes? Then I think your refactor which removes the ArtifactCache.open abstract method will just work as is.
Oct 10 2024
@Ottomata @mforns I think we should expand the scope of this refactor to include redefining the relationships between Artifact, ArtifactLocator, ArtifactSource and ArtifactCache, too. The current design is cumbersome and unintuitive IMHO, for the following reasons:
- ArtifactSource is defined/constructed through class name and base_uri which is optional, but in practice base_uri is not optional and points to either a directory URI or to an actual instance of Artifact. However, since it extends ArtifactLocator, all of its methods require an Artifact instance as a parameter.
- ArtifactCache is defined through class name and base_uri which is again defined as optional, but in practice it's actually required. All of its methods also require an instance of Artifact as a parameter.
- Even though ArtifactLocators all de-facto depend on an Artifact, the Artifact class itself introduces a hard dependency on ArtifactLocators through its source and caches arguments.
Oct 7 2024
I've published a draft MR that, as far as I can tell, implements most of support for the produced_by configuration of a dataset. The code correctly recognizes, parses and models the produced_by section, and inserts a DatasetProducer object into a Dataset object. This DatasetProducer object supports its own implementation of the get_sensor_for method that takes precedence over Dataset's normal method, if the DatasetProducer is available.
Oct 2 2024
Sep 27 2024
How would Kerberos authentication play out when it comes to different Airflow instances communicating with each other? Would each instance get a Kerberos ticket automatically, and then use that to communicate to other instances?
Sep 26 2024
Things are not settled for sure. The version of Airflow we use already supports the notion of Dataset (Data Asset is basically a rename+upgrade of a Dataset) and I think even as of right now we could use that for "event-driven pipelines" that react to Datasets being updated.
Sep 25 2024
Currently, each instance specifies its own datasets.yaml files in its “config” directory. This approach is problematic due to a couple of reasons:
OK, so the schema I proposed in the Google Doc looks like this:
I've posted a draft proposal for the implementation design of ideas/needs described in this ticket: https://docs.google.com/document/d/1lapUHpWY2rm9si1iupRYHTcsDY-_QSbLcL03MXOSOlE/edit
Sep 23 2024
Project migrated to GitLab: https://gitlab.wikimedia.org/repos/data-engineering/gobblin-wmf/-/merge_requests/1
Sep 17 2024
Sep 16 2024
This was resolved some time ago when SRE released updated Airflow Debian package.
Sep 5 2024
Aug 23 2024
Maybe just change the name of the file to data_dependencies.yaml and the module to DataDependency?
And then the user would use it like data_dependency("data-dependency-name").get_sensor_for(dag)?
With this, the semantic weirdness would be solved, no?
Aug 7 2024
I talked to @BTullis about obtaining a functional test environment that would mimic the real world this service would be operating in, and he kindly provided a list of things to do in order to build such an environment. The list is in the subtask ticket https://phabricator.wikimedia.org/T371994
Jul 25 2024
Jul 22 2024
Also, could we add a settings.xml file with the following contents to the Docker image? It's necessary for Maven release plugin to interact with GitLab:
Jul 11 2024
Jul 8 2024
I've considered the option of pulling from the git origin into the destination HDFS, albeit not using a systemd timer. I've actually done something similar before in previous jobs/roles, by mounting HDFS onto a local file system, but I don't think this is a viable solution for a number of reasons:
Jul 2 2024
Seconded about the .test in the group ID - do we really need that? I believe the group ID should simply be org.wikimedia and then we can have the test part in the artifact ID.
Jun 24 2024
Jun 12 2024
Jun 10 2024
Merged and applied - done
Jun 6 2024
Jun 5 2024
Jun 4 2024
This ticket has been resolved, the tasks from the ticket definition have been performed on an-launcher1002 (an-launcher1001 has been decommissioned).
May 22 2024
@lbowmaker this task can be closed.
@lbowmaker this task can be closed.
@lbowmaker this task can be closed.
@lbowmaker this task can be closed.
May 20 2024
We are attempting to resolve this issue in this ticket: T365382
No update still - let's wait for a bit and see what happens, the sync refreshment period might be daily as opposed to hourly.
May 16 2024
This task can be closed as the issue has been fixed and changes to the DAG have been merged.
May 13 2024
May 9 2024
MR to switch to using DagProperties: https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/689
May 8 2024
The issue was in the path to the configured log4j.properties file in Airflow UI, hdfs:///user/aqu/aqu-log4j.properties was not accessible by the Airflow user analytics.
Apr 26 2024
Apr 17 2024
Apr 16 2024
Apr 9 2024
Feb 16 2024
Jan 29 2024
I need access to the following (from the wiki page you provided):
Jan 25 2024
@Arnoldokoth @Dzahn thank you!