Page MenuHomePhabricator

Self hosted machine translation service
Closed, ResolvedPublic

Description

Background

The NLLB-200 machine translation system that a research team from Meta (Facebook) provided was running in an AWS hosting managed by Meta as a temporary solution. Recently we migrated that to AWS account by WMF (T321781). This allowed to keep supporting the initial set of communities, several of which with no previous machine translaiton options. However, budget constraints of this approach prevent to use the machine translation system to its full potential. Hosting this system directly on Wikimedia infrastructure was not an option because of dependency on NVIDIA GPU and hence nonfree CUDA drivers.

A recent exploration by @santhosh discovered an alternative mechanism to get the same or better performance by just CPUs. This is achieved by a one time conversion of model to a special model with the help of Ctranslate2, which optimize the model for inference in low processor and memory setting. A version of this is running at https://translate.wmcloud.org/, it provides good performance for translation, but it is a cloud VM.

WMF Language team would like to host this system in a production system.

As per the consensus from the team, the MT service will be called as "MinT" machine translation service. This is only for exposing it as an option to users.

image.png (832×872 px, 69 KB)

Plan

A rough plan for the next steps (the order isn't strict, things can be done in parallel)

Details

SubjectRepoBranchLines +/-
mediawiki/extensions/ContentTranslationmaster+4 K -4 K
operations/deployment-chartsmaster+2 -10
mediawiki/extensions/ContentTranslationmaster+5 -7
mediawiki/services/cxservermaster+0 -233
operations/deployment-chartsmaster+1 -1
operations/deployment-chartsmaster+1 -1
operations/deployment-chartsmaster+7 -0
operations/deployment-chartsmaster+7 -1
mediawiki/services/cxservermaster+3 -0
operations/puppetproduction+5 -0
operations/puppetproduction+1 -1
operations/deployment-chartsmaster+1 -1
mediawiki/services/machinetranslationmaster+8 -2
operations/deployment-chartsmaster+1 -1
mediawiki/services/machinetranslationmaster+4 -4
operations/dnsmaster+17 -15
operations/puppetproduction+15 -0
operations/deployment-chartsmaster+1 -1
operations/deployment-chartsmaster+14 -7
operations/deployment-chartsmaster+20 -58
mediawiki/services/machinetranslationmaster+2 -3
operations/deployment-chartsmaster+6 -1
operations/deployment-chartsmaster+1 -1
mediawiki/services/machinetranslationmaster+1 -1
operations/deployment-chartsmaster+20 -0
operations/deployment-chartsmaster+7 -1
operations/deployment-chartsmaster+20 -4
operations/deployment-chartsmaster+1 -1
mediawiki/services/machinetranslationmaster+11 -7
operations/deployment-chartsmaster+6 -0
operations/deployment-chartsmaster+4 -0
operations/deployment-chartsmaster+93 -1
operations/deployment-chartsmaster+1 -1
operations/deployment-chartsmaster+1 -1
operations/deployment-chartsmaster+53 -23
mediawiki/services/machinetranslationmaster+1 -1
mediawiki/services/machinetranslationmaster+63 -40
operations/deployment-chartsmaster+2 -2
operations/puppetproduction+10 -6
operations/deployment-chartsmaster+1 K -0
operations/deployment-chartsmaster+1 -0
labs/privatemaster+4 -0
mediawiki/services/cxservermaster+131 -0
mediawiki/services/machinetranslationmaster+145 -31
mediawiki/services/machinetranslationmaster+34 -0
integration/configmaster+23 -0
Show related patches Customize query in gerrit

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 912811 merged by jenkins-bot:

[operations/deployment-charts@master] machinetranslation: Switch to 2023-04-27-093807-production

https://gerrit.wikimedia.org/r/912811

Change 912812 merged by jenkins-bot:

[operations/deployment-charts@master] machinetranslation: Enable monitoring

https://gerrit.wikimedia.org/r/912812

Change 912863 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[mediawiki/services/machinetranslation@master] Fix OpenAPI spec endpoints

https://gerrit.wikimedia.org/r/912863

Thanks a lot for the info! Is it going to be the permanent solution? I am asking since the ML cluster uses Swift, that may be more resilient long term.

I don't expect to be a permanent solution. With the large size of models, this was more of a pragmatic solution to have the system up and running. The initial goal is to replace the existing service on AWS. For example, we plan to enable initially the service for the same 23 languages, although the models support over 200. Once we can replace the current service, I expect follow-up iterations for improvement to expand languages (T326578), models (T333969) and other infrastructure aspects that can make the service better (more robust, maintainable, etc.).

I created a ticket based on your proposal (T335491: Provide better long-term storage for translation models). Feel free to share more details in the ticket or propose other improvements that can help making the service better.
Thanks @elukey!

Sure sure as initial step it makes sense! My point is that Lift Wing already have this functionality (fetching models from Swift), and the more I see the requirements of this service the more I wonder why it wasn't onboarded as ML service (see https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing). It is fine to keep going in this direction, too much work as been done, but worth to keep it in mind for the future.

Change 913108 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/deployment-charts@master] machinetranslation: Bump limitranges and resourcequotas

https://gerrit.wikimedia.org/r/913108

Change 913109 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/deployment-charts@master] machinetranslation: Enable thanos-swift service mesh

https://gerrit.wikimedia.org/r/913109

Change 913116 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/deployment-charts@master] machinetranslation: Enable ingress functionality

https://gerrit.wikimedia.org/r/913116

Change 913152 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/puppet@production] service::catalog: Add machinetranslation service

https://gerrit.wikimedia.org/r/913152

Change 913108 merged by jenkins-bot:

[operations/deployment-charts@master] machinetranslation: Bump limitranges and resourcequotas

https://gerrit.wikimedia.org/r/913108

Change 913109 merged by jenkins-bot:

[operations/deployment-charts@master] machinetranslation: Enable thanos-swift service mesh

https://gerrit.wikimedia.org/r/913109

Change 913116 merged by jenkins-bot:

[operations/deployment-charts@master] machinetranslation: Enable ingress functionality

https://gerrit.wikimedia.org/r/913116

Change 912863 merged by jenkins-bot:

[mediawiki/services/machinetranslation@master] Fix OpenAPI spec endpoints

https://gerrit.wikimedia.org/r/912863

Change 914322 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/deployment-charts@master] machinetranslation: Deploy 2023-05-02-080334-production

https://gerrit.wikimedia.org/r/914322

Change 914322 merged by jenkins-bot:

[operations/deployment-charts@master] machinetranslation: Deploy 2023-05-02-080334-production

https://gerrit.wikimedia.org/r/914322

Change 914365 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/deployment-charts@master] machinetranslation: Enable ingress in chart

https://gerrit.wikimedia.org/r/914365

Change 914365 merged by jenkins-bot:

[operations/deployment-charts@master] machinetranslation: Support ingress in chart

https://gerrit.wikimedia.org/r/914365

Change 914468 had a related patch set uploaded (by KartikMistry; author: KartikMistry):

[operations/deployment-charts@master] Update cxserver to 2023-05-03-044244-production

https://gerrit.wikimedia.org/r/914468

Change 914721 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/deployment-charts@master] machinetranslation: Support configuration as env variables

https://gerrit.wikimedia.org/r/914721

Change 914722 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/deployment-charts@master] machinetranslation: Add people to egress

https://gerrit.wikimedia.org/r/914722

Change 914721 merged by jenkins-bot:

[operations/deployment-charts@master] machinetranslation: Support configuration as env variables

https://gerrit.wikimedia.org/r/914721

Change 914722 merged by jenkins-bot:

[operations/deployment-charts@master] machinetranslation: Add people to egress

https://gerrit.wikimedia.org/r/914722

Change 914732 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[mediawiki/services/machinetranslation@master] Install wmf-certificates in the image

https://gerrit.wikimedia.org/r/914732

Change 914732 merged by jenkins-bot:

[mediawiki/services/machinetranslation@master] Install wmf-certificates in the image

https://gerrit.wikimedia.org/r/914732

Change 914767 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/deployment-charts@master] machinetranslation: Use 2023-05-03-104124-production

https://gerrit.wikimedia.org/r/914767

Change 914767 merged by jenkins-bot:

[operations/deployment-charts@master] machinetranslation: Use 2023-05-03-104124-production

https://gerrit.wikimedia.org/r/914767

Change 915364 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/deployment-charts@master] machinetranslation: networkpolicy for metrics-exporter

https://gerrit.wikimedia.org/r/915364

Change 915364 merged by jenkins-bot:

[operations/deployment-charts@master] machinetranslation: networkpolicy for metrics-exporter

https://gerrit.wikimedia.org/r/915364

Change 915483 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[mediawiki/services/machinetranslation@master] Allow passing env var GUNICORN_WORKERS

https://gerrit.wikimedia.org/r/915483

Change 915488 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/deployment-charts@master] machinetranslation: Remove args, document env vars

https://gerrit.wikimedia.org/r/915488

Change 915493 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/deployment-charts@master] Ship a prometheus-statsd-export configuration

https://gerrit.wikimedia.org/r/915493

Change 915483 merged by jenkins-bot:

[mediawiki/services/machinetranslation@master] Allow passing env var GUNICORN_WORKERS

https://gerrit.wikimedia.org/r/915483

Change 915493 merged by jenkins-bot:

[operations/deployment-charts@master] Ship a prometheus-statsd-export configuration

https://gerrit.wikimedia.org/r/915493

Change 915488 merged by jenkins-bot:

[operations/deployment-charts@master] machinetranslation: Remove args, document env vars

https://gerrit.wikimedia.org/r/915488

Change 914468 merged by jenkins-bot:

[operations/deployment-charts@master] Update cxserver to 2023-05-03-044244-production

https://gerrit.wikimedia.org/r/914468

Mentioned in SAL (#wikimedia-operations) [2023-05-04T11:38:02Z] <kart_> Updated cxserver to 2023-05-03-044244-production (T333835, T335019, T331505)

Mentioned in SAL (#wikimedia-operations) [2023-05-08T06:48:10Z] <kart_> Deployed MinT to the production (T331505)

Change 914351 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/dns@master] Add machinetranslation service RRs

https://gerrit.wikimedia.org/r/914351

Change 913152 merged by Alexandros Kosiaris:

[operations/puppet@production] service::catalog: Add machinetranslation service

https://gerrit.wikimedia.org/r/913152

Change 917819 had a related patch set uploaded (by KartikMistry; author: KartikMistry):

[operations/deployment-charts@master] Update MinT to 2023-05-09-082017-production

https://gerrit.wikimedia.org/r/917819

Change 914351 merged by Alexandros Kosiaris:

[operations/dns@master] Add machinetranslation service RRs

https://gerrit.wikimedia.org/r/914351

Change 917828 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[mediawiki/services/machinetranslation@master] OpenAPI spec: Fix translate API parameter name

https://gerrit.wikimedia.org/r/917828

Change 917828 merged by jenkins-bot:

[mediawiki/services/machinetranslation@master] OpenAPI spec: Fix translate API parameter name

https://gerrit.wikimedia.org/r/917828

Change 917819 merged by jenkins-bot:

[operations/deployment-charts@master] Update MinT to 2023-05-09-110213-production

https://gerrit.wikimedia.org/r/917819

Change 917906 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[mediawiki/services/machinetranslation@master] Header and response fixes to the spec

https://gerrit.wikimedia.org/r/917906

Change 917906 merged by jenkins-bot:

[mediawiki/services/machinetranslation@master] Header and response fixes to the spec

https://gerrit.wikimedia.org/r/917906

Change 918002 had a related patch set uploaded (by KartikMistry; author: KartikMistry):

[operations/deployment-charts@master] Update MinT to 2023-05-10-045734-production

https://gerrit.wikimedia.org/r/918002

Change 918002 merged by jenkins-bot:

[operations/deployment-charts@master] Update MinT to 2023-05-10-045734-production

https://gerrit.wikimedia.org/r/918002

Mentioned in SAL (#wikimedia-operations) [2023-05-10T05:42:43Z] <kart_> Updated MinT to 2023-05-10-045734-production (T331505)

Change 918243 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/puppet@production] machinetranslation: Switch service::catalog to production

https://gerrit.wikimedia.org/r/918243

Change 918243 merged by Alexandros Kosiaris:

[operations/puppet@production] machinetranslation: Switch service::catalog to production

https://gerrit.wikimedia.org/r/918243

Change 911887 merged by Alexandros Kosiaris:

[operations/puppet@production] services_proxy: Add machinetranslation

https://gerrit.wikimedia.org/r/911887

Change 918343 had a related patch set uploaded (by KartikMistry; author: KartikMistry):

[mediawiki/services/cxserver@master] Add MinT service to production config

https://gerrit.wikimedia.org/r/918343

Change 918407 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/deployment-charts@master] cxserver: Enable machintranslation proxy

https://gerrit.wikimedia.org/r/918407

Change 918343 merged by jenkins-bot:

[mediawiki/services/cxserver@master] Add MinT service to production config

https://gerrit.wikimedia.org/r/918343

Change 905579 had a related patch set uploaded (by KartikMistry; author: KartikMistry):

[operations/deployment-charts@master] Add MinT support to cxserver

https://gerrit.wikimedia.org/r/905579

Change 905579 merged by jenkins-bot:

[operations/deployment-charts@master] Add MinT support to cxserver

https://gerrit.wikimedia.org/r/905579

Change 918407 merged by jenkins-bot:

[operations/deployment-charts@master] cxserver: Enable machintranslation proxy

https://gerrit.wikimedia.org/r/918407

Change 918441 had a related patch set uploaded (by KartikMistry; author: KartikMistry):

[operations/deployment-charts@master] cxserver: Bump chart version

https://gerrit.wikimedia.org/r/918441

Change 918441 merged by jenkins-bot:

[operations/deployment-charts@master] cxserver: Bump chart version

https://gerrit.wikimedia.org/r/918441

Change 918509 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/deployment-charts@master] cxserver: mesh configuration updated

https://gerrit.wikimedia.org/r/918509

Change 918509 merged by jenkins-bot:

[operations/deployment-charts@master] cxserver: mesh configuration updated

https://gerrit.wikimedia.org/r/918509

Change 922059 had a related patch set uploaded (by Santhosh; author: Santhosh):

[mediawiki/services/cxserver@master] Remove Flores client as it is replaced by MinT

https://gerrit.wikimedia.org/r/922059

Change 922061 had a related patch set uploaded (by Santhosh; author: Santhosh):

[mediawiki/extensions/ContentTranslation@master] Replace references to Flores by MinT and remove custom label

https://gerrit.wikimedia.org/r/922061

Change 922059 merged by jenkins-bot:

[mediawiki/services/cxserver@master] Remove Flores client as it is replaced by MinT

https://gerrit.wikimedia.org/r/922059

Change 922064 had a related patch set uploaded (by KartikMistry; author: KartikMistry):

[operations/deployment-charts@master] cxserver: Remove Flores MT service

https://gerrit.wikimedia.org/r/922064

Change 922061 merged by KartikMistry:

[mediawiki/extensions/ContentTranslation@master] Replace references to Flores by MinT and remove custom label

https://gerrit.wikimedia.org/r/922061

Change 922064 merged by jenkins-bot:

[operations/deployment-charts@master] cxserver: Remove Flores MT service

https://gerrit.wikimedia.org/r/922064

Mentioned in SAL (#wikimedia-operations) [2023-05-23T06:04:44Z] <kart_> cxserver: Remove Flores MT service (T331505)

Change 923921 had a related patch set uploaded (by Santhosh; author: Santhosh):

[mediawiki/extensions/ContentTranslation@master] CX3 Build 0.2.0+20230529

https://gerrit.wikimedia.org/r/923921

Change 923921 merged by jenkins-bot:

[mediawiki/extensions/ContentTranslation@master] CX3 Build 0.2.0+20230529

https://gerrit.wikimedia.org/r/923921

Since MinT was launched the service has been running in support of Content and Section Translation.

For pending items in the task:

  • Regarding the QA test checkbox, we can mark it as resolved. As we enabled MinT to support each language their models support we have checked them individually (T326578, T339105, T340953, T336683).
  • The use of a better storing option (T335491) is a follow-up task to improve the service, but I don't see it as a blocker to close the current task since the service is up and running. So we can close the current ticket and leave T335491 as a follow-up.