Page MenuHomePhabricator

WMCS-roots wiki replica access
Closed, ResolvedPublic

Description

We need to decide on and update the permissions for the the WMCS-roots and wmcs-admins groups.

The convention we have in the admin data.yaml file is that $group-roots are more powerful then $group-admins, however it seems that in the WMCS case this is not strictly true. From what I can tell the wmcs-roots group is used to give sudo root access to a bunch of wmcs hosts but not all of them. (update: this has been fixed)

The wmcs-admins group is used to allow users to manage wikireplicas, specifically it allows users to run:

  • cluster::management
    • /usr/local/bin/secure-cookbook sre.wikireplicas
  • wmcs::db::wikireplicas
    • /usr/local/sbin/maintain-views',
    • /usr/local/sbin/maintain-meta_p',
    • /usr/local/sbin/maintain-replica-indexes'

Currently the members of the wmcs-roots and wmcs-admin groups are almost identical, the only difference being that @taavi is in the former and not the later. I created a change to add wmcs-roots to the wmcs-admins group. It would allow wmcs-roots to preform the above maintenance tasks on the db wiki hosts, possibly less of a concern but would still need someone with knowledge to confirm if this is acceptable. (update: this patch has been merged)

For the sre.wikireplicas.* use case I wonder if these cookbooks could be run from the cloudcumin hosts instead of the cumin hosts, then we can simply drop wmcs-admin access from cluster::managment

Also when looking at the wmcs-roots group I noticed that it does not have access to all wmcs machines, I believe that the ultimate goal is that wmcs engineers could perform 100% of their role with wmcs-roots and would in theory be able to drop global ops membership some time in the future. As such I think we should try and ensure that the wmcs-roots group does have the necessary access. I have created a change to add what seems to me to be the obvious ones but suspect its missing some. (update: this patch has been merged, but there are some operations that still require global ops)

@nskaggs please add, correct or update anything I may have missed .

Event Timeline

jbond triaged this task as Medium priority.May 31 2023, 11:35 AM
jbond created this task.
Reedy renamed this task from WMCS-roots acess to WMCS-roots access.May 31 2023, 11:18 PM
Reedy updated the task description. (Show Details)
Reedy updated the task description. (Show Details)
taavi renamed this task from WMCS-roots access to WMCS-roots wiki replica access.Jun 8 2023, 12:33 PM
taavi edited projects, added cloud-services-team, Data-Services; removed Cloud-Services.

Also when looking at the wmcs-roots group I noticed that it does not have access to all wmcs machines, I believe that the ultimate goal is that wmcs engineers could perform 100% of their role with wmcs-roots and would in theory be able to drop global ops membership some time in the future. As such I think we should try and ensure that the wmcs-roots group does have the necessary access. I have created a change to add what seems to me to be the obvious ones but suspect its missing some.

Indeed that's the intention, right now for the roles in your patch set there is site-specific hiera data which grants access already.

Also when looking at the wmcs-roots group I noticed that it does not have access to all wmcs machines, I believe that the ultimate goal is that wmcs engineers could perform 100% of their role with wmcs-roots and would in theory be able to drop global ops membership some time in the future. As such I think we should try and ensure that the wmcs-roots group does have the necessary access. I have created a change to add what seems to me to be the obvious ones but suspect its missing some.

Indeed that's the intention, right now for the roles in your patch set there is site-specific hiera data which grants access already.

Thanks @taavi i have updated the patch. however this also highluights thet we also have a group called labstest-roots which has the same members as wmcs-roots i wonder if the former could be droped. or at the very least be updated so it automatically includes the useres from wmcs-roots

My preference would be to find a resolution to any lingering concerns from the DBA group about full sudo for non-global roots on the clouddb Wiki Replica hosts, apply the wmcs-roots group there, and finally remove the wmcs-admins group entirely from Puppet. T166310#3292145 is as far as I know the only written documentation of the concern which gave rise to the wmcs-admins group and its limited permissions for the clouddb servers 6+ years ago.

We really need to come up with a way to be able to grant root access to clouddb* hosts that doesn't imply root on all the production databases, because that is really overkill. I don't know if this is something the Infrastructure-Foundations team could help with? cc @joanna_borun

Giving root to labsdbs would be equivalent to giving root to all mysql servers for many reasons. No problem with that, but he should be added to the paging system (if he is not already there) and respond to the database alerts.

@jcrespo or @Marostegui could we expand on this and explore ways we may be able to close this issue?

finally remove the wmcs-admins group entirely from Puppet.

i think its probably worth keeping this around but instead redefining it as described above i.e. a less powerful wmcs-roots this would allow either new staff or volunteers to get some access without full root

Jbond, the short answer is I believe I agree with all the points you've raised in the description.

I agree with the goal to ensure wmcs-roots have access to all the hardware, cookbooks, and databases they need to effectively maintain the platforms and services WMCS is responsible for.

However this would give wmcs-roots access to production hosts, in particulate the cumin hosts (arguably the most powerful production hosts) and we have not previously considered this access before.

I presume controls would still exist on those cumin hosts to control what cookbooks could actually be run? And therefore might minimize potential risk?

Further it would allow wmcs-roots to preform the above maintenance tasks on the db wiki hosts, possibly less of a concern but would still need someone with knowledge to confirm if this is acceptable.

Given the recent wikireplicas incident T337446, support and ownership discussions are currently happening regarding wikireplicas. In the short term, the lack of access by those who could potentially help, including community members, was identified as a contributing factor.

For the sre.wikireplicas.* use case I wonder if these cookbooks could be run from the cloudcumin hosts instead of the cumin hosts, then we can simply drop wmcs-roots access from cluster::managment

I think there would remain more sre.* cookbooks that wmcs-roots would want to run. Including things like the imaging and decommissioning cookbooks. That said, I support enabling cloudcumin to run these cookbooks, along with granting wmcs-roots access.

Thanks for the feedback

However this would give wmcs-roots access to production hosts, in particulate the cumin hosts (arguably the most powerful production hosts) and we have not previously considered this access before.

I presume controls would still exist on those cumin hosts to control what cookbooks could actually be run? And therefore might minimize potential risk?

Currently no, not really. All cookbooks on the production cumin hosts run as root however there are a number of things we can explore to change this:

  • first of we now have the cloud-cumin hosts which afaik solves a lot of the problems for cloud hosts. ill raise a child ticket to document the gaps that we currently have with this
  • we have also started some work on unprivecumin hosts which use kerberso instead of a root ssh key to preform tasks, this is theory works however needs to be tested more and then rolled out, however i think it would have some of the same issues as the cloudcumin hosts.
  • start an investigation to identify which cookbooks actully need root (or root like) privileges.

That said i think, without significant amount of work, there will be some cookbooks that need global root

Further it would allow wmcs-roots to preform the above maintenance tasks on the db wiki hosts, possibly less of a concern but would still need someone with knowledge to confirm if this is acceptable.

Given the recent wikireplicas incident T337446, support and ownership discussions are currently happening regarding wikireplicas. In the short term, the lack of access by those who could potentially help, including community members, was identified as a contributing factor.

I think we will need to get some input from Data-Persistence to understand the risks first

For the sre.wikireplicas.* use case I wonder if these cookbooks could be run from the cloudcumin hosts instead of the cumin hosts, then we can simply drop wmcs-roots access from cluster::managment

I think there would remain more sre.* cookbooks that wmcs-roots would want to run. Including things like the imaging and decommissioning cookbooks. That said, I support enabling cloudcumin to run these cookbooks, along with granting wmcs-roots access.

As mentioned above ill create a child ticket to try and capture the remaining issues with cloudcumin,

Thanks for the feedback

However this would give wmcs-roots access to production hosts, in particulate the cumin hosts (arguably the most powerful production hosts) and we have not previously considered this access before.

I presume controls would still exist on those cumin hosts to control what cookbooks could actually be run? And therefore might minimize potential risk?

Currently no, not really. All cookbooks on the production cumin hosts run as root however there are a number of things we can explore to change this:

  • first of we now have the cloud-cumin hosts which afaik solves a lot of the problems for cloud hosts. ill raise a child ticket to document the gaps that we currently have with this
  • we have also started some work on unprivecumin hosts which use kerberso instead of a root ssh key to preform tasks, this is theory works however needs to be tested more and then rolled out, however i think it would have some of the same issues as the cloudcumin hosts.
  • start an investigation to identify which cookbooks actully need root (or root like) privileges.

That said i think, without significant amount of work, there will be some cookbooks that need global root

Further it would allow wmcs-roots to preform the above maintenance tasks on the db wiki hosts, possibly less of a concern but would still need someone with knowledge to confirm if this is acceptable.

Given the recent wikireplicas incident T337446, support and ownership discussions are currently happening regarding wikireplicas. In the short term, the lack of access by those who could potentially help, including community members, was identified as a contributing factor.

I think we will need to get some input from Data-Persistence to understand the risks first

I have replied on the specific sub-task you've created.

Given the recent wikireplicas incident T337446, support and ownership discussions are currently happening regarding wikireplicas. In the short term, the lack of access by those who could potentially help, including community members, was identified as a contributing factor.

Could you clarify this sentence: a contributing factor to what?

this would give wmcs-roots access to production hosts, in particulate the cumin hosts

I'm not following here. In cumin1001 I don't see anything that would allow wmcs-roots or wmcs-admin to run cumin, so my understanding is that members of those groups can SSH to cumin1001 but cannot become root or run sudo or run cumin. Am I missing something?

i think its probably worth keeping this around but instead redefining it as described above i.e. a less powerful wmcs-roots this would allow either new staff or volunteers to get some access without full root

I agree it could be useful in the future, but at the moment it would probably be empty.

Given the recent wikireplicas incident T337446, support and ownership discussions are currently happening regarding wikireplicas. In the short term, the lack of access by those who could potentially help, including community members, was identified as a contributing factor.

Could you clarify this sentence: a contributing factor to what?

Yes, I'm sorry this wasn't very well said. It's my understanding that during recovery, view creation specifically was delayed by a lack of knowledge as well as a lack of access for those who had some understanding but no access. So that part of the recovery was delayed until I believe Amir was able to understand what was needed, as well as fix some scripts to make it work. In my opinion, more people having more access during the incident would have helped wikireplicas recover faster, and could have allowed more individuals to assist those working on the problem. I was trying to share that in the interim/short-term wmcs-roots having access *might* be helpful.

However, that doesn't mean giving more groups access to perform database operations would result in a better maintained or supported database. In the long-term, I would hope WMCS roots don't need access to db-wiki hosts. Ideally because their help isn't required. But if it was, then ideally access to the db hosts wouldn't be required to help.

Please feel free to correct anything I've misunderstood or misrepresented. Does that clarify my comment?

my understanding is that members of those groups can SSH to cumin1001 but cannot become root

Re-reading the thread, I wonder if your concern is that SSH access is already too much? I agree it would be nice to remove access entirely, and that does not seem to be hard because....

For the sre.wikireplicas.* use case I wonder if these cookbooks could be run from the cloudcumin hosts instead of the cumin hosts,

I agree this seems the best way forward for those cookbooks, I believe that's the only reason why wmcs-admin members can currently ssh to cuminXXXX hosts. Allowing users to run sre.wikireplicas.* from cloudcumins could be done in the context of T325067.

then we can simply drop wmcs-roots access from cluster::managment

I agree but I think you mean wmcs-admin here, which is the group that currently has access.

my understanding is that members of those groups can SSH to cumin1001 but cannot become root

Re-reading the thread, I wonder if your concern is that SSH access is already too much? I agree it would be nice to remove access entirely, and that does not seem to be hard because....

You are right if we add wmcs-roots to wmcs-admins then they would still only get the privileges for wmcs-admins (see below) unless we explicitly add wmcs-roots to the profile::admin::groups for cluster::managment

privileges: ['ALL = (ALL) NOPASSWD: /usr/local/bin/secure-cookbook sre.wikireplicas.*',
             'ALL = (ALL) NOPASSWD: /usr/local/sbin/maintain-views',
             'ALL = (ALL) NOPASSWD: /usr/local/sbin/maintain-meta_p',
             'ALL = (ALL) NOPASSWD: /usr/local/sbin/maintain-replica-indexes']

For the sre.wikireplicas.* use case I wonder if these cookbooks could be run from the cloudcumin hosts instead of the cumin hosts,

I agree this seems the best way forward for those cookbooks, I believe that's the only reason why wmcs-admin members can currently ssh to cuminXXXX hosts. Allowing users to run sre.wikireplicas.* from cloudcumins could be done in the context of T325067.

then we can simply drop wmcs-roots access from cluster::managment

I agree but I think you mean wmcs-admin here, which is the group that currently has access.

yes i do thanks :)

Change 951469 had a related patch set uploaded (by Jbond; author: jbond):

[operations/puppet@production] admin: deprecate laptest-roots group

https://gerrit.wikimedia.org/r/951469

Change 923681 had a related patch set uploaded (by Jbond; author: jbond):

[operations/puppet@production] wmcs: add wmcs-roots to roles where it is missing

https://gerrit.wikimedia.org/r/923681

@fnegri your read of the wmcs-admin access on cumin1001 matches mine, and I can confirm as a member it is so (no funny obvious things going on via other indirect permissions for sudo).

For me while I'm analyzing Wiki Replicas the wmcs-admin access to the clouddb#### prod realm hosts (via analytics_multiinstance.yaml and web_multiinstance.yaml and analytics_multiinstance.yaml) is coming in handy. Thinking ahead, as long as the cloudcumin privs are modeled the same for users like me it seems like we could just transfer wmcs-admin sudo material over to cloudcumin and drop it from cluster::management; for now of course the users who might run wmcs-admin things on cumin are folks familiar with the infrastructure.

As a note to myself for the near future and as we're thinking through this, the things I'm seeing that would be good wmcs-admin are the following. @jbond @Marostegui okay if I start submitting some patches for these? It seems like we have a social protocol of it being okay to post a draft to show the thinking, but I also don't want to use our precious cycles on this if you have a notion to take it in a very different direction.

  • Ability to query some private hiera configuration via narrowly defined cumin aliases. (Not sure how this might translate in a cloudcumin world, but that's a TODO).
  • dbproxy1018 and dbproxy1019 access (presently this only seems achievable via ops membership implicit rights)
    • On the prod realm, some haproxy and other commands
  • Probably (?) some things that are currently only possible on cloudcontrol - in particular DNS updates. This looks maybe somewhat more complicated.

If wmcs-admins stays small for NDA'd high privilege users, a simpler ability to run arbitrary sudo commands, mentioned in passing in T344599#9115620 on T344599: wikireplicas root access, might be less toilsome later on, though (but as also noted, maybe it's possible to enumerate all of the essential commands and otherwise use low privs as a standard Linux user). I wouldn't mind joining wmcs-roots, but I think that should be issued to a pretty small group as well.

thanks for the input dr0ptp4kt

@jbond okay if I start submitting some patches for these?

Thats fine by me but please make sure to read the child tasks, specifically T344599. also please check out the current patches[1][2] to ensure we are not duplicating work

[1]https://gerrit.wikimedia.org/r/c/operations/puppet/+/923681/
[2]https://gerrit.wikimedia.org/r/c/operations/puppet/+/951469/

  • Ability to query some private hiera configuration via narrowly defined cumin aliases. (Not sure how this might translate in a cloudcumin world, but that's a TODO).

We should probably discuss this more on T344412, there is no way to do this currently but depending on which secrets you want it may not be too hard

  • dbproxy1018 and dbproxy1019 access (presently this only seems achievable via ops membership implicit rights)

Would this make sense to discuss on T344599? however this seems like a bit of feature creep afaik theses servers are very much production servers owned by the data-persistence team so it would ultimately be there call

  • On the prod realm, some haproxy and other commands

Can you expand on which machines, again if theses are machines owned by a production team i.e. sre then we will need to include different stack holders

  • Probably (?) some things that are currently only possible on cloudcontrol - in particular DNS updates. This looks maybe somewhat more complicated.

Theses are cloud hosts so thats completely up to wmcs

If wmcs-admins stays small for NDA'd high privilege users, a simpler ability to run arbitrary sudo commands, mentioned in passing in T344599#9115620 on T344599: wikireplicas root access,

If we can define them then that's fine

I wouldn't mind joining wmcs-roots, but I think that should be issued to a pretty small group as well.

To clarify currently in my mind wmcs-roots should be a group that gives root privileges to wmcs machines, to allow the WMCS team and highly trusted volunteers to do there work, including things we haven't thought of before e.g. running strace/tcpdump etc. wmcs-admins should be a group that allows doing regular, pre determined, maintenance tasks. That is to say that if there are additional known tasks we should definitely add them to the sudo rules for wmcs-admins, however the end goal i think should be to give wmcs-roots full sudo access to clouddb

Change 951469 merged by Jbond:

[operations/puppet@production] admin: deprecate labtest-roots group

https://gerrit.wikimedia.org/r/951469

@fnegri your read of the wmcs-admin access on cumin1001 matches mine, and I can confirm as a member it is so (no funny obvious things going on via other indirect permissions for sudo).

For me while I'm analyzing Wiki Replicas the wmcs-admin access to the clouddb#### prod realm hosts (via analytics_multiinstance.yaml and web_multiinstance.yaml and analytics_multiinstance.yaml) is coming in handy. Thinking ahead, as long as the cloudcumin privs are modeled the same for users like me it seems like we could just transfer wmcs-admin sudo material over to cloudcumin and drop it from cluster::management; for now of course the users who might run wmcs-admin things on cumin are folks familiar with the infrastructure.

As a note to myself for the near future and as we're thinking through this, the things I'm seeing that would be good wmcs-admin are the following. @jbond @Marostegui okay if I start submitting some patches for these? It seems like we have a social protocol of it being okay to post a draft to show the thinking, but I also don't want to use our precious cycles on this if you have a notion to take it in a very different direction.

  • Ability to query some private hiera configuration via narrowly defined cumin aliases. (Not sure how this might translate in a cloudcumin world, but that's a TODO).

I don't have any strong feelings here, as Data Persistence do not really work on/with this.

  • dbproxy1018 and dbproxy1019 access (presently this only seems achievable via ops membership implicit rights)

We (Data Persistence) do not own these hosts. So I don't have input on it.

    • On the prod realm, some haproxy and other commands
  • Probably (?) some things that are currently only possible on cloudcontrol - in particular DNS updates. This looks maybe somewhat more complicated.

Same as above, we do not own them. But whoever ends up owning this whole service should have access to haproxy in order to depool hosts. Either that or use @BTullis cookbook which I believe runs from cumin.

Change 952455 had a related patch set uploaded (by Jbond; author: jbond):

[operations/puppet@production] dbproxy: change ownership to wmcs

https://gerrit.wikimedia.org/r/952455

  • dbproxy1018 and dbproxy1019 access (presently this only seems achievable via ops membership implicit rights)

We (Data Persistence) do not own these hosts. So I don't have input on it.

Do you or anyone know who does? looking at /etc/haproxy/conf.d/multi-db-replicas.cfg it seems to only reference clouddb hosts so im guessing WMCS. if so we should update the contacts info

  • dbproxy1018 and dbproxy1019 access (presently this only seems achievable via ops membership implicit rights)

We (Data Persistence) do not own these hosts. So I don't have input on it.

Do you or anyone know who does? looking at /etc/haproxy/conf.d/multi-db-replicas.cfg it seems to only reference clouddb hosts so im guessing WMCS. if so we should update the contacts info

The latest I know was cloud-services-team but I don't know if this has changed already. We (Data Persistence) never owned these hosts.

Change 923681 merged by Jbond:

[operations/puppet@production] wmcs: add wmcs-roots to roles where it is missing

https://gerrit.wikimedia.org/r/923681

Change 952455 merged by Andrew Bogott:

[operations/puppet@production] dbproxy: change ownership to wmcs

https://gerrit.wikimedia.org/r/952455

There's no pending discussion at the moment, so I'm moving this task out of "Needs discussion" column and back to the inbox. Feel free to leave a comment if you would like this to be prioritized.

fnegri updated the task description. (Show Details)
fnegri claimed this task.

I updated the description of this task noting which parts have been fixed since the description was written.

I'm resolving this task, as the parts that have not been fixed yet are already tracked in the following other tasks: