Page MenuHomePhabricator

Deprecate raw SQL conditions for IDatabase methods (select, insert, etc.)
Open, MediumPublic

Description

There is currently a fairly common use case for which we rely on raw SQL, despite having a pretty good abstraction layer in place.

For example:

$res = $db->newSelectQueryBuilder()
    ->select( [
        'rev_id',
        'rev_user',
    ] )
    ->from( 'revision' )
    ->where( [
        'rev_page' => $pageId,
        'rev_timestamp > ' . $db->addQuotes( $db->timestamp( $since ) )
    ] ),
    ->fetchResultSet();

Goals

  1. Able to achieve the above without using raw SQL.
  2. As consumer, able to guarantee that my use of IDatabase cannot accidentally cause raw SQL to be used (e.g. I want an "opt-in" or "opt-out" way to say that I promise to never use this insecure method and would expect failure instead of insecurity if something it is accidentally used in my call, no matter of hooks or indirect supply of parts of the conds array.)

Prior art and considerations

We currently have two pieces of a query builder. We have the outer structure of the query created via methods such as $db->select() and $db->insert(). For the inner segments, mainly for the conditions, we also have helper methods such as $db->makeList(), $db->buildConcat(), and $db->buildSelectSubquery().

The idea is to bring these together in an interface for query building, that different backends can implement based on the syntax for that particular RDBMS backend.

The Wikimedia\Rdbms\Subquery class and other encasing classes also resemble the direction of typed value objects in favour of strings, which this would build upon.

The proposed QueryInfo class (see https://gerrit.wikimedia.org/r/459242) also seems relevant here. It's orthogonal in so far that it could co-exist alongside the query builder, but may want to avoid having two very similar concepts of a "query" value object.

Proposal 1: Expression builder

We currently have builder methods like makeList and buildConcat directly on IDatabase. We could continue that model and add the missing ones for expresions where we still use raw SQL today (e.g.buildGreatherThan, buildLessThan, etc).

We coud also move them to the SelectQueryBuilder class or some new Expression class.

Either way, this could look something like this:

$db->newSelectQueryBuilder()
    /* -> … */ 
    ->where( [
        'rev_page' => $pageId,
        $expr->buildGreaterThan( 'rev_timestamp', $sinceTimestamp )
    ] )

Proposal 1B: Enforcement

The issue proposal 1 leaves unsolved is enforcement and confidence that nothing breaks out of the model.

One way to do this, is to have the expression building methods return an internal object instead of a string, and then methods like IDatabase::select and SelectQueryBuilder::where could deprecate use of raw SQL strings.

Possibly, we'd opt-in through some means like $db->select( $db::SAFE_MODE, 'revision', [ .. ], [ .. ] );

This would make it so that conditions have to be an array, or expressin object built by the above methods like Wikimedia\Rdbms\LikeMatch and other encasings.

Below is what that might look like:

$db->newSelectQueryBuilder( $db::SAFE_MODE )
    /* -> … */ 
    ->where( [
        'rev_page' => $pageId,
        $expr->buildGreaterThan( 'rev_timestamp', $sinceTimestamp )
    ] )

The downside is that this doesn't make for a good migration target.

The parameter will either always remain optional and rarely used, or if we do decide to deprecate non-safe mode, we'd have to first add it everywhere, and then either keep the pointless option forever, or remove it everywhere again.

One way to avoid that could be to use a different method name, something that doesn't look optional but is just different for no particular reason, e.g. $db->selectQuery() that wouldn't look weird in the long-term, but would however make for two confusingly similar methods until the migration is complete.

One way to avoid that, could be to have the parameters be encapsulated. E.g. we'd keep $db->select() and make the deprecation based on signature. E.g. the legacy signature is the current position arguments, and the new signature could be something like: $db->select( $db->makeSelectQuery( ... ) );. That is basically the query builder idea.

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 970868 merged by jenkins-bot:

[mediawiki/core@master] rdbms: Add support for LIKE in expression builder

https://gerrit.wikimedia.org/r/970868

Change 971951 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] maintenance: Migrate $db->buildLike() to expression builder

https://gerrit.wikimedia.org/r/971951

Change 971951 merged by jenkins-bot:

[mediawiki/core@master] maintenance: Migrate $db->buildLike() to expression builder

https://gerrit.wikimedia.org/r/971951

@daniel @matmarex @Krinkle the LIKE expression implementation gave me an idea for cases of ipb_range_end = ipb_range_start in WHERE conditions and possibly join conditions. We could introduce a RawValue class and turn ipb_range_end = ipb_range_start into $dbr->expr('ipb_range_end', '=', RawValue( 'ipb_range_start' ) ) instead. It's not pretty but it's clearly not a common usecase. Thoughts?

Basically flipping the default of "raw SQL unless specified" to "quoted unless explicitly specified"

Change 972431 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] Migrate all non-API code to use expression builder instead of buildLike

https://gerrit.wikimedia.org/r/972431

@daniel @matmarex @Krinkle the LIKE expression implementation gave me an idea for cases of ipb_range_end = ipb_range_start in WHERE conditions and possibly join conditions. We could introduce a RawValue class and turn ipb_range_end = ipb_range_start into $dbr->expr('ipb_range_end', '=', RawValue( 'ipb_range_start' ) ) instead. It's not pretty but it's clearly not a common usecase. Thoughts?

Basically flipping the default of "raw SQL unless specified" to "quoted unless explicitly specified"

I like the idea, except for the name. For the use case in this example, I'd suggest something like new TableName( 'ipb_range_start' ) - we would specifically declare the object to represent a table name. It should probablby trigger identifier quoting.

We could also support new RawCondition( "ipb_range_end = ipb_range_start" ).
Or, perhaps better: new JoinCondition( "ipb_range_end", "ipb_range_start" ).

Change 972431 merged by jenkins-bot:

[mediawiki/core@master] Migrate all non-API code to use expression builder instead of buildLike

https://gerrit.wikimedia.org/r/972431

Change 972850 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] api: Migrate away from buildLike to expression builder

https://gerrit.wikimedia.org/r/972850

@Ladsgroup @daniel I'd be fine with any of those syntaxes. I'm not sure if it's worth doing for join conditions, though? They're almost always just 'a = b', maybe we can keep them as strings.

@Krinkle has also proposed something similar earlier:

->and( $db->expr( 'rev_timestamp', '<=', $db->expField( 'page_touched' ) )

I think the question is: do we want to allow more types for the third parameter to $db->expr(), or do we want to introduce a new IExpression subclass for this kind of condition? Given how you've implemented the LIKE expressions, I guess you prefer the former. That seems alright to me, as long as we don't do anything that is impossible for Phan to validate.

@daniel @matmarex @Krinkle the LIKE expression implementation gave me an idea for cases of ipb_range_end = ipb_range_start in WHERE conditions and possibly join conditions. We could introduce a RawValue class and turn ipb_range_end = ipb_range_start into $dbr->expr('ipb_range_end', '=', RawValue( 'ipb_range_start' ) ) instead. It's not pretty but it's clearly not a common usecase. Thoughts?

Basically flipping the default of "raw SQL unless specified" to "quoted unless explicitly specified"

Maybe we can just add a Column class, supported by Expression/$db->expr(), with identifier escaping/quoting used automatically when needed. The constructor would support:

  • new Column("column")
  • new Column( "alias.column" ) [not supporting a period in either of the two components]

It would be a little nice to also have an ExcludedColumn class for automatically dealing with "excluded.<column>" "ON CONFLICT"/upsert syntax in postgres/sqlite (which can't be done now due to raw sql).

I'd be careful with the wording of column vs field, though columns are also fields. Note that the rdbms Field interface already exists for other purposes, so that cannot exactly be used a class name without some renaming.

An interesting case is in SQLBagOStuff::buildMultiUpsertSetForOverwrite() that compares SUBSTR(column,x,y) with a literal. Maybe there could be a ComputedValue class, supported by Expression/$db->expr(). The constructor would take an an IComputedValue constant (e.g. IComputedValue::SUBSTR) plus the variadic int/string/Column/ComputedValue arguments. Maybe there could be a $db->computedValue() method, e.g.:

$db->computedValue( IComputedValue::SUBSTR, new Column( 'modtoken' ), 1, 13 );

I suppose ComputedValue could be shortened to Computation (though I like the clarity of LikeValue/ComputedValue in Expression construction).

@daniel @matmarex @Krinkle the LIKE expression implementation gave me an idea for cases of ipb_range_end = ipb_range_start in WHERE conditions and possibly join conditions. We could introduce a RawValue class and turn ipb_range_end = ipb_range_start into $dbr->expr('ipb_range_end', '=', RawValue( 'ipb_range_start' ) ) instead. It's not pretty but it's clearly not a common usecase. Thoughts?

Basically flipping the default of "raw SQL unless specified" to "quoted unless explicitly specified"

I like the idea, except for the name. For the use case in this example, I'd suggest something like new TableName( 'ipb_range_start' ) - we would specifically declare the object to represent a table name. It should probablby trigger identifier quoting.

We could also support new RawCondition( "ipb_range_end = ipb_range_start" ).
Or, perhaps better: new JoinCondition( "ipb_range_end", "ipb_range_start" ).

The problem is that it's not always join condition, in the example I gave above it is a WHERE condition (and not even ANSI-89 join, an actual condition to pick up rows that have the same value in two different fields).

I like the RawCondition( "ipb_range_end = ipb_range_start" ), maybe RawExpression to make it consistent?

@Ladsgroup @daniel I'd be fine with any of those syntaxes. I'm not sure if it's worth doing for join conditions, though? They're almost always just 'a = b', maybe we can keep them as strings.

@Krinkle has also proposed something similar earlier:

->and( $db->expr( 'rev_timestamp', '<=', $db->expField( 'page_touched' ) )

I think the question is: do we want to allow more types for the third parameter to $db->expr(), or do we want to introduce a new IExpression subclass for this kind of condition? Given how you've implemented the LIKE expressions, I guess you prefer the former. That seems alright to me, as long as we don't do anything that is impossible for Phan to validate.

Yeah, I prefer the former but at the same time, I really don't want to introduce way too many new classes developers would need to learn and get used to, that's why I'm a bit against Column/Join, First I want to have RawSQL and then we can look at the usages and take out the most common ones into dedicated classes but to avoid having a class for every possible type of value that could exist. Does that make sense to you?

An interesting case is in SQLBagOStuff::buildMultiUpsertSetForOverwrite() that compares SUBSTR(column,x,y) with a literal. Maybe there could be a ComputedValue class, supported by Expression/$db->expr(). The constructor would take an an IComputedValue constant (e.g. IComputedValue::SUBSTR) plus the variadic int/string/Column/ComputedValue arguments. Maybe there could be a $db->computedValue() method, e.g.:

$db->computedValue( IComputedValue::SUBSTR, new Column( 'modtoken' ), 1, 13 );

I suppose ComputedValue could be shortened to Computation (though I like the clarity of LikeValue/ComputedValue in Expression construction).

I would really like to avoid coining new terms, we already have way too many terms that are not standard and each new developer needs learning and getting used to. it's not super clear either, does the php do the computation or it just produces sql and that happen db server side? Is it a computation in terms of halting or is it a transform? "function" is actually much clearer or "function call". I still would want to know how common this usecase is before jumping to implementing new interfaces.

Re-reading the description and comments it seems to me that this task is more about addQuotes(), phan-taint, and a more fluent looking interface more so than eliminating raw SQL from code calling rdbms query methods. Is the idea to keep most of the build*() methods around? Currently, the $field arguments to Database::expr() allows raw SQL in $field. Is that intentional in the long run? If so, then buildMultiUpsertSetForOverwrite() can just use SUBSTR() in the $field argument. SelectQueryBuilder::fields() allows raw SQL fields and SelectQueryBuilder::where() allows raw SQL. On the other hand, SelectQueryBuilder::table() wants a query builder for computed/derived tables rather than a string from selectSQLText() or such.

From the task description:

We coud also move them to the SelectQueryBuilder class or some new Expression class.

It terms of moving the build*() methods to SqlQueryBuilder, that seems like it would decrease convenience since the query builder would have to be bound to a variable instead of just reusing $db. Alternatively, using a new Expression class would be a lot like ComputedValue...maybe it could be called ExpressionValue and reuse IExpression:: constants. I could even be called FunctionCall, though it would be slightly odd for mere arithmetic operations. In any case, it would mean "value computed server side as specified by the query". It's unfortunate that both "calculated" and "computed" are used in SQL server documentation for this case, in addition to the related concept of schema-defined computed columns (both VIRTUAL and PERSISTENT).

I agree that it would be nice to cut down on classes that developers must be aware of. If the build*() methods go with a class based approach, this gets tricky. More so if raw SQL is not allowed in $fields/$conds since basic arithmetic operators also would need to use the class. If we are mostly fine with keeping the build*() methods, then new RawExpression("ipb_range_end = ipb_range_start") seems fine.

Change 972850 merged by jenkins-bot:

[mediawiki/core@master] api: Migrate away from buildLike to expression builder

https://gerrit.wikimedia.org/r/972850

While $field can take raw SQL but it's different from status quo drastically, we don't take fields from user input so the chance of sql injection is much lower, on top of that, the plan is have strict regex on field (allowing only a-zA-Z\d\.) so it's different from the current wild west.

The plan is not to keep build*, we could have a replacement for them, e.g. buildSubString() -> subStringExpr() that would return IExpression.

Change 975915 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] rdbms: Add a strict regex on $field on expression builder

https://gerrit.wikimedia.org/r/975915

Change 975915 merged by jenkins-bot:

[mediawiki/core@master] rdbms: Add a strict regex on $field on expression builder

https://gerrit.wikimedia.org/r/975915

Change 980493 had a related patch set uploaded (by Bartosz Dziewoński; author: Bartosz Dziewoński):

[mediawiki/core@master] rdbms: Add support for NOT LIKE in expression builder

https://gerrit.wikimedia.org/r/980493

Change 980493 merged by jenkins-bot:

[mediawiki/core@master] rdbms: Add support for NOT LIKE in expression builder

https://gerrit.wikimedia.org/r/980493

Change 991395 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] maintenance: Migrate to expression builders

https://gerrit.wikimedia.org/r/991395

Change 991395 merged by jenkins-bot:

[mediawiki/core@master] maintenance: Migrate to expression builders

https://gerrit.wikimedia.org/r/991395

Are there any updates on when it will be possible to pass a column name as the value in an IExpression? This will block T350972: Use expression builder instead of raw SQL in GlobalBlocking due to the usage of > between two columns.

There are several different suggestions, I support a broad API, called "RawSQLValue" class which means toSQL() wouldn't call addQuotes on it. Because we have many cases that we can't cover all. There are column comparisons, there are function calls, etc. etc. Of course its use should be used with care (at least it makes searching for potential SQL injection vectors easier).

If someone makes a decision to move forward with one of suggestions above, it's just matter of implementation.

Change #1014090 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] rdbms: Introduce RawSQLValue for edge cases

https://gerrit.wikimedia.org/r/1014090

Change #1014090 merged by jenkins-bot:

[mediawiki/core@master] rdbms: Introduce RawSQLExpression for edge cases

https://gerrit.wikimedia.org/r/1014090

Change #1020368 had a related patch set uploaded (by Bartosz Dziewoński; author: Bartosz Dziewoński):

[mediawiki/core@master] rdbms: Create RawSQLValue for SET clauses in update/upsert

https://gerrit.wikimedia.org/r/1020368

Change #1020368 merged by jenkins-bot:

[mediawiki/core@master] rdbms: Create RawSQLValue for SET clauses in update/upsert

https://gerrit.wikimedia.org/r/1020368

So the raw SQL part has not been deprecated yet since there are many usecases left but the replacement in all cases is in place. I suggest renaming this ticket to narrow its scope to "provide replacement for raw SQL in conditions" and file a new ticket for the actual deprecation. Does that sound fine by people?

So the raw SQL part has not been deprecated yet since there are many usecases left but the replacement in all cases is in place. I suggest renaming this ticket to narrow its scope to "provide replacement for raw SQL in conditions" and file a new ticket for the actual deprecation. Does that sound fine by people?

Can we do it the other way around, preserving this as the deprecate task and have a new sub-task for the provide replacement which we can mark as Resolved?

So the raw SQL part has not been deprecated yet since there are many usecases left but the replacement in all cases is in place. I suggest renaming this ticket to narrow its scope to "provide replacement for raw SQL in conditions" and file a new ticket for the actual deprecation. Does that sound fine by people?

It sounds fine to me. I think that's how we actually used this ticket, for the most part (we tagged all patches introducing the replacements, but only a few random patches that made use of them).

Seeing these new IExpression in 1.42, it recalls me an article I read three years ago about prevention of SQLi, I mention it since it is a step further compared to the discussion here, but it remains a POC/theory. The article (in French) is here, written by a researcher then at Orange Cyberdefense (Judicaël Courant), and an equivalent conference paper in English is here; he wrote a POC in Java and I translated it into a PHP library three years ago.

The idea is to encode the SQL into PHP objects, creating an AST for the SQL query, which is then compiled into a prepared statement (*). According to the author, the advantages of this way of writing SQL are:

  1. we keep the full expressivity of the SQL, and
  2. it is no-SQLi-enforced by design (except if you bypass the literals, using id instead of str) so the reviews are facilitated, and
  3. the "syntax" for constructing the AST is near standard SQL, so it should not be too difficult to learn.

An example for SELECT user_id FROM user WHERE user_name = '$user'; with my POC library (**):

use function SQLTrees\{select,from,where,operator,id,str};

$user = "SQLi' OR 1=1 OR '";
$sql = select( id( 'user_id' ),
               from( id( 'user' ) ),
               where( operator( id( 'user_name' ), '=', str( $user ) ) )
);

$statement = $sql->compile();
var_dump( $statement->getPreparedStatement() );

$conn = new mysqli( 'localhost', 'user', 'password', 'mediawiki' );
$res = $statement->run_mysqli( $conn );

with the var_dump displaying:

array(2) {
  'template' =>
  string(46) "SELECT user_id FROM user WHERE user_name = ?;"
  'parameters' =>
  array(1) {
    [0] =>
    array(2) {
      [0] =>
      string(1) "s"
      [1] =>
      string(17) "SQLi' OR 1=1 OR '"
    }
  }
}

(*) using prepared statements instead of writing directly the SQL statements is an implementation choice, mainly to reduce the complexity of the library by avoiding each DBMS-specific escaping rules
(**) I cheat a bit because, currently, "USER" is recognized as a reserved SQL identifier and an exception is thrown for this specific issue (indeed, it is standard-SQL recognized word, but it is not recognized as such by MySQL), I opened an issue on my library about this.

Ladsgroup edited projects, added Data-Persistence (work done); removed DBA.

I have done the work of introducing the methods but I don't have the capacity to do the deprecation of previous ones as it requires migrating everything in production. So I stop cookie licking. If anyone wants to pick it up, feel free to do so.