Regular expression - SQL manipulation

Question

[pol@fedora data]$ lsb_release -a
LSB Version:    :core-4.1-amd64:core-4.1-noarch
Distributor ID: Fedora
Description:    Fedora release 34 (Thirty Four)
Release:    34
Codename:   ThirtyFour

I'm trying to convert a sample database file from MS SQL Server to PostgreSQL.

So, I'm having two small niggles that I can't resolve.

shipname       NVARCHAR(40) NOT NULL,

That's

(always) two spaces
identifier (i.e. field name) - always [a-z] - lower case alphabetical
followed by an unknown number of spaces
followed by NVARCHAR(xy) NOT NULL or it may be followed by NVARCHAR(xy) NULL

and I want to transform this into

shipname       TEXT NOT NULL CHECK (LENGTH(shipname)  <= xy),

or

shipname       TEXT NULL,

What I have so far:

sed 's/^  [a-z]+[ ]+NVARCHAR([0-9]+) NOT NULL/TEXT NOT NULL CHECK \(LENGTH\((\1) <= (\2)\)/g'

So,

^ is the beginning of the string
followed by two spaces
followed by my field name [a-z]+
followed by an arbitrary no. of spaces [ ]+
NVARCHAR([0-9]+)

and substitute in

TEXT followed by NOT NULL then CHECK(LENGTH(xy) - back reference 1 - <= back reference 2...

I've tried various permutations and combinations of the above, but nothing appears to work for me.

[pol@fedora data]$ sed 's/^  [a-z]+[ ]+NVARCHAR([0-9]+) NOT NULL/TEXT NOT NULL CHECK \(LENGTH\((\1) <= (\2)\)/g' 
sed: -e expression #1, char 87: invalid reference \2 on `s' command's RHS

Get invalid back reference...

Ideally, and I stress ideally, if the string following NVARCHAR(xy) is NULL and not NOT NULL, I don't want any length check - because it doesn't make sense to take the LENGTH of a NULL... this is conditional behaviour - not sure if it's possible in regexps....

p.s. thought this would be trivial.

Have data like this:

N'Strada Provinciale 1234', N'Reggio Emilia', NULL, N'10289', N'Italy');

I want to change the N' into just plain apostrophe ' (the N' is a SQL Server thing) but I don't want to change the NULL into the empty string, or worse ULL - so I tried:

[pol@fedora data]$ sed 's/N\'\'/g TSQLV5.sql

but get

sed: -e expression #1, char 7: unterminated `s' command

I know that I've used sed a lot, but would be open to any awk commands that could perform the tasks required.

Does your first instance of xy match the 40 from the corresponding original? — Chris Davies, Commented Jun 1, 2021 at 14:37
@roaima - not necessarily - xy could be any integer from 5 - 5000 (arbitrary)... hence [0-9]+ - there has to be at least one digit present, so not [0-9]*. That is: (xy) corresponds to whatever integers would be in those brackets! fieldname VARCHAR(543) => xy = 543 and is the back-reference in the regex! I hope this is clear? — Vérace, Commented Jun 1, 2021 at 14:44
Yes... I appreciate the generalisation, thanks; I'm trying to check I understand your mapping in the case of the example — Chris Davies, Commented Jun 1, 2021 at 14:50
1. You say "it doesn't make sense to take the LENGTH of a NULL". Both region and postalcode are VARCHAR(n) NULL, yet you want them to have a length check in the desired output. Which is correct? 2. Where does the test_field come from? and the extra blank lines (before birthdate, address, phone)? — cas, Commented Jun 2, 2021 at 3:18
I have rolled back your edit as it substantially changed the question and since you have already received an answer to your original question. If you have another related issue, then please open a new question rather than re-writing your original question. — Kusalananda, Commented Jun 2, 2021 at 7:53

DanieleGrassini · Accepted Answer · 2021-06-01 15:44:42Z

Since you use fedora you have GNU sed and this should work:

s="  shipname       NVARCHAR(40) NOT NULL,"
echo "$s" | sed -E '/NOT/{s/^  ([[:lower:]]+)\s*NVARCHAR\(([[:digit:]]+)\) NOT NULL,$/\1 TEXT NOT NULL CHECK \(LENGTH\(\1\) <= \2\),/;q0} ; s/^  ([[:lower:]]+)/\1 TEXT NULL,/'

This emulates a fake if.

if:

a NOT (/NOT/) is found inside the db structure then the first sed command is executed then, quit (q0) without executing the second statement.

else:

no NOT keyword is found, and the second instance is executed.

For the second reqirements :

sed "s/N'/'/g"

Globally search for N' and replace it with only '. I found useful to swap ' with " for sed command line delimiter and make it more clean without a lot of escaping.

Put the first sed inside a file:

#!/bin/sed -Ef

# If a NOT is found execute this:
# capture the column name and the value of this
/NOT/ {
    s/^  ([[:lower:]]+)\s*NVARCHAR\(([[:digit:]]+)\) NOT NULL,$/\1 TEXT NOT NULL CHECK \(LENGTH\(\1\) <= \2\),/

    # Quit without execute the other statement
    q0
}

# Else: If we are here then the database
# structure does not contains a length for the column;
# so it should be NULL
s/^  ([[:lower:]]+)/\1 TEXT NULL,/

The { command is used to group more sed command togheter.

The q is the quit command, it is used to make sed quit. Here i have use it to force sed exiting before encounter the last line if the first test succeded.

Philippos · Accepted Answer · 2021-06-02 10:39:05Z

You already got answers, but I want to add what went wrong in your own approach, so you can learn from it instead of just copying some solution:

You use extended regular expressions, but forgot to give the -E option to sed.
You want to reuse the identifier, but you did not enclose it in ()
You seem to mix ERE () groups with literal ones. You probably mean sed -E 's/^ ([a-z]+)[ ]+NVARCHAR$([0-9]+)$ NOT NULL/TEXT NOT NULL CHECK $LENGTH\((\1) <= (\2)$/g'
The first part up to the spaces doesn't show in the replacement. You also need to group it and use it as reference in the replacement: sed -E 's/^( ([a-z]+)[ ]+)NVARCHAR$([0-9]+)$ NOT NULL/\1TEXT NOT NULL CHECK $LENGTH\((\2) <= (\3)$/g'
[ ]+ is the same as +. Not an error, but makes it more confusing to read.
The g option is superfluous. With an anchor like ^ or $ in the pattern multiple replacements are not possible.
You can avoid multiple expressions by making the NOT optional: `sed -E 's/^( ([a-z]+) +)NVARCHAR(([0-9]+)) (NOT )?NULL/\1TEXT \4NULL CHECK (LENGTH((\2) <= (\3))/'
On the other hand, if you want to leave out the check, you can do that with a separate replacement: s/^( [a-z]+ +)NVARCHAR$([0-9]+)$ NULL/\1TEXT NULL/
Your s/N\'\'/g misses the separator between search pattern and replacement: s/N\'/\'/g

So you end up with

sed -E 's/^(  ([a-z]+) +)NVARCHAR\(([0-9]+)\) NOT NULL/\1TEXT NOT NULL CHECK \(LENGTH\((\2) <= (\3)\)/
  s/^(  [a-z]+ +)NVARCHAR\(([0-9]+)\) NULL/\1TEXT NULL/
  s/N\'/\'/g'

cas · Accepted Answer · 2021-06-02 04:33:52Z

sed is great for some tasks, but some other tasks required a full-featured language, like awk or perl, with conditionals and printf and more. And preferably a language that doesn't read like some hideous hybrid of a regex and an RPN calculator :-).

#!/usr/bin/perl
use strict;

while(<>) {
  # print verbatim any lines that don't define an identifier
  unless (m/^\s+\S/) { print; next };
  # print a blank line before certain identifiers
  print "\n" if m/birthdate|address|phone/;

  # various regex transformations for IDENTITY and VARCHAR fields
  s/\s+NOT NULL IDENTITY/ GENERATED BY DEFAULT AS IDENTITY/;
  s/([[:lower:]]+)\s+NVARCHAR\((\d+)\) NOT NULL/$1 TEXT NOT NULL CHECK (LENGTH($1) <= $2)/;
  s/\s+NVARCHAR\((\d+)\)\s+NULL/ TEXT NULL/;

  # remove length checks from NULL definitions
  s/\s+CHECK.*/,/ if /(?<!NOT) NULL/;

  # add a comma at the end of the mgrid line if it's not there
  s/\s*$/,/ if /mgrid/ && ! /,\s*$/;

  # hacky crap to nicely format "TYPE (NOT )?NULL" output.
  my @F = split;
  my $identifier = shift @F;
  my $type = shift @F;
  $type .= " " . shift @F if ($F[0] =~ /NOT/);
  $type = sprintf "%-8s", $type;
  $type .= " " . shift @F if ($F[0] =~ /NULL/);

  printf "  %-15s %-13s%s\n", $identifier, $type, join(" ",'',@F);

  # print the test_field definition after mgrid
  if ($identifier eq 'mgrid') {
    print "  test_field      TEXT     NULL CHECK (LENGTH(test_field) <= 25)\n";
  };
}

this is a fairly brute-force method of transforming your input to (roughly) your desired output. a few regex transformations, and some code to line up the "fields" nicely. and a few extra print statements to add blank lines and the test_field in the appropriate places. As such, it's not generically useful but can be adapted to suit other SQL transformations as required.
the script implements the description in your question, not what is displayed in the "desired output" (so, for example, both region and postalcode do not have length checks because they are NULL fields).

Output:

CREATE TABLE employee
(
  empid           INT           GENERATED BY DEFAULT AS IDENTITY,
  lastname        TEXT NOT NULL CHECK (LENGTH(lastname) <= 20),
  firstname       TEXT NOT NULL CHECK (LENGTH(firstname) <= 10),
  title           TEXT     NULL,
  titleofcourtesy TEXT     NULL,

  birthdate       DATE NOT NULL,
  hiredate        DATE NOT NULL,

  address         TEXT NOT NULL CHECK (LENGTH(address) <= 60),
  city            TEXT NOT NULL CHECK (LENGTH(city) <= 15),
  region          TEXT     NULL,
  postalcode      TEXT     NULL,
  country         TEXT NOT NULL CHECK (LENGTH(country) <= 15),

  phone           TEXT NOT NULL CHECK (LENGTH(phone) <= 24),
  mgrid           INT      NULL,
  test_field      TEXT     NULL CHECK (LENGTH(test_field) <= 25)

);

Here's a diff of the script's output vs your desired output (after cleaning up to remove comments and some extraneous space characters):

-  region          TEXT     NULL CHECK (LENGTH(region) <= 15),
-  postalcode      TEXT     NULL CHECK (LENGTH(postalcode) <= 10),
+  region          TEXT     NULL,
+  postalcode      TEXT     NULL,

Other comments:

You probably want PRIMARY KEY GENERATED BY DEFAULT AS IDENTITY for empid
postgresql has a VARCHAR(n) data type, which is probably more appropriate than TEXT, and much simpler to transform: s/NVARCHAR/VARCHAR/. VARCHARs have a fixed length, so a) don't need the length constraint checks, and b) are faster to index and search.
Allowing a field to be NULL is the default, so there's no real need to explicitly define them as such.

Yes, I would want a PK for my table. VARCHARs do not have a fixed length - that's the reason for the name :-) And yes, my regex powers would extend to 's/NVARCHAR/VARCHAR/g'! CHAR(n) has a fixed length - see here- also here. — Vérace, Commented Jun 2, 2021 at 6:20
yes, sorry, sloppy thinking when i wrote that. I meant CHAR. Also looks like a lot has changed with TEXT vs CHAR/VARCHAR since I first started using pg (late 90s). back then, CHAR / VARCHAR were preferable to TEXT. — cas, Commented Jun 2, 2021 at 6:21
Been a long day? Mine's just starting here in Europe :-) (+1 BTW). — Vérace, Commented Jun 2, 2021 at 6:25
not exactly, but a long night last night and poor sleep. btw, unless you're doing more constraint checks than just a length check it doesn't seem as if there's any real advantage to using TEXT (and the disadvantage of having to explicitly write the length constraint). — cas, Commented Jun 2, 2021 at 6:29
I'm not sure now. I'll have to look into it next time it actually matters to me :-). I've always used VARCHAR(n) for shortish columns up to 256 chars or so (CHAR was always kind of pointless with Pg. and with bonus annoying padded spaces), and TEXT for bulk text. — cas, Commented Jun 2, 2021 at 6:33

Stack Exchange Network

Regular expression - SQL manipulation

3 Answers 3

Other comments:

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
linux
text-processing
regular-expression
sql
.

Hot Network Questions

Regular expression - SQL manipulation

3 Answers 3

Other comments:

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged linuxtext-processingregular-expressionsql.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
linux
text-processing
regular-expression
sql
.