Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python script to generate a .json.gz file per each locale #5

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

abaevbog
Copy link

@abaevbog abaevbog commented May 2, 2023

No description provided.

Copy link
Member

@dstillman dstillman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also have this output the necessary lines to copy into .htaccess to map locale parameter values to the files between written. We mostly won't need to update them, but good to keep them coupled to the available locales, and the logic might need tweaking.

And for that logic, we basically want to implement this code:

https://github.com/zotero/zotero/blob/3e12f3f20b097d56006c040ea466fb422ba78308/chrome/content/zotero/xpcom/utilities_internal.js#L1581-L1620

…but entirely with rewrite rules. (I haven't totally though through how possible this is, but I think it's mostly doable by writing out a lot of rules in a given order. Let's see what we can do.)

en-US → en-US (exact match)
ar → ar (exact match)
de-DE → de (matching language part)
ca → ca-AD (matching language part)
en-NZ → en-US (prefer en-US over other available en locales for inexact match)
pt → pt-PT (prefer country code matching language code if unspecified — this one's debatable, and 217 million Brazilians would presumably disagree, but it's what we do elsewhere)
zh → zh-CN (this we just get from sorting the country codes, but it makes sense for our userbase)
zz → full file (ignore locale parameter if language code is unknown)

schema_text = f.read()
schema = json.loads(schema_text)

if not os.path.exists('../locales'):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Script should be runnable from any folder. You're already getting the parent folder above, so you should just use that for other paths.

You should also wipe locales to remove any existing files. (Locales will pretty much never be removed, but just to fix possible bugs, etc.)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood, this has been done.

for creator_type in item_type['creatorTypes']:
if creator_type['creatorType'] in current_locale["itemTypes"]:
creator_type['creatorType'] = current_locale["itemTypes"][creator_type['creatorType']]
del schema_with_one_locale['locales']
Copy link
Member

@dstillman dstillman May 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely don't need all of this — sorry for not specifying. Can just keep the one locale keyed properly in locales, as below.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got rid of it!

@abaevbog
Copy link
Author

abaevbog commented May 2, 2023

Added the generation of the .htaccess file. The example output is as below:

RewriteRule ^schema/(af-ZA|ar|bg-BG|br|ca-AD|cs-CZ|da-DK|de|el-GR|en-GB|en-US|es-ES|et-EE|eu-ES|fa|fi-FI|fr-FR|gl-ES|he-IL|hr-HR|hu-HU|id-ID|is-IS|it-IT|ja-JP|km|ko-KR|lt-LT|mn-MN|nb-NO|nl-NL|nn-NO|pl-PL|pt-BR|pt-PT|ro-RO|ru-RU|sk-SK|sl-SI|sr-RS|sv-SE|th-TH|tr-TR|uk-UA|vi-VN|zh-CN|zh-TW)$ /zotero-schema/locales/$1.gz [L]
...
RewriteRule ^schema/he(-.*)?$ /zotero-schema/locales/he-IL.gz [L]
RewriteRule ^schema/hr(-.*)?$ /zotero-schema/locales/hr-HR.gz [L]
RewriteRule ^schema/hu(-.*)?$ /zotero-schema/locales/hu-HU.gz [L]
RewriteRule ^schema/id(-.*)?$ /zotero-schema/locales/id-ID.gz [L]
...
RewriteRule ^schema/* /zotero-schema/schema.json.gz [L]

Each of these rules matches /schema/{country_code}-... or /schema/{country_code} but not /schema/{country_code}something_else

@abaevbog abaevbog requested a review from dstillman May 2, 2023 19:58
os.mkdir(locales_folder)

# String that accumulates the rules to paste into htaccess
htaccess_rules = f"RewriteRule ^schema/({'|'.join(schema['locales'].keys())})$ /zotero-schema/locales/$1.gz [L]"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be a locale query string parameter, not a path component.

# Catch all for default schema with all locales
htaccess_rules += f'\nRewriteRule ^schema/* /zotero-schema/schema.json.gz [L]'

print("--- .htacess rules --- \n" + htaccess_rules + "\n--- ---")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can skip the header and footer

@abaevbog abaevbog requested a review from dstillman May 3, 2023 14:59
@dstillman
Copy link
Member

Can replace update-gz with update-gz.py, no extension, mode 755

@dstillman
Copy link
Member

dstillman commented May 4, 2023

I think we'll need some additional conditions/rules for the locale gz files, like we already have for schema.json.gz:

https://github.com/zotero/dataserver/blob/5fba6e4c4a9df97a305764a69ebf70a872d38cd8/htdocs/.htaccess#L22-L38

E.g., checking Accept-Encoding

Let's just generate the whole block in this script. And we don't need newlines — better to keep this as a single block.

@dstillman
Copy link
Member

Not sure it matters, but since we're generating this in a script anyway and know the total count, we can probably add a skip flag (S) to skip all of these rules unless the URL begins with schema. These will be checked for every single dataserver request, so we do want to avoid any slowdown.

(Will this all be meaningfully faster than just having a single schema.php file that generate a locale-specific file and dumped it in memcached? Unclear. But this is certainly the more fun way to do it…)

@abaevbog
Copy link
Author

abaevbog commented May 4, 2023

After testing it out with a version of dataserver I ran locally, I had to add one more condition RewriteCond %{SCRIPT_FILENAME} !/zotero-schema/ at the end for everything to work properly. This is a gist of the .htaccess file I ended up with

https://gist.github.com/abaevbog/23a986e6966000325f609652cd25e6ce

@dstillman
Copy link
Member

I think we can do a slightly simpler block:

https://gist.githubusercontent.com/dstillman/82db17dca11e2bda0bd87b4d27c02c89/raw/2398339bc434c5dfef7d44e8824b58df09122381/gistfile1.txt

Specifically:

  • Last for everything
  • Removed the lines with E=no-gzip:1 and move those things (setting correct content type, preventing double-gzip) into the FilesMatch
  • No need for zotero-schema in the Skip RewriteCond, since we're setting Last
  • No need for RewriteCond %{SCRIPT_FILENAME} !/zotero-schema/ at the end, since we're setting Last

@abaevbog
Copy link
Author

abaevbog commented May 5, 2023

Yes, you are right...
Adding the [L] flag solved whatever issues I had

htaccess_rules = f'''RewriteCond %{{REQUEST_URI}} !^/(schema|zotero-schema)
RewriteRule ".?" "-" [S=LINES_TO_SKIP]
htaccess_rules = f'''RewriteCond %{{REQUEST_URI}} !^/schema
RewriteRule ".?" "-" [S=LINES_TO_SKIP,L]
Copy link
Member

@dstillman dstillman May 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

L isn't right here, though — that breaks the dataserver completely by skipping the main redirect at the bottom of the file.

Copy link
Author

@abaevbog abaevbog May 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you mean - I removed it.

For some reason when I was testing it on my local dataserver setup, all non /schema requests reached index.php as they were supposed to even with that L flag. Probably can blame it on my local dataserver or apache setup

RewriteCond %{{HTTP:Accept-Encoding}} !gzip
RewriteRule ^schema(/.*)?$ /zotero-schema/schema.json [QSD,L]
RewriteCond %{{QUERY_STRING}} (?:^|&)locale=({'|'.join(schema['locales'].keys())})(?:&|$)
RewriteRule ^schema(/.*)?$ /zotero-schema/locales/%1.json.gz [QSD]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing L here, so exact matches (e.g., locale=en-US) don't work

# For every country code, sort locale candidates and add rule to htacecss
for country_code in htaccess_mapings.keys():
htaccess_mapings[country_code].sort(key=locale_sort_key)
# Each rule is only applid is gzip encoding is accepted
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

applid isapplied if

@abaevbog abaevbog requested a review from dstillman May 8, 2023 12:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants