Diffusion wikimedia/discovery/analytics 2326f9c67c5f

Import cirrus indexes to hdfs
2326f9c67c5f
Actions

Tags

None

Referenced Files

None

Subscribers

None

Description

Import cirrus indexes to hdfs

Runs a weekly import from the codfw search cluster into elasticsearch.

Uses the search_after functionality of elasticsearch to read the data.

Splits the input into thousands of smaller partitions for reading. This was necessary as otherwise the 50G shards would take hours to finish, and any error during the import would have spark retrying the whole shard. Partitions are still lumpy, but the median size with current settings is 100M and the largest at 280M which seems acceptable.

The cirrus document schema isn't particularly exact, had to work through and find inconsistences to get avro to accept the data. We should probably try to apply some of these fixes upstream in cirrus.

Includes a debug option that will instead print type errors into the executor logs. Was quite useful to track down what parts of the data spark/avro didn't like.

Considered using parquet, but it showed memory issues both when writing and when reading back those files. Switched to avro which still requires some extra memory to write, but performed much better (less killed-by-yarn for excessive memory usage, and less memory overhead required)) when reading back. In a test run parquet required 4G of memory overhead per executor to read back, avro only 768M.

Considered using elasticsearch-hadoop, but it didn't have any support for index patterns which would have required creating a separate dataframe for each index (thousands) that we want to read. Decided that was too much complication.

Bug: T317023
Change-Id: I90ae77dfa6cc2a5b659b25d1d33959ff6af7f2e5

Details

Provenance

Authored on Oct 18 2022, 10:43 PM

Parents

rWDAN48e506e0ce1c: drop-snapshots: Remove directory handling

Branches

Unknown

Tags

Unknown

ChangeId

I90ae77dfa6cc2a5b659b25d1d33959ff6af7f2e5

Event Timeline

Changes (8)

Path

Size

airflow/

config/

import_cirrus_indexes_conf.json

dags/

import_cirrus_indexes.py

tests/

fixtures/

hive_operator_hql/

import_cirrus_indexes_init_create_tables.expected

spark_submit_operator/

import_cirrus_indexes_weekly_import.expected

bin/

refinery-drop-mediawiki-snapshots.py

spark/

import_cirrus_indexes.py

test/

test_import_cirrus_indexes.py

rWDAN2326f9c67c5f

airflow/config/import_cirrus_indexes_conf.json

Loading...

airflow/dags/import_cirrus_indexes.py

Loading...

airflow/tests/conftest.py

Loading...

airflow/tests/fixtures/hive_operator_hql/import_cirrus_indexes_init_create_tables.expected

Loading...

airflow/tests/fixtures/spark_submit_operator/import_cirrus_indexes_weekly_import.expected

Loading...

bin/refinery-drop-mediawiki-snapshots.py

Loading...

spark/import_cirrus_indexes.py

Loading...

spark/test/test_import_cirrus_indexes.py

Loading...