Import cirrus indexes to hdfs
Runs a weekly import from the codfw search cluster into elasticsearch.
- Uses the search_after functionality of elasticsearch to read the data.
- Splits the input into thousands of smaller partitions for reading. This was necessary as otherwise the 50G shards would take hours to finish, and any error during the import would have spark retrying the whole shard. Partitions are still lumpy, but the median size with current settings is 100M and the largest at 280M which seems acceptable.
- The cirrus document schema isn't particularly exact, had to work through and find inconsistences to get avro to accept the data. We should probably try to apply some of these fixes upstream in cirrus.
- Includes a debug option that will instead print type errors into the executor logs. Was quite useful to track down what parts of the data spark/avro didn't like.
- Considered using parquet, but it showed memory issues both when writing and when reading back those files. Switched to avro which still requires some extra memory to write, but performed much better (less killed-by-yarn for excessive memory usage, and less memory overhead required)) when reading back. In a test run parquet required 4G of memory overhead per executor to read back, avro only 768M.
- Considered using elasticsearch-hadoop, but it didn't have any support for index patterns which would have required creating a separate dataframe for each index (thousands) that we want to read. Decided that was too much complication.
Bug: T317023
Change-Id: I90ae77dfa6cc2a5b659b25d1d33959ff6af7f2e5