Task 2018

Highload Cup 2018 (Task)

Welcome to Highload Cup 2018! Carefully read the rules and the technical task outlined below. If something is not clear, or doesn't work or you've got some ideas how to improve our system, you can always write to cups@corp.mail.ru or to our chat in Telegram. Good luck!

Contest legend

In an alternate reality, humanity decided to create and launch a global "other half" search system. Such a system is designed to reduce the number of single people in the world and contribute to the creation of strong families.

Participants of the Highload Cup 2018 competition are invited to act as an engineer who was asked to create a prototype of such a system. The prototype should, as soon as possible, send the correct responses to requests from third-party services that do something with the responses (for example, they may display it to users in shiny user interfaces). In essence, it should serve as a functional API for external hypothetical services.

1. Competition rules

To get started, %username%, launch your favourite IDE and download an archive with test data in JSON format from contest website https://highloadcup.ru. You need to first create and then deploy a productive application server that will implement the necessary Web API to this data. Pay attention that solutions must be packed as docker images. You can learn more about it by reading our article below.

You can use any web technologies that you can find or come up with. Choose your own programming language and a framework. This could be C++, Java + Tomcat, Python + Django, Ruby + RoR, GoLang, JavaScript + NodeJs, Haskell or something else you like. Also for data storage: MySQL, PostgreSQL, Redis, MongoDB, caches - all is up to you! Pay attention, we estimate not only the number of correct answers to requests, but also the speed of the server - choose carefully!

First test your solution locally on the test data. And when you are ready, build a docker image from it and upload it into the system of the competition. The corresponding commands are written on the task page. After the container is uploaded, a record will appear on the same page about the accepted solution and about putting it in a queue for preliminary shelling.

Tip 1. If the same solution is uploaded twice unchanged, then the system will not start the shelling. In each decision there should be, albeit minimal, differences from the previous ones.

Shelling scenario:

The solution will be sent to a testing machine with an Intel Core i7 processor. The solution will be allocated 4 cores of 2.4 GHz, 2 GB of RAM and 10 GB of hard disk.
The solution will run as a docker container (docker run). In case of launch errors, it will be shown on the page with the shelling log.

After launching the countainer, the data.zip will be available in /tmp/data directory. It contains "production" data (approximately 10 MB of data for the preliminary and 1 GB for a full shelling). Please note that the /tmp/data directory is read-only, so the solution must load the archive into RAM for processing. The archive itself will contain files with names like "accounts_<file_number>.json". These files content is a valid JSON data.
An example of the structure is shown below:

{"accounts": [
    {
        "id": 10003,
        "fname": "Мария",
        "email": "ewheten@icloud.com",
        "interests": [
            "Красное вино",
            "Стейк",
            "Вкусно поесть"
        ],
        "status": "свободны",
        "premium": {
            "start": 1533321770,
            "finish": 1533321770
        },
        "sex": "f",
        "phone": "8(985)4076805",
        "likes": [
            {
                "ts": 1476378752,
                "id": 41803
            },
            ...
        ],
        "birth": 870172195,
        "city": "Испляндия",
        "country": "Кроноштадт",
        "joined": 1450137600
    },
    ... // and many other accounts below
]

The solution has a fixed time before the start of the shelling, in order to fill this data into its own database and prepare it for processing (1 minute for the preliminary and 10 minutes for the full shelling).
After this time, shelling begins with requests from those specified in the API section. The duration of the shelling is 90 seconds for the preliminary and 9 minutes for the full rating shelling. Please note that the server must listen to port 80 in order for the shelling to be successful! Requests come with the Host: accounts.com header over HTTP/1.1 with re-used connections (keep-alive). Network losses are completely absent.
You will see the results and errors of shelling on the website, on the page with the details of the solution in the sections "Shelling" and "Results", respectively.

Tip 2. In case of hacker attacks on the servers of the Highload Cup 2018 competition are noticed, the participant will be banned, and the results of the shelling won't be counted.

Please, note! Preliminary shelling starts automatically and is intended to test solutions at low load. For such a shelling, the results are shown in the form of graphs, but not rated. To participate in the rating, you must manually launch rating shelling, which is carried out in much more hardcore conditions. The number of rating shelling sessions is limited, 4 launches in 12 hours.

The results of rating attacks on all participants will be tabulated on the site. The best of the best will receive prizes!

2. Description of the subject area

Both in test, and in the "production" data there are records about one entity: Account. It describes all known information about the user - his name, contacts, interests, revealed sympathy for other users. Guaranteed correctness of the data provided in accordance with the following types and restrictions.

One Account record (Profile) contains following personal data:

id - unique external user id. It is set by shelling system and then used to validate server responses. Type — 32-bit integer.
email - user's email address. Type - unicode-string with length up to 100 characters. Unique.
fname и sname - first name and the last name respectively. Type - unicode-string with length up to 50 characters. These fields are optional and may not be present in a specific record.
phone - mobile phone number. Type - unicode-string with length up to 16 characters. This field is optional but specified values are unique. Specified quite rarely.
sex - unicode-string. "m" is male, "f" - female.
birth - birth date, saved as count of seconds from start of the UNIX Epoch by UTC (in other words, it's a timestamp). Limited to 01/01/1950 from the bottom and 01/01/2005 from above.
country - country of residence. Type - unicode-string with length up to 50 characters. This field is optional.
city - city of residence. Type - unicode-string with length up to 50 characters. This field is optional and is filled rarely. Each city is located in a particular country.

Tip 3. All data is randomly generated and not related to real people, contacts or places, even if there were coincidences. The data generator code does not use third-party solutions, except for importing the random, datetime, calendar and string modules from the standard Python library.

Also in the Account record there are fields specific to the "other half" search system:

joined - user's sign up date. Type - timestamp limited to 01/01/2011 from the bottom and to 01/01/2018 from above.
status - user's current status. Type - one of the following options: "свободны" (looking for relationship), "заняты" (currently in a relationship), "всё сложно" (it's complicated).
interests - user's interests. Type - array of unicode-strings, may be empty. Strings have length up to 100 characters.
premium - the beginning and end of the premium period in the system (when users really wanted to find an “other half” and they made a cash contribution). In JSON this field is represented by the nested object with the fields start and finish, where the timestamps with the lower boundary 01/01/2018.
likes - array of known user likes, possibly empty. All sympathies are at odds with each one and is an object of the following fields:
- id - id of another account to which sympathy is present. Account id in the source data always exists. In the data there can be several likes with the same id.
- ts - timestamp when sympathy was recorded in the system.

3. Required API description

The API is the http request schemas that the server developed by the member must maintain. URLs are built according to the REST paradigm. The angle brackets indicate parts of the URL that can and will vary from request to request.

All responses from the server take into account the headers Content-Type, Content-Length, Connection.

Tip 4. All examples are further nicely and neatly formatted for easier perception. Server responses do not include formatting. Cyrillic and special characters in the URL are encoded by the python function urlencode.

Data retreival requests (GET):

Retreiving list of users: /accounts/filter/

This API method is planned to be used to search for users by predefined or desired fields. For example, someone wanted to see all people of a certain age and gender who live in a particular city.

The response body is expected to have a {"accounts": [...]} structure with users whose data correspond to the restrictions specified in the GET parameters. For each matching account entry, it is not necessary to transfer all the data known about it, but only the id, email and those used in the request.

20.12.2018: this request do not use interests and likes in output anymore. This is done by restrictions of server storages.

As a result, users should be sorted by descending values in the id field. The number of records to be selected is limited by the required GET parameter limit.

The remaining GET parameters are configured as <field>_<predicate>. For different fields, only certain filtering predicates can be used, which are listed in the table below. In this query, the action of several parameters is added, that is, first filtering one by one, then filtering the result by the second, and so on.

#	Field	Possible predicates
1	sex	`eq` - matching specific gender - "m" или "f";
2	email	`domain` - select all whose emails have the specified domain; `lt` - select all whose emails are lexicographically earlier; `gt` - same but lexicographically later;
3	status	`eq` - matching specific status; `neq` - select all whose status not matching specified one;
4	fname	`eq` - matching specific first name; `any` - in any of the names listed separated by commas; `null` - select all those who have a first name specified (if 0) or not (if 1);
5	sname	`eq` - matching specifict last name; `starts` - select all whose last names begin with the transmitted prefix; `null` - select all those who have a last name specified (if 0) or not (if 1);
6	phone	`code` - select everyone who has a specific code on the phone (three digits in brackets); `null` - similar to the rest of the fields;
7	country	`eq` - everyone who lives in a particular country; `null` - similar;
8	city	`eq` - everyone who lives in a particular city; `any` - in any of the cities listed separated by commas; `null` - similar;
9	birth	`lt` - select anyone born before the specified date; `gt` - select anyone born after the specified date; `year` - select anyone who was born at that year
10	interests	`contains` - select everyone who has all the listed interests; `any` - select everyone who has any of the listed interests;
11	likes	`contains` - select everyone who likes all listed users (in the value - comma-separated ids);
12	premium	`now` - all who have a premium for the current date; `null` - similar

Of course, we do not generate sets of these parameters by absolute randomness. Only certain combinations are allowed and each field has the probability of including it in the request. Combinations and probabilities are chosen for a reason - try to use this knowledge to your advantage.

Sample request and correct response to it::

GET: /accounts/filter/?status_neq=всё+сложно&birth_lt=643972596&country_eq=Индляндия&limit=5&query_id=110

{
    "accounts": [
        {
            "email": "monnorakodehrenod@list.ru",
            "country": "Индляндия",
            "id": 99270,
            "status": "заняты",
            "birth": 581863572
        },{
            "email": "erwirarhadmemeddifde@yahoo.com",
            "country": "Индляндия",
            "id": 98881,
            "status": "свободны",
            "birth": 640015608
        },{
            "email": "rupewseor@rambler.ru",
            "country": "Индляндия",
            "id": 98828,
            "status": "заняты",
            "birth": 604256977
        },{
            "email": "fiotnefaersohhev@inbox.ru",
            "country": "Индляндия",
            "id": 98804,
            "status": "свободны",
            "birth": 596799123
        },{
            "email": "geslasereshedot@yahoo.com",
            "country": "Индляндия",
            "id": 98718,
            "status": "свободны",
            "birth": 640919302
        }
    ]
}

Tip 5. In the case of an unknown field or an unresolved predicate, code 400 with an empty body is expected in the response. In all other cases, a 200 response is expected, even if no users were found.

Tip 6. All API requests have a technical GET parameter query_id. The solution should simply ignore this parameter, since it does not require any action.

Splitting users into groups: /accounts/group/

This API method is planned to be used to create reports on the system operation. The fields by which the grouping is performed are transferred in the GET parameter keys, separated by commas. They are not as numerous as in the user filtering request. There are five fields for grouping - sex, status, interests, country, city.

Before grouping, it is necessary to perform a selection as in the previous query, but by specific values, and not by predicates. For example, if country=Алания is specified in the GET parameters, then grouping is performed only by users from this country. Sampling can go on any field, but the value in it will be only one (for likes there will be only one id, for interests only one line, for birth and joined - there will be one number - year).

The response body is expected to have a {"groups": [...]} structure with a list of groups. Each group must have the keys, which were grouped with the corresponding specific values. For each identified group, it is necessary to calculate how many users it has got into and to record as a result using the count key. That is, the aggregation function for this query is only one and this is a count.

As a result, it is necessary to return not all identified groups, but only the N largest ones or the N smallest ones. The number N is given by the GET parameter limit=N, and whether to return first large or small first - with the GET parameter order=-1 or order=1 respectively. In response, there may be groups with the same count and this can create problems at the stage of validating responses. Please sort these groups among themselves by the values of other fields in the order specified by order.

Please note that N does not exceed 50. Perhaps this fact will help you with optimization :)

Sample request and correct response to it:
```
GET: /accounts/group/?birth=1998&limit=4&order=-1&keys=country
```
(return 4 countries with the most users since the year of birth 1998)
```
{"groups": [
    {"country": "Малатрис", "count": 8745},
    {"country": "Алания", "count": 4390},
    {"country": "Финляндия", "count": 2100},
    {"country": "Гератрис", "count": 547}
]}
```
Tip 7. If unexpected grouping fields or unknown GET parameters appear in the request, code 400 with an empty response body is expected in the response.
Compatibility recommendations: /accounts/<id>/recommend/

This query is used to search for the "other half" by the specified user data. The request passes the id of the user for whom those who are best compatible by status, age and interests are searched. The decision should check compatibility only with the opposite sex (we are not against sexual minorities and condemn discrimination, it just happened :)). If a country or city with the keys country and city, respectively, is transmitted in a GET request, then you need to search only among those living in the specified location.

The response is expected code 200 and the structure {"accounts": [...]} or code 404, if the user with the required id is not found in the stored data. The accounts key should be N users sorted in descending order of compatibility with the indicated id. The number N is specified in the request with the GET parameter limit and is no more than 20.

Compatibility is defined as a function of two users: compatibility = f (me, somebody). The function is built by the participants themselves, but so as to comply with the following rules:
1. The greatest contribution to compatibility is given by the status "свободны". Those who are "все сложно" go secondarily, and the "заняты" in the third and last (very likely they will not be in the response).
2. Next comes compatibility for interests. The more matched interests of users, the more compatible they are.
3. The third parameter is the difference in age. The greater the difference, the less compatibility.
4. Those who have activated a premium account, push through to the very top, ahead of ordinary users. If there are several, they are sorted by compatibility with each other.
5. If there are no common interests, then users should be considered completely incompatible from compatibility = 0.
Only the following fields should be displayed in the final list: id, email, status, fname, sname, birth, premium, interests . If the answer contains equally compatible users (the same status, interests, birth), then output them in ascending order by id

20.12.2018: this request do not use interests in output anymore. This is done by restrictions of server storages.

Sample request and correct response to it:
```
GET: /accounts/89528/recommend/?country=Индция&limit=8&query_id=151
```
(return the 8 most compatible with user id=89528 in the country "Индция")
```
{
    "accounts":  [
        {
            "email": "heernetletem@me.com",
            "premium": {"finish": 1546029018.0, "start": 1530304218},
            "status": "свободны",
            "sname": "Данашевен",
            "fname": "Анатолий",
            "id": 35473,
            "birth": 926357446
        },{
            "email": "teicfiwidadsuna@inbox.com",
            "premium": {"finish": 1565741391.0, "start": 1534205391},
            "status": "свободны",
            "id": 23067,
            "birth": 801100962
        },{
            "email": "nonihiwwahigtegodyn@inbox.com",
            "premium": {"finish": 1557069862.0, "start": 1525533862},
            "status": "свободны",
            "sname": "Стаметаный",
            "fname": "Виталий",
            "id": 90883,
            "birth": 773847481
        }
    ]
}
```
Tip 8. If there is no user in the stored data with the passed id, then 404 code is expected with an empty response body.
Selection for similar sympathies: /accounts/<id>/suggest/

This type of query is similar to the previous one in that it is also about searching for the "other half". The user id for which we are looking for the "other half" is sent in a similar way and the GET parameter limit is used in the same way. Differences are in the implementation. Now we are looking for people of the same sex who like the same “likes” and offer those whom they recently liked themselves. If the country or city GET parameter is transmitted in the request, then you should look for "similar sympathies" only in a certain location.

Similarity between sympathies is defined as a function: similarity = f (me, account), which is calculated uniquely as the sum of fractions 1 / abs (my_like ['ts'] - like ['ts']), where my_like and like are sympathy for the same user. For a fraction, where my_like ['ts'] == like ['ts'], replace the fraction with 1. If there are no common likes, then users should be considered completely different from similarity = 0. If one account has several likes on the same user with different dates, then the average of their dates is used in the formula.

The response contains a list of those who have not yet been liked by the user with the specified id, but whom the users with the most similar likes liked. Sort by similarity descending, and between likes of one such user - by descending of the like date.

Sample request and correct response to it:
```
GET: /accounts/51774/suggest/?country=Испляндия&limit=6&query_id=152
```
```
{
    "accounts": [
        {
             "email": "itwonudiahsu@yandex.ru",
             "id": 94155,
             "status": "заняты",
             "fname": "Никита"
        },{
             "email": "neeficyreddohypot@ymail.com",
             "id": 93449,
             "status": "свободны",
             "fname": "Иван"
        },{
             "email": "sotheralnes@inbox.ru",
             "id": 89997,
             "sname": "Лукетатин",
             "fname": "Руслан",
             "status": "заняты"
        },{
             "email": "kihatneselritunuwryr@ya.ru",
             "id": 88119,
             "sname": "Лукушутин",
             "fname": "Николай",
             "status": "свободны"
        },{
             "email": "otnideonfomedec@icloud.com",
             "id": 87873,
             "status": "свободны",
             "sname": "Фаетавен",
             "fname": "Сидор"
        },{
             "email": "poodreantasis@me.com",
             "id": 85461,
             "sname": "Даныкалан",
             "fname": "Вадим",
             "status": "заняты"
        },
    ]
}
```
Tip 9. If there is no user in the stored data with the passed id, then 404 code is expected with an empty response body.

Data change requests (POST):

Add new user: /accounts/new/

This query simply adds a new user record to the stored data. New data is recorded in the request body in JSON format. It is assumed that the solution itself will control the uniqueness of the fields and data types.

In response, code 201 is expected with an empty JSON in the response body ({}) if the creation of a new user was successful. In the case of incorrect data types or unknown keys, you must return the code 400 with an empty body.

Sample request and correct response to it:
```
POST: /accounts/new/
...
{
    "sname": "Хопетачан",
    "email": "orhograanenor@yahoo.com",
    "country": "Голция",
    "interests": [],
    "birth": 736598811,
    "id": 50000,
    "sex": "f",
    "likes": [
        {"ts": 1475619112, "id": 38753},
        {"ts": 1464366718, "id": 14893},
        {"ts": 1510257477, "id": 37967},
        {"ts": 1431722263, "id": 38933}
    ],
    "premium": {"start": 1519661251, "finish": 1522253251},
    "status": "всё сложно",
    "fname": "Полина",
    "joined": 1466035200
}
```
```
{}
```
Tip 10. In the case of incorrect data types, the presence of unknown keys or violation of uniqueness, you must return the code 400 with an empty body.
Update user data/accounts/<id>/

This request updates the data of a single user in the stored data. The request body in JSON format contains only updated fields and their values. The id field is never contained among the updateable fields and is sent to the request URL. It is assumed that the solution itself will check the uniqueness of the updated fields and data types.

In response, code 202 is expected with an empty JSON in the response body ({}) if the update was successful. If the record with the specified id does not exist in the available data, then the 404 code with an empty body is expected. If the record exists, but unknown fields are transmitted in the request body or value types are incorrect, then code 400 is expected.

Sample request and correct response to it:
```
POST: /accounts/46133/?query_id=308
...
{
    "birth": 664945551,
    "city": "Санктобирск",
    "email": "fywdolpa@yandex.ru",
    "status": "заняты",
    "country": "Алмаль"
}
```
```
{}
```
Tip 11. Similar to adding a new user, in the case of incorrect data types, the presence of unknown keys or violation of uniqueness, you need to return the code 400 with an empty body.
Add new likes: /accounts/likes/

This request adds a lot of new likes to many different users. This is how dating services work in life: the client application recommends suitable candidates, and users put a mark of sympathy (inspired by badoo). There are no restrictions on uniqueness in likes.

In the body of the request, the {"likes": [...]} structure is passed, where likes contains an array of objects with such keys:
- liker - id of the one who put the mark of sympathy;
- likee - id of the one who is marked;
- ts - timestamp when mark was set;
In response, code 202 is expected with an empty JSON in the response body ({}) if the update was successful. If the body of the request passed unknown fields or value types are incorrect, then code 400 is expected.

Sample request and correct response to it:
```
POST: /accounts/likes/?query_id=316
...
{"likes":[
    {"likee": 3929, "ts": 1464869768, "liker": 25486},
    {"likee": 13239, "ts": 1431103000, "liker": 26727},
    {"likee": 2407, "ts": 1439604510, "liker": 6403},
    {"likee": 26677, "ts": 1454719940, "liker": 22248},
    {"likee": 22411, "ts": 1481309376, "liker": 32820},
    {"likee": 9747, "ts": 1431850118, "liker": 43794},
    {"likee": 43575, "ts": 1499496173, "liker": 16134},
    {"likee": 29725, "ts": 1479087147, "liker": 22248}
]}
```
```
{}
```
Tip 12. If an unknown field or value type is passed in the request body, the code 400 with an empty body must be returned.

Tip 13. For all URLs that are not listed in the above API, a response with code 404 and an empty body is expected.

Now, dear participant, when you got acquainted with the rules for holding the Highload Cup 2018 and setting the task, it’s time to try and win!
We are Technopark Laboratory and Mail.ru Group. With all our hearts we wish you good luck!