Rakun Extractor¶
Rakun Extractor is a tool for extracting keywords from texts. The tool is based on an unsupervised graph-based method RaKUn.
Creation¶
Parameters¶
The following section gives an overview of Rakun Extractor’s input parameters.
- description:
Name of the Rakun Extractor model to easily distinguish it from the other Rakun instances with different parameters.
- distance_method:
Method used while pruning the graph by merging similar words/phrases into a single node.
- Supported options:
- editdistance:
Uses Levenshtein distance for pruning the graph.
- fasttext:
Uses a FastText Embedding for pruning the graph.
Note
To use a FastText embedding as a distance method, the embedding should be first trained using the Embedding app.
- distance_threshold:
The maximum allowed difference between two words for being treated as different words. The words are merged, if the difference between them is lower than the set threshold. If the selected distance_method is editdistance, the threshold should be an integer and it will act as Levenshtein distance. If the selected distance_method is fasttext, the threshold should be a float in range [0, 1].
- num_keywords:
Maximum number of keywords that should be returned.
Note
The algorithm’s efficiency doesn’t depend on the number of keywords returned, so extracting 3 top keywords takes as much time as extracting 50 top keywords. As the probability of the keywords is also returned, it is usually more reasonable to set the number of extracted keywords higher and later filter out the most relevant ones, although it depends on the task.
- pair_diff_length:
The maximum difference in length for words to be considered as candidates for merging.
Note
NB! Pay attention to the relation with parameters distance_method and distance_threshold! For example, let’s consider words “gift” and “present”. If the value for pair_diff_length is set to 2, the words are automatically treated as different, because the difference in their length = 3 < 2, the value for pair_diff_length.
- bigram_count_threshold:
How frequently should a bigram be used in a text for it to be considered a bigram.
- min_tokens:
Minimum number of words in a keyword.
- max_tokens:
Maximum number of words in a keyword.
- max_similar:
Maximum number of overlapping words allowed in bi- and trigrams.
Note
Used only, if the value of parameter max_tokens is greater than 1.
- max_occurrence:
Maximum frequency of a word/phrase for it to be considered as a possible keyword candidate.
- fasttext_embedding:
A fasttext embedding used for pruning the results.
Note
Relevant only if the value for distance_method is “fasttext”.
- stopwords:
A list of words to ignore as potential keywords.
GUI¶
For creating a new Rakun Extractor, navigate to “Models” -> “Rakun Extractors” as seen in Fig. 72.
If the navigation is successful, you should see a panel similar to Fig. 73 with “Create” button in the top left corner of the page.
Clicking on the “Create” button opens a modal window with text “New Rakun Extractor” as depicted in Fig. 74.
Fill the required fields and click on the “Create” button in the bottom right corner of the window (Fig. 75).
The created Rakun Extractor can now be seen as the first (or only, if no previous Rakun Extractors exist under the project) row in the table of Rakun Extractors (Fig. 76).
API¶
Endpoint /projects/{project_pk}/rakun_extractors/
Example:
curl -X POST "http://localhost:8000/api/v1/projects/1/rakun_extractors/" \
-H "accept: application/json" \
-H "Content-Type: application/json" \
-H "Authorization: Token 8229898dccf960714a9fa22662b214005aa2b049" \
-d '{
"description": "Test Rakun Extractor",
"distance_method": "editdistance",
"distance_threshold": 2,
"num_keywords": 15,
"pair_diff_length": 3,
"stopwords": ["and", "or", "if", "why", "when"],
"bigram_count_threshold": 3,
"min_tokens": 1,
"max_tokens": 2,
"max_similar": 3,
"max_occurrence": 10
}'
Response:
{
"id": 38,
"url": "http://localhost:8000/api/v1/projects/1/rakun_extractors/38/",
"author_username": "test",
"description": "Test Rakun Extractor",
"distance_method": "editdistance",
"distance_threshold": 2.0,
"num_keywords": 15,
"pair_diff_length": 3,
"stopwords": [
"and",
"or",
"if",
"why",
"when"
],
"bigram_count_threshold": 3,
"min_tokens": 1,
"max_tokens": 2,
"max_similar": 3,
"max_occurrence": 10,
"fasttext_embedding": null,
"task": null
}
Usage¶
The following section covers the most important functionalities of Rakun Extractor.
Extract from text¶
Function “Extract from text” extracts keywords from a single text with a selected Rakun Extractor model.
GUI¶
For extracting keywords with an existing Rakun Extractor model, navigate to “Models” -> “Rakun Extractors” as seen in Fig. 72.
Select the model you wish to use and navigate to options panel denoted with three vertical dots as seen in Fig. 77.
Select option “Extract from text” from the selection menu as seen in Fig. 78.
Selecting the option opens a new modal window “Extract From Text”. Insert the text from where you wish to extract the keywords and click on the button “Extract” in the bottom right corner of the panel (Fig. 79).
The results are displayed in the same modal window as seen in Fig. 80.
API¶
Endpoint /projects/{project_pk}/rakun/{id}/extract_from_text/
Example:
curl -X POST "http://localhost:8000/api/v1/projects/1/rakun_extractors/38/extract_from_text/" \
-H "accept: application/json" \
-H "Content-Type: application/json" \
-H "Authorization: Token 8229898dccf960714a9fa22662b214005aa2b049" \
-d '{
"text": "The Tallinn Urban Environment and Public Works Department has released a warning that the city's streets may become exceptionally slippery due to falling temperatures. \"The difficult winter is coming to an end, but this sometimes makes the weather even more problematic. The rain, combined with sub-zero temperatures, will inevitably lead to dangerous icing conditions on both pathways and roads. All our partners will start de-icing at the first sign of icy roads, but unfortunately, it is simply not possible to do it everywhere at once. Therefore, we urge road users to be careful and remind property owners that granite aggregate is the best way for de-icing,\" Vladimir Svet (Center), Deputy Mayor of Tallinn said.",
"add_spans": True
}'
Response:
{
"rakun_id": 38,
"desscription": "Test Rakun Extractor",
"result": true,
"text": "The Tallinn Urban Environment and Public Works Department has released a warning that the city's streets may become exceptionally slippery due to falling temperatures. \"The difficult winter is coming to an end, but this sometimes makes the weather even more problematic. The rain, combined with sub-zero temperatures, will inevitably lead to dangerous icing conditions on both pathways and roads. All our partners will start de-icing at the first sign of icy roads, but unfortunately, it is simply not possible to do it everywhere at once. Therefore, we urge road users to be careful and remind property owners that granite aggregate is the best way for de-icing,\" Vladimir Svet (Center), Deputy Mayor of Tallinn said.",
"keywords": [
{
"fact": "Test Rakun Extractor",
"str_val": "temperatures",
"spans": "[[154, 166]]",
"doc_path": "text",
"probability": 0.5953014184397163
},
{
"fact": "Test Rakun Extractor",
"str_val": "temperatures",
"spans": "[[305, 317]]",
"doc_path": "text",
"probability": 0.5953014184397163
},
{
"fact": "Test Rakun Extractor",
"str_val": "roads",
"spans": "[[391, 396]]",
"doc_path": "text",
"probability": 0.5048758865248227
},
{
"fact": "Test Rakun Extractor",
"str_val": "roads",
"spans": "[[460, 465]]",
"doc_path": "text",
"probability": 0.5048758865248227
},
{
"fact": "Test Rakun Extractor",
"str_val": "de-icing",
"spans": "[[426, 434]]",
"doc_path": "text",
"probability": 0.5048758865248227
},
{
"fact": "Test Rakun Extractor",
"str_val": "de-icing",
"spans": "[[655, 663]]",
"doc_path": "text",
"probability": 0.5048758865248227
},
{
"fact": "Test Rakun Extractor",
"str_val": "environment",
"spans": "[[18, 29]]",
"doc_path": "text",
"probability": 0.41976950354609927
},
{
"fact": "Test Rakun Extractor",
"str_val": "conditions",
"spans": "[[359, 369]]",
"doc_path": "text",
"probability": 0.41976950354609927
},
{
"fact": "Test Rakun Extractor",
"str_val": "released",
"spans": "[[62, 70]]",
"doc_path": "text",
"probability": 0.41976950354609927
},
{
"fact": "Test Rakun Extractor",
"str_val": "vladimir",
"spans": "[[666, 674]]",
"doc_path": "text",
"probability": 0.41976950354609927
},
{
"fact": "Test Rakun Extractor",
"str_val": "icing",
"spans": "[[353, 358]]",
"doc_path": "text",
"probability": 0.41976950354609927
},
{
"fact": "Test Rakun Extractor",
"str_val": "icing",
"spans": "[[429, 434]]",
"doc_path": "text",
"probability": 0.41976950354609927
},
{
"fact": "Test Rakun Extractor",
"str_val": "icing",
"spans": "[[658, 663]]",
"doc_path": "text",
"probability": 0.41976950354609927
},
{
"fact": "Test Rakun Extractor",
"str_val": "public",
"spans": "[[34, 40]]",
"doc_path": "text",
"probability": 0.41976950354609927
},
{
"fact": "Test Rakun Extractor",
"str_val": "department",
"spans": "[[47, 57]]",
"doc_path": "text",
"probability": 0.41976950354609927
},
{
"fact": "Test Rakun Extractor",
"str_val": "urban",
"spans": "[[12, 17]]",
"doc_path": "text",
"probability": 0.41976950354609927
},
{
"fact": "Test Rakun Extractor",
"str_val": "become",
"spans": "[[109, 115]]",
"doc_path": "text",
"probability": 0.41976950354609927
},
{
"fact": "Test Rakun Extractor",
"str_val": "exceptionally",
"spans": "[[116, 129]]",
"doc_path": "text",
"probability": 0.41976950354609927
},
{
"fact": "Test Rakun Extractor",
"str_val": "center",
"spans": "[[681, 687]]",
"doc_path": "text",
"probability": 0.41976950354609927
},
{
"fact": "Test Rakun Extractor",
"str_val": "slippery",
"spans": "[[130, 138]]",
"doc_path": "text",
"probability": 0.41976950354609927
}
]
}