Rakun Extractor¶
Rakun Extractor is a tool for extracting keywords from texts. The tool is based on an unsupervised graph-based method RaKUn.
Creation¶
Parameters¶
The following section gives an overview of Rakun Extractor’s input parameters.
- description:
Name of the Rakun Extractor model to easily distinguish it from the other Rakun instances with different parameters.
- distance_method:
Method used while pruning the graph by merging similar words/phrases into a single node.
- Supported options:
- editdistance:
Uses Levenshtein distance for pruning the graph.
- fasttext:
Uses a FastText Embedding for pruning the graph.
Märkus
To use a FastText embedding as a distance method, the embedding should be first trained using the Embedding app.
- distance_threshold:
The maximum allowed difference between two words for being treated as different words. The words are merged, if the difference between them is lower than the set threshold. If the selected distance_method is editdistance, the threshold should be an integer and it will act as Levenshtein distance. If the selected distance_method is fasttext, the threshold should be a float in range [0, 1].
- num_keywords:
Maximum number of keywords that should be returned.
Märkus
The algorithm’s efficiency doesn’t depend on the number of keywords returned, so extracting 3 top keywords takes as much time as extracting 50 top keywords. As the probability of the keywords is also returned, it is usually more reasonable to set the number of extracted keywords higher and later filter out the most relevant ones, although it depends on the task.
- pair_diff_length:
The maximum difference in length for words to be considered as candidates for merging.
Märkus
NB! Pay attention to the relation with parameters distance_method and distance_threshold! For example, let’s consider words „gift“ and „present“. If the value for pair_diff_length is set to 2, the words are automatically treated as different, because the difference in their length = 3 < 2, the value for pair_diff_length.
- bigram_count_threshold:
How frequently should a bigram be used in a text for it to be considered a bigram.
- min_tokens:
Minimum number of words in a keyword.
- max_tokens:
Maximum number of words in a keyword.
- max_similar:
Maximum number of overlapping words allowed in bi- and trigrams.
Märkus
Used only, if the value of parameter max_tokens is greater than 1.
- max_occurrence:
Maximum frequency of a word/phrase for it to be considered as a possible keyword candidate.
- fasttext_embedding:
A fasttext embedding used for pruning the results.
Märkus
Relevant only if the value for distance_method is „fasttext“.
- stopwords:
A list of words to ignore as potential keywords.
GUI¶
For creating a new Rakun Extractor, navigate to „Models“ -> „Rakun Extractors“ as seen in Joonis 72.
If the navigation is successful, you should see a panel similar to Joonis 73 with „Create“ button in the top left corner of the page.
Clicking on the „Create“ button opens a modal window with text „New Rakun Extractor“ as depicted in Joonis 74.
Fill the required fields and click on the „Create“ button in the bottom right corner of the window (Joonis 75).
The created Rakun Extractor can now be seen as the first (or only, if no previous Rakun Extractors exist under the project) row in the table of Rakun Extractors (Joonis 76).
API¶
Endpoint /projects/{project_pk}/rakun_extractors/
Example:
curl -X POST "http://localhost:8000/api/v1/projects/1/rakun_extractors/" \
-H "accept: application/json" \
-H "Content-Type: application/json" \
-H "Authorization: Token 8229898dccf960714a9fa22662b214005aa2b049" \
-d '{
"description": "Test Rakun Extractor",
"distance_method": "editdistance",
"distance_threshold": 2,
"num_keywords": 15,
"pair_diff_length": 3,
"stopwords": ["and", "or", "if", "why", "when"],
"bigram_count_threshold": 3,
"min_tokens": 1,
"max_tokens": 2,
"max_similar": 3,
"max_occurrence": 10
}'
Response:
{
"id": 38,
"url": "http://localhost:8000/api/v1/projects/1/rakun_extractors/38/",
"author_username": "test",
"description": "Test Rakun Extractor",
"distance_method": "editdistance",
"distance_threshold": 2.0,
"num_keywords": 15,
"pair_diff_length": 3,
"stopwords": [
"and",
"or",
"if",
"why",
"when"
],
"bigram_count_threshold": 3,
"min_tokens": 1,
"max_tokens": 2,
"max_similar": 3,
"max_occurrence": 10,
"fasttext_embedding": null,
"task": null
}
Usage¶
The following section covers the most important functionalities of Rakun Extractor.
Extract from text¶
Function „Extract from text“ extracts keywords from a single text with a selected Rakun Extractor model.
GUI¶
For extracting keywords with an existing Rakun Extractor model, navigate to „Models“ -> „Rakun Extractors“ as seen in Joonis 72.
Select the model you wish to use and navigate to options panel denoted with three vertical dots as seen in Joonis 77.
Select option „Extract from text“ from the selection menu as seen in Joonis 78.
Selecting the option opens a new modal window „Extract From Text“. Insert the text from where you wish to extract the keywords and click on the button „Extract“ in the bottom right corner of the panel (Joonis 79).
The results are displayed in the same modal window as seen in Joonis 80.
API¶
Endpoint /projects/{project_pk}/rakun/{id}/extract_from_text/
Example:
curl -X POST "http://localhost:8000/api/v1/projects/1/rakun_extractors/38/extract_from_text/" \
-H "accept: application/json" \
-H "Content-Type: application/json" \
-H "Authorization: Token 8229898dccf960714a9fa22662b214005aa2b049" \
-d '{
"text": "The Tallinn Urban Environment and Public Works Department has released a warning that the city's streets may become exceptionally slippery due to falling temperatures. \"The difficult winter is coming to an end, but this sometimes makes the weather even more problematic. The rain, combined with sub-zero temperatures, will inevitably lead to dangerous icing conditions on both pathways and roads. All our partners will start de-icing at the first sign of icy roads, but unfortunately, it is simply not possible to do it everywhere at once. Therefore, we urge road users to be careful and remind property owners that granite aggregate is the best way for de-icing,\" Vladimir Svet (Center), Deputy Mayor of Tallinn said.",
"add_spans": True
}'
Response:
{
"rakun_id": 38,
"desscription": "Test Rakun Extractor",
"result": true,
"text": "The Tallinn Urban Environment and Public Works Department has released a warning that the city's streets may become exceptionally slippery due to falling temperatures. \"The difficult winter is coming to an end, but this sometimes makes the weather even more problematic. The rain, combined with sub-zero temperatures, will inevitably lead to dangerous icing conditions on both pathways and roads. All our partners will start de-icing at the first sign of icy roads, but unfortunately, it is simply not possible to do it everywhere at once. Therefore, we urge road users to be careful and remind property owners that granite aggregate is the best way for de-icing,\" Vladimir Svet (Center), Deputy Mayor of Tallinn said.",
"keywords": [
{
"fact": "Test Rakun Extractor",
"str_val": "temperatures",
"spans": "[[154, 166]]",
"doc_path": "text",
"probability": 0.5953014184397163
},
{
"fact": "Test Rakun Extractor",
"str_val": "temperatures",
"spans": "[[305, 317]]",
"doc_path": "text",
"probability": 0.5953014184397163
},
{
"fact": "Test Rakun Extractor",
"str_val": "roads",
"spans": "[[391, 396]]",
"doc_path": "text",
"probability": 0.5048758865248227
},
{
"fact": "Test Rakun Extractor",
"str_val": "roads",
"spans": "[[460, 465]]",
"doc_path": "text",
"probability": 0.5048758865248227
},
{
"fact": "Test Rakun Extractor",
"str_val": "de-icing",
"spans": "[[426, 434]]",
"doc_path": "text",
"probability": 0.5048758865248227
},
{
"fact": "Test Rakun Extractor",
"str_val": "de-icing",
"spans": "[[655, 663]]",
"doc_path": "text",
"probability": 0.5048758865248227
},
{
"fact": "Test Rakun Extractor",
"str_val": "environment",
"spans": "[[18, 29]]",
"doc_path": "text",
"probability": 0.41976950354609927
},
{
"fact": "Test Rakun Extractor",
"str_val": "conditions",
"spans": "[[359, 369]]",
"doc_path": "text",
"probability": 0.41976950354609927
},
{
"fact": "Test Rakun Extractor",
"str_val": "released",
"spans": "[[62, 70]]",
"doc_path": "text",
"probability": 0.41976950354609927
},
{
"fact": "Test Rakun Extractor",
"str_val": "vladimir",
"spans": "[[666, 674]]",
"doc_path": "text",
"probability": 0.41976950354609927
},
{
"fact": "Test Rakun Extractor",
"str_val": "icing",
"spans": "[[353, 358]]",
"doc_path": "text",
"probability": 0.41976950354609927
},
{
"fact": "Test Rakun Extractor",
"str_val": "icing",
"spans": "[[429, 434]]",
"doc_path": "text",
"probability": 0.41976950354609927
},
{
"fact": "Test Rakun Extractor",
"str_val": "icing",
"spans": "[[658, 663]]",
"doc_path": "text",
"probability": 0.41976950354609927
},
{
"fact": "Test Rakun Extractor",
"str_val": "public",
"spans": "[[34, 40]]",
"doc_path": "text",
"probability": 0.41976950354609927
},
{
"fact": "Test Rakun Extractor",
"str_val": "department",
"spans": "[[47, 57]]",
"doc_path": "text",
"probability": 0.41976950354609927
},
{
"fact": "Test Rakun Extractor",
"str_val": "urban",
"spans": "[[12, 17]]",
"doc_path": "text",
"probability": 0.41976950354609927
},
{
"fact": "Test Rakun Extractor",
"str_val": "become",
"spans": "[[109, 115]]",
"doc_path": "text",
"probability": 0.41976950354609927
},
{
"fact": "Test Rakun Extractor",
"str_val": "exceptionally",
"spans": "[[116, 129]]",
"doc_path": "text",
"probability": 0.41976950354609927
},
{
"fact": "Test Rakun Extractor",
"str_val": "center",
"spans": "[[681, 687]]",
"doc_path": "text",
"probability": 0.41976950354609927
},
{
"fact": "Test Rakun Extractor",
"str_val": "slippery",
"spans": "[[130, 138]]",
"doc_path": "text",
"probability": 0.41976950354609927
}
]
}