Document Importer¶
Document Importer (Document Importer) provides API endpoints for adding, deleting and replacing JSON documents inside Elasticsearch indices.
Dokumendi importimine¶
Adding documents through the API is the easiest way to integrate existing datasets and systems with TEXTA Toolkit. However, for security reasons the users are only allowed to insert documents into indices which are already put inside their Project. API users should also be keenly aware that such indices would also need to be set up with a the proper schema to work with tools like Tagger and Tagger Groups, please refer to the Index API documentation.
Parameetrid¶
- documents:
Elasticsearchi dokumentide kogum. Listi igal elemendil on järgnevad väljad:
„_id“: millise id alla Elasticsearch dokumendi paneb. Kui see on puudu, genereerib Elasticsearch id ise.
„_index“: Under which already existing index should Elasticsearch insert the document. When the index field is missing, all the documents
will be sent to an index with the name format of: „texta-{DEPLOY_KEY}-import-project-{project_id}“ where DEPLOY_KEY by default is „1“.
„_type“: Specifies the doc_type for Elasticsearch documents, should be manually set to „_doc“, defaults to „_doc“.
„_source“: Actual JSON content of the document. All documents should follow the same schema as conflicts will cause errors.
- split_text_in_fields:
Specifies which text fields should be split into smaller pieces, defaults to a field with the name „text“ if none is given. By default the texts are split at a 3000 character limit! Users who do not want to have their documents split should set this field to an empty list.
Märkus
Kui dokumentidel ei ole „_index“ ja „_type“ välju, genereeritakse indeksi nimi automaatselt.
Näited¶
Endpoint: /projects/{project_pk}/elastic/documents/
Näide, kus indeksi nimi lisatakse automaatselt ning tekste ei jaotata mitmeks:
curl -X POST "http://localhost:8000/api/v2/projects/1/elastic/documents/" \
-H "accept: application/json" \
-H "Content-Type: application/json" \
-H "Authorization: Token 8229898dccf960714a9fa22662b214005aa2b049" \
-d '{
"documents": [{"_id": "3", "_source": {"hello": "general kenobi!"}}],
"split_text_in_fields": []
}'
Näide, kus indeksi nimi on kaasa antud ning teksti jaotatakse mitmeks:
curl -X POST "http://localhost:8000/api/v2/projects/1/elastic/documents/" \
-H "accept: application/json" \
-H "Content-Type: application/json" \
-H "Authorization: Token 8229898dccf960714a9fa22662b214005aa2b049" \
-d '{
"documents": [{"_id": "30", "uuid": "aa15-ghh4-41af-af51", "_index": "texta_test_index", "_type": "texta_test_index", "_source": {"hello": "general kenobi! Here is a very long text that should be splitted", "date": "2015-01-01T12:10:30Z"}}],
"split_text_in_fields": ["hello"]
}'
Dokumendi vaatamine¶
Endpoint: projects/{project_pk}/elastic/documents/{index_name}/{document_id}/
curl -X GET "http://localhost:8000/api/v2/projects/1/elastic/documents/texta_text_index/30/"
Dokumendi kustutamine¶
Endpoint: projects/{project_pk}/elastic/documents/{index_name}/{document_id}/
curl -X DELETE "http://localhost:8000/api/v2/projects/1/elastic/documents/texta_text_index/30"
Mitmeks jagatud dokumendi uuendamine¶
Parameetrid¶
- id_field:
Millist välja kasutada ID markerina, et kategoriseerida mitmeks jagatud dokumente tagasi üheks üksuseks.
- id_value:
asendatava dokumendi ID välja väärtus.
- text_field:
Täpsustab tekstivälja, mida soovid uuendada
- content:
Sisu, millega valitud tekstiväli asendatakse
Näide¶
Endpoint: projects/{project_pk}/elastic/documents/{index_name}/update_split
Märkus
Lack of trailing „/“ is important for this endpoint!
curl -X POST "http://localhost:8000/api/v2/projects/1/elastic/documents/texta_test_index/update_split" \
-H "accept: application/json" \
-H "Content-Type: application/json" \
-H "Authorization: Token 8229898dccf960714a9fa22662b214005aa2b049" \
-d '{
"content": "general kenobi! Here is a very long text that should be splitted and now there is more text I forgot to add before and am replacing now",
"text_field": "hello",
"id_value": "uuid",
"id_field": "aa15-ghh4-41af-af51"
}'