| Title: | Blazing Fast Morphological Analyzer Based on Kiwi(Korean Intelligent Word Identifier) |
|---|---|
| Description: | This is the R wrapper package Kiwi(Korean Intelligent Word Identifier), a blazing fast speed morphological analyzer for Korean. It supports configuration of user dictionary and detection of unregistered nouns based on frequency. |
| Authors: | Chanyub Park [aut, cre]
|
| Maintainer: | Chanyub Park <[email protected]> |
| License: | LGPL (>= 3) |
| Version: | 0.3.1 |
| Built: | 2026-05-17 09:44:38 UTC |
| Source: | https://github.com/mrchypark/elbird |
Simple version of analyze function.
analyze( text, top_n = 3, match_option = Match$ALL, stopwords = FALSE, blocklist = NULL, pretokenized = NULL, normalize_coda = FALSE, typos = NULL, typo_cost_threshold = 2.5, open_ending = FALSE, allowed_dialects = Dialect$STANDARD, dialect_cost = 3 )analyze( text, top_n = 3, match_option = Match$ALL, stopwords = FALSE, blocklist = NULL, pretokenized = NULL, normalize_coda = FALSE, typos = NULL, typo_cost_threshold = 2.5, open_ending = FALSE, allowed_dialects = Dialect$STANDARD, dialect_cost = 3 )
text |
target text. |
top_n |
|
match_option |
|
stopwords |
stopwords option. Default is FALSE which is use nothing.
If |
blocklist |
|
pretokenized |
|
normalize_coda |
|
typos |
|
typo_cost_threshold |
|
open_ending |
|
allowed_dialects |
|
dialect_cost |
|
## Not run: analyze("Test text.") analyze("Please use Korean.", top_n = 1) analyze("Test text.", 1, Match$ALL_WITH_NORMALIZING) analyze("Test text.", stopwords = FALSE) analyze("Test text.", stopwords = TRUE) analyze("Test text.", stopwords = "user_dict.txt") analyze("Test text.", stopwords = Stopwords$new(TRUE)) # New features with Kiwi v0.21.0 kw <- Kiwi$new() morphset <- kw$create_morphset() analyze("Test text.", blocklist = morphset) ## End(Not run)## Not run: analyze("Test text.") analyze("Please use Korean.", top_n = 1) analyze("Test text.", 1, Match$ALL_WITH_NORMALIZING) analyze("Test text.", stopwords = FALSE) analyze("Test text.", stopwords = TRUE) analyze("Test text.", stopwords = "user_dict.txt") analyze("Test text.", stopwords = Stopwords$new(TRUE)) # New features with Kiwi v0.21.0 kw <- Kiwi$new() morphset <- kw$create_morphset() analyze("Test text.", blocklist = morphset) ## End(Not run)
Dialect constants for analysis options.
DialectDialect
An object of class EnumGenerator of length 11.
## Not run: Dialect Dialect$STANDARD ## End(Not run)## Not run: Dialect Dialect$STANDARD ## End(Not run)
Get kiwi language model file.
get_model(size = "base", path = model_home(), clean = FALSE)get_model(size = "base", path = model_home(), clean = FALSE)
size |
"base" model. default is "base". Also "all" available. |
path |
path for model files. default is |
clean |
remove previous model files before get new. |
https://github.com/bab2min/Kiwi/releases
## Not run: get_model("base") ## End(Not run)## Not run: get_model("base") ## End(Not run)
Joiner class provides methods to compose morphemes into text.
## Not run: kw <- Kiwi$new() joiner <- kw$create_joiner() joiner$add("테스트", "NNG") joiner$get() ## End(Not run)## Not run: kw <- Kiwi$new() joiner <- kw$create_joiner() joiner$add("테스트", "NNG") joiner$get() ## End(Not run)
Kiwi class is provide method for korean mophological analyze result.
print()
print method for Kiwi objects
Kiwi$print(x, ...)
xself
...ignored
new()
Create a kiwi instance.
Kiwi$new( num_workers = 0, model_size = "base", integrate_allomorph = TRUE, load_default_dict = TRUE )
num_workersint(optional): use multi-thread core number. default is 0 which means use all core.
model_sizechar(optional): kiwi model select. default is "base".
integrate_allomorphbool(optional): default is TRUE.
load_default_dictbool(optional): use defualt dictionary. default is TRUE.
add_user_word()
add user word with pos and score
Kiwi$add_user_word(word, tag, score, orig_word = "")
wordchar(required): target word to add.
tagTags(required): tag information about word.
scorenum(required): score information about word.
orig_wordchar(optional): origin word.
add_pre_analyzed_words()
TODO
Kiwi$add_pre_analyzed_words(form, analyzed, score)
formchar(required): target word to add analyzed result.
analyzeddata.frame(required): analyzed result expected.
scorenum(required): score information about pre analyzed result.
add_rules()
TODO
Kiwi$add_rules(tag, pattern, replacement, score)
tagTags(required): target tag to add rules.
patternchar(required): regular expression.
replacementchar(required): replace text.
scorenum(required): score information about rules.
load_user_dictionarys()
add user dictionary using text file.
Kiwi$load_user_dictionarys(user_dict_path)
user_dict_pathchar(required): path of user dictionary file.
extract_words()
Extract Noun word candidate from texts.
Kiwi$extract_words( input, min_cnt, max_word_len, min_score, pos_threshold, apply = FALSE )
inputchar(required): target text data
min_cntint(required): minimum count of word in text.
max_word_lenint(required): max word length.
min_scorenum(required): minimum score.
pos_thresholdnum(required): pos threashold.
applybool(optional): apply extracted word as user word dict.
analyze()
Analyze text to token and tag results.
Kiwi$analyze( text, top_n = 3, match_option = Match$ALL, stopwords = FALSE, blocklist = NULL, pretokenized = NULL )
textchar(required): target text.
top_nint(optional): number of result. Default is 3.
match_optionmatch_option Match: use Match. Default is Match$ALL
stopwordsstopwords option. Default is FALSE which is use nothing.
If TRUE, use embaded stopwords dictionany.
If char: path of dictionary txt file, use file.
If Stopwords class, use it.
If not valid value, work same as FALSE.
blocklistMorphset(optional): morpheme set to block from analysis results.
pretokenizedPretokenized(optional): pretokenized object for guided analysis.
list of result.
tokenize()
Analyze text to token and pos result just top 1.
Kiwi$tokenize( text, match_option = Match$ALL, stopwords = FALSE, form = "tibble" )
textchar(required): target text.
match_optionmatch_option Match: use Match. Default is Match$ALL
stopwordsstopwords option. Default is FALSE which is use nothing.
If TRUE, use embaded stopwords dictionany.
If char: path of dictionary txt file, use file.
If Stopwords class, use it.
If not valid value, work same as FALSE.
formchar(optional): return form. default is "tibble".
"list", "tidytext" is available.
split_into_sents()
Some text may not split sentence by sentence. split_into_sents works split sentences to sentence by sentence.
Kiwi$split_into_sents(text, match_option = Match$ALL, return_tokens = FALSE)
textchar(required): target text.
match_optionmatch_option Match: use Match. Default is Match$ALL
return_tokensbool(optional): add tokenized resault.
get_tidytext_func()
set function to tidytext unnest_tokens.
Kiwi$get_tidytext_func(match_option = Match$ALL, stopwords = FALSE)
match_optionmatch_option Match: use Match. Default is Match$ALL
stopwordsstopwords option. Default is TRUE which is
to use embaded stopwords dictionary.
If FALSE, use not embaded stopwords dictionary.
If char: path of dictionary txt file, use file.
If Stopwords class, use it.
If not valid value, work same as FALSE.
function
\dontrun{
kw <- Kiwi$new()
tidytoken <- kw$get_tidytext_func()
tidytoken("test")
}
set_typo_correction()
Set typo correction settings for the Kiwi instance.
Kiwi$set_typo_correction( enabled = TRUE, cost_threshold = 2.5, custom_typos = NULL )
enabledbool(optional): enable or disable typo correction. Default is TRUE.
cost_thresholdnum(optional): cost threshold for typo correction. Default is 2.5.
custom_typoslist(optional): list of custom typo corrections.
Each element should be a list with 'orig', 'error', and optionally 'cost' fields.
\dontrun{
kw <- Kiwi$new()
# Enable with default settings
kw$set_typo_correction(TRUE)
# Enable with custom rules
custom_rules <- list(
list(orig = "안녕", error = "안뇽", cost = 1.0),
list(orig = "하세요", error = "하셰요", cost = 1.5)
)
kw$set_typo_correction(TRUE, cost_threshold = 3.0, custom_typos = custom_rules)
# Disable typo correction
kw$set_typo_correction(FALSE)
}
get_typo_correction_settings()
Get current typo correction settings.
Kiwi$get_typo_correction_settings()
list with typo correction settings
create_morphset()
Create a new morpheme set for blocking specific morphemes from analysis.
Kiwi$create_morphset()
Morphset object
create_pretokenized()
Create a new pretokenized object for guided analysis.
Kiwi$create_pretokenized()
Pretokenized object
clone()
The objects of this class are cloneable with this method.
Kiwi$clone(deep = FALSE)
deepWhether to make a deep clone.
## Not run: kw <- Kiwi$new() kw$analyze("test") kw$tokenize("test") ## End(Not run) ## ------------------------------------------------ ## Method `Kiwi$get_tidytext_func` ## ------------------------------------------------ ## Not run: kw <- Kiwi$new() tidytoken <- kw$get_tidytext_func() tidytoken("test") ## End(Not run) ## ------------------------------------------------ ## Method `Kiwi$set_typo_correction` ## ------------------------------------------------ ## Not run: kw <- Kiwi$new() # Enable with default settings kw$set_typo_correction(TRUE) # Enable with custom rules custom_rules <- list( list(orig = "안녕", error = "안뇽", cost = 1.0), list(orig = "하세요", error = "하셰요", cost = 1.5) ) kw$set_typo_correction(TRUE, cost_threshold = 3.0, custom_typos = custom_rules) # Disable typo correction kw$set_typo_correction(FALSE) ## End(Not run)## Not run: kw <- Kiwi$new() kw$analyze("test") kw$tokenize("test") ## End(Not run) ## ------------------------------------------------ ## Method `Kiwi$get_tidytext_func` ## ------------------------------------------------ ## Not run: kw <- Kiwi$new() tidytoken <- kw$get_tidytext_func() tidytoken("test") ## End(Not run) ## ------------------------------------------------ ## Method `Kiwi$set_typo_correction` ## ------------------------------------------------ ## Not run: kw <- Kiwi$new() # Enable with default settings kw$set_typo_correction(TRUE) # Enable with custom rules custom_rules <- list( list(orig = "안녕", error = "안뇽", cost = 1.0), list(orig = "하세요", error = "하셰요", cost = 1.5) ) kw$set_typo_correction(TRUE, cost_threshold = 3.0, custom_typos = custom_rules) # Disable typo correction kw$set_typo_correction(FALSE) ## End(Not run)
ALL option contains URL, EMAIL, HASHTAG, MENTION, SERIAL.
MatchMatch
An object of class EnumGenerator of length 22.
## Not run: Match Match$ALL ## End(Not run)## Not run: Match Match$ALL ## End(Not run)
Verifies if model files exists.
model_exists(size = "all")model_exists(size = "all")
size |
model size. default is "all" which is true that all three models must be present. |
logical model files exists or not.
## Not run: get_model("small") model_exists("small") ## End(Not run)## Not run: get_model("small") model_exists("small") ## End(Not run)
kiwi_model_path()
Returns the kiwi model path.TODO explain ELBIRD_MODEL_HOME
model_home()model_home()
character: file path
model_home()model_home()
Verifies if models work fine.
model_works(size = "all")model_works(size = "all")
size |
model size. default is "all" which is true that all three models must be present. |
logical model work or not.
## Not run: get_model("small") model_works("small") ## End(Not run)## Not run: get_model("small") model_works("small") ## End(Not run)
Morphset class provides methods for managing morpheme sets that can be used to block specific morphemes from analysis results.
print()
print method for Morphset objects
Morphset$print(x, ...)
xself
...ignored
new()
Create a morphset instance with a C++ handle.
Morphset$new(handle)
handleC++ handle for the morphset object
add()
Add a morpheme to the morphset.
Morphset$add(form, tag)
formchar(required): morpheme form to add.
tagchar(required): POS tag for the morpheme.
logical indicating success
add_multiple()
Add multiple morphemes to the morphset.
Morphset$add_multiple(forms, tags)
formscharacter vector: morpheme forms to add.
tagscharacter vector: POS tags for the morphemes.
logical vector indicating success for each morpheme
get_handle()
Get the internal C++ handle for this morphset. This is used internally by the Kiwi class.
Morphset$get_handle()
C++ handle
size()
Get the number of morphemes in this morphset.
Morphset$size()
integer number of morphemes
get_morphemes()
Get a list of all morphemes in this morphset.
Morphset$get_morphemes()
list of morphemes with form and tag
clear()
Clear all morphemes from this morphset. Note: This creates a new morphset handle, so existing references may become invalid.
Morphset$clear()
clone()
The objects of this class are cloneable with this method.
Morphset$clone(deep = FALSE)
deepWhether to make a deep clone.
## Not run: kw <- Kiwi$new() morphset <- kw$create_morphset() morphset$add("테스트", "NNG") result <- kw$analyze("테스트 문장", blocklist = morphset) ## End(Not run)## Not run: kw <- Kiwi$new() morphset <- kw$create_morphset() morphset$add("테스트", "NNG") result <- kw$analyze("테스트 문장", blocklist = morphset) ## End(Not run)
Pretokenized class provides methods for managing pretokenized objects that can guide the morphological analysis process by providing predefined token boundaries and information.
print()
print method for Pretokenized objects
Pretokenized$print(x, ...)
xself
...ignored
new()
Create a pretokenized instance with a C++ handle.
Pretokenized$new(handle)
handleC++ handle for the pretokenized object
add_span()
Add a span to the pretokenized object.
Pretokenized$add_span(begin, end)
begininteger(required): beginning position of the span.
endinteger(required): ending position of the span.
integer span ID for adding tokens to this span
add_token_to_span()
Add a token to a specific span.
Pretokenized$add_token_to_span(span_id, form, tag, begin, end)
span_idinteger(required): ID of the span to add token to.
formchar(required): token form.
tagchar(required): POS tag for the token.
begininteger(required): beginning position of the token.
endinteger(required): ending position of the token.
logical indicating success
add_tokens_to_span()
Add multiple tokens to a span at once.
Pretokenized$add_tokens_to_span(span_id, forms, tags, begins, ends)
span_idinteger(required): ID of the span to add tokens to.
formscharacter vector: token forms.
tagscharacter vector: POS tags for the tokens.
beginsinteger vector: beginning positions of the tokens.
endsinteger vector: ending positions of the tokens.
logical vector indicating success for each token
get_handle()
Get the internal C++ handle for this pretokenized object. This is used internally by the Kiwi class.
Pretokenized$get_handle()
C++ handle
span_count()
Get the number of spans in this pretokenized object.
Pretokenized$span_count()
integer number of spans
token_count()
Get the total number of tokens across all spans.
Pretokenized$token_count()
integer total number of tokens
get_span_info()
Get information about a specific span.
Pretokenized$get_span_info(span_id)
span_idinteger(required): ID of the span to get information for.
list with span information
get_all_spans()
Get information about all spans.
Pretokenized$get_all_spans()
list of all spans
clear()
Clear all spans and tokens from this pretokenized object. Note: This creates a new pretokenized handle, so existing references may become invalid.
Pretokenized$clear()
clone()
The objects of this class are cloneable with this method.
Pretokenized$clone(deep = FALSE)
deepWhether to make a deep clone.
## Not run: kw <- Kiwi$new() pt <- kw$create_pretokenized() span_id <- pt$add_span(0, 10) pt$add_token_to_span(span_id, "테스트", "NNG", 0, 3) result <- kw$analyze("테스트 문장", pretokenized = pt) ## End(Not run)## Not run: kw <- Kiwi$new() pt <- kw$create_pretokenized() span_id <- pt$add_span(0, 10) pt$add_token_to_span(span_id, "테스트", "NNG", 0, 3) result <- kw$analyze("테스트 문장", pretokenized = pt) ## End(Not run)
Some text may not split sentence by sentence. split_into_sents works split sentences to sentence by sentence.
split_into_sents(text, return_tokens = FALSE)split_into_sents(text, return_tokens = FALSE)
text |
target text. |
return_tokens |
add tokenized resault. |
## Not run: split_into_sents("text") split_into_sents("text", return_tokens = TRUE) ## End(Not run)## Not run: split_into_sents("text") split_into_sents("text", return_tokens = TRUE) ## End(Not run)
Stopwords is for filter result.
print()
print method for Stopwords objects
Stopwords$print(x, ...)
xself
...ignored
new()
Create a stopwords object for filter stopwords on analyze() and tokenize() results.
Stopwords$new(use_system_dict = TRUE)
use_system_dictbool(optional): use system stopwords dictionary or not.
Defualt is TRUE.
add()
add stopword one at a time.
Stopwords$add(form = NA, tag = Tags$nnp)
formchar(optional): Form information. Default is NA.
tagchar(optional): Tag information. Default is "NNP". Please check Tags.
\dontrun{
sw <- Stopwords$new()
sw$add("word", "NNG")
sw$add("word", Tags$nng)
}
add_from_dict()
add stopword from text file. text file need to form "TEXT/TAG". TEXT can remove like "/NNP". TAG required like "FORM/NNP".
Stopwords$add_from_dict(path, dict_name = "user")
pathchar(required): dictionary file path.
dict_namechar(optional): default is "user"
remove()
remove stopword one at a time.
Stopwords$remove(form = NULL, tag = NULL)
formchar(optional): Form information. If form not set, remove tag in input.
tagchar(required): Tag information. Please check Tags.
save_dict()
save current stopwords list in text file.
Stopwords$save_dict(path)
pathchar(required): file path to save stopwords list.
get()
return tibble of stopwords.
Stopwords$get()
a tibble for stopwords options
for analyze() / tokenize() function.
clone()
The objects of this class are cloneable with this method.
Stopwords$clone(deep = FALSE)
deepWhether to make a deep clone.
## Not run: Stopwords$new() ## End(Not run) ## ------------------------------------------------ ## Method `Stopwords$add` ## ------------------------------------------------ ## Not run: sw <- Stopwords$new() sw$add("word", "NNG") sw$add("word", Tags$nng) ## End(Not run)## Not run: Stopwords$new() ## End(Not run) ## ------------------------------------------------ ## Method `Stopwords$add` ## ------------------------------------------------ ## Not run: sw <- Stopwords$new() sw$add("word", "NNG") sw$add("word", Tags$nng) ## End(Not run)
SwTokenizer class provides encode/decode helpers for subword tokenization.
## Not run: kw <- Kiwi$new() swt <- kw$create_swtokenizer("tokenizer.json") swt$encode("안녕하세요") ## End(Not run)## Not run: kw <- Kiwi$new() swt <- kw$create_swtokenizer("tokenizer.json") swt$encode("안녕하세요") ## End(Not run)
Tags contains tag list for elbird.
TagsTags
An object of class EnumGenerator of length 47.
https://github.com/bab2min/Kiwi
## Not run: Tags Tags$nnp ## End(Not run)## Not run: Tags Tags$nnp ## End(Not run)
Simple version of tokenizer function.
tokenize( text, match_option = Match$ALL, stopwords = TRUE, blocklist = NULL, pretokenized = NULL, normalize_coda = FALSE, typos = NULL, typo_cost_threshold = 2.5, open_ending = FALSE, allowed_dialects = Dialect$STANDARD, dialect_cost = 3 ) tokenize_tbl( text, match_option = Match$ALL, stopwords = TRUE, blocklist = NULL, pretokenized = NULL, normalize_coda = FALSE, typos = NULL, typo_cost_threshold = 2.5, open_ending = FALSE, allowed_dialects = Dialect$STANDARD, dialect_cost = 3 ) tokenize_tidytext( text, match_option = Match$ALL, stopwords = TRUE, blocklist = NULL, pretokenized = NULL, normalize_coda = FALSE, typos = NULL, typo_cost_threshold = 2.5, open_ending = FALSE, allowed_dialects = Dialect$STANDARD, dialect_cost = 3 ) tokenize_tidy( text, match_option = Match$ALL, stopwords = TRUE, blocklist = NULL, pretokenized = NULL, normalize_coda = FALSE, typos = NULL, typo_cost_threshold = 2.5, open_ending = FALSE, allowed_dialects = Dialect$STANDARD, dialect_cost = 3 )tokenize( text, match_option = Match$ALL, stopwords = TRUE, blocklist = NULL, pretokenized = NULL, normalize_coda = FALSE, typos = NULL, typo_cost_threshold = 2.5, open_ending = FALSE, allowed_dialects = Dialect$STANDARD, dialect_cost = 3 ) tokenize_tbl( text, match_option = Match$ALL, stopwords = TRUE, blocklist = NULL, pretokenized = NULL, normalize_coda = FALSE, typos = NULL, typo_cost_threshold = 2.5, open_ending = FALSE, allowed_dialects = Dialect$STANDARD, dialect_cost = 3 ) tokenize_tidytext( text, match_option = Match$ALL, stopwords = TRUE, blocklist = NULL, pretokenized = NULL, normalize_coda = FALSE, typos = NULL, typo_cost_threshold = 2.5, open_ending = FALSE, allowed_dialects = Dialect$STANDARD, dialect_cost = 3 ) tokenize_tidy( text, match_option = Match$ALL, stopwords = TRUE, blocklist = NULL, pretokenized = NULL, normalize_coda = FALSE, typos = NULL, typo_cost_threshold = 2.5, open_ending = FALSE, allowed_dialects = Dialect$STANDARD, dialect_cost = 3 )
text |
target text. |
match_option |
|
stopwords |
stopwords option. Default is TRUE which is
to use embaded stopwords dictionany.
If FALSE, use not embaded stopwords dictionany.
If char: path of dictionary txt file, use file.
If |
blocklist |
|
pretokenized |
|
normalize_coda |
|
typos |
|
typo_cost_threshold |
|
open_ending |
|
allowed_dialects |
|
dialect_cost |
|
list type of result.
## Not run: tokenize("Test text.") tokenize("Please use Korean.", Match$ALL_WITH_NORMALIZING) # New features with Kiwi v0.21.0 kw <- Kiwi$new() morphset <- kw$create_morphset() tokenize("Test text.", blocklist = morphset) ## End(Not run)## Not run: tokenize("Test text.") tokenize("Please use Korean.", Match$ALL_WITH_NORMALIZING) # New features with Kiwi v0.21.0 kw <- Kiwi$new() morphset <- kw$create_morphset() tokenize("Test text.", blocklist = morphset) ## End(Not run)
Typo correction set constants.
TypoSetTypoSet
An object of class EnumGenerator of length 6.
## Not run: TypoSet TypoSet$BASIC_TYPO_SET ## End(Not run)## Not run: TypoSet TypoSet$BASIC_TYPO_SET ## End(Not run)