Title: | Blazing Fast Morphological Analyzer Based on Kiwi(Korean Intelligent Word Identifier) |
---|---|
Description: | This is the R wrapper package Kiwi(Korean Intelligent Word Identifier), a blazing fast speed morphological analyzer for Korean. It supports configuration of user dictionary and detection of unregistered nouns based on frequency. |
Authors: | Chanyub Park [aut, cre] |
Maintainer: | Chanyub Park <[email protected]> |
License: | LGPL (>= 3) |
Version: | 0.2.5 |
Built: | 2024-10-31 04:06:14 UTC |
Source: | https://github.com/mrchypark/elbird |
Simple version of analyze function.
analyze(text, top_n = 3, match_option = Match$ALL, stopwords = FALSE)
analyze(text, top_n = 3, match_option = Match$ALL, stopwords = FALSE)
text |
target text. |
top_n |
|
match_option |
|
stopwords |
stopwords option. Default is TRUE which is
to use embaded stopwords dictionany.
If FALSE, use not embaded stopwords dictionany.
If char: path of dictionary txt file, use file.
If |
## Not run: analyze("Test text.") analyze("Please use Korean.", top_n = 1) analyze("Test text.", 1, Match$ALL_WITH_NORMALIZING) analyze("Test text.", stopwords = FALSE) analyze("Test text.", stopwords = TRUE) analyze("Test text.", stopwords = "user_dict.txt") analyze("Test text.", stopwords = Stopwords$new(TRUE)) ## End(Not run)
## Not run: analyze("Test text.") analyze("Please use Korean.", top_n = 1) analyze("Test text.", 1, Match$ALL_WITH_NORMALIZING) analyze("Test text.", stopwords = FALSE) analyze("Test text.", stopwords = TRUE) analyze("Test text.", stopwords = "user_dict.txt") analyze("Test text.", stopwords = Stopwords$new(TRUE)) ## End(Not run)
Get kiwi language model file.
get_model(size = "base", path = model_home(), clean = FALSE)
get_model(size = "base", path = model_home(), clean = FALSE)
size |
"small", "base", "large" model. default is "base". Also "all" available. |
path |
path for model files. default is |
clean |
remove previous model files before get new. |
https://github.com/bab2min/Kiwi/releases
## Not run: get_model("small") ## End(Not run)
## Not run: get_model("small") ## End(Not run)
Kiwi class is provide method for korean mophological analyze result.
print()
print method for Kiwi
objects
Kiwi$print(x, ...)
x
self
...
ignored
new()
Create a kiwi instance.
Kiwi$new( num_workers = 0, model_size = "base", integrate_allomorph = TRUE, load_default_dict = TRUE )
num_workers
int(optional)
: use multi-thread core number. default is 0 which means use all core.
model_size
char(optional)
: kiwi model select. default is "base". "small", "large" is available.
integrate_allomorph
bool(optional)
: default is TRUE.
load_default_dict
bool(optional)
: use defualt dictionary. default is TRUE.
add_user_word()
add user word with pos and score
Kiwi$add_user_word(word, tag, score, orig_word = "")
word
char(required)
: target word to add.
tag
Tags(required)
: tag information about word.
score
num(required)
: score information about word.
orig_word
char(optional)
: origin word.
add_pre_analyzed_words()
TODO
Kiwi$add_pre_analyzed_words(form, analyzed, score)
form
char(required)
: target word to add analyzed result.
analyzed
data.frame(required)
: analyzed result expected.
score
num(required)
: score information about pre analyzed result.
add_rules()
TODO
Kiwi$add_rules(tag, pattern, replacement, score)
tag
Tags(required)
: target tag to add rules.
pattern
char(required)
: regular expression.
replacement
char(required)
: replace text.
score
num(required)
: score information about rules.
load_user_dictionarys()
add user dictionary using text file.
Kiwi$load_user_dictionarys(user_dict_path)
user_dict_path
char(required)
: path of user dictionary file.
extract_words()
Extract Noun word candidate from texts.
Kiwi$extract_words( input, min_cnt, max_word_len, min_score, pos_threshold, apply = FALSE )
input
char(required)
: target text data
min_cnt
int(required)
: minimum count of word in text.
max_word_len
int(required)
: max word length.
min_score
num(required)
: minimum score.
pos_threshold
num(required)
: pos threashold.
apply
bool(optional)
: apply extracted word as user word dict.
analyze()
Analyze text to token and tag results.
Kiwi$analyze(text, top_n = 3, match_option = Match$ALL, stopwords = FALSE)
text
char(required)
: target text.
top_n
int(optional)
: number of result. Default is 3.
match_option
match_option Match
: use Match. Default is Match$ALL
stopwords
stopwords option. Default is FALSE which is use nothing.
If TRUE
, use embaded stopwords dictionany.
If char
: path of dictionary txt file, use file.
If Stopwords
class, use it.
If not valid value, work same as FALSE.
list
of result.
tokenize()
Analyze text to token and pos result just top 1.
Kiwi$tokenize( text, match_option = Match$ALL, stopwords = FALSE, form = "tibble" )
text
char(required)
: target text.
match_option
match_option Match
: use Match. Default is Match$ALL
stopwords
stopwords option. Default is FALSE which is use nothing.
If TRUE
, use embaded stopwords dictionany.
If char
: path of dictionary txt file, use file.
If Stopwords
class, use it.
If not valid value, work same as FALSE.
form
char(optional)
: return form. default is "tibble".
"list", "tidytext" is available.
split_into_sents()
Some text may not split sentence by sentence. split_into_sents works split sentences to sentence by sentence.
Kiwi$split_into_sents(text, match_option = Match$ALL, return_tokens = FALSE)
text
char(required)
: target text.
match_option
match_option Match
: use Match. Default is Match$ALL
return_tokens
bool(optional)
: add tokenized resault.
get_tidytext_func()
set function to tidytext unnest_tokens.
Kiwi$get_tidytext_func(match_option = Match$ALL, stopwords = FALSE)
match_option
match_option Match
: use Match. Default is Match$ALL
stopwords
stopwords option. Default is TRUE which is
to use embaded stopwords dictionary.
If FALSE, use not embaded stopwords dictionary.
If char: path of dictionary txt file, use file.
If Stopwords
class, use it.
If not valid value, work same as FALSE.
function
\dontrun{ kw <- Kiwi$new() tidytoken <- kw$get_tidytext_func() tidytoken("test") }
clone()
The objects of this class are cloneable with this method.
Kiwi$clone(deep = FALSE)
deep
Whether to make a deep clone.
## Not run: kw <- Kiwi$new() kw$analyze("test") kw$tokenize("test") ## End(Not run) ## ------------------------------------------------ ## Method `Kiwi$get_tidytext_func` ## ------------------------------------------------ ## Not run: kw <- Kiwi$new() tidytoken <- kw$get_tidytext_func() tidytoken("test") ## End(Not run)
## Not run: kw <- Kiwi$new() kw$analyze("test") kw$tokenize("test") ## End(Not run) ## ------------------------------------------------ ## Method `Kiwi$get_tidytext_func` ## ------------------------------------------------ ## Not run: kw <- Kiwi$new() tidytoken <- kw$get_tidytext_func() tidytoken("test") ## End(Not run)
ALL option contains URL, EMAIL, HASHTAG, MENTION.
Match
Match
An object of class EnumGenerator
of length 13.
## Not run: Match Match$ALL ## End(Not run)
## Not run: Match Match$ALL ## End(Not run)
Verifies if model files exists.
model_exists(size = "all")
model_exists(size = "all")
size |
model size. default is "all" which is true that all three models must be present. |
logical
model files exists or not.
## Not run: get_model("small") model_exists("small") ## End(Not run)
## Not run: get_model("small") model_exists("small") ## End(Not run)
kiwi_model_path()
Returns the kiwi model path.TODO explain ELBIRD_MODEL_HOME
model_home()
model_home()
character
: file path
model_home()
model_home()
Verifies if models work fine.
model_works(size = "all")
model_works(size = "all")
size |
model size. default is "all" which is true that all three models must be present. |
logical
model work or not.
## Not run: get_model("small") model_works("small") ## End(Not run)
## Not run: get_model("small") model_works("small") ## End(Not run)
Some text may not split sentence by sentence. split_into_sents works split sentences to sentence by sentence.
split_into_sents(text, return_tokens = FALSE)
split_into_sents(text, return_tokens = FALSE)
text |
target text. |
return_tokens |
add tokenized resault. |
## Not run: split_into_sents("text") split_into_sents("text", return_tokens = TRUE) ## End(Not run)
## Not run: split_into_sents("text") split_into_sents("text", return_tokens = TRUE) ## End(Not run)
Stopwords is for filter result.
print()
print method for Stopwords
objects
Stopwords$print(x, ...)
x
self
...
ignored
new()
Create a stopwords object for filter stopwords on analyze()
and tokenize()
results.
Stopwords$new(use_system_dict = TRUE)
use_system_dict
bool(optional)
: use system stopwords dictionary or not.
Defualt is TRUE.
add()
add stopword one at a time.
Stopwords$add(form = NA, tag = Tags$nnp)
form
char(optional)
: Form information. Default is NA.
tag
char(optional)
: Tag information. Default is "NNP". Please check Tags.
\dontrun{ sw <- Stopwords$new() sw$add("word", "NNG") sw$add("word", Tags$nng) }
add_from_dict()
add stopword from text file. text file need to form "TEXT/TAG". TEXT can remove like "/NNP". TAG required like "FORM/NNP".
Stopwords$add_from_dict(path, dict_name = "user")
path
char(required)
: dictionary file path.
dict_name
char(optional)
: default is "user"
remove()
remove stopword one at a time.
Stopwords$remove(form = NULL, tag = NULL)
form
char(optional)
: Form information. If form not set, remove tag in input.
tag
char(required)
: Tag information. Please check Tags.
save_dict()
save current stopwords list in text file.
Stopwords$save_dict(path)
path
char(required)
: file path to save stopwords list.
get()
return tibble of stopwords.
Stopwords$get()
a tibble for stopwords options
for analyze()
/ tokenize()
function.
clone()
The objects of this class are cloneable with this method.
Stopwords$clone(deep = FALSE)
deep
Whether to make a deep clone.
## Not run: Stopwords$new() ## End(Not run) ## ------------------------------------------------ ## Method `Stopwords$add` ## ------------------------------------------------ ## Not run: sw <- Stopwords$new() sw$add("word", "NNG") sw$add("word", Tags$nng) ## End(Not run)
## Not run: Stopwords$new() ## End(Not run) ## ------------------------------------------------ ## Method `Stopwords$add` ## ------------------------------------------------ ## Not run: sw <- Stopwords$new() sw$add("word", "NNG") sw$add("word", Tags$nng) ## End(Not run)
Tags contains tag list for elbird.
Tags
Tags
An object of class EnumGenerator
of length 47.
https://github.com/bab2min/Kiwi
## Not run: Tags Tags$nnp ## End(Not run)
## Not run: Tags Tags$nnp ## End(Not run)
Simple version of tokenizer function.
tokenize(text, match_option = Match$ALL, stopwords = TRUE) tokenize_tbl(text, match_option = Match$ALL, stopwords = TRUE) tokenize_tidytext(text, match_option = Match$ALL, stopwords = TRUE) tokenize_tidy(text, match_option = Match$ALL, stopwords = TRUE)
tokenize(text, match_option = Match$ALL, stopwords = TRUE) tokenize_tbl(text, match_option = Match$ALL, stopwords = TRUE) tokenize_tidytext(text, match_option = Match$ALL, stopwords = TRUE) tokenize_tidy(text, match_option = Match$ALL, stopwords = TRUE)
text |
target text. |
match_option |
|
stopwords |
stopwords option. Default is TRUE which is
to use embaded stopwords dictionany.
If FALSE, use not embaded stopwords dictionany.
If char: path of dictionary txt file, use file.
If |
list type of result.
## Not run: tokenize("Test text.") tokenize("Please use Korean.", Match$ALL_WITH_NORMALIZING) ## End(Not run)
## Not run: tokenize("Test text.") tokenize("Please use Korean.", Match$ALL_WITH_NORMALIZING) ## End(Not run)