Title: | Convert Data to Memorable Phrases |
---|---|
Description: | Convert keys and other values to memorable phrases. Includes some methods to build lists of words. |
Authors: | Max Candocia |
Maintainer: | Max Candocia <[email protected]> |
License: | GPL (>= 2) |
Version: | 0.2.1 |
Built: | 2025-02-24 03:31:58 UTC |
Source: | https://github.com/mcandocia/keytoenglish |
Converts a collection of documents to a word list
corpora_to_word_list( paths, ascii_only = TRUE, custom_regex = NA, max_word_length = 20, stopword_fn = DEFAULT_STOPWORDS, min_word_count = 5, max_size = 16^3, min_word_length = 3, output_file = NA, json_path = NA )
corpora_to_word_list( paths, ascii_only = TRUE, custom_regex = NA, max_word_length = 20, stopword_fn = DEFAULT_STOPWORDS, min_word_count = 5, max_size = 16^3, min_word_length = 3, output_file = NA, json_path = NA )
paths |
Paths of plaintext documents |
ascii_only |
Will omit non-ascii characters if TRUE |
custom_regex |
If not NA, will override ascii_only and this will determine what a valid word consists of |
max_word_length |
Maximum length of extracted words |
stopword_fn |
Filename containing stopwords to use or a list of stopwords (if length > 1) |
min_word_count |
Minimum number of occurrences for a word to be added to word list |
max_size |
Maximum size of list |
min_word_length |
Minimum length of words |
output_file |
File to write list to |
json_path |
If input text is JSON, then it will be parsed as such if this is a character of JSON keys to follow |
A 'character' vector of words
Calculates greatest common denominator of a list of numbers
GCD(...)
GCD(...)
... |
Any number of 'numeric' vectors or nested 'list's containing such |
A 'numeric' that is the greatest common denominator of the input values
Randomly generate sentences with a specific structure
generate_random_sentences(n, punctuate = TRUE, fast = FALSE)
generate_random_sentences(n, punctuate = TRUE, fast = FALSE)
n |
'numeric' number of sentences to generate |
punctuate |
'logical' value of whether to add spaces, capitalize first letter, and append period |
fast |
'logical' |
'character' vector of randomly generated sentences
Hashes data to a sentence that contains 54 bits of entropy
hash_to_sentence(x, ...)
hash_to_sentence(x, ...)
x |
- Input data, which will be converted to 'character' if not already 'character' |
... |
- Other parameters to pass to 'keyToEnglish()', besides 'word_list', 'hash_subsection_size', and 'hash_function' |
'character' vector of hashed field resembling phrases
Hashes field to sequence of words from a list.
keyToEnglish( x, hash_function = "md5", phrase_length = 5, corpus_path = NA, word_list = wl_common, hash_subsection_size = 3, sep = "", word_trans = "camel", suppress_warnings = FALSE, hash_output_length = NA, forced_limit = NA, numeric_append_range = NA )
keyToEnglish( x, hash_function = "md5", phrase_length = 5, corpus_path = NA, word_list = wl_common, hash_subsection_size = 3, sep = "", word_trans = "camel", suppress_warnings = FALSE, hash_output_length = NA, forced_limit = NA, numeric_append_range = NA )
x |
- field to hash |
hash_function |
'character' name of hash function or hash 'function' itself, returning a hexadecimal character |
phrase_length |
'numeric' of words to use in each hashed key |
corpus_path |
'character' path to word list, as a single-column text file with one word per row |
word_list |
'character' list of words to use in phrases |
hash_subsection_size |
'numeric' length of each subsection of hash to use for word index. 16^N unique words can be used for a size of N. This value times phrase_length must be less than or equal to the length of the hash output. Must be less than 14. |
sep |
'character' separator to use between each word. |
word_trans |
A ‘function', 'list' of functions, or ’camel' (for CamelCase). If a list is used, then the index of the word of each phrase is mapped to the corresponding function with that index, recycling as necessary |
suppress_warnings |
'logical' value indicating if warning of non-character input should be suppressed |
hash_output_length |
optional 'numeric' if the provided hash function is not a 'character'. This is used to send warnings if the hash output is too small to provide full range of all possible combinations of outputs. |
forced_limit |
for multiple word lists, this is the maximum number of values used for calculating the index (prior to taking the modulus) for each word in a phrase. Using this may speed up processing longer word lists with a large least-common-multiple among individual word list lengths. This will introduce a small amount of bias into the randomness. This value should be much larger than any individual word list whose length is not a factor of this value. |
numeric_append_range |
optional 'numeric' value of two integers indicating range of integers to append onto data |
'character' vector of hashed field resembling phrases
# hash the numbers 1 through 5 keyToEnglish(1:5) # alternate upper and lowercase, 3 words only keyToEnglish(1:5, word_trans=list(tolower, toupper), phrase_length=3)
# hash the numbers 1 through 5 keyToEnglish(1:5) # alternate upper and lowercase, 3 words only keyToEnglish(1:5, word_trans=list(tolower, toupper), phrase_length=3)
Calculates least common multiple of a list of numbers
LCM(...)
LCM(...)
... |
Any number of 'numeric' vectors or nested 'list's containing such |
A 'numeric' that is the least common multiple of the input values
Returns approximate number of elements that you can select out of a set of size 'N' if the probability of there being any duplicates is less than or equal to 'p'
uniqueness_max_size(N, p)
uniqueness_max_size(N, p)
N |
'numeric' size of set elements are selected from, or a 'list' of 'list's of 'character' vectors (e.g., 'wml_animals') |
p |
'numeric' probability that there are any duplicate elements |
'numeric' value indicating size. Value will most likely be non-integer
# how many values from 1-1,000 can I randomly select before # I have a 10% chance of having at least one duplicate? uniqueness_max_size(1000,0.1) # 14.51
# how many values from 1-1,000 can I randomly select before # I have a 10% chance of having at least one duplicate? uniqueness_max_size(1000,0.1) # 14.51
Calculates probability that all 'r' elements of a set of size 'N' are unique
uniqueness_probability(N, r)
uniqueness_probability(N, r)
N |
'numeric' size of set. Becomes unstable for values greater than 10^16. |
r |
'numeric' number of elements selected with replacement |
'numeric' probability that all 'r' elements are unique
Clean JSON text from Wikipedia
wiki_clean(x)
wiki_clean(x)
x |
'character' JSON text |
'character' JSON text
Word list of 256 adjectives that do not describe origin, so they can usually be used prior to visual/origin adjectives without breaking any grammar rules
data(wl_adjectives_nonorigin)
data(wl_adjectives_nonorigin)
A 'character' vector
Word list of 256 adjectives that visually describe an object.
data(wl_adjectives_visual)
data(wl_adjectives_visual)
A 'character' vector
Word list generated by processing several animal-related pages on Wikipedia
data(wl_animal)
data(wl_animal)
An object of class 'character'
data(wl_animal) keyToEnglish(1:5, word_list=wl_animal)
data(wl_animal) keyToEnglish(1:5, word_list=wl_animal)
Public domain word list of common words
data(wl_common)
data(wl_common)
An object of class 'character'
Public Domain Word Lists. Michael Wehar https://github.com/MichaelWehar/Public-Domain-Word-Lists
data(wl_common) keyToEnglish(1:5, word_list=wl_common)
data(wl_common) keyToEnglish(1:5, word_list=wl_common)
Public domain word list of common words, slightly truncated from original version
data(wl_freq5663)
data(wl_freq5663)
An object of class 'character'
Public Domain Word Lists. Michael Wehar https://github.com/MichaelWehar/Public-Domain-Word-Lists
data(wl_common) keyToEnglish(1:5, word_list=wl_freq5663)
data(wl_common) keyToEnglish(1:5, word_list=wl_freq5663)
Word list generated by processing several works of literature on Project Gutenberg
data(wl_literature)
data(wl_literature)
An object of class 'character'
Project Gutenberg. Project Gutenberg
data(wl_literature) keyToEnglish(1:5, word_list=wl_literature)
data(wl_literature) keyToEnglish(1:5, word_list=wl_literature)
Word list of 2048 singular, concrete nouns, largely excluding materials and liquids that cannot be referred to in the singular form
data(wl_nouns_concrete)
data(wl_nouns_concrete)
A 'character' vector
Word list of 2048 concrete nouns in plural form, largely excluding materials and liquids that cannot be referred to in the singular form.
data(wl_nouns_concrete_plural)
data(wl_nouns_concrete_plural)
A 'character' vector
Word list generated by processing several science-related pages on Wikipedia
data(wl_science)
data(wl_science)
An object of class 'character'
data(wl_science) keyToEnglish(1:5, word_list=wl_science)
data(wl_science) keyToEnglish(1:5, word_list=wl_science)
Word list of 256 transitive verbs in gerund form (i.e., "ing" at end)
data(wl_verbs_transitive_gerund)
data(wl_verbs_transitive_gerund)
A 'character' vector
Word list of 256 transitive verbs in infinitive form (minus the "to")
data(wl_verbs_transitive_infinitive)
data(wl_verbs_transitive_infinitive)
A 'character' vector
Word list of 256 transitive verbs in present tense
data(wl_verbs_transitive_present)
data(wl_verbs_transitive_present)
A 'character' vector
Word lists of sizes, colors, animals, and attributes to construct memorable phrases
List of word lists that combine cute words with physics-related words
data(wml_animals) data(wml_animals)
data(wml_animals) data(wml_animals)
A 'list' of 'character' vectors
A 'list' of 'character' vectors
keyToEnglish(1:5, word_list=wml_animals)
keyToEnglish(1:5, word_list=wml_animals)
List of word lists that combine cute words with physics-related words
data(wml_cutephysics)
data(wml_cutephysics)
A 'list' of 'character' vectors
List of word lists that can be used to make a 54-byte, often humorous, sentence
data(wml_long_sentence)
data(wml_long_sentence)
A 'list' of 'character' vectors