Submitted Date
Subjects
Authors
Institution
  • The Slides for Guiding Large Language Models to Generate Computer-Parsable Content

    Subjects: Computer Science >> Computer Software Subjects: Linguistics and Applied Linguistics >> Linguistics and Applied Linguistics submitted time 2024-04-21

    Abstract: This slide presentation describes the research on Guiding Large Language Models to Generate Computer-Parsable Content in terms of Background, Motivation, Method, Effect, Prospect and Acknowledgements. For the full paper, please refer to: https://arxiv.org/abs/2404.05499

  • Constraining Large Language Model for Generating Computer-Parsable Content

    Subjects: Computer Science >> Computer Software Subjects: Linguistics and Applied Linguistics >> Linguistics and Applied Linguistics submitted time 2024-04-07

    Abstract: Large language models (LLMs) have demonstrated remarkable capabilities in learning patterns from massive text corpora, including word relationships, sentence structures, and even complex semantic and pragmatic information. However, it remains challenging to induce pre-trained language models to generate structured content that strictly follows specific conventions.We propose a scheme for guiding LLMs to generate highly usable content for computers without the need for fine-tuning and additional neural network inference, by introducing coroutine-based content generation constraints through a pre-agreed context-free grammar (CFG), which guides the autoregressive model Transformer to sample the correct tokens during its decoding phase to form a program-compliant form in the decoding phase of the autoregressive model Transformer to form a formal language that conforms to the program conventions. This will effectively improve the stability and consistency of LLMs in generating target data structures, types or instructions, and reduce the difficulty of application development and integration.We first verified that the error rate of models such as GPT-2 and Gemma reaches 95% when the length of the generated DSLs are greater than 36 and 282, respectively, through the experiment of matching bracket pairs , which illustrates the performance problem of some current LLMs in the generation of specific DSLs. We also present YieldLang, a coroutine-based DSL generation framework, and conduct experiments using LLMs on multiple task datasets, including tasks such as JSON, Mermaid flowchart, and function call expression generation. These experiments show that the approach in this paper improves its accuracy by a factor of 1.09 to 11.6 compared to the benchmarks, and in the best case is able to reduce the number of samples used by the LLMs to generate JSON to about 16.5% of the benchmarks, which will effectively improve the usability of the content generated by the LLMs for computer programs.

  • A General Rhetorical Interpretation of Sentence Translation

    Subjects: Linguistics and Applied Linguistics >> Linguistics and Applied Linguistics submitted time 2024-01-26

    Abstract: Rhetoric in language, like air, is ubiquitous. It is not only presented in the form of narrow rhetoric (rhetorical devices), but from a broad rhetorical perspective, rhetoric is also implicit in all sentences, inherently encompassing the domain of narrow rhetoric. This article starts with rhetorical devices, explains the source language and target language in a broad sense of rhetoric, explores the connection between the two, analyzes the process of sentence translation from a broad rhetorical perspective, and proposes dynamic principles for measuring the quality of sentence translation.
     

  • New Possibilities for Linguistic Research in the Era of Large Language Models

    Subjects: Linguistics and Applied Linguistics >> Linguistics and Applied Linguistics Subjects: Computer Science >> Natural Language Understanding and Machine Translation submitted time 2024-01-11

    Abstract: The research and engineering paradigm of natural language processing has been shifted with the rapid development of large languages models represented by the GPT series. It makes a significant impact on the related fields such as healthcare, education, judiciary and finance. At the same time, it also brings new possibilities for linguistics, the study of language itself. In this paper, we employ GPT4, Baichuan2 as well as ChatGLM3 and investigate their abilities of analyzing complex linguistic phenomena, taking ambiguity as an example. The experimental results show that GPT4 can effectively perceive and understand complex linguistic phenomena by integrating ambiguity resolution and syntactic analysis. For Baichuan2, if it is guided properly via prompt engineering, its analytical ability can be improved without parameter optimization. In addition, the relationship between linguistic phenomena and large language models can be visually demonstrated by monitoring the internal features and neuron activities of the models when processing ambiguous sentences in different context. In general, our experiments indicate that large language models are beneficial to better understanding the analyzing complex linguistic phenomena, hence providing new alternatives for linguistic research.

  • How semantic prosody is acquired in novel word learning: Evidence from the “Double-Jujube Tree” Effect

    Subjects: Linguistics and Applied Linguistics >> Linguistics and Applied Linguistics submitted time 2024-01-05

    Abstract: Generally, a word’s meaning consists of at least two components. The first is denotative meaning, representing the definitional meaning found in dictionaries and serving as the word’s fundamental meaning. The second component involves semantics that a word “absorbs” from its linguistic context, not constrained by definitions; this is known as semantic prosody, described as “a consistent aura of meaning with which a form is imbued by its collocates” (Louw, 1993, p. 157). While theories and empirical studies have shed light on mechanisms supporting the acquisition of the first word meaning component, the acquisition of the connotative meaning engendered by semantic prosody has been overlooked. It remains unclear whether readers can unconsciously acquire the semantic prosody (or emotional connotations) of a novel word after encountering it consistently in a context with a strong emotional polarity.
    Against this backdrop, we conducted a word learning experiment, manipulating context emotionality (negative vs. neutral vs. positive) and context variability (same-repeated vs. varied contexts) as crucial contextual variables. This aimed to address two understudied questions in vocabulary acquisition: (1) Does transfer of affect to a word from its linguistic context take place through reading exposures, facilitating the acquisition of semantic prosody for the word? If so, is such transfer influenced by context variability? (2) Does the acquired semantic prosody for words affect the acquisition of word forms and meanings, and is this acquisition modulated by context variability? This experiment involved two sessions: a reading-and-learning phase and a testing phase. During the reading-and-learning session, participants read emotionally charged passages, simultaneously learning embedded target words. The testing session included an immediate posttest, incorporating four vocabulary tests—valence rating, orthographic choice, definition matching, and definition generation. A total of 196 Chinese speakers participated in the experiment.
    Mixed-effects models were utilized to analyze data from the valence rating task and the other three vocabulary knowledge tests. The findings revealed that, within the same-repeated context, manipulating context emotionality (positive vs. neutral vs. negative) significantly influenced valence ratings, showing significantly higher ratings in the positive condition compared to neutral and negative conditions. Conversely, in the varied context, no significant differences in valence ratings were observed. This result supports the hypothesis of the “Double-Jujube Tree” effect, emphasizing the effect of repetitive texts compared to multiple texts. However, in the varied context, valence ratings played a role in influencing participants’ performances in the vocabulary tests, leading to better outcomes as valence ratings increased. In the same-repeated context, valence ratings had minimal effect on accuracy in the orthographic choice test and the definition prompting test.
    We posit that the effective mechanism for learning the semantic-prosody-engendered connotations of words involves the transfer of affect from their collocations. However, this transfer seems to be contingent on context variability, occurring only in the same-repeated context and not in the varied context. Furthermore, we illustrate that the emotionality of context influences the quality of both orthographic and semantic word learning, with words being better learned in positive contexts as opposed to negative or neutral ones.
     

  • The independent effect of transitional probability on verbal statistical learning

    Subjects: Linguistics and Applied Linguistics >> Linguistics and Applied Linguistics submitted time 2023-11-01

    Abstract: In a typical SL task, participants are first exposed to a nonsensical artificial language for 5~10 mins and then asked to finish a 2 alternative forced choice task (2AFC). Transitional probabilities (TPs), the core concept in SL, represent the predictably between syllables. In a given artificial language, syllables within a target word occur together more frequently, resulting in higher TPs than those of the syllables that span across word boundaries. The latter is referred to as partwords and consist of lower TPs. After the exposure phase, participants enter the test phase and are presented with a partword and target word in each trial of the 2AFC task. If the accuracy across participants is higher than chance level, it is assumed that learning has occurred. However, studies have also shown that factors other than TPs, such as word token frequency, word length variations (or the lack thereof) also impact SL performance in such tasks. To date, these factors as well as their interactive effects remain under studied.
    In the experiment one, we aimed to investigate whether TPs affect SL learning performance when controlling for target words’ and partwords’ token frequencies. In doing so, we created the artificial language by randomizing the order of two trisyllabic words and two disyllabic words. During the 2AFC task, three types of items (target word, partword, and nonword) were paired together, with two items in equal length in each trial. There were 24 trials in the test. 40 native Mandarin monolinguals participated in the experiment; they first listened to the artificial language for 5mins and then finished the 2AFC task. In the experiment two, an artificial language was generated with 10 syllables and presented in exposure phase, to examine whether experiment one’s learning effect came from the TP or participants’ prior language bias.
    Results in experiment one showed that the accuracy of all trials was significantly higher than chance (0.5) at the group level, suggesting that participants were able to segment the artificial language of mixed word lengths. Participants were also marginally better at choosing target words over partwords, and partwords over nonwords. To investigate the independent effect of TP in SL, we subset the data by word length and found that participants’ accuracy choosing trisyllabic target words over partwords was marginally lower than their choosing disyllabic target words over partword, which suggests that disyllabic words confer advantage in SL for this group of participants. In addition, participants’ accuracy in choosing trisyllabic partwords over nonwords was significantly higher than that of disyllabic target partwords over nonwords. In the experiment two, there were no significant learning effect in any levels when the statistical information was absent.
    A series of results across two behavior experiments highlight the unique contribution from TPs alone, since accuracy was assessed by controlling for word token frequency and word length. Thus, the present study suggests that TP exerts effect on verbal SL performance independent of word token frequency. Further studies should take into account more types of statistical rule such as mutual information and backward TP.
     

  • The comparison research on phonology of Kunshan, Suzhou and Shanghai dialect

    Subjects: Linguistics and Applied Linguistics >> Linguistics and Applied Linguistics submitted time 2023-08-19

    Abstract: Linguists have paid more attention on Researches and investigations of Shanghai(SH)and Soochow (SC) dialects since P.R.C was established. Some remark Khuense (KS) dialect as a “half-Soochow, half-Shanghai” accent so that they are not going to have an in-depth understanding of it. In order to have some conclusion about KS’ phonetic distinctions and phonology revolution rules, the author did comparative research on modern phonologies of SC, KS and SH; also, further analysis on their onsets and rhymes based on Traditional Chinese rhyme system. It is true that KS’ phonetics has been highly influenced and changed by SH’s; However, there are accurately systemic correspondences between KS and SC’s phonemes. People cannot simply conclude that KS is a younger sibling or kid of SH. For instance, rhymes of many Yu She(遇摄)and Guo She(果摄) syllables are əʊ in KS, but u in SH. It is not easy to distinguish 2 phonological changes: əʊ > u ,a labialization of rhyme caused by Bang Xi(帮系)onsets and mediate u ; and u > əʊ , a phonetic split happened frequently in Wu Chinese. The question is critical because there are many causal-effect relationships depending on the judgment of KS’ identity, an elder or younger generation to SH.
     

  • The Essence of syllables and A new explanation of the nature of vowels and consonants as syllables

    Subjects: Linguistics and Applied Linguistics >> Linguistics and Applied Linguistics submitted time 2023-07-18

    Abstract: The syllable has long been regarded as a one-order phonetic unit without being known as an illusion; vowels and consonants have been the most solid units of phonetics without being known as impostors of letters; and the historical misconception that vowels and consonants make up syllables has gone unrecognized. The article analyzes the reasons why syllables have no linguistic status, describes the confusion over the unknown origin of vowels and consonants, analyzes the relationship between syllables and letters from a chronological perspective,
    and experimentally explores the unitary form of speech production, and reveals that the temporal structure of articulation prescribes the nature of syllables from a co-temporal perspective. Based on this, the paper redefines syllables in a subversive way, and reshapes the true status and value of vowels and consonants. The article concludes by suggesting that syllables evolved in a long history and eventually produced complex word syllable structures and limited pure syllable forms through shedding, whereby a set of minimal syllable analysis schemes and syncopation principles can be proposed.

  • Pronunciation analysis of late summer

    Subjects: Medicine, Pharmacy >> Traditional Chinese Medicine and Chinese Materia Medica Subjects: Linguistics and Applied Linguistics >> Linguistics and Applied Linguistics submitted time 2023-07-13

    Abstract: Late summer is an important term in Chinese medicine. Among the ancient medical scholars, only Gao Shizong was found to have clearly explained the pronunciation of the word "late" in late summer. In order to clarify the pronunciation of the character "Chang" of Changxia, we searched the ancient classical Chinese medical texts, sorted through the existing literature and analyzed the views of existing scholars, and on this basis, we proposed a new opinion that when the character "late" is pronounced zhǎng, it means that the elders are honored; when it is pronounced cháng, it means that the elders are honored. When pronounced cháng, it expresses the good wish of life and vitality for a long time and the long duration of time, and it also indicates the longest state of the yellow bell. The clarification of the pronunciation of late summer can help to further understand Changxia and help to further interpret and study Chinese medicine.
     

  • Visual world paradigm reveals the time course of spoken language processing

    Subjects: Psychology >> Other Disciplines of Psychology Subjects: Linguistics and Applied Linguistics >> Linguistics and Applied Linguistics submitted time 2023-07-12

    Abstract: The visual world paradigm (VWP) assesses real-time language processing by tracking and measuring eye movements in visual contexts. Linking hypotheses, such as the coordinated interplay account and the goal-based linking hypothesis, establish the link between eye movements and the cognitive processes of language comprehension. Time sensitivity is characteristic of the data generated by this paradigm. Analytical methods include the analysis of fixation proportions within time windows, divergence point analysis and growth-curve analysis, etc. Studies using the VWP provide important evidence for speech and lexical recognition, syntactic parsing, semantic integration, and the processing of discourse and pragmatic information.  

  • A Study of the Double Tone in the Two tone Dialect of Lan-Yin Mandarin

    Subjects: Linguistics and Applied Linguistics >> Linguistics and Applied Linguistics submitted time 2023-04-28

    Abstract: This paper studies the disyllabic patterns of four disyllabic dialects in Lan-Yin Mandarin, describes and compares the monosyllabic patterns and disyllabic patterns of the four disyllabic dialects. Then, from an acoustic perspective, a comparative analysis was conducted on the differences in duration, pitch, and intensity between two tone tones and two syllable words within four two tone dialects. The results showed that the duration of two tone tones was significantly longer than that of two syllable words, and there were corresponding differences in pitch and intensity curves. The intensity of two tone tones had a clear valley. Finally, from the perspective of language contact, this paper analyzes the causes of the double tone characteristics of Lan-Yin Mandarin's two tone dialect, and points out the significance of the comparative study of the double syllable words and double tone.

  • Research on Chinese Phonetic Awareness: Taking Cantonese Dialect as an Example

    Subjects: Linguistics and Applied Linguistics >> Linguistics and Applied Linguistics submitted time 2023-04-28

    Abstract: This study explores the ability of Chinese adults to detect and manipulate different speech units through speech awareness testing, and discusses the impact of rhyming and basic speech perception units in Chinese populations, as well as the knowledge of Chinese pinyin, on the establishment of speech awareness in Chinese populations. The experiment examined the performance of two groups of adult participants (each consisting of 15 people) with/without Pinyin ability in the local dialect of Duanzhou, Zhaoqing, in seven phonological awareness tests (syllable awareness, vowel detection, tone detection, initial detection, vowel substitution, tone substitution, and initial deletion). The test results showed that both groups of participants had complete syllable awareness; The Pinyin group is proficient in detecting and manipulating tonal units, while non Pinyin groups find it difficult to operate longitudinal tones alone, but have a certain level of detection ability. Among them, detecting vowels is the best, followed by tones, and vowels are the worst; The ability to generate rhyming syllables with a given stimulus without the need to extract vowels from the syllables; Pinyin groups can cut syllables at the sound vowel boundary, while non pinyin groups cannot; The acquisition of Pinyin will change the basic perception mode of Chinese phonetics.

  • On Phonetic and Phonetic Systems: Taking Chinese as an Example

    Subjects: Linguistics and Applied Linguistics >> Linguistics and Applied Linguistics submitted time 2023-04-28

    Abstract: With the progress of phonetics technology and research methods, phonemics has gradually developed from the methodology of structuralism to the methodology of cognitive science, and gradually established the "cognitive phonemics". From the perspective of cognitive phonemics, phonetic awareness is the first step in establishing phonemes. Phonemic awareness is usually defined as the phonetic units that can be perceived by native speakers without discriminating meanings. Phonetic awareness can be divided into natural phonetic awareness and unnatural phonetic awareness. Based on these basic concepts, this paper discusses the phonemic system of Chinese in different historical periods: 1) Syllabic phoneme system "Zhiyin"; 2) phonemic system of initials and finals "Fanqie"; 3) Quasi-phonemic system of initials, rhymes and tones "Guang Yun"; 4) Quasi-phonemic system of initials, rhymes and tones "Zhuyin Fuhao"; 5) Phonemic system of segments "Hanyu Pinyin". The study found that the phonemic system of Chinese gradually changed from the phonemic system based on natural phonetic awareness to the phonemic system based on unnatural phonetic awareness. From the point of view of phonetic units, it also gradually changes from syllables to segments. Based on these studies and findings, this paper gives a clear definition of "phonemic system" and "phonetic notation system", and expounds the difference between them. According to the definitions of "phonemic system" and "phonetic notation system", it is found that most of the field investigations and studies on phonemic construction of Chinese-Tibetan languages have established "phonetic notation system" rather than "phonemic system".

  • CCTE-A database of Chinese COVID-19 Terms

    Subjects: Psychology >> Cognitive Psychology Subjects: Psychology >> Experimental Psychology Subjects: Psychology >> Psychological Measurement Subjects: Psychology >> Statistics in Psychology Subjects: Psychology >> Other Disciplines of Psychology Subjects: Linguistics and Applied Linguistics >> Linguistics and Applied Linguistics Subjects: Other Disciplines >> Synthetic discipline submitted time 2023-02-08

    Abstract: Objective: To establish a multi-dimensional and standardized lexical database of COVID-19-related terms and words. The database may have facilitated COVID-19-related research in domains such as Psychology, Psychiatry, Neuroscience, etc. Methods: This database referred to the established methods of the emotional lexical database at home and abroad, and used the dot-detection task and words in the database as experimental materials to test the attention bias of the subjects suspected of having COVID-19 phobia, so as to test the validity of the database. Results: 196 COVID-19-related words and 99 neutral words were included in the word database. Then, we classified and evaluated the words through six dimensions, and established a standardized database of Chinese COVID-19-related terms. The words have good reliability and internal consistency. In addition, the validity was tested through the dot-detection task. Subjects with COVID-19 fear and those without COVID-19 fear showed a significant attentional bias toward COVID-19-related words Limitations: The initial sample size is small and the database application needs further development. Conclusions: The database of Chinese COVID-19 terms has good reliability, internal consistency, and reliability, and can be used as materials related to COVID-19-related research in the future.

  • Research on the rising phenomenon of intonation at the end of sentences in Jinzhou dialect

    Subjects: Linguistics and Applied Linguistics >> Linguistics and Applied Linguistics submitted time 2022-08-10

    Abstract: Objective This paper explores the proportion and influencing factors of rising intonation at the end of Jinzhou discourse to objectively displays the intonation characteristics of Jinzhou discourse. At the same time, it finds out the acoustic parameters that can reflect the rising intonation at the end of a sentence, and lays the foundation for the acoustic research of intonation research. Methods Combining acoustic experiments and listening discrimination experiments, taking the four sentences types in Mandarin as the control group, this paper uses parameters such as the pitch difference between the beginning and the end of the sentence at the end of the sentence, and the pitch difference between the beginning and end of the last and the non-modal particle word to analyze and describe the characteristics of intonation at the end of sentences in Jinzhou dialect. Results The rate of the rising intonation at the end of a sentence is 47.6%. The judgement of rising mainly depends on the relationship between the pitch difference between the beginning and the end of a sentence between Jinzhou dialect and Mandarin. When expressing doubts and shocks, Jinzhou dialect has a larger increase than Mandarin, and the declarative sentences of “啊” show a kind of rising intonation. Limitations This paper only initially analyzes the conditions and performance of rising intonation at the end of sentences in Jinzhou from the perspective of pitch, while the duration and intensity of rising intonation are also different from those in Mandarin, and further research is needed. Conclusions The results of this paper show that there is a rise in the pitch at the end of sentences in Jinzhou dialect, but it is not a rise in every sentence. The pitch difference between the beginning and the end of the word at the end of the sentence, and the pitch difference between the beginning and end of the last and the non-modal particle word can be used as a rising judgment indicator. The rising intonation at the end of a sentence corresponds to the pragmatic purpose, which is to attract others' attention in order to obtain a response.

  • 利用深度学习研究中文书写系统、字体对阅读绩效的影响

    Subjects: Linguistics and Applied Linguistics >> Linguistics and Applied Linguistics submitted time 2022-01-05

    Abstract: " [Objective] We study the reading performance of different fonts and writing systems that are using in Chinese publications. [Methods] Specifically, the Chinese characters in a sentence are rendered into their corresponding glyph images, then fold those images into a three-dimensional sentence tensor according to the word order. For different fonts or simplified/traditional Chinese text, we can get the corresponding representations with visual differences. By inputting the obtained sentence tensor into the proposed deep language model, we test them on text classification, which can objectively study the influence of font and writing system on reading performance. [Results] According to the experiments on two real-world Chinese text classification datasets, Toutiao and Thucnews, we found that the accuracy of text classification on some uncommon fonts is lower than that of common used fonts, and the text representation efficiency of different fonts in the common fonts is also different. [Conclusions] Through a hypothesis test, we found that there is a significant difference in the accuracy of using the data sets of regular script and bold script for text classification task, and the efficiency of regular script is higher than that of bold script. There are significant differences in reading performance between simplified and traditional writing systems. " "

  • 荥阳方言单字调及双字调的声学分析

    Subjects: Linguistics and Applied Linguistics >> Linguistics and Applied Linguistics submitted time 2021-03-08

    Abstract:河南荥阳方言属官话方言的中原官话,有阴平、阳平、上声、去声四个声调,没有入声。本文根据方言音系设计字表,以实验语音学的方法来提取代表字词的基频和时长等参数,依据五度标调法和归一法原则,分析了荥阳方言单字调和双字调的基频模式。借助Praat软件重新测定单字调中两个降调(阳平52、去声31),一个平调(上声33),一个升调(阴平23);测定了不同声调组合的双字调,并在此基础上总结了双字调的音变规则。荥阳方言双字调变调模式中,以前字变调为主,产生了53、42、44、32共4个新声调,其变调类型以简化型连调和异化型连调为主。

  • 新绛方言单字调及双音节连读变调实验研究

    Subjects: Linguistics and Applied Linguistics >> Linguistics and Applied Linguistics submitted time 2021-03-08

    Abstract:新绛方言属于中原官话汾河片绛州小片。本文采用实验研究的方法,以基频为主要声学参数,对运城市新绛县方言单字调和双音节连读变调进行研究。实验结果表明,新绛方言单字调有3个调类,阴平、去声调为高降调41,阳平调为中升调34,上声为平调33。在双音节连读变调中,变化较为显著,前字、后字均有变调。阳平、上声、去声同为上声前字时发生合并。双音节组合调类一共有16种模式,归并后为13个。语音变调的规律性较强。在大部分情况下,阴平调型保持降调不变,调值会有些许变化。由于有古音为入声来源的词,阳平为前字多两种组合模式。变调多在阳平和上声中发生。

  • 潜江方言单字调及双字调声学分析

    Subjects: Linguistics and Applied Linguistics >> Linguistics and Applied Linguistics submitted time 2021-03-08

    Abstract:潜江方言属于北方方言区西南官话的武(汉)天(门)片,有阴平、阳平、上声、去声四个调类,历来关于潜江方言在西南官话中所属分区问题、音系系统、语法特点、词汇特点等方面的讨论成果颇丰,但未曾通过数据测量对潜江方言进行过声学分析。本文通过实验语音学的方法,运用Praat软件,通过提取潜江方言单、双字调的基频信息对潜江方言的单字调和双字调进行分析,重新测量单、双字调的调值,并总结双字调中的音变规律。

  • 高要乐城白话单字调及双字调声学分析

    Subjects: Linguistics and Applied Linguistics >> Linguistics and Applied Linguistics submitted time 2021-03-08

    Abstract:肇庆市高要区乐城镇位于广东省中部偏西,当地通行的“白话”属于广府粤方言的一种。此前方言学专著将高要与肇庆城区作为同一个方言点记录,未见高要区内其他方言点的声学分析结果及关于乐城白话的相关记录。实验结果表明,乐城话共有9个调类,分别定为阴平、阳平、阴上、阳上、阴去、阳去、上阴入、下阴入和阳入。乐城白话中没有明显的双字调变调现象,但前字调域偏窄,发音时长较短。