0% found this document useful (0 votes)
13 views

LabSet 03

Uploaded by

ike.cai.cxc
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

LabSet 03

Uploaded by

ike.cai.cxc
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Lab Set 3

Ike Cai

2024-10-04

1 Lab Set 3

1.1 Dependency parsing

1.1.1 Task 1

From the keyness table (acad_v_fic), identify the preposition with the highest keyness value.
Report the relevant statistics (including frequencies and dispersions) following the conventions
described in Brezina.

library(tidyverse)

-- Attaching core tidyverse packages ------------------------ tidyverse 2.0.0 --


v dplyr 1.1.4 v readr 2.1.5
v forcats 1.0.0 v stringr 1.5.1
v ggplot2 3.5.1 v tibble 3.2.1
v lubridate 1.9.3 v tidyr 1.3.1
v purrr 1.0.2
-- Conflicts ------------------------------------------ tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag() masks stats::lag()
i Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to becom

library(quanteda)

Package version: 4.0.2


Unicode version: 14.0
ICU version: 71.1

1
Parallel computing: disabled
See https://quanteda.io for tutorials and examples.

library(quanteda.textstats)
library(udpipe)
library(gt)

source("../R/keyness_functions.R")
source("../R/helper_functions.R")

load("../data/sample_corpus.rda")

set.seed(123)
sub_corpus <- sample_corpus %>%
mutate(text_type = str_extract(doc_id, "^[a-z]+")) %>%
group_by(text_type) %>%
sample_n(5) %>%
ungroup() %>%
dplyr::select(doc_id, text)

corpus_split <- split(sub_corpus, seq(1, nrow(sub_corpus), by = 10))

library(future.apply)

Loading required package: future

ncores <- 4L
plan(multisession, workers = ncores)

annotate_splits <- function(corpus_text) {


ud_model <- udpipe_load_model("../models/english-ewt-ud-2.5-191206.udpipe")
x <- data.table::as.data.table(udpipe_annotate(ud_model, x =
↪ corpus_text$text,
doc_id =
↪ corpus_text$doc_id))
return(x)
}

annotation <- future_lapply(corpus_split, annotate_splits, future.seed = T)

annotation <- data.table::rbindlist(annotation)

2
anno_edit <- annotation %>%
dplyr::select(doc_id, sentence_id, token_id, token, lemma, upos, xpos,
↪ head_token_id, dep_rel) %>%
rename(pos = upos, tag = xpos)

anno_edit <- structure(anno_edit, class = c("spacyr_parsed", "data.frame"))

sub_tkns <- as.tokens(anno_edit, include_pos = "tag", concatenator = "_")

doc_categories <- names(sub_tkns) %>%


data.frame(text_type = .) %>%
mutate(text_type = str_extract(text_type, "^[a-z]+"))

docvars(sub_tkns) <- doc_categories

sub_dfm <- dfm(sub_tkns)

sub_dfm <- sub_tkns %>%


tokens_select("^.*[a-zA-Z0-9]+.*_[a-z]", selection = "keep", valuetype =
↪ "regex", case_insensitive = T) %>%
dfm()

acad_dfm <- dfm_subset(sub_dfm, text_type == "acad") %>%


↪ dfm_trim(min_termfreq = 1)
fic_dfm <- dfm_subset(sub_dfm, text_type == "fic") %>% dfm_trim(min_termfreq
↪ = 1)

acad_v_fic <- keyness_table(acad_dfm, fic_dfm) %>%


separate(col = Token, into = c("Token", "Tag"), sep = "_")

acad_v_fic %>%
filter(Tag == "in") |>
slice(1:10) |>
gt() |>
cols_hide(columns = c('LL', 'LR', 'PV', "Tag")) |>
fmt_number(columns = c('Per_10.4_Tar', 'Per_10.4_Ref'), decimals = 2) |>
fmt_number(columns = c('DP_Tar', 'DP_Ref'), decimals = 3) |>
as_raw_html()

3
Token AF_Tar AF_Ref Per_10.4_Tar Per_10.4_Ref DP_Tar DP_Ref
of 563 159 445.02 121.08 0.094 0.067
by 96 30 75.88 22.84 0.188 0.165
as 94 35 74.30 26.65 0.084 0.401
than 37 11 29.25 8.38 0.275 0.163
per 14 1 11.07 0.76 0.605 0.800
in 258 191 203.94 145.45 0.060 0.094
among 15 2 11.86 1.52 0.405 0.800
near 5 0 3.95 0.00 0.798 NA
between 17 6 13.44 4.57 0.333 0.269
for 120 92 94.85 70.06 0.217 0.090

Your response: “of” has the highest AF_Tar (563) and a high relative frequency
Per_10.4_Tar (445.02) with a relatively lower AF_Ref (159). “by” and “in” have
higher dispersion scores, but their frequencies and relative frequencies are much
lower compared to of. Therefore, the preposition “of” has the highest keyness value
due to its overwhelmingly higher AF_Tar (563) and Per_10.4_Tar (445.02) com-
pared to other prepositions.”of” has AF_Tar of 563 (Absolute Frequency in Target
Corpus), AF_Ref of 159 (Absolute Frequency in Reference Corpus), Per_10.4_Tar
of 445.02 (Relative Frequency in Target Corpus), Per_10.4_Ref of 121.08 (Relative
Frequency in Reference Corpus, DP_Tar of 0.094 (Dispersion in Target Corpus),
and DP_Ref of 0.067 (Dispersion in Reference Corpus)

Create a KWIC table (with the preposition as the node word and a context window of 3) of
10 rows.

preposition <- kwic(sub_tkns, pattern = "of_IN", window = 3)


preposition |>
head(10) |>
gt() |>
cols_hide(columns = c('keyword', 'pattern')) |>
as_raw_html()

docname from to pre post


acad_31 24 24 revising_VBG the_DT picture_NN how_WRB they_PRP lived_VBD
acad_31 53 53 a_DT new_JJ category_NN reptiles_NNS ._. Today_NN
acad_31 62 62 familiar_JJ to_IN people_NNS all_DT ages_NNS ,_,
acad_31 129 129 the_DT American_NNP Museum_NNP Natural_NNP History_NNP ,_,
acad_31 154 154 complete_JJ fossil_NN skeleton_IN a_DT sauropod_NN to_TO
acad_31 167 167 An_DT 1891_CD reconstruction_NN B._NNP excelsus_NNP by_IN

4
docname from to pre post
acad_31 182 182 at_IN the_DT top_NN the_DT page_NN ._.
acad_31 201 201 under_IN the_DT direction_NN the_DT paleontologist_NN Edward_NNP
acad_31 230 230 ”_'' 1910_CD reconstruction_NN Diplodocus_NNP ,_, by_IN
acad_31 241 241 followed_VBD the_DT views_NNS the_DT paleontologist_NN Oliver_NNP

Posit an explanation for the higher frequency of the preposition in the target corpus (academic
writing) vs. the reference corpus (fiction).

Your response: Here are a few reasons for the higher frequency of prepositions in
academic writing compared to fiction. First, academic writing often employs more
complex sentence structures, requiring more prepositions to clarify relationships be-
tween ideas (e.g., “under the direction,” “to people of all ages”). Second, academic
writing includes detailed explanations and descriptions that require prepositions
to specify the relationships between different concepts (e.g., “familiar to people,”
“reptiles by Today”). Lastly, in academic writing, references to studies, data, and
attributions (e.g., “by B. excelsus,” “at the top”) often require prepositions for
proper citations. On the other hand, fiction typically focuses on narrative flow and
dialogue, which may require fewer prepositions, leading to their lower frequency.

1.1.2 Task 2

Calculate the mean length of the noun phrases in the 2 two text-types (acad_nps and
fic_nps). Report the results and posit an explanation for the findings that connects to
the previous findings related to prepositions.

library(dplyr)
library(stringr)

extract_noun_phrases <- function(annotation_data) {


np_candidates <- annotation_data %>%
filter(dep_rel %in% c("det", "amod", "compound", "nmod", "nsubj",
↪ "dobj")) %>%
group_by(doc_id, sentence_id, head_token_id) %>%
summarise(np_length = n())

return(np_candidates)
}
np_acad <- extract_noun_phrases(filter(anno_edit, str_detect(doc_id,
↪ "^acad")))

5
`summarise()` has grouped output by 'doc_id', 'sentence_id'. You can override
using the `.groups` argument.

np_fic <- extract_noun_phrases(filter(anno_edit, str_detect(doc_id, "^fic")))

`summarise()` has grouped output by 'doc_id', 'sentence_id'. You can override


using the `.groups` argument.

mean_np_length_acad <- np_acad %>% summarise(mean_length = mean(np_length))

`summarise()` has grouped output by 'doc_id'. You can override using the
`.groups` argument.

mean_np_length_fic <- np_fic %>% summarise(mean_length = mean(np_length))

`summarise()` has grouped output by 'doc_id'. You can override using the
`.groups` argument.

mean(mean_np_length_acad$mean_length)

[1] 1.483593

mean(mean_np_length_fic$mean_length)

[1] 1.21275

Your response: After summarizing the noun phrase lengths, the mean noun phrase
length for academic texts is approximately 1.48. Similarly, the mean noun phrase
length for fiction texts is approximately 1.21. The longer mean noun phrase length
in academic texts reflects the more complex and structured language typically used
in such writing. Academic writing tends to employ more detailed and elaborative
noun phrases (e.g., multiple adjectives, compounds) to provide precision and clarity.
In contrast, fiction tends to use simpler noun phrases to maintain narrative flow and
accessibility. This finding complements the observation that certain prepositions
(such as “of”) are more frequent in academic writing, as they often serve to link
these longer, more complex noun phrases.

6
1.2 Logistic regression

1.2.1 Task 1

Following the example in Brezina (pg. 129), report and briefly interpret the output of the
regression model (wt_regs, line 245).

Your response: The context type was a significant predictor of the relative pronoun
used (that/which). Entry of the predictors into the model significantly improved
model fit (Null Deviance = 8764, Residual Deviance = 8135, AIC = 8143). The
model also has strong predictive capabilities, as indicated by the coefficients table
below. As can be seen from the coefficients table, the use of a comma significantly
increases the likelihood of using which (OR = 4.89, 95% CI [1.46, 5.77]) compared
to the absence of a comma. Being a student from the UK also positively affects the
choice of pronoun (OR = 1.43, 95% CI [1.29, 1.57]). In contrast, speaker status
had a negligible effect (OR = 1.02, 95% CI [0.89, 1.16]), indicating no significant
influence on the choice between that and which.

1.3 Task 2

Write a brief interpretation of the probability curves illustrated in Figure 5.

Your response: Figure 5 displays the relationship between the normalized frequen-
cies of hedges and boosters (denoted as hedges_norm and boosters_norm, respec-
tively) and the predicted probabilities of different disciplines (Humanities, Sciences,
and Social Sciences). For humanities, The probability of belonging to the Human-
ities discipline increases smoothly from 0 to 1 as the frequency of boosters_norm
rises. This relationship follows a hyperbolic tangent (tanh) shape, indicating that
as the use of boosters increases, the likelihood of a text being classified as Humani-
ties grows steadily and sharply, reflecting a strong association between the presence
of boosters and Humanities texts. For science, the probability of belonging to the
Sciences discipline demonstrates a decreasing trend, starting at 1 and declining to
0 as the confidence (i.e., boosters_norm) approaches a value of 3. This suggests
that higher frequencies of boosters correlate with lower probabilities of a text be-
ing classified as Science. For social science, the probability shows a bell-shaped
curve that peaks around a confidence level of 2. This indicates that moderate
frequencies of boosters are most characteristic of Science texts, while both high
and low frequencies are less indicative. The probabilities for disciplines exhibit a
mirroring pattern compared to the boosters_norm. As the frequency of boosters
increases, the behavior of hedges_norm follows a similar trend, indicating that
both language features interact in a way that influences the classification of Social
Sciences texts. The reflection suggests that as hedges increase, the probability for

7
Social Sciences also increases, highlighting the nuanced role that both hedging and
boosting language plays in this discipline’s textual characteristics.

1.4 Multi-dimensional analysis

1.4.1 Task 1

Brezina similarly plots the Brown corpus registers on pg. 169. His process is a little different.
Rather than extracting factor loadings from the Brown corpus, he uses the loadings from the
original Biber data (some of which are listed on pg. 168).
Our loadings for dimension 1 are similar to Biber’s, though with some differences. Likewise,
the resulting plot is similar to the one on pg. 169. Why is this the case, do you think? (If you
want to check Biber’s description of his corpus, it’s on pg. 66 of his book.)

Your response: The similarity between the dimension loadings found in this analysis
and Biber’s work could stem from the foundational structure of language use in the
Brown Corpus. Both analyses focus on similar linguistic features and categories,
leading to comparable results despite differences in methodology and datasets. For
instance, Biber’s dimensions of “Involved vs. Informational Production” highlight
different discourse types, which remain relevant in both analyses. The insights
gained from MDA can inform various fields, including linguistics, discourse analysis,
and even artificial intelligence, where understanding language patterns is essential.

1.4.2 Task 2

Using information from the factor loadings, the positions of the disciplines along the dimension,
and KWIC tables, name dimension 1 following the X vs. Y convention. In a couple of sentences,
explain your reasoning.

Your response: Based on Brezina’s functional interpretation of factors as dimen-


sions and the analysis of the KWIC tables, I would name dimension 1 “Conver-
sational Engagement vs. Academic Formality.” The positive end of this dimen-
sion, characterized by features such as private verbs, contractions, and personal
pronouns (as seen in the “Confidence High” KWIC examples), reflects a style of
communication that is interactive and engaging. These features suggest a focus on
personal experiences and subjective interpretations, indicative of conversational
language that aims to connect with the audience. In contrast, the negative end,
marked by features such as nouns, propositional phrases, and structured academic
writing moves (highlighted in the “Academic Writing Moves” KWIC examples),
indicates a formal and informational approach typical of academic discourse. This

8
style prioritizes objectivity, structure, and logical argumentation, aimed at convey-
ing information rather than fostering interpersonal engagement.

You might also like