0% found this document useful (0 votes)
10 views

Translation and Technology Chapter 2 Summary

Chapter 2 discusses the translation memory (TM) database, its features, and how it can be customized to enhance translation quality and productivity. It covers methods for increasing TM size, the importance of segmentation, and the role of metadata in managing translation units. Additionally, it highlights the significance of TMX files for sharing and maintaining consistency across translation projects.

Uploaded by

i5alid55
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Translation and Technology Chapter 2 Summary

Chapter 2 discusses the translation memory (TM) database, its features, and how it can be customized to enhance translation quality and productivity. It covers methods for increasing TM size, the importance of segmentation, and the role of metadata in managing translation units. Additionally, it highlights the significance of TMX files for sharing and maintaining consistency across translation projects.

Uploaded by

i5alid55
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Translation and Technology Chapter 2: The Translation Memory Database

Key concepts

• The translation memory database can be filled to suit our individual needs.
• The translation memory database accepts several methods to boost its content.
• The translation memory database can be customised to suit our language pairs.
• The translation memory database includes features to improve translation quality and
productivity.
• The translation memory database can affect profit margins. what is the primary database in
CAT tool
Introduction what does TM contain?
what does TM need?
Translation memory (TM) is a database consisting of translation units (TUs), which we
have entered over time. It's a valuable resource that needs intelligent management, regular edits,
and good maintenance to be customized and up to date. It's the primary database in the CAT tool.

- How do we increase the size of TM? what is TMX?

1. Importing TMX files (interchangeable TMs)


2. Constant use over time

- What are TM features?

1. It automatically stores TUs when we confirm a translated segment.


2. We must manually add and highlight the terminology pairs we want to store in the
terminology database (Tmdb).

2.1 Creating a translation memory database What does TM do in the translation process?

The Translation Results window displays target results when a translation memory (TM)
segment matches an identical or almost identical source segment. The TM (1) compares the
content in the source segment against TM segments, (2) presenting translation units (TU) found.

The degree of match is expressed in percentages:

• 100% or perfect match: the content of a translation memory segment matches the document
segment exactly.
• 101% match or context match: If a new segment and a translation memory segment match
precisely, including tags, numbers, punctuation, etc.
• Fuzzy match less than 100%: the default fuzzy match threshold generally is 70%. There is
insufficient usable content below this level.
• No match: no match is found (translate from scratch).
what can translator do with proposed matches? how does TMX help in keep consistency?
The translator can accept or reject proposed matches, edit fuzzy matches, or translate
from scratch. Confirmed segments are sent to the Translation Manager (TM), which can be
edited at any time. The interchangeable TMX format allows translators to share a unit (TM) to
maintain consistency in translation projects. This method is more effective when all translators
are linked to a central server, as the TUs are confirmed in real time. LSPs use CAT systems or
translation management systems (TMS) for this purpose.

what are the conditions to have successful TM?

Conditions of successful TM are that the source text is: (1) consistent and error-free, (2) external
TMs are maintained, and (3) the supplied glossaries are reliable.

2.1.1 Segmentation
what are the stages of segmentations?
The CAT tool splits a source document into segments so that the TM can store the TUs
systematically for recall.
what are the segments boundaries?
- Stages of segmentation / Task of the CAT tool:
are all segmentations same?
1. Import the document through the 'Project preparation'.
2. Convert to Translatable Format: is the segmentation process and converts the source text
to a segmented bilingual CAT tool format.
• The segment boundaries are spaces or punctuation marks, which works well in word- based
languages such as English where a full stop signifies the end of a sentence.
• The segmentation is defined by rules specific to each source language and may vary between
different CAT tool programs.
• Segmentation may vary according to the language system.
• CAT tools can edit rules to support match searches but may prevent recall.
• Segmentation has been traditionally defined by characters.

FIGURE 2.1 File preparation dialog in SDL Trados Studio


what does segmentation cause? how to solve this problem
- Pym (2008) and Dragsted (2005) argue that the translator only sees what is in the box without
realising syntactic or semantic relationships with the preceding or following units or sentence(s)
which leads to non- contextual translation.

- The Segmentation Manufacturers have done their best to address the perceived limitations of
segmentation:

• Preview feature has been introduced in many CAT tools to compensate for segmentation and
the suggested lack of context.
• It is possible for the translator to review and revise the exported target text outside the CAT
tool and then re-import the revised monolingual target text to update the TM.

- How does TM affect the Translation Quality?

Pym (2008) claims that TM software may help maintain terminological consistency, but it
requires too much management to bring about great productivity gains.

- Advantages of Segmentation:

1. Improves repetition rates


2. Guarantees consistency
3. Leads to cost reduction
what should be given the priority productivity or quality?
The translator is the decision-maker regarding acceptance or modification of
segmentation rules. The translator may format or reformat editable source texts in MS. A
reduction of these special characters prior to importing the file will greatly reduce segmentation
and reduce the number of format tags. The translator should prioritise quality over productivity.
2.1.2 The concordance and consistency

The concordance feature, a popular feature, allows users to look up for specific words or word
sequences in the TMs.

- What is the purpose of the concordance?

• To show how a term or phrase is used in context.


• To help the translator retrieve a TU that is not shown in the Translation Results box, a ‘no
match’.

- Why does the concordance sometimes show 'no match'?

There is a different set of tags, extra space, missing punctuation, or other numbers. Seemingly
minor differences prevent matches.
is CAT concordance bilingual? why does translator rely on concordance? why is concordance designed?
• The CAT concordance is bilingual and shows source and target segments.
• Translators are known to rely on concordance to find term translations in the TM instead of
building a Tmdb.
• The bilingual concordance is not designed to achieve consistency but to focus on
terminology usage. do adjectival endings cuase problem?
• Adjectival endings in the target language prevent matches, so you should highlight the stem
of the word in the concordance search to find results with and without the adjectival ending.

FIGURE 2.2 Dialog box opened in concordance search in SDL Trados Studio 2019

FIGURE 2.3 Concordance target results for the term ‘functional’


FIGURE 2.4 Search and find dialog in SDL Studio 2019

FIGURE 2.5 Search and Find function Ctrl+F

FIGURE 2.6 Concordance dialog in memoQ 9.1


2.1.3 The analysis feature and fees

• To check the word count


Translators run the
• To show the amount of repetition which should speed up
analysis/ statistics feature
the job
before they start:
• LSPs use this feature to cost the translation project

- LSPs are expecting discounts for repetitions, 90%-100% matches, 70%-90% fuzzy
matches, and no discounts for fuzzy matches below 70%.
- The LSP creates a Purchase Order (PO), allowing the translator to accept or reject the
job. Some translators are rejecting discounts due to expensive CAT tools or context-
specific checks.
• Convenient for costing, both for translators and LSPs
The benefits of the • gives word and character counts and other statistical data,
analysis feature are: such as the number of new words and repetitions in a file,
before or after translation
• the source text because these counts are available at the
initial negotiation stage (they vary according to the
Fees are generally based language)
on: • hourly rates
• page rates
• flat fees (minimum fees)

FIGURE 2.7 Analysis report (statistics tab in memoQ 9.1)


2.2 Metadata and subsegment matching how could metadata help the translator?
where is metadata saved?
• Metadata is data about data
• It is crucial and can be added to the Tmdb during project setup
• It includes information like date, name, domain, client, and deadline. These data are visible
in dialog boxes and can help identify multiple TMs and their associated clients. Having this
information is helpful when multiple TMs are open, as it allows translators to see which TM
matches are associated with. how Tmdb could be used as metadata?
• The Tmdb can be used to add additional metadata such as part of speech, gender, and
number, which can help with declensions and inflections.what is the difference between TM and Tmdb searches?
• The Tmdb searches term pairs or phrases and allows for modification, extension, and making
plurals, while the TM searches for exact matches. Lowering the fuzzy match percentage in
the settings is not recommended as it can result in unhelpful suggestions from TM. Tmdb's
recall is superior to TM's, as it is part of longer strings in segments.
what does lowering fuzzy match cause? which is better recall TM or Tmdb's?
Software developers are improving subsegment matching in translation memory (TM)
through the Translation Automation User Society (TAUS), a Netherlands-based association of
major translation buyers and organizations interested in machine translation. TAUS offers
webinars and conferences and has developed a framework for testing and disseminating machine
translated data. what is Matching data by TAUS? why are software developers interested in subsegment matching
what is granular subsegment? what is Matching Data called?
TAUS has introduced Matching Data, a product that addresses the issue of TUs being
locked into one domain by transforming parallel language data into unique corpora. Matching
Data uses granular subsegments, focusing on smaller units with more lexical or morphemic
detail. This approach allows TM manufacturers to search at subsegment level in the TU, resulting
in partial perfect matches without post-editing. The product is known as 'Deepminer', 'Uplift', or
'Longest Substring Concordance'.

• CAT programs operate differently when their TMs reassemble or auto-assemble new and
longer strings, resulting in varying results.
when does uplift become better? and how developers work on remedy this issue?
- ‘Uplift’ which matches subsegments in the TM works better if the TM is smaller and more
specific.
how CAT tool finds subsegment pairs

• CAT tool developers are working to remedy this issue by giving translators more control over
resources that can be used in addition to the TM.
does deepminer use statistical data? what is the first form of subsegmentation
- ‘Deepminer’ was the first form of subsegmentation. The TM ‘mines’ for subsegments and uses
statistical data to analyse TM content. what is the difference of random TN and focused TM?
What does the quality of subsegment matching depend on? Where are subsegment matches presented?
• CAT tools perform concordance searches, finding subsegment pairs in source and target
strings. They present these subsegments in the Translation Results window, allowing
translators to select, insert, and translate the rest of the string. The quality of subsegment
matching depends on the quality and specificity of the TM.
• A random TM will give random matches, whereas a customised, domain-focused, well-
maintained TM will deliver higher quality matches and a lower edit distance.

how to boost TM?


2.3 Boosting the translation memory

Any new CAT tool arrives with an empty TM. The translators will start to build their
own TM with the constant use of the CAT tool. TMs is the transferable and exchangeable TMX
format which includes the need to boost the TM prior to translating.
A repeated warning for users of public PCs in training centres regarding TM creation
- A repeated warning for users of public PCs in training centres regarding TM creation:

(1) public PCs shut down without storing data. (2) They do not allow you to build a TM. (3) You
must remember to export TMs in the transferable and exchangeable TMX format at the end of a
session and store the files in personal folders.
CAT tools offer features to enhance the TM, including
- CAT tools offer features to enhance the TM, including: (1) importing monolingual or bilingual
reference files, (2) importing external TMX files, (3) aligning source and target texts, and (4)
importing reference materials. These tools also aid in TM edits, maintenance, and project
preparation, ensuring effective translation.

- How do we boost the TM? By the alignment features


What is alignment?
2.3.1 Alignment

The alignment feature in CAT tools turns previously translated documents and their
source texts into translation units (TUs) so that we can add them to a TM.

- CAT tools provide five types of alignment:

1. alignment with review


2. review in the translation editor on the fly
3. alignment of single files
4. alignment of multiple files
5. monolingual review
These tools allow translators to revise exported target files, update changes, and create
content with existing public translations or bilingual corpora. However, human revision is
needed each time a match presents itself to ensure good results.
what does "'reference material'" refer to?
2.3.2 Translation memory and reference files what are the component of the pachages?

LSPs use the term 'reference material' to refer to additional files for translators. They send
packages, a CAT tool term for ZIP archives that are created in the CAT tool, containing project
files, TM, Tmdb, analysis reports, and reference files. These packages are imported and opened
in the translator's CAT tool and are interchangeable between programs.

FIGURE 2.8 Alignment with join- up lines between source and target segments

- Reference material in CAT tools can be:

• Contextual, including extra-linguistic information to explain the topic, images of products,


applications, packaging, or illustrations to help understand the source text.
• Linguistic resource with relevant terminology, glossaries (mono or bilingual), or previously
translated TTs. what are the format of reference materials?

These materials can now be imported and processed in suitable formats such as MS Word
without performing an alignment. Keywords in context are presented in the Translation Results
window, with colors and icons used to differentiate between references and translation matches.
Reference files can be included in packages, sent via email, or made available online. Programs
import these materials as separate features after or during import.

- The benefit of importing reference files in your CAT tool is:

1. The TM does the searching and presents relevant matches, explanations, and definitions
on the fly.
2. It saves the translator from having to scan or highlight relevant phrases in reference files
before or during translation.

2.3.3 TMX File

A TMX file is a TM that can be exported in an interchangeable format, which can then be
imported for use in other CAT tools or shared between colleagues. Shared TMX files are best
TMX file is a TM in interchangeable file so that the translators can share this file and it
can be used in CAT.
used in Read Only format, and many open-source TMX files are available online but should be
used as Read Only files.

- Numerous TMX files that are open-source can be found online (some of which can be
downloaded, but be aware that they should only be used as Read Only files).

- Boosting an empty TM can be tempting, but it's best to create, edit, and manage small,
customized TMs yourself to avoid high expectations and time investment.

- If a proposed match is accepted by the user, it will then be entered in their own TM, contrary to
importing an entire TMX without checking the content.

2.4 Formats

CAT tool manufacturers are constantly updating their translation tools to keep up with
the changing and interchangeable formats of source text files. These tools enable the translation
of files that were previously inaccessible or unreadable on translators' desktops. However, some
file formats may have formatting instructions hidden in tags, which must be observed in the
target segments. Tags affect word count and string identification but are considered inconvenient
and take extra time to insert. Some formats, such as Excel spreadsheets, web-based HTML files,
PDF files, and PowerPoint files, are easier to manage within the CAT tool due to its
standardization of all formats in the translation editor. However, their formatting must be
maintained through format tags, which must be inserted in the target segments by the translator.

MS Excel
how does the MS Excel work more conveniently in CAT tool for translation?
- The translation process of an MS Excel file format is more convenient in a CAT tool than in
Excel:

• the text does not hide behind other cells


• it can be spellchecked
• the word count feature operates Excel spreadsheets are more popular in....
- Excel spreadsheets are popular for managing short or fragmented source texts in multiple
languages such as instructions or specifications that need translating in multiple languages.

- CAT tools import and convert Excel format to a clear bilingual editor interface, exporting the
translation in its original format.

- One tool (DVX3) excludes red text in spreadsheets, ensuring the target text is exported in the
required column.

HTML

- HTML (HyperText Markup Language) files contain tags that define text structure and layout on
the web.
could CAT tool read hidden instructions?
- The CAT tool reads hidden instructions and presents them with its own set of tags.
- Translators complain about tags, as they require extreme accuracy and cost extra time.

- CAT tool manufacturers are working to simplify and automate tag insertion, as it is essential
for posting HTML target files on the web.
What are the types of files that are compatible with CAT tool program?
XLIFF and TMX

- XLIFF (text files), TMX (TM databases), and TBX (terminology database) are exchange
formats, which means they are interchangeable between desktops/ laptops and compatible with a
variety of CAT programs.

- They are useful for collaboration between translators who are not working on a server or in the
cloud.
are all CAT tool programs same ?
- Translators should be aware that different CAT programs have different methods for
segmenting and storing information, which can cause differences in expected matches and
segmentation processes. how to make sure that this CAT tool program can
handle this format
XLIFF file exchanges can cause data loss due to the variety of file formats and programs.
To ensure CAT tool can handle imported XLIFF files, we can apply the 'pseudo-translation'
feature in the CAT tool. It translates a source text in a pseudo target language to test if the XLIFF
file can be handled and processed. The pseudo- translation feature is useful because it generates a
timely warning that the CAT tool cannot process the format.

FIGURE 2.9 Pseudo translated PDF file - (Adobe Acrobat Reader could not open ‘xxx.pdf’ because it is either not a supported
file type or because the file has been damaged. For example, it was sent as an email attachment and was not decoded correctly)

PDF

- PDF (Portable Document Format) is a widely used file format for converting MS Word files to
uneditable image files. does PDF format work in CAT tool programs?

- It can be problematic for translators as it may change sentence sequencing and format.
- The pseudo translation feature is helpful to see if the file can be exported and what the target
file will look like prior to attempting a translation.
what does image-based file need?
- PDF files with text-based content can be searched and copied, while image-based files require
an Optical Character Recognition (OCR) program for conversion and editing.

- The clearer text results in better conversions, but a PDF file can be despair for the translator and
CAT tool. can CAT tool process PP app

- A high-quality OCR program can ensure format and sentence structure retention if a translator
receives many PDF files. CAT tools can process common formats like PowerPoints, providing
good formatting results.

2.5 Other functions and features on the ribbon


does MS word and CAT tool have similiar ribbons? what are filters and regex used for?
• MS Word and CAT tools have similar ribbons that operate in similar ways.
• They have tabs, icons, and drop- down boxes which vary slightly between CAT tools.
• A minimalist approach to using TM, i.e. how to make it work with the least effort, may be
efficient in the short term but not satisfying.
• Each translator's requirements vary depending on text genres, domains, and language pairs,
and customising the program is necessary.
• Filters and regex can be used to personalise the TM.

FIGURE 2.10 Ribbon in MS Word

FIGURE 2.11 Ribbon in CAT tool

2.5.1 Filters setting up filters is a part of...

Setting up filters is part of the 'maintenance work’ for CAT programs. By re-indexing the
database, we can set the edit tab in TM to remove duplicates.

- The TM editor has many filters and deserves our attention:


can filters be used to search for a specific source/ target term?
• Edit filters can be used to search for a specific source/ target term or phrase in the TM.
how we can benefit from "history"

can we set or add filters?

• We can check modification dates in ‘history’ to recall previous versions for comparison.
• We can set or add filters.
• We can set a filter to capitalise certain words in our languages or we can tell it not to
capitalise.

The TM editor offers various functions to improve matches and prevent false positives,
especially when using multiple TMs simultaneously in a project to recall as many matches as
possible.

2.5.2 Regex

The CAT tool uses Regex to apply transfer rules for various formats, including date and
time, currencies, metrics, numbers, and email addresses, allowing the TM to recognize and
match digits in source and target segments.
what does Regex stand for?
- Regex, which stands for ‘regular expressions’, is a mathematical theory on which pattern
matching is based in MT and CAT. Why is regex important to us?
- Why is regex important to us? If we recognise and understand pattern matching, we can change
segmentation rules and improve our error checking in the CAT tool’s QA (quality assurance)
functions.

❖ The following is an example where the TM will not recognise the sentences as identical
because they lack automatic numbering:

❖ Matching is improved if we change the segmentation as follows:

❖ If you go to segmentation rules in the CAT tool you can add the following regex:
It means: look for all segments that start with any lowercase letter or uppercase letter or any
number between 0 and 9, which repeats itself one or more times, is preceded or not by a left
parenthesis, is followed by a right parenthesis, then by a space character or a tab character, which
repeats zero or more times. To change this rule we then add #!#, which means apply a segment
break here.

❖ Regex codes consist of metacharacters, which are standard. Here are some examples:

❖ If you want to check for inconsistencies, and you realize/ realise that you may have confused
GB and US spelling, you can click on Find and Search for one spelling and then again for the
other spelling. what do MT and TM systems rely on to match TU

- CAT tools provide instructions on creating a regex list, which prevents TM from producing
errors or false positives. Both MT and TM systems rely on regexes for matching TUs, and CAT
tools and MT systems are merging based on the same modus operandi.

You might also like