Translation and Technology Chapter 2 Summary
Translation and Technology Chapter 2 Summary
Key concepts
• The translation memory database can be filled to suit our individual needs.
• The translation memory database accepts several methods to boost its content.
• The translation memory database can be customised to suit our language pairs.
• The translation memory database includes features to improve translation quality and
productivity.
• The translation memory database can affect profit margins. what is the primary database in
CAT tool
Introduction what does TM contain?
what does TM need?
Translation memory (TM) is a database consisting of translation units (TUs), which we
have entered over time. It's a valuable resource that needs intelligent management, regular edits,
and good maintenance to be customized and up to date. It's the primary database in the CAT tool.
2.1 Creating a translation memory database What does TM do in the translation process?
The Translation Results window displays target results when a translation memory (TM)
segment matches an identical or almost identical source segment. The TM (1) compares the
content in the source segment against TM segments, (2) presenting translation units (TU) found.
• 100% or perfect match: the content of a translation memory segment matches the document
segment exactly.
• 101% match or context match: If a new segment and a translation memory segment match
precisely, including tags, numbers, punctuation, etc.
• Fuzzy match less than 100%: the default fuzzy match threshold generally is 70%. There is
insufficient usable content below this level.
• No match: no match is found (translate from scratch).
what can translator do with proposed matches? how does TMX help in keep consistency?
The translator can accept or reject proposed matches, edit fuzzy matches, or translate
from scratch. Confirmed segments are sent to the Translation Manager (TM), which can be
edited at any time. The interchangeable TMX format allows translators to share a unit (TM) to
maintain consistency in translation projects. This method is more effective when all translators
are linked to a central server, as the TUs are confirmed in real time. LSPs use CAT systems or
translation management systems (TMS) for this purpose.
Conditions of successful TM are that the source text is: (1) consistent and error-free, (2) external
TMs are maintained, and (3) the supplied glossaries are reliable.
2.1.1 Segmentation
what are the stages of segmentations?
The CAT tool splits a source document into segments so that the TM can store the TUs
systematically for recall.
what are the segments boundaries?
- Stages of segmentation / Task of the CAT tool:
are all segmentations same?
1. Import the document through the 'Project preparation'.
2. Convert to Translatable Format: is the segmentation process and converts the source text
to a segmented bilingual CAT tool format.
• The segment boundaries are spaces or punctuation marks, which works well in word- based
languages such as English where a full stop signifies the end of a sentence.
• The segmentation is defined by rules specific to each source language and may vary between
different CAT tool programs.
• Segmentation may vary according to the language system.
• CAT tools can edit rules to support match searches but may prevent recall.
• Segmentation has been traditionally defined by characters.
- The Segmentation Manufacturers have done their best to address the perceived limitations of
segmentation:
• Preview feature has been introduced in many CAT tools to compensate for segmentation and
the suggested lack of context.
• It is possible for the translator to review and revise the exported target text outside the CAT
tool and then re-import the revised monolingual target text to update the TM.
Pym (2008) claims that TM software may help maintain terminological consistency, but it
requires too much management to bring about great productivity gains.
- Advantages of Segmentation:
The concordance feature, a popular feature, allows users to look up for specific words or word
sequences in the TMs.
There is a different set of tags, extra space, missing punctuation, or other numbers. Seemingly
minor differences prevent matches.
is CAT concordance bilingual? why does translator rely on concordance? why is concordance designed?
• The CAT concordance is bilingual and shows source and target segments.
• Translators are known to rely on concordance to find term translations in the TM instead of
building a Tmdb.
• The bilingual concordance is not designed to achieve consistency but to focus on
terminology usage. do adjectival endings cuase problem?
• Adjectival endings in the target language prevent matches, so you should highlight the stem
of the word in the concordance search to find results with and without the adjectival ending.
FIGURE 2.2 Dialog box opened in concordance search in SDL Trados Studio 2019
- LSPs are expecting discounts for repetitions, 90%-100% matches, 70%-90% fuzzy
matches, and no discounts for fuzzy matches below 70%.
- The LSP creates a Purchase Order (PO), allowing the translator to accept or reject the
job. Some translators are rejecting discounts due to expensive CAT tools or context-
specific checks.
• Convenient for costing, both for translators and LSPs
The benefits of the • gives word and character counts and other statistical data,
analysis feature are: such as the number of new words and repetitions in a file,
before or after translation
• the source text because these counts are available at the
initial negotiation stage (they vary according to the
Fees are generally based language)
on: • hourly rates
• page rates
• flat fees (minimum fees)
• CAT programs operate differently when their TMs reassemble or auto-assemble new and
longer strings, resulting in varying results.
when does uplift become better? and how developers work on remedy this issue?
- ‘Uplift’ which matches subsegments in the TM works better if the TM is smaller and more
specific.
how CAT tool finds subsegment pairs
• CAT tool developers are working to remedy this issue by giving translators more control over
resources that can be used in addition to the TM.
does deepminer use statistical data? what is the first form of subsegmentation
- ‘Deepminer’ was the first form of subsegmentation. The TM ‘mines’ for subsegments and uses
statistical data to analyse TM content. what is the difference of random TN and focused TM?
What does the quality of subsegment matching depend on? Where are subsegment matches presented?
• CAT tools perform concordance searches, finding subsegment pairs in source and target
strings. They present these subsegments in the Translation Results window, allowing
translators to select, insert, and translate the rest of the string. The quality of subsegment
matching depends on the quality and specificity of the TM.
• A random TM will give random matches, whereas a customised, domain-focused, well-
maintained TM will deliver higher quality matches and a lower edit distance.
Any new CAT tool arrives with an empty TM. The translators will start to build their
own TM with the constant use of the CAT tool. TMs is the transferable and exchangeable TMX
format which includes the need to boost the TM prior to translating.
A repeated warning for users of public PCs in training centres regarding TM creation
- A repeated warning for users of public PCs in training centres regarding TM creation:
(1) public PCs shut down without storing data. (2) They do not allow you to build a TM. (3) You
must remember to export TMs in the transferable and exchangeable TMX format at the end of a
session and store the files in personal folders.
CAT tools offer features to enhance the TM, including
- CAT tools offer features to enhance the TM, including: (1) importing monolingual or bilingual
reference files, (2) importing external TMX files, (3) aligning source and target texts, and (4)
importing reference materials. These tools also aid in TM edits, maintenance, and project
preparation, ensuring effective translation.
The alignment feature in CAT tools turns previously translated documents and their
source texts into translation units (TUs) so that we can add them to a TM.
LSPs use the term 'reference material' to refer to additional files for translators. They send
packages, a CAT tool term for ZIP archives that are created in the CAT tool, containing project
files, TM, Tmdb, analysis reports, and reference files. These packages are imported and opened
in the translator's CAT tool and are interchangeable between programs.
FIGURE 2.8 Alignment with join- up lines between source and target segments
These materials can now be imported and processed in suitable formats such as MS Word
without performing an alignment. Keywords in context are presented in the Translation Results
window, with colors and icons used to differentiate between references and translation matches.
Reference files can be included in packages, sent via email, or made available online. Programs
import these materials as separate features after or during import.
1. The TM does the searching and presents relevant matches, explanations, and definitions
on the fly.
2. It saves the translator from having to scan or highlight relevant phrases in reference files
before or during translation.
A TMX file is a TM that can be exported in an interchangeable format, which can then be
imported for use in other CAT tools or shared between colleagues. Shared TMX files are best
TMX file is a TM in interchangeable file so that the translators can share this file and it
can be used in CAT.
used in Read Only format, and many open-source TMX files are available online but should be
used as Read Only files.
- Numerous TMX files that are open-source can be found online (some of which can be
downloaded, but be aware that they should only be used as Read Only files).
- Boosting an empty TM can be tempting, but it's best to create, edit, and manage small,
customized TMs yourself to avoid high expectations and time investment.
- If a proposed match is accepted by the user, it will then be entered in their own TM, contrary to
importing an entire TMX without checking the content.
2.4 Formats
CAT tool manufacturers are constantly updating their translation tools to keep up with
the changing and interchangeable formats of source text files. These tools enable the translation
of files that were previously inaccessible or unreadable on translators' desktops. However, some
file formats may have formatting instructions hidden in tags, which must be observed in the
target segments. Tags affect word count and string identification but are considered inconvenient
and take extra time to insert. Some formats, such as Excel spreadsheets, web-based HTML files,
PDF files, and PowerPoint files, are easier to manage within the CAT tool due to its
standardization of all formats in the translation editor. However, their formatting must be
maintained through format tags, which must be inserted in the target segments by the translator.
MS Excel
how does the MS Excel work more conveniently in CAT tool for translation?
- The translation process of an MS Excel file format is more convenient in a CAT tool than in
Excel:
- CAT tools import and convert Excel format to a clear bilingual editor interface, exporting the
translation in its original format.
- One tool (DVX3) excludes red text in spreadsheets, ensuring the target text is exported in the
required column.
HTML
- HTML (HyperText Markup Language) files contain tags that define text structure and layout on
the web.
could CAT tool read hidden instructions?
- The CAT tool reads hidden instructions and presents them with its own set of tags.
- Translators complain about tags, as they require extreme accuracy and cost extra time.
- CAT tool manufacturers are working to simplify and automate tag insertion, as it is essential
for posting HTML target files on the web.
What are the types of files that are compatible with CAT tool program?
XLIFF and TMX
- XLIFF (text files), TMX (TM databases), and TBX (terminology database) are exchange
formats, which means they are interchangeable between desktops/ laptops and compatible with a
variety of CAT programs.
- They are useful for collaboration between translators who are not working on a server or in the
cloud.
are all CAT tool programs same ?
- Translators should be aware that different CAT programs have different methods for
segmenting and storing information, which can cause differences in expected matches and
segmentation processes. how to make sure that this CAT tool program can
handle this format
XLIFF file exchanges can cause data loss due to the variety of file formats and programs.
To ensure CAT tool can handle imported XLIFF files, we can apply the 'pseudo-translation'
feature in the CAT tool. It translates a source text in a pseudo target language to test if the XLIFF
file can be handled and processed. The pseudo- translation feature is useful because it generates a
timely warning that the CAT tool cannot process the format.
FIGURE 2.9 Pseudo translated PDF file - (Adobe Acrobat Reader could not open ‘xxx.pdf’ because it is either not a supported
file type or because the file has been damaged. For example, it was sent as an email attachment and was not decoded correctly)
- PDF (Portable Document Format) is a widely used file format for converting MS Word files to
uneditable image files. does PDF format work in CAT tool programs?
- It can be problematic for translators as it may change sentence sequencing and format.
- The pseudo translation feature is helpful to see if the file can be exported and what the target
file will look like prior to attempting a translation.
what does image-based file need?
- PDF files with text-based content can be searched and copied, while image-based files require
an Optical Character Recognition (OCR) program for conversion and editing.
- The clearer text results in better conversions, but a PDF file can be despair for the translator and
CAT tool. can CAT tool process PP app
- A high-quality OCR program can ensure format and sentence structure retention if a translator
receives many PDF files. CAT tools can process common formats like PowerPoints, providing
good formatting results.
Setting up filters is part of the 'maintenance work’ for CAT programs. By re-indexing the
database, we can set the edit tab in TM to remove duplicates.
• We can check modification dates in ‘history’ to recall previous versions for comparison.
• We can set or add filters.
• We can set a filter to capitalise certain words in our languages or we can tell it not to
capitalise.
The TM editor offers various functions to improve matches and prevent false positives,
especially when using multiple TMs simultaneously in a project to recall as many matches as
possible.
2.5.2 Regex
The CAT tool uses Regex to apply transfer rules for various formats, including date and
time, currencies, metrics, numbers, and email addresses, allowing the TM to recognize and
match digits in source and target segments.
what does Regex stand for?
- Regex, which stands for ‘regular expressions’, is a mathematical theory on which pattern
matching is based in MT and CAT. Why is regex important to us?
- Why is regex important to us? If we recognise and understand pattern matching, we can change
segmentation rules and improve our error checking in the CAT tool’s QA (quality assurance)
functions.
❖ The following is an example where the TM will not recognise the sentences as identical
because they lack automatic numbering:
❖ If you go to segmentation rules in the CAT tool you can add the following regex:
It means: look for all segments that start with any lowercase letter or uppercase letter or any
number between 0 and 9, which repeats itself one or more times, is preceded or not by a left
parenthesis, is followed by a right parenthesis, then by a space character or a tab character, which
repeats zero or more times. To change this rule we then add #!#, which means apply a segment
break here.
❖ Regex codes consist of metacharacters, which are standard. Here are some examples:
❖ If you want to check for inconsistencies, and you realize/ realise that you may have confused
GB and US spelling, you can click on Find and Search for one spelling and then again for the
other spelling. what do MT and TM systems rely on to match TU
- CAT tools provide instructions on creating a regex list, which prevents TM from producing
errors or false positives. Both MT and TM systems rely on regexes for matching TUs, and CAT
tools and MT systems are merging based on the same modus operandi.