Full name: | SZTAKI Dictionary's major rewrite |
Start date: | 2010. 01. 03. |
End date: | 2011. 01. 03. |
Project homepage: | szotar.sztaki.hu |
SZTAKI Dictionary (SZTAKI Szótár) is the most popular Hungarian dictionary service with nearly 100.000 visitors a day, providing dictionaries in 7 languages. SZTAKI Dictionary started to operate in 1995 and today is an infrastructural service for all foreign language learners and those in need of translating texts from or to Hungarian.
"SZTAKI Szótár 4.0" is the fourth generation of the dictionary service with a brand new visual layout, architecture and enriched dictionary content and services.
SZTAKI Dictionary in numbers
SZTAKI Dictionary provides comprehensive dictionaries in 7 languages with the following content:
Dictionary |
Entry count |
Meanings |
Sub meanings |
Translations |
POS |
Thematic qualification |
Geo qualification |
Stylistic qualification |
Examples |
Expressions |
Enlgish-Hungarian |
92606 |
124253 |
12600 |
209046 |
92606 |
13203 |
19211 |
58305 |
738 |
1130 |
Hungarian-English |
112485 |
209046 |
0 |
209046 |
112485 |
21459 |
4677 |
15323 |
1008 |
637 |
German-Hungarian |
34859 |
49629 |
2968 |
72586 |
34859 |
6716 |
1542 |
25396 |
0 |
885 |
Hungarian-German |
39886 |
72586 |
0 |
72586 |
39886 |
7866 |
408 |
8897 |
0 |
462 |
French-Hungarian |
22701 |
42369 |
10231 |
61787 |
22701 |
10337 |
1268 |
24978 |
0 |
0 |
Hungarian-French |
30223 |
61787 |
0 |
61787 |
30223 |
11394 |
398 |
11287 |
0 |
0 |
Italian-Hungarian |
102566 |
231395 |
0 |
263542 |
102566 |
12469 |
3604 |
44025 |
0 |
0 |
Hungarian-Italian |
146233 |
263542 |
0 |
263542 |
146233 |
13374 |
6580 |
71771 |
0 |
0 |
Polish-Hungarian |
16544 |
19556 |
0 |
19556 |
16544 |
0 |
0 |
293 |
0 |
0 |
Hungarian-Polish |
15581 |
19556 |
0 |
19556 |
15581 |
0 |
0 |
279 |
0 |
0 |
Bulgarian-Hungarian |
51655 |
74412 |
7241 |
114984 |
51655 |
0 |
0 |
0 |
22272 |
0 |
Hungarian-Bulgarian |
71658 |
114984 |
0 |
114984 |
71658 |
0 |
0 |
0 |
9306 |
0 |
Overall SZTAKI Dictionary provides:
- 740.000 dictionary entries
- 1.200.000 meanings
- 1.500.000 translations
Visual layout
SZTAKI Dictionary's modern responsive HTML5 layout design makes it easy to lookup words on any devices being it a desktop browser, tablet or phone. All services, including the talking dictionary functionality, work accross all devices.
The user interface focuses on the main function of the service: looking up words. Therefore the central user interface element is the input field for entering words and all other elements are built around it. The search experience is enhanced with an autocomplete feature, which provides the basic meanings of the words as yout type. This, in many cases, means you don't even have to press "Search" to find the meaning of a word, but can get the basic meaning in a nice dropdown below the input field. This make s word look up very fast and convenient and makes work of SZTAKI Dictionary users more effective.
Architecture
SZTAKI Dictionary has been implemented in a service oriented architecture. The reason for this was to be able to easily build reuse - or eventually replace - individual services built around dictionaries. In our current system there is one service oriented component the Dictionary Server - mostly refered to as the "backend" - and a graphical user interface component - known as the frontend. From service oriented architecture point of view the frontend is not considered as a service, rather just an interface of the backend services for human users.
The frontend has been implemented based on the Drupal CMS with new modules and a theme added. The modules implement the communication with the backend. The communication functions include converting frontend HTTP requests to invocations to the backend and translating the backend data formats to the final HTML5 presented to the user.
The Dictionary Server provides all the functionalities related to dictionaries. These include the lookup of words and the management of dictionaries and dictionary entries. To find the approriate technology to base the implementation of the Dictionary Server we first "reverse engineered" the structure of paper dictionaries. We found that dictionary entries consists of the followings:
Multiple levels of meanings. Meaning provides the semantics of a word. In dictionaries we can identify multiple levels of meaning in a dictionary entry starting from a main meaning, then sub meanings of one or two levels, then at the lowest level the synonyms at the particular meaning level. This structure suggests that dictionary entries have a recursive, tree like nature.
Qualifiers related to meaning. Beside the meaning-sub meaning structure the translations in dictionary entries are enhanced with qualifiers. These qualifiers provide context for the word where and how it can be used as well as make the semantics of the word. Such qualifiers can provide thematical, stylistic, geological categorization of the translation and well as provide etymological or grammatical information regarding the word.
Information related to the word form. Dictionary form provides the basic forms of the index word. The dictionary form differs in each language. For some languages the dictionary form includes the plural form of nouns, or the past form of verbs, while for others there is an article for nouns.
Miscellaneous information related to words or meanings. Dictionary entries may containg not just the translation of a word but also example sentences, expressions or pictures, which all help explaining the translation or providing a context for its meaning and use.
Here is an example of a typical dictionary entry from an English-Hungarian dictionary and how its structure can be represented.
As can be seen this structure forms a graph, or more precisely a specific kind of graph: a tree. From computer science point of view representing and storing trees is simple, however the lookup and walking of trees may not be efficient with certain database technologies, eg. RDBMS. But what if we want to make our structure more web-like? For example, if we want to make paths between any meaning of a Hungarian word in any other languages then a simple tree structure won't be enough, we will need graphs.
Based on these evaluations on the strucure and nature of dictionaries and dictionary entries we decided to build our Dictionary Server on a graph database called Neo4J.
A graph database is a kind of database, which uses nodes to store data and edges to connect the nodes. Node or both nodes and edges can have properties, which are name and value pairs. With these elements - nodes, edges, properties - we can easily represent any kind of data and all this without requiring a data schema. This means that we can easily add new nodes to the store, set new properties on nodes and connect them freely via edges.
Neo4J is popular ACID transactional high performance graph database written in Java. What makes Neo4J even more suitable for the Dictionary Server is its ability to not only provide graph walking on a node-to-node bases but also integrates Lucene the well known full text search engine. With Lucene integrated we can look up the node representing the index word searche dby the user and then walk the nodes of the dictionary entry containing the meanings, translations, expressions, etc. contained in the entry.
What do we store in Neo4J?
- Dictionaries
- Metadata of dictionaries
- Dictionary schemes, describing structure of dictionary entries for specific languages and language pairs
- Dictionary entries
- Label dictionaries, which contain contain little pieces of text, which appear on the screen and may need to be translated to multiple languages eg.: qualifiers, name of dictionaries, etc
The Dictionary Server with the Neo4J database embedded in a servlet engine (Tomcat) is really performant. A typical dictionary search takes only 50-100ms.
Services
The main service of SZTAKI Dictionary is the lookup of words and providing their translation in an other language.
Usually when you search for a word in a dictionary service you should define which dictionary you want to search in, or what languages you want to translate from and to, maybe specify the search mode, the result format or other parameters. All these necessary (pre)settings make the search experience less attractive.
In SZTAKI Dictionary therefore we tried to realize a search facility to allow zero-config searches. This means, that even if the users don't set any search parameters we try to find the best matching translation in the most probable language pairs. Later, of course, the searches can be enhanced by setting specific parameters, but there's usually no need for this thanks to the intelligent search and sorting algorithm used. To make the dictionary experience even more attractive we enhanced the dictionary databases with features found in any paper based dictionary, including proper word form, part of speech, expressions and example sentences, written and audio pronunication.
Besides searching in dictionaries we also provide a full dictionary management platform, where everyone can create and develop dictionaries in private or in a community built around a dictionary.
Summary
With the services provided by SZTAKI Dictionary we would like to encourage our users to create dictionaries, even small ones covering a niche topic in a given language. These small thematical dictionaries will add up and result in a full thematically rich dictionary in the given language pairs later.
Our hope is that all these changes will make SZTAKI Szótár a viable alternative to well established paper dictionaries.