JPA Native Query Postgres Full Text Search#
What Is The Full Text Search?#
-
Full Text Searching(or just text search) provides the capability to identify natural-language documents that satisfy a query, and optionally to sort them by relevance to the query. The most common type of search is to find all documents containing given query terms and return them in order of their similarity to the query. Notions of query and similarity are very flexible and depend on the specific application. The simplest search considers query as a set of words and similarity as the frequency of query words in the document. -
Textual search operators have existed in databases for years. PostgreSQL has
~,~*,LIKE, andILIKEoperators for textual data types, but they lack many essential properties required by modern information systems:-
There is no linguistic support, even for English. Regular expressions are not sufficient because they cannot easily handle derived words, e.g.,
satisfiesandsatisfy. You might miss documents that containsatisfies, although you probably would like to find them when searching forsatisfy. It is possible to useORto search for multiple derived forms, but this is tedious and error-prone (some words can have several thousand derivatives). -
They provide no ordering (ranking) of search results, which makes them ineffective when thousands of matching documents are found.
-
They tend to be slow because there is no index support, so they must process all documents for every search.
-
-
Full text indexing allows documents to be preprocessed and an index saved for later rapid searching. Preprocessing includes:
-
Parsing documents into tokens. It is useful to identify various classes of tokens, e.g., numbers, words, complex words, email addresses, so that they can be processed differently. In principle token classes depend on the specific application, but for most purposes it is adequate to use a predefined set of classes. PostgreSQL uses a parser to perform this step. A standard parser is provided, and custom parsers can be created for specific needs.
-
Converting tokens into lexemes. A lexeme is a string, just like a token, but it has been normalized so that different forms of the same word are made alike. For example, normalization almost always includes folding upper-case letters to lower-case, and often involves removal of suffixes (such as s or es in English). This allows searches to find variant forms of the same word, without tediously entering all the possible variants. Also, this step typically eliminates stop words, which are words that are so common that they are useless for searching. (In short, then, tokens are raw fragments of the document text, while lexemes are words that are believed useful for indexing and searching.) PostgreSQL uses dictionaries to perform this step. Various standard dictionaries are provided, and custom ones can be created for specific needs.
-
Storing preprocessed documents optimized for searching. For example, each document can be represented as a sorted array of normalized lexemes. Along with the lexemes it is often desirable to store positional information to use for proximity ranking, so that a document that contains a more “dense” region of query words is assigned a higher rank than one with scattered query words.
-
-
Dictionaries allow fine-grained control over how tokens are normalized. With appropriate dictionaries, you can:
- Define stop words that should not be indexed.
- Map synonyms to a single word using Ispell.
- Map phrases to a single word using a thesaurus.
- Map different variations of a word to a canonical form using an Ispell dictionary.
- Map different variations of a word to a canonical form using Snowball stemmer rules.
-
A data type
tsvectoris provided for storing preprocessed documents, along with a typetsqueryfor representing processed queries. There are many functions and operators available for these data types, the most important of which is the match operator@@,which we introduce in.Full text searchescan be accelerated using indexes.
What Is A Document?#
-
A documentis the unit of searching in a full text search system; for example, a magazine article or email message. The text search engine must be able to parse documents and store associations of lexemes (key words) with their parent document. Later, these associations are used to search for documents that contain query words. -
For searches within PostgreSQL, a document is normally a textual field within a row of a database table, or possibly a combination (concatenation) of such fields, perhaps stored in several tables or obtained dynamically. In other words, a document can be constructed from different parts for indexing and it might not be stored anywhere as a whole.
-
For example: we have a table
customerswhich contains some column as below.
| id | address | dob | full_name | gender | phone | |
|---|---|---|---|---|---|---|
| b94d2cb0-2b55-4191-b946-12dbd117cfa2 | 893 Bogisich Row | 1995-10-10 07:00:00.000 | Maximilian.Renner68@example.net | Abel Rath IV | M | 377-802-6499 |
| 305920c6-102b-4592-a33d-15657d7ee515 | 57786 Selmer Squares | 1995-10-10 07:00:00.000 | Henri.Schowalter8@example.net | Katrina Schulist | M | 417-341-8058 |
| 59f3a54c-5bde-4790-952a-3be66bc3f94e | 7668 Schowalter Streets | 1995-10-10 07:00:00.000 | Chadrick_Beier@example.net | Elbert Kris | M | 647-843-2866 |
| 6f623a35-f4b9-4a51-ad50-91c3206e41f6 | 3805 Estevan Radial | 1995-10-10 07:00:00.000 | Daniela.Quitzon20@example.com | June Rowe | M | 841-709-3392 |
| 9f686134-e286-47b3-bb61-60bb56f5bad2 | 76029 Lloyd Points | 1995-10-10 07:00:00.000 | Reid_McDermott15@example.com | Sammy King | M | 908-596-0825 |
- Then we will use the PostgresSQL below to create documents which contains combinations values of text columns as
address,email,full_nanmeandphone.
1 2 3 | |
- After execute the query we have the result.
| document |
|---|
| 893 Bogisich Row Maximilian.Renner68@example.net Abel Rath IV 377-802-6499 |
| 57786 Selmer Squares Henri.Schowalter8@example.net Katrina Schulist 417-341-8058 |
| 7668 Schowalter Streets Chadrick_Beier@example.net Elbert Kris 647-843-2866 |
| 3805 Estevan Radial Daniela.Quitzon20@example.com June Rowe 841-709-3392 |
| 76029 Lloyd Points Reid_McDermott15@example.com Sammy King 908-596-0825 |
-
Now with these documents we can use them for creating
tsvector.
What Is The tsvector Datatype?#
PostgreSQLprovides two data types that are designed to support full text search, which is the activity of searching through a collection of natural-language documents to locate those that best match a query. Thetsvectortype represents a document in a form optimized for text search; thetsquerytype similarly represents a text query.- A
tsvectorvalue is a sorted list of distinct lexemes, which are words that have been normalized to merge different variants of the same word (see Chapter 12 for details). Sorting and duplicate-elimination are done automatically during input, as shown in this example:
1 | |
| tsvector |
|---|
| 'a' 'and' 'ate' 'cat' 'fat' 'mat' 'on' 'rat' 'sat' |
-
It is important to understand that the
tsvectortype itself does not perform any word normalization; it assumes the words it is given are normalized appropriately for the application. -
For most English-text-searching applications the above words would be considered non-normalized, but
tsvectordoesn't care. Raw document text should usually be passed throughto_tsvectorto normalize the words appropriately for searching.
What Is The to_tsvector function?#
- To implement full text searching there must be a function to create a tsvector from a document and a tsquery from a user query. Also, we need to return results in a useful order, so we need a function that compares documents with respect to their relevance to the query. It's also important to be able to display the results nicely. PostgreSQL provides support for all of these functions.
PostgreSQLprovides the functionto_tsvectorfor converting adocumentto thetsvectordata type.
1 | |
to_tsvectorparses a textual document into tokens, reduces the tokens to lexemes, and returns atsvectorwhich lists the lexemes together with their positions in the document. The document is processed according to the specified or default text search configuration. Here is a simple example:
1 | |
| to_tsvector |
|---|
| 'ate':9 'cat':3 'fat':2,11 'mat':7 'rat':12 'sat':4 |
-
In the example above we see that the resulting
tsvectordoes not contain the wordsa,onandandthe wordratsbecameratand the punctuation sign-was ignored. -
The
to_tsvectorfunction internally calls a parser which breaks thedocumenttext intotokensand assigns a type to each token. For eachtoken, a list ofdictionariesis consulted, where the list can vary depending on thetokentype. The first dictionary that recognizes thetokenemits one or more normalized lexemes to represent thetoken. For example,ratsbecameratbecause one of the dictionaries recognized that the wordratsis a plural form ofrat. Some words are recognized as stop words, which causes them to be ignored since they occur too frequently to be useful in searching. In our example these area,on, andand. If no dictionary in the list recognizes the token then it is also ignored. In this example that happened to the punctuation sign-because there are in fact no dictionaries assigned for its token type (Space symbols), meaning space tokens will never be indexed. The choices of parser, dictionaries and which types of tokens to index are determined by the selected text search configuration. It is possible to have many different configurations in the same database, and predefined configurations are available for various languages. In our example we used the default configurationenglishfor theEnglishlanguage.
What Is The tsquery Datatype?#
- A
tsqueryvalue stores lexemes that are to be searched for, and can combine them using the Boolean operators&(AND),|(OR), and!(NOT), as well as the phrase search operator<->(FOLLOWED BY). There is also a variant<N>of the FOLLOWED BY operator, whereNis an integer constant that specifies the distance between the two lexemes being searched for.<->is equivalent to<1>. - Parentheses can be used to enforce grouping of these operators. In the absence of parentheses,
!(NOT) binds most tightly,<->(FOLLOWED BY) next most tightly, then&(AND), with|(OR) binding the least tightly.
1 | |
| tsquery |
|---|
| 'fat' & 'rat' |
1 | |
1 2 3 | |
1 | |
| tsquery |
|---|
| 'fat' & 'rat' & !'cat' |
- Searching for phrases is possible with the help of the
<->(FOLLOWED BY)tsqueryoperator, which matches only if its arguments have matches that are adjacent and in the given order. For example:
1 | |
| tsquery |
|---|
| 'fatal' <-> 'error' |
- Also, lexemes in a
tsquerycan be labeled with*to specify prefix matching:
1 | |
| tsquery |
|---|
| 'super':* |
What Are Supported Functions For tsquery?#
-
In PostgreSQL,
to_tsquery,plainto_tsquery, andphraseto_tsqueryare functions related to full-text search. These functions are used to construct query strings in the tsquery format, which is then used in conjunction with the@@operator to perform full-text searches using thetsvectordata type. -
to_tsqueryFunction:- Syntax:
to_tsquery(config_name, query_text) - Description: Converts a query string into a
tsquerydata type, which is used for full-text searching. Theconfig_nameparameter specifies the text search configuration to use, andquery_textis the actual query string.
- Syntax:
-
Example:
1 | |
| to_tsquery |
|---|
| 'fat' & 'rat' |
-
plainto_tsqueryFunction:- Syntax:
plainto_tsquery(config_name, query_text) - Description: Converts a plain text query string into a
tsquery. It is similar toto_tsquery, but it processes the query text in a simpler way.
- Syntax:
-
Example:
1 | |
| plainto_tsquery |
|---|
| 'cat' & 'ate' & 'rat' |
-
phraseto_tsqueryFunction:- Syntax:
phraseto_tsquery(config_name, query_text) - Description: Similar to
plainto_tsquery, but it processes the query text as a phrase, considering the order of the words in the input.
- Syntax:
-
Example:
1 | |
| to_tsquery |
|---|
| 'fatal' <-> 'error' |
- There is a more general version of the FOLLOWED BY operator having the form
<N>, whereNis an integer standing for the difference between the positions of the matching lexemes.<1>is the same as<->, while<2>allows exactly one other lexeme to appear between the matches, and so on. Thephraseto_tsqueryfunction makes use of this operator to construct atsquerythat can match a multi-word phrase when some of the words are stop words. For example:
1 | |
| phraseto_tsquery |
|---|
| 'cat' <-> 'ate' <2> 'rat' |
Basic Text Matching#
Full text searchingin PostgreSQL is based on the match operator@@, which returnstrueif atsvector(document) matches atsquery(query). It doesn't matter which data type is written first.
1 | |
| ?column? |
|---|
| true |
- As the above example suggests, a
tsqueryis not just raw text, any more than atsvectoris. Atsquerycontains search terms, which must be already-normalized lexemes, and may combine multiple terms using AND, OR, NOT, and FOLLOWED BY operators. There are functionsto_tsquery,plainto_tsquery, andphraseto_tsquerythat are helpful in converting user-written text into a propertsquery, primarily by normalizing words appearing in the text. Similarly,to_tsvectoris used to parse and normalize a document string. So in practice a text search match would look more like this:
1 | |
| ?column? |
|---|
| true |
- The
@@operator also supportstextinput, allowing explicit conversion of a text string totsvectorortsqueryto be skipped in simple cases. The variants available are:
1 2 3 4 | |
-
The first two of these we saw already. The form
text@@tsqueryis equivalent toto_tsvector(x) @@ y. The formtext@@textis equivalent toto_tsvector(x) @@ plainto_tsquery(y). -
Within a
tsquery, the&(AND) operator specifies that both its arguments must appear in the document to have a match. Similarly, the|(OR) operator specifies that at least one of its arguments must appear, while the!(NOT) operator specifies that its argument must not appear in order to have a match. For example, the queryfat & ! ratmatches documents that containfatbut notrat.
Example With JPA#
- Now, let's take an example for applying PosgresSQL full text search with JPA.
Dependency#
- Let's create an sample project and add these dependencies below.
| pom.xml | |
|---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 | |
- Then let's apply the plugin below for OpenApi generator and Mapstruct
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 | |
OpenApi#
- Now, in the folder
resource, we will create a folderopenapiand aopenapi-server.yamlinside it. The content of this file will look like below with some apis.
| openapi-server.yml | |
|---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 | |
- Now, we can use command
mvn clean installto build and generate apis and model fromopenapi-server.yaml.
Entity#
- Then let's define a
CustomerEnittyas below.
| CustomerEntity | |
|---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 | |
| Gender.java | |
|---|---|
1 2 3 4 5 6 7 | |
Repository#
- Now, lets create the Repository for the CustomerEntity above.
| CustomerRepository.java | |
|---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 | |
- As you can see in this Repository we will define a method to search customers using native query with some params such as
keyword,pageSizeandpageNumber.
Dto#
- Next, let's create DTOs for CustomerEntity and
CustomerFilterResultas below.
| Customer.java | |
|---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | |
| CustomerFilterResult.java | |
|---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | |
Mapper#
- Next, let's create a mapper interface for mapping Customer DTO, Entity and CustomerFilterResult models using MapStruct.
| AutoCustomerMapper.java | |
|---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 | |
Service#
- Now, with these classes that we defined above, we can create service class for handling logics for CRUD entities.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 | |
Controller#
- Now, we can create some basic controllers and implement the generated apis.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 | |
Testing#
- Now, let's start the application then use the postman and with the request body as below.
1 2 3 4 5 6 7 8 | |
- Postman supports us to random some common fields every time we send the request body.
- Now, let's run the api CreateCustomer for 10000 times. Then we will have 10000 records in the
customerstable.
- Next, let's continue use postman for searching customer api as below.
- As you can see the search query will look up the keyword
KnollsorKnollin columnsfull_name,email,address,dob,phoneandgender. If any column contains this keyword then it will be return. Currently we have 10 records in 9994 records contains this keyword.




