0% found this document useful (0 votes)

386 views

Building Transformer Models With Attention - Web - Page

This document provides an overview of an ebook about building transformer models with attention. It discusses how attention mechanisms allow neural networks to better process sequential data like text by relating different parts of a sequence. Transformer models stack multiple attention layers without using recurrent neural networks. The ebook teaches the fundamentals of attention and how to implement a transformer model from scratch in Keras that can translate English sentences to German. It is designed for machine learning practitioners who want to understand how attention and transformers work rather than just using pre-built models.

Uploaded by

Aparajita Aggarwal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

386 views

Building Transformer Models With Attention - Web - Page

Uploaded by

Aparajita Aggarwal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

5/12/23, 11:03 PM Building Transformer Models with Attention

 Navigation

Building Transformer Models with Attention

Implementing a Neural Machine Translator from Scratch in
Keras
$37 USD

If you have been around long enough, you should

notice that your search engine can understand
human language much better than in previous
years. The game changer was the attention
mechanism. It is not an easy topic to explain, and
it is sad to see someone consider that as secret
magic. If we know more about attention and
understand the problem it solves, we can decide
if it fits into our project and be more comfortable
using it.

If you are interested in natural language

processing and want to tap into the most
advanced technique in deep learning for NLP, this
new Ebook—in the friendly Machine Learning
Mastery style that you’re used to—is all you
need.

Using clear explanations and step-by-step tutorial

lessons, you will learn how attention can get the
job done and why we build transformer models to
tackle the sequence data. You will also create
your own transformer model that translates
sentences from one language to another.

About this Ebook:

Read on all devices: PDF format Ebook,

no DRM
Tons of tutorials: 23 step-by-step
lessons, 225 pages
Foundations: Start from the theoretical
background of attention mechanisms,
and it will guide you to finish a
transformer model
Hands-on: Instead of using an off-the-
shelf model, you implement every nut
https://machinelearningmastery.com/transformer-models-with-attention/?utm_source=drip&utm_medium=email&utm_campaign=AFML+Mini-Co… 1/19
5/12/23, 11:03 PM Building Transformer Models with Attention

and bolt so you fully understand what

you’re doing
Working code: More than 50 Python
(.py) code files included, in addition to
the data file you need
Shows You the Detail of Attention and
Transformers.
Designed for Developers. Nothing Hidden.
Convinced?

Jump Straight to the Packages

…another NLP book?

This one is different!
Handling text and human language is a tedious job. Not only is a lot of data cleansing needed, but
multiple levels of preprocessing are also required depending on the algorithm you apply. But
unarguably, the most challenging part of all natural language processing problems is to find the
accurate meaning of words and sentences. It is because human languages are ambiguous, and the
sentence structure can be very complex.

This is not an introduction to natural language processing techniques. In fact, before you read this book,
you should know some terms on language preprocessing, such as tokenization. The goal of this book is
to introduce to you the attention mechanisms that can extract key information from a sequence and
show you how a transformer model, in which an attention mechanism is applied, is built and used.
There is only one main theme in this book: to make a machine that can translate an English sentence
into German.

Why must I know attention?

So, why must I know attention if all I want is to apply deep learning to natural language processing
tasks?

You don’t. You can just download a model from some repository and copy over the sample code. You
can finish your project without knowing why you should know attention.

However, when you find an issue in the code or discover hundreds of different models with similar
names, you will want to know what the code or the model is doing behind the scenes. Understanding
the transformer models and the attention mechanisms that power them would allow you to tell why
something works and why another doesn’t.

3 Key Ideas You Should Know About Attention

https://machinelearningmastery.com/transformer-models-with-attention/?utm_source=drip&utm_medium=email&utm_campaign=AFML+Mini-Co… 2/19
5/12/23, 11:03 PM Building Transformer Models with Attention

There can be a lot to learn about attention and transformers. But there are three basic questions that
you should be able to answer.

1. What are Attention Mechanisms?

When we speak of attention, we talk about a sequence processing problem. We want to read a long
sequence as input and produce another sequence as output. Humans often find it difficult to recite a
long sentence. All writing guides suggest against using long and complex sentence structures for that
reason—readers will not comprehend it. Computers have the same difficulties too. Attention is to relate
one word in a sentence (or a token in a sequence) to another so the information or meaning they carry
can be correlated. This helps finding the context.

2. How Attention improves Recurrent Neural Networks?

Sequence processing is natural to recurrent neural networks. It is recurrent because we reuse the same
neural network for each token in a sequence. Inside the neural network, there is a state to remember
what it saw so far, but only in a condensed format. As the sequence becomes longer, it is easy to have
this state remember the recent tokens but forget about the older ones. That’s the reason we have
limited success in recurrent neural networks once the input is longer. Attention adds another layer on
top of it and helps to find the correlation between the states when the network reads different tokens in
the sequence. This is to refresh the memory at strategic times. Therefore, an RNN with attention can
handle a longer sequence.

3. What is a Transformer?
If attention can be applied to the states of a recurrent neural network, it can also be applied to the input
sequences. After all, the output of an attention mechanism is also a sequence. Therefore, we can stack
up multiple attention layers and build a neural network out of them without recurrent neural networks. It
turns out such a network can work well in many problems, and this architecture is called a transformer
model.

Introducing Our New Ebook: Building Transformer Models

with Attention
Welcome to Building Transformer Models with Attention!

This book is designed to teach machine learning practitioners like you about transformer models from
the ground up. This book is for you if you use some off-the-shelf models and see them working but feel
clueless about how attention and transformers can solve your problems.

It starts by giving you a high-level overview of what attention mechanisms are and how people use
them. You will learn from the fundamental theory and implement a transformer model line by line in
Keras. By the time you finish this book, you will have a working transformer model that can translate
English sentences into German.

This book will teach you the inner workings of a transformer model in the fastest and most effective way
we know how: to learn by doing. We give you executable code that you can run to develop the intuitions

https://machinelearningmastery.com/transformer-models-with-attention/?utm_source=drip&utm_medium=email&utm_campaign=AFML+Mini-Co… 3/19
5/12/23, 11:03 PM Building Transformer Models with Attention

required and that you can copy and paste into your project to
immediately get a result. You can even reuse the code on a
different dataset to obtain a translator of your favorite languages.

Convinced?
Click to jump straight to the packages.

Who Is This Book For?

…so, is this book right for YOU?
This book is for people who know some Deep Learning and NLP.

Perhaps you have already finished our other book Deep Learning
with Python. Perhaps you finished a project with LSTM or other
recurrent neural networks. Then, the lessons in this book will guide
you to the advanced topic of attention and transformers.

The lessons in this book do assume a few things about you, such as:

You know basic Python for programming.

You can comfortably work with your IDE (such as Spyder or Visual Studio Code) or run Python on
the command line.
You know how to develop a model with TensorFlow and Keras.
You are eager to learn the nuts and bolts of transformer models.

This guide was written in the top-down and results-first style that you’re used to from Machine Learning
Mastery.

What if I Am Not Interested in Natural Language Processing?

Transformers and attention are about sequence, not only natural language processing.

Researchers have developed transformer models for computer vision. While the data are fundamentally
different, the same idea is applied. Even if you are not interested in NLP problems, you will understand
why it can work in other domains.

What if I Am Just Beginning to Learn Natural Language Processing?

Perfect. This book is written for you!

The tutorials in the book do not require sophisticated background knowledge. Following this book and
building a translator can be your first project in NLP.

What if I Have Never Used Keras Before?

You don’t need to be an expert in TensorFlow or Keras to read this book. If you still don’t feel
comfortable, we have pointers for you to learn the basics of Keras modules, either online or from
another book.

https://machinelearningmastery.com/transformer-models-with-attention/?utm_source=drip&utm_medium=email&utm_campaign=AFML+Mini-Co… 4/19
5/12/23, 11:03 PM Building Transformer Models with Attention

You can benefit from this book even if you can barely code. You will know how to learn from other
people’s code. You will know how to learn from your own mistakes!

About Your Outcomes

…so what will YOU know after reading this book?
After reading and working through this book, you will know:

What is an attention mechanism

How to calculate attention mathematically
How to implement the attention mechanism into a reusable module in Keras
What is a transformer
How we build a transformer encoder and decoder using attention
How to use a transformer to get the result

You should be able to learn a new idea or two from this book to bring your NLP project to the next level.

After reading this book, you will be able to:

Explain what attention mechanisms help you

Tell how to apply a transformer to a sequence of data
Comfortably use a transformer encoder and/or decoder for your project

What Exactly Is in This Book?

This book was designed to be an advanced book on deep learning. Ideally, you are expected to feel
comfortable using a Keras API to build a model from scratch.

There is a lot to do to build a transformer model. You are not going to get lost or distracted. We aim to
take you from start to finish to develop a working transformer model that you can reuse in your other
deep learning projects. Step-by-step with laser-focused tutorials.

Each tutorial is designed to take you about one hour to read through and complete, excluding the
extensions and further reading.

You can choose to work through the lessons one per day, one per week, or at your own pace. I think
momentum is critically important, and this book is intended to be read and used, not to sit idle.

I recommend picking a schedule and sticking to it.

The tutorials are divided into five parts:

Part 1: Foundations of Attention. A high-level, minimally-technical introduction to attention.

You will know what it is about, why it works, and the variations in the implementation.
Part 2: From Recurrent Neural Networks to Transformer. Start from the traditional recurrent
neural network and see its limitations and how attention comes to the rescue. Then we show
you why once we have attention, a transformer model can replace a recurrent neural network.

https://machinelearningmastery.com/transformer-models-with-attention/?utm_source=drip&utm_medium=email&utm_campaign=AFML+Mini-Co… 5/19
5/12/23, 11:03 PM Building Transformer Models with Attention

Part 3: Building a Transformer from Scratch. In multiple steps, you will create the building
blocks of a transformer model in Keras. Then you will connect the pieces to build a working
transformer with training, testing, and inference.
Part 4: Applications. There are larger transformer models available. They take a much longer
time to train and need much larger datasets, but some of them are available off the shelf. We
picked one such model and will show you how to use it to do something in only a few lines of
code.
Chapters Overview Table of Contents
Below is an overview of the 23 step-by-step The screenshot below was taken from the PDF
tutorial lessons you will work through: Ebook. It provides you with a full overview of the
table of contents from the book.
Each chapter was designed to be completed in
about 30 to 60 minutes by the average developer.

Foundations of Attention
Chapter 01: What Is Attention?
Chapter 02: A Bird’s Eye View of Research
on Attention
Chapter 03: A Tour of Attention-Based
Architectures
Chapter 04: The Bahdanau Attention
Mechanism
Chapter 05: The Luong Attention
Mechanism

From Recurrent Neural Networks to

Transformer
Chapter 06: An Introduction to Recurrent
Neural Networks
Chapter 07: Understanding Simple
Recurrent Neural Networks in Keras
Chapter 08: The Attention Mechanism from
Scratch
Chapter 09: Adding a Custom Attention
Layer to Recurrent Neural Network in Keras Building transformer models with attention table of contents
Chapter 10: The Transformer Attention
Mechanism
Chapter 11: The Transformer Model
Chapter 12: The Vision Transformer Model

Building a Transformer from Scratch

Chapter 13: Positional Encoding in
Transformer Models
Chapter 14: Transformer Positional
Encoding Layer in Keras

https://machinelearningmastery.com/transformer-models-with-attention/?utm_source=drip&utm_medium=email&utm_campaign=AFML+Mini-Co… 6/19
5/12/23, 11:03 PM Building Transformer Models with Attention

Chapter 15: Implementing Scaled Dot-

Product Attention in Keras
Chapter 16: Implementing Multi-Head
Attention in Keras
Chapter 17: Implementing the Transformer
Encoder in Keras
Chapter 18: Implementing the Transformer
Decoder in Keras
Chapter 19: Joining the Transformer
Encoder and Decoder with Masking
Chapter 20: Training the Transformer Model
Chapter 21: Plotting the Training and
Validation Loss Curves for the Transformer
Model
Chapter 22: Inference with the Transformer
Model

Applications
Chapter 23: A Brief Introduction to BERT

Appendix
Appendix A: How to Setup a Workstation for
Python
Appendix C: How to Setup Amazon EC2 for
Deep Learning on GPUs

You can see that each part targets a specific

learning outcome and so does each tutorial within
each section. This acts as a filter to ensure you
are only focused on what you need to know to get
to a specific result and do not get bogged down in
unrelated objectives.

The tutorials were not designed to teach you

everything, nor to have you know everything
about each topic. They were designed to help
you understand how to get something done, how
to use a tool, and how to see the results the
fastest way I know how: to learn by doing.

Take a Sneak Peek Inside the Ebook

Click image to Enlarge.

https://machinelearningmastery.com/transformer-models-with-attention/?utm_source=drip&utm_medium=email&utm_campaign=AFML+Mini-Co… 7/19
5/12/23, 11:03 PM Building Transformer Models with Attention

Download Your Sample Chapter

Do you want to take a closer look at the book? Download a free sample
chapter PDF.

Enter your email address, and your sample chapter will be sent to your inbox.

Click Here to Get My Sample Chapter

BONUS: Python Code to Do Great Things

…you also get 45 fully working Python scripts
Sample Code Recipes
Each recipe presented in the book is standalone,
meaning that you can copy and paste it into your
project and use it immediately.

You get one Python script (.py) for each

example provided in the book.

This means you can follow along and compare

your answers to a known working implementation
of each example in the provided Python files.

This helps to speed up your progress when

working through the details of a specific task,
such as:
https://machinelearningmastery.com/transformer-models-with-attention/?utm_source=drip&utm_medium=email&utm_campaign=AFML+Mini-Co… 8/19
5/12/23, 11:03 PM Building Transformer Models with Attention

Generating positional encodings

Computing scaled dot-product attentions
Building a transformer encoder
Building a transformer decoder
Training a complete transformer model

Most of the provided code was developed in a

text editor and is intended to be run on the
command line. No special IDE or notebooks are
required.

All code examples were designed and tested with

Python 3.9+.

All code examples will run on modest and

modern computer hardware and were executed
on a CPU.

Python Technical Details

This section provides some technical details
about the code provided in the book.

Python Version: You can use Python 3.6 or

higher, but Python 3.9 is recommended.
Operating System: You can use Windows,
Linux, or Mac OS X.
Hardware: A standard modern workstation
will do.
Editor: You can use a text editor and run the
example from the command line.

Don’t have a Python environment?

No problem!

The appendix contains step-by-step tutorials

showing you exactly how to set up a Python
machine learning environment.

You're Not Alone in Choosing Machine Learning Mastery

Trusted by Over 53,938 Practitioners
...including employees from companies like:

https://machinelearningmastery.com/transformer-models-with-attention/?utm_source=drip&utm_medium=email&utm_campaign=AFML+Mini-Co… 9/19
5/12/23, 11:03 PM Building Transformer Models with Attention

...students and faculty from universities like:

and many thousands more...

Absolutely No Risk with...

100% Money Back Guarantee
Plus, as you should expect of any great product on the market, every Machine Learning Mastery Ebook
comes with the surest sign of confidence: my gold-standard 100% money-back guarantee.

https://machinelearningmastery.com/transformer-models-with-attention/?utm_source=drip&utm_medium=email&utm_campaign=AFML+Mini-C… 10/19
5/12/23, 11:03 PM Building Transformer Models with Attention

100% Money-Back Guarantee

If you're not happy with your purchase of any of the Machine Learning Mastery Ebooks,
just email me within 90 days of buying, and I'll give you your money back ASAP.

No waiting. No questions asked. No risk.

Learn Transformer TODAY!

Choose Your Package:

Basic Package Deep Learning Super Bundle

You will get the Ebook: Bundle BEST VALUE

Building Transformer TOP SELLER You get the complete 25-

Models with Attention
Ebook set:
You get the 9-Ebook set:
(including bonus source
1. Statistics Methods for
code) 1. Deep Learning With
Machine Learning
Python
2. Linear Algebra for
2. Deep Learning With
BUY NOW Machine Learning
PyTorch
3. Probability for
FOR $37 3. Deep Learning for
Machine Learning
Computer Vision
4. Optimization for
4. Deep Learning for
Machine Learning
Natural Language
(a great deal!) 5. Master Machine
Processing
Learning Algorithms
5. Deep Learning for
6. ML Algorithms From
Time Series
Scratch
Forecasting
7. Machine Learning
6. Generative
Mastery With Weka
Adversarial Networks
8. Machine Learning
with Python
Mastery With R
7. Long Short-Term
9. Machine Learning
Memory Networks
Mastery With Python
with Python
10. Data Preparation for
8. Better Deep Learning
Machine Learning
9. Building Transformer
11. Imbalanced
Models with Attention
Classification
With Python
https://machinelearningmastery.com/transformer-models-with-attention/?utm_source=drip&utm_medium=email&utm_campaign=AFML+Mini-C… 11/19
5/12/23, 11:03 PM Building Transformer Models with Attention

(includes all bonus source 12. Time Series

code) Forecasting With
Python
13. Deep Learning With
BUY NOW
Python
FOR $237 14. Deep Learning for
CV
15. Deep Learning for
NLP
That's $353.00 of Value! 16. Deep Learning for
Time Series
(You get a 32.86%
Forecasting
discount)
17. Generative
Adversarial Networks
with Python
18. Better Deep Learning
19. LSTM Networks With
Python
20. XGBoost With
Python
21. Ensemble Learning
Algorithms With
Python
22. Calculus for Machine
Learning
23. Python for Machine
Learning
24. Building Transformer
Models with Attention
25. Deep Learning with
PyTorch

(includes all bonus source

code)

BUY NOW
FOR $587

That's $861.00 of Value!

(You save a massive

$935.00)

https://machinelearningmastery.com/transformer-models-with-attention/?utm_source=drip&utm_medium=email&utm_campaign=AFML+Mini-C… 12/19
5/12/23, 11:03 PM Building Transformer Models with Attention

All prices are in US Dollars (USD).

(1) Click the button. (2) Enter your details. (3) Download immediately.

Secure Payment Processing With SSL Encryption

Are you a Student, Teacher or Retiree? Do you have any Questions?

Contact me about a discount. See the FAQ.

About The Author

Hi, I'm Jason Brownlee. I run this site and I wrote and published
this book.

I live in Australia with my wife and sons. I love to read books, write
tutorials, and develop systems.

I have a computer science and software engineering background

as well as Masters and PhD degrees in Artificial Intelligence with a
focus on stochastic optimization.

I've written books on algorithms, won and ranked well in

competitions, consulted for startups, and spent years in industry.
(Yes, I have spend a long time building and maintaining REAL
operational systems!)

I get a lot of satisfaction helping developers get started and get

really good at applied machine learning.

https://machinelearningmastery.com/transformer-models-with-attention/?utm_source=drip&utm_medium=email&utm_campaign=AFML+Mini-C… 13/19
5/12/23, 11:03 PM Building Transformer Models with Attention

I teach an unconventional top-down and results-first approach to machine learning where we start by
working through tutorials and problems, then later wade into theory as we need it.

I'm here to help if you ever have any questions. I want you to be awesome at machine learning.

What Are Skills in Machine Learning Worth?

Your boss asks you:

Hey, can you build a predictive model for this?

Imagine you had the skills and confidence to say:

"YES!"
...and follow through.
I have been there. It feels great!

How much is that worth to you?

The industry is demanding skills in machine learning.
The market wants people that can deliver results, not write academic papers.

Business knows what these skills are worth and are paying sky-high starting salaries.

A Data Scientists Salary Begins at:

$100,000 to $150,000.
A Machine Learning Engineers Salary is Even Higher.

What Are Your Alternatives?

You made it this far.
You're ready to take action.

But, what are your alternatives? What options are there?

(1) A Theoretical Textbook for $100+

...it's boring, math-heavy and you'll probably never finish it.

(2) An On-site Boot Camp for $10,000+

...it's full of young kids, you must travel and it can take months.

(3) A Higher Degree for $100,000+

...it's expensive, takes years, and you'll be an academic.

OR...

https://machinelearningmastery.com/transformer-models-with-attention/?utm_source=drip&utm_medium=email&utm_campaign=AFML+Mini-C… 14/19
5/12/23, 11:03 PM Building Transformer Models with Attention

For the Hands-On Skills You Get...

And the Speed of Results You See...
And the Low Price You Pay...

Machine Learning Mastery Ebooks are

Amazing Value!
And they work. That's why I offer the money-back guarantee.

You're A Professional

The field moves quickly, Bottom-up is Slow and

...how long can you wait? Frustrating,
You think you have all the time in the world, but... ...don't you want a faster way?
Can you really go on another day, week or
New methods are devised and algorithms
month...
change.
New books get released and prices increase.
Scraping ideas and code from incomplete
New graduates come along and jobs get posts.
filled.
Skimming theory and insight from short
videos.
Right Now is the Best Time to make your
start. Parsing Greek letters from academic
textbooks.

Targeted Training is your Shortest Path to a

result.

Professionals Stay On Top Of Their Field

Get The Training You Need!
You don't want to fall behind or miss the opportunity.

Frequently Asked Questions

Customer Questions (78)

a Can I be a reseller for your books?

a Can I exchange a book in a bundle?

https://machinelearningmastery.com/transformer-models-with-attention/?utm_source=drip&utm_medium=email&utm_campaign=AFML+Mini-C… 15/19
5/12/23, 11:03 PM Building Transformer Models with Attention

a Can I get a customized bundle of books (a la carte)?

a Can I get a hard copy of your book?

a Can I get a purchase order?

a Can I get a refund for one book in a bundle?

a Can I get a refund?

a Can I get a sample of the book?

a Can I get a tax invoice for my purchase?

a Can I get an evaluation copy of a book?

a Can I get Kindle or ePub versions of the books?

a Can I get the ISBN for your book?

a Can I get your books for free?

a Can I have a discount?

a Can I have an RFI / RFP / RFT / RFQ?

a Can I pay via debit card?

a Can I pay via WeChat Pay or Alipay?

a Can I print the PDF for my personal use?

a Can I see the table of contents?

a Can I send you a cheque, money order, western union, etc?

a Can I upgrade to the super bundle later?

a Can I use your material to teach my lecture or course?

a Can I white-label your books or content?

a Can your books be purchased elsewhere online or offline?

a Do your books provide exercises or assignments?

a Do I get a certificate of completion?

a Do I get new books for free if I buy the super bundle?

https://machinelearningmastery.com/transformer-models-with-attention/?utm_source=drip&utm_medium=email&utm_campaign=AFML+Mini-C… 16/19
5/12/23, 11:03 PM Building Transformer Models with Attention

a Do I need to be a good programmer?

a Do you cover the theory and derivations?

a Do you have any sales, deals, or coupons?

a Do you have videos?

a Do you offer a guarantee?

a Do you ship to my country?

a Do you support TensorFlow 2?

a Does the LSTM book cover multivariate time series?

a How are the mini-courses different from the books?

a How are the two algorithm books different?

a How are your books different from the blog?

a How are your books different to other books on machine learning?

a How do I access the code provided with the book?

a How do I buy a book?

a How do I convert my currency to US dollars?

a How do I download my purchase?

a How do I get access to the bonuses?

a How do I use a discount coupon?

a How long do books take to ship?

a How long will a book take me to finish?

a How long will it take to get the book?

a In what order should I read your books?

a Is there a license for libraries?

a Is there a team or company license?

a Is there any digital rights management (DRM)?

https://machinelearningmastery.com/transformer-models-with-attention/?utm_source=drip&utm_medium=email&utm_campaign=AFML+Mini-C… 17/19
5/12/23, 11:03 PM Building Transformer Models with Attention

a Is there errata or a change log for the books?

a What about delivery?

a What book should I start with?

a What books and bundles do you sell?

a What books are you writing next?

a What books have I already purchased?

a What currencies do you support?

a What if my download link expires?

a What is the difference between the LSTM and Deep Learning books?

a What is the difference between the LSTM and Deep Learning for Time
Series books?

a What is the difference between the LSTM and the NLP books?

a What is your business or corporate tax number (e.g. ABN, ACN, VAT, etc.)

a What operating systems are supported in the books?

a What programming language is used in “Master Machine Learning

Algorithms”?

a What software do you use to write your books?

a What version of Python is used?

a Where is my purchase?

a Why are some of the book chapters also on the blog?

a Why are your books so expensive?

a Why aren’t your books on Amazon?

a Why doesn’t my payment work?

a Why is there an additional small charge on my order?

a Why not give all of your books away for free?

a Will I get free updates to the books?

https://machinelearningmastery.com/transformer-models-with-attention/?utm_source=drip&utm_medium=email&utm_campaign=AFML+Mini-C… 18/19
5/12/23, 11:03 PM Building Transformer Models with Attention